top of page

Apache Spark Best Practices: Optimize Your Data Processing
Apache Spark is a powerful open-source distributed computing system that excels in big data processing. It is lauded for its speed and ease of use, making it a favorite among software engineers and data scientists.
Claude Paugh
2 days ago4 min read
2 views


Gathering Data Statistics Using PySpark: A Comparative Analysis with Scala
Data processing and statistics gathering are essential tasks in today's data-driven world. Engineers frequently find themselves choosing between tools like PySpark and Scala when embarking on these tasks.
Claude Paugh
3 days ago5 min read
3 views

Harnessing the Dask Python Library for Parallel Computing
Dask is a flexible library for parallel computing in Python. It is designed to scale from a single machine to a cluster of machines seamlessly. By using Dask, you can manage and manipulate large datasets that are too big to fit into memory on a single machine.
Claude Paugh
3 days ago5 min read
2 views


Spark Data Engineering: Best Practices and Use Cases
In today's data-driven world, organizations are generating vast amounts of data every second. This data can be a goldmine for insights when processed and analyzed effectively. One of the most powerful tools in this realm is Apache Spark.
Claude Paugh
4 days ago4 min read
8 views


Portfolio Holdings Data: Introduction
Several years ago, I started a side project that I thought would be fun: collecting and loading SEC filings for ETF and Mutual Fund Holdings on a monthly basis. I wanted to essentially automate the collection of the SEC filings
Claude Paugh
Apr 85 min read
3 views


HDF5 Data Processing Toolkit
The processing design utilizes batches alongside the multi-processing (mp) Python module to handle processes and resources. Each content "classification" is encapsulated in its own class, such as ImageProcessor, VideoProcessor, TextFileProcessor, etc.
Claude Paugh
Apr 71 min read
18 views


DASK: Data Scientist Power Tool
https://www.kdnuggets.com/introduction-dask-python-data-scientist-power-tool
Linked Article
Dec 17, 20241 min read
2 views


Data Engineering 2.0
https://medium.com/towards-data-engineering/data-engineering-2-0-trends-that-are-shaping-the-industrys-future-8d9415ddaa1d
Linked Article
Dec 17, 20241 min read
2 views
bottom of page