Dataset type | Size range | Fits in RAM? | Fits on local disk? |
---|---|---|---|
Small dataset | Less than 2-4 GB | Yes | Yes |
Medium dataset | Less than 2 TB | No | Yes |
Large dataset | Greater than 2 TB | No | No |
Why Dask?
In late 2014, Dask was launched to bring native scalability to the Python Open Data science stack and overcome its stand-alone limitations.
Over time, the project has become one of the best scalable computing frameworks available to Python developers. Dask consists of several different components and apis, divided into three layers: scheduler, low-level API, and high-level API.
Dask brings four key advantages to the table:
- Dask is fully implemented in Python and natively scales NumPy, Pandas, and scikit-learn.
- Dask can be used effectively to work with both medium datasets on a single machine and large datasets on a cluster.
- Dask can be used as a general framework for parallelizing most Python objects.
- Dask has a very low configuration and maintenance overhead.
A Dask DataFrame consists of multiple smaller Pandas DataFrame, a Dask Array consists of multiple smaller NumPy arrays, and so on.
Each of the smaller underlying objects (called blocks or partitions) can be passed between machines in the cluster, or queued locally and processed one part at a time.
This design makes transitioning from small datasets to medium and large datasets very easy for experienced Pandas, NumPy, and scikit-learn users. In this book, we’ll examine some of the best practices and pitfalls that will enable you to get the most out of Dask.
Working with medium-sized datasets on a single machine is just as useful as working with large datasets on a cluster. Scaling Dask up or down is not complicated at all.
Dask has been optimized to minimize memory footprint, therefore, it can gracefully handle medium data sets even on relatively low-performance machines.
The local task scheduler can be used to run Dasks on a single machine, while the distributed task scheduler can be used for local and cross-cluster execution.
One of the most unusual aspects of Dask is its inherent ability to scale most Python objects :Dask’s low-level APIs, Dask Delayed and Dask Futures, are NumPy arrays used in the scaled Dask Array, Pandas DataFrame, used in Dask DataFrame, and common basis for Python lists used in Dask Bag.
Finally, Dask is very lightweight and is easy to set up, tear down, and maintain. Dask’s short learning curve, flexibility, and familiar APIs make Dask a more attractive solution for data scientists with a Python Open Data Science Stack background.
You can read more or give it a try from here: Dask
If you interested in python stuff, please visit here: https://quietbookspace.com/blog/python/
Views: 12