The burgeoning volume and complexity of data make scalability and reliability increasingly challenging issues. But while modern systems contain multicore CPUs and GPUs that have the potential for parallel computing, many Python tools weren't designed to leverage this parallelism. Using Dask to paral
Dask: The Definitive Guide - Scalable Python Data Science with Dask (Early Release 1)
β Scribed by Matthew Rocklin, Matthew Powers, Richard Pelgrim
- Publisher
- OβReilly Media, Inc.
- Year
- 2022
- Tongue
- English
- Leaves
- 105
- Edition
- 1
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Synopsis
The burgeoning volume and complexity of data make scalability and reliability increasingly challenging issues. But while modern systems contain multicore CPUs and GPUs that have the potential for parallel computing, many Python tools weren't designed to leverage this parallelism. Using Dask to parallelize Python workflows delivers a competitive advantage by reducing turnaround time, freeing you to work on more interesting or complex data problems.
With this essential guide at your side, you'll be able to:
Deploy Dask on the cloud or on-prem
Scale your Python code to bigger datasets and CPU-intensive workflows
Speed up data pipelines that often take weeks or months to execute
Overcome the limits of serial computing on your local machine (or system of machines)
Use the examples provided to scale your workflows, whether you're working with NumPy, pandas, scikit-learn, PyTorch, XGBoost, or other tools
Develop a specialized data science library that leverages parallel and distributed computing
Scale computations to a cluster of machines and to the cloud securely and efficiently
β¦ Table of Contents
- Understanding the Architecture of Dask DataFrames
pandas Architecture
pandas Limitations
How Dask DataFrames Differ from pandas
Example illustrating Dask DataFrame Architectural Components
Lazy Execution
Dask DataFrame Divisions
pandas vs. Dask DataFrame on Larger than RAM Datasets
Scaling Up vs Scaling Out
Summary - How to Work with Dask DataFrames
Reading Data into a Dask DataFrame
Read a single file into a Dask DataFrame
Read multiple files into a Dask DataFrame
Working with partitioned data
Setting Partition Size
Inspecting Data Types
Reading Remote Data from the Cloud
Processing Data with Dask DataFrames
Converting to Parquet files
Materializing results in memory with compute
Materializing results in memory with persist()
Repartitioning Dask DataFrames
Filtering Dask DataFrames
Setting the Index
Joining Dask DataFrames
Mapping Custom Functions
groupby aggregations
Memory usage
Tips on managing memory
Converting to number columns with to_numeric
Vertically union Dask DataFrames
Writing Data with Dask DataFrames
File Compression
to_csv: single_file
to_parquet: engine
to_parquet: partition_on
Other Keywords
Summary
π SIMILAR VOLUMES
Dask is a free and open source library for parallel computing in Python that helps you scale your data science and machine learning workflows. With this quick but thorough resource, data scientists and Python programmers will learn how Dask provides APIs that make it easy to parallelize PyData libra
Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you're already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, using the tools you already have. And Data Science with Python and Dask is your guide to
Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you're already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, using the tools you already have. And Data Science with Python and Dask is your guide to
<b>Summary</b><br /><br />Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you're already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, using the tools you already have. And<i>Data Science with Pyth
<p><span>Modern systems contain multicore CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source l