𝔖 Scriptorium
✦   LIBER   ✦

πŸ“

Dask: The Definitive Guide - Scalable Python Data Science with Dask (Early Release 1)

✍ Scribed by Matthew Rocklin, Matthew Powers, Richard Pelgrim


Publisher
O’Reilly Media, Inc.
Year
2022
Tongue
English
Leaves
105
Edition
1
Category
Library

⬇  Acquire This Volume

No coin nor oath required. For personal study only.

✦ Synopsis


The burgeoning volume and complexity of data make scalability and reliability increasingly challenging issues. But while modern systems contain multicore CPUs and GPUs that have the potential for parallel computing, many Python tools weren't designed to leverage this parallelism. Using Dask to parallelize Python workflows delivers a competitive advantage by reducing turnaround time, freeing you to work on more interesting or complex data problems.

With this essential guide at your side, you'll be able to:

Deploy Dask on the cloud or on-prem
Scale your Python code to bigger datasets and CPU-intensive workflows
Speed up data pipelines that often take weeks or months to execute
Overcome the limits of serial computing on your local machine (or system of machines)
Use the examples provided to scale your workflows, whether you're working with NumPy, pandas, scikit-learn, PyTorch, XGBoost, or other tools
Develop a specialized data science library that leverages parallel and distributed computing
Scale computations to a cluster of machines and to the cloud securely and efficiently

✦ Table of Contents


  1. Understanding the Architecture of Dask DataFrames
    pandas Architecture
    pandas Limitations
    How Dask DataFrames Differ from pandas
    Example illustrating Dask DataFrame Architectural Components
    Lazy Execution
    Dask DataFrame Divisions
    pandas vs. Dask DataFrame on Larger than RAM Datasets
    Scaling Up vs Scaling Out
    Summary
  2. How to Work with Dask DataFrames
    Reading Data into a Dask DataFrame
    Read a single file into a Dask DataFrame
    Read multiple files into a Dask DataFrame
    Working with partitioned data
    Setting Partition Size
    Inspecting Data Types
    Reading Remote Data from the Cloud
    Processing Data with Dask DataFrames
    Converting to Parquet files
    Materializing results in memory with compute
    Materializing results in memory with persist()
    Repartitioning Dask DataFrames
    Filtering Dask DataFrames
    Setting the Index
    Joining Dask DataFrames
    Mapping Custom Functions
    groupby aggregations
    Memory usage
    Tips on managing memory
    Converting to number columns with to_numeric
    Vertically union Dask DataFrames
    Writing Data with Dask DataFrames
    File Compression
    to_csv: single_file
    to_parquet: engine
    to_parquet: partition_on
    Other Keywords
    Summary

πŸ“œ SIMILAR VOLUMES


Dask: The Definitive Guide - Scalable Py
✍ Matthew Rocklin, Matthew Powers, Richard Pelgrim πŸ“‚ Library πŸ“… 2022 πŸ› O’Reilly Media, Inc. 🌐 English

The burgeoning volume and complexity of data make scalability and reliability increasingly challenging issues. But while modern systems contain multicore CPUs and GPUs that have the potential for parallel computing, many Python tools weren't designed to leverage this parallelism. Using Dask to paral

Scaling Python with Dask (Sixth Early Re
✍ Holden Karau and Mika Kimmins πŸ“‚ Library πŸ“… 2023 πŸ› O'Reilly Media, Inc. 🌐 English

Dask is a free and open source library for parallel computing in Python that helps you scale your data science and machine learning workflows. With this quick but thorough resource, data scientists and Python programmers will learn how Dask provides APIs that make it easy to parallelize PyData libra

Data Science with Python and Dask
✍ Jesse C. Daniel πŸ“‚ Library πŸ“… 2019 πŸ› Manning Publications 🌐 English

Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you're already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, using the tools you already have. And Data Science with Python and Dask is your guide to

Data Science With Python And Dask
✍ Jesse C. Daniel πŸ“‚ Library πŸ“… 2019 πŸ› Manning Publications 🌐 English

Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you're already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, using the tools you already have. And Data Science with Python and Dask is your guide to

Data science at scale with python and da
✍ Daniel, Jesse C πŸ“‚ Library πŸ“… 2019 πŸ› Manning Publications; O'reilly Media 🌐 English

<b>Summary</b><br /><br />Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you're already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, using the tools you already have. And<i>Data Science with Pyth

Scaling Python with Dask: From Data Scie
✍ Holden Karau, Mika Kimmins πŸ“‚ Library πŸ› O'Reilly Media 🌐 English

<p><span>Modern systems contain multicore CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source l