Fast Python for Data Science Version 8

✍ Scribed by Tiago Antão

Publisher: Manning Publications
Year: 2022
Tongue: English
Leaves: 273
Edition: MEAP Edition
Category: Library

No coin nor oath required. For personal study only.

✦ Table of Contents

Fast Python for Data Science MEAP V08
Copyright
welcome
brief contents
Chapter 1: The need for efficient computing and data storage
1.1 The overwhelming need for efficient computing in Python
1.2 The impact of modern computing architectures on high performance computing
1.2.1 Changes inside the computer
1.2.2 Changes in the network
1.2.3 The cloud
1.3 Working with Python’s limitations
1.3.1 The Global Interpreter Lock (GIL)
1.4 What will you learn from this book
1.5 The reader for this book
1.6 Summary
Chapter 2: Extracting performance from built-in features
2.1 Introducing the project dataset
2.1.1 An architecture for big data processing
2.1.2 Preparing the data
2.2 Profiling code to detect performance bottlenecks
2.2.1 Using Python’s built-in profiling module
2.2.2 Visualizing profiling information
2.2.3 Line profiling
2.3 Optimizing basic data structures for speed: lists, sets, dictionaries
2.3.1 Performance of list searches
2.3.2 Using the bisect module
2.3.3 Content aware search approaches
2.3.4 Searching using sets or dictionaries
2.3.5 List complexity in Python
2.4 Finding excessive memory allocation
2.4.1 Navigating the minefield of Python memory estimation
2.4.2 Using more compact representations
2.4.3 Packing many observations in a number
2.4.4 Use the array module
2.4.5 Systematizing what we have learned: Estimating the memory usage of Python objects
2.5 Using laziness and generators for big-data pipelining
2.5.1 Using generators instead of standard functions
2.5.2 Enabling code pipelining with generators
2.6 Summary
Chapter 3: Concurrency, parallelism and asynchronous processing
3.1 Writing the scaffold of an asynchronous server
3.1.1 The implementation of the scaffold for communicating with clients
3.1.2 Programming with coroutines
3.1.3 Sending complex data from simple synchronous client
3.1.4 Alternative approaches
3.2 Implementing the first MapReduce engine
3.2.1 Understanding MapReduce frameworks
3.2.2 Developing a very simple test scenario
3.2.3 Implementing a too-simple MapReduce framework
3.3 Implementing a concurrent version of a MapReduce engine
3.3.1 Using concurrent.futures to implement a threaded server
3.3.2 Asynchronous execution with Futures
3.3.3 The GIL and multi-threading
3.4 Using multi-processing to implement MapReduce
3.4.1 A solution based on concurrent.futures
3.4.2 A solution based on the multiprocessing module
3.4.3 Monitoring the progress of the multiprocessing solution
3.4.4 Transferring data in chunks
3.5 Tying it all together: an asynchronous multi-threaded and multi-processing MapReduce server
3.5.1 Architecting a complete high-performance solution
3.5.2 Creating a robust version of the server
3.6 Summary
Chapter 4: High performance NumPy
4.1 Understanding NumPy from a performance perspective
4.1.1 Copies and views
4.1.2 Understanding NumPy’s view machinery
4.1.3 Making use of views for efficiency
4.2 Using array programming
4.2.1 Broadcasting in NumPy
4.2.2 Applying array programming to image manipulation
4.2.3 Developing a "vectorized mentality"
4.3 Tuning NumPy’s internal architecture for performance
4.3.1 An overview of NumPy dependencies
4.3.2 How to tune NumPy in your Python distribution
4.3.3 Threads in NumPy
4.4 Summary
Chapter 5: Re-implementing critical code with Cython
5.1 An overview of alternatives for efficient code re-implementation
5.2 A Whirlwind introduction to Cython
5.2.1 A naïve implementation in Cython
5.2.2 Using Cython annotations
5.3 Profiling Cython code
5.3.1 Using Python built-in profiling infrastructure
5.4 Optimizing array access with Cython memoryviews
5.4.1 Cleaning up all internal interactions with Python
5.5 NumPy generalized universal functions in Cython
5.6 Advanced array access in Cython
5.6.1 A Cython implementation of QuadLife
5.7 Parallelism with Cython
5.8 Summary
Chapter 7: High performance Pandas and Apache Arrow
7.1 Memory and time optimization of data loading
7.2 Techniques to increase data analysis speed
7.2.1 Using indexing to accelarate access
7.2.2 Row iteration Strategies
7.3 Pandas on top of NumPy, Cython and NumExpr
7.3.1 Explicit use of NumPy
7.3.2 Pandas on top of NumExpr
7.3.3 Cython and Pandas
7.4 Introducing Apache Arrow and reading data into Pandas with Arrow
7.4.1 The relationship between Pandas and Apache Arrow
7.4.2 Reading a CSV file
7.4.3 Doing analysis with Arrow
7.5 Using Arrow interop to delegate work to more efficient languages and systems
7.5.1 Implications of Arrow’s language interop architecture
7.5.2 Zero-copy operations on data with Arrow’s Plasma server
7.6 Summary
Chapter 8: Storing big data
8.1 A unified interface for file access: fsspec
8.1.1 Looking at a GitHub repository as a file system
8.1.2 Using fsspec to inspect Zip files
8.1.3 Accessing files using fsspec
8.1.4 Using URL chaining to traverse different file systems transparently
8.1.5 Replacing file system backends
8.1.6 Interfacing with PyArrow
8.2 An efficient format to store columnar data: Parquet
8.2.1 Inspecting Parquet meta data
8.2.2 Column encoding with Parquet
8.2.3 Partitioning with datasets
8.3 Dealing with larger than memory datasets the old fashioned way
8.3.1 Memory mapping files with NumPy
8.3.2 Chunk reading and writing of data frames
8.4 Zarr for large array persistence
8.4.1 Understanding zarr’s internal structure
8.4.2 Storage of arrays in Zarr
8.4.3 Creating a new array
8.4.4 Parallel reading and writing of Zarr arrays
8.5 Summary
Chapter 9: Data Analysis using GPU computing
9.1 Making sense of GPU computing power
9.1.1 Understanding the advantages of GPUs
9.1.2 The relationship between CPU and GPU
9.1.3 GPU internal architecture
9.1.4 Software architecture considerations
9.2 Using Numba to generate GPU code
9.2.1 Installation of GPU software for Python
9.2.2 The Basics of GPU programming with Numba
9.2.3 Revisiting the Mandelbrot example
9.2.4 A NumPy version of the Mandelbrot code
9.3 Performance analysis of GPU code: the case of a CuPy application
9.3.1 GPU-based data analysis libraries
9.3.2 Using CuPy: a high level data science library
9.3.3 A Basic interaction with CuPy
9.3.4 Writing a Mandelbrot generator using Numba
9.3.5 Writing a Mandelbrot generator using CUDA C
9.3.6 Profiling tools for GPU code
Chapter 10: Analyzing big data with Dask
10.1 Understanding the execution model of Dask
10.1.1 A Pandas baseline for comparison
10.1.2 Developing a Dask based data frame solution
10.2 The computational cost of Dask operations
10.2.1 Partitioning data
10.2.2 Persisting intermediate computations
10.2.3 Algorithm implementations over distributed data frames
10.2.4 Repartitioning the data
10.2.5 Persisting distributed data frames
10.3 Using Dask’s distributed scheduler
10.3.1 Architecture
10.3.2 Running code using dask.distributed
10.3.3 Dealing with datasets larger than memory
10.4 Summary
Appendix A: Setting up the environment
A.1 Code style and organization
Notes

📜 SIMILAR VOLUMES

Fast Python for Data Science

📁 Fast Python for Data Science

✍ Tiago Rodrigues Antao 📂 Library 📅 2022 🏛 Manning 🌐 English

Master these effective techniques to reduce costs and run times, handle huge datasets, and implement complex machine learning applications efficiently in Python. Fast Python for Data Science is a hands-on guide to writing Python code that can process more data, faster, and with less

Python Data Science Handbook (Jupyter No

📁 Python Data Science Handbook (Jupyter Notebook Version)

✍ it-ebooks 📂 Library 📅 2017 🏛 iBooker it-ebooks 🌐 English

Python For Data Science

📁 Python For Data Science

✍ Kevin Clark 📂 Library 📅 2020 🏛 Kevin Clark 🌐 English

The phenomenon, named the fourth industrial revolution, and also known as Big Data, is bringing profound changes in the world we live in. It is still difficult to make accurate predictions of how the phenomenon will affect our lives and our world, but Big Data will change your personal life, your ho

Python for data science

📁 Python for data science

✍ John Mueller; Luca Massaron 📂 Library 📅 2019 🏛 John Wiley & Sons 🌐 English

Python for Data Science

📁 Python for Data Science

✍ Muddana A Lakshmi Lakshmi, Sandhya Vinayakam 📂 Library 📅 2024 🏛 Springer 🌐 English

The book is designed to serve as a textbook for courses offered to undergraduate and graduate students enrolled in data science. This book aims to help the readers understand the basic and advanced concepts for developing simple programs and the fundamentals required for building machine learning mo

Python for Data Science for Dummies

📁 Python for Data Science for Dummies

✍ Zanab Hussain 📂 Library 📅 2015 🏛 For Dummies 🌐 English

Unleash the power of Python for your data analysis projects with For Dummies! Python is the preferred programming language for data scientists and combines the best features of Matlab, Mathematica, and R into libraries specific to data analysis and visualization. Python for Data