Minimalist Data Wrangling with Python

✍ Scribed by Marek Gagolewski

Publisher: Marek Gagolewski
Year: 2022
Tongue: English
Leaves: 436
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results.
This textbook is a non-profit project. Its online and PDF versions are freely available at https://datawranglingpy.gagolewski.com/.
This is version 1.0.3 of the book (last updated 2023-02-06).

✦ Table of Contents

Preface
The art of data wrangling
Aims, scope, and design philosophy
We need maths
We need some computing environment
We need data and domain knowledge
Structure
The Rules
About the author
Acknowledgements
You can make this book better
I Introducing Python
Getting started with Python
Installing Python
Working with Jupyter notebooks
Launching JupyterLab
First notebook
More cells
Edit vs command mode
Markdown cells
The best note-taking app
Initialising each session and getting example data
Exercises
Scalar types and control structures in Python
Scalar types
Logical values
Numeric values
Arithmetic operators
Creating named variables
Character strings
F-strings (formatted string literals)
Calling built-in functions
Positional and keyword arguments
Modules and packages
Slots and methods
Controlling program flow
Relational and logical operators
The if statement
The while loop
Defining functions
Exercises
Sequential and other types in Python
Sequential types
Lists
Tuples
Ranges
Strings (again)
Working with sequences
Extracting elements
Slicing
Modifying elements of mutable sequences
Searching for specific elements
Arithmetic operators
Dictionaries
Iterable types
The for loop
Tuple assignment
Argument unpacking ()
Variadic arguments: args and kwargs ()
Object references and copying ()
Copying references
Pass by assignment
Object copies
Modify in place or return a modified copy?
Further reading
Exercises
II Unidimensional data
Unidimensional numeric data and their empirical distribution
Creating vectors in numpy
Enumerating elements
Arithmetic progressions
Repeating values
numpy.r_ ()
Generating pseudorandom variates
Loading data from files
Some mathematical notation
Inspecting the data distribution with histograms
heights: A bell-shaped distribution
income: A right-skewed distribution
How many bins?
peds: A bimodal distribution (already binned)
matura: A bell-shaped distribution (almost)
marathon (truncated – fastest runners): A left-skewed distribution
Log-scale and heavy-tailed distributions
Cumulative probabilities and the empirical cumulative distribution function
Exercises
Processing unidimensional data
Aggregating numeric data
Measures of location
Arithmetic mean and median
Sensitive to outliers vs robust
Sample quantiles
Measures of dispersion
Standard deviation (and variance)
Interquartile range
Measures of shape
Box (and whisker) plots
Other aggregation methods ()
Vectorised mathematical functions
Logarithms and exponential functions
Trigonometric functions
Arithmetic operators
Vector-scalar case
Application: Feature scaling
Standardisation and z-scores
Min-max scaling and clipping
Normalisation (l2; dividing by magnitude)
Normalisation (l1; dividing by sum)
Vector-vector case
Indexing vectors
Integer indexing
Logical indexing
Slicing
Other operations
Cumulative sums and iterated differences
Sorting
Dealing with tied observations
Determining the ordering permutation and ranking
Searching for certain indexes (argmin, argmax)
Dealing with round-off and measurement errors
Vectorising scalar operations with list comprehensions
Exercises
Continuous probability distributions
Normal distribution
Estimating parameters
Data models are useful
Assessing goodness-of-fit
Comparing cumulative distribution functions
Comparing quantiles
Kolmogorov–Smirnov test ()
Other noteworthy distributions
Log-normal distribution
Pareto distribution
Uniform distribution
Distribution mixtures ()
Generating pseudorandom numbers
Uniform distribution
Not exactly random
Sampling from other distributions
Natural variability
Adding jitter (white noise)
Independence assumption
Further reading
Exercises
III Multidimensional data
From uni- to multidimensional numeric data
Creating matrices
Reading CSV files
Enumerating elements
Repeating arrays
Stacking arrays
Other functions
Reshaping matrices
Mathematical notation
Transpose
Row and column vectors
Identity and other diagonal matrices
Visualising multidimensional data
2D Data
3D data and beyond
Scatter plot matrix (pairs plot)
Exercises
Processing multidimensional data
Extending vectorised operations to matrices
Vectorised mathematical functions
Componentwise aggregation
Arithmetic, logical, and relational operations
Matrix vs scalar
Matrix vs matrix
Matrix vs any vector
Row vector vs column vector ()
Other row and column transforms ()
Indexing matrices
Slice-based indexing
Scalar-based indexing
Mixed logical/integer vector and scalar/slice indexers
Two vectors as indexers ()
Views of existing arrays ()
Adding and modifying rows and columns
Matrix multiplication, dot products, and Euclidean norm ()
Pairwise distances and related methods ()
Euclidean metric ()
Centroids ()
Multidimensional dispersion and other aggregates ()
Fixed-radius and k-nearest neighbour search ()
Spatial search with K-d trees ()
Exercises
Exploring relationships between variables
Measuring correlation
Pearson linear correlation coefficient
Perfect linear correlation
Strong linear correlation
No linear correlation does not imply independence
False linear correlations
Correlation is not causation
Correlation heat map
Linear correlation coefficients on transformed data
Spearman rank correlation coefficient
Regression tasks ()
K-nearest neighbour regression ()
From data to (linear) models ()
Least squares method ()
Analysis of residuals ()
Multiple regression ()
Variable transformation and linearisable models ()
Descriptive vs predictive power ()
Fitting regression models with scikit-learn ()
Ill-conditioned model matrices ()
Finding interesting combinations of variables ()
Dot products, angles, collinearity, and orthogonality ()
Geometric transformations of points ()
Matrix inverse ()
Singular value decomposition ()
Dimensionality reduction with SVD ()
Principal component analysis ()
Further reading
Exercises
IV Heterogeneous data
Introducing data frames
Creating data frames
Data frames are matrix-like
Series
Index
Aggregating data frames
Transforming data frames
Indexing Series objects
Do not use [...] directly (in the current version of pandas)
loc[...]
iloc[...]
Logical indexing
Indexing data frames
loc[...] and iloc[...]
Adding rows and columns
Modifying items
Pseudorandom sampling and splitting
Hierarchical indexes ()
Further operations on data frames
Sorting
Stacking and unstacking (long/tall and wide forms)
Joining (merging)
Set-theoretic operations and removing duplicates
…and (too) many more
Exercises
Handling categorical data
Representing and generating categorical data
Encoding and decoding factors
Binary data as logical and probability vectors
One-hot encoding ()
Binning numeric data (revisited)
Generating pseudorandom labels
Frequency distributions
Counting
Two-way contingency tables: Factor combinations
Combinations of even more factors
Visualising factors
Bar plots
Political marketing and statistics
.
Pareto charts ()
Heat maps
Aggregating and comparing factors
Mode
Binary data as logical vectors
Pearson chi-squared test ()
Two-sample Pearson chi-squared test ()
Measuring association ()
Binned numeric data
Ordinal data ()
Exercises
Processing data in groups
Basic methods
Aggregating data in groups
Transforming data in groups
Manual splitting into subgroups ()
Plotting data in groups
Series of box plots
Series of bar plots
Semitransparent histograms
Scatter plots with group information
Grid (trellis) plots
Kolmogorov–Smirnov test for comparing ECDFs ()
Comparing quantiles
Classification tasks ()
K-nearest neighbour classification ()
Assessing prediction quality ()
Splitting into training and test sets ()
Validating many models (parameter selection) ()
Clustering tasks ()
K-means method ()
Solving k-means is hard ()
Lloyd algorithm ()
Local minima ()
Random restarts ()
Further reading
Exercises
Accessing databases
Example database
Exporting data to a database
Exercises on SQL vs pandas
Filtering
Ordering
Removing duplicates
Grouping and aggregating
Joining
Solutions to exercises
Closing the database connection
Common data serialisation formats for the Web
Working with many files
File paths
File search
Exception handling
File connections ()
Further reading
Exercises
V Other data types
Text data
Basic string operations
Unicode as the universal encoding
Normalising strings
Substring searching and replacing
Locale-aware services in ICU ()
String operations in pandas
String operations in numpy ()
Working with string lists
Formatted outputs for reproducible report generation
Formatting strings
str and repr
Aligning strings
Direct Markdown output in Jupyter
Manual Markdown file output ()
Regular expressions ()
Regex matching with re ()
Regex matching with pandas ()
Matching individual characters ()
Matching anything (almost) ()
Defining character sets ()
Complementing sets ()
Defining code point ranges ()
Using predefined character sets ()
Alternating and grouping subexpressions ()
Alternation operator ()
Grouping subexpressions ()
Non-grouping parentheses ()
Quantifiers ()
Capture groups and references thereto ()
Extracting capture group matches ()
Replacing with capture group matches ()
Back-referencing ()
Anchoring ()
Matching at the beginning or end of a string ()
Matching at word boundaries ()
Looking behind and ahead ()
Exercises
Missing, censored, and questionable data
Missing data
Representing and detecting missing values
Computing with missing values
Missing at random or not?
Discarding missing values
Mean imputation
Imputation by classification and regression ()
Censored and interval data ()
Incorrect data
Outliers
The 3/2 IQR rule for normally-distributed data
Unidimensional density estimation ()
Multidimensional density estimation ()
Exercises
Time series
Temporal ordering and line charts
Working with date-times and time-deltas
Representation: The UNIX epoch
Time differences
Date-times in data frames
Basic operations
Iterated differences and cumulative sums revisited
Smoothing with moving averages
Detecting trends and seasonal patterns
Imputing missing values
Plotting multidimensional time series
Candlestick plots (*)
Further reading
Exercises
Changelog
References

📜 SIMILAR VOLUMES

Minimalist Data Wrangling with Python

📁 Minimalist Data Wrangling with Python

✍ Marek Gagolewski 📂 Library 📅 2023 🏛 Independently Published 🌐 English

Minimalist Data Wrangling with Python

📁 Minimalist Data Wrangling with Python

✍ Marek Gagolewski 📂 Library 📅 2024 🏛 Independently Published 🌐 English

Python for Data Analysis: Data Wrangling

📁 Python for Data Analysis: Data Wrangling with Pandas, Numpy, and Ipython

✍ Wes McKinney 📂 Library 📅 2017 🏛 O'Reilly Media 🌐 English

Looking for complete instructions on manipulating, processing, cleaning, and crunching structured data in Python? The second edition of this hands-on guide--updated for Python 3.5 and Pandas 1.0--is packed with practical cases studies that show you how to effectively solve a broad set of data analys

Python for Data Analysis. Data Wrangling

📁 Python for Data Analysis. Data Wrangling with Pandas, NumPy, and IPython

✍ Wes McKinney 📂 Library 📅 2017 🏛 O’Reilly 🌐 English

Python for Data Analysis: Data Wrangling

📁 Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython

✍ Wes McKinney 📂 Library 📅 2017 🏛 O’Reilly Media 🌐 English

<div><p>Get complete instructions for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.6, the second edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You’ll lea

Python for Data Analysis: Data Wrangling

📁 Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython

✍ Wes McKinney 📂 Library 📅 2017 🏛 O’Reilly Media 🌐 English