𝔖 Scriptorium
✦   LIBER   ✦

📁

Minimalist Data Wrangling with Python

✍ Scribed by Marek Gagolewski


Publisher
Independently Published
Year
2023
Tongue
English
Leaves
443
Category
Library

⬇  Acquire This Volume

No coin nor oath required. For personal study only.

✦ Synopsis


Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results.

Data Science aims at making sense of and generating predictions from data that have been collected in copious quantities from various sources, such as physical sensors, surveys, online forms, access logs, and (pseudo)random number generators, to name a few. They can take diverse forms, e.g., be given as vectors, matrices, or other tensors, graphs/networks, audio/video streams, or text.

Data usually do not come in a tidy and tamed form. Data wrangling is the very broad process of appropriately curating raw information chunks and then exploring the underlying data structure so that they become analysable.

This course is envisaged as a student’s first exposure to data science, providing a high-level overview as well as discussing key concepts at a healthy level of detail.

By no means do we have the ambition to be comprehensive with regard to any topic we cover. Time for that will come later in separate lectures on calculus, matrix algebra, probability, mathematical statistics, continuous and combinatorial optimisation, information theory, stochastic processes, statistical/machine learning, algorithms and data structures, take a deep breath, databases and Big Data analytics, operational research, graphs and networks, differential equations and dynamical systems, time series analysis, signal processing, etc.

We primarily focus on methods and algorithms that have stood the test of time and that continue to inspire researchers and practitioners. They all meet the reality check comprised of the three following properties, which we believe are essential in practice:

  • simplicity (and thus interpretability, being equipped with no or only a few underlying tunable parameters; being based on some sensible intuitions that can be explained in our own words),
  • mathematical analysability (at least to some extent; so that we can understand their strengths and limitations),
  • implementability (not too abstract on the one hand, but also not requiring any advanced computer-y hocus-pocus on the other).

This course uses the Python language which we shall introduce from scratch. Consequently, we do not require any prior programming experience.

Over the last few years, Python has proven to be a very robust choice for learning and applying data wrangling techniques. This is possible thanks to the devoted community of open-source programmers who wrote the famous high-quality packages such as NumPy, SciPy, Matplotlib, Pandas, Seaborn, and Scikit-learn.

Nevertheless, Python and its third-party packages are amongst many software tools which can help extract knowledge from data. Other robust open-source choices include R and Julia.

We will focus on developing transferable skills: most of what we learn here can be applied (using different syntax but the same kind of reasoning) in other environments. Thus, this is a course on data wrangling (with Python), not a course on Python (with examples in data wrangling).

✦ Table of Contents


Preface
The art of data wrangling
Aims, scope, and design philosophy
We need maths
We need some computing environment
We need data and domain knowledge
Structure
The Rules
About the author
Acknowledgements
You can make this book better
I Introducing Python
Getting started with Python
Installing Python
Working with Jupyter notebooks
Launching JupyterLab
First notebook
More cells
Edit vs command mode
Markdown cells
The best note-taking app
Initialising each session and getting example data
Exercises
Scalar types and control structures in Python
Scalar types
Logical values
Numeric values
Arithmetic operators
Creating named variables
Character strings
F-strings (formatted string literals)
Calling built-in functions
Positional and keyword arguments
Modules and packages
Slots and methods
Controlling program flow
Relational and logical operators
The if statement
The while loop
Defining functions
Exercises
Sequential and other types in Python
Sequential types
Lists
Tuples
Ranges
Strings (again)
Working with sequences
Extracting elements
Slicing
Modifying elements of mutable sequences
Searching for specific elements
Arithmetic operators
Dictionaries
Iterable types
The for loop
Tuple assignment
Argument unpacking ()
Variadic arguments:
args and kwargs ()
Object references and copying (
)
Copying references
Pass by assignment
Object copies
Modify in place or return a modified copy?
Further reading
Exercises
II Unidimensional data
Unidimensional numeric data and their empirical distribution
Creating vectors in numpy
Enumerating elements
Arithmetic progressions
Repeating values
numpy.r_ ()
Generating pseudorandom variates
Loading data from files
Mathematical notation
Inspecting the data distribution with histograms
heights: A bell-shaped distribution
income: A right-skewed distribution
How many bins?
peds: A bimodal distribution (already binned)
matura: A bell-shaped distribution (almost)
marathon (truncated – fastest runners): A left-skewed distribution
Log-scale and heavy-tailed distributions
Cumulative probabilities and the empirical cumulative distribution function
Exercises
Processing unidimensional data
Aggregating numeric data
Measures of location
Arithmetic mean and median
Sensitive to outliers vs robust
Sample quantiles
Measures of dispersion
Standard deviation (and variance)
Interquartile range
Measures of shape
Box (and whisker) plots
Further methods (
)
Vectorised mathematical functions
Logarithms and exponential functions
Trigonometric functions
Arithmetic operators
Vector-scalar case
Application: Feature scaling
Standardisation and z-scores
Min-max scaling and clipping
Normalisation (l2; dividing by magnitude)
Normalisation (l1; dividing by sum)
Vector-vector case
Indexing vectors
Integer indexing
Logical indexing
Slicing
Other operations
Cumulative sums and iterated differences
Sorting
Dealing with tied observations
Determining the ordering permutation and ranking
Searching for certain indexes (argmin, argmax)
Dealing with round-off and measurement errors
Vectorising scalar operations with list comprehensions
Exercises
Continuous probability distributions
Normal distribution
Estimating parameters
Data models are useful
Assessing goodness-of-fit
Comparing cumulative distribution functions
Comparing quantiles
Kolmogorov–Smirnov test ()
Other noteworthy distributions
Log-normal distribution
Pareto distribution
Uniform distribution
Distribution mixtures (
)
Generating pseudorandom numbers
Uniform distribution
Not exactly random
Sampling from other distributions
Natural variability
Adding jitter (white noise)
Independence assumption
Further reading
Exercises
III Multidimensional data
Multidimensional numeric data at a glance
Creating matrices
Reading CSV files
Enumerating elements
Repeating arrays
Stacking arrays
Other functions
Reshaping matrices
Mathematical notation
Row and column vectors
Transpose
Identity and other diagonal matrices
Visualising multidimensional data
2D Data
3D Data and beyond
Scatter plot matrix (pairs plot)
Exercises
Processing multidimensional data
From vectors to matrices
Vectorised mathematical functions
Componentwise aggregation
Arithmetic, logical, and relational operations
Matrix vs scalar
Matrix vs matrix
Matrix vs any vector
Row vector vs column vector ()
Other row and column transforms (
)
Indexing matrices
Slice-based indexing
Scalar-based indexing
Mixed logical/integer vector and scalar/slice indexers
Two vectors as indexers ()
Views of existing arrays (
)
Adding and modifying rows and columns
Matrix multiplication, dot products, and the Euclidean norm
Pairwise distances and related methods
The Euclidean metric
Centroids
Multidimensional dispersion and other aggregates
Fixed-radius and k-nearest neighbour search
Spatial search with K-d trees
Exercises
Exploring relationships between variables
Measuring correlation
Pearson’s linear correlation coefficient
Perfect linear correlation
Strong linear correlation
No linear correlation does not imply independence
False linear correlations
Correlation is not causation
Correlation heat map
Linear correlation coefficients on transformed data
Spearman’s rank correlation coefficient
Regression tasks
K-nearest neighbour regression
From data to (linear) models
Least squares method
Analysis of residuals
Multiple regression
Variable transformation and linearisable models ()
Descriptive vs predictive power (
)
Fitting regression models with scikit-learn ()
Ill-conditioned model matrices (
)
Finding interesting combinations of variables ()
Dot products, angles, collinearity, and orthogonality
Geometric transformations of points
Matrix inverse
Singular value decomposition
Dimensionality reduction with SVD
Principal component analysis
Further reading
Exercises
IV Heterogeneous data
Introducing data frames
Creating data frames
Data frames are matrix-like
Series
Index
Aggregating data frames
Transforming data frames
Indexing Series objects
Do not use [...] directly
loc[...]
iloc[...]
Logical indexing
Indexing data frames
loc[...] and iloc[...]
Adding rows and columns
Modifying items
Pseudorandom sampling and splitting
Hierarchical indexes (
)
Further operations on data frames
Sorting
Stacking and unstacking (long/tall and wide forms)
Joining (merging)
Set-theoretic operations and removing duplicates
…and (too) many more
Exercises
Handling categorical data
Representing and generating categorical data
Encoding and decoding factors
Binary data as logical and probability vectors
One-hot encoding ()
Binning numeric data (revisited)
Generating pseudorandom labels
Frequency distributions
Counting
Two-way contingency tables: Factor combinations
Combinations of even more factors
Visualising factors
Bar plots
Political marketing and statistics
Pie… don’t even trip
Pareto charts (
)
Heat maps
Aggregating and comparing factors
Mode
Binary data as logical vectors
Pearson chi-squared test ()
Two-sample Pearson chi-squared test (
)
Measuring association ()
Binned numeric data
Ordinal data (
)
Exercises
Processing data in groups
Basic methods
Aggregating data in groups
Transforming data in groups
Manual splitting into subgroups ()
Plotting data in groups
Series of box plots
Series of bar plots
Semitransparent histograms
Scatter plots with group information
Grid (trellis) plots
Kolmogorov–Smirnov test for comparing ECDFs (
)
Comparing quantiles
Classification tasks
K-nearest neighbour classification
Assessing the quality of predictions
Splitting into training and test sets
Validating many models (parameter selection) ()
Clustering tasks
K-means method
Solving k-means is hard
Lloyd algorithm
Local minima
Random restarts
Further reading
Exercises
Accessing databases
Example database
Exporting data to a database
Exercises on SQL vs pandas
Filtering
Ordering
Removing duplicates
Grouping and aggregating
Joining
Solutions to exercises
Closing the database connection
Common data serialisation formats for the Web
Working with many files
File paths
File search
Exception handling
File connections (
)
Exercises
V Other data types
Text data
Basic string operations
Unicode as the universal encoding
Normalising strings
Substring searching and replacing
Locale-aware services in ICU ()
String operations in pandas
String operations in numpy (
)
Working with string lists
Formatted outputs for reproducible report generation
Formatting strings
str and repr
Aligning strings
Direct Markdown output in Jupyter
Manual Markdown file output ()
Regular expressions (
)
Regex matching with re
Regex matching with pandas
Matching individual characters
Matching any character
Defining character sets
Complementing sets
Defining code point ranges
Using predefined character sets
Alternating and grouping subexpressions
Alternation operator
Grouping subexpressions
Non-grouping parentheses
Quantifiers
Capture groups and references thereto (
)
Extracting capture group matches
Replacing with capture group matches
Back-referencing
Anchoring
Matching at the beginning or end of a string
Matching at word boundaries
Looking behind and ahead ()
Exercises
Missing, censored, and questionable data
Missing data
Representing and detecting missing values
Computing with missing values
Missing at random or not?
Discarding missing values
Mean imputation
Imputation by classification and regression (
)
Censored and interval data (
)
Incorrect data
Outliers
The 3/2 IQR rule for normally-distributed data
Unidimensional density estimation ()
Multidimensional density estimation (
)
Exercises
Time series
Temporal ordering and line charts
Working with date-times and time-deltas
Representation: The UNIX epoch
Time differences
Date-times in data frames
Basic operations
Iterated differences and cumulative sums revisited
Smoothing with moving averages
Detecting trends and seasonal patterns
Imputing missing values
Plotting multidimensional time series
Candlestick plots (*)
Further reading
Exercises
Changelog
References


📜 SIMILAR VOLUMES


Minimalist Data Wrangling with Python
✍ Marek Gagolewski 📂 Library 📅 2024 🏛 Independently Published 🌐 English

Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features,

Minimalist Data Wrangling with Python
✍ Marek Gagolewski 📂 Library 📅 2022 🏛 Marek Gagolewski 🌐 English

<span>Minimalist Data Wrangling with Python</span><span> is envisaged as a student's </span><span>first introduction to data science</span><span>, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, trans

Python for Data Analysis: Data Wrangling
✍ Wes McKinney 📂 Library 📅 2017 🏛 O'Reilly Media 🌐 English

Looking for complete instructions on manipulating, processing, cleaning, and crunching structured data in Python? The second edition of this hands-on guide--updated for Python 3.5 and Pandas 1.0--is packed with practical cases studies that show you how to effectively solve a broad set of data analys

Python for Data Analysis: Data Wrangling
✍ Wes McKinney 📂 Library 📅 2017 🏛 O’Reilly Media 🌐 English

<div><p>Get complete instructions for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.6, the second edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You’ll lea

Python for Data Analysis: Data Wrangling
✍ Wes McKinney 📂 Library 📅 2017 🏛 O’Reilly Media 🌐 English

<div><p>Get complete instructions for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.6, the second edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You’ll lea