𝔖 Scriptorium
✦   LIBER   ✦

πŸ“

Data Preparation for Machine Learning - Data Cleaning, Feature Selection, and Data

✍ Scribed by Jason Brownlee


Publisher
machine learning mastery
Year
2020
Tongue
English
Leaves
398
Category
Library

⬇  Acquire This Volume

No coin nor oath required. For personal study only.

✦ Synopsis


Data preparation involves transforming raw data in to a form that can be modeled using machine learning algorithms.

Cut through the equations, Greek letters, and confusion, and discover the specialized data preparation techniques that you need to know to get the most out of your data on your next project.

Using clear explanations, standard Python libraries, and step-by-step tutorial lessons, you will discover how to confidently and effectively prepare your data for predictive modeling with machine learning.

✦ Table of Contents


Copyright
Contents
Preface
I Introduction
II Foundation
Data Preparation in a Machine Learning Project
Tutorial Overview
Applied Machine Learning Process
What Is Data Preparation
How to Choose Data Preparation Techniques
Further Reading
Summary
Why Data Preparation is So Important
Tutorial Overview
What Is Data in Machine Learning
Raw Data Must Be Prepared
Predictive Modeling Is Mostly Data Preparation
Further Reading
Summary
Tour of Data Preparation Techniques
Tutorial Overview
Common Data Preparation Tasks
Data Cleaning
Feature Selection
Data Transforms
Feature Engineering
Dimensionality Reduction
Further Reading
Summary
Data Preparation Without Data Leakage
Tutorial Overview
Problem With Naive Data Preparation
Data Preparation With Train and Test Sets
Data Preparation With k-fold Cross-Validation
Further Reading
Summary
III Data Cleaning
Basic Data Cleaning
Tutorial Overview
Messy Datasets
Identify Columns That Contain a Single Value
Delete Columns That Contain a Single Value
Consider Columns That Have Very Few Values
Remove Columns That Have A Low Variance
Identify Rows That Contain Duplicate Data
Delete Rows That Contain Duplicate Data
Further Reading
Summary
Outlier Identification and Removal
Tutorial Overview
What are Outliers?
Test Dataset
Standard Deviation Method
Interquartile Range Method
Automatic Outlier Detection
Further Reading
Summary
How to Mark and Remove Missing Data
Tutorial Overview
Diabetes Dataset
Mark Missing Values
Missing Values Cause Problems
Remove Rows With Missing Values
Further Reading
Summary
How to Use Statistical Imputation
Tutorial Overview
Statistical Imputation
Horse Colic Dataset
Statistical Imputation With SimpleImputer
Further Reading
Summary
How to Use KNN Imputation
Tutorial Overview
k-Nearest Neighbor Imputation
Horse Colic Dataset
Nearest Neighbor Imputation with KNNImputer
Further Reading
Summary
How to Use Iterative Imputation
Tutorial Overview
Iterative Imputation
Horse Colic Dataset
Iterative Imputation With IterativeImputer
Further Reading
Summary
IV Feature Selection
What is Feature Selection
Tutorial Overview
Feature Selection
Statistics for Feature Selection
Feature Selection With Any Data Type
Common Questions
Further Reading
Summary
How to Select Categorical Input Features
Tutorial Overview
Breast Cancer Categorical Dataset
Categorical Feature Selection
Modeling With Selected Features
Further Reading
Summary
How to Select Numerical Input Features
Tutorial Overview
Diabetes Numerical Dataset
Numerical Feature Selection
Modeling With Selected Features
Tune the Number of Selected Features
Further Reading
Summary
How to Select Features for Numerical Output
Tutorial Overview
Regression Dataset
Numerical Feature Selection
Modeling With Selected Features
Tune the Number of Selected Features
Further Reading
Summary
How to Use RFE for Feature Selection
Tutorial Overview
Recursive Feature Elimination
RFE with scikit-learn
RFE Hyperparameters
Further Reading
Summary
How to Use Feature Importance
Tutorial Overview
Feature Importance
Test Datasets
Coefficients as Feature Importance
Decision Tree Feature Importance
Permutation Feature Importance
Feature Selection with Importance
Common Questions
Further Reading
Summary
V Data Transforms
How to Scale Numerical Data
Tutorial Overview
The Scale of Your Data Matters
Numerical Data Scaling Methods
Diabetes Dataset
MinMaxScaler Transform
StandardScaler Transform
Common Questions
Further Reading
Summary
How to Scale Data With Outliers
Tutorial Overview
Robust Scaling Data
Robust Scaler Transforms
Diabetes Dataset
IQR Robust Scaler Transform
Explore Robust Scaler Range
Further Reading
Summary
How to Encode Categorical Data
Tutorial Overview
Nominal and Ordinal Variables
Encoding Categorical Data
Breast Cancer Dataset
OrdinalEncoder Transform
OneHotEncoder Transform
Common Questions
Further Reading
Summary
How to Make Distributions More Gaussian
Tutorial Overview
Make Data More Gaussian
Power Transforms
Sonar Dataset
Box-Cox Transform
Yeo-Johnson Transform
Further Reading
Summary
How to Change Numerical Data Distributions
Tutorial Overview
Change Data Distribution
Quantile Transforms
Sonar Dataset
Normal Quantile Transform
Uniform Quantile Transform
Further Reading
Summary
How to Transform Numerical to Categorical Data
Tutorial Overview
Change Data Distribution
Discretization Transforms
Sonar Dataset
Uniform Discretization Transform
k-Means Discretization Transform
Quantile Discretization Transform
Further Reading
Summary
How to Derive New Input Variables
Tutorial Overview
Polynomial Features
Polynomial Feature Transform
Sonar Dataset
Polynomial Feature Transform Example
Effect of Polynomial Degree
Further Reading
Summary
VI Advanced Transforms
How to Transform Numerical and Categorical Data
Tutorial Overview
Challenge of Transforming Different Data Types
How to use the ColumnTransformer
Data Preparation for the Abalone Regression Dataset
Further Reading
Summary
How to Transform the Target in Regression
Tutorial Overview
Importance of Data Scaling
How to Scale Target Variables
Example of Using the TransformedTargetRegressor
Further Reading
Summary
How to Save and Load Data Transforms
Tutorial Overview
Challenge of Preparing New Data for a Model
Save Data Preparation Objects
Worked Example of Saving Data Preparation
Further Reading
Summary
VII Dimensionality Reduction
What is Dimensionality Reduction
Tutorial Overview
Problem With Many Input Variables
Dimensionality Reduction
Techniques for Dimensionality Reduction
Further Reading
Summary
How to Perform LDA Dimensionality Reduction
Tutorial Overview
Linear Discriminant Analysis
LDA Scikit-Learn API
Worked Example of LDA for Dimensionality
Further Reading
Summary
How to Perform PCA Dimensionality Reduction
Tutorial Overview
Dimensionality Reduction and PCA
PCA Scikit-Learn API
Worked Example of PCA for Dimensionality Reduction
Further Reading
Summary
How to Perform SVD Dimensionality Reduction
Tutorial Overview
Dimensionality Reduction and SVD
SVD Scikit-Learn API
Worked Example of SVD for Dimensionality
Further Reading
Summary
VIII Appendix
Getting Help
Data Preparation
Machine Learning Books
Python APIs
Ask Questions About Data Preparation
How to Ask Questions
Contact the Author
How to Setup Python on Your Workstation
Tutorial Overview
Download Anaconda
Install Anaconda
Start and Update Anaconda
Further Reading
Summary
IX Conclusions
How Far You Have Come


πŸ“œ SIMILAR VOLUMES


Feature Engineering for Machine Learning
✍ Dong, Guozhu; Liu, Huan πŸ“‚ Library πŸ“… 2018 πŸ› Taylor and Francis 🌐 English

"Feature engineering plays a vital role in big data analytics. Machine learning and data mining algorithms cannot work without data. Little can be achieved if there are few features to represent the underlying data objects, and the quality of results of those algorithms largely depends on the qualit

Data Science e Machine Learning: Dai dat
✍ Michele di Nuzzo πŸ“‚ Library πŸ“… 2021 πŸ› Michele di Nuzzo 🌐 Italian

<p><strong>Estrarre conoscenza dalle informazioni attraverso l'analisi dei dati</strong>: quella del data scientist Γ¨ stata definita la professione piΓΉ attraente del XXI secolo. Analizzare le relazioni tra i dati, scoprire nuove informazioni e, con l'aiuto del machine learning, sfruttare l'enorme po

Practical Data Science with Jupyter: Exp
✍ PRATEEK GUPTA πŸ“‚ Library πŸ“… 2021 πŸ› BPB Publications 🌐 English

<b>Solve business problems with data-driven techniques and easy-to-follow Python examples</b><p></p><b>Key Features</b><li>Essential coverage on statistics and data science techniques.</li><li>Exposure to Jupyter, PyCharm, and use of GitHub.</li><li>Real use-cases, best practices, and smart techniqu

Feature Engineering for Machine Learning
✍ Alice Zheng, Amanda Casari πŸ“‚ Library πŸ“… 2018 πŸ› O’Reilly Media 🌐 English

Feature engineering is a crucial step in the machine-learning pipeline, yet this topic is rarely examined on its own. With this practical book, you’ll learn techniques for extracting and transforming featuresβ€”the numeric representations of raw dataβ€”into formats for machine-learning models. Each chap

Feature Engineering for Machine Learning
✍ Alice Zheng, Amanda Casari πŸ“‚ Library πŸ“… 2018 πŸ› O’Reilly Media 🌐 English

Feature engineering is a crucial step in the machine-learning pipeline, yet this topic is rarely examined on its own. With this practical book, you’ll learn techniques for extracting and transforming featuresβ€”the numeric representations of raw dataβ€”into formats for machine-learning models. Each chap

Feature engineering for machine learning
✍ Casari, Amanda;Zheng, Alice πŸ“‚ Library πŸ“… 2018 πŸ› O'Reilly Media, Inc. 🌐 English

Intro; Copyright; Table of Contents; Preface; Introduction; Conventions Used in This Book; Using Code Examples; O'Reilly Safari; How to Contact Us; Acknowledgments; Special Thanks from Alice; Special Thanks from Amanda; Chapter 1. The Machine Learning Pipeline; Data; Tasks; Models; Features; Model E