Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly

✍ Scribed by Michael Walker

Publisher: Packt Publishing
Year: 2022
Tongue: English
Leaves: 542
Category: Library

⬇ Acquire This Volume

No coin nor oath required. For personal study only.

✦ Synopsis

Explore supercharged machine learning techniques to take care of your data laundry loads

Key Features

Learn how to prepare data for machine learning processes
Understand which algorithms are based on prediction objectives and the properties of the data
Explore how to interpret and evaluate the results from machine learning

Book Description

Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results.

As you start with this book, models are carefully chosen to help you grasp the underlying data, including in-feature importance and correlation, and the distribution of features and targets. The first two parts of the book introduce you to techniques for preparing data for ML algorithms, without being bashful about using some ML techniques for data cleaning, including anomaly detection and feature selection. The book then helps you apply that knowledge to a wide variety of ML tasks. You'll gain an understanding of popular supervised and unsupervised algorithms, how to prepare data for them, and how to evaluate them. Next, you'll build models and understand the relationships in your data, as well as perform cleaning and exploration tasks with that data. You'll make quick progress in studying the distribution of variables, identifying anomalies, and examining bivariate relationships, as you focus more on the accuracy of predictions in this book.

By the end of this book, you'll be able to deal with complex data problems using unsupervised ML algorithms like principal component analysis and k-means clustering.

What you will learn

Explore essential data cleaning and exploration techniques to be used before running the most popular machine learning algorithms
Understand how to perform preprocessing and feature selection, and how to set up the data for testing and validation
Model continuous targets with supervised learning algorithms
Model binary and multiclass targets with supervised learning algorithms
Execute clustering and dimension reduction with unsupervised learning algorithms
Understand how to use regression trees to model a continuous target

Who this book is for

This book is for professional data scientists, particularly those in the first few years of their career, or more experienced analysts who are relatively new to machine learning. Readers should have prior knowledge of concepts in statistics typically taught in an undergraduate introductory course as well as beginner-level experience in manipulating data programmatically.

Examining the Distribution of Features and Targets
Examining Bivariate and Multivariate Relationships between Features and Targets
Identifying and Fixing Missing Values
Encoding, Transforming, and Scaling Features
Feature Selection
Preparing for Model Evaluation
Linear Regression Models
Support Vector Regression
K-Nearest Neighbor, Decision Tree, Random Forest and Gradient Boosted Regression
Logistic Regression
Decision Trees and Random Forest Classification
K-Nearest Neighbors for Classification
Support Vector Machine Classification
Naive Bayes Classification
Principal Component Analysis
K-Means and DBSCAN Clustering

✦ Table of Contents

Cover
Title page
Copyright and Credits,
Contributors
Table of Contents
Preface
Section 1 – Data Cleaning and Machine Learning Algorithms
Chapter 1: Examining the Distribution of Features and Targets
Technical requirements
Subsetting data
Generating frequencies for categorical features
Generating summary statistics for continuous and discrete features
Identifying extreme values and outliers in univariate analysis
Using histograms, boxplots, and violin plots to examine the distribution of features
Using histograms
Using boxplots
Using violin plots
Summary
Chapter 2: Examining Bivariate and Multivariate Relationships between Features and Targets
Technical requirements
Identifying outliers and extreme values in bivariate relationships
Using scatter plots to view bivariate relationships between continuous features
Using grouped boxplots to view bivariate relationships between continuous and categorical features
Using linear regression to identify data points with significant influence
Using K-nearest neighbors to find outliers
Using Isolation Forest to find outliers
Summary
Chapter 3: Identifying and Fixing Missing Values
Technical requirements
Identifying missing values
Cleaning missing values
Imputing values with regression
Using KNN imputation
Using random forest for imputation
Summary
Section 2 – Preprocessing, Feature Selection, and Sampling
Chapter 4: Encoding, Transforming, and Scaling Features
Technical requirements
Creating training datasets and avoiding data leakage
Removing redundant or unhelpful features
Encoding categorical features
One-hot encoding
Ordinal encoding
Encoding categorical features with medium or high cardinality
Feature hashing
Using mathematical transformations
Feature binning
Equal-width and equal-frequency binning
K-means binning
Feature scaling
Summary
Chapter 5: Feature Selection
Technical requirements
Selecting features for classification models
Mutual information classification for feature selection with a categorical target
ANOVA F-value for feature selection with a categorical target
Selecting features for regression models
F-tests for feature selection with a continuous target
Mutual information for feature selection with a continuous target
Using forward and backward feature selection
Using forward feature selection
Using backward feature selection
Using exhaustive feature selection
Eliminating features recursively in a regression model
Eliminating features recursively in a classification model
Using Boruta for feature selection
Using regularization and other embedded methods
Using L1 regularization
Using a random forest classifier
Using principal component analysis
Summary
Chapter 6: Preparing for Model Evaluation
Technical requirements
Measuring accuracy, sensitivity, specificity, and precision for binary classification
Examining CAP, ROC, and precision-sensitivity curves for binary classification
Constructing CAP curves
Plotting a receiver operating characteristic (ROC) curve
Plotting precision-sensitivity curves
Evaluating multiclass models
Evaluating regression models
Using K-fold cross-validation
Preprocessing data with pipelines
Summary
Section 3 – Modeling Continuous Targets with Supervised Learning
Chapter 7: Linear Regression Models
Technical requirements
Key concepts
Key assumptions of linear regression models
Linear regression and ordinary least squares
Linear regression and gradient descent
Using classical linear regression
Pre-processing the data for our regression model
Running and evaluating our linear model
Improving our model evaluation
Using lasso regression
Tuning hyperparameters with grid searches
Using non-linear regression
Regression with gradient descent
Summary
Chapter 8: Support Vector Regression
Technical requirements
Key concepts of SVR
Nonlinear SVR and the kernel trick
SVR with a linear model
Using kernels for nonlinear SVR
Summary
Chapter 9: K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression
Technical requirements
Key concepts for K-nearest neighbors regression
K-nearest neighbors regression
Key concepts for decision tree and random forest regression
Using random forest regression
Decision tree and random forest regression
A decision tree example with interpretation
Building and interpreting our actual model
Random forest regression
Using gradient boosted regression
Summary
Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning
Chapter 10: Logistic Regression
Technical requirements
Key concepts of logistic regression
Logistic regression extensions
Binary classification with logistic regression
Evaluating a logistic regression model
Regularization with logistic regression
Multinomial logistic regression
Summary
Chapter 11: Decision Trees and Random Forest Classification
Technical requirements
Key concepts
Using random forest for classification
Using gradient-boosted decision trees
Decision tree models
Implementing random forest
Implementing gradient boosting
Summary
Chapter 12: K-Nearest Neighbors for Classification
Technical requirements
Key concepts of KNN
KNN for binary classification
KNN for multiclass classification
KNN for letter recognition
Summary
Chapter 13: Support Vector Machine Classification
Technical requirements
Key concepts for SVC
Nonlinear SVM and the kernel trick
Multiclass classification with SVC
Linear SVC models
Nonlinear SVM classification models
SVMs for multiclass classification
Summary
Chapter 14: Naïve Bayes Classification
Technical requirements
Key concepts
Naïve Bayes classification models
Naïve Bayes for text classification
Summary
Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning
Chapter 15: Principal Component Analysis
Technical requirements
Key concepts of PCA
Feature extraction with PCA
Using kernels with PCA
Summary
Chapter 16: K-Means and DBSCAN Clustering
Technical requirements
The key concepts of k-means and DBSCAN clustering
Implementing k-means clustering
Implementing DBSCAN clustering
Summary
Index
About Packt
Other Books You May Enjoy

📜 SIMILAR VOLUMES

Data Cleaning and Exploration with Machi

📁 Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly

✍ Michael Walker 📂 Library 📅 2022 🏛 Packt Publishing 🌐 English

<p><span>Explore supercharged machine learning techniques to take care of your data laundry loads</span></p><h4><span>Key Features</span></h4><ul><li><span><span>Learn how to prepare data for machine learning processes</span></span></li><li><span><span>Understand which algorithms are based on predic

Data Preparation for Machine Learning -

📁 Data Preparation for Machine Learning - Data Cleaning, Feature Selection, and Data

✍ Jason Brownlee 📂 Library 📅 2020 🏛 machine learning mastery 🌐 English

Data preparation involves transforming raw data in to a form that can be modeled using machine learning algorithms. Cut through the equations, Greek letters, and confusion, and discover the specialized data preparation techniques that you need to know to get the most out of your data on your next

Machine learning with R : discover how t

📁 Machine learning with R : discover how to build machine learning algorithms, prepare data, and dig deep into data prediction techniques with R

✍ Lantz, Brett 📂 Library 📅 2015 🏛 Packt Publishing - ebooks Account 🌐 English

<h4>Key Features</h4><ul><li>Harness the power of R for statistical computing and data science</li><li>Explore, forecast, and classify data with R</li><li>Use R to apply common machine learning algorithms to real-world scenarios</li></ul><h4>Book Description</h4><p>Machine learning, at its core, is

Machine learning with Spark: create scal

📁 Machine learning with Spark: create scalable machine learning applications to power a modern data-driven business using Spark

✍ Pentreath, Nick 📂 Library 📅 2015 🏛 Packt Publishing Ltd 🌐 English

Getting up and running with Spark -- Designing a machine learning system -- Obtaining, processing, and preparing data with Spark -- Building a recommendation engine with Spark -- Building a classification model with Spark -- Building a regression model with Spark -- Building a clustering model with

Machine learning with Spark: create scal

📁 Machine learning with Spark: create scalable machine learning applications to power a modern data-driven business using Spark

✍ Pentreath, Nick 📂 Library 📅 2015 🏛 Packt Publishing Ltd 🌐 English

Machine learning with Spark: create scal

📁 Machine learning with Spark: create scalable machine learning applications to power a modern data-driven business using Spark

✍ Pentreath, Nick 📂 Library 📅 2015 🏛 Packt Publishing Ltd 🌐 English