Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics

✍ Scribed by Roy Jafari

Publisher: Packt Publishing
Year: 2022
Tongue: English
Leaves: 602
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

This book will make the link between data cleaning and preprocessing to help you design effective data analytic solutions

Key Features

Develop the skills to perform data cleaning, data integration, data reduction, and data transformation
Get ready to make the most of your data with powerful data transformation and massaging techniques
Perform thorough data cleaning, such as dealing with missing values and outliers

Book Description

Data preprocessing is the first step in data visualization, data analytics, and machine learning, where data is prepared for analytics functions to get the best possible insights. Around 90% of the time spent on data analytics, data visualization, and machine learning projects is dedicated to performing data preprocessing.

This book will equip you with the optimum data preprocessing techniques from multiple perspectives. You'll learn about different technical and analytical aspects of data preprocessing – data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment. This book will provide a comprehensive articulation of data preprocessing, its whys and hows, and help you identify opportunities where data analytics could lead to more effective decision making. It also demonstrates the role of data management systems and technologies for effective analytics and how to use APIs to pull data.

By the end of this Python data preprocessing book, you'll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques; and handle outliers or missing values to effectively prepare data for analytic tools.

What you will learn

Use Python to perform analytics functions on your data
Understand the role of databases and how to effectively pull data from databases
Perform data preprocessing steps defined by your analytics goals
Recognize and resolve data integration challenges
Identify the need for data reduction and execute it
Detect opportunities to improve analytics with data transformation

Who this book is for

Junior and senior data analysts, business intelligence professionals, engineering undergraduates, and data enthusiasts looking to perform preprocessing and data cleaning on large amounts of data will find this book useful. Basic programming skills, such as working with variables, conditionals, and loops, along with beginner-level knowledge of Python and simple analytics experience, are assumed.

Review of the Core Modules of NumPy and Pandas
Review of Another Core Module - Matplotlib
Data – What Is It Really?
Databases
Data Visualization
Prediction
Classification
Clustering Analysis
Data Cleaning Level I - Cleaning Up the Table
Data Cleaning Level II - Unpacking, Restructuring, and Reformulating the Table
Data Cleaning Level III- Missing Values, Outliers, and Errors
Data Fusion and Data Integration
Data Reduction
Data Transformation and Massaging
Case Study 1 - Mental Health in Tech
Case Study 2 - Predicting COVID-19 Hospitalizations
Case Study 3: United States Counties Clustering Analysis
Summary, Practice Case Studies, and Conclusions

✦ Table of Contents

Cover
Copyright
Contributors
Table of Contents
Preface
Part 1:Technical Needs
Chapter 1: Review of the Core Modules of NumPy and Pandas
Technical requirements
Overview of the Jupyter Notebook
Are we analyzing data via computer programming?
Overview of the basic functions of NumPy
The np.arange() function
The np.zeros() and np.ones() functions
The np.linspace() function
Overview of Pandas
Pandas data access
Boolean masking for filtering a DataFrame
Pandas functions for exploring a DataFrame
Pandas applying a function
The Pandas groupby function
Pandas multi-level indexing
Pandas pivot and melt functions
Summary
Exercises
Chapter 2: Review of Another Core Module – Matplotlib
Technical requirements
Drawing the main plots in Matplotlib
Summarizing numerical attributes using histograms or boxplots
Observing trends in the data using a line plot
Relating two numerical attributes using a scatterplot
Modifying the visuals
Adding a title to visuals and labels to the axis
Adding legends
Modifying ticks
Modifying markers
Subplots
Resizing visuals and saving them
Resizing
Saving
Example of Matplotilb assisting data preprocessing
Summary
Exercises
Chapter 3: Data – What Is It Really?
Technical requirements
What is data?
Why this definition?
DIKW pyramid
Data preprocessing for data analytics versus data preprocessing for machine learning
The most universal data structure – a table
Data objects
Data attributes
Types of data values
Analytics standpoint
Programming standpoint
Information versus pattern
Understanding everyday use of the word "information"
Statistical use of the word "information"
Statistical meaning of the word "pattern"
Summary
Exercises
References
Chapter 4: Databases
Technical requirements
What is a database?
Understanding the difference between a database and a dataset
Types of databases
The differentiating elements of databases
Relational databases (SQL databases)
Unstructured databases (NoSQL databases)
A practical example that requires a combination of both structured and unstructured databases
Distributed databases
Blockchain
Connecting to, and pulling data from, databases
Direct connection
Web page connection
API connection
Request connection
Publicly shared
Summary
Exercises
Part 2: Analytic Goals
Chapter 5: Data Visualization
Technical requirements
Summarizing a population
Example of summarizing numerical attributes
Example of summarizing categorical attributes
Comparing populations
Example of comparing populations using boxplots
Example of comparing populations using histograms
Example of comparing populations using bar charts
Investigating the relationship between two attributes
Visualizing the relationship between two numerical attributes
Visualizing the relationship between two categorical attributes
Visualizing the relationship between a numerical attribute and a categorical attribute
Adding visual dimensions
Example of a five-dimensional scatter plot
Showing and comparing trends
Example of visualizing and comparing trends
Summary
Exercise
Chapter 6: Prediction
Technical requirements
Predictive models
Forecasting
Regression analysis
Linear regression
Example of applying linear regression to perform regression analysis
MLP
How does MLP work?
Example of applying MLP to perform regression analysis
Summary
Exercises
Chapter 7: Classification
Technical requirements
Classification models
Example of designing a classification model
Classification algorithms
KNN
Example of using KNN for classification
Decision Trees
Example of using Decision Trees for classification
Summary
Exercises
Chapter 8: Clustering Analysis
Technical requirements
Clustering model
Clustering example using a two-dimensional dataset
Clustering example using a three-dimensional dataset
K-Means algorithm
Using K-Means to cluster a two-dimensional dataset
Using K-Means to cluster a dataset with more than two dimensions
Centroid analysis
Summary
Exercises
Part 3: The Preprocessing
Chapter 9: Data Cleaning Level I – Cleaning Up the Table
Technical requirements
The levels, tools, and purposes of data cleaning – a roadmap to chapters 9, 10, and 11
Purpose of data analytics
Tools for data analytics
Levels of data cleaning
Mapping the purposes and tools of analytics to the levels of data cleaning
Data cleaning level I – cleaning up the table
Example 1 – unwise data collection
Example 2 – reindexing (multi-level indexing)
Example 3 – intuitive but long column titles
Summary
Exercises
Chapter 10: Data Cleaning Level II – Unpacking, Restructuring, and Reformulating the Table
Technical requirements
Example 1 – unpacking columns and reformulating the table
Unpacking FileName
Unpacking Content
Reformulating a new table for visualization
The last step – drawing the visualization
Example 2 – restructuring the table
Example 3 – level I and II data cleaning
Level I cleaning
Level II cleaning
Doing the analytics – using linear regression to create a predictive model
Summary
Exercises
Chapter 11: Data Cleaning Level III – Missing Values, Outliers, and Errors
Technical requirements
Missing values
Detecting missing values
Example of detecting missing values
Causes of missing values
Types of missing values
Diagnosis of missing values
Dealing with missing values
Outliers
Detecting outliers
Dealing with outliers
Errors
Types of errors
Dealing with errors
Detecting systematic errors
Summary
Exercises
Chapter 11: Data Fusion and Data Integration
Technical requirements
What are data fusion and data integration?
Data fusion versus data integration
Directions of data integration
Frequent challenges regarding data fusion and integration
Challenge 1 – entity identification
Challenge 2 – unwise data collection
Challenge 3 – index mismatched formatting
Challenge 4 – aggregation mismatch
Challenge 5 – duplicate data objects
Challenge 6 – data redundancy
Example 1 (challenges 3 and 4)
Example 2 (challenges 2 and 3)
Example 3 (challenges 1, 3, 5, and 6)
Checking for duplicate data objects
Designing the structure for the result of data integration
Filling songIntegrate_df from billboard_df
Filling songIntegrate_df from songAttribute_df
Filling songIntegrate_df from artist_df
Checking for data redundancy
The analysis
Example summary
Summary
Exercise
Chapter 13: Data Reduction
Technical requirements
The distinction between data reduction and data redundancy
The objectives of data reduction
Types of data reduction
Performing numerosity data reduction
Random sampling
Stratified sampling
Random over/undersampling
Performing dimensionality data reduction
Linear regression as a dimension reduction method
Using a decision tree as a dimension reduction method
Using random forest as a dimension reduction method
Brute-force computational dimension reduction
PCA
Functional data analysis
Summary
Exercises
Chapter 14: Data Transformation and Massaging
Technical requirements
The whys of data transformation and massaging
Data transformation versus data massaging
Normalization and standardization
Binary coding, ranking transformation, and discretization
Example one – binary coding of nominal attribute
Example two – binary coding or ranking transformation of ordinal attributes
Example three – discretization of numerical attributes
Understanding the types of discretization
Discretization – the number of cut-off points
A summary – from numbers to categories and back
Attribute construction
Example – construct one transformed attribute from two attributes
Feature extraction
Example – extract three attributes from one attribute
Example – Morphological feature extraction
Feature extraction examples from the previous chapters
Log transformation
Implementation – doing it yourself
Implementation – the working module doing it for you
Smoothing, aggregation, and binning
Smoothing
Aggregation
Binning
Summary
Exercise
Part 4: Case Studies
Chapter 15: Case Study 1 – Mental Health in Tech
Technical requirements
Introducing the case study
The audience of the results of analytics
Introduction to the source of the data
Integrating the data sources
Cleaning the data
Detecting and dealing with outliers and errors
Detecting and dealing with missing values
Analyzing the data
Analysis question one – is there a significant difference between the mental health of employees across the attribute of gender?
Analysis question two – is there a significant difference between the mental health of employees across the Age attribute?
Analysis question three – do more supportive companies have mentally healthier employees?
Analysis question four – does the attitude of individuals toward mental health influence their mental health and their seeking of treatments?
Summary
Chapter 16: Case Study 2 – Predicting COVID-19 Hospitalizations
Technical requirements
Introducing the case study
Introducing the source of the data
Preprocessing the data
Designing the dataset to support the prediction
Filling up the placeholder dataset
Supervised dimension reduction
Analyzing the data
Summary
Chapter 17: Case Study 3: United States Counties Clustering Analysis
Technical requirements
Introducing the case study
Introduction to the source of the data
Preprocessing the data
Transforming election_df to partisan_df
Cleaning edu_df, employ_df, pop_df, and pov_df
Data integration
Data cleaning level III – missing values, errors, and outliers
Checking for data redundancy
Analyzing the data
Using PCA to visualize the dataset
K-Means clustering analysis
Summary
Chapter 18: Summary, Practice Case Studies, and Conclusions
A summary of the book
Part 1 – Technical requirements
Part 2 – Analytics goals
Part 3 – The preprocessing
Part 4 – Case studies
Practice case studies
Google Covid-19 mobility dataset
Police killings in the US
US accidents
San Francisco crime
Data analytics job market
FIFA 2018 player of the match
Hot hands in basketball
Wildfires in California
Silicon Valley diversity profile
Recognizing fake job posting
Hunting more practice case studies
Conclusions
Index
Other Books You May Enjoy

📜 SIMILAR VOLUMES

Hands-On Data Preprocessing in Python: L

📁 Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics

✍ Roy Jafari 📂 Library 📅 2022 🏛 Packt Publishing 🌐 English

<span><p><b>This book will make the link between data cleaning and preprocessing to help you design effective data analytic solutions</b></p><h4>Key Features</h4><ul><li>Develop the skills to perform data cleaning, data integration, data reduction, and data transformation</li><li>Get ready to make t

Hands-On Data Preprocessing in Python: L

📁 Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics

✍ Roy Jafari 📂 Library 🏛 Packt Publishing 🌐 English

<p><span>Get your raw data cleaned up and ready for processing to design better data analytic solutions</span></p><h4><span>Key Features</span></h4><ul><li><span><span>Develop the skills to perform data cleaning, data integration, data reduction, and data transformation</span></span></li><li><span><

PYTHON DATA ANALYTICS: Mastering Python

📁 PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization

✍ Floyd Bax 📂 Library 📅 2024 🌐 English

"Python Data Analytics" is your gateway to becoming a proficient data analyst using the versatile Python programming language. Whether you're delving into the world of data for the first time or enhancing your analytical skills, this book provides a hands-on approach to harnessing Python's capabilit

IoT Data Analytics using Python: Learn h

📁 IoT Data Analytics using Python: Learn how to use Python to collect, analyze, and visualize IoT data

✍ M. S. Hariharan 📂 Library 📅 2023 🏛 BPB Online 🌐 English

Python is a popular programming language for data analytics, and it is also well-suited for IoT Data Analytics. By leveraging Python's versatility and its rich ecosystem of libraries and tools, Data Analytics for IoT can unlock valuable insights, enable predictive capabilities, and optimize decision

Python Data Science: The Ultimate Handbo

📁 Python Data Science: The Ultimate Handbook for Beginners on How to Explore NumPy for Numerical Data, Pandas for Data Analysis, IPython, Scikit-Learn and Tensorflow for Machine Learning and Business

✍ Steve Blair 📂 Library 📅 2019 🏛 Steve Blair 🌐 English

Recently, more and more companies are learning that they need to make DATA-DRIVEN decisions. And with big data and data science on the rise, we now have more data than we know what to do with. In fact, without a doubt, you have already experienced data science in one way or another. Obviously, yo

DATA ANALYTICS: Simple and Effective Tip

📁 DATA ANALYTICS: Simple and Effective Tips and Tricks to Learn Data Analytics Effectively

✍ Smith, Benjamin 📂 Library 📅 2020 🌐 English