Practical Data Science with Python: Learn tools and techniques from hands-on examples to extract insights from data

✍ Scribed by Nathan George

Publisher: Packt Publishing
Year: 2021
Tongue: English
Leaves: 621
Category: Library

⬇ Acquire This Volume

No coin nor oath required. For personal study only.

✦ Synopsis

Learn to effectively manage data and execute data science projects from start to finish using Python

Key Features

Understand and utilize data science tools in Python, such as specialized machine learning algorithms and statistical modeling
Build a strong data science foundation with the best data science tools available in Python
Add value to yourself, your organization, and society by extracting actionable insights from raw data

Book Description

Practical Data Science with Python teaches you core data science concepts, with real-world and realistic examples, and strengthens your grip on the basic as well as advanced principles of data preparation and storage, statistics, probability theory, machine learning, and Python programming, helping you build a solid foundation to gain proficiency in data science.

The book starts with an overview of basic Python skills and then introduces foundational data science techniques, followed by a thorough explanation of the Python code needed to execute the techniques. You'll understand the code by working through the examples. The code has been broken down into small chunks (a few lines or a function at a time) to enable thorough discussion.

As you progress, you will learn how to perform data analysis while exploring the functionalities of key data science Python packages, including pandas, SciPy, and scikit-learn. Finally, the book covers ethics and privacy concerns in data science and suggests resources for improving data science skills, as well as ways to stay up to date on new data science developments.

By the end of the book, you should be able to comfortably use Python for basic data science projects and should have the skills to execute the data science process on any data source.

What you will learn

Use Python data science packages effectively
Clean and prepare data for data science work, including feature engineering and feature selection
Data modeling, including classic statistical models (such as t-tests), and essential machine learning algorithms, such as random forests and boosted models
Evaluate model performance
Compare and understand different machine learning methods
Interact with Excel spreadsheets through Python
Create automated data science reports through Python
Get to grips with text analytics techniques

Who this book is for

The book is intended for beginners, including students starting or about to start a data science, analytics, or related program (e.g. Bachelor's, Master's, bootcamp, online courses), recent college graduates who want to learn new skills to set them apart in the job market, professionals who want to learn hands-on data science techniques in Python, and those who want to shift their career to data science.

The book requires basic familiarity with Python. A "getting started with Python" section has been included to get complete novices up to speed.

Introduction to Data Science
Getting Started with Python
SQL and Built-in File Handling Modules in Python
Loading and Wrangling Data with Pandas and NumPy
Exploratory Data Analysis and Visualization
Data Wrangling Documents and Spreadsheets
Web Scraping
Probability, Distributions, and Sampling
Statistical Testing for Data Science
Preparing Data for Machine Learning: Feature Selection, Feature Engineering, and Dimensionality Reduction
Machine Learning for Classification
Evaluating Machine Learning Classification Models and Sampling for Classification
Machine Learning with Regression

(N.B. Please use the Look Inside option to see further chapters)

✦ Table of Contents

Cover
CopyRight
Contributors
Table of Contents
Preface
An Introduction and the Basics
Chapter 1: Introduction to Data Science
The data science origin story
The top data science tools and skills
Python
Other programming languages
GUIs and platforms
Cloud tools
Statistical methods and math
Collecting, organizing, and preparing data
Software development
Business understanding and communication
Specializations in and around data science
Machine learning
Business intelligence
Deep learning
Data engineering
Big data
Statistical methods
Natural Language Processing (NLP)
Artificial Intelligence (AI)
Choosing how to specialize
Data science project methodologies
Using data science in other fields
CRISP-DM
TDSP
Further reading on data science project management strategies
Other tools
Test your knowledge
Summary
Chapter 2: Getting Started with Python
Installing Python with Anaconda and getting started
Installing Anaconda
Running Python code
The Python shell
The IPython Shell
Jupyter
Why the command line?
Command line basics
Installing and using a code text editor – VS Code
Editing Python code with VS Code
Running a Python file
Installing Python packages and creating virtual environments
Python basics
Numbers
Strings
Variables
Lists, tuples, sets, and dictionaries
Lists
Tuples
Sets
Dictionaries
Loops and comprehensions
Booleans and conditionals
Packages and modules
Functions
Classes
Multithreading and multiprocessing
Software engineering best practices
Debugging errors and utilizing documentation
Debugging
Documentation
Version control with Git
Code style
Productivity tips
Test your knowledge
Summary
Dealing with Data
Chapter 3: SQL and Built-in File Handling Modules in Python
Introduction
Loading, reading, and writing files with base Python
Opening a file and reading its contents
Using the built-in JSON module
Saving credentials or data in a Python file
Saving Python objects with pickle
Using SQLite and SQL
Creating a SQLite database and storing data
Using the SQLAlchemy package in Python
Test your knowledge
Summary
Chapter 4: Loading and Wrangling Data with Pandas and NumPy
Data wrangling and analyzing iTunes data
Loading and saving data with Pandas
Understanding the DataFrame structure and combining/concatenating multiple DataFrames
Exploratory Data Analysis (EDA) and basic data cleaning with Pandas
Examining the top and bottom of the data
Examining the data's dimensions, datatypes, and missing values
Investigating statistical properties of the data
Plotting with DataFrames
Cleaning data
Filtering DataFrames
Removing irrelevant data
Dealing with missing values
Dealing with outliers
Dealing with duplicate values
Ensuring datatypes are correct
Standardizing data formats
Data transformations
Using replace, map, and apply to clean and transform data
Using GroupBy
Writing DataFrames to disk
Wrangling and analyzing Bitcoin price data
Understanding NumPy basics
Using NumPy mathematical functions
Test your knowledge
Summary
Chapter 5: Exploratory Data Analysis and Visualization
EDA and visualization libraries in Python
Performing EDA with Seaborn and pandas
Making boxplots and letter-value plots
Making histograms and violin plots
Making scatter plots with Matplotlib and Seaborn
Examining correlations and making correlograms
Making missing value plots
Using EDA Python packages
Using visualization best practices
Saving plots for sharing and reports
Making plots with Plotly
Test your knowledge
Summary
Chapter 6: Data Wrangling Documents and Spreadsheets
Parsing and processing Word and PDF documents
Reading text from Word documents
Extracting insights from Word documents: common words and phrases
Analyzing words and phrases from the text
Reading text from PDFs
Reading and writing data with Excel files
Using pandas for wrangling Excel files
Analyzing the data
Using openpyxl for wrangling Excel files
Test your knowledge
Summary
Chapter 7: Web Scraping
Understanding the structure of the internet
GET and POST requests, and HTML
Performing simple web scraping
Using urllib
Using the requests package
Scraping several files
Extracting the data from the scraped files
Parsing HTML from scraped pages
Using XPath, lxml, and bs4 to extract data from webpages
Collecting data from several pages
Using APIs to collect data
Using API wrappers
The ethics and legality of web scraping
Test your knowledge
Summary
Statistics for Data Science
Chapter 8: Probability, Distributions, and Sampling
Probability basics
Independent and conditional probabilities
Bayes' Theorem
Frequentist versus Bayesian
Distributions
The normal distribution and using scipy to generate distributions
Descriptive statistics of distributions
Fitting distributions to data to get parameters
The Student's t-distribution
The Bernoulli distribution
The binomial distribution
The uniform distribution
The exponential and Poisson distributions
The Weibull distribution
The Zipfian distribution
Sampling from data
The law of large numbers
The central limit theorem
Random sampling
Bootstrap sampling and confidence intervals
Test your knowledge
Summary
Chapter 9: Statistical Testingfor Data Science
Statistical testing basics and sample comparison tests
The t-test and z-test
One-sample, two-sided t-test
The z-test
One-sided tests
Two-sample t- and z-tests: A/B testing
Paired t- and z-tests
Other A/B testing methods
Testing between several groups with ANOVA
Post-hoc ANOVA tests
Assumptions for these methods
Other statistical tests
Testing if data belongs to a distribution
Generalized ESD outlier test
The Pearson correlation test
Test your knowledge
Summary
Machine Learning
Chapter 10: Preparing Data for Machine Learning: Feature Selection, Feature Engineering, and Dimensionality Reduction
Types of machine learning
Feature selection
The curse of dimensionality
Overfitting and underfitting, and the bias-variance trade-off
Methods for feature selection
Variance thresholding – removing features with too much and too little variance
Univariate statistics feature selection
Correlation
Mutual information score and chi-squared
The chi-squared test
ANOVA
Using the univariate statistics for feature selection
Feature engineering
Data cleaning and preparation
Converting strings to dates
Outlier cleaning strategies
Combining multiple columns
Transforming numeric data
Standardization
Making data more Gaussian with the Yeo-Johnson transform
Extracting datetime features
Binning
One-hot encoding and label encoding
Simplifying categorical columns
One-hot encoding
Dimensionality reduction
Principle Component Analysis (PCA)
Test your knowledge
Summary
Chapter 11: Machine Learning for Classification
Machine learning classification algorithms
Logistic regression for binary classification
Getting predictions from our model
How logistic regression works
Odds ratio and the logit
Examining feature importances with sklearn
Using statmodels for logistic regression
Maximum likelihood estimation, optimizers, and the logistic regression algorithm
Regularization
Hyperparameters and cross-validation
Logistic regression (and other models) with big data
Naïve Bayes for binary classification
k-nearest neighbors (KNN)
Multiclass classification
Logistic regression
One-versus-rest and one-versus-one formulations
Multi-label classification
Choosing a model to use
The "no free lunch" theorem
Computational complexity of models
Test your knowledge
Summary
Chapter 12: Evaluating Machine Learning Classification Models and Sampling for Classification
Evaluating classification algorithm performance with metrics
Train-validation-test splits
Accuracy
Cohen's Kappa
Confusion matrix
Precision, recall, and F1 score
AUC score and the ROC curve
Choosing the optimal cutoff threshold
Sampling and balancing classification data
Downsampling
Oversampling
SMOTE and other synthetic sampling methods
Test your knowledge
Summary
Chapter 13: Machine Learning with Regression
Linear regression
Linear regression with sklearn
Linear regression with statsmodels
Regularized linear regression
Regression with KNN in sklearn
Evaluating regression models
R2 or the coefficient of determination
Adjusted R2
Information criteria
Mean squared error
Mean absolute error
Linear regression assumptions
Regression models on big data
Forecasting
Test your knowledge
Summary
Chapter 14: Optimizing Models and Using AutoML
Hyperparameter optimization with search methods
Using grid search
Using random search
Using Bayesian search
Other advanced search methods
Using learning curves
Optimizing the number of features with ML models
Using AutoML with PyCaret
The no free lunch theorem
AutoML solutions
Using PyCaret
Test your knowledge
Summary
Chapter 15: Tree-Based Machine Learning Models
Decision trees
Random forests
Random forests with sklearn
Random forests with H2O
Feature importance from tree-based methods
Using H2O for feature importance
Using sklearn random forest feature importances
Boosted trees: AdaBoost, XGboost, LightGBM, and CatBoost
AdaBoost
XGBoost
XGBoost with PyCaret
XGBoost with the xgboost package
Training boosted models on a GPU
LightGBM
LightGBM plotting
Using LightGBM directly
CatBoost
Using CatBoost natively
Using early stopping with boosting algorithms
Test your knowledge
Summary
Chapter 16: Support Vector Machine (SVM) Machine Learning Models
How SVMs work
SVMs for classification
SVMs for regression
Using SVMs
Using SVMs in sklearn
Tuning SVMs with pycaret
Test your knowledge
Summary
Text Analysis and Reporting
Chapter 17: Clustering with Machine Learning
Using k-means clustering
Clustering metrics
Optimizing k in k-means
Examining the clusters
Hierarchical clustering
DBSCAN
Other unsupervised methods
Test your knowledge
Summary
Chapter 18: Working with Text
Text preprocessing
Basic text cleaning
Stemming and Lemmatizing
Preparing text with spaCy
Word vectors
TFIDF vectors
Basic text analysis
Word frequency plots
Wordclouds
Zipf's law
Word collocations
Parts of speech
Unsupervised pearning
Topic modeling
Topic modeling with pycaret
Topic modeling with Top2Vec
Supervised learning
Classification
Sentiment analysis
Test your knowledge
Summary
Wrapping Up
Chapter 19: Data Storytelling and Automated Reporting/Dashboarding
Data storytelling
Data storytelling example
Automated reporting and dashboarding
Automated reporting options
Automated dashboarding
Scheduling tasks to run automatically
Test your knowledge
Summary
Chapter 20: Ethics and Privacy
The ethics of machine learning algorithms
Bias
How to decrease ML biases
Carefully evaluating performance and consequences
Data privacy
Data privacy regulations and laws
k-anonymity, l-diversity, and t-closeness
Differential privacy
Using data science for the common good
Other ethical considerations
Test your knowledge
Summary
Chapter 21: Staying Up to Date and the Future of Data Science
Blogs, newsletters, books, and academic sources
Blogs
Newsletters
Books
Academic sources
Data science competition websites
Online learning platforms
Cloud services
Other places to keep an eye on
Strategies for staying up to date
Other data science topics we didn't cover
The future of data science
Summary
PacktPage
Other Books You May Enjoy
Index

📜 SIMILAR VOLUMES

Practical Data Science with Python: Lear

📁 Practical Data Science with Python: Learn tools and techniques from hands-on examples to extract insights from data

✍ Nathan George 📂 Library 📅 2021 🏛 Packt Publishing 🌐 English

<p><b>Learn to effectively manage data and execute data science projects from start to finish using Python</b></p><h4>Key Features</h4><ul><li>Understand and utilize data science tools in Python, such as specialized machine learning algorithms and statistical modeling</li><li>Build a strong data sci

Practical Data Science with Python 3: Sy

📁 Practical Data Science with Python 3: Synthesizing Actionable Insights from Data

✍ Ervin Varga 📂 Library 📅 2019 🏛 Apress 🌐 English

Gain insight into essential data science skills in a holistic manner using data engineering and associated scalable computational methods. This book covers the most popular Python 3 frameworks for both local and distributed (in premise and cloud based) processing. Along the way, you will be introduc

Beginning Data Science with Python and J

📁 Beginning Data Science with Python and Jupyter: Use powerful tools to unlock actionable insights from data

✍ Alex Galea 📂 Library 📅 2018 🏛 Packt Publishing 🌐 English

<p><span>Getting started with data science doesn't have to be an uphill battle. This step-by-step guide is ideal for beginners who know a little Python and are looking for a quick, fast-paced introduction.</span></p><h4><span>Key Features</span></h4><ul><li><span><span>Get up and running with the Ju

Hands-On Web Scraping with Python: Extra

📁 Hands-On Web Scraping with Python: Extract quality data from the web using effective Python techniques

✍ Anish Chapagain 📂 Library 📅 2023 🏛 Packt Publishing Pvt Ltd 🌐 English

Work through practical examples to unlock the full potential of web scraping with Python and gain valuable insights from high-quality data Key Features Build an initial portfolio of web scraping projects with detailed explanations Grasp Python programming fundamentals related to web scraping an

Hands-On Web Scraping with Python: Extra

📁 Hands-On Web Scraping with Python: Extract quality data from the web using effective Python techniques

✍ Anish Chapagain 📂 Library 📅 2023 🏛 Packt Publishing 🌐 English

<p><span>Work through practical examples to unlock the full potential of web scraping with Python and gain valuable insights from high-quality data</span></p><h4><span>Key Features</span></h4><ul><li><span><span>Build an initial portfolio of web scraping projects with detailed explanations</span></s

Python Data Cleaning Cookbook: Modern te

📁 Python Data Cleaning Cookbook: Modern techniques and Python tools to detect and remove dirty data to extract key insights

✍ Michael Walker 📂 Library 📅 2021 🏛 Packt Publishing 🌐 English

<p><b>Discover how to describe your data in detail, identify data issues, and find out how to solve them using commonly used techniques and tips and tricks</b></p><h4>Key Features</h4><ul><li>Get well-versed with various data cleaning techniques to reveal key insights</li><li>Manipulate data of diff