Practitioner's Guide to Data Science

✍ Scribed by Hui Lin, Ming Li

Publisher: CRC Press
Year: 2023
Tongue: English
Leaves: 403
Series: Data Science
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

This book aims to increase the visibility of Data Science in real-world, which differs from what you learn from a typical textbook. Many aspects of day-to-day Data Science work are almost absent from conventional statistics, Machine Learning, and Data Science curriculum. Yet these activities account for a considerable share of the time and effort for data professionals in the industry. Based on industry experience, this book outlines real-world scenarios and discusses pitfalls that Data Science practitioners should avoid. It also covers the Big Data cloud platform and the art of Data Science, such as soft skills. The authors use R as the primary tool and provide code for both R and Python.

This book is for readers who want to explore possible career paths and eventually become data scientists. This book comprehensively introduces various Data Science fields, soft and programming skills in Data Science projects, and potential career paths. Traditional data-related practitioners such as statisticians, business analysts, and data analysts will find this book helpful in expanding their skills for future Data Science careers. Undergraduate and graduate students from analytics-related areas will find this book beneficial to learn real-world Data Science applications. Non-mathematical readers will appreciate the reproducibility of the companion R and Python codes.

Key Features:

• It covers both technical and soft skills.
• It has a chapter dedicated to the Big Data cloud environment. For industry applications, the practice of data science is often in such an environment.
• It is hands-on. We provide the data and　repeatable　R and Python code in notebooks. Readers can repeat the analysis in the book using the data and code provided. We also suggest that readers modify the notebook to perform analyses with their data and problems, if possible. The best way to learn Data Science is to do it!

✦ Table of Contents

Cover
Half Title
Series Page
Title Page
Copyright Page
Contents
List of Figures
Preface
About the Authors
Acknowledgment
1. Introduction
1.1. A Brief History of Data Science
1.2. Data Science Role and Skill Tracks
1.2.1. Engineering
1.2.2. Analysis
1.2.3. Modeling/Inference
1.3. What Kind of Questions Can Data Science Solve?
1.3.1. Prerequisites
1.3.2. Problem Type
1.4. Structure of Data Science Team
1.5. Data Science Roles
2. Soft Skills for Data Scientists
2.1. Comparison between Statistician and Data Scientist
2.2. Beyond Data and Analytics
2.3. Three Pillars of Knowledge
2.4. Data Science Project Cycle
2.4.1. Types of Data Science Projects
2.4.2. Problem Formulation and Project Planning Stage
2.4.3. Project Modeling Stage
2.4.4. Model Implementation and Post Production Stage
2.4.5. Project Cycle Summary
2.5. Common Mistakes in Data Science
2.5.1. Problem Formulation Stage
2.5.2. Project Planning Stage
2.5.3. Project Modeling Stage
2.5.4. Model Implementation and Post Production Stage
2.5.5. Summary of Common Mistakes
3. Introduction to the Data
3.1. Customer Data for a Clothing Company
3.2. Swine Disease Breakout Data
3.3. MNIST Dataset
3.4. IMDB Dataset
4. Big Data Cloud Platform
4.1. Power of Cluster of Computers
4.2. Evolution of Cluster Computing
4.2.1. Hadoop
4.2.2. Spark
4.3. Introduction of Cloud Environment
4.3.1. Open Account and Create a Cluster
4.3.2. R Notebook
4.3.3. Markdown cells
4.4. Leverage Spark Using R Notebook
4.5. Databases and SQL
4.5.1. History
4.5.2. Database, Table, and View
4.5.3. Basic SQL Statement
4.5.4. Advanced Topics in Database
5. Data Pre-processing
5.1. Data Cleaning
5.2. Missing Values
5.2.1. Impute Missing Values with Median/Mode
5.2.2. K-nearest Neighbors
5.2.3. Bagging Tree
5.3. Centering and Scaling
5.4. Resolve Skewness
5.5. Resolve Outliers
5.6. Collinearity
5.7. Sparse Variables
5.8. Re-encode Dummy Variables
6. Data Wrangling
6.1. Summarize Data
6.1.1. dplyr Package
6.1.2. apply(), lapply() and sapply() in base R
6.2. Tidy and Reshape Data
7. Model Tuning Strategy
7.1. Variance-Bias Trade-Off
7.2. Data Splitting and Resampling
7.2.1. Data Splitting
7.2.2. Resampling
8. Measuring Performance
8.1. Regression Model Performance
8.2. Classification Model Performance
8.2.1. Confusion Matrix
8.2.2. Kappa Statistic
8.2.3. ROC
8.2.4. Gain and Lift Charts
9. Regression Models
9.1. Ordinary Least Square
9.1.1. The Magic P-value
9.1.2. Diagnostics for Linear Regression
9.2. Principal Component Regression and Partial Least Square
10. Regularization Methods
10.1. Ridge Regression
10.2. LASSO
10.3. Elastic Net
10.4. Penalized Generalized Linear Model
10.4.1. Introduction to glmnet Package
10.4.2. Penalized Logistic Regression
11. Tree-Based Methods
11.1. Tree Basics
11.2. Splitting Criteria
11.2.1. Gini Impurity
11.2.2. Information Gain (IG)
11.2.3. Information Gain Ratio (IGR)
11.2.4. Sum of Squared Error (SSE)
11.3. Tree Pruning
11.4. Regression and Decision Tree Basic
11.4.1. Regression Tree
11.4.2. Decision Tree
11.5. Bagging Tree
11.6. Random Forest
11.7. Gradient Boosted Machine
11.7.1. Adaptive Boosting
11.7.2. Stochastic Gradient Boosting
12. Deep Learning
12.1. Feedforward Neural Network
12.1.1. Logistic Regression as Neural Network
12.1.2. Stochastic Gradient Descent
12.1.3. Deep Neural Network
12.1.4. Activation Function
12.1.5. Optimization
12.1.6. Deal with Overfitting
12.1.7. Image Recognition Using FFNN
12.2. Convolutional Neural Network
12.2.1. Convolution Layer
12.2.2. Padding layer
12.2.3. Pooling Layer
12.2.4. Convolution Over Volume
12.2.5. Image Recognition Using CNN
12.3. Recurrent Neural Network
12.3.1. RNN Model
12.3.2. Long Short Term Memory
12.3.3. Word Embedding
12.3.4. Sentiment Analysis Using RNN
A: Handling Large Local Data
A.1. readr
A.2. data.table— Enhanced data.frame
B: R Code for Data Simulation
B.1. Customer Data for Clothing Company
B.2. Swine Disease Breakout Data
Bibliography
Index

📜 SIMILAR VOLUMES

Thinking Data Science: A Data Science Pr

📁 Thinking Data Science: A Data Science Practitioner’s Guide

✍ Poornachandra Sarang 📂 Library 📅 2023 🏛 Springer 🌐 English

This definitive guide to Machine Learning projects answers the problems an aspiring or experienced data scientist frequently has: Confused on what technology to use for your ML development? Should I use GOFAI, ANN/DNN or Transfer Learning? Can I rely on AutoML for model development? What if

Thinking Data Science: A Data Science Pr

📁 Thinking Data Science: A Data Science Practitioner’s Guide

✍ Poornachandra Sarang 📂 Library 📅 2023 🏛 Springer Nature 🌐 English

This definitive guide to Machine Learning projects answers the problems an aspiring or experienced data scientist frequently has: Confused on what technology to use for your ML development? Should I use GOFAI, ANN/DNN or Transfer Learning? Can I rely on AutoML for model development? What if the clie

The Practitioner's Guide to Graph Data

📁 The Practitioner's Guide to Graph Data

✍ Matthias Broecheler ; Denise Gosnell 📂 Library 📅 2020 🏛 O'Reilly Media, Inc. 🌐 English

Graph data closes the gap between the way humans and computers view the world. While computers rely on static rows and columns of data, people navigate and reason about life through relationships. This practical guide demonstrates how graph data brings these two approaches together. By working with

Practitioner's Guide to Operationalizing

📁 Practitioner's Guide to Operationalizing Data Governance

✍ Mary Anne Hopper 📂 Library 📅 2023 🏛 Wiley 🌐 English

Discover what does--and doesn't--work when designing and building a data governance program In A Practitioner's Guide to Operationalizing Data Governance, veteran SAS and data management expert Mary Anne Hopper walks readers through the planning, design, operationalization, and mainte

Practitioner's Guide to Operationalizing

📁 Practitioner's Guide to Operationalizing Data Governance

✍ Mary Anne Hopper 📂 Library 📅 2023 🏛 Wiley 🌐 English

Discover what does―and doesn’t―work when designing and building a data governance programIn A Practitioner’s Guide to Operationalizing Data Governance, veteran SAS and data management expert Mary Anne Hopper walks readers through the planning, d

The Practitioner's Guide to Data Quality

📁 The Practitioner's Guide to Data Quality Improvement

✍ David Loshin (Auth.) 📂 Library 📅 2011 🏛 Morgan Kaufmann 🌐 English

</div><div class='box-content'><ul><li><IT>"There is NOTHING like this out there that I am aware of, and certainly nothing from anyone with same stature as David Loshin."</IT>--David Plotkin, Wells Fargo Bank<IT>"The book provides a comprehensive look at data qua