From Concepts to Code: Introduction to Data Science

✍ Scribed by Adam Tashman

Publisher: CRC Press
Year: 2024
Tongue: English
Leaves: 385
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

The breadth of problems that can be solved with data science is astonishing, and this book provides the required tools and skills fot a broad audience. The reader takes a journey into the forms, uses, and abuses of data and models, and learns how to critically examine each step. Python coding and data analysis skills are built from the ground up, with no prior coding experience assumed. The necessary background in computer science, mathematics, and statistics is provided in an approachable manner.

Each step of the machine learning lifecycle is discussed, from business objective planning to monitoring a model in production. This end-to-end approach supplies the broad view necessary to sidestep many of the pitfalls that can sink a data science project. Detailed examples are provided from a wide range of applications and fields, from fraud detection in banking to breast cancer classification in healthcare. The reader will learn the techniques to accomplish tasks that include predicting outcomes, explaining observations, and detecting patterns. Improper use of data and models can introduce unwanted effects and dangers to society. A chapter on model risk provides a framework for comprehensively challenging a model and mitigating weaknesses. When data is collected, stored, and used, it may misrepresent reality and introduce bias. Strategies for addressing bias are discussed. From Concepts to Code: Introduction to Data Science leverages content developed by the author for a full-year data science course suitable for advanced high school or early undergraduate students. This course is freely available and it includes weekly lesson plans.

✦ Table of Contents

Cover
Half Title
Title Page
Copyright Page
Dedication
Contents
Acknowledgments
Preface
Symbols
1. Introduction
1.1. What Is Data Science?
1.2. Relationships Are of Primary Importance
1.3. Modeling and Uncertainty
1.4. Pipelines
1.4.1. The Data Pipeline
1.4.2. The Data Science Pipeline
1.5. Representation
1.6. For Everyone
1.7. Target Audience
1.8. How this Book Teaches Coding
1.9. Course and Code Package
1.10. Why Isn't Data Science Typically Done with Excel?
1.11. Goals and Scope
1.12. Exercises
2. Communicating Effectively and Earning Trust
2.1. Master Yourself
2.2. Technical Competence
2.3. Know Your Audience
2.4. Tell Good Stories
2.5. State Your Needs
2.6. Assume Positive Intent
2.7. Help Others
2.8. Take Ownership
2.9. Chapter Summary
2.10. Exercises
3. Data Science Project Planning
3.1. Defining the Project Objectives
3.2. A Questionnaire for Defining the Objectives
3.3. Analytical Framing
3.4. Planning Data Collection and Usage
3.5. Data Quantity and Coverage
3.6. Sourcing Data
3.7. Chapter Summary
3.8. Exercises
4. An Overview of Data
4.1. Data Types
4.2. Statistical Data Types
4.3. Datasets and States of Data
4.4. Data Sources and Data Veracity
4.5. Data Ingestion
4.5.1. Data Velocity and Volume
4.5.2. Batch versus Streaming
4.5.3. Web Scraping and APIs
4.6. Data Integration
4.7. Levels of Data Processing
4.7.1. Trusted Zone
4.7.2. Standardizing Data
4.7.3. Natural Language Processing
4.7.4. Protecting Identity
4.7.5. Refined Zone
4.8. The Structure of Data at Rest
4.8.1. Structured Data
4.8.2. Semi-structured Data
4.8.3. Unstructured Data
4.9. Metadata
4.10. Representativeness and Bias
4.11. Data Is Never Neutral
4.12. Chapter Summary
4.13. Exercises
5. Computing Preliminaries and Setup
5.1. Hardware
5.1.1. Processor
5.1.2. Memory (RAM)
5.1.3. Storage
5.1.4. Motherboard
5.2. Software
5.2.1. Modules
5.3. I/O
5.3.1. Directories and Paths
5.3.2. File Formats
5.4. Shell, Terminal, and Command Line
5.5. Version Control
5.5.1. Git
5.5.2. GitHub
5.5.3. GitHub Setup and Course Repo Download
5.6. Exploring the Code Repo
5.7. Coding Tools
5.7.1. IDEs
5.8. Cloud Computing
5.9. Chapter Summary
5.10. Appendix: Going Further with Git and GitHub
5.10.1. Syncing with the Upstream Repository
5.10.2. Initializing a Git Repo
5.10.3. Tracking Changes
5.11. Exercises
6. Data Processing
6.1. California Wildfires
6.1.1. Running Python with the CLI
6.1.2. Setting the Relative Path
6.1.3. Variables
6.1.4. Strings
6.1.5. Importing Data
6.1.6. Text Processing
6.1.7. Getting Help
6.2. Counting Leopards
6.2.1. Extracting DataFrame Attributes
6.2.2. Subsetting
6.2.3. Creating and Appending New Columns
6.2.4. Sorting
6.2.5. Saving the DataFrame
6.3. Patient Blood Pressure
6.3.1. Data Validation
6.3.2. Imputation
6.3.3. Data Type Conversion
6.3.4. Extreme Observations
6.4. Chapter Summary
6.5. Exercises
7. Data Storage and Retrieval
7.1. Relational Databases
7.1.1. Primary Key
7.1.2. Foreign Key
7.2. SQL
7.3. Music Query: Single Table
7.4. Music Query: Multiple Tables
7.5. Houses, Lakes, and Lake Houses
7.5.1. Data Warehouse
7.5.2. Data Lake
7.5.3. Data Lakehouse
7.6. Chapter Summary
7.7. Exercises
8. Mathematics Preliminaries
8.1. Set Theory
8.2. Functions
8.3. Differential Calculus
8.4. Probability
8.5. Matrix Algebra
8.6. Chapter Summary
8.7. Exercises
9. Statistics Preliminaries
9.1. Descriptive Statistics
9.2. Inferential Statistics
9.2.1. One-Sample Test of the Mean
9.2.2. Confidence Intervals
9.3. Chapter Summary
9.4. Exercises
10. Data Transformation
10.1. Transforms for Treating Noise
10.1.1. Moving Average
10.1.2. Limiting Extreme Values
10.2. Transforms for Treating Scale
10.2.1. Order of Magnitude and Use of Logarithm
10.2.2. Standardization
10.2.3. Normalization
10.3. Transforms for Treating Data Representation
10.3.1. Count Vectorizer
10.3.2. One-Hot Encoding
10.4. Other Common Methods for Creating Predictors
10.4.1. Binarization
10.4.2. Discretization
10.4.3. Additional Common Transformations
10.4.4. Storing Transformed Data
10.5. Chapter Summary
10.6. Exercises
11. Exploratory Data Analysis
11.1. Check Fraud
11.2. World Happiness
11.3. Use and Limitations of Summary Statistics
11.4. Graphical Excellence
11.5. Chapter Summary
11.6. Exercises
12. An Overview of Machine Learning
12.1. A Simple Tool for Decision Making
12.2. Supervised Learning
12.3. Unsupervised Learning
12.4. Semi-supervised Learning
12.5. Reinforcement Learning
12.6. Generalization
12.7. Loss Functions
12.8. Hyperparameter Tuning
12.9. Metrics
12.10. Chapter Summary
12.11. Exercises
13. Modeling with Linear Regression
13.1. Mathematical Framework
13.1.1. Parameter Estimation for Linear Regression
13.2. Being Thoughtful about Predictors
13.3. Predicting Housing Prices
13.3.1. Data Splitting
13.3.2. Data Scaling
13.3.3. Model Fitting
13.3.4. Interpreting the Parameter Estimates
13.3.5. Model Performance Evaluation
13.3.6. Comparing Models
13.3.7. Calculating R-Squared
13.4. Chapter Summary
13.5. Appendix: Parameter Estimation in Matrix Form
13.6. Exercises
14. Classification with Logistic Regression
14.1. Mathematical Framework
14.1.1. Parameter Estimation for Logistic Regression
14.2. Detecting Breast Cancer
14.2.1. Interpreting the Parameter Estimates
14.2.2. Revisiting Gradient Descent
14.2.3. Model Performance Evaluation
14.3. Chapter Summary
14.4. Exercises
15. Clustering with K-Means
15.1. Clustering Concepts
15.2. K-Means
15.2.1. K-Means Hand Calculation
15.2.2. Performance Evaluation
15.3. Clustering Foods by Nutritional Value
15.4. Chapter Summary
15.5. Exercises
16. Elements of Reproducible Data Science
16.1. Sharing Code
16.2. Testing
16.3. Containers
16.3.1. Installing Docker
16.3.2. Common Docker Commands
16.3.3. Dockerizing an ML Application
16.4. Chapter Summary
16.5. Exercises
17. Model Risk
17.1. Model Documentation
17.2. Conceptual Soundness
17.3. Data and Inputs
17.4. Outcomes Analysis
17.5. Model Benchmarking
17.6. Sensitivity Analysis
17.7. Stress Testing
17.8. Ongoing Model Performance Monitoring
17.8.1. Diving Deeper on Monitoring
17.9. Case Study: Fair Lending Risk
17.9.1. Fair Lending Background
17.9.2. Numerical Example
17.10. Chapter Summary
17.11. Exercises
18. Next Steps
18.1. Building Blocks
18.2. Advanced Technique: Regularization
18.3. Advanced Machine Learning Models
18.3.1. Tree-Based Models
18.3.2. Artificial Neural Networks
18.4. Additional Languages
18.5. Resources
18.6. Applications
18.7. Final Thoughts
Bibliography
Index

📜 SIMILAR VOLUMES

From Concepts to Code

📁 From Concepts to Code

✍ Adam P. Tashman 📂 Library 📅 2024 🏛 Chapman and Hall/CRC 🌐 English

<p><span>The breadth of problems that can be solved with data science is astonishing, and this book provides the required tools and skills for a broad audience. The reader takes a journey into the forms, uses, and abuses of data and models, and learns how to critically examine each step. Python codi

Practical Linear Algebra for Data Scienc

📁 Practical Linear Algebra for Data Science: From Core Concepts to Applications Using Python

✍ Mike Cohen 📂 Library 📅 2022 🏛 O'Reilly Media 🌐 English

If you want to work in any computational or technical field, you need to understand linear algebra. As the study of matrices and operations acting upon them, linear algebra is the mathematical basis of nearly all algorithms and analyses implemented in computers. But the way it's presented in decades

Practical Linear Algebra for Data Scienc

📁 Practical Linear Algebra for Data Science: From Core Concepts to Applications Using Python

✍ Mike X Cohen 📂 Library 📅 2022 🏛 O'Reilly Media 🌐 English

<p><span>If you want to work in any computational or technical field, you need to understand linear algebra. As the study of matrices and operations acting upon them, linear algebra is the mathematical basis of nearly all algorithms and analyses implemented in computers. But the way it's presented i

Introduction to Data Science: A Python A

📁 Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications

✍ Laura Igual; Santi Seguí 📂 Library 📅 2017 🏛 Springer 🌐 English

This accessible and classroom-tested textbook/reference presents an introduction to the fundamentals of the emerging and interdisciplinary field of data science. The coverage spans key concepts adopted from statistics and machine learning, useful techniques for graph analysis and parallel programmin

Introduction to Data Science: A Python A

📁 Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications

✍ Laura Igual; Santi Seguí 📂 Library 📅 2017 🏛 Springer 🌐 English

Introduction to data science: a Python a

📁 Introduction to data science: a Python approach to concepts, techniques and applications

✍ Igual, Laura;Seguí, Santi;Vitrià, Jordi 📂 Library 📅 2017 🏛 Springer 🌐 English