Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications (Addison-Wesley Data & Analytics Series)

✍ Scribed by Andrew Kelleher, Adam Kelleher

Publisher: Addison-Wesley Professional
Year: 2019
Tongue: English
Leaves: 282
Edition: 1
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

The typical data science task in industry starts with an “ask” from the business. But few data scientists have been taught what to do with that ask. This book shows them how to assess it in the context of the business’s goals, reframe it to work optimally for both the data scientist and the employer, and then execute on it. Written by two of the experts who’ve achieved breakthrough optimizations at BuzzFeed, it’s packed with real-world examples that take you from start to finish: from ask to actionable insight.

Andrew Kelleher and Adam Kelleher walk you through well-formed, concrete principles for approaching common data science problems, giving you an easy-to-use checklist for effective execution. Using their principles and techniques, you’ll gain deeper understanding of your data, learn how to analyze noise and confounding variables so they don’t compromise your analysis, and save weeks of iterative improvement by planning your projects more effectively upfront.

Once you’ve mastered their principles, you’ll put them to work in two realistic, beginning-to-end site optimization tasks. These extended examples come complete with reusable code examples and recommended open-source solutions designed for easy adaptation to your everyday challenges. They will be especially valuable for anyone seeking their first data science job -- and everyone who’s found that job and wants to succeed in it.

✦ Table of Contents

Cover
Half Title
Title Page
Copyright Page
Dedication
Contents
Foreword
Preface
About the Authors
I: Principles of Framing
1 The Role of the Data Scientist
1.1 Introduction
1.2 The Role of the Data Scientist
1.2.1 Company Size
1.2.2 Team Context
1.2.3 Ladders and Career Development
1.2.4 Importance
1.2.5 The Work Breakdown
1.3 Conclusion
2 Project Workflow
2.1 Introduction
2.2 The Data Team Context
2.2.1 Embedding vs. Pooling Resources
2.2.2 Research
2.2.3 Prototyping
2.2.4 A Combined Work˝ow
2.3 Agile Development and the Product Focus
2.3.1 The 12 Principles
2.4 Conclusion
3 Quantifying Error
3.1 Introduction
3.2 Quantifying Error in Measured Values
3.3 Sampling Error
3.4 Error Propagation
3.5 Conclusion
4 Data Encoding and Preprocessing
4.1 Introduction
4.2 Simple Text Preprocessing
4.2.1 Tokenization
4.2.2 N-grams
4.2.3 Sparsity
4.2.4 Feature Selection
4.2.5 Representation Learning
4.3 Information Loss
4.4 Conclusion
5 Hypothesis Testing
5.1 Introduction
5.2 What Is a Hypothesis?
5.3 Types of Errors
5.4 P-values and Confidence Intervals
5.5 Multiple Testing and “P-hacking”
5.6 An Example
5.7 Planning and Context
5.8 Conclusion
6 Data Visualization
6.1 Introduction
6.2 Distributions and Summary Statistics
6.2.1 Distributions and Histograms
6.2.2 Scatter Plots and Heat Maps
6.2.3 Box Plots and Error Bars
6.3 Time-Series Plots
6.3.1 Rolling Statistics
6.3.2 Auto-Correlation
6.4 Graph Visualization
6.4.1 Layout Algorithms
6.4.2 Time Complexity
6.5 Conclusion
II: Algorithms and Architectures
7 Introduction to Algorithms and Architectures
7.1 Introduction
7.2 Architectures
7.2.1 Services
7.2.2 Data Sources
7.2.3 Batch and Online Computing
7.2.4 Scaling
7.3 Models
7.3.1 Training
7.3.2 Prediction
7.3.3 Validation
7.4 Conclusion
8 Comparison
8.1 Introduction
8.2 Jaccard Distance
8.2.1 The Algorithm
8.2.2 Time Complexity
8.2.3 Memory Considerations
8.2.4 A Distributed Approach
8.3 MinHash
8.3.1 Assumptions
8.3.2 Time and Space Complexity
8.3.3 Tools
8.3.4 A Distributed Approach
8.4 Cosine Similarity
8.4.1 Complexity
8.4.2 Memory Considerations
8.4.3 A Distributed Approach
8.5 Mahalanobis Distance
8.5.1 Complexity
8.5.2 Memory Considerations
8.5.3 A Distributed Approach
8.6 Conclusion
9 Regression
9.1 Introduction
9.1.1 Choosing the Model
9.1.2 Choosing the Objective Function
9.1.3 Fitting
9.1.4 Validation
9.2 Linear Least Squares
9.2.1 Assumptions
9.2.2 Complexity
9.2.3 Memory Considerations
9.2.4 Tools
9.2.5 A Distributed Approach
9.2.6 A Worked Example
9.3 Nonlinear Regression with Linear Regression
9.3.1 Uncertainty
9.4 Random Forest
9.4.1 Decision Trees
9.4.2 Random Forests
9.5 Conclusion
10 Classification and Clustering
10.1 Introduction
10.2 Logistic Regression
10.2.1 Assumptions
10.2.2 Time Complexity
10.2.3 Memory Considerations
10.2.4 Tools
10.3 Bayesian Inference, Naive Bayes
10.3.1 Assumptions
10.3.2 Complexity
10.3.3 Memory Considerations
10.3.4 Tools
10.4 K-Means
10.4.1 Assumptions
10.4.2 Complexity
10.4.3 Memory Considerations
10.4.4 Tools
10.5 Leading Eigenvalue
10.5.1 Complexity
10.5.2 Memory Considerations
10.5.3 Tools
10.6 Greedy Louvain
10.6.1 Assumptions
10.6.2 Complexity
10.6.3 Memory Considerations
10.6.4 Tools
10.7 Nearest Neighbors
10.7.1 Assumptions
10.7.2 Complexity
10.7.3 Memory Considerations
10.7.4 Tools
10.8 Conclusion
11 Bayesian Networks
11.1 Introduction
11.2 Causal Graphs, Conditional Independence, and Markovity
11.2.1 Causal Graphs and Conditional Independence
11.2.2 Stability and Dependence
11.3 D-separation and the Markov Property
11.3.1 Markovity and Factorization
11.3.2 D-separation
11.4 Causal Graphs as Bayesian Networks
11.4.1 Linear Regression
11.5 Fitting Models
11.6 Conclusion
12 Dimensional Reduction and Latent Variable Models
12.1 Introduction
12.2 Priors
12.3 Factor Analysis
12.4 Principal Components Analysis
12.4.1 Complexity
12.4.2 Memory Considerations
12.4.3 Tools
12.5 Independent Component Analysis
12.5.1 Assumptions
12.5.2 Complexity
12.5.3 Memory Considerations
12.5.4 Tools
12.6 Latent Dirichlet Allocation
12.7 Conclusion
13 Causal Inference
13.1 Introduction
13.2 Experiments
13.3 Observation: An Example
13.4 Controlling to Block Non-causal Paths
13.4.1 The G-formula
13.5 Machine-Learning Estimators
13.5.1 The G-formula Revisited
13.5.2 An Example
13.6 Conclusion
14 Advanced Machine Learning
14.1 Introduction
14.2 Optimization
14.3 Neural Networks
14.3.1 Layers
14.3.2 Capacity
14.3.3 Overfitting
14.3.4 Batch Fitting
14.3.5 Loss Functions
14.4 Conclusion
III: Bottlenecks and Optimizations
15 Hardware Fundamentals
15.1 Introduction
15.2 Random Access Memory
15.2.1 Access
15.2.2 Volatility
15.3 Nonvolatile/Persistent Storage
15.3.1 Hard Disk Drives or “Spinning Disks”
15.3.2 SSDs
15.3.3 Latency
15.3.4 Paging
15.3.5 Thrashing
15.4 Throughput
15.4.1 Locality
15.4.2 Execution-Level Locality
15.4.3 Network Locality
15.5 Processors
15.5.1 Clock Rate
15.5.2 Cores
15.5.3 Threading
15.5.4 Branch Prediction
15.6 Conclusion
16 Software Fundamentals
16.1 Introduction
16.2 Paging
16.3 Indexing
16.4 Granularity
16.5 Robustness
16.6 Extract, Transfer/Transform, Load
16.7 Conclusion
17 Software Architecture
17.1 Introduction
17.2 Client-Server Architecture
17.3 N-tier/Service-Oriented Architecture
17.4 Microservices
17.5 Monolith
17.6 Practical Cases (Mix-and-Match Architectures)
17.7 Conclusion
18 The CAP Theorem
18.1 Introduction
18.2 Consistency/Concurrency
18.2.1 Conflict-Free Replicated Data Types
18.3 Availability
18.3.1 Redundancy
18.3.2 Front Ends and Load Balancers
18.3.3 Client-Side Load Balancing
18.3.4 Data Layer
18.3.5 Jobs and Taskworkers
18.3.6 Failover
18.4 Partition Tolerance
18.4.1 Split Brains
18.5 Conclusion
19 Logical Network Topological Nodes
19.1 Introduction
19.2 Network Diagrams
19.3 Load Balancing
19.4 Caches
19.4.1 Application-Level Caching
19.4.2 Cache Services
19.4.3 Write-Through Caches
19.5 Databases
19.5.1 Primary and Replica
19.5.2 Multimaster
19.5.3 A/B Replication
19.6 Queues
19.6.1 Task Scheduling and Parallelization
19.6.2 Asynchronous Process Execution
19.6.3 API Buffering
19.7 Conclusion
Bibliography
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Z

📜 SIMILAR VOLUMES

Machine Learning in Production: Developi

📁 Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications

✍ Andrew Kelleher; Adam Kelleher 📂 Library 📅 2019 🏛 Addison-Wesley 🌐 English

Pandas for Everyone: Python Data Analysi

📁 Pandas for Everyone: Python Data Analysis (Addison-Wesley Data & Analytics Series)

✍ Daniel Chen 📂 Library 📅 2023 🏛 Addison-Wesley Professional 🌐 English

Manage and Automate Data Analysis with Pandas in Python Today, analysts must manage data characterized by extraordinary variety, velocity, and volume. Using the open source Pandas library, you can use Python to rapidly automate and perform virtually a

Pandas for Everyone: Python Data Analysi

📁 Pandas for Everyone: Python Data Analysis (Addison-Wesley Data & Analytics Series)

✍ Daniel Y. Chen 📂 Library 📅 2022 🏛 Addison-Wesley Professional 🌐 English

Manage and Automate Data Analysis with Pandas in PythonToday, analysts must manage data characterized by extraordinary variety, velocity, and volume. Using the open source Pandas library, you can use Python to rapidly automate and perform virtually any data analysis task

Pandas for Everyone: Python Data Analysi

📁 Pandas for Everyone: Python Data Analysis (Addison-Wesley Data & Analytics Series)

✍ Daniel Y. Chen 📂 Library 📅 2022 🏛 Addison-Wesley Professional 🌐 English

Pandas for Everyone: Python Data Analysi

📁 Pandas for Everyone: Python Data Analysis (Addison-Wesley Data & Analytics Series)

✍ Daniel Y. Chen 📂 Library 📅 2022 🏛 Addison-Wesley Professional 🌐 English

Foundations of Deep Reinforcement Learni

📁 Foundations of Deep Reinforcement Learning: Theory and Practice in Python (Addison-Wesley Data & Analytics Series)

✍ Laura Graesser, Wah Loon Keng 📂 Library 📅 2019 🏛 Addison-Wesley Professional 🌐 English

The Contemporary Introduction to Deep Reinforcement Learning that Combines Theory and Practice Deep reinforcement learning (deep RL) combines deep learning and reinforcement learning, in which artificial agents learn to solve sequential decision-making problems. In the past deca