This book is aimed at practitioners of data science, with consideration for bespoke problems, standards, and tech stacks between industries. It will guide you through the fundamentals of technical decision making, including planning, building, optimizing, packaging, and deploying end-to-end, reliabl
MLOps Lifecycle Toolkit: A Software Engineering Roadmap for Designing, Deploying, and Scaling Stochastic Systems
â Scribed by Dayne Sorvisto
- Publisher
- Apress
- Year
- 2023
- Tongue
- English
- Leaves
- 285
- Category
- Library
No coin nor oath required. For personal study only.
⌠Synopsis
This book is aimed at practitioners of data science, with consideration for bespoke problems, standards, and tech stacks between industries. It will guide you through the fundamentals of technical decision making, including planning, building, optimizing, packaging, and deploying end-to-end, reliable, and robust stochastic workflows using the language of data science.
MLOps Lifecycle Toolkit walks you through the principles of software engineering, assuming no prior experience. It addresses the perennial âwhyâ of MLOps early, along with insight into the unique challenges of engineering stochastic systems. Next, youâll discover resources to learn software craftsmanship, data-driven testing frameworks, and computer science. Additionally, you will see how to transition from Jupyter notebooks to code editors, and leverage infrastructure and cloud services to take control of the entire machine learning lifecycle. Youâll gain insight into the technical and architectural decisions youâre likely to encounter, as well as best practices for deploying accurate, extensible, scalable, and reliable models. Through hands-on labs, you will build your own MLOps âtoolkitâ that you can use to accelerate your own projects. In later chapters, author Dayne Sorvisto takes a thoughtful, bottom-up approach to machine learning engineering by considering the hard problems unique to industries such as high finance, energy, healthcare, and tech as case studies, along with the ethical and technical constraints that shape decision making.
After reading this book, whether you are a data scientist, product manager, or industry decision maker, you will be equipped to deploy models to production, understand the nuances of MLOps in the domain language of your industry, and have the resources for continuous delivery and learning.
What You Will Learn
- Understand the principles of software engineering and MLOps
- Design an end-to-endmachine learning system
- Balance technical decisions and architectural trade-offs
- Gain insight into the fundamental problems unique to each industry and how to solve them
Who This Book Is For
Data scientists, machine learning engineers, and software professionals.
⌠Table of Contents
Table of Contents
About the Author
About the Technical Reviewer
Acknowledgments
Introduction
Chapter 1: Introducing MLOps
What Is MLOps?
Defining MLOps
MLOps Maturity Model
Brief History of MLOps
Defining the Relationship Between Data Science and Engineering
What Are the Types of Data Science Projects?
Supervised Machine Learning
Semi-supervised Machine Learning
Reinforcement Learning
Probabilistic Programming
Ad Hoc Statistical Analysis
The Two Worlds: Mindset Shift from Data Science to Engineering
What Is a Type A Data Scientist?
Types of Data Science Roles
Hackerlytics: Thinking Like an Engineer for Data Scientists
Anti-pattern: The Brittle Training Pipeline
Future-Proofing Data Science Code
What Is Technical Debt?
Hidden Technical Trade-Offs in MLOps
How to Protect Projects from Change
Drivers of Change in Data Science Projects
Choosing a Programming Language for Data Science
MapReduce and Big Data
Big Data a.k.a. âHigh Volumeâ
High-Velocity Data
High-Veracity Data
Types of Data Architectures
The Spiral MLOps Lifecycle
Data Discovery
Data Discovery and Insight Generation
Data and Feature Engineering
Model Training
Model Evaluation
Deployment and Ops
Monitoring Models in Production
Example Components of a Production Machine Learning System
Measuring the Quality of Data Science Projects
Measuring Quality in Data Science Projects
Importance of Measurement in MLOps
What Is Reliability?
What Is Maintainability?
Moving the Needle: From Measurement to Actionable Business Insights
Hackerlytics: The Mindset of an MLOps Role
Summary
Chapter 2: Foundations for MLOps Systems
Mathematical Thinking
Linear Algebra
Probability Distributions
Understanding Generative and Discriminative Models
Bayesian Thinking
Gaussian Mixture Models
General Additive Models
Kernel Methods
Higher Dimensional Spaces
Lab: Mathematical Statistics
Programming Nondeterministic Systems
Programming and Computational Concepts
Loops
Variables, Statements, and Mathematica Expressions
Control Flow and Boolean Expressions
Tensor Operations and Einsums
Data Structures for Data Science
Sets
Arrays and Lists
Hash Maps
Trees and Graphs
Binary Tree
DAGs
SQL Basics
Algorithmic Thinking for Data Science
Core Technical Decision-Making: Choosing the Right Tool
Translating Thoughts into Executable Code
Understanding Libraries and Packages
PyMc3 Package
Numpy and Pandas
R Packages
Important Frameworks for Deep Learning
TensorFlow
PyTorch
Theano
Keras
Further Resources in Computer Science Foundations
Further Reading in Mathematical Foundations
Summary
Chapter 3: Tools for Data Science Developers
Data and Code Version Control Systems
What Is Version Control?
What Is Git?
Git Internals
Plumbing and Porcelain: Understanding Git Terminology
How Git Stores Snapshots Internally
Sourcetree for the Data Scientist
Branching Strategy for Data Science Teams
Creating Pull Requests
Do I Need to Use Source Control?
Version Control for Data
Git and DVC Lab
Model Development and Training
Spyder
Visual Studio Code
Cloud Notebooks and Google Colab
Programming Paradigms and Craftsmanship
Naming Conventions and Standards in Data Science
Code Smells in Data Science Code
Documentation for Data Science Teams
Test Driven Development for Data Scientists
From Craftsmanship to Clean Code
Model Packages and Deployment
Choosing a Package Manager
Anaconda
Installing Python Packages Securely
Navigating Open Source Packages for Data Scientists
Common Packages for MLOps
DataOps Packages
Jupyter Notebook
JupyterLab Server
Databricks
ModelOps Packages
Model Tracking and Monitoring
Packages for Data Visualization and Reporting
Lab: Developing an MLOps Toolkit Accelerator in CookieCutter
Summary
Chapter 4: Infrastructure for MLOps
Containerization for Data Scientists
Introduction to Docker
Anatomy of the Docker File
Lab 1: Building a Docker Data Science Lab for MLOps
The Feature Store Pattern
Implementing Feature Stores: Online vs. Offline Feature Stores
Lab: Exploring Data Infrastructure with Feast
Exercise
Dive into Parquet Format
Hardware Accelerated Training
Cloud Service Providers
Distributed Training
Optional Lab: PaaS Feature Stores in the Cloud Using Databricks
Scaling Pandas Code with a Single Line
GPU Accelerated Training
Databases for Data Science
Patterns for Enterprise Grade Projects
No-SQL Databases and Metastores
Relational Databases
Introduction to Container Orchestration
Commands for Managing Services in Docker Compose
Making Technical Decisions
Summary
Chapter 5: Building Training Pipelines
Pipelines for Model Training
ELT and Loading Training Data
Tools for Building ELT Pipelines
Azure Data Factory and AWS Glue
Using Production Data in Training Pipeline
Preprocessing the Data
Handling Missing Values
Knowing When to Scale Your Training Data
Understanding Schema Drift
Feature Selection: To Automate or Not to Automate?
Building the Model
Evaluating the Model
Automated Reporting
Batch Processing and Feature Stores
Mini-Batch Gradient Descent:
Stochastic Gradient Descent
Online Learning and Personalization
Shap Values and Explainability at Training Time
Feedback Loops: Augmenting Training Pipelines with User Data
Hyper-parameter Tuning
Hardware Accelerated Training Lab
Experimentation Tracking
MLFlow Architecture and Components
MLFlow Lab: Building a Training Pipeline with MLFlow
Summary
Chapter 6: Building Inference Pipelines
Reducing Production-Training Skew
Monitoring Infrastructure Used in Inference Pipelines
Monitoring Data and Model Drift
Designing Inference APIs
Comparing Models and Performance for Several Models
Performance Considerations
Scalability
What Is a RESTful API?
What Is a Microservice?
Lab: Building an Inference API
The Cold-Start Problem
Documentation for Inference Pipelines
Reporting for Inference Pipelines
Summary
Chapter 7: Deploying Stochastic Systems
Introducing the Spiral MLOps Lifecycle
Problem Definition
Problem Validation
Data Collection or Data Discovery
Data Validation
Data Engineering
Model Training
Diagnostic Plots and Model Retraining
Model Inference
The Various Levels of Schema Drift in Data Science
The Need for a More Flexible Table in Data Science
Model Deployment
Deploying Model as Public or Private API
Integrating Your Model into a Business System
Developing a Deployment Strategy
Reducing Technical Debt in your Lifecycle
Generative AI for Code Reviews and Development
Adapting Agile for Data Scientists
Model-Centric vs. Data-Centric Workflows
Continuous Delivery for Stochastic Systems
Introducing to Kubeflow for Data Scientists
Lab: Deploying Your Data Science Project
Open Source vs. Closed Source in Data Science
Monolithic vs. Distributed Architectures
Choosing a Deployment Model
Post-deployment
Deploying More General Stochastic Systems
Summary
Chapter 8: Data Ethics
Data Ethics
Model Sustainability
Data Ethics for Data Science
GDPR and Data Governance
Ethics in Data Science
Generative AIâs Impact on Data Ethics
Safeguards for Mitigating Risk
Data Governance for Data Scientists
Privacy and Data Science
How to Identify PII in Big Data
Using Only the Data You Need
ESG and Social Responsibility for Data Science
Data Ethics Maturity Framework
Responsible Use of AI in Data Science
Further Resources
Data Ethics Lab: Adding Bias Reduction to Titanic Disaster Dataset
Summary
Chapter 9: Case Studies by Industry
Causal Data Science and Confounding Variables
Energy Industry
Manufacturing
Transportation
Retail
Agritech
Finance Industry
Healthcare Industry
Insurance Industry
Product Data Science
Research and Development Data Science: The Last Frontier
Building a Data Moat for Your Organization
The Changing Role of the Domain Expert Through History
Will Data Outpace Processing Capabilities?
The MLOps Lifecycle Toolkit
Summary
Index
đ SIMILAR VOLUMES
This book is worth every penny of it's price if, for nothing else, but the excellent development of fact and dimension table architecture.Yes, we have all created our own ad hoc versions of a fact table (intersection table) when many-to-many relationships collide on our ERD, but having the concept t
A recent survey stated that 52% of embedded projects are late by 4-5 months. This book can help get those projects in on-time with design patterns. The author carefully takes into account the special concerns found in designing and developing embedded applications specifically concurrency, communica