"Feature engineering plays a vital role in big data analytics. Machine learning and data mining algorithms cannot work without data. Little can be achieved if there are few features to represent the underlying data objects, and the quality of results of those algorithms largely depends on the qualit
The Pragmatic Programmer for Machine Learning: Engineering Analytics and Data Science Solutions
β Scribed by Marco Scutari, Mauro Malvestio
- Publisher
- CRC Press/Chapman & Hall
- Year
- 2023
- Tongue
- English
- Leaves
- 357
- Series
- Chapman & Hall/CRC Machine Learning & Pattern Recognition
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Synopsis
Machine learning has redefined the way we work with data and is increasingly becoming an indispensable part of everyday life. The Pragmatic Programmer for Machine Learning: Engineering Analytics and Data Science Solutions discusses how modern software engineering practices are part of this revolution both conceptually and in practical applictions.
Comprising a broad overview of how to design machine learning pipelines as well as the state-of-the-art tools we use to make them, this book provides a multi-disciplinary view of how traditional software engineering can be adapted to and integrated with the workflows of domain experts and probabilistic models.
From choosing the right hardware to designing effective pipelines architectures and adopting software development best practices, this guide will appeal to machine learning and data science specialists, whilst also laying out key high-level principlesin a way that is approachable for students of computer science and aspiring programmers.
β¦ Table of Contents
Cover
Half Title
Series Page
Title Page
Copyright Page
Dedication
Contents
Preface
1. What Is This Book About?
1.1. Machine Learning
1.2. Data Science
1.3. Software Engineering
1.4. How Do They Go Together?
I. Foundations of Scientific Computing
2. Hardware Architectures
2.1. Types of Hardware
2.1.1. Compute
2.1.2. Memory
2.1.3. Connections
2.2. Making Hardware Live Up to Expectations
2.3. Local and Remote Hardware
2.4. Choosing the Right Hardware for the Job
3. Variable Types and Data Structures
3.1. Variable Types
3.1.1. Integers
3.1.2. Floating Point
3.1.3. Strings
3.2. Data Structures
3.2.1. Vectors and Lists
3.2.2. Representing Data with Data Frames
3.2.3. Dense and Sparse Matrices
3.3. Choosing the Right Variable Types for the Job
3.4. Choosing the Right Data Structures for the Job
4. Analysis of Algorithms
4.1. Writing Pseudocode
4.2. Computational Complexity and Big-O Notation
4.3. Big-O Notation and Benchmarking
4.4. Algorithm Analysis for Machine Learning
4.5. Some Examples of Algorithm Analysis
4.5.1. Estimating Linear Regression Models
4.5.2. Sparse Matrices Representation
4.5.3. Uniform Simulations of Directed Acyclic Graphs
4.6. Big-O Notation and Real-World Performance
II. Best Practices for Machine Learning Pipelines
5. Designing and Structuring Pipelines
5.1. Data as Code
5.2. Technical Debt
5.2.1. At the Data Level
5.2.2. At the Model Level
5.2.3. At the Architecture (Design) Level
5.2.4. At the Code Level
5.3. Machine Learning Pipeline
5.3.1. Project Scoping
5.3.2. Producing a Baseline Implementation
5.3.3. Data Ingestion and Preparation
5.3.4. Model Training, Evaluation and Validation
5.3.5. Deployment, Serving and Inference
5.3.6. Monitoring, Logging and Reporting
6. Writing Machine Learning Code
6.1. Choosing Languages and Libraries
6.2. Naming Things
6.3. Coding Styles and Coding Standards
6.4. Filesystem Structure
6.5. Effective Versioning
6.6. Code Review
6.7. Refactoring
6.8. Reworking Academic Code: An Example
7. Packaging and Deploying Pipelines
7.1. Model Packaging
7.1.1. Standalone Packaging
7.1.2. Programming Language Package Managers
7.1.3. Virtual Machines
7.1.4. Containers
7.2. Model Deployment: Strategies
7.3. Model Deployment: Infrastructure
7.4. Model Deployment: Monitoring and Logging
7.5. What Can Possibly Go Wrong?
7.6. Rolling Back
8. Documenting Pipelines
8.1. Comments
8.2. Documenting Public Interfaces
8.3. Documenting Architecture and Design
8.4. Documenting Algorithms and Business Cases
8.5. Illustrating Practical Use Cases
9. Troubleshooting and Testing Pipelines
9.1. Data Are the Problem
9.1.1. Large Data
9.1.2. Heterogeneous Data
9.1.3. Dynamic Data
9.2. Models Are the Problem
9.2.1. Large Models
9.2.2. Black-Box Models
9.2.3. Costly Models
9.2.4. Many Models
9.3. Common Signs That Something Is Up
9.4. Tests Are the Solution
9.4.1. What Do We Want to Achieve?
9.4.2. What Should We Test?
9.4.3. Offline and Online Data
9.4.4. Testing Local and Testing Global
9.4.5. Conceptual and Implementation Errors
9.4.6. Code Coverage and Test Prioritisation
III. Tools and Technologies
10. Tools for Developing Pipelines
10.1. Data Exploration and Experiment Tracking
10.2. Code Development
10.2.1. Code Editors and IDEs
10.2.2. Notebooks
10.2.3. Accessing Data and Documentation
10.3. Build, Test and Documentation Tools
11. Tools to Manage Pipelines in Production
11.1. Infrastructure Management
11.2. Machine Learning Software Management
11.3. Dashboards, Visualisation and Reporting
IV. A Case Study
12. Recommending Recommendations: A Recommender System Using Natural Language Understanding
12.1. The Domain Problem
12.2. The Machine Learning Model
12.3. The Infrastructure
12.4. The Architecture of the Pipeline
12.4.1. Data Ingestion and Data Preparation
12.4.2. Data Tracking and Versioning
12.4.3. Training and Experiment Tracking
12.4.4. Model Packaging
12.4.5. Deployment and Inference
Bibliography
Index
π SIMILAR VOLUMES
<p><span>Self-driving cars, natural language recognition, and online recommendation engines are all possible thanks to Machine Learning. Now you can create your own genetic algorithms, nature-inspired swarms, Monte Carlo simulations, cellular automata, and clusters. Learn how to test your ML code an
The term βdata scienceβ has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a scripting language with the speed of a compiled l
Packt Publishing, 2015. β 607 p. β ISBN: 1784397180, 9781784397180<div class="bb-sep"></div>The term data science has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful langu
mpower Your Data Insights with Java's Top Tools and Frameworks. Book Description This book is a comprehensive guide to data analysis using Java. It starts with the fundamentals, covering the purpose of data analysis, different data types and structures, and how to pre-process datasets. It then i
Get a step ahead of your competitors with insights from over 30 Kaggle Masters and Grandmasters. Discover tips, tricks, and best practices for competing effectively on Kaggle and becoming a better data scientist. Purchase of the print or Kindle book includes a free eBook in the PDF format. Key F