<p>Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project's successβand is absolutely essential for those working with production code. This practical book bridges the gap between data science and software engineering,and clearly expl
Software Engineering for Data Scientists: From Notebooks to Scalable Systems
β Scribed by Catherine Nelson
- Publisher
- O'Reilly Media
- Year
- 2024
- Tongue
- English
- Leaves
- 258
- Edition
- 1
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Synopsis
Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project's successβand is absolutely essential for those working with production code. This practical book bridges the gap between data science and software engineering,and clearly explains how to apply the best practices from software engineering to data science.
Examples are provided in Python, drawn from popular packages such as NumPy and pandas. If you want to write better data science code, this guide covers the essential topics that are often missing from introductory data science or coding classes, including how to:
β’ Understand data structures and object-oriented programming
β’ Clearly and skillfully document your code
β’ Package and share your code
β’ Integrate data science code with a larger code base
β’ Learn how to write APIs
β’ Create secure code
β’ Apply best practices to common tasks such as testing, error handling, and logging
β’ Work more effectively with software engineers
β’ Write more efficient, maintainable, and robust code in Python
β’ Put your data science projects into production
β’ And more
β¦ Table of Contents
Cover
Copyright
Table of Contents
Preface
Who Is This Book For?
Why Python?
What Is Not in This Book
Guide to This Book
Reading Order
Conventions Used in This Book
Using Code Examples
OβReilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. What Is Good Code?
Why Good Code Matters
Adapting to Changing Requirements
Simplicity
Donβt Repeat Yourself (DRY)
Avoid Verbose Code
Modularity
Readability
Standards and Conventions
Names
Cleaning up
Documentation
Performance
Robustness
Errors and Logging
Testing
Key Takeaways
Chapter 2. Analyzing Code Performance
Methods to Improve Performance
Timing Your Code
Profiling Your Code
cProfile
line_profiler
Memory Profiling with Memray
Time Complexity
How to Estimate Time Complexity
Big O Notation
Key Takeaways
Chapter 3. Using Data Structures Effectively
Native Python Data Structures
Lists
Tuples
Dictionaries
Sets
NumPy Arrays
NumPy Array Functionality
NumPy Array Performance Considerations
Array Operations Using Dask
Arrays in Machine Learning
pandas DataFrames
DataFrame Functionality
DataFrame Performance Considerations
Key Takeaways
Chapter 4. Object-Oriented Programming and Functional Programming
Object-Oriented Programming
Classes, Methods, and Attributes
Defining Your Own Classes
OOP Principles
Functional Programming
Lambda Functions and map()
Applying Functions to DataFrames
Which Paradigm Should I Use?
Key Takeaways
Chapter 5. Errors, Logging, and Debugging
Errors in Python
Reading Python Error Messages
Handling Errors
Raising Errors
Logging
What to Log
Logging Configuration
How to Log
Debugging
Strategies for Debugging
Tools for Debugging
Key Takeaways
Chapter 6. Code Formatting, Linting, and Type Checking
Code Formatting and Style Guides
PEP8
Import Formatting
Automatic Code Formatting with Black
Linting
Linting Tools
Linting in Your IDE
Type Checking
Type Annotations
Type Checking with mypy
Key Takeaways
Chapter 7. Testing Your Code
Why You Should Write Tests
When to Test
How to Write and Run Tests
A Basic Test
Testing Unexpected Inputs
Running Automated Tests with Pytest
Types of Tests
Unit Tests
Integration Tests
Data Validation
Data Validation Examples
Using Pandera for Data Validation
Data Validation with Pydantic
Testing for Machine Learning
Testing Model Training
Testing Model Inference
Key Takeaways
Chapter 8. Design and Refactoring
Project Design and Structure
Project Design Considerations
An Example Machine Learning Project
Code Design
Modular Code
A Code Design Framework
Interfaces and Contracts
Coupling
From Notebooks to Scalable Scripts
Why Use Scripts Instead of Notebooks?
Creating Scripts from Notebooks
Refactoring
Strategies for Refactoring
An Example Refactoring Workflow
Key Takeaways
Chapter 9. Documentation
Documentation Within the Codebase
Names
Comments
Docstrings
Readmes, Tutorials, and Other Longer Documents
Documentation in Jupyter Notebooks
Documenting Machine Learning Experiments
Key Takeaways
Chapter 10. Sharing Your Code: Version Control, Dependencies, and Packaging
Version Control Using Git
How Does Git Work?
Tracking Changes and Committing
Remote and Local
Branches and Pull Requests
Dependencies and Virtual Environments
Virtual Environments
Managing Dependencies with pip
Managing Dependencies with Poetry
Python Packaging
Packaging Basics
pyproject.toml
Building and Uploading Packages
Key Takeaways
Chapter 11. APIs
Calling an API
HTTP Methods and Status Codes
Getting Data from the SDG API
Creating Your Own API Using FastAPI
Setting Up the API
Adding Functionality to Your API
Making Requests to Your API
Key Takeaways
Chapter 12. Automation and Deployment
Deploying Code
Automation Examples
Pre-Commit Hooks
GitHub Actions
Cloud Deployments
Containers and Docker
Building a Docker Container
Deploying an API on Google Cloud
Deploying an API on Other Cloud Providers
Key Takeaways
Chapter 13. Security
What Is Security?
Security Risks
Credentials, Physical Security, and Social Engineering
Third-Party Packages
The Python Pickle Module
Version Control Risks
API Security Risks
Security Practices
Security Reviews and Policies
Secure Coding Tools
Simple Code Scanning
Security for Machine Learning
Attacks on ML Systems
Security Practices for ML Systems
Key Takeaways
Chapter 14. Working in Software
Development Principles and Practices
The Software Development Lifecycle
Waterfall Software Development
Agile Software Development
Agile Data Science
Roles in the Software Industry
Software Engineer
QA or Test Engineer
Data Engineer
Data Analyst
Product Manager
UX Researcher
Designer
Community
Open Source
Speaking at Events
The Python Community
Key Takeaways
Chapter 15. Next Steps
The Future of Code
Your Future in Code
Thank You
Index
About the Author
Colophon
β¦ Subjects
Software Engineering; Data Science; Debugging; Data Structures; Security; Python; Functional Programming; Logging; Deployment; Lambda Functions; Profiling; Best Practices; Design; Documentation; Refactoring; Object-Oriented Programming; Packaging System; Unit Testing; Error Handling; Automation; Integration Testing; Testing; Performance; Version Control Systems; Elementary; Data Validation; Dependency Management; mypy; Type Checking; Linting; APIs; Readability
π SIMILAR VOLUMES
Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project's successβand is absolutely essential for those working with production code. This practical book bridges the gap between data science and software engineering, clearly explaining
These easy to learn and apply software engineering techniques will radically improve collaboration, scaling, and deployment in your Data Science projects. In Software Engineering for Data Scientists youβll learn to improve performance and efficiency by: Using source control Handling exception
These easy to learn and apply software engineering techniques will radically improve collaboration, scaling, and deployment in your data science projects. In Software Engineering for Data Scientists youβll learn to improve performance and efficiency by Using source control Handling exceptions
Enhance your system design skills to build scalable and efficient systems by working through real-world case studies and expert strategies to excel in interviews Key Features Comprehensive coverage of distributed systems concepts and practical system design techniques. Insider tips and proven s