𝔖 Scriptorium
✦   LIBER   ✦

πŸ“

Software Engineering for Data Scientists: From Notebooks to Scalable Systems

✍ Scribed by Catherine Nelson


Publisher
O'Reilly Media
Year
2024
Tongue
English
Leaves
400
Category
Library

⬇  Acquire This Volume

No coin nor oath required. For personal study only.

✦ Synopsis


Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project's successβ€”and is absolutely essential for those working with production code. This practical book bridges the gap between data science and software engineering,and clearly explains how to apply the best practices from software engineering to data science.

Examples are provided in Python, drawn from popular packages such as NumPy and pandas. If you want to write better data science code, this guide covers the essential topics that are often missing from introductory data science or coding classes, including how to:

  • Understand data structures and object-oriented programming
  • Clearly and skillfully document your code
  • Package and share your code
  • Integrate data science code with a larger code base
  • Learn how to write APIs
  • Create secure code
  • Apply best practices to common tasks such as testing, error...
  • ✦ Table of Contents


    Preface
    Who Is This Book For?
    Why Python?
    What Is Not in This Book
    Guide to This Book
    Reading Order
    Conventions Used in This Book
    Using Code Examples
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
    1. What Is Good Code?
    Why Good Code Matters
    Adapting to Changing Requirements
    Simplicity
    Don’t Repeat Yourself (DRY)
    Avoid Verbose Code
    Modularity
    Readability
    Standards and Conventions
    Names
    Cleaning up
    Documentation
    Performance
    Robustness
    Errors and Logging
    Testing
    Key Takeaways
    2. Analyzing Code Performance
    Methods to Improve Performance
    Timing Your Code
    Profiling Your Code
    cProfile
    line_profiler
    Memory Profiling with Memray
    Time Complexity
    How to Estimate Time Complexity
    Big O Notation
    Key Takeaways
    3. Using Data Structures Effectively
    Native Python Data Structures
    Lists
    Tuples
    Dictionaries
    Sets
    NumPy Arrays
    NumPy Array Functionality
    NumPy Array Performance Considerations
    Array Operations Using Dask
    Arrays in Machine Learning
    pandas DataFrames
    DataFrame Functionality
    DataFrame Performance Considerations
    Key Takeaways
    4. Object-Oriented Programming and Functional Programming
    Object-Oriented Programming
    Classes, Methods, and Attributes
    Defining Your Own Classes
    OOP Principles
    Functional Programming
    Lambda Functions and map()
    Applying Functions to DataFrames
    Which Paradigm Should I Use?
    Key Takeaways
    5. Errors, Logging, and Debugging
    Errors in Python
    Reading Python Error Messages
    Handling Errors
    Raising Errors
    Logging
    What to Log
    Logging Configuration
    How to Log
    Debugging
    Strategies for Debugging
    Tools for Debugging
    Key Takeaways
    6. Code Formatting, Linting, and Type Checking
    Code Formatting and Style Guides
    PEP8
    Import Formatting
    Automatic Code Formatting with Black
    Linting
    Linting Tools
    Linting in Your IDE
    Type Checking
    Type Annotations
    Type Checking with mypy
    Key Takeaways
    7. Testing Your Code
    Why You Should Write Tests
    When to Test
    How to Write and Run Tests
    A Basic Test
    Testing Unexpected Inputs
    Running Automated Tests with Pytest
    Types of Tests
    Unit Tests
    Integration Tests
    Data Validation
    Data Validation Examples
    Using Pandera for Data Validation
    Data Validation with Pydantic
    Testing for Machine Learning
    Testing Model Training
    Testing Model Inference
    Key Takeaways
    8. Design and Refactoring
    Project Design and Structure
    Project Design Considerations
    An Example Machine Learning Project
    Code Design
    Modular Code
    A Code Design Framework
    Interfaces and Contracts
    Coupling
    From Notebooks to Scalable Scripts
    Why Use Scripts Instead of Notebooks?
    Creating Scripts from Notebooks
    Refactoring
    Strategies for Refactoring
    An Example Refactoring Workflow
    Key Takeaways
    9. Documentation
    Documentation Within the Codebase
    Names
    Comments
    Docstrings
    Readmes, Tutorials, and Other Longer Documents
    Documentation in Jupyter Notebooks
    Documenting Machine Learning Experiments
    Key Takeaways
    10. Sharing Your Code: Version Control, Dependencies, and Packaging
    Version Control Using Git
    How Does Git Work?
    Tracking Changes and Committing
    Remote and Local
    Branches and Pull Requests
    Dependencies and Virtual Environments
    Virtual Environments
    Managing Dependencies with pip
    Managing Dependencies with Poetry
    Python Packaging
    Packaging Basics
    pyproject.toml
    Building and Uploading Packages
    Key Takeaways
    11. APIs
    Calling an API
    HTTP Methods and Status Codes
    Getting Data from the SDG API
    Creating Your Own API Using FastAPI
    Setting Up the API
    Adding Functionality to Your API
    Making Requests to Your API
    Key Takeaways
    12. Automation and Deployment
    Deploying Code
    Automation Examples
    Pre-Commit Hooks
    GitHub Actions
    Cloud Deployments
    Containers and Docker
    Building a Docker Container
    Deploying an API on Google Cloud
    Deploying an API on Other Cloud Providers
    Key Takeaways
    13. Security
    What Is Security?
    Security Risks
    Credentials, Physical Security, and Social Engineering
    Third-Party Packages
    The Python Pickle Module
    Version Control Risks
    API Security Risks
    Security Practices
    Security Reviews and Policies
    Secure Coding Tools
    Simple Code Scanning
    Security for Machine Learning
    Attacks on ML Systems
    Security Practices for ML Systems
    Key Takeaways
    14. Working in Software
    Development Principles and Practices
    The Software Development Lifecycle
    Waterfall Software Development
    Agile Software Development
    Agile Data Science
    Roles in the Software Industry
    Software Engineer
    QA or Test Engineer
    Data Engineer
    Data Analyst
    Product Manager
    UX Researcher
    Designer
    Community
    Open Source
    Speaking at Events
    The Python Community
    Key Takeaways
    15. Next Steps
    The Future of Code
    Your Future in Code
    Thank You
    Index
    About the Author


    πŸ“œ SIMILAR VOLUMES


    Software Engineering for Data Scientists
    ✍ Catherine Nelson πŸ“‚ Library πŸ“… 2024 πŸ› O'Reilly Media 🌐 English

    Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project's successβ€”and is absolutely essential for those working with production code. This practical book bridges the gap between data science and software engineering,and clearly explain

    Software Engineering for Data Scientists
    ✍ Catherine Nelson πŸ“‚ Library πŸ“… 2024 πŸ› O'Reilly Media, Inc. 🌐 English

    Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project's successβ€”and is absolutely essential for those working with production code. This practical book bridges the gap between data science and software engineering, clearly explaining

    Software Engineering for Data Scientists
    ✍ Andrew Treadway πŸ“‚ Library πŸ“… 2023 πŸ› Manning Publications 🌐 English

    These easy to learn and apply software engineering techniques will radically improve collaboration, scaling, and deployment in your Data Science projects. In Software Engineering for Data Scientists you’ll learn to improve performance and efficiency by: Using source control Handling exception

    Software Engineering for Data Scientists
    ✍ Andrew Treadway πŸ“‚ Library πŸ“… 2023 πŸ› Manning Publications 🌐 English

    These easy to learn and apply software engineering techniques will radically improve collaboration, scaling, and deployment in your data science projects. In Software Engineering for Data Scientists you’ll learn to improve performance and efficiency by Using source control Handling exceptions

    System Design Guide for Software Profess
    ✍ Dhirendra Sinha
, Tejas Chopra πŸ“‚ Library πŸ“… 2024 πŸ› Packt Publishing Pvt Ltd 🌐 English

    Enhance your system design skills to build scalable and efficient systems by working through real-world case studies and expert strategies to excel in interviews Key Features Comprehensive coverage of distributed systems concepts and practical system design techniques. Insider tips and proven s