𝔖 Scriptorium
✦   LIBER   ✦

📁

Training Data for Machine Learning

✍ Scribed by Anthony Sarkis


Publisher
O'Reilly Media
Year
2023
Tongue
English
Leaves
329
Category
Library

⬇  Acquire This Volume

No coin nor oath required. For personal study only.

✦ Synopsis


Your training data has as much to do with the success of your data project as the algorithms themselves because most failures in AI systems relate to training data. But while training data is the foundation for successful AI and machine learning, there are few comprehensive resources to help you ace the process.

In this hands-on guide, author Anthony Sarkis—lead engineer for the Diffgram AI training data software—shows technical professionals, managers, and subject matter experts how to work with and scale training data, while illuminating the human side of supervising machines. Engineering leaders, data engineers, and data science professionals alike will gain a solid understanding of the concepts, tools, and processes they need to succeed with training data.

With this book, you'll learn how to:

  • Work effectively with training data including schemas, raw data, and annotations
  • Transform your work, team, or...
  • ✦ Table of Contents


    Preface
    Who Should Read This Book?
    For the Technical Professional and Engineer
    For the Manager and Director
    For the Subject Matter Expert and Data Annotation Specialist
    For the Data Scientist
    Why I Wrote This Book
    How This Book Is Organized
    Themes
    The Basics and Getting Started
    Concepts and Theories
    Putting It All Together
    Conventions Used in This Book
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
    1. Training Data Introduction
    Training Data Intents
    What Can You Do With Training Data?
    What Is Training Data Most Concerned With?
    Schema
    Raw data
    Annotations
    Quality
    Integrations
    The human role
    Training Data Opportunities
    Business Transformation
    Training Data Efficiency
    Tooling Proficiency
    Process Improvement Opportunities
    Why Training Data Matters
    ML Applications Are Becoming Mainstream
    The Foundation of Successful AI
    Training Data Is Here to Stay
    Training Data Controls the ML Program
    New Types of Users
    Training Data in the Wild
    What Makes Training Data Difficult?
    The Art of Supervising Machines
    A New Thing for Data Science
    ML Program Ecosystem
    Raw data media types
    Data-Centric Machine Learning
    Failures
    History of Development Affects Training Data Too
    What Training Data Is Not
    Generative AI
    Human Alignment Is Human Supervision
    Summary
    2. Getting Up and Running
    Introduction
    Getting Up and Running
    Installation
    Tasks Setup
    Annotator Setup
    Portal (default)
    Embedded
    Data Setup
    Workflow Setup
    Data Catalog Setup
    Initial Usage
    Optimization
    Tools Overview
    Training Data for Machine Learning
    Growing Selection of Tools
    People, Process, and Data
    Embedded Supervision
    Human Computer Supervision
    Separation of End Concerns
    Standards
    Many Personas
    A Paradigm to Deliver Machine Learning Software
    Trade-Offs
    Costs
    Installed Versus Software as a Service
    Development System
    Sequentially dependent discoveries
    Scale
    Why is it useful to define scale?
    Transitioning from small to medium scale
    Large-scale thoughts
    Installation Options
    Packaging
    Storage
    Database
    Data configuration
    Annotation Interfaces
    Modeling Integration
    Multi-User versus Single-User Systems
    Integrations
    Scope
    Platform and suite solutions
    Decision-making process
    Cautions
    Point solutions
    Tools in between
    Hidden Assumptions
    Security
    Security architecture
    Attack surface
    Security configuration
    Security benefits
    User access
    Data science access
    Root-level access
    Open Source and Closed Source
    Choose an open source tool to get up and running quickly
    See the forest from the trees
    Capability over optimizations
    Ease of use in different flows
    Vastly different assumptions
    Look at settings, not first impressions
    Is it easy to use, or just lacking features?
    Customization is the name of the game
    History
    Open Source Standards
    Realizing the Need for Dedicated Tooling
    More usage, more demands
    Advent of new standards
    Summary
    3. Schema
    Schema Deep Dive Introduction
    Labels and Attributes—What Is It?
    What Do We Care About?
    Introduction to Labels
    Attributes Introduction
    Attribute concepts
    Schema complexity trade-off
    Attribute depth
    Attribute Complexity Exceeds Spatial Complexity
    The hidden background case
    Example of sharing attributes between labels
    Technical Overview
    Example of an attribute in relation to an instance
    Data representations for engineering
    Examples of attributes
    Technical example of an attribute
    Spatial Representation—Where Is It?
    Using Spatial Types to Prevent Social Bias
    One way to avoid spatial bias
    Joint responsibility
    Trade-Offs with Types
    Computer Vision Spatial Type Examples
    Full image tag
    Box (2D)
    Polygon
    Ellipse and circle
    Cuboid
    Types with multiple uses
    Other types
    Raster mask
    Polygons and raster masks
    Keypoint geometry
    Custom spatial templates
    Complex spatial types
    Relationships, Sequences, Time Series: When Is It?
    Sequences and Relationships
    When
    Guides and Instructions
    Judgment Calls
    Relation of Machine Learning Tasks to Training Data
    Semantic Segmentation
    Image Classification (Tags)
    Object Detection
    Pose Estimation
    Relationship of Tasks to Training Data Types
    General Concepts
    Instance Concept Refresher
    Upgrading Data Over Time
    The Boundary Between Modeling and Training Data
    Raw Data Concepts
    Summary
    4. Data Engineering
    Introduction
    Who Wants the Data?
    Annotators
    Data scientists
    ML programs
    Application engineers
    Other stakeholders
    A Game of Telephone
    When a system of record is needed
    Planning a Great System
    Naive and Training Data–Centric Approaches
    Naive approaches
    Training data–centric (system of record)
    The first steps
    Raw Data Storage
    By Reference or by Value
    Off-the-Shelf Dedicated Training Data Tooling on Your Own Hardware
    Data Storage: Where Does the Data Rest?
    External Reference Connection
    Raw Media (BLOB)–Type Specific
    Images
    Video
    3D
    Text
    Medical
    Geospatial
    Formatting and Mapping
    User-Defined Types (Compound Files)
    Defining DataMaps
    Ingest Wizards
    Organizing Data and Useful Storage
    Remote Storage
    Versioning
    Per-instance history
    Per file and per set
    Per-export snapshots
    Data Access
    Disambiguating Storage, Ingestion, Export, and Access
    File-Based Exports
    Streaming Data
    Streaming benefits
    Streaming drawbacks
    Example: Fetch and stream
    Queries Introduction
    Integrations with the Ecosystem
    Security
    Access Control
    Identity and Authorization
    Example of Setting Permissions
    Signed URLs
    Cloud connections and signed URLs
    Personally Identifiable Information
    PII-compliant data chain
    PII avoidance
    PII removal
    Pre-Labeling
    Updating Data
    Pre-labeling gotchas
    Pre-labeling data prep process
    Summary
    5. Workflow
    Introduction
    Glue Between Tech and People
    Why Are Human Tasks Needed?
    Partnering with Non-Software Users in New Ways
    Getting Started with Human Tasks
    Basics
    Schemas’ Staying Power
    User Roles
    Training
    Gold Standard Training
    Task Assignment Concepts
    Do You Need to Customize the Interface?
    How Long Will the Average Annotator Be Using It?
    Tasks and Project Structure
    Quality Assurance
    Annotator Trust
    Annotators Are Partners
    Who supervises the data
    All training data has errors
    Annotator needs
    Common Causes of Training Data Errors
    Task Review Loops
    Standard review loop
    Consensus
    Analytics
    Annotation Metrics Examples
    Data Exploration
    Data exploration tool example
    Explore processes
    Explore examples
    Similar image reduction
    Models
    Using the Model to Debug the Humans
    Distinctions Between a Dataset, Model, and Model Run
    Getting Data to Models
    Dataflow
    Overview of Streaming
    Data Organization
    Folders and static organization
    Filters and dynamic organization
    Pipelines and Processes
    The dataset connection
    Sending a single file to that set
    Relating a dataset to a template
    Putting the whole example together
    Expanding the example
    Non-linear example
    Hooks
    Direct Annotation
    Business Process Integration
    Attributes
    Depth of Labeling
    Supervising Existing Data
    Interactive Automations
    Example: Semantic Segmentation Auto Bordering
    Video
    Motion
    Examples of tracking objects through time (time series)
    Static objects
    Persistent objects: football example
    Series example
    Video events
    Detecting sequence errors
    Common issues in video annotation
    Summary
    6. Theories, Concepts, and Maintenance
    Introduction
    Theories
    A System Is Only as Useful as Its Schema
    Who Supervises the Data Matters
    Intentionally Chosen Data Is Best
    Working with Historical Data
    Training Data Is Like Code
    Surface Assumptions Around Usage of Your Training Data
    Use definitions and processes to protect against assumptions
    Human Supervision Is Different from Classic Datasets
    Discovery versus automation
    Discovery
    General Concepts
    Data Relevancy
    Overall system design
    Raw data collection
    Need for Both Qualitative and Quantitative Evaluations
    Iterations
    Prioritization: What to Label
    Transfer Learning’s Relation to Datasets (Fine-Tuning)
    Per-Sample Judgment Calls
    Ethical and Privacy Considerations
    Bias
    Bias Is Hard to Escape
    Metadata
    Preventing Lost Metadata
    Train/Val/Test Is the Cherry on Top
    Sample Creation
    Simple Schema for a Strawberry Picking System
    Geometric Representations
    Binary Classification
    Let’s Manually Create Our First Set
    Upgraded Classification
    Where Is the Traffic Light?
    Maintenance
    Actions
    Increase schema depth to improve performance
    Better align the spatial type to the raw data
    Create more tasks
    Change the raw data
    Net Lift
    Levels of System Maturity of Training Data Operations
    Applied Versus Research Sets
    Training Data Management
    Quality
    Completed Tasks
    Freshness
    Maintaining Set Metadata
    Task Management
    Summary
    7. AI Transformation and Use Cases
    Introduction
    AI Transformation
    Seeing Your Day-to-Day Work as Annotation
    The Creative Revolution of Data-centric AI
    You Can Create New Data
    You Can Change What Data You Collect
    You Can Change the Meaning of the Data
    You Can Create!
    Think Step Function Improvement for Major Projects
    Build Your AI Data to Secure Your AI Present and Future
    Appoint a Leader: The Director of AI Data
    New Expectations People Have for the Future of AI
    Sometimes Proposals and Corrections, Sometimes Replacement
    Upstream Producers and Downstream Consumers
    Producer and consumer comparison
    Producer and consumer mindset
    Why is new structure needed?
    The budget
    The AI Director’s background
    Director of Training Data role
    AI-focused company modifications
    Classic company modification
    Spectrum of Training Data Team Engagement
    Dedicated Producers and Other Teams
    Organizing Producers from Other Teams
    Director of AI data responsibilities
    Training Data Evangelist
    Training Data Production Manager(s)
    Annotation Producer
    Data Engineer
    Use Case Discovery
    Rubric for Good Use Cases
    Detailed rubric
    Adds a new capability use case
    Repeating use cases
    Specialists and experts
    Evaluating a Use Case Against the Rubric
    Automatic background removal
    Evaluation example
    Conceptual Effects of Use Cases
    Ongoing impact of use cases
    The New “Crowd Sourcing”: Your Own Experts
    Key Levers on Training Data ROI
    What the Annotated Data Represents
    Trade-Offs of Controlling Your Own Training Data
    The Need for Hardware
    Common Project Mistakes
    Modern Training Data Tools
    Think Learning Curve, Not Perfection
    New Training and Knowledge Are Required
    Everyone
    Annotators
    Managers
    Executives
    How Companies Produce and Consume Data
    Trap to Avoid: Premature Optimization in Training Data
    No Silver Bullets
    Culture of Training Data
    New Engineering Principles
    Summary
    8. Automation
    Introduction
    Getting Started
    Motivation: When to Use These Methods?
    Check What Part of the Schema a Method Is Designed to Work On
    What Do People Actually Use?
    Commonly used techniques
    Domain-specific
    A note on ordering
    What Kind of Results Can I Expect?
    Common Confusions
    “Fully” automatic labeling for novel model creation
    Proprietary automatic methods
    User Interface Optimizations
    Risks
    Trade-Offs
    Nature of Automations
    Setup Costs
    How to Benchmark Well
    How to Scope the Automation Relative to the Problem
    Correction Time
    Subject Matter Experts
    Consider How the Automations Stack
    Pre-Labeling
    Standard Pre-Labeling
    Benefits
    Caveats
    Pre-Labeling a Portion of the Data Only
    Use off-the-shelf models
    Clear separation of concerns
    The “one step early” trick
    How to get started pre-labeling
    Interactive Annotation Automation
    Creating Your Own
    Technical Setup Notes
    What Is a Watcher? (Observer Pattern)
    How to Use a Watcher
    Interactive Capturing of a Region of Interest
    Interactive Drawing Box to Polygon Using GrabCut
    Full Image Model Prediction Example
    Example: Person Detection for Different Attribute
    Quality Assurance Automation
    Using the Model to Debug the Humans
    Automated Checklist Example
    Domain-Specific Reasonableness Checks
    Data Discovery: What to Label
    Human Exploration
    Raw Data Exploration
    Metadata Exploration
    Adding Pre-Labeling-Based Metadata
    Augmentation
    Better Models Are Better than Better Augmentation
    To Augment or Not to Augment
    Training/runtime augmentation
    Patch and inject method (crop and inject)
    Simulation and Synthetic Data
    Simulations Still Need Human Review
    Media Specific
    What Methods Work with Which Media?
    Considerations
    Media-Specific Research
    Domain Specific
    Geometry-Based Labeling
    Multi-sensor labeling automation—spatial
    Spatial labeling
    Heuristics-Based Labeling
    Summary
    9. Case Studies and Stories
    Introduction
    Industry
    A Security Startup Adopts Training Data Tools
    Quality Assurance at a Large-Scale Self-Driving Project
    Tricky schemas should be expanded, not shrunk
    Don’t justify a clearly bad schema with domain-specific assumptions
    Tracking spatial quality and errors per image
    Regression and focused effort do not always solve specific problems
    Overfocus on complex instructions instead of fixing the schema
    Trade-offs of attempting to achieve “perfection” in nuanced domain-specific cases
    Understanding nuanced cases
    Learning from mistakes
    Define occlusion well
    Expand schemas
    Remember the null case
    Missing assumptions for language barriers
    Don’t overfocus on spatial information
    Big-Tech Challenges
    Two annotation software teams
    Confusing the media types
    Non-queryable
    Different teams for annotations and raw media
    Moving toward a system of record
    Missing the big picture
    Solution
    Let’s address loops
    Human in the loop
    The case for aligning teams around training data
    Insurance Tech Startup Lessons
    Will the production data match the training data?
    Too late to bring in training data software
    Stories
    “Static Schema Prevented Innovation at Self Driving Firm”
    “Startup Didn’t Change Schema and Wasted Effort”
    “Accident Prevention Startup Missed Data-Centric Approach”
    “Sports Startup Successfully Used Pre-Labeling”
    An Academic Approach to Training Data
    Kaggle TSA Competition
    Keying in on training data
    How focusing on training data reveals commercial efficiencies
    Learning lessons and mistakes
    Summary
    Index


    📜 SIMILAR VOLUMES


    Training Data for Machine Learning: Huma
    ✍ Anthony Sarkis 📂 Library 📅 2022 🏛 O'Reilly Media 🌐 English

    <div><p>Your training data has as much to do with the success of your data project as the algorithms themselves--most failures in deep learning systems relate to training data. But while training data is the foundation for successful machine learning, there are few comprehensive resources to help yo

    Training Data for Machine Learning: Huma
    ✍ Anthony Sarkis 📂 Library 📅 2023 🏛 O'Reilly Media 🌐 English

    Your training data has as much to do with the success of your data project as the algorithms themselves because most failures in AI systems relate to training data. But while training data is the foundation for successful AI and machine learning, there are few comprehensive resources to help you ace

    Machine learning quick reference: quick
    ✍ Kumar, Rahul 📂 Library 📅 2019 🏛 Packt Publishing, Limited 🌐 English

    <div><p><strong>Your hands-on reference guide to develop, train and optimize your machine learning models</strong></p><h4>Key Features</h4><ul><br><li>Your guide to learning efficient machine learning process from scratch</li><br><li>Expert techniques and hacks on a variety of machine learning conce

    Training Data for Machine Learning: Huma
    ✍ Anthony Sarkis 📂 Library 📅 2023 🏛 O'Reilly Media, Inc. 🌐 English

    our training data has as much to do with the success of your data project as the algorithms themselves--most failures in deep learning systems relate to training data. But while training data is the foundation for successful machine learning, there are few comprehensive resources to help you ace the

    Machine Learning for Data Mining
    ✍ Jesus Salcedo 📂 Library 📅 2019 🏛 Packt Publishing 🌐 English

    <p><b>Get efficient in performing data mining and machine learning using IBM SPSS Modeler </b><p><b>Key Features</b><p><li>Learn how to apply machine learning techniques in the field of data science<li>Understand when to use different data mining techniques, how to set up different analyses, and how

    Training Data for Machine Learning: Huma
    ✍ Anthony Sarkis 📂 Library 📅 2023 🏛 O'Reilly Media, Inc. 🌐 English

    Your training data has as much to do with the success of your data project as the algorithms themselves--most failures in Deep Learning systems relate to training data. But while training data is the foundation for successful Machine Learning, there are few comprehensive resources to help you ace th