๐”– Scriptorium
โœฆ   LIBER   โœฆ

๐Ÿ“

Data Quality Fundamentals: A Practitioner's Guide to Building Trustworthy Data Pipelines

โœ Scribed by Barr Moses, Lior Gavish, Molly Vorwerck


Publisher
O'Reilly Media
Year
2022
Tongue
English
Leaves
311
Edition
1
Category
Library

โฌ‡  Acquire This Volume

No coin nor oath required. For personal study only.

โœฆ Synopsis


Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you're using broken or just plain wrong? These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner. If you answered yes to these questions, this book is for you.

Many data engineering teams today face the "good pipelines, bad data" problem. It doesn't matter how advanced your data infrastructure is if the data you're piping is bad. In this book, Barr Moses, Lior Gavish, and Molly Vorwerck, from the data observability company Monte Carlo, explain how to tackle data quality and trust at scale by leveraging best practices and technologies used by some of the world's most innovative companies.

โ€ข Build more trustworthy and reliable data pipelines
โ€ข Write scripts to make data checks and identify broken pipelines with data observability
โ€ข Learn how to set and maintain data SLAs, SLIs, and SLOs
โ€ข Develop and lead data quality initiatives at your company
โ€ข Learn how to treat data services and systems with the diligence of production software
โ€ข Automate data lineage graphs across your data ecosystem
โ€ข Build anomaly detectors for your critical data assets

โœฆ Table of Contents


Cover
Copyright
Table of Contents
Preface
Conventions Used in This Book
Using Code Examples
Oโ€™Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. Why Data Quality Deserves Attentionโ€”Now
What Is Data Quality?
Framing the Current Moment
Understanding the โ€œRise of Data Downtimeโ€
Other Industry Trends Contributing to the Current Moment
Summary
Chapter 2. Assembling the Building Blocks of a Reliable Data System
Understanding the Difference Between Operational and Analytical Data
What Makes Them Different?
Data Warehouses Versus Data Lakes
Data Warehouses: Table Types at the Schema Level
Data Lakes: Manipulations at the File Level
What About the Data Lakehouse?
Syncing Data Between Warehouses and Lakes
Collecting Data Quality Metrics
What Are Data Quality Metrics?
How to Pull Data Quality Metrics
Using Query Logs to Understand Data Quality in the Warehouse
Using Query Logs to Understand Data Quality in the Lake
Designing a Data Catalog
Building a Data Catalog
Summary
Chapter 3. Collecting, Cleaning, Transforming, and Testing Data
Collecting Data
Application Log Data
API Responses
Sensor Data
Cleaning Data
Batch Versus Stream Processing
Data Quality for Stream Processing
Normalizing Data
Handling Heterogeneous Data Sources
Schema Checking and Type Coercion
Syntactic Versus Semantic Ambiguity in Data
Managing Operational Data Transformations Across AWS Kinesis and Apache Kafka
Running Analytical Data Transformations
Ensuring Data Quality During ETL
Ensuring Data Quality During Transformation
Alerting and Testing
dbt Unit Testing
Great Expectations Unit Testing
Deequ Unit Testing
Managing Data Quality with Apache Airflow
Scheduler SLAs
Installing Circuit Breakers with Apache Airflow
SQL Check Operators
Summary
Chapter 4. Monitoring and Anomaly Detection for Your Data Pipelines
Knowing Your Known Unknowns and Unknown Unknowns
Building an Anomaly Detection Algorithm
Monitoring for Freshness
Understanding Distribution
Building Monitors for Schema and Lineage
Anomaly Detection for Schema Changes and Lineage
Visualizing Lineage
Investigating a Data Anomaly
Scaling Anomaly Detection with Python and Machine Learning
Improving Data Monitoring Alerting with Machine Learning
Accounting for False Positives and False Negatives
Improving Precision and Recall
Detecting Freshness Incidents with Data Monitoring
F-Scores
Does Model Accuracy Matter?
Beyond the Surface: Other Useful Anomaly Detection Approaches
Designing Data Quality Monitors for Warehouses Versus Lakes
Summary
Chapter 5. Architecting for Data Reliability
Measuring and Maintaining High Data Reliability at Ingestion
Measuring and Maintaining Data Quality in the Pipeline
Understanding Data Quality Downstream
Building Your Data Platform
Data Ingestion
Data Storage and Processing
Data Transformation and Modeling
Business Intelligence and Analytics
Data Discovery and Governance
Developing Trust in Your Data
Data Observability
Measuring the ROI on Data Quality
How to Set SLAs, SLOs, and SLIs for Your Data
Case Study: Blinkist
Summary
Chapter 6. Fixing Data Quality Issues at Scale
Fixing Quality Issues in Software Development
Data Incident Management
Incident Detection
Response
Root Cause Analysis
Resolution
Blameless Postmortem
Incident Response and Mitigation
Establishing a Routine of Incident Management
Why Data Incident Commanders Matter
Case Study: Data Incident Management at PagerDuty
The DataOps Landscape at PagerDuty
Data Challenges at PagerDuty
Using DevOps Best Practices to Scale Data Incident Management
Summary
Chapter 7. Building End-to-End Lineage
Building End-to-End Field-Level Lineage for Modern Data Systems
Basic Lineage Requirements
Data Lineage Design
Parsing the Data
Building the User Interface
Case Study: Architecting for Data Reliability at Fox
Exercise โ€œControlled Freedomโ€ When Dealing with Stakeholders
Invest in a Decentralized Data Team
Avoid Shiny New Toys in Favor of Problem-Solving Tech
To Make Analytics Self-Serve, Invest in Data Trust
Summary
Chapter 8. Democratizing Data Quality
Treating Your โ€œDataโ€ Like a Product
Perspectives on Treating Data Like a Product
Convoy Case Study: Data as a Service or Output
Uber Case Study: The Rise of the Data Product Manager
Applying the Data-as-a-Product Approach
Building Trust in Your Data Platform
Align Your Productโ€™s Goals with the Goals of the Business
Gain Feedback and Buy-in from the Right Stakeholders
Prioritize Long-Term Growth and Sustainability Versus Short-Term Gains
Sign Off on Baseline Metrics for Your Data and How You Measure Them
Know When to Build Versus Buy
Assigning Ownership for Data Quality
Chief Data Officer
Business Intelligence Analyst
Analytics Engineer
Data Scientist
Data Governance Lead
Data Engineer
Data Product Manager
Who Is Responsible for Data Reliability?
Creating Accountability for Data Quality
Balancing Data Accessibility with Trust
Certifying Your Data
Seven Steps to Implementing a Data Certification Program
Case Study: Toastโ€™s Journey to Finding the Right Structure for Their Data Team
In the Beginning: When a Small Team Struggles to Meet Data Demands
Supporting Hypergrowth as a Decentralized Data Operation
Regrouping, Recentralizing, and Refocusing on Data Trust
Considerations When Scaling Your Data Team
Increasing Data Literacy
Prioritizing Data Governance and Compliance
Prioritizing a Data Catalog
Beyond Catalogs: Enforcing Data Governance
Building a Data Quality Strategy
Make Leadership Accountable for Data Quality
Set Data Quality KPIs
Spearhead a Data Governance Program
Automate Your Lineage and Data Governance Tooling
Create a Communications Plan
Summary
Chapter 9. Data Quality in the Real World: Conversations and Case Studies
Building a Data Mesh for Greater Data Quality
Domain-Oriented Data Owners and Pipelines
Self-Serve Functionality
Interoperability and Standardization of Communications
Why Implement a Data Mesh?
To Mesh or Not to Mesh? That Is the Question
Calculating Your Data Mesh Score
A Conversation with Zhamak Dehghani: The Role of Data Quality Across the Data Mesh
Can You Build a Data Mesh from a Single Solution?
Is Data Mesh Another Word for Data Virtualization?
Does Each Data Product Team Manage Their Own Separate Data Stores?
Is a Self-Serve Data Platform the Same Thing as a Decentralized Data Mesh?
Is the Data Mesh Right for All Data Teams?
Does One Person on Your Team โ€œOwnโ€ the Data Mesh?
Does the Data Mesh Cause Friction Between Data Engineers and Data Analysts?
Case Study: Kolibri Gamesโ€™ Data Stack Journey
First Data Needs
Pursuing Performance Marketing
2018: Professionalize and Centralize
Getting Data-Oriented
Getting Data-Driven
Building a Data Mesh
Five Key Takeaways from a Five-Year Data Evolution
Making Metadata Work for the Business
Unlocking the Value of Metadata with Data Discovery
Data Warehouse and Lake Considerations
Data Catalogs Can Drown in a Data Lakeโ€”or Even a Data Mesh
Moving from Traditional Data Catalogs to Modern Data Discovery
Deciding When to Get Started with Data Quality at Your Company
Youโ€™ve Recently Migrated to the Cloud
Your Data Stack Is Scaling with More Data Sources, More Tables, and More Complexity
Your Data Team Is Growing
Your Team Is Spending at Least 30% of Their Time Firefighting Data Quality Issues
Your Team Has More Data Consumers Than They Did One Year Ago
Your Company Is Moving to a Self-Service Analytics Model
Data Is a Key Part of the Customer Value Proposition
Data Quality Starts with Trust
Summary
Chapter 10. Pioneering the Future of Reliable Data Systems
Be Proactive, Not Reactive
Predictions for the Future of Data Quality and Reliability
Data Warehouses and Lakes Will Merge
Emergence of New Roles on the Data Team
Rise of Automation
More Distributed Environments and the Rise of Data Domains
So Where Do We Go from Here?
Index
About the Authors
Colophon

โœฆ Subjects


Management; Anomaly Detection; Monitoring; Scalability; Data Cleaning; Data Lake; Data Warehouse; Data Collection; Data Ingestion; Apache Airflow; Data Quality; Data Normalization; Data Reliability


๐Ÿ“œ SIMILAR VOLUMES


Data Quality Fundamentals: A Practitione
โœ Barr Moses ๐Ÿ“‚ Library ๐Ÿ“… 2022 ๐Ÿ› O'Reilly Media ๐ŸŒ English

Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you're using broken or just plain wrong? These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner. If you answered yes to these questions, this book is f

The Practitioner's Guide to Data Quality
โœ David Loshin (Auth.) ๐Ÿ“‚ Library ๐Ÿ“… 2011 ๐Ÿ› Morgan Kaufmann ๐ŸŒ English

</div><div class='box-content'><ul><li><p><span class="review_text"><P/><IT>"There is NOTHING like this out there that I am aware of, and certainly nothing from anyone with same stature as David Loshin."</IT>--David Plotkin, Wells Fargo Bank<P/><IT>"The book provides a comprehensive look at data qua

Modern Data Architectures with Python: A
โœ Brian Lipp ๐Ÿ“‚ Library ๐Ÿ“… 2023 ๐Ÿ› Packt Publishing ๐ŸŒ English

Learn to build scalable and reliable data ecosystems using Data Mesh, Databricks Spark, and Kafka. Key Features Develop modern data skills in emerging technologies Learn pragmatic design methodologies like Data Mesh and Lake House Grow a deeper understanding of data governance Book Descript

Modern Data Architectures with Python: A
โœ Brian Lipp ๐Ÿ“‚ Library ๐Ÿ“… 2023 ๐Ÿ› Packt Publishing Pvt Ltd ๐ŸŒ English

Build scalable and reliable data ecosystems using Data Mesh, Databricks Spark, and Kafka Key Features Develop modern data skills used in emerging technologies Learn pragmatic design methodologies such as Data Mesh and data lakehouses Gain a deeper understanding of data governance Purchase of

Practitioner's Guide to Data Science
โœ Hui Lin, Ming Li ๐Ÿ“‚ Library ๐Ÿ“… 2023 ๐Ÿ› CRC Press ๐ŸŒ English

This book aims to increase the visibility of Data Science in real-world, which differs from what you learn from a typical textbook. Many aspects of day-to-day Data Science work are almost absent from conventional statistics, Machine Learning, and Data Science curriculum. Yet these activities account

Fundamentals of Data Observability: Impl
โœ Andy Petrella ๐Ÿ“‚ Library ๐Ÿ“… 2023 ๐Ÿ› O'Reilly Media ๐ŸŒ English

<p><span>Quickly detect, troubleshoot, and prevent a wide range of data issues through data observability, a set of best practices that enables data teams to gain greater visibility of data and its usage. If you're a data engineer, data architect, or machine learning engineer who depends on the qual