Simplifying Data Engineering and Analytics with Delta: Create analytics-ready data that fuels artificial intelligence and business intelligence

✍ Scribed by Anindita Mahapatra

Publisher: Packt Publishing
Year: 2022
Tongue: English
Leaves: 335
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Explore how Delta brings reliability, performance, and governance to your data lake and all the AI and BI use cases built on top of it

Key Features

Learn Delta's core concepts and features as well as what makes it a perfect match for data engineering and analysis
Solve business challenges of different industry verticals using a scenario-based approach
Make optimal choices by understanding the various tradeoffs provided by Delta

Book Description

Delta helps you generate reliable insights at scale and simplifies architecture around data pipelines, allowing you to focus primarily on refining the use cases being worked on. This is especially important when you consider that existing architecture is frequently reused for new use cases.

In this book, you'll learn about the principles of distributed computing, data modeling techniques, and big data design patterns and templates that help solve end-to-end data flow problems for common scenarios and are reusable across use cases and industry verticals. You'll also learn how to recover from errors and the best practices around handling structured, semi-structured, and unstructured data using Delta. After that, you'll get to grips with features such as ACID transactions on big data, disciplined schema evolution, time travel to help rewind a dataset to a different time or version, and unified batch and streaming capabilities that will help you build agile and robust data products.

By the end of this Delta book, you'll be able to use Delta as the foundational block for creating analytics-ready data that fuels all AI/BI use cases.

What you will learn

Explore the key challenges of traditional data lakes
Appreciate the unique features of Delta that come out of the box
Address reliability, performance, and governance concerns using Delta
Analyze the open data format for an extensible and pluggable architecture
Handle multiple use cases to support BI, AI, streaming, and data discovery
Discover how common data and machine learning design patterns are executed on Delta
Build and deploy data and machine learning pipelines at scale using Delta

Who this book is for

Data engineers, data scientists, ML practitioners, BI analysts, or anyone in the data domain working with big data will be able to put their knowledge to work with this practical guide to executing pipelines and supporting diverse use cases using the Delta protocol. Basic knowledge of SQL, Python programming, and Spark is required to get the most out of this book.

An Introduction to Data Engineering
Data Modeling and ETL
Delta – The Foundation Block for Big Data
Unifying Batch and Streaming with Delta
Data Consolidation in Delta Lake
Solving Common Data Pattern Scenarios with Delta
Delta for Data Warehouse Use Cases
Handling Atypical Data Scenarios with Delta
Delta for Reproducible Machine Learning Pipelines
Delta for Data Products and Services
Operationalizing Data and ML Pipelines
Optimizing Cost and Performance with Delta
Managing Your Data Journey

✦ Table of Contents

Cover
Title Page
Copyright and Credits
Foreword
Contributors
Table of Contents
Preface
Section 1 – Introduction to Delta Lake and Data Engineering Principles
Chapter 1: Introduction to Data Engineering
The motivation behind data engineering
Use cases
How big is big data?
But isn't ML and AI all the rage today?
Understanding the role of data personas
Big data ecosystem
What characterizes big data?
Classifying data
Reaping value from data
Top challenges of big data systems
Evolution of data systems
Rise of cloud data platforms
SQL and NoSQL systems
OLTP and OLAP systems
Distributed computing
SMP and MPP computing
Parallel and distributed computing
Business justification for tech spending
Strategy for business transformation to use data as an asset
Big data trends and best practices
Summary
Chapter 2: Data Modeling and ETL
Technical requirements
What is data modeling and why should you care?
Advantages of a data modeling exercise
Stages of data modeling
Data modeling approaches for different data stores
Understanding metadata – data about data
Data catalog
Types of metadata
Why is metadata management the nerve center of data?
Moving and transforming data using ETL
Scenarios to consider for building ETL pipelines
Job orchestration
How to choose the right data format
Text format versus binary format
Row versus column formats
When to use which format
Leveraging data compression
Common big data design patterns
Ingestion
Transformations
Persist
Summary
Further reading
Chapter 3: Delta – The Foundation Block for Big Data
Technical requirements
Motivation for Delta
A case of too many is too little
Data silos to data swamps
Characteristics of curated data lakes
DDL commands
DML commands
APPEND
Demystifying Delta
Format layout on disk
The main features of Delta
ACID transaction support
Schema evolution
Unifying batch and streaming workloads
Time travel
Performance
Life with and without Delta
Lakehouse
Summary
Section 2 – End-to-End Process of Building Delta Pipelines
Chapter 4: Unifying Batch and Streaming with Delta
Technical requirements
Moving toward real-time systems
Streaming concepts
Lambda versus Kappa architectures
Streaming ETL
Extract – file-based versus event-based streaming
Transforming – stream processing
Loading – persisting the stream
Handling streaming scenarios
Joining with other static and dynamic datasets
Recovering from failures
Handling late-arriving data
Stateless and stateful in-stream operations
Trade-offs in designing streaming architectures
Cost trade-offs
Handling latency trade-offs
Data reprocessing
Multi-tenancy
De-duplication
Streaming best practices
Summary
Chapter 5: Data Consolidation in Delta Lake
Technical requirements
Why consolidate disparate data types?
Delta unifies all types of data
Structured data
Semi-structured data
Unstructured data
Avoiding patches of data darkness
Addressing problems in flight status using Delta
Augmenting domain knowledge constraints to quality
Continuous quality monitoring
Curating data in stages for analytics
RDD, DataFrames, and datasets
Spark transformations and actions
Spark APIs and UDFs
Ease of extending to existing and new use cases
Delta Lake connectors
Specialized Delta Lakes by industry
Data governance
GDPR and CCPA compliance
Role-based data access
Summary
Chapter 6: Solving Common Data Pattern Scenarios with Delta
Technical requirements
Understanding use case requirements
Minimizing data movement with Delta time travel
Delta cloning
Handling CDC
CDC
Change Data Feed (CDF)
Handling Slowly Changing Dimensions (SCD)
SCD Type 1
SCD Type 2
Summary
Chapter 7: Delta for Data Warehouse Use Cases
Technical requirements
Choosing the right architecture
Understanding what a data warehouse really solves
Lacunas of data warehouses
Discovering when a data lake does not suffice
Addressing concurrency and latency requirements with Delta
Visualizing data using BI reporting
Can cubes be constructed with Delta?
Analyzing tradeoffs in a push versus pull data flow
Why is being open such a big deal?
Considerations around data governance
The rise of the lakehouse category
Summary
Chapter 8: Handling Atypical Data Scenarios with Delta
Technical requirements
Emphasizing the importance of exploratory data analysis (EDA)
From big data to good data
Data profiling
Statistical analysis
Applying sampling techniques to address class imbalance
How to detect and address imbalance
Synthetic data generation to deal with data imbalance
Addressing data skew
Providing data anonymity
Handling bias and variance in data
Bias versus variance
How do we detect bias and variance?
How do we fix bias and variance?
Compensating for missing and out-of-range data
Monitoring data drift
Summary
Chapter 9: Delta for Reproducible Machine Learning Pipelines
Technical requirements
Data science versus machine learning
Challenges of ML development
Formalizing the ML development process
What is a model?
What is MLOps?
Aspirations of a modern ML platform
The role of Delta in an ML pipeline
Delta-backed feature store
Delta-backed model training
Delta-backed model inferencing
Model monitoring with Delta
From business problem to insight generation
Summary
Chapter 10: Delta for Data Products and Services
Technical requirements
DaaS
The need for data democratization
Delta for unstructured data
NLP data (text and audio)
Image and video data
Data mashups using Delta
Data blending
Data harmonization
Federated query
Facilitating data sharing with Delta
Setting up Delta sharing
Benefits of Delta sharing
Data clean room
Summary
Section 3 – Operationalizing and Productionalizing Delta Pipelines
Chapter 11: Operationalizing Data and ML Pipelines
Technical requirements
Why operationalize?
Understanding and monitoring SLAs
Scaling and high availability
Planning for DR
How to decide on the correct DR strategy
How Delta helps with DR
Guaranteeing data quality
Automation of CI/CD pipelines
Code under version control
Infrastructure as Code (IaC)
Unit and integration testing
Data as code – An intelligent pipeline
Summary
Chapter 12: Optimizing Cost and Performance with Delta
Technical requirements
Improving performance with common strategies
Where to look and what to look for
Optimizing with Delta
Changing the data layout in storage
Other platform optimizations
Automation
Is cost always inversely proportional to performance?
Best practices for managing performance
Summary
Chapter 13: Managing Your Data Journey
Provisioning a multi-tenant infrastructure
Data democratization via policies and processes
Capacity planning
Managing and monitoring
Data sharing
Data migration
COE best practices
Summary
Index
Other Books You May Enjoy

📜 SIMILAR VOLUMES

Simplifying Data Engineering and Analyti

📁 Simplifying Data Engineering and Analytics with Delta: Create analytics-ready data that fuels artificial intelligence and business intelligence

✍ Anindita Mahapatra 📂 Library 📅 2022 🏛 Packt Publishing 🌐 English

<p><span>Explore how Delta brings reliability, performance, and governance to your data lake and all the AI and BI use cases built on top of it</span></p><h4><span>Key Features</span></h4><ul><li><span><span>Learn Delta's core concepts and features as well as what makes it a perfect match for data e

Simplifying Data Engineering and Analyti

📁 Simplifying Data Engineering and Analytics with Delta: Create analytics-ready data that fuels artificial intelligence and business intelligence Anindita Mahapatra

✍ Anindita Mahapatra 📂 Library 📅 2022 🌐 English

Who this book is for...cunts Data engineers, data scientists, ML practitioners, BI analysts, or anyone in the data domain working with big data will be able to put their knowledge to work with this practical guide to executing pipelines and supporting diverse use cases using the Delta protocol. B

INTELLIGENT DATA ENGINEERING AND ANALYTI

📁 INTELLIGENT DATA ENGINEERING AND ANALYTICS

📂 Library 📅 2020 🏛 SPRINGER VERLAG, SINGAPOR 🌐 English

Data Analytics: Models and Algorithms fo

📁 Data Analytics: Models and Algorithms for Intelligent Data Analysis

✍ Thomas A. Runkler (auth.) 📂 Library 📅 2012 🏛 Vieweg+Teubner Verlag 🌐 English

<p>This book is a comprehensive introduction to the methods and algorithms and approaches of modern data analytics. It covers data preprocessing, visualization, correlation, regression, forecasting, classification, and clustering. It provides a sound mathematical basis, discusses advantages and draw

Data Analytics: Models and Algorithms fo

📁 Data Analytics: Models and Algorithms for Intelligent Data Analysis

✍ Thomas A. Runkler 📂 Library 📅 2016 🏛 Springer Vieweg 🌐 English

This book is a comprehensive introduction to the methods and algorithms of modern data analytics. It provides a sound mathematical basis, discusses advantages and drawbacks of different approaches, and enables the reader to design and implement data analytics solutions for real-world applications. T

Data Analytics: Models and Algorithms fo

📁 Data Analytics: Models and Algorithms for Intelligent Data Analysis

✍ Thomas A. Runkler 📂 Library 📅 2020 🏛 Vieweg + Teubner Verlag 🌐 English