Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks

✍ Scribed by Pulkit Chadha

Publisher: Packt Publishing
Year: 2024
Tongue: English
Leaves: 438
Edition: 1
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Work through 70 recipes for implementing reliable data pipelines with Apache Spark, optimally store and process structured and unstructured data in Delta Lake, and use Databricks to orchestrate and govern your data Key Features
Learn data ingestion, data transformation, and data management techniques using Apache Spark and Delta Lake
Gain practical guidance on using Delta Lake tables and orchestrating data pipelines
Implement reliable DataOps and DevOps practices, and enforce data governance policies on Databricks
Purchase of the print or Kindle book includes a free PDF eBook
Book DescriptionData Engineering with Databricks Cookbook will guide you through recipes to effectively use Apache Spark, Delta Lake, and Databricks for data engineering, beginning with an introduction to data ingestion and loading with Apache Spark. As you progress, you'll be introduced to various data manipulation and data transformation solutions that can be applied to data. You'll find out how to manage and optimize Delta tables, as well as how to ingest and process streaming data. The book will also show you how to improve the performance problems of Apache Spark apps and Delta Lake. Later chapters will show you how to use Databricks to implement DataOps and DevOps practices and teach you how to orchestrate and schedule data pipelines using Databricks Workflows. Finally, you'll understand how to set up and configure Unity Catalog for data governance. By the end of this book, you'll be well-versed in building reliable and scalable data pipelines using modern data engineering technologies.What you will learn
Perform data loading, ingestion, and processing with Apache Spark
Discover data transformation techniques and custom user-defined functions (UDFs) in Apache Spark
Manage and optimize Delta tables with Apache Spark and Delta Lake APIs
Use Spark Structured Streaming for real-time data processing
Optimize Apache Spark application and Delta table query performance
Implement DataOps and DevOps practices on Databricks
Orchestrate data pipelines with Delta Live Tables and Databricks Workflows
Implement data governance policies with Unity Catalog
Who this book is for This book is for data engineers, data scientists, and data practitioners who want to learn how to build efficient and scalable data pipelines using Apache Spark, Delta Lake, and Databricks. To get the most out of this book, you should have basic knowledge of data architecture, SQL, and Python programming.
]]>

✦ Table of Contents

Title Page
Copyright and Credits
Dedication
Contributors
Table of Contents
Preface
Part 1 – Working with Apache Spark and Delta Lake
Chapter 1: Data Ingestion and Data Extraction with Apache Spark
Technical requirements
Reading CSV data with Apache Spark
How to do it...
There’s more…
See also
Reading JSON data with Apache Spark
How to do it...
There’s more…
See also
Reading Parquet data with Apache Spark
How to do it...
See also
Parsing XML data with Apache Spark
How to do it…
There’s more…
See also
Working with nested data structures in Apache Spark
How to do it…
There’s more…
See also
Processing text data in Apache Spark
How to do it…
There’s more…
See also
Writing data with Apache Spark
How to do it…
There’s more…
See also
Chapter 2: Data Transformation and Data Manipulation with Apache Spark
Technical requirements
Applying basic transformations to data with Apache Spark
How to do it...
There’s more…
See also
Filtering data with Apache Spark
How to do it…
There’s more…
See also
Performing joins with Apache Spark
How to do it...
There’s more…
See also
Performing aggregations with Apache Spark
How to do it...
There’s more…
See also
Using window functions with Apache Spark
How to do it...
There’s more…
Writing custom UDFs in Apache Spark
How to do it...
There’s more…
See also
Handling null values with Apache Spark
How to do it...
There’s more…
See also
Chapter 3: Data Management with Delta Lake
Technical requirements
Creating a Delta Lake table
How to do it...
There’s more…
See also
Reading a Delta Lake table
How to do it...
There’s more...
See also
Updating data in a Delta Lake table
How to do it...
See also
Merging data into Delta tables
How to do it...
There’s more…
See also
Change data capture in Delta Lake
How to do it...
See also
Optimizing Delta Lake tables
How to do it...
There’s more...
See also
Versioning and time travel for Delta Lake tables
How to do it...
There’s more...
See also
Managing Delta Lake tables
How to do it...
See also
Chapter 4: Ingesting Streaming Data
Technical requirements
Configuring Spark Structured Streaming for real-time data processing
Getting ready
How to do it…
How it works…
There’s more…
See also
Reading data from real-time sources, such as Apache Kafka, with Apache Spark Structured Streaming
Getting ready
How to do it…
How it works…
There’s more…
See also
Defining transformations and filters on a Streaming DataFrame
Getting ready
How to do it…
See also
Configuring checkpoints for Structured Streaming in Apache Spark
Getting ready
How to do it…
How it works…
There’s more…
See also
Configuring triggers for Structured Streaming in Apache Spark
Getting ready
How to do it…
How it works…
See also
Applying window aggregations to streaming data with Apache Spark Structured Streaming
Getting ready
How to do it…
There’s more…
See also
Handling out-of-order and late-arriving events with watermarking in Apache Spark Structured Streaming
Getting ready
How to do it…
There’s more…
See also
Chapter 5: Processing Streaming Data
Technical requirements
Writing the output of Apache Spark Structured Streaming to a sink such as Delta Lake
Getting ready
How to do it…
How it works…
See also
Idempotent stream writing with Delta Lake and Apache Spark Structured Streaming
Getting ready
How to do it…
See also
Merging or applying Change Data Capture on Apache Spark Structured Streaming and Delta Lake
Getting ready
How to do it…
There’s more…
Joining streaming data with static data in Apache Spark Structured Streaming and Delta Lake
Getting ready
How to do it…
There’s more…
See also
Joining streaming data with streaming data in Apache Spark Structured Streaming and Delta Lake
Getting ready
How to do it…
There’s more…
See also
Monitoring real-time data processing with Apache Spark Structured Streaming
Getting ready
How to do it…
There’s more…
See also
Chapter 6: Performance Tuning with Apache Spark
Technical requirements
Monitoring Spark jobs in the Spark UI
How to do it…
See also
Using broadcast variables
How to do it…
How it works…
There’s more…
Optimizing Spark jobs by minimizing data shuffling
How to do it…
See also
Avoiding data skew
How to do it…
There’s more...
Caching and persistence
How to do it…
There’s more…
Partitioning and repartitioning
How to do it…
There’s more…
Optimizing join strategies
How to do it…
See also
Chapter 7: Performance Tuning in Delta Lake
Technical requirements
Optimizing Delta Lake table partitioning for query performance
How to do it…
There’s more…
See also
Organizing data with Z-ordering for efficient query execution
How to do it…
How it works…
See also
Skipping data for faster query execution
How to do it…
See also
Reducing Delta Lake table size and I/O cost with compression
How to do it…
How it works…
See also
Part 2 – Data Engineering Capabilities within Databricks
Chapter 8: Orchestration and Scheduling Data Pipeline with Databricks Workflows
Technical requirements
Building Databricks workflows
How to do it…
See also
Running and managing Databricks Workflows
How to do it...
See also
Passing task and job parameters within a Databricks Workflow
How to do it...
See also
Conditional branching in Databricks Workflows
How to do it...
See also
Triggering jobs based on file arrival
Getting ready
How to do it…
See also
Setting up workflow alerts and notifications
How to do it…
There’s more…
See also
Troubleshooting and repairing failures in Databricks Workflows
How to do it...
See also
Chapter 9: Building Data Pipelines with Delta Live Tables
Technical requirements
Creating a multi-hop medallion architecture data pipeline with Delta Live Tables in Databricks
How to do it…
How it works…
See also
Building a data pipeline with Delta Live Tables on Databricks
How to do it…
See also
Implementing data quality and validation rules with Delta Live Tables in Databricks
How to do it…
How it works…
See also
Quarantining bad data with Delta Live Tables in Databricks
How to do it…
See also
Monitoring Delta Live Tables pipelines
How to do it…
See also
Deploying Delta Live Tables pipelines with Databricks Asset Bundles
Getting ready
How to do it…
There’s more…
See also
Applying changes (CDC) to Delta tables with Delta Live Tables
How to do it…
See also
Chapter 10: Data Governance with Unity Catalog
Technical requirements
Connecting to cloud object storage using Unity Catalog
Getting ready
How to do it…
See also
Creating and managing catalogs, schemas, volumes, and tables using Unity Catalog
Getting ready
How to do it…
See also
Defining and applying fine-grained access control policies using Unity Catalog
Getting ready
How to do it…
See also
Tagging, commenting, and capturing metadata about data and AI assets using Databricks Unity Catalog
Getting ready
How to do it…
See also
Filtering sensitive data with Unity Catalog
Getting ready
How to do it…
See also
Using Unity Catalogs lineage data for debugging, root cause analysis, and impact assessment
Getting ready
How to do it…
See also
Accessing and querying system tables using Unity Catalog
Getting ready
How to do it…
See also
Chapter 11: Implementing DataOps and DevOps on Databricks
Technical requirements
Using Databricks Repos to store code in Git
Getting ready
How to do it…
There’s more…
See also
Automating tasks by using the Databricks CLI
Getting ready
How to do it…
There’s more…
See also
Using the Databricks VSCode extension for local development and testing
Getting ready
How to do it…
See also
Using Databricks Asset Bundles (DABs)
Getting ready
How to do it…
See also
Leveraging GitHub Actions with Databricks Asset Bundles (DABs)
Getting ready
How to do it…
See also
Index
About Packt
Other Books You May Enjoy

📜 SIMILAR VOLUMES

Databricks. Using Apache Spark

📁 Databricks. Using Apache Spark

📂 Library 🌐 English

Мануал от компании Databricks по использованию Apache Spark.<div class="bb-sep"></div><strong>Introduction</strong><br/><strong>Log Analysis with Spark</strong><br/>Introduction to Apache Spark<br/>Importing Data<br/>Exporting Data<br/>Log Analyzer Application<br/><strong>Twitter Streaming Language

The Azure Data Lakehouse Toolkit: Buildi

📁 The Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake

✍ Ron L'Esteve 📂 Library 📅 2022 🏛 Apress 🌐 English

<p><span>Design and implement a modern data lakehouse on the Azure Data Platform using Delta Lake, Apache Spark, Azure Databricks, Azure Synapse Analytics, and Snowflake. This book teaches you the intricate details of the Data Lakehouse Paradigm and how to efficiently design a cloud-based data lakeh

The Azure Data Lakehouse Toolkit: Buildi

📁 The Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake

✍ Ron L'Esteve 📂 Library 📅 2022 🏛 Apress 🌐 English

The Azure Data Lakehouse Toolkit: Buildi

📁 The Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake

✍ Ron L'Esteve 📂 Library 📅 2022 🏛 Apress 🌐 English

<span>Design and implement a modern data lakehouse on the Azure Data Platform using Delta Lake, Apache Spark, Azure Databricks, Azure Synapse Analytics, and Snowflake. This book teaches you the intricate details of the Data Lakehouse Paradigm and how to efficiently design a cloud-based data lakehous

Azure Databricks Cookbook: Accelerate an

📁 Azure Databricks Cookbook: Accelerate and scale real-time analytics solutions using the Apache Spark-based analytics service

✍ Phani Raj, Vinod Jaiswal 📂 Library 📅 2021 🏛 Packt Publishing 🌐 English

<p><b>Get to grips with building and productionizing end-to-end big data solutions in Azure and learn best practices for working with large datasets</b></p><h4>Key Features</h4><ul><li>Integrate with Azure Synapse Analytics, Cosmos DB, and Azure HDInsight Kafka Cluster to scale and analyze your proj

Azure Databricks Cookbook: Accelerate an

📁 Azure Databricks Cookbook: Accelerate and scale real-time analytics solutions using the Apache Spark-based analytics service

✍ Phani Raj, Vinod Jaiswal 📂 Library 🏛 Packt Publishing 🌐 English

<p><span>Get to grips with building and productionizing end-to-end big data solutions in Azure and learn best practices for working with large datasets</span></p><h4><span>Key Features</span></h4><ul><li><span><span>Integrate with Azure Synapse Analytics, Cosmos DB, and Azure HDInsight Kafka Cluster