Data Engineering with Scala and Spark: A practical guide helping you build streaming and batch pipelines that process massive amounts of data using Scala

✍ Scribed by Eric Tome, Rupam Bhattacharjee, David Radford

Publisher: Packt Publishing Pvt Ltd
Year: 2024
Tongue: English
Leaves: 323
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Most data engineers know that performance issues in a distributed computing environment can easily lead to issues impacting the overall efficiency and effectiveness of data engineering tasks. While Python remains a popular choice for data engineering due to its ease of use, Scala shines in scenarios where the performance of distributed data processing is paramount.

This book will teach you how to leverage the Scala programming language on the Spark framework and use the latest cloud technologies to build continuous and triggered data pipelines. You'll do this by setting up a data engineering environment for local development and scalable distributed cloud deployments using data engineering best practices, test-driven development, and CI/CD. You'll also get to grips with DataFrame API, Dataset API, and Spark SQL API and its use. Data profiling and quality in Scala will also be covered, alongside techniques for orchestrating and performance tuning your end-to-end pipelines to deliver data to your end users.

By the end of this book, you will be able to build streaming and batch data pipelines using Scala while following software engineering best practices.

✦ Table of Contents

Data Engineering with Scala and Spark
Contributors
About the authors
About the reviewers
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Share Your Thoughts
Download a free PDF copy of this book
Part 1 – Introduction to Data Engineering, Scala, and an Environment Setup
1
Scala Essentials for Data Engineers
Technical requirements
Understanding functional programming
Understanding objects, classes, and traits
Classes
Object
Trait
Working with higher-order functions (HOFs)
Examples of HOFs from the Scala collection library
Understanding polymorphic functions
Variance
Option type
Collections
Understanding pattern matching
Wildcard patterns
Constant patterns
Variable patterns
Constructor patterns
Sequence patterns
Tuple patterns
Typed patterns
Implicits in Scala
Summary
Further reading
2
Environment Setup
Technical requirements
Setting up a cloud environment
Leveraging cloud object storage
Using Databricks
Local environment setup
The build tool
Summary
Further reading
Part 2 – Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark
3
An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL
Technical requirements
Working with Apache Spark
How do Spark applications work?
What happens on executors?
Creating a Spark application using Scala
Spark stages
Shuffling
Understanding the Spark Dataset API
Understanding the Spark DataFrame API
Spark SQL
The select function
Creating temporary views
Summary
4
Working with Databases
Technical requirements
Understanding the Spark JDBC API
Working with the Spark JDBC API
Loading the database configuration
Creating a database interface
Creating a factory method for SparkSession
Performing various database operations
Working with databases
Updating the Database API with Spark read and write
Summary
5
Object Stores and Data Lakes
Understanding distributed file systems
Data lakes
Object stores
Streaming data
Working with streaming sources
Processing and sinks
Aggregating streams
Summary
6
Understanding Data Transformation
Technical requirements
Understanding the difference between transformations and actions
Using Select and SelectExpr
Filtering and sorting
Learning how to aggregate, group, and join data
Leveraging advanced window functions
Working with complex dataset types
Summary
7
Data Profiling and Data Quality
Technical requirements
Understanding components of Deequ
Performing data analysis
Leveraging automatic constraint suggestion
Defining constraints
Storing metrics using MetricsRepository
Detecting anomalies
Summary
Part 3 – Software Engineering Best Practices for Data Engineering in Scala
8
Test-Driven Development, Code Health, and Maintainability
Technical requirements
Introducing TDD
Creating unit tests
Performing integration testing
Checking code coverage
Running static code analysis
Installing SonarQube locally
Creating a project
Running SonarScanner
Understanding linting and code style
Linting code with WartRemover
Formatting code using scalafmt
Summary
9
CI/CD with GitHub
Technical requirements
Introducing CI/CD and GitHub
Understanding Continuous Integration (CI)
Understanding Continuous Delivery (CD)
Understanding the big picture of CI/CD
Working with GitHub
Cloning a repository
Understanding branches
Writing, committing, and pushing code
Creating pull requests
Reviewing and merging pull requests
Understanding GitHub Actions
Workflows
Jobs
Steps
Summary
Part 4 – Productionalizing Data Engineering Pipelines – Orchestration and Tuning
10
Data Pipeline Orchestration
Technical requirements
Understanding the basics of orchestration
Understanding core features of Apache Airflow
Apache Airflow’s extensibility
Extending beyond operators
Monitoring and UI
Hosting and deployment options
Designing data pipelines with Airflow
Working with Argo Workflows
Installing Argo Workflows
Understanding the core components of Argo Workflows
Taking a short detour
Creating an Argo workflow
Using Databricks Workflows
Leveraging Azure Data Factory
Primary components of ADF
Summary
11
Performance Tuning
Introducing the Spark UI
Navigating the Spark UI
The Jobs tab – overview of job execution
Leveraging the Spark UI for performance tuning
Identifying performance bottlenecks
Optimizing data shuffling
Memory management and garbage collection
Scaling resources
Analyzing SQL query performance
Right-sizing compute resources
Understanding the basics
Understanding data skewing, indexing, and partitioning
Data skew
Indexing and partitioning
Summary
Part 5 – End-to-End Data Pipelines
12
Building Batch Pipelines Using Spark and Scala
Understanding our business use case
What’s our marketing use case?
Understanding the data
Understanding the medallion architecture
The end-to-end pipeline
Ingesting the data
Transforming the data
Checking data quality
Creating a serving layer
Orchestrating our batch process
Summary
13
Building Streaming Pipelines Using Spark and Scala
Understanding our business use case
What’s our IoT use case?
Understanding the data
The end-to-end pipeline
Ingesting the data
Transforming the data
Creating a serving layer
Orchestrating our streaming process
Summary
Index
Why subscribe?
Other Books You May Enjoy
Packt is searching for authors like you
Share Your Thoughts
Download a free PDF copy of this book

📜 SIMILAR VOLUMES

Building Big Data Pipelines with Apache

📁 Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing

✍ Jan Lukavský 📂 Library 📅 2022 🏛 Packt Publishing 🌐 English

<p><b>Implement, run, operate, and test data processing pipelines using Apache Beam</b></p><h4>Key Features</h4><ul><li>Understand how to improve usability and productivity when implementing Beam pipelines</li><li>Learn how to use stateful processing to implement complex use cases using Apache Beam<

Building Big Data Pipelines with Apache

📁 Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing

✍ Jan Lukavský 📂 Library 📅 2022 🏛 Packt Publishing 🌐 English

<span><p><b>Implement, run, operate, and test data processing pipelines using Apache Beam</b></p><h4>Key Features</h4><ul><li>Understand how to improve usability and productivity when implementing Beam pipelines</li><li>Learn how to use stateful processing to implement complex use cases using Apache

Big data for chimps: a guide to massive-

📁 Big data for chimps: a guide to massive-scale data processing in practice

✍ Jurney, Russell;Kromer, Philip 📂 Library 📅 2016 🏛 O'Reilly Media 🌐 English

Big Data for Chimps: A Guide to Massive-

📁 Big Data for Chimps: A Guide to Massive-Scale Data Processing in Practice

✍ Philip (flip) Kromer, Russell Jurney 📂 Library 📅 2015 🏛 O'Reilly Media 🌐 English

<div><p>Finding patterns in massive event streams can be difficult, but learning how to find them doesn’t have to be. This unique hands-on guide shows you how to solve this and many other problems in large-scale data processing with simple, fun, and elegant tools that leverage Apache Hadoop. You’ll

Apache Spark 2 for beginners : develop l

📁 Apache Spark 2 for beginners : develop large-scale distributed data processing applications using Spark 2 in Scala and Python

✍ Rajanarayanan Thottuvaikkatumana 📂 Library 📅 2016 🏛 Packt Publishing 🌐 English

Apache Spark 2 for beginners develop lar

📁 Apache Spark 2 for beginners develop large-scale distributed data processing applications using Spark 2 in Scala and Python

✍ Thottuvaikkatumana, Rajanarayanan 📂 Library 📅 2016 🏛 Packt Publishing 🌐 English

Apache Spark two for beginners.</div>