Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing

✍ Scribed by Alfonso Antolínez García

Publisher: Apress
Tongue: English
Leaves: 404
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

This book explains how to scale Apache Spark 3 to handle massive amounts of data, either via batch or streaming processing. It covers how to use Spark’s structured APIs to perform complex data transformations and analyses you can use to implement end-to-end analytics workflows. This book covers Spark 3's new features, theoretical foundations, and application architecture. The first section introduces the Apache Spark ecosystem as a unified engine for large scale data analytics, and shows you how to run and fine-tune your first application in Spark. The second section centers on batch processing suited to end-of-cycle processing, and data ingestion through files and databases. It explains Spark DataFrame API as well as structured and unstructured data with Apache Spark. The last section deals with scalable, high-throughput, fault-tolerant streaming processing workloads to process real-time data. Here you'll learn about Apache Spark Streaming’s execution model, the architecture of Spark Streaming, monitoring, reporting, and recovering Spark streaming. A full chapter is devoted to future directions for Spark Streaming. With real-world use cases, code snippets, and notebooks hosted on GitHub, this book will give you an understanding of large-scale data analysis concepts--and help you put them to use.
Upon completing this book, you will have the knowledge and skills to seamlessly implement large-scale batch and streaming workloads to analyze real-time data streams with Apache Spark.
What You Will Learn

Master the concepts of Spark clusters and batch data processing
Understand data ingestion, transformation, and data storage
Gain insight into essential stream processing concepts and different streaming architectures
Implement streaming jobs and applications with Spark Streaming

Who This Book Is ForData engineers, data analysts, machine learning engineers, Python and R programmers

✦ Table of Contents

Table of Contents
About the Author
About the Technical Reviewer
Part I: Apache Spark Batch Data Processing
Chapter 1: Introduction to Apache Spark for Large-Scale Data Analytics
1.1 What Is Apache Spark?
Simpler to Use and Operate
Fast
Scalable
Ease of Use
Fault Tolerance at Scale
1.2 Spark Unified Analytics Engine
1.3 How Apache Spark Works
Spark Application Model
Spark Execution Model
Spark Cluster Model
1.4 Apache Spark Ecosystem
Spark Core
Spark APIs
Spark SQL and DataFrames and Datasets
Spark Streaming
Spark GraphX
1.5 Batch vs. Streaming Data
What Is Batch Data Processing?
What Is Stream Data Processing?
Difference Between Stream Processing and Batch Processing
1.6 Summary
Chapter 2: Getting Started with Apache Spark
2.1 Downloading and Installing Apache Spark
Installation of Apache Spark on Linux
Step 1: Verifying the Java Installation
Step 2: Installing Spark
Step 3: Moving Spark Software Files
Step 4: Setting Up the Environment for Spark
Step 5: Verifying the Spark Installation
Installation of Apache Spark on Windows
Step 1: Java Installation
Step 2: Download Apache Spark
Step 3: Install Apache Spark
Step 4: Download the winutils File for Hadoop
Step 5: Configure System and Environment Variables
2.2 Hands-On Spark Shell
Using the Spark Shell Command
The Scala Shell Command Line
The Pyspark Shell Command Line
Running Self-Contained Applications with the spark-submit Command
Spark Submit Options
Deployment Mode Options
Cluster Manager Options
Tuning Resource Allocation
Dynamically Loading Spark Submit Configurations
Application Properties
Runtime Environment
Dynamic Allocation
Others
2.3 Spark Application Concepts
Spark Application and SparkSession
Access the Existing SparkSession
SparkSession in spark-shell
Create a SparkSession Programmatically
2.4 Transformations, Actions, Immutability, and Lazy Evaluation
Transformations
Narrow Transformations
Wide Transformations
Actions
2.5 Summary
Chapter 3: Spark Low-Level API
3.1 Resilient Distributed Datasets (RDDs)
Creating RDDs from Parallelized Collections
Creating RDDs from External Datasets
Creating RDDs from Existing RDDs
3.2 Working with Key-Value Pairs
Creating Pair RDDs
Showing the Distinct Keys of a Pair RDD
Transformations on Pair RDDs
Sorting Pair RDDs by Key
Adding Values by Key in a RDD
Saving a RDD as a Text File
Combining Data at Each Partition
Merging Values with a Neutral ZeroValue
Combining Elements by Key Using Custom Aggregation Functions
Grouping of Data on Pair RDDs
Performance Considerations of groupByKey
Joins on Pair RDDs
Returning the Keys Present in Both RDDs
Returning the Keys Present in the Source RDD
Returning the Keys Present in the Parameter RDD
Sorting Data on Pair RDDs
Sorting a RDD by a Given Key Function
Actions on Pair RDDs
Count RDD Instances by Key
Count RDD Instances by Value
Returning Key-Value Pairs as a Dictionary
Collecting All Values Associated With a Key
3.3 Spark Shared Variables: Broadcasts and Accumulators
Broadcast Variables
When to Use Broadcast Variables
How to Create a Broadcast Variable
Accumulators
3.4 When to Use RDDs
3.5 Summary
Chapter 4: The Spark High-Level APIs
4.1 Spark Dataframes
Attributes of Spark DataFrames
Methods for Creating Spark DataFrames
Manually Creating a Spark DataFrame Using toDF()
Manually Creating a Spark DataFrame Using createDataFrame()
Data Sources for Creating Spark DataFrames
Querying Files Using SQL
Ignoring Corrupt and Missing Files
Time-Based Paths
Specifying Save Options
NOTICE
Read and Write Apache Parquet Files
Saving and Data Compression of a DataFrame to a Parquet File Format
Direct Queries on Parquet Files
Parquet File Partitioning
Reading Parquet File Partitions
Read and Write JSON Files with Spark
Reading Multiple JSON Files at Once
Reading JSON Files Based on Patterns at Once
Direct Queries on JSON Files
Saving a DataFrame to a JSON File
Saving Modes
Load JSON Files Based on Customized Schemas
Work with Complex Nested JSON Structures Using Spark
Read and Write CSV Files with Spark
Read and Write Hive Tables
Read and Write Data via JDBC from and to Databases
4.2 Use of Spark DataFrames
Select DataFrame Columns
Selecting All or Specific DataFrame Columns
Select Columns Based on Name Patterns
Filtering Results of a Query Based on One or Multiple Conditions
Using Different Column Name Notations
Using Logical Operators for Multi-condition Filtering
Manipulating Spark DataFrame Columns
Renaming DataFrame Columns
Dropping DataFrame Columns
Creating a New Dataframe Column Dependent on Another Column
User-Defined Functions (UDFs)
Merging DataFrames with Union and UnionByName
Merging DataFrames with Duplicate Values
Joining DataFrames with Join
4.3 Spark Cache and Persist of Data
Unpersisting Cached Data
4.4 Summary
Chapter 5: Spark Dataset API and Adaptive Query Execution
5.1 What Are Spark Datasets?
5.2 Methods for Creating Spark Datasets
5.3 Adaptive Query Execution
5.4 Data-Dependent Adaptive Determination of the Shuffle Partition Number
5.5 Runtime Replanning of Join Strategies
5.6 Optimization of Unevenly Distributed Data Joins
5.7 Enabling the Adaptive Query Execution (AQE)
5.8 Summary
Chapter 6: Introduction to Apache Spark Streaming
6.1 Real-Time Analytics of Bound and Unbound Data
6.2 Challenges of Stream Processing
6.3 The Uncertainty Component of Data Streams
6.4 Apache Spark Streaming’s Execution Model
6.5 Stream Processing Architectures
The Lambda Architecture
Data Ingestion Layer
Batch Processing Layer
Real-Time or Speed Layer
Serving Layer
Pros and Cons of the Lambda Architecture
The Kappa Architecture
Pros and Cons of the Kappa Architecture
6.6 Spark Streaming Architecture: Discretized Streams
6.7 Spark Streaming Sources and Receivers
Basic Input Sources
Socket Connection Streaming Sources
Running Socket Streaming Applications Locally
Improving Our Data Analytics with Spark Streaming Transformations
File System Streaming Sources
How Spark Monitors File Systems
Running File System Streaming Applications Locally
Known Issues While Dealing with Object Stores Data Sources
Advanced Input Sources
6.8 Spark Streaming Graceful Shutdown
6.9 Transformations on DStreams
6.10 Summary
Part II: Apache Spark Streaming
Chapter 7: Spark Structured Streaming
7.1 General Rules for Message Delivery Reliability
7.2 Structured Streaming vs. Spark Streaming
7.3 What Is Apache Spark Structured Streaming?
Spark Structured Streaming Input Table
Spark Structured Streaming Result Table
Spark Structured Streaming Output Modes
7.4 Datasets and DataFrames Streaming API
Socket Structured Streaming Sources
Running Socket Structured Streaming Applications Locally
File System Structured Streaming Sources
Running File System Streaming Applications Locally
7.5 Spark Structured Streaming Transformations
Streaming State in Spark Structured Streaming
Spark Stateless Streaming
Spark Stateful Streaming
Stateful Streaming Aggregations
Time-Based Aggregations
No-Time-Based Aggregations
7.6 Spark Checkpointing Streaming
Recovering from Failures with Checkpointing
7.7 Summary
Chapter 8: Streaming Sources and Sinks
8.1 Spark Streaming Data Sources
Reading Streaming Data from File Data Sources
Reading Streaming Data from Kafka
Reading Streaming Data from MongoDB
8.2 Spark Streaming Data Sinks
Writing Streaming Data to the Console Sink
Writing Streaming Data to the File Sink
Writing Streaming Data to the Kafka Sink
Writing Streaming Data to the ForeachBatch Sink
Writing Streaming Data to the Foreach Sink
Writing Streaming Data to Other Data Sinks
8.3 Summary
Chapter 9: Event-Time Window Operations and Watermarking
9.1 Event-Time Processing
9.2 Stream Temporal Windows in Apache Spark
What Are Temporal Windows and Why Are They Important in Streaming
9.3 Tumbling Windows
9.4 Sliding Windows
9.5 Session Windows
Session Window with Dynamic Gap
9.6 Watermarking in Spark Structured Streaming
What Is a Watermark?
9.7 Summary
Chapter 10: Future Directions for Spark Streaming
10.1 Streaming Machine Learning with Spark
What Is Logistic Regression?
Types of Logistic Regression
Use Cases of Logistic Regression
Assessing the Sensitivity and Specificity of Our Streaming ML Model
10.2 Spark 3.3.x
Spark RocksDB State Store Database
What Is RocksDB?
BlockBasedTableConfig
RocksDB Possible Performance Degradation
10.3 The Project Lightspeed
Predictable Low Latency
Enhanced Functionality for Processing Data/Events
New Ecosystem of Connectors
Improve Operations and Troubleshooting
10.4 Summary
Bibliography
Index

📜 SIMILAR VOLUMES

Hands-on Guide to Apache Spark 3: Build

📁 Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing

✍ Alfonso Antolínez García 📂 Library 📅 2023 🏛 Apress 🌐 English

<span>This book explains how to scale Apache Spark 3 to handle massive amounts of data, either via batch or streaming processing. It covers how to use Spark’s structured APIs to perform complex data transformations and analyses you can use to implement end-to-end analytics workflows. This book cover

Hands-on Guide to Apache Spark 3 : Build

📁 Hands-on Guide to Apache Spark 3 : Build Scalable Computing Engines for Batch and Stream Data Processing

✍ Alfonso Antolínez García 📂 Library 📅 2023 🏛 Apress 🌐 English

Modern Data Engineering with Apache Spar

📁 Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications

✍ S. Haines 📂 Library 📅 2022 🌐 English

Data Engineering with Scala and Spark: A

📁 Data Engineering with Scala and Spark: A practical guide helping you build streaming and batch pipelines that process massive amounts of data using Scala

✍ Eric Tome, Rupam Bhattacharjee, David Radford 📂 Library 📅 2024 🏛 Packt Publishing Pvt Ltd 🌐 English

Most data engineers know that performance issues in a distributed computing environment can easily lead to issues impacting the overall efficiency and effectiveness of data engineering tasks. While Python remains a popular choice for data engineering due to its ease of use, Scala shines in scenarios

Building Big Data Pipelines with Apache

📁 Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing

✍ Jan Lukavský 📂 Library 📅 2022 🏛 Packt Publishing 🌐 English

<p><b>Implement, run, operate, and test data processing pipelines using Apache Beam</b></p><h4>Key Features</h4><ul><li>Understand how to improve usability and productivity when implementing Beam pipelines</li><li>Learn how to use stateful processing to implement complex use cases using Apache Beam<

Building Big Data Pipelines with Apache

📁 Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing

✍ Jan Lukavský 📂 Library 📅 2022 🏛 Packt Publishing 🌐 English

<span><p><b>Implement, run, operate, and test data processing pipelines using Apache Beam</b></p><h4>Key Features</h4><ul><li>Understand how to improve usability and productivity when implementing Beam pipelines</li><li>Learn how to use stateful processing to implement complex use cases using Apache