Learning Apache Flink
β Scribed by Tanmay Deshpande
- Publisher
- Packt Publishing
- Year
- 2017
- Tongue
- English
- Leaves
- 275
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Synopsis
Learning Apache Flink
β¦ Table of Contents
Cover
Copyright
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Introduction to Apache Flink
History
Architecture
Distributed execution
Job Manager
Actor system
Scheduler
Check pointing
Task manager
Job client
Features
High performance
Exactly-once stateful computation
Flexible streaming windows
Fault tolerance
Memory management
Optimizer
Stream and batch in one platform
Libraries
Event time semantics
Quick start setup
Pre-requisite
Installing on Windows
Installing on Linux
Cluster setup
SSH configurations
Java installation
Flink installation
Configurations
Starting daemons
Adding additional Job/Task Managers
Stopping daemons and cluster
Running sample application
Summary
Chapter 2: Data Processing Using the DataStream API
Execution environment
Data sources
Socket-based
File-based
Transformations
Map
FlatMap
Filter
KeyBy
Reduce
Fold
Aggregations
Window
Global windows
Tumbling windows
Sliding windows
Session windows
WindowAll
Union
Window join
Split
Select
Project
Physical partitioning
Custom partitioning
Random partitioning
Rebalancing partitioning
Rescaling
Broadcasting
Data sinks
Event time and watermarks
Event time
Processing time
Ingestion time
Connectors
Kafka connector
Twitter connector
RabbitMQ connector
ElasticSearch connector
Embedded node mode
Transport client mode
Cassandra connector
Use case β sensor data analytics
Summary
Chapter 3: Data Processing Using the Batch Processing API
Data sources
File-based
Collection-based
Generic sources
Compressed files
Transformations
Map
Flat map
Filter
Project
Reduce on grouped datasets
Reduce on grouped datasets by field position key
Group combine
Aggregate on a grouped tuple dataset
MinBy on a grouped tuple dataset
MaxBy on a grouped tuple dataset
Reduce on full dataset
Group reduce on a full dataset
Aggregate on a full tuple dataset
MinBy on a full tuple dataset
MaxBy on a full tuple dataset
Distinct
Join
Cross
Union
Rebalance
Hash partition
Range partition
Sort partition
First-n
Broadcast variables
Data sinks
Connectors
Filesystems
HDFS
Amazon S3
Alluxio
Avro
Microsoft Azure storage
MongoDB
Iterations
Iterator operator
Delta iterator
Use case β Athletes data insights using Flink batch API
Summary
Chapter 4: Data Processing Using the Table API
Registering tables
Registering a dataset
Registering a datastream
Registering a table
Registering external table sources
CSV table source
Kafka JSON table source
Accessing the registered table
Operators
The select operator
The where operator
The filter operator
The as operator
The groupBy operator
The join operator
The leftOuterJoin operator
The rightOuterJoin operator
The fullOuterJoin operator
The union operator
The unionAll operator
The intersect operator
The intersectAll operator
The minus operator
The minusAll operator
The distinct operator
The orderBy operator
The limit operator
Data types
SQL
SQL on datastream
Supported SQL syntax
Scalar functions
Scalar functions in the table API
Scala functions in SQL
Use case β Athletes data insights using Flink Table API
Summary
Chapter 5: Complex Event Processing
What is complex event processing?
Flink CEP
Event streams
Pattern API
Begin
Filter
Subtype
OR
Continuity
Strict continuity
Non-strict continuity
Within
Detecting patterns
Selecting from patterns
Select
flatSelect
Handling timed-out partial patterns
Use case β complex event processing on a temperature sensor
Summary
Chapter 6: Machine Learning Using FlinkML
What is machine learning?
Supervised learning
Regression
Classification
Unsupervised learning
Clustering
Association
Semi-supervised learning
FlinkML
Supported algorithms
Supervised learning
Support Vector Machine
Multiple Linear Regression
Optimization framework
Recommendations
Alternating Least Squares
Unsupervised learning
k Nearest Neighbour join
Utilities
Data pre processing and pipelines
Polynomial features
Standard scaler
MinMax scaler
Summary
Chapter 7: Flink Graph API - Gelly
What is a graph?
Flink graph API β Gelly
Graph representation
Graph nodes
Graph edges
Graph creation
From dataset of edges and vertices
From dataset of tuples representing edges
From CSV files
From collection lists
Graph properties
Graph transformations
Map
Translate
Filter
Join
Reverse
Undirected
Union
Intersect
Graph mutations
Neighborhood methods
Graph validation
Iterative graph processing
Vertex-Centric iterations
Scatter-Gather iterations
Gather-Sum-Apply iterations
Use case β Airport Travel Optimization
Summary
Chapter 8: Distributed Data Processing with Flink and Hadoop
Quick overview of Hadoop
HDFS
YARN
Flink on YARN
Configurations
Starting a Flink YARN session
Submitting a job to Flink
Stopping Flink YARN session
Running a single Flink job on YARN
Recovery behavior for Flink on YARN
Working details
Summary
Chapter 9: Deploying Flink on Cloud
Flink on Google Cloud
Installing Google Cloud SDK
Installing BDUtil
Launching a Flink cluster
Executing a sample job
Shutting down the cluster
Flink on AWS
Launching an EMR cluster
Installing Flink on EMR
Executing Flink on EMR-YARN
Starting a Flink YARN session
Executing Flink job on YARN session
Shutting down the cluster
Flink on EMR 5.3+
Using S3 in Flink applications
Summary
Chapter 10: Best Practices
Logging best practices
Configuring Log4j
Configuring Logback
Logging in applications
Using ParameterTool
From system properties
From command line arguments
From .properties file
Naming large TupleX types
Registering a custom serializer
Metrics
Registering metrics
Counters
Gauges
Histograms
Meters
Reporters
Monitoring REST API
Config API
Overview API
Overview of the jobs
Details of a specific job
User defined job configuration
Back pressure monitoring
Summary
Index
π SIMILAR VOLUMES
Authors Ellen Friedman and Kostas Tzoumas show technical and nontechnical readers alike how Flink is engineered to overcome significant tradeoffs that have limited the effectiveness of other approaches to stream processing. Youβll also learn how Flink has the ability to handle both stream and batch
With Early Release ebooks, you get books in their earliest form--the author's raw and unedited content as he or she writes--so you can take advantage of these technologies long before the official release of these titles. You'll also receive updates when significant changes are made, new chapters ar