Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood

✍ Scribed by Supun Kamburugamuve, Saliya Ekanayake

Publisher: Wiley
Year: 2021
Tongue: English
Leaves: 415
Edition: 1
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

PEEK “UNDER THE HOOD” OF BIG DATA ANALYTICS

The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance.

The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within.

Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system.

Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to:

Identify the foundations of large-scale, distributed data processing systems

Make major software design decisions that optimize performance

Diagnose performance problems and distributed operation issues

Understand state-of-the-art research in big data

Explain and use the major big data frameworks and understand what underpins them

Use big data analytics in the real world to solve practical problems

✦ Table of Contents

Cover
Title Page
Copyright Page
About the Authors
About the Editor
Acknowledgments
Contents at a Glance
Contents
Introduction
History of Data-Intensive Applications
Data Processing Architecture
Foundations of Data-Intensive Applications
Who Should Read This Book?
Organization of the Book
Scope of the Book
References
References
Chapter 1 Data Intensive Applications
Anatomy of a Data-Intensive Application
A Histogram Example
Program
Process Management
Communication
Execution
Data Structures
Putting It Together
Application
Resource Management
Messaging
Data Structures
Tasks and Execution
Fault Tolerance
Remote Execution
Parallel Applications
Serial Applications
Lloyd’s K-MeansAlgorithm
Parallelizing Algorithms
Decomposition
Task Assignment
Orchestration
Mapping
K-MeansAlgorithm
Parallel and Distributed Computing
Memory Abstractions
Shared Memory
Distributed Memory
Hybrid (Shared + Distributed) Memory
Partitioned Global Address Space Memory
Application Classes and Frameworks
Parallel Interaction Patterns
Pleasingly Parallel
Dataflow
Iterative
Irregular
Data Abstractions
Data-Intensive Frameworks
Components
Workflows
An Example
What Makes It Difficult?
Developing Applications
Concurrency
Data Partitioning
Debugging
Diverse Environments
Computer Networks
Synchronization
Thread Synchronization
Data Synchronization
Ordering of Events
Faults
Consensus
Summary
References
Chapter 2 Data and Storage
Storage Systems
Storage for Distributed Systems
Direct-AttachedStorage
Storage Area Network
Network-AttachedStorage
DAS or SAN or NAS?
Storage Abstractions
Block Storage
File Systems
Object Storage
Data Formats
XML
JSON
CSV
Apache Parquet
Apache Avro
Avro Data Definitions (Schema)
Code Generation
Without Code Generation
Avro File
Schema Evolution
Protocol Buffers, Flat Buffers, and Thrift
Data Replication
Synchronous and Asynchronous Replication
Single-Leader and Multileader Replication
Data Locality
Disadvantages of Replication
Data Partitioning
Vertical Partitioning
Horizontal Partitioning (Sharding)
Hybrid Partitioning
Considerations for Partitioning
NoSQL Databases
Data Models
Key-ValueDatabases
Document Databases
Wide Column Databases
Graph Databases
CAP Theorem
Message Queuing
Message Processing Guarantees
Durability of Messages
Acknowledgments
Storage First Brokers and Transient Brokers
Summary
References
Chapter 3 Computing Resources
A Demonstration
Computer Clusters
Anatomy of a Computer Cluster
Data Analytics in Clusters
Dedicated Clusters
Classic Parallel Systems
Big Data Systems
Shared Clusters
OpenMPI on a Slurm Cluster
Spark on a Yarn Cluster
Distributed Application Life Cycle
Life Cycle Steps
Step 1: Preparation of the Job Package
Step 2: Resource Acquisition
Step 3: Distributing the Application (Job) Artifacts
Step 4: Bootstrapping the Distributed Environment
Step 5: Monitoring
Step 6: Termination
Computing Resources
Data Centers
Physical Machines
Network
Virtual Machines
Containers
Processor, Random Access Memory, and Cache
Cache
Multiple Processors in a Computer
Nonuniform Memory Access
Uniform Memory Access
Hard Disk
GPUs
Mapping Resources to Applications
Cluster Resource Managers
Kubernetes
Kubernetes Architecture
Kubernetes Application Concepts
Data-IntensiveApplications on Kubernetes
Slurm
Yarn
Job Scheduling
Scheduling Policy
Objective Functions
Throughput and Latency
Priorities
Lowering Distance Among the Processes
Data Locality
Completion Deadline
Algorithms
First in First Out
Gang Scheduling
List Scheduling
Backfill Scheduling
Summary
References
Chapter 4 Data Structures
Virtual Memory
Paging and TLB
Cache
The Need for Data Structures
Cache and Memory Layout
Memory Fragmentation
Data Transfer
Data Transfer Between Frameworks
Cross-LanguageData Transfer
Object and Text Data
Serialization
Vectors and Matrices
1D Vectors
Matrices
Row-Majorand Column-Major Formats
N-Dimensional Arrays/Tensors
NumPy
Sparse Matrices
Table
Table Formats
Column Data Format
Row Data Format
Apache Arrow
Arrow Data Format
Primitive Types
Variable-Length Data
Arrow Serialization
Arrow Example
Pandas DataFrame
Column vs. Row Tables
Summary
References
Chapter 5 Programming Models
Introduction
Parallel Programming Models
Parallel Process Interaction
Problem Decomposition
Data Structures
Data Structures and Operations
Data Types
Local Operations
Distributed Operations
Array
Tensor
Indexing
Slicing
Broadcasting
Table
Graph Data
Message Passing Model
Model
Message Passing Frameworks
Message Passing Interface
Bulk Synchronous Parallel
K-Means
Distributed Data Model
Eager Model
Dataflow Model
Data Frames, Datasets, and Tables
Input and Output
Task Graphs (Dataflow Graphs)
Model
User Program to Task Graph
Tasks and Functions
Source Task
Compute Task
Implicit vs. Explicit Parallel Models
Remote Execution
Components
Batch Dataflow
Data Abstractions
Table Abstraction
Matrix/Tensors
Functions
Source
Compute
Sink
An Example
Caching State
Evaluation Strategy
Lazy Evaluation
Eager Evaluation
Iterative Computations
DOALL Parallel
DOACROSS Parallel
Pipeline Parallel
Task Graph Models for Iterative Computations
K-MeansAlgorithm
Streaming Dataflow
Data Abstractions
Streams
Distributed Operations
Streaming Functions
Sources
Compute
Sink
An Example
Windowing
Windowing Strategies
Operations on Windows
Handling Late Events
SQL
Queries
Summary
References
Chapter 6 Messaging
Network Services
TCP/IP
RDMA
Messaging for Data Analytics
Anatomy of a Message
Data Packing
Protocol
Message Types
Control Messages
External Data Sources
Data Transfer Messages
Distributed Operations
How Are They Used?
Task Graph
Parallel Processes
Anatomy of a Distributed Operation
Data Abstractions
Distributed Operation API
Streaming and Batch Operations
Streaming Operations
Batch Operations
Distributed Operations on Arrays
Broadcast
Reduce and AllReduce
Gather and AllGather
Scatter
AllToAll
Optimized Operations
Broadcast
Reduce
AllReduce
Gather and AllGather Collective Algorithms
Scatter and AllToAll Collective Algorithms
Distributed Operations on Tables
Shuffle
Partitioning Data
Handling Large Data
Fetch-Based Algorithm (Asynchronous Algorithm)
Distributed Synchronization Algorithm
GroupBy
Aggregate
Join
Join Algorithms
Distributed Joins
Performance of Joins
More Operations
Advanced Topics
Data Packing
Memory Considerations
Message Coalescing
Compression
Stragglers
Nonblocking vs. Blocking Operations
Blocking Operations
Nonblocking Operations
Summary
References
Chapter 7 Parallel Tasks
CPUs
Cache
False Sharing
Vectorization
Threads and Processes
Concurrency and Parallelism
Context Switches and Scheduling
Mutual Exclusion
User-Level Threads
Process Affinity
NUMA-Aware Programming
Accelerators
Task Execution
Scheduling
Static Scheduling
Dynamic Scheduling
Loosely Synchronous and Asynchronous Execution
Loosely Synchronous Parallel System
Asynchronous Parallel System (Fully Distributed)
Actor Model
Asynchronous Messages
Actor Frameworks
Execution Models
Process Model
Thread Model
Remote Execution
Tasks for Data Analytics
SPMD and MPMD Execution
Batch Tasks
Data Partitions
Operations
Task Graph Scheduling
Threads, CPU Cores, and Partitions
Data Locality
Execution
Streaming Execution
State
Immutable Data
State in Driver
Distributed State
Streaming Tasks
Streams and Data Partitioning
Partitions
Operations
Scheduling
Uniform Resources
Resource-Aware Scheduling
Execution
Dynamic Scaling
Back Pressure (Flow Control)
Rate-Based Flow Control
Credit-Based Flow Control
State
Summary
References
Chapter 8 Case Studies
Apache Hadoop
Programming Model
Architecture
Cluster Resource Management
Apache Spark
Programming Model
RDD API
SQL, DataFrames, and DataSets
Architecture
Resource Managers
Task Schedulers
Executors
Communication Operations
Apache Spark Streaming
Apache Storm
Programming Model
Architecture
Cluster Resource Managers
Communication Operations
Kafka Streams
Programming Model
Architecture
PyTorch
Programming Model
Execution
Cylon
Programming Model
Architecture
Execution
Communication Operations
Rapids cuDF
Programming Model
Architecture
Summary
References
Chapter 9 Fault Tolerance
Dependable Systems and Failures
Fault Tolerance Is Not Free
Dependable Systems
Failures
Process Failures
Network Failures
Node Failures
Byzantine Faults
Failure Models
Failure Detection
Recovering from Faults
Recovery Methods
Stateless Programs
Batch Systems
Streaming Systems
Processing Guarantees
Role of Cluster Resource Managers
Checkpointing
State
Consistent Global State
Uncoordinated Checkpointing
Coordinated Checkpointing
Chandy-Lamport Algorithm
Batch Systems
When to Checkpoint?
Snapshot Data
Streaming Systems
Case Study: Apache Storm
Message Tracking
Failure Recovery
Case Study: Apache Flink
Checkpointing
Failure Recovery
Batch Systems
Iterative Programs
Case Study: Apache Spark
RDD Recomputing
Checkpointing
Recovery from Failures
Summary
References
Chapter 10 Performance and Productivity
Performance Metrics
System Performance Metrics
Parallel Performance Metrics
Speedup
Strong Scaling
Weak Scaling
Parallel Efficiency
Amdahl’s Law
Gustafson’s Law
Throughput
Latency
Benchmarks
LINPACK Benchmark
NAS Parallel Benchmark
BigDataBench
TPC Benchmarks
HiBench
Performance Factors
Memory
Execution
Distributed Operators
Disk I/O
Garbage Collection
Finding Issues
Serial Programs
Profiling
Scaling
Strong Scaling
Weak Scaling
Debugging Distributed Applications
Programming Languages
C/C++
Java
Memory Management
Data Structures
Interfacing with Python
Python
C/C++ Code integration
Productivity
Choice of Frameworks
Operating Environment
CPUs and GPUs
Public Clouds
Future of Data-Intensive Applications
Summary
References
Index
EULA

📜 SIMILAR VOLUMES

Foundations of Data Intensive Applicatio

📁 Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood

✍ Supun Kamburugamuve, Saliya Ekanayake 📂 Library 📅 2021 🏛 Wiley 🌐 English

PEEK “UNDER THE HOOD” OF BIG DATA ANALYTICS The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how the

Large-Scale Data Analytics

📁 Large-Scale Data Analytics

✍ Sherif Sakr, Anna Liu (auth.), Aris Gkoulalas-Divanis, Abderrahim Labbi (eds.) 📂 Library 📅 2014 🏛 Springer-Verlag New York 🌐 English

This edited book collects state-of-the-art research related to large-scale data analytics that has been accomplished over the last few years. This is among the first books devoted to this important area based on contributions from diverse scientific areas such as databases, data mining, superc

Large Scale Data Analytics

📁 Large Scale Data Analytics

✍ Chung Yik Cho, Rong Kun Jason Tan, John A. Leong, Amandeep S. Sidhu 📂 Library 📅 2019 🏛 Springer International Publishing 🌐 English

This book presents a language integrated query framework for big data. The continuous, rapid growth of data information to volumes of up to terabytes (1,024 gigabytes) or petabytes (1,048,576 gigabytes) means that the need for a system to manage and query information from large scale data sources

Data Just Right: Introduction to Large-S

📁 Data Just Right: Introduction to Large-Scale Data & Analytics

✍ Michael Manoochehri 📂 Library 📅 0 🏛 Addison-Wesley 🌐 English

The array of tools for collecting, storing, and gaining insight from data is huge and getting bigger every day. For people entering the field, that means digging through hundreds of Web sites and dozens of books to get the basics of working with data at scale. That’s why this book is a great a

Real-Time Data Analytics for Large Scale

📁 Real-Time Data Analytics for Large Scale Sensor Data

✍ Himansu Das, Nilanjan Dey, Valentina Emilia Balas 📂 Library 📅 2019 🏛 Academic Press 🌐 English

Real-Time Data Analytics for Large-Scale Sensor Data covers the theory and applications of hardware platforms and architectures, the development of software methods, techniques and tools, applications, governance and adoption strategies for the use of massive sensor data in real-time data

Data Just Right Introduction to Large-S

📁 Data Just Right Introduction to Large-Scale Data & Analytics

✍ Michael Manoochehri 📂 Library 📅 2013 🏛 Addison-Wesley Professional 🌐 English

Large-scale data analysis is now vitally important to virtually every business. Mobile and social technologies are generating massive datasets distributed cloud computing offers the resources to store and analyze them and professionals have radically new technologies at their command, including NoSQ