Simplify Big Data Analytics with Amazon EMR: A Beginner's Guide to Learning and Implementing Amazon EMR for Building Data Analytics Solutions

✍ Scribed by Sakti Mishra

Year: 2022
Tongue: English
Leaves: 430
Edition: 1
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Design scalable big data solutions using Hadoop, Spark, and AWS cloud native services Key Features: Build data pipelines that require distributed processing capabilities on a large volume of data Discover the security features of EMR such as data protection and granular permission management Explore best practices and optimization techniques for building data analytics solutions in Amazon EMR Book Description: Amazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS. This book is a practical guide to Amazon EMR for building data pipelines. You'll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You'll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you'll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you'll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR. By the end of this book, you'll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS. What You Will Learn: Explore Amazon EMR features, architecture, Hadoop interfaces, and EMR Studio Configure, deploy, and orchestrate Hadoop or Spark jobs in production Implement the security, data governance, and monitoring capabilities of EMR Build applications for batch and real-time streaming data analytics solutions Perform interactive development with a persistent EMR cluster and Notebook Orchestrate an EMR Spark job using AWS Step Functions and Apache Airflow Who this book is for: This book is for data engineers, data analysts, data scientists, and solution architects who are interested in building data analytics solutions with the Hadoop ecosystem services and Amazon EMR. Prior experience in either Python programming, Scala, or the Java programming language and a basic understanding of Hadoop and AWS will help you make the most out of this book.

✦ Table of Contents

Cover
Copyright
Contributors
Table of Contents
Preface
Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR
Chapter 1: An Overview of Amazon EMR
What is Amazon EMR?
What is big data?
Hadoop – processing framework to handle big data
Overview of Amazon EMR – managed and scalable Hadoop cluster in AWS
A brief history of the major big data releases
Benefits of Amazon EMR
Decoupling compute and storage
Persistent versus transient clusters
Integration with other AWS services
Amazon S3 with EMR File System (EMRFS)
Amazon Kinesis Data Streams (KDS)
Amazon Managed Streaming for Kafka (MSK)
AWS Glue Data Catalog
Amazon Relational Database Service (RDS)
Amazon DynamoDB
Amazon Redshift
AWS Lake Formation
AWS Identity and Access Management (IAM)
AWS Key Management Service (KMS)
Lake House architecture overview
EMR release history
Comparing Amazon EMR with AWS Glue and AWS Glue DataBrew
AWS Glue
AWS Glue DataBrew
Choosing the right service for your use case
Summary
Test your knowledge
Further reading
Chapter 2: Exploring the Architecture and Deployment Options
EMR architecture deep dive
Distributed storage layer
YARN – cluster resource manager
Distributed processing frameworks
Hadoop applications
Understanding clusters and nodes
Uniform instance groups
Instance fleet
Using S3 versus HDFS for cluster storage
HDFS as cluster-persistent storage
Amazon S3 as a persistent data store
Understanding the cluster life cycle
Options to submit work to the cluster
Submitting jobs to the cluster as EMR steps
Building Hadoop jobs with dependencies in a specific EMR release version
EMR deployment options
Amazon EMR on Amazon EC2
Amazon EMR on Amazon EKS
Amazon EMR on AWS Outposts
EMR pricing for different deployment options
Monitoring and controling your costs with AWS Budgets and Cost Explorer
Summary
Test your knowledge
Further reading
Chapter 3: Common Use Cases and Architecture Patterns
Reference architecture for batch ETL workloads
Use case overview
Reference architecture walkthrough
Best practices to follow during implementation
Reference architecture for clickstream analytics
Use case overview
Reference architecture walkthrough
Best practices to follow during implementation
Reference architecture for interactive analytics and ML
Use case overview
Reference architecture walkthrough
Best practices to follow during implementation
Reference architecture for real-time streaming analytics
Use case overview
Reference architecture walkthrough
Best practices to follow during implementation
Reference architecture for genomics data analytics
Use case overview
Reference architecture walkthrough
Best practices to follow during implementation
Reference architecture for log analytics
Use case overview
Reference architecture walkthrough
Best practices to follow during implementation
Summary
Test your knowledge
Further reading
Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR
Technical requirements
Understanding popular big data applications in EMR
Hive
Presto
Spark
HBase
Hue
Ganglia
Machine learning frameworks available in EMR
TensorFlow
MXNet
Notebook options available in EMR
EMR Notebooks
JupyterHub
EMR Studio
Zeppelin
Summary
Questions
Further reading
Section 2: Configuration, Scaling, Data Security, and Governance
Chapter 5: Setting Up and Configuring EMR Clusters
Technical requirements
Setting up and configuring clusters with the EMR console's quick create option
Advanced configuration for cluster hardware and software
Understanding the Software Configuration section
Understanding Steps
Understanding the Hardware Configuration section
Understanding general configurations
Understanding Security Options
Working with AMIs and controlling cluster termination
Working with AMIs
Controlling the EMR cluster termination process
Troubleshooting and logging in your EMR cluster
Tools available to debug your EMR cluster
Viewing and restarting cluster application processes
Troubleshooting a failed cluster
Troubleshooting a slow cluster
Logging in your EMR cluster
Summary
Test your knowledge
Further reading
Chapter 6: Monitoring, Scaling, and High Availability
Technical requirements
Monitoring your EMR cluster
Monitoring clusters and applications with web user interfaces
Monitoring cluster metrics with CloudWatch monitoring
EMR API audit logging with AWS CloudTrail
Scaling cluster resources
Managed scaling in EMR
Autoscaling in EMR with a custom policy for instance groups
Manually resizing your EMR cluster
Comparing managed scaling with autoscaling
Cluster cloning and high availability with multiple master nodes
High availability with multiple master nodes
Cloning an existing EMR cluster
Summary
Test your knowledge
Further reading
Chapter 7: Understanding Security in Amazon EMR
Technical requirements
Understanding the basics of security
Creating security configurations
Specifying a security configuration for your cluster
AWS IAM integration with Amazon EMR
Configuring an IAM service role for your EMR cluster
Configuring IAM roles for EMRFS
Integrating IAM roles in applications that invoke AWS services directly
Allowing users and groups to create and modify roles
Identity-based policies and best practices
Understanding authentication to cluster nodes
Understanding data protection in EMR
Encrypting data at rest for EMRFS on Amazon S3 data
Encrypting data in transit for EMRFS on Amazon S3 data
Role of security groups and interface VPC endpoints
Controlling cluster network traffic with security groups
Connecting to Amazon EMR on an EC2 cluster using an interface VPC endpoint
Connecting to Amazon EMR on an EKS cluster using an interface VPC endpoint
Summary
Test your knowledge
Further reading
Chapter 8: Understanding Data Governance in Amazon EMR
Technical requirements
Understanding data catalog and access management options
Using AWS Glue Data Catalog
Integrating AWS Glue Data Catalog with Amazon EMR
Permission management on top of a data catalog
Understanding Amazon EMR integration with AWS Lake Formation
Integrating Lake Formation with Amazon EMR
Launching an EMR cluster with Lake Formation
Setting up EMR notebooks to work with Lake Formation
Understanding Amazon EMR integration with Apache Ranger
Setting up Ranger in EMR
Understanding Apache Ranger plugins
Summary
Test your knowledge
Further reading
Section 3: Implementing Common Use Cases and Best Practices
Chapter 9: Implementing BatchETL Pipeline withAmazon EMR andApache Spark
Technical requirements
Use case and architecture overview
Architecture overview
Implementation steps
Creating Amazon S3 buckets
Creating the AWS Lambda function
Configuring an S3 file arrival event to trigger the Lambda function
Triggering the EMR job
Validating the output using Amazon Athena
Defining a virtual Glue Data Catalog table on top of Amazon S3 data
Querying output data using Amazon Athena standard SQL
Spark ETL and Lambda function code walk-through
Understanding the AWS Lambda function code
Understanding the PySpark script integrated into the EMR step
Summary
Test your knowledge
Further reading
Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming
Technical requirements
Use case and architecture overview
Architecture overview
Implementation steps
Creating Amazon S3 buckets
Creating the Amazon Kinesis data stream
Creating and configuring the Kinesis Data Generator tool
Creating an Amazon EMR cluster and configuring a Spark Streaming job
Validating output using Amazon Athena
Defining a virtual Glue Catalog table on top of Amazon S3 data
Querying output data using a standard SQL query in Amazon Athena
Spark Streaming code walk-through
Summary
Test your knowledge
Further reading
Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi
Technical requirements
Apache Hudi overview
Popular use cases
Registering Hudi data with your Hive or Glue Data Catalog metastore
Creating an EMR cluster and an EMR notebook
Creating the EMR cluster
Creating an EMR notebook
Creating an Amazon S3 bucket
Interactive development with Spark and Hudi
Creating a PySpark notebook for development
Integrating Hudi with our PySpark notebook
Executing Spark and Hudi scripts in your notebook
Summary
Test your knowledge
Further reading
Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA
Technical requirements
Overview of AWS Step Functions
Integrating AWS Step Functions to orchestrate EMR jobs
Overview of Apache Airflow and MWAA
Integrating Airflow to trigger EMR jobs
Summary
Test your knowledge
Further reading
Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR
Understanding migration approaches
Lift and shift
Re-architecting
Hybrid architecture
Migrating data and metadata catalogs
Migrating data
Migrating metadata catalogs
Migrating ETL jobs and Oozie workflows
Migrating Oozie workflows
Testing and validation
Validating metadata quality
Validating data quality
Best practices for migration
Summary
Test your knowledge
Further reading
Chapter 14: Best Practices and Cost-Optimization Techniques
Best practices around EMR cluster configurations
Choosing the correct cluster type (transient versus long-running)
Best practices around sizing your cluster
Optimization techniques for data processing and storage
Best practices for cluster persistent storage
Best practices while processing data using EMR
Security best practices
Configuring edge nodes outside of the cluster to limit connectivity
Integrating logging, monitoring, and audit controls into your cluster
Blocking public access to your EMR cluster
Protecting your data at rest and in transit
Cost-optimization techniques
Cost savings with compute resources
Cost savings with storage
Integrating AWS Budgets and Cost Explorer
AWS Trusted Advisor
Cost allocation tags
Limitations of Amazon EMR and possible workarounds
Summary
Test your knowledge
Further reading
Index
Other Books You May Enjoy

📜 SIMILAR VOLUMES

Python For Data Analysis: A Beginner’s G

📁 Python For Data Analysis: A Beginner’s Guide to Learn Data Analysis with Python Programming.

✍ Hush, John 📂 Library 📅 2020 🌐 English

Big Data Analytics with Spark: A Practit

📁 Big Data Analytics with Spark: A Practitioner's Guide to Using Spark for Large Scale Data Analysis

✍ Mohammed Guller 📂 Library 📅 2015 🏛 Apress 🌐 English

<p><em>Big Data Analytics with Spark</em> is a step-by-step guide for learning Spark, which is an open-source fast and general-purpose cluster computing framework for large-scale data analysis. You will learn how to use Spark for different types of big data analytics projects, including batch, inter

Big Data Analytics with Spark: A Practit

📁 Big Data Analytics with Spark: A Practitioner's Guide to Using Spark for Large Scale Data Analysis

✍ Mohammed Guller 📂 Library 📅 2016 🏛 Apress 🌐 English

This book is a step-by-step guide for learning how to use Spark for different types of big-data analytics projects, including batch, interactive, graph, and stream data analysis as well as machine learning. It covers Spark core and its add-on libraries, including Spark SQL, Spark Streaming, GraphX,

Learning Big Data with Amazon Elastic Ma

📁 Learning Big Data with Amazon Elastic MapReduce

✍ Amarkant Singh, Vijay Rayapati 📂 Library 📅 2014 🏛 Packt Publishing 🌐 English

Statistics and Machine Learning Methods

📁 Statistics and Machine Learning Methods for EHR Data: From Data Extraction to Data Analytics

✍ Hulin Wu, Jose Miguel Yamal, Ashraf Yaseen, Vahed Maroufy 📂 Library 📅 2020 🏛 CRC Press 🌐 English

<p>The use of Electronic Health Records (EHR)/Electronic Medical Records (EMR) data is becoming more prevalent for research. However, analysis of this type of data has many unique complications due to how they are collected, processed and types of questions that can be answered. This book covers man

Data Analytics and Linux Operating Syste

📁 Data Analytics and Linux Operating System. Beginners Guide to Learn Data Analytics, Predictive Analytics and Data Science with Linux Operating System

✍ Isaac D. Cody 📂 Library 📅 2016 🏛 CreateSpace Independent Publishing Platform 🌐 English

<h1><strong>This is a 2 book bundle related to Data </strong><br></h1><h1><strong>Analytics and beginning your quest to understand the Linux Command Line Operating System</strong></h1><p><strong></strong><br></p><h3><strong>Two manuscripts for the price of one!</strong></h3><h3><strong></strong><br>