<p><span>A comprehensive and accessible roadmap to performing data analytics in the AWS cloud</span></p><p><span>In </span><span>Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS</span><span>, accomplished software engineer and data architect Joe Minich
Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS
✍ Scribed by Joe Minichino
- Publisher
- Sybex
- Year
- 2023
- Tongue
- English
- Leaves
- 411
- Edition
- 1
- Category
- Library
No coin nor oath required. For personal study only.
✦ Synopsis
A comprehensive and accessible roadmap to performing data analytics in the AWS cloud
In Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS, accomplished software engineer and data architect Joe Minichino delivers an expert blueprint to storing, processing, analyzing data on the Amazon Web Services cloud platform. In the book, you’ll explore every relevant aspect of data analytics―from data engineering to analysis, business intelligence, DevOps, and MLOps―as you discover how to integrate machine learning predictions with analytics engines and visualization tools.
You’ll also find:
- Real-world use cases of AWS architectures that demystify the applications of data analytics
- Accessible introductions to data acquisition, importation, storage, visualization, and reporting
- Expert insights into serverless data engineering and how to use it to reduce overhead and costs, improve stability, and simplify maintenance
A can't-miss for data architects, analysts, engineers and technical professionals, Data Analytics in the AWS Cloud will also earn a place on the bookshelves of business leaders seeking a better understanding of data analytics on the AWS cloud platform.
✦ Table of Contents
Cover
Title Page
Copyright Page
About the Author
About the Technical Editor
Acknowledgments
Contents at a Glance
Contents
Introduction
What Is a Data Lake?
When You Do Not Need a Data Lake
When Do You Need Analytics?
When Do You Need a Data Lake for Analytics?
How About an Analytics Team?
The Data Platform
The End of the Beginning
Chapter 1 AWS Data Lakes and Analytics Technology Overview
Why AWS?
What Does a Data Lake Look Like in AWS?
Analytics on AWS
Skills Required to Build and Maintain an AWS Analytics Pipeline
Chapter 2 The Path to Analytics: Setting Up a Data and Analytics Team
The Data Vision
Support
DA Team Roles
Early Stage Roles
Team Lead
Data Architect
Data Engineer
Data Analyst
Maturity Stage Roles
Data Scientist
Cloud Engineer
Business Intelligence (BI) Developer
Machine Learning Engineer
Business Analyst
Niche Roles
Analytics Flow at a Process Level
Workflow Methodology
The DA Team Mantra: “Automate Everything”
Analytics Models in the Wild: Centralized, Distributed, Center of Excellence
Centralized
Distributed
Center of Excellence
Summary
Chapter 3 Working on AWS
Accessing AWS
Everything Is a Resource
S3: An Important Exception
IAM: Policies, Roles, and Users
Policies
Identity-Based Policies
Resource-Based Policies
Roles
Users and User Groups
Summarizing IAM
Working with the Web Console
The AWS Command-Line Interface
Installing AWS CLI
Linux Installation
macOS Installation
Windows
Configuring AWS CLI
A Note on Region
Setting Individual Parameters
Using Profiles and Configuration Files
Final Notes on Configuration
Using the AWS CLI
Using Skeletons and File Inputs
Cleaning Up!
Infrastructure-as-Code: CloudFormation and Terraform
CloudFormation
CloudFormation Stacks
CloudFormation Template Anatomy
CloudFormation Changesets
Getting Stack Information
Cleaning Up Again
CloudFormation Conclusions
Terraform
Coding Style
Modularity
Limitations
Terraform vs. CloudFormation
Infrastructure-as-Code: CDK, Pulumi, Cloudcraft, and Other Solutions
AWS CDK
Pulumi
Cloudcraft
Infrastructure Management Conclusions
Chapter 4 Serverless Computing and Data Engineering
Serverless vs. Fully Managed
AWS Serverless Technologies
AWS Lambda
Pricing Model
Laser Focus on Code
The Lambda Paradigm Shift
Virtually Infinite Scalability
Geographical Distribution
A Lambda Hello World
Lambda Configuration
Runtime
Container-Based Lambdas
Architectures
Memory
Networking
Execution Role
Environment Variables
AWS EventBridge
AWS Fargate
AWS DynamoDB
AWS SNS
Amazon SQS
AWS CloudWatch
Amazon QuickSight
AWS Step Functions
Amazon API Gateway
Amazon Cognito
AWS Serverless Application Model (SAM)
Ephemeral Infrastructure
AWS SAM Installation
Configuration
Creating Your First AWS SAM Project
Application Structure
SAM Resource Types
SAM Lambda Template
!! Recursive Lambda Invocation !!
Function Metadata
Outputs
Implicitly Generated Resources
Other Template Sections
Lambda Code
Building Your First SAM Application
Testing the AWS SAM Application Locally
Deployment
Cleaning Up
Summary
Chapter 5 Data Ingestion
AWS Data Lake Architecture
Serverless Data Lake Architecture Structure
Ingestion
Storage and Processing
Cataloging, Governance, and Search
Security and Monitoring
Consumption
Sample Processing Architecture: Cataloging Images into DynamoDB
Use Case Description
SAM Application Creation
S3-Triggered Lambda
Adding DynamoDB
Lambda Execution Context
Inserting into DynamoDB
Cleaning Up
Serverless Ingestion
AWS Fargate
AWS Lambda
Example Architecture: Fargate-Based Periodic Batch Import
The Basic Importer
ECS CLI
AWS Copilot CLI
Clean Up
AWS Kinesis Ingestion
Example Architecture: Two-Pronged Delivery
Fully Managed Ingestion with AppFlow
Operational Data Ingestion with Database Migration Service
DMS Concepts
DMS Instance
DMS Endpoints
DMS Tasks
Summary of the Workflow
Common Use of DMS
Example Architecture: DMS to S3
DMS Instance
DMS Endpoints
DMS Task
Summary
Chapter 6 Processing Data
Phases of Data Preparation
What Is ETL? Why Should I Care?
ETL Job vs. Streaming Job
Overview of ETL in AWS
ETL with AWS Glue
ETL with Lambda Functions
ETL with Hadoop/EMR
Other Ways to Perform ETL
ETL Job Design Concepts
Source Identification
Destination Identification
Mappings
Validation
Filter
Join, Denormalization, Relationalization
AWS Glue for ETL
Really, It’s Just Spark
Visual
Spark Script Editor
Python Shell Script Editor
Jupyter Notebook
Connectors
Creating Connections
Creating Connections with the Web Console
Creating Connections with the AWS CLI
Creating ETL Jobs with AWS Glue Visual Editor
ETL Example: Format Switch from Raw (JSON) to Cleaned (Parquet)
Job Bookmarks
Transformations
Apply Mapping
Filter
Other Available Transforms
Run the Edited Job
Visual Editor with Source and Target Conclusions
Creating ETL Jobs with AWS Glue Visual Editor (without Source and Target)
Creating ETL Jobs with the Spark Script Editor
Developing ETL Jobs with AWS Glue Notebooks
What Is a Notebook?
Notebook Structure
Step 1: Load Code into a DynamicFrame
Step 2: Apply Field Mapping
Step 3: Apply the Filter
Step 4: Write to S3 in Parquet Format
Example: Joining and Denormalizing Data from Two S3 Locations
Conclusions for Manually Authored Jobs with Notebooks
Creating ETL Jobs with AWS Glue Interactive Sessions
It’s Magic
Development Workflow
Streaming Jobs
Differences with a Standard ETL Job
Streaming Sources
Example: Process Kinesis Streams with a Streaming Job
Streaming ETL Jobs Conclusions
Summary
Chapter 7 Cataloging, Governance, and Search
Cataloging with AWS Glue
AWS Glue and the AWS Glue Data Catalog
Glue Databases and Tables
Databases
The Idea of Schema-on-Read
Tables
Create Table Manually
Creating a Table from an Existing Schema
Creating a Table with a Crawler
Summary on Databases and Tables
Crawlers
Updating or Not Updating?
Running the Crawler
Creating a Crawler from the AWS CLI
Retrieving Table Information from the CLI
Classifiers
Classifier Example
Crawlers and Classifiers Summary
Search with Amazon Athena: The Heart of Analytics in AWS
A Bit of History
Interface Overview
Creating Tables Manually
Athena Data Types
Complex Types
Running a Query
Connecting with JDBC and ODBC
Query Stats
Recent Queries and Saved Queries
The Power of Partitions
Athena Pricing Model
Automatic Naming
Athena Query Output
Athena Peculiarities (SQL and Not)
Computed Fields Gotcha and WITH Statement Workaround
Lowercase!
Query Explain
Deduplicating Records
Working with JSON, Flattening, and Unnesting
Athena Views
CREATE TABLE AS SELECT (CTAS)
Saving Queries and Reusing Saved Queries
Running Parameterized Queries
Athena Federated Queries
Athena Lambda Connectors
Note on Connection Errors
Performing Federated Queries
Creating a View from a Federated Query
Governing: Athena Workgroups, Lake Formation, and More
Athena Workgroups
Fine-Grained Athena Access with IAM
Recap of Athena-Based Governance
AWS Lake Formation
Registering a Location in Lake Formation
Creating a Database in Lake Formation
Assigning Permissions in Lake Formation
LF-Tags and Permissions in Lake Formation
Data Filters
Governance Conclusions
Summary
Chapter 8 Data Consumption: BI, Visualization, and Reporting
QuickSight
Signing Up for QuickSight
Standard Plan
Enterprise Plan
Users and User Groups
Managing Users and Groups
Managing QuickSight
Users and Groups
Your Subscriptions
SPICE Capacity
Account Settings
Security and Permissions
VPC Connections
Mobile Settings
Domains and Embedding
Single Sign-On
Data Sources and Datasets
Creating an Athena Data Source
Creating Other Data Sources
Creating a Data Source from the AWS CLI
Creating a Dataset from a Table
Creating a Dataset from a SQL Query
Duplicating Datasets
Note on Creating Datasets
QuickSight Favorites, Recent, and Folders
SPICE
Manage SPICE Capacity
Refresh Schedule
QuickSight Data Editor
QuickSight Data Types
Change Data Types
Calculated Fields
Joining Data
Excluding Fields
Filtering Data
Removing Data
Geospatial Hierarchies and Adding Fields to Hierarchies
Unsupported Format Dates
Visualizing Data: QuickSight Analysis
Adding a Title and a Description to Your Analysis
Renaming the Sheet
Your First Visual with AutoGraph
Field Wells
Visual Types
Saving and Autosaving
A First Example: Pie Chart
Renaming a Visual
Filtering Data
Adding Drill-Downs
Parameters
Actions
Insights
ML-Powered Insights
Sharing an Analysis
Dashboards
Dashboard Layouts and Themes
Publishing a Dashboard
Embedding Visuals and Dashboards
Data Consumption: Not Only Dashboards
Summary
Chapter 9 Machine Learning at Scale
Machine Learning and Artificial Intelligence
What Are ML/AI Use Cases?
Types of ML Models
Overview of ML/AI AWS Solutions
Amazon SageMaker
SageMaker Domains
Adding a User to the Domain
SageMaker Studio
SageMaker Example Notebook
Step 1: Prerequisites and Preprocessing
Step 2: Data Ingestion
Step 3: Data Inspection
Step 4: Data Conversion
Step 5: Upload Training Data
Step 6: Train the Model
Step 7: Set Up Hosting and Deploy the Model
Step 8: Validate the Model
Step 9: Use the Model
Inference
Real Time
Asynchronous
Serverless
Batch Transform
Data Wrangler
SageMaker Canvas
Summary
Appendix Example Data Architectures in AWS
Modern Data Lake Architecture
ETL in a Lake House
Consuming Data in the Lake House
The Modern Data Lake Architecture
Batch Processing
Stream Processing
Architecture Design Recommendations
Automate Everything
Build on Events
Performance = Cost Savings
AWS Glue Catalog and Athena-Centric Workflow
Design Flexible
Pick Your Battles
Parquet
Summary
Index
EULA
📜 SIMILAR VOLUMES
<p><span>A comprehensive and accessible roadmap to performing data analytics in the AWS cloud</span></p><p><span>In </span><span>Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS</span><span>, accomplished software engineer and data architect Joe Minich
<b>A comprehensive and accessible roadmap to performing data analytics in the AWS cloud</b> In <i>Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS</i>, accomplished software engineer and data architect Joe Minichino delivers an expert blueprint to sto
<p><span>Build an end-to-end geospatial data lake in AWS using popular AWS services such as RDS, Redshift, DynamoDB, and Athena to manage geodata Purchase of the print or Kindle book includes a free PDF eBook.</span></p><h4><span>Key Features</span></h4><ul><li><span><span>Explore the architecture a
<p><span>Build an end-to-end geospatial data lake in AWS using popular AWS services such as RDS, Redshift, DynamoDB, and Athena to manage geodata Purchase of the print or Kindle book includes a free PDF eBook.</span></p><h4><span>Key Features</span></h4><ul><li><span><span>Explore the architecture a
<p><span>Build an end-to-end geospatial data lake in AWS using popular AWS services such as RDS, Redshift, DynamoDB, and Athena to manage geodata Purchase of the print or Kindle book includes a free PDF eBook.</span></p><h4><span>Key Features</span></h4><ul><li><span><span>Explore the architecture a