𝔖 Scriptorium
✦   LIBER   ✦

📁

The Definitive Guide to Azure Data Engineering: Modern ELT, DevOps, and Analytics on the Azure Cloud Platform

✍ Scribed by Ron C. L'Esteve


Tongue
English
Leaves
611
Category
Library

⬇  Acquire This Volume

No coin nor oath required. For personal study only.

✦ Synopsis


Intermediate-Advanced user level

✦ Table of Contents


Table of Contents
About the Author
About the Technical Reviewer
Acknowledgments
Introduction
Part I: Getting Started
Chapter 1: The Tools and Prerequisites
Master the Traditional Microsoft Business Intelligence Stack
Understand Azure’s Modern Enterprise Data and Analytics Platform
Understand How to Manage Big Data with Azure
Understand the Fundamental Requirements for the Data Engineer Associate
Expand Your Knowledge Across Azure Specialties
Be Able to Address the Business Value of the Azure Data Platform
Get Hands-On with Azure Data Engineering Through Azure Portal
Azure Services Covered in This Book
Data Lake Storage Gen2
Data Factory
Ingest and Load
Mapping Data Flows for Transformation and Aggregation
Wrangling Data Flows
Schedule and Monitor
ADF Limitations
Databricks
Synapse Analytics
DevOps CI/CD
IoT Hub
Stream Analytics
Power BI
Purview
Snowflake
SQL Database
Purchasing Models (SQL DTU vs. vCore Database)
Deployment Models
Service Tiers
Cosmos DB
Relevant Azure Services Not Covered
Analysis Services
Cognitive Services
Decision
Language
Speech
Vision
Search
Azure Machine Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Monitor
Log Analytics
Event Hubs
Data Share
Logic Apps
Power Apps
App Service
SQL Managed Instance
Data Box
Data Sync
Data Gateway
Cost Management + Billing
Digital Twins
Mobile
Networking
Security
Identity
Kubernetes
Functions
HVR Real-Time Data Replication
Summary
Chapter 2: Data Factory vs. SSIS vs. Databricks
Choosing the Right Data Integration Tool
When to Use Azure Data Factory, Azure Databricks, or Both
Summary
Chapter 3: Design a Data Lake Storage Gen2 Account
Data Lake Layers
Environments
Storage Accounts
File Systems
Zones, Directories, and Files
Zones
Directories (Folders)
Files
Security
Control Plane Permissions
Data Plane Permissions
POSIX-Like Access Control Lists
Shared Access Signature
Data Encryption
Network Transport
Summary
Part II: Azure Data Factory for ELT
Chapter 4: Dynamically Load a SQL Database to Data Lake Storage Gen2
Azure Prerequisite Resources
Prepare and Verify SQL Server Database Objects
Prepare and Verify Azure SQL Database Objects
Prepare an Azure Data Lake Storage Gen2 Container
Create Azure Data Factory Pipeline Resources
Create a Self-Hosted Integration Runtime
Create Linked Services
Create Datasets
DS_ADLS2
DS_SQLSERVER
DS_ASQLDB_PIPELINE_PARAMETER
Create Azure Data Factory Pipelines
P_Insert_Base_Table_Info
P_SQL_to_ADLS
Run the Data Factory Pipeline and Verify Azure Data Lake Storage Gen2 Objects
Summary
Chapter 5: Use COPY INTO to Load a Synapse Analytics Dedicated SQL Pool
Features of the COPY INTO Command
Data Preparation Tips
Tip #1: Remove Spaces from the Column Names
Tip #2: Convert VARCHAR(MAX) to VARCHAR(4000)
COPY INTO Using a Parquet File
COPY INTO Using a CSV File
Using COPY INTO from Data Factory
Summary
Chapter 6: Load Data Lake Storage Gen2 Files into a Synapse Analytics Dedicated SQL Pool
Recreate the Pipeline Parameter Table
Create the Datasets
DS_ADLS_TO_SYNAPSE
DS_ADLS_TO_SYNAPSE_MI
DS_SYNAPSE_ANALYTICS_DW
Create the Pipeline
Choose the Copy Method
BULK INSERT
PolyBase
Copy Command
Summary
Chapter 7: Create and Load Synapse Analytics Dedicated SQL Pool Tables Dynamically
Dynamically Create and Load New Tables Using an ADF Pre-copy Script
Dynamically Truncate and Load Existing Tables Using an ADF Pre-copy Script
Dynamically Drop, Create, and Load Tables Using a Stored Procedure
Summary
Chapter 8: Build Custom Logs in SQL Database for Pipeline Activity Metrics
Option 1: Create a Stored Procedure Activity
Option 2: Create a CSV Log File in Data Lake Storage Gen2
Option 3: Create a Log Table in Azure SQL Database
Summary
Chapter 9: Capture Pipeline Error Logs in SQL Database
Create a Parameter Table
Create a Log Table
Create an Errors Table
Create a Stored Procedure to Update the Log Table
Create a Stored Procedure to Update the Errors Table
Create a Source Error
Add Records to a Parameter Table
Verify the Azure Data Lake Storage Gen2 Folders and Files
Configure the Pipeline Lookup Activity
Configure the Pipeline ForEach Loop Activity
Configure a Stored Procedure to Update the Log Table
Configure a Stored Procedure to Update the Errors Table
Run the Pipeline
Verify the Results
Other ADF Logging Options
Summary
Chapter 10: Dynamically Load a Snowflake Data Warehouse
Linked Services and Datasets
Base Linked Services
Datasets
Snowflake Control Database and Tables
Pipelines
Step 1: Design and Execute an ADF Pipeline to Load Azure SQL Database to Data Lake Storage Gen2
Step 2: Design the Data Lake Storage Gen2 to Snowflake ADF Pipeline
Option 1: ADF Pipeline to Load ADLS Gen2 to Snowflake Using Azure Databricks
Option 2: ADF Pipeline to Load ADLS Gen2 to Snowflake Using ADF Copy Activity
Option 3: ADF Pipeline to Load ADLS Gen2 to Snowflake Using Mapping Data Flows
Comparing the Various ADLS Gen2 to Snowflake Ingestion Options
Swim Lanes
Data Validation
Summary
Chapter 11: Mapping Data Flows for Data Warehouse ETL
Modern Data Warehouse
Creating the Base Azure Data Resources
Slowly Changing Dimension Type I
Create a Data Factory Pipeline and Datasets
Create a Data Factory Mapping Data Flow
Exists
LookupDates
SetAttributes
AlterRows
sink1
Updating a Record
Inserting a Record
Summary
Chapter 12: Aggregate and Transform Big Data Using Mapping Data Flows
Add Files and Folders to Azure Data Lake Storage Gen2
File Size
Folder Structure
Create Azure Data Factory Resources
Create the Mapping Data Flow
Regular Expressions (Regex)
Soundex
RANK Function
DENSE_RANK Function
ROW_NUMBER Function
Summary
Chapter 13: Incrementally Upsert Data
Create a Parameter Table
Create a Source Query for the ADF Pipeline
Add the ADF Datasets
Azure SQL Database
Azure Data Lake Storage Gen2
Azure Synapse Analytics DW
Create the ADF Pipeline
Add a Lookup Activity to Get the List of Tables
Add a ForEach Activity to Iterate and Copy Each Table
Mapping Data Flow for SQL to Lake Incremental ADF Pipeline
Round Robin
Hash
Dynamic Range
Fixed Range
Key
Mapping Data Flow to Incrementally Upsert from Lake to Synapse Analytics DW
Run the ADF Pipeline
Verify Incremental SQL to Lake Pipeline Results
Verify Incremental Upsert Lake to Synapse ADF Pipeline Results
Verify Source SQL Record Count
Verify Lake Folder and Parquet File Path
Verify Destination Synapse Record Count
Insert a Source SQL Record
Verify Incremental SQL to Lake ADF Pipeline Results
Verify Incremental Upsert Lake to Synapse ADF Pipeline Results
Verify Destination Synapse Analytics DW Record Count
Update a Source SQL Record
Verify Destination Synapse Analytics DW Record Count
Summary
Chapter 14: Load Excel Sheets into Azure SQL Database Tables
Prerequisites
Create an Excel Spreadsheet
Upload to Azure Data Lake Storage Gen2
Create Linked Services and Datasets
Create a Pipeline to Load Multiple Excel Sheets in a Spreadsheet into a Single Azure SQL Table
Create a Pipeline to Load Multiple Excel Sheets in a Spreadsheet into Multiple Azure SQL Tables
Summary
Chapter 15: Delta Lake
Why an ACID Delta Lake
Prerequisites
Create and Insert into Delta Lake
Update Delta Lake
Delete from Delta Lake
Explore Delta Logs
Insert
Update
Delete
Summary
Part III: Real-Time Analytics in Azure
Chapter 16: Stream Analytics Anomaly Detection
Prerequisites
Create an Azure Stream Analytics Job
Create an IoT Hub
Create a Power BI Service
Download the Device Simulator
Create a Stream Analytics Input and Output
Add Stream Input
Add Stream Output
Write the Stream Analytics Query
Start the Stream Analytics Job
Create a Real-Time Power BI Dashboard
Create a Dataset
Create a Dashboard
Add a Tile
Run the Device Simulator
Monitor Real-Time Power BI Streaming
Summary
Chapter 17: Real-Time IoT Analytics Using Apache Spark
Prerequisites
Create an IoT Hub
Create a Databricks Cluster
Install Maven Library
Create a Notebook and Run Structured Streaming Queries
Configure Notebook Connections
Start the Structured Stream
Start the IoT Device Simulator
Display the Real-Time Streaming Data
Create a Spark SQL Table
Write the Stream to a Delta Table
Summary
Chapter 18: Azure Synapse Link for Cosmos DB
Create an Azure Cosmos DB Account
Enable Azure Synapse Link
Create a Cosmos DB Container and Database
Import Data into Azure Cosmos DB
Create a Cosmos DB Linked Service in Azure Synapse Analytics
Load and Query the Data Using Synapse Spark
Summary
Part IV: DevOps for Continuous Integration and Deployment
Chapter 19: Deploy Data Factory Changes
Prerequisites
Create the DevOps Continuous Integration Build Pipeline
Create the DevOps Continuous Deployment Release Pipeline
Azure PowerShell Task to Stop Triggers
ARM Template Deployment Task
Azure PowerShell Task to Start Triggers
Run the Release Pipeline
Verify the Deployed Data Factory Resources
Summary
Chapter 20: Deploy a SQL Database
Pre-Requisites
Create a Visual Studio SQL Database Project
Install Visual Studio GitHub Extension
Import AdventureWorks Database
Connect to GitHub Repo Source Control
Check In Visual Studio Solution to GitHub Repo
Install Azure Pipelines from GitHub
Build CI Pipeline from GitHub Repo
Release CD Pipeline from DevOps Artifact Repo
Verify Deployed Azure SQL AdventureWorks Database
Summary
Part V: Advanced Analytics
Chapter 21: Graph Analytics Using Apache Spark’s GraphFrame API
Install JAR Library
Load New Data Tables
Load Data in a Databricks Notebook
Build a Graph with Vertices and Edges
Query the Graph
Find Patterns with Motifs
Discover Importance with PageRank
Explore In-Degree and Out-Degree Metrics
Run a Breadth-First Search
Find Connected Components
Summary
Chapter 22: Synapse Analytics Workspaces
Create a Synapse Analytics Workspace
Explore Sample Data with Spark
Query Data with SQL
Create External Table with SQL
Summary
Chapter 23: Machine Learning in Databricks
Create an MLflow Experiment
Install the MLflow Library
Create a Notebook
Selective Logging
Auto-logging
Register a Model
Summary
Part VI: Data Governance
Chapter 24: Purview for Data Governance
Create Azure Purview Account
Explore Azure Purview
Create and Register Data Source
Manage Credentials and Access
Create a Scan
Explore the Glossary
Browse Assets
Working with Purview Programmatically
Summary
Index


📜 SIMILAR VOLUMES


The Modern Data Warehouse in Azure: Buil
✍ Matt How 📂 Library 📅 2020 🏛 Apress 🌐 English

Build a modern data warehouse on Microsoft's Azure Platform that is flexible, adaptable, and fast—fast to snap together, reconfigure, and fast at delivering results to drive good decision making in your business. Gone are the days when data warehousing projects were lumbering dinosaur-style projects

The Modern Data Warehouse in Azure: Buil
✍ Matt How 📂 Library 📅 2020 🏛 Apress 🌐 English

Build a modern data warehouse on Microsoft's Azure Platform that is flexible, adaptable, and fast—fast to snap together, reconfigure, and fast at delivering results to drive good decision making in your business. Gone are the days when data warehousing projects were lumbering dinosaur-style projects

Cloud Native Development with Azure : A
✍ Verma, Pavan; 📂 Library 📅 2024 🏛 BPB Publications 🌐 English

Develop cloud-native skills by learning Azure cloud infrastructure offerings KEY FEATURES ● Master cloud-native development fundamentals and Azure services. ● Application security, monitoring, and efficient management. ● Explore advanced services like Azure Machine Learning & IoT Hub. DESCRIPTION Az