Mapping Data Flows in Azure Data Factory: Building Scalable ETL Projects in the Microsoft Cloud
✍ Scribed by Mark Kromer
- Publisher
- Apress
- Year
- 2022
- Tongue
- English
- Leaves
- 204
- Edition
- 1
- Category
- Library
No coin nor oath required. For personal study only.
✦ Synopsis
Build scalable ETL data pipelines in the cloud using Azure Data Factory’s Mapping Data Flows. Each chapter of this book addresses different aspects of an end-to-end data pipeline that includes repeatable design patterns based on best practices using ADF’s code-free data transformation design tools. The book shows data engineers how to take raw business data at cloud scale and turn that data into business value by organizing and transforming the data for use in data science projects and analytics systems.
The book begins with an introduction to Azure Data Factory followed by an introduction to its Mapping Data Flows feature set. Subsequent chapters show how to build your first pipeline and corresponding data flow, implement common design patterns, and operationalize your result. By the end of the book, you will be able to apply what you’ve learned to your complex data integration and ETL projects in Azure. These projects will enable cloud-scale big analytics and data loading and transformation best practices for data warehouses.
What You Will Learn
- Build scalable ETL jobs in Azure without writing code
- Transform big data for data quality and data modeling requirements
- Understand the different aspects of Azure Data Factory ETL pipelines from datasets and Linked Services to Mapping Data Flows
- Apply best practices for designing and managing complex ETL data pipelines in Azure Data Factory
- Add cloud-based ETL patterns to your set of data engineering skills
- Build repeatable code-free ETL design patterns
Who This Book Is For
Data engineers who are new to building complex data transformation pipelines in the cloud with Azure; and data engineers who need ETL solutions that scale to match swiftly growing volumes of data
✦ Table of Contents
Table of Contents
About the Author
About the Technical Reviewer
Introduction
Part I: Getting Started with Azure Data Factory and Mapping Data Flows
Chapter 1: ETL for the Cloud Data Engineer
General ETL Process
Differences in Cloud-Based ETL
Data Drift
Landing the Refined Data
Typical SDLC
Summary
Chapter 2: Introduction to Azure Data Factory
What Is Azure Data Factory?
Factory Resources
Pipelines
Activities
Triggers
Mapping Data Flows
Linked Services
Datasets
Azure Integration Runtime
Self-Hosted Integration Runtime
Elements of a Pipeline
Pipeline Execution
Pipeline Triggers
Pipeline Monitoring
Summary
Chapter 3: Introduction to Mapping Data Flows
Getting Started
Design Surface
Connector Lines and Reference Lines
Repositioning Nodes
Data Flow Script
Transformation Primitives
Multiple Inputs/Outputs
New Branch
Join
Conditional Split
Exists
Union
Lookup
Schema Modifier
Derived Column
Select
Aggregate
Surrogate Key
Pivot
Unpivot
Window
Rank
External Call
Formatters
Flatten
Parse
Stringify
Row Modifier
Filter
Sort
Alter Row
Assert
Flowlets
Destination
Expression language
Functions
Input Schema
Parameters
Cached Lookup
Locals
Data Preview
Manage Compute Environment from Azure IR
Debugging from the Data Flow Surface
Debugging from Pipeline
Summary
Untitled
Part II: Designing Scalable ETL Jobs with ADF Mapping Data Flows
Chapter 4: Build Your First ETL Pipeline in ADF
Scenario
Data Quality
Task 1: Start with a New Data Flow
Task 2: Metadata Checker
Task 3: Add Asserts for Data Validation
Task 4: Filter Out NULLs
Task 5: Create Full Address Field
Final Step: Land the Data As Parquet in the Data Lake
Summary
Chapter 5: Common ETL Pipeline Practices in ADF with Mapping Data Flows
Task 1: Create a New Pipeline
Task 2: Debug the Pipeline
Task 3: Evaluate Execution Plan
Task 4: Evaluate Results
Task 5: Prepare Pipeline for Operational Deployment
Summary
Chapter 6: Slowly Changing Dimensions
Building a Slowly Changing Dimension Pattern in Mapping Data Flows
Data Sources
NewProducts
ExistingProducts
Cached Lookup
Create Cache
Create Row Hashes
Surrogate Key Generation
Check for Existing Dimension Members
Set Dimension Properties
Bring the Streams Together
Prepare Data for Writing to Database
Summary
Chapter 7: Data Deduplication
The Need for Data Deduplication
Type 1: Distinct Rows
Type 2: Fuzzy Matching
Column Pattern Matching
Self-Join
Match Scoring
Scoring Your Data for Duplication Evaluation
Turn the Data Flow into a Reusable Flowlet
Debugging a Flowlet
Summary
Chapter 8: Mapping Data Flow Advanced Topics
Working with Complex Data Types
Hierarchical Structures
Working with an Existing Hierarchical Structure
Building a Structure
Using Other Transformations
Arrays
Build an Array
Work with an Existing Array
Maps
Create a New Map
Data Lake File Formats
Parquet
Delta Lake
Optimized Row Columnar
Avro
JSON and Delimited Text
Data Flow Script
Summary
Part III: Operationalize Your ETL Data Pipelines
Chapter 9: Basics of CI/CD and Pipeline Scheduling
Configure Git
New Factory
Existing Factory
Branching
Publish Changes
Pipeline Scheduling
Debug Run
Trigger Now
Schedule Trigger
Tumbling Window Trigger
Storage Events Trigger
Custom Events Trigger
Summary
Chapter 10: Monitor, Manage, and Optimize
Monitoring Your Jobs
Error Row Handling
Partitioning Strategies
Optimizing Integration Runtimes
Compute Settings
Time to Live (TTL)
Iterating over Files
Parameterizing
Pipeline Parameters
Data Flow Parameters
Late Binding
Data Profiling
Mapping Data Flow Statistics
Data Preview Statistics
Profile Stats
Power Query Activity
Transformation Optimization
byName( ) and byNames( )
Rank and Surrogate Key
Sorting
Database Queries
Joins and Lookups
Broadcasting
Cached Lookup
Pipeline Optimizations for Data Flow Activity
Run in Parallel
Logging Level
Database Staging
Summary
Untitled
Untitled
Index
📜 SIMILAR VOLUMES
<span>Intermediate-Advanced user level</span>
<span>Design and build architectures on the Microsoft Azure platform specifically for data-driven and ETL applications. Modern cloud architectures rely on serverless components more than ever, and this book helps you identify those components of data-driven or ETL applications that can be tackled us
Build a modern data warehouse on Microsoft's Azure Platform that is flexible, adaptable, and fast—fast to snap together, reconfigure, and fast at delivering results to drive good decision making in your business. Gone are the days when data warehousing projects were lumbering dinosaur-style projects
Build a modern data warehouse on Microsoft's Azure Platform that is flexible, adaptable, and fast—fast to snap together, reconfigure, and fast at delivering results to drive good decision making in your business. Gone are the days when data warehousing projects were lumbering dinosaur-style projects
Intermediate-Advanced user level