<p><span>This concise yet comprehensive guide explains how to adopt a data lakehouse architecture to implement modern data platforms. It reviews the design considerations, challenges, and best practices for implementing a lakehouse and provides key insights into the ways that using a lakehouse can i
Practical Lakehouse Architecture: Designing and Implementing Modern Data Platforms at Scale
β Scribed by Gaurav Ashok Thalpati
- Publisher
- O'Reilly Media
- Year
- 2024
- Tongue
- English
- Leaves
- 3
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Synopsis
This concise yet comprehensive guide explains how to adopt a data lakehouse architecture to implement modern data platforms. It reviews the design considerations, challenges, and best practices for implementing a lakehouse and provides key insights into the ways that using a lakehouse can impact your data platform, from managing structured and unstructured data and supporting BI and AI/ML use cases to enabling more rigorous data governance and security measures.
Practical Lakehouse Architecture shows you how to:
β¦ Table of Contents
Preface
Who Should Read This Book?
Why I Wrote This Book
Navigating This Book
OβReilly Online Learning
Conventions Used in This Book
How to Contact Us
Acknowledgments
1. Introduction to Lakehouse Architecture
Understanding Data Architecture
What Is Data Architecture?
How Does Data Architecture Help Build a Data Platform?
Defining core components
Defining component interdependencies and data flow
Defining guiding principles
Defining the technology stack
Aligning with overall vision and data strategy
Core Components of a Data Platform
Source systems
Internal and external source systems
Batch, near real-time, and streaming systems
Structured, semi-structured, and unstructured data
Data ingestion
Batch ingestion
Near real-time
Streaming
Data storage
General storage
Purpose-built storage
Data processing and transformations
Data validation and cleansing
Data transformation
Data curation and serving
Data consumption and delivery
BI workloads
Ad hoc/Interactive analysis
Downstream applications and APIs
AI and ML workloads
Common services
Metadata management
Data governance and data security
Data operations
Why Do We Need a New Data Architecture?
Lakehouse Architecture: A New Pattern
The Lakehouse: Best of Both Worlds
How does a lakehouse get data lake features?
How does a lakehouse get data warehouse features?
Understanding Lakehouse Architecture
Storage layer
Cloud storage
Open file formats
Open table formats
Compute layer
Open-source engines
Commercial engines
Lakehouse Architecture Characteristics
Single storage tier with no dedicated warehouse
Warehouse-like performance on the data lake
Decoupled architecture with separate storage and compute scaling
Open architecture
Support for different data types
Support for diverse workloads
Lakehouse Architecture Benefits
Simplified architecture
Support for unstructured data and ML use cases
No vendor lock-ins
Data sharing
Scalable and cost efficient
No data swamps
Schema enforcement and evolution
Schema enforcement
Schema evolution
Unified platform for ETL/ELT, BI, AI/ML, and real-time workloads
ETL/ELT workloads
BI workloads
AI/ML workloads
Real-time workloads
Time travel
Retrieve older data based on version
Retrieve older data based on timestamp
Key Takeaways
References
2. Traditional Architectures and Modern Data Platforms
Traditional Architectures: Data Lakes and Data Warehouses
Data Warehouse Fundamentals
Benefits and advantages
Limitations and challenges
Data Lake Fundamentals
Benefits and advantages
Limitations and challenges
Modern Data Platforms
Finding Answers in the Cloud
Standalone Approach
Benefits
Limitations
Combined Approach
Benefits
Limitations
Expectations of Modern Data Platforms
Comparison: Data Warehouse, Data Lake, Lakehouse
Capabilities and Limitations
Standalone cloud data warehouse
Standalone cloud data lake
Combined architecture
Lakehouse architecture
Implementation Activities
Standalone cloud data warehouse
Standalone cloud data lake
Combined architecture
Lakehouse architecture
Administration and Management
Standalone cloud data warehouse
Standalone cloud data lake
Combined architecture
Lakehouse architecture
Business Outcomes
Standalone cloud data warehouse
Standalone cloud data lake
Combined architecture
Lakehouse architecture
Lakehouse Architecture: The Default Choice for Future Data Platforms?
Key Takeaways
References
3. Storage: The Heart of the Lakehouse
Lakehouse Storage: Key Concepts
Row Versus Columnar Storage
Storage-based Performance Optimization
Lakehouse Storage Components
Cloud Object Storage
Storage characteristics
File Formats
Parquet
File layout
Key features
ORC
File layout
Key features
Avro
File layout
Key features
Similarities, differences, and use cases
Table Formats
Hive
Iceberg
Table layout
Key features
Hudi
Table layout
Key features
Linux Foundationβs Delta Lake
Table layout
Key features
Similarities, differences, and use cases
Key Design Considerations
Ecosystem Support
Community Support
Supported File Formats
Supported Compute Engines
Supported Features
Commercial Product Support
Current and Future Versions
Performance Benchmarking
Comparisons
Sharing Features
Key Takeaways
References
4. Data Catalogs
Understanding Metadata
Technical Metadata
Business Metadata
How Metastores and Data Catalogs Work Together
Features of a Data Catalog
Search, Explore, and Discover Data
Data Classification
Data Governance and Security
Data Lineage
Unified Data Catalog
Challenges of Siloed Metadata Management
What Is a Unified Data Catalog?
Benefits of a Unified Data Catalog
Implementing a Data Catalog: Key Design Considerations and Options
Using Hive metastore
Using AWS Services
Using Azure Services
Using GCP Services
Using Databricks
Key Takeaways
References
5. Compute Engines for Lakehouse Architectures
Data Computation Benefits of Lakehouse Architecture
Independent Scaling
Cross-region, Cross-account Access
Unified Batch and Real-Time Processing
Enhanced BI Performance
Freedom to Choose Different Engine Types
Cross-zone Analysis
Compute Engine Options for Lakehouse Platforms
Open Source Tools
Tools for data engineering
Spark
Flink
Tools for data consumption
Presto and Trino
Cloud Services
AWS
AWS Glue
Amazon EMR
Amazon Athena
Other AWS services
Azure
Azure Data Factory (ADF)
Azure HDInsight
Azure Synapse Analytics
GCP
Dataproc
BigQuery
Third-Party Platforms
Databricks
Snowflake
Key Design Considerations
Open Table Format Support
Supported Version and Features
Ecosystem Support
Persona-Based Preferences
Managed Open Source Versus Cloud Native Versus Third-Party Products
Data Consumption Workloads
BI workloads
AI/ML workloads
Key Takeaways
References
6. Data (and AI) Governance and Security in Lakehouse Architecture
What Is Data Governance and Data Security?
Benefits of Data Governance and Data Security
Unified Governance and Security in Lakehouse Architecture
Governance and Security Processes in Lakehouse Architecture
Metadata Management
Compliance and Regulations
Data and ML Model Quality
Lineage Across Data and AI assets
Understanding data flow
Performing impact analysis
Identifying unused objects
Tracking sensitive data
Data and AI Asset Sharing
Data Ownership
Auditing and Monitoring
Access Management
Data Protection
Data at rest
Data in transit
Handling Sensitive Data
Identify sensitive data
Anonymize sensitive data
Whatβs Your Role?
Key Takeaways
References
7. The Big Picture: Designing and Implementing a Lakehouse Platform
Pre-design Activities
Understanding Platform Requirements
Studying Existing System
Understanding the Organizationβs Vision and Data Strategy
Conducting Workshops and Interviews
Choosing the Right Architecture
Establishing Guiding Principles
Data Ecosystem
Scalability and Performance
Cost Control and Optimization
Platform Operations
Governance and Security
Design Considerations and Implementation Best Practices
Architecture Blueprint
Data Ingestion
Data ingestion considerations
Ingestion frequency
Source system types
Identify incremental data (change data capture)
Sensitive data
Technology choices
Best practices
Data Storage
Storage zones considerations
Raw zone
Cleansed zone
Curated zone
Semantic zone
Data modeling considerations
Entity relationship (ER) modeling
Data Vault modeling
Dimensional modeling
Best practices
Data Processing
Data processing considerations
Open table format conversion
Schema and data quality validations
Data integration
Data transformations and enrichment
Best practices
Data Consumption and Delivery
Workload considerations
Best practices
Common Services
Metadata management
Governance and security
Platform operations
DataOps
MLOps
Best practices
Design References
Step-by-Step Design Guide
Design Questionnaire
Key Takeaways
References
8. Lakehouse in the Real World
Delivering a Real-World Lakehouse
Estimation and Planning Phase
Estimation
Planning
Analysis and Design Phase
Analyzing the Existing System
Data Modeling
Finalizing the Tech Stack
Implementation and Test Phase
Historical Data Migration
Data Reconciliation and Testing
Reverse Engineering
Data Quality and Handling Sensitive Data
Support and Maintenance Phase
Auditing and Tracking
Disaster Recovery Strategy
Decommissioning the Old System
Delivery References
Project Deliverables
Reference Architectures
Cloud native implementation
Third-party platform implementation
Key Takeaways
References
9. Lakehouse of the Future
Warehouse to Lakehouse: Whatβs Next?
Data Mesh
HTAP
Zero ETL
Interoperability and New Formats
Universal Format (UniForm)
Apache XTable
Upcoming File and Table Formats
Managed Platforms for Public and Private Clouds
Microsoft Fabric and Other Platforms
Managed Lakehouse for Private Cloud Platform
AI in a Lakehouse
Key Takeaways
Book Conclusion
References
Index
π SIMILAR VOLUMES
<p><span>This concise yet comprehensive guide explains how to adopt a data lakehouse architecture to implement modern data platforms. It reviews the design considerations, challenges, and best practices for implementing a lakehouse and provides key insights into the ways that using a lakehouse can i
<p><span>Propose a new scalable data architecture paradigm, Data Lakehouse, that addresses the limitations of current data architecture patterns</span></p><h4><span>Key Features</span></h4><ul><li><span><span>Understand how data is ingested, stored, served, governed, and secured for enabling data an
<p><span>Thereβs a lot of information about big data technologies, but splicing these technologies into an end-to-end enterprise data platform is a daunting task not widely covered. With this practical book, youβll learn how to build big data infrastructure both on-premises and in the cloud and succ
<p>Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. These new architectures have solid benefits, but they're also surrounded by a lot of hyperbole and confusion. This practical book provides a guided tour of each architecture to h