𝔖 Scriptorium
✦   LIBER   ✦

πŸ“

Practical Lakehouse Architecture: Designing and Implementing Modern Data Platforms at Scale

✍ Scribed by Gaurav Ashok Thalpati


Publisher
O'Reilly Media
Year
2024
Tongue
English
Leaves
3
Category
Library

⬇  Acquire This Volume

No coin nor oath required. For personal study only.

✦ Synopsis


This concise yet comprehensive guide explains how to adopt a data lakehouse architecture to implement modern data platforms. It reviews the design considerations, challenges, and best practices for implementing a lakehouse and provides key insights into the ways that using a lakehouse can impact your data platform, from managing structured and unstructured data and supporting BI and AI/ML use cases to enabling more rigorous data governance and security measures.

Practical Lakehouse Architecture shows you how to:

  • Understand key lakehouse concepts and features like transaction support, time travel, and schema evolution
  • Understand the differences between traditional and lakehouse data architectures
  • Differentiate between various file formats and table formats
  • Design lakehouse architecture layers for storage, compute, metadata management, and data consumption
  • Implement data governance and data security within the...
  • ✦ Table of Contents


    Preface
    Who Should Read This Book?
    Why I Wrote This Book
    Navigating This Book
    O’Reilly Online Learning
    Conventions Used in This Book
    How to Contact Us
    Acknowledgments
    1. Introduction to Lakehouse Architecture
    Understanding Data Architecture
    What Is Data Architecture?
    How Does Data Architecture Help Build a Data Platform?
    Defining core components
    Defining component interdependencies and data flow
    Defining guiding principles
    Defining the technology stack
    Aligning with overall vision and data strategy
    Core Components of a Data Platform
    Source systems
    Internal and external source systems
    Batch, near real-time, and streaming systems
    Structured, semi-structured, and unstructured data
    Data ingestion
    Batch ingestion
    Near real-time
    Streaming
    Data storage
    General storage
    Purpose-built storage
    Data processing and transformations
    Data validation and cleansing
    Data transformation
    Data curation and serving
    Data consumption and delivery
    BI workloads
    Ad hoc/Interactive analysis
    Downstream applications and APIs
    AI and ML workloads
    Common services
    Metadata management
    Data governance and data security
    Data operations
    Why Do We Need a New Data Architecture?
    Lakehouse Architecture: A New Pattern
    The Lakehouse: Best of Both Worlds
    How does a lakehouse get data lake features?
    How does a lakehouse get data warehouse features?
    Understanding Lakehouse Architecture
    Storage layer
    Cloud storage
    Open file formats
    Open table formats
    Compute layer
    Open-source engines
    Commercial engines
    Lakehouse Architecture Characteristics
    Single storage tier with no dedicated warehouse
    Warehouse-like performance on the data lake
    Decoupled architecture with separate storage and compute scaling
    Open architecture
    Support for different data types
    Support for diverse workloads
    Lakehouse Architecture Benefits
    Simplified architecture
    Support for unstructured data and ML use cases
    No vendor lock-ins
    Data sharing
    Scalable and cost efficient
    No data swamps
    Schema enforcement and evolution
    Schema enforcement
    Schema evolution
    Unified platform for ETL/ELT, BI, AI/ML, and real-time workloads
    ETL/ELT workloads
    BI workloads
    AI/ML workloads
    Real-time workloads
    Time travel
    Retrieve older data based on version
    Retrieve older data based on timestamp
    Key Takeaways
    References
    2. Traditional Architectures and Modern Data Platforms
    Traditional Architectures: Data Lakes and Data Warehouses
    Data Warehouse Fundamentals
    Benefits and advantages
    Limitations and challenges
    Data Lake Fundamentals
    Benefits and advantages
    Limitations and challenges
    Modern Data Platforms
    Finding Answers in the Cloud
    Standalone Approach
    Benefits
    Limitations
    Combined Approach
    Benefits
    Limitations
    Expectations of Modern Data Platforms
    Comparison: Data Warehouse, Data Lake, Lakehouse
    Capabilities and Limitations
    Standalone cloud data warehouse
    Standalone cloud data lake
    Combined architecture
    Lakehouse architecture
    Implementation Activities
    Standalone cloud data warehouse
    Standalone cloud data lake
    Combined architecture
    Lakehouse architecture
    Administration and Management
    Standalone cloud data warehouse
    Standalone cloud data lake
    Combined architecture
    Lakehouse architecture
    Business Outcomes
    Standalone cloud data warehouse
    Standalone cloud data lake
    Combined architecture
    Lakehouse architecture
    Lakehouse Architecture: The Default Choice for Future Data Platforms?
    Key Takeaways
    References
    3. Storage: The Heart of the Lakehouse
    Lakehouse Storage: Key Concepts
    Row Versus Columnar Storage
    Storage-based Performance Optimization
    Lakehouse Storage Components
    Cloud Object Storage
    Storage characteristics
    File Formats
    Parquet
    File layout
    Key features
    ORC
    File layout
    Key features
    Avro
    File layout
    Key features
    Similarities, differences, and use cases
    Table Formats
    Hive
    Iceberg
    Table layout
    Key features
    Hudi
    Table layout
    Key features
    Linux Foundation’s Delta Lake
    Table layout
    Key features
    Similarities, differences, and use cases
    Key Design Considerations
    Ecosystem Support
    Community Support
    Supported File Formats
    Supported Compute Engines
    Supported Features
    Commercial Product Support
    Current and Future Versions
    Performance Benchmarking
    Comparisons
    Sharing Features
    Key Takeaways
    References
    4. Data Catalogs
    Understanding Metadata
    Technical Metadata
    Business Metadata
    How Metastores and Data Catalogs Work Together
    Features of a Data Catalog
    Search, Explore, and Discover Data
    Data Classification
    Data Governance and Security
    Data Lineage
    Unified Data Catalog
    Challenges of Siloed Metadata Management
    What Is a Unified Data Catalog?
    Benefits of a Unified Data Catalog
    Implementing a Data Catalog: Key Design Considerations and Options
    Using Hive metastore
    Using AWS Services
    Using Azure Services
    Using GCP Services
    Using Databricks
    Key Takeaways
    References
    5. Compute Engines for Lakehouse Architectures
    Data Computation Benefits of Lakehouse Architecture
    Independent Scaling
    Cross-region, Cross-account Access
    Unified Batch and Real-Time Processing
    Enhanced BI Performance
    Freedom to Choose Different Engine Types
    Cross-zone Analysis
    Compute Engine Options for Lakehouse Platforms
    Open Source Tools
    Tools for data engineering
    Spark
    Flink
    Tools for data consumption
    Presto and Trino
    Cloud Services
    AWS
    AWS Glue
    Amazon EMR
    Amazon Athena
    Other AWS services
    Azure
    Azure Data Factory (ADF)
    Azure HDInsight
    Azure Synapse Analytics
    GCP
    Dataproc
    BigQuery
    Third-Party Platforms
    Databricks
    Snowflake
    Key Design Considerations
    Open Table Format Support
    Supported Version and Features
    Ecosystem Support
    Persona-Based Preferences
    Managed Open Source Versus Cloud Native Versus Third-Party Products
    Data Consumption Workloads
    BI workloads
    AI/ML workloads
    Key Takeaways
    References
    6. Data (and AI) Governance and Security in Lakehouse Architecture
    What Is Data Governance and Data Security?
    Benefits of Data Governance and Data Security
    Unified Governance and Security in Lakehouse Architecture
    Governance and Security Processes in Lakehouse Architecture
    Metadata Management
    Compliance and Regulations
    Data and ML Model Quality
    Lineage Across Data and AI assets
    Understanding data flow
    Performing impact analysis
    Identifying unused objects
    Tracking sensitive data
    Data and AI Asset Sharing
    Data Ownership
    Auditing and Monitoring
    Access Management
    Data Protection
    Data at rest
    Data in transit
    Handling Sensitive Data
    Identify sensitive data
    Anonymize sensitive data
    What’s Your Role?
    Key Takeaways
    References
    7. The Big Picture: Designing and Implementing a Lakehouse Platform
    Pre-design Activities
    Understanding Platform Requirements
    Studying Existing System
    Understanding the Organization’s Vision and Data Strategy
    Conducting Workshops and Interviews
    Choosing the Right Architecture
    Establishing Guiding Principles
    Data Ecosystem
    Scalability and Performance
    Cost Control and Optimization
    Platform Operations
    Governance and Security
    Design Considerations and Implementation Best Practices
    Architecture Blueprint
    Data Ingestion
    Data ingestion considerations
    Ingestion frequency
    Source system types
    Identify incremental data (change data capture)
    Sensitive data
    Technology choices
    Best practices
    Data Storage
    Storage zones considerations
    Raw zone
    Cleansed zone
    Curated zone
    Semantic zone
    Data modeling considerations
    Entity relationship (ER) modeling
    Data Vault modeling
    Dimensional modeling
    Best practices
    Data Processing
    Data processing considerations
    Open table format conversion
    Schema and data quality validations
    Data integration
    Data transformations and enrichment
    Best practices
    Data Consumption and Delivery
    Workload considerations
    Best practices
    Common Services
    Metadata management
    Governance and security
    Platform operations
    DataOps
    MLOps
    Best practices
    Design References
    Step-by-Step Design Guide
    Design Questionnaire
    Key Takeaways
    References
    8. Lakehouse in the Real World
    Delivering a Real-World Lakehouse
    Estimation and Planning Phase
    Estimation
    Planning
    Analysis and Design Phase
    Analyzing the Existing System
    Data Modeling
    Finalizing the Tech Stack
    Implementation and Test Phase
    Historical Data Migration
    Data Reconciliation and Testing
    Reverse Engineering
    Data Quality and Handling Sensitive Data
    Support and Maintenance Phase
    Auditing and Tracking
    Disaster Recovery Strategy
    Decommissioning the Old System
    Delivery References
    Project Deliverables
    Reference Architectures
    Cloud native implementation
    Third-party platform implementation
    Key Takeaways
    References
    9. Lakehouse of the Future
    Warehouse to Lakehouse: What’s Next?
    Data Mesh
    HTAP
    Zero ETL
    Interoperability and New Formats
    Universal Format (UniForm)
    Apache XTable
    Upcoming File and Table Formats
    Managed Platforms for Public and Private Clouds
    Microsoft Fabric and Other Platforms
    Managed Lakehouse for Private Cloud Platform
    AI in a Lakehouse
    Key Takeaways
    Book Conclusion
    References
    Index


    πŸ“œ SIMILAR VOLUMES


    Practical Lakehouse Architecture: Design
    ✍ Gaurav Ashok Thalpati πŸ“‚ Library πŸ“… 2024 πŸ› O'Reilly Media 🌐 English

    <p><span>This concise yet comprehensive guide explains how to adopt a data lakehouse architecture to implement modern data platforms. It reviews the design considerations, challenges, and best practices for implementing a lakehouse and provides key insights into the ways that using a lakehouse can i

    Practical Lakehouse Architecture: Design
    ✍ Gaurav Ashok Thalpati πŸ“‚ Library πŸ“… 2024 πŸ› O'Reilly Media 🌐 English

    <p><span>This concise yet comprehensive guide explains how to adopt a data lakehouse architecture to implement modern data platforms. It reviews the design considerations, challenges, and best practices for implementing a lakehouse and provides key insights into the ways that using a lakehouse can i

    Data Lakehouse in Action: Architecting a
    ✍ Pradeep Menon πŸ“‚ Library πŸ“… 2022 πŸ› Packt Publishing 🌐 English

    <p><span>Propose a new scalable data architecture paradigm, Data Lakehouse, that addresses the limitations of current data architecture patterns</span></p><h4><span>Key Features</span></h4><ul><li><span><span>Understand how data is ingested, stored, served, governed, and secured for enabling data an

    Architecting Modern Data Platforms: A Gu
    ✍ Jan Kunigk, Ian Buss, Paul Wilkinson, Lars George πŸ“‚ Library πŸ› O'Reilly Media 🌐 English

    <p><span>There’s a lot of information about big data technologies, but splicing these technologies into an end-to-end enterprise data platform is a daunting task not widely covered. With this practical book, you’ll learn how to build big data infrastructure both on-premises and in the cloud and succ

    Deciphering Data Architectures: Choosing
    ✍ James Serra πŸ“‚ Library πŸ“… 2023 πŸ› O'Reilly Media 🌐 English

    <p>Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. These new architectures have solid benefits, but they're also surrounded by a lot of hyperbole and confusion. This practical book provides a guided tour of each architecture to h