๐”– Scriptorium
โœฆ   LIBER   โœฆ

๐Ÿ“

Modern Data Architectures with Python: A modern approach to building data ecosystems

โœ Scribed by Brian Lipp


Publisher
Packt Publishing - ebooks Account
Year
2023
Tongue
English
Leaves
318
Category
Library

โฌ‡  Acquire This Volume

No coin nor oath required. For personal study only.

โœฆ Synopsis


Learn to build scalable and reliable data ecosystems using Data Mesh, Databricks Spark, and Kafka.

Key Features

  • Develop modern data skills in emerging technologies
  • Learn pragmatic design methodologies like Data Mesh and Lake House
  • Grow a deeper understanding of data governance

Book Description

Data Architecture with Python will teach you how to integrate your machine learning and data science work streams into your data platform. You will also learn how to take your data and build open lakehouses that can combine with any technology. This book will give you deep hands-on experience with tools like Kafka, Apache Spark, MongoDB, Neo4J, Delta Lake MLFlow, and SQL Dashboards.

By the end of this journey, you would have amassed a wealth of hands-on and theoretical knowledge to architect your own data ecosystems.

What you will learn

  • Understand data pattern patterns such as Delta Architecture
  • Learn key details in Spark Internals and how to increase performance
  • Discover how to design critical Data diagrams
  • Explore MLOps with tools like AutoML and MLflow
  • Learn to build data products in a data mesh
  • Discover data governance and how to build confidence in your data
  • Learn how to introduce Data Visualizations and Dashboards into your data practice

Who This Book Is For

This book is great for developers, analytics engineers, and managers looking to further develop a data ecosystem within their organization. Basic Python will be useful but not required, Also, experience with data is useful but not necessary to read and do the labs.

Table of Contents

  1. Modern Data Processing Architectures
  2. Basics of Data Analytics Engineering
  3. Cloud Storage and Processing Concepts
  4. Python Batch and Stream Processing with Spark
  5. Streaming Data with Kafka
  6. Python MLOps
  7. Python and SQL based Visualizations
  8. Integrating CI into your workflow
  9. Data Orchestration
  10. Data Governance
  11. Introduction to Saturn Insurance, Deploying CI and ELT
  12. Data Governance and Dashboards

โœฆ Table of Contents


Cover
Title Page
Copyright and Credits
Dedications
Contributors
Table of Contents
Preface
Part 1: Fundamental Data Knowledge
Chapter 1: Modern Data Processing Architecture
Technical requirements
Databases, data warehouses, and data lakes
OLTP
OLAP
Data lakes
Event stores
File formats
Data platform architecture at a high level
Comparing the Lambda and Kappa architectures
Lambda architecture
Kappa architecture
Lakehouse and Delta architectures
Lakehouses
The seven central tenets
The medallion data pattern and the Delta architecture
Data mesh theory and practice
Defining terms
The four principles of data mesh
Summary
Practical lab
Solution
Chapter 2: Understanding Data Analytics
Technical requirements
Setting up your environment
Python
venv
Graphviz
Workflow initialization
Cleaning and preparing your data
Duplicate values
Working with nulls
Using RegEx
Outlier identification
Casting columns
Fixing column names
Complex data types
Data documentation
diagrams
Data lineage graphs
Data modeling patterns
Relational
Dimensional modeling
Key terms
OBT
Practical lab
Loading the problem data
Solution
Summary
Part 2: Data Engineering Toolset
Chapter 3: Apache Spark Deep Dive
Technical requirements
Setting up your environment
Python, AWS, and Databricks
Databricks CLI
Cloud data storage
Object storage
Relational
NoSQL
Spark architecture
Introduction to Apache Spark
Key components
Working with partitions
Shuffling partitions
Caching
Broadcasting
Job creation pipeline
Delta Lake
Transaction log
Grouping tables with databases
Table
Adding speed with Z-ordering
Bloom filters
Practical lab
Problem 1
Problem 2
Problem 3
Solution
Summary
Chapter 4: Batch and Stream Data Processing Using PySpark
Technical requirements
Setting up your environment
Python, AWS, and Databricks
Databricks CLI
Batch processing
Partitioning
Data skew
Reading data
Spark schemas
Making decisions
Removing unwanted columns
Working with data in groups
The UDF
Stream processing
Reading from disk
Debugging
Writing to disk
Batch stream hybrid
Delta streaming
Batch processing in a stream
Practical lab
Setup
Creating fake data
Problem 1
Problem 2
Problem 3
Solution
Solution 1
Solution 2
Solution 3
Summary
Chapter 5: Streaming Data with Kafka
Technical requirements
Setting up your environment
Python, AWS, and Databricks
Databricks CLI
Confluent Kafka
Signing up
Kafka architecture
Topics
Partitions
Brokers
Producers
Consumers
Schema Registry
Kafka Connect
Spark and Kafka
Practical lab
Solution
Summary
Part 3: Modernizing the Data Platform
Chapter 6: MLOps
Technical requirements
Setting up your environment
Python, AWS, and Databricks
Databricks CLI
Introduction to machine learning
Understanding data
The basics of feature engineering
Splitting up your data
Fitting your data
Cross-validation
Understanding hyperparameters and parameters
Training our model
Working together
AutoML
MLflow
MLOps benefits
Feature stores
Hyperopt
Practical lab
Create an MLflow project
Summary
Chapter 7: Data and Information Visualization
Technical requirements
Setting up your environment
Principles of data visualization
Understanding your user
Validating your data
Data visualization using notebooks
Line charts
Bar charts
Histograms
Scatter plots
Pie charts
Bubble charts
A single line chart
A multiple line chart
A bar chart
A scatter plot
A histogram
A bubble chart
GUI data visualizations
Tips and tricks with Databricks notebooks
Magic
Markdown
Other languages
Terminal
Filesystem
Running other notebooks
Widgets
Databricks SQL analytics
Accessing SQL analytics
SQL Warehouses
SQL editors
Queries
Dashboards
Alerts
Query history
Connecting BI tools
Practical lab
Loading problem data
Problem 1
Solution
Problem 2
Solution
Summary
Chapter 8: Integrating Continous Integration into Your Workflow
Technical requirements
Setting up your environment
Databricks
Databricks CLI
The DBX CLI
Docker
Git
GitHub
Pre-commit
Terraform
Docker
Install Jenkins, container setup, and compose
CI tooling
Git and GitHub
Pre-commit
Python wheels and packages
Anatomy of a package
DBX
Important commands
Testing code
Terraform โ€“ IaC
IaC
The CLI
HCL
Jenkins
Jenkinsfile
Practical lab
Problem 1
Problem 2
Summary
Chapter 9: Orchestrating Your Data Workflows
Technical requirements
Setting up your environment
Databricks
Databricks CLI
The DBX CLI
Orchestrating data workloads
Making life easier with Autoloader
Reading
Writing
Two modes
Useful options
Databricks Workflows
Terraform
Failed runs
REST APIs
The Databricks API
Python code
Logging
Practical lab
Solution
Lambda code
Notebook code
Summary
Part 4: Hands-on Project
Chapter 10: Data Governance
Technical requirements
Setting up your environment
Python, AWS, and Databricks
The Databricks CLI
What is data governance?
Data standards
Data catalogs
Data lineage
Data security and privacy
Data quality
Great Expectations
Creating test data
Data context
Data source
Batch request
Validator
Adding tests
Saving the suite
Creating a checkpoint
Datadocs
Testing new data
Profiler
Databricks Unity
Practical lab
Summary
Chapter 11: Building out the Groundwork
Technical requirements
Setting up your environment
The Databricks CLI
Git
GitHub
pre-commit
Terraform
PyPI
Creating GitHub repos
Terraform setup
Initial file setup
Schema repository
Schema repository
ML repository
Infrastructure repository
Summary
Chapter 12: Completing Our Project
Technical requirements
Documentation
Schema diagram
C4 System Context diagram
Faking data with Mockaroo
Managing our schemas with code
Building our data pipeline application
Creating our machine learning application
Displaying our data with dashboards
Summary
Index
Other Books You May Enjoy


๐Ÿ“œ SIMILAR VOLUMES


Modern Data Architectures with Python: A
โœ Brian Lipp ๐Ÿ“‚ Library ๐Ÿ“… 2023 ๐Ÿ› Packt Publishing ๐ŸŒ English

Learn to build scalable and reliable data ecosystems using Data Mesh, Databricks Spark, and Kafka. Key Features Develop modern data skills in emerging technologies Learn pragmatic design methodologies like Data Mesh and Lake House Grow a deeper understanding of data governance Book Descript

Modern Data Architectures with Python: A
โœ Brian Lipp ๐Ÿ“‚ Library ๐Ÿ“… 2023 ๐Ÿ› Packt Publishing Pvt Ltd ๐ŸŒ English

Build scalable and reliable data ecosystems using Data Mesh, Databricks Spark, and Kafka Key Features Develop modern data skills used in emerging technologies Learn pragmatic design methodologies such as Data Mesh and data lakehouses Gain a deeper understanding of data governance Purchase of

Data Analysis With Python: A Modern Appr
โœ David Taieb ๐Ÿ“‚ Library ๐Ÿ“… 2018 ๐Ÿ› Packt Publishing ๐ŸŒ English

Data Analysis with Python offers a modern approach to data analysis so that you can work with the latest and most powerful Python tools, AI techniques, and open source libraries. Industry expert David Taieb shows you how to bridge data science with the power of programming and algorithms in Python.

Modern Data Mining with Python
โœ Dushyant Singh Sengar; Vikash Chandra ๐Ÿ“‚ Library ๐Ÿ“… 2024 ๐Ÿ› BPB Publications ๐ŸŒ English

<p>"Modern Data Mining with Python" is a guidebook for responsibly implementing data mining techniques that involve collecting, storing, and analyzing large amounts of structured and unstructured data to extract useful insights and patterns.<p>Enter into the world of data mining and machine learning