Data Engineering with Google Cloud Platform: A practical guide to operationalizing scalable data analytics systems on GCP

✍ Scribed by Adi Wijaya

Publisher: Packt Publishing
Year: 2022
Tongue: English
Leaves: 440
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Build and deploy your own data pipelines on GCP, make key architectural decisions, and gain the confidence to boost your career as a data engineer

Key Features

Understand data engineering concepts, the role of a data engineer, and the benefits of using GCP for building your solution
Learn how to use the various GCP products to ingest, consume, and transform data and orchestrate pipelines
Discover tips to prepare for and pass the Professional Data Engineer exam

Book Description

With this book, you'll understand how the highly scalable Google Cloud Platform (GCP) enables data engineers to create end-to-end data pipelines right from storing and processing data and workflow orchestration to presenting data through visualization dashboards.

Starting with a quick overview of the fundamental concepts of data engineering, you'll learn the various responsibilities of a data engineer and how GCP plays a vital role in fulfilling those responsibilities. As you progress through the chapters, you'll be able to leverage GCP products to build a sample data warehouse using Cloud Storage and BigQuery and a data lake using Dataproc. The book gradually takes you through operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. You'll learn how to design IAM for data governance, deploy ML pipelines with the Vertex AI, leverage pre-built GCP models as a service, and visualize data with Google Data Studio to build compelling reports. Finally, you'll find tips on how to boost your career as a data engineer, take the Professional Data Engineer certification exam, and get ready to become an expert in data engineering with GCP.

By the end of this data engineering book, you'll have developed the skills to perform core data engineering tasks and build efficient ETL data pipelines with GCP.

What you will learn

Load data into BigQuery and materialize its output for downstream consumption
Build data pipeline orchestration using Cloud Composer
Develop Airflow jobs to orchestrate and automate a data warehouse
Build a Hadoop data lake, create ephemeral clusters, and run jobs on the Dataproc cluster
Leverage Pub/Sub for messaging and ingestion for event-driven systems
Use Dataflow to perform ETL on streaming data
Unlock the power of your data with Data Studio
Calculate the GCP cost estimation for your end-to-end data solutions

Who this book is for

This book is for data engineers, data analysts, and anyone looking to design and manage data processing pipelines using GCP. You'll find this book useful if you are preparing to take Google's Professional Data Engineer exam. Beginner-level understanding of data science, the Python programming language, and Linux commands is necessary. A basic understanding of data processing and cloud computing, in general, will help you make the most out of this book.

Fundamentals of Data Engineering
Big Data Capabilities on GCP
Building a Data Warehouse in BigQuery
Building Orchestration for Batch Data Loading Using Cloud Composer
Building a Data Lake Using Dataproc
Processing Streaming Data with Pub/Sub and Dataflow
Visualizing Data for Making Data-Driven Decisions with Data Studio
Building Machine Learning Solutions on Google Cloud Platform
User and Project Management in GCP
Cost Strategy in GCP
CI/CD on Google Cloud Platform for Data Engineers
Boosting Your Confidence as a Data Engineer

✦ Table of Contents

Cover
Title page
Copyright and Credits
Contributors
Table of Contents
Preface
Section 1: Getting Started with Data Engineering with GCP
Chapter 1: Fundamentals of Data Engineering
Understanding the data life cycle
Understanding the need for a data warehouse
Knowing the roles of a data engineer before starting
Data engineer versus data scientist
The focus of data engineers
Foundational concepts for data engineering
ETL concept in data engineering
The difference between ETL and ELT
What is NOT big data?
A quick look at how big data technologies store data
A quick look at how to process multiple files using MapReduce
Summary
Exercise
See also
Chapter 2: Big Data Capabilities on GCP
Technical requirements
Understanding what the cloud is
The difference between the cloud and non-cloud era
The on-demand nature of the cloud
Getting started with Google Cloud Platform
Introduction to the GCP console
Practicing pinning services
Creating your first GCP project
Using GCP Cloud Shell
A quick overview of GCP services for data engineering
Understanding the GCP serverless service
Service mapping and prioritization
The concept of quotas on GCP services
User account versus service account
Summary
Section 2: Building Solutions with GCP Components
Chapter 3: Building a Data Warehouse in BigQuery
Technical requirements
Introduction to Google Cloud Storage and BigQuery
BigQuery data location
Introduction to the BigQuery console
Creating a dataset in BigQuery using the console
Loading a local CSV file into the BigQuery table
Using public data in BigQuery
Data types in BigQuery compared to other databases
Timestamp data in BigQuery compared to other databases
Preparing the prerequisites before developing our data warehouse
Step 1: Access your Cloud shell
Step 2: Check the current setup using the command line
Step 3: The gcloud init command
Step 4: Download example data from Git
Step 5: Upload data to GCS from Git
Practicing developing a data warehouse
Data warehouse in BigQuery – Requirements for scenario 1
Steps and planning for handling scenario 1
Data warehouse in BigQuery – Requirements for scenario 2
Steps and planning for handling scenario 2
Summary
Exercise – Scenario 3
See also
Chapter 4: Build Orchestration for Batch Data Loading Using Cloud Composer
Technical requirements
Introduction to Cloud Composer
Understanding the working of Airflow
Provisioning Cloud Composer in a GCP project
Exercise: Build data pipeline orchestration using Cloud Composer
Level 1 DAG – Creating dummy workflows
Level 2 DAG – Scheduling a pipeline from Cloud SQL to GCS and BigQuery datasets
Level 3 DAG – Parameterized variables
Level 4 DAG – Guaranteeing task idempotency in Cloud Composer
Level 5 DAG – Handling late data using a sensor
Summary
Chapter 5: Building a Data Lake Using Dataproc
Technical requirements
Introduction to Dataproc
A brief history of the data lake and Hadoop ecosystem
A deeper look into Hadoop components
How much Hadoop-related knowledge do you need on GCP?
Introducing the Spark RDD and the DataFrame concept
Introducing the data lake concept
Hadoop and Dataproc positioning on GCP
Exercise – Building a data lake on a Dataproc cluster
Creating a Dataproc cluster on GCP
Using Cloud Storage as an underlying Dataproc file system
Exercise: Creating and running jobs on a Dataproc cluster
Preparing log data in GCS and HDFS
Developing Spark ETL from HDFS to HDFS
Developing Spark ETL from GCS to GCS
Developing Spark ETL from GCS to BigQuery
Understanding the concept of the ephemeral cluster
Practicing using a workflow template on Dataproc
Building an ephemeral cluster using Dataproc and Cloud Composer
Summary
Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow
Technical requirements
Processing streaming data
Streaming data for data engineers
Introduction to Pub/Sub
Introduction to Dataflow
Exercise – Publishing event streams to cloud Pub/Sub
Creating a Pub/Sub topic
Creating and running a Pub/Sub publisher using Python
Creating a Pub/Sub subscription
Exercise – Using Cloud Dataflow to stream data from Pub/Sub to GCS
Creating a HelloWorld application using Apache Beam
Creating a Dataflow streaming job without aggregation
Creating a streaming job with aggregation
Summary
Chapter 7: Visualizing Data for Making Data-Driven Decisions with Data Studio
Technical requirements
Unlocking the power of your data with Data Studio
From data to metrics in minutes with an illustrative use case
Understanding what BigQuery INFORMATION_SCHEMA is
Exercise – Exploring the BigQuery INFORMATION_SCHEMA table using Data Studio
Exercise – Creating a Data Studio report using data from a bike-sharing data warehouse
Understanding how Data Studio can impact the cost of BigQuery
What kind of table could be 1 TB in size?
How can a table be accessed 10,000 times in a month?
How to create materialized views and understanding how BI Engine works
Understanding BI Engine
Summary
Chapter 8: Building Machine Learning Solutions on Google Cloud Platform
Technical requirements
A quick look at machine learning
Exercise – practicing ML code using Python
Preparing the ML dataset by using a table from the BigQuery public dataset
Training the ML model using Random Forest in Python
Creating Batch Prediction using the training dataset's output
The MLOps landscape in GCP
Understanding the basic principles of MLOps
Introducing GCP services related to MLOps
Exercise – leveraging pre-built GCP models as a service
Uploading the image to a GCS bucket
Creating a detect text function in Python
Exercise – using GCP in AutoML to train an ML model
Exercise – deploying a dummy workflow with Vertex AI Pipeline
Creating a dedicated regional GCS bucket
Developing the pipeline on Python
Monitoring the pipeline on the Vertex AI Pipeline console
Exercise – deploying a scikit-learn model pipeline with Vertex AI
Creating the first pipeline, which will result in an ML model file in GCS
Running the first pipeline in Vertex AI Pipeline
Creating the second pipeline, which will use the model file from the prediction results as a CSV file in GCS
Running the second pipeline in Vertex AI Pipeline
Summary
Section 3: Key Strategies for Architecting Top-Notch Data Pipelines
Chapter 9: User and Project Management in GCP
Technical requirements
Understanding IAM in GCP
Planning a GCP project structure
Understanding the GCP organization, folder, and project hierarchy
Deciding how many projects we should have in a GCP organization
Controlling user access to our data warehouse
Use-case scenario – planninga BigQuery ACL on an e-commerce organization
Column-level security in BigQuery
Practicing the concept of IaC using Terraform
Exercise – creating and running basic Terraform scripts
Self-exercise – managing a GCP project and resources using Terraform
Summary
Chapter 10: Cost Strategy in GCP
Technical requirements
Estimating the cost of your end-to-end data solution in GCP
Comparing BigQuery on-demand and flat-rate
Example – estimating data engineering use case
Tips for optimizing BigQuery using partitioned and clustered tables
Partitioned tables
Clustered tables
Exercise – optimizing BigQuery on-demand cost
Summary
Chapter 11: CI/CD on Google Cloud Platform for Data Engineers
Technical requirements
Introduction to CI/CD
Understanding the data engineer's relationship with CI/CD practices
Understanding CI/CD components with GCP services
Exercise – implementing continuous integration using Cloud Build
Creating a GitHub repository using Cloud Source Repository
Developing the code and Cloud Build scripts
Creating the Cloud Build Trigger
Pushing the code to the GitHub repository
Exercise – deploying Cloud Composer jobs using Cloud Build
Preparing the CI/CD environment
Preparing the cloudbuild.yaml configuration file
Pushing the DAG to our GitHub repository
Checking the CI/CD result in the GCS bucket and Cloud Composer
Summary
Further reading
Chapter 12: Boosting Your Confidence as a Data Engineer
Overviewing the Google Cloud certification
Exam preparation tips
Extra GCP services material
Quiz – reviewing all the concepts you've learned about
Questions
Answers
The past, present, and future of Data Engineering
Boosting your confidence and final thoughts
Summary
Index
Other Books You May Enjoy

📜 SIMILAR VOLUMES

Data Engineering with Google Cloud Platf

📁 Data Engineering with Google Cloud Platform: A practical guide to operationalizing scalable data analytics systems on GCP

✍ Adi Wijaya 📂 Library 📅 2022 🏛 Packt Publishing 🌐 English

Build and deploy your own data pipelines on GCP, make key architectural decisions, and gain the confidence to boost your career as a data engineer<h4>Key Features</h4><ul><li>Understand data engineering concepts, the role of a data engineer, and the benef

Data engineering with Google Cloud Platf

📁 Data engineering with Google Cloud Platform : a practical guide to operationalizing scalable data analytics systems on GCP

✍ Adi Widjaja 📂 Library 📅 2022 🌐 English

Data Analytics with Google Cloud Platfor

📁 Data Analytics with Google Cloud Platform: Build Real Time Data Analytics on Google Cloud Platform (English Edition)

✍ Murari Ramuka 📂 Library 🌐 English

Step-by-step guide to different data movement and processing techniques, using Google Cloud Platform Services Key Features<ul><li>Learn the basic concept of Cloud Computing along with different Cloud service provides with their supported Mo

Data Engineering with dbt: A practical g

📁 Data Engineering with dbt: A practical guide to building a cloud-based pragmatic and dependable data platform with SQL

✍ Roberto Zagni 📂 Library 📅 2023 🏛 Packt Publishing Pvt. Ltd. 🌐 English

Use easy-to-apply patterns in SQL and Python to adopt modern analytics engineering to build agile platforms with dbt that are well-tested and simple to extend and run Purchase of the print or Kindle book includes a free PDF eBook Key Features Build a solid dbt base and learn data modeling and

Data Engineering with dbt: A practical g

📁 Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL

✍ Roberto Zagni 📂 Library 📅 2023 🏛 Packt Publishing 🌐 English

Use easy-to-apply patterns in SQL and Python to adopt modern analytics engineering to build agile platforms with dbt that are well-tested and simple to extend and run Purchase of the print or Kindle book includes a free PDF eBook<h4>Key Features</h4><ul><li><sp

Data Engineering with dbt: A practical g

📁 Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL

✍ Roberto Zagni 📂 Library 📅 2023 🏛 Packt Publishing 🌐 English