Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud

✍ Scribed by Robert Ilijason

Publisher: Apress
Year: 2020
Tongue: English
Leaves: 281
Edition: 1
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Analyze vast amounts of data in record time using Apache Spark with Databricks in the Cloud. Learn the fundamentals, and more, of running analytics on large clusters in Azure and AWS, using Apache Spark with Databricks on top. Discover how to squeeze the most value out of your data at a mere fraction of what classical analytics solutions cost, while at the same time getting the results you need, incrementally faster.

This book explains how the confluence of these pivotal technologies gives you enormous power, and cheaply, when it comes to huge datasets. You will begin by learning how cloud infrastructure makes it possible to scale your code to large amounts of processing units, without having to pay for the machinery in advance. From there you will learn how Apache Spark, an open source framework, can enable all those CPUs for data analytics use. Finally, you will see how services such as Databricks provide the power of Apache Spark, without you having to know anything about configuring hardware or software. By removing the need for expensive experts and hardware, your resources can instead be allocated to actually finding business value in the data.

This book guides you through some advanced topics such as analytics in the cloud, data lakes, data ingestion, architecture, machine learning, and tools, including Apache Spark, Apache Hadoop, Apache Hive, Python, and SQL. Valuable exercises help reinforce what you have learned.

What You Will Learn

Discover the value of big data analytics that leverage the power of the cloud
Get started with Databricks using SQL and Python in either Microsoft Azure or AWS
Understand the underlying technology, and how the cloud and Apache Spark fit into the bigger picture
See how these tools are used in the real world
Run basic analytics, including machine learning, on billions of rows at a fraction of a cost or free

Who This Book Is For

Data engineers, data scientists, and cloud architects who want or need to run advanced analytics in the cloud. It is assumed that the reader has data experience, but perhaps minimal exposure to Apache Spark and Azure Databricks. The book is also recommended for people who want to get started in the analytics field, as it provides a strong foundation.

✦ Table of Contents

Table of Contents
About the Author
About the Technical Reviewer
Introduction
Chapter 1: Introduction to Large-Scale Data Analytics
Analytics, the hype
Analytics, the reality
Large-scale analytics for fun and profit
Data: Fueling analytics
Free as in speech. And beer!
Into the clouds
Databricks: Analytics for the lazy ones
How to analyze data
Large-scale examples from the real world
Telematics at Volvo Trucks
Fraud detection at Visa
Customer analytics at Target
Targeted ads at Cambridge Analytica
Summary
Chapter 2: Spark and Databricks
Apache Spark, the short overview
Databricks: Managed Spark
The far side of the Databricks moon
Spark architecture
Apache Spark processing
Working with data
Data processing
Storing data
Cool components on top of Core
Summary
Chapter 3: Getting Started with Databricks
Cloud-only
Community edition: No money? No problem
Mostly good enough
Getting started with the community edition
Commercial editions: The ones you want
Databricks on Amazon Web Services
Azure Databricks
Summary
Chapter 4: Workspaces, Clusters, and Notebooks
Getting around in the UI
Clusters: Powering up the engines
Data: Getting access to the fuel
Notebooks: Where the work happens
Summary
Chapter 5: Getting Data into Databricks
Databricks File System
Navigating the file system
The FileStore, a portal to your data
Schemas, databases, and tables
Hive Metastore
The many types of source files
Going binary
Alternative transportation
Importing from your computer
Getting data from the Web
Working with the shell
Basic importing with Python
Getting data with SQL
Mounting a file system
Mounting example Amazon S3
Mounting example Microsoft Blog Storage
Getting rid of the mounts
How to get data out of Databricks
Summary
Chapter 6: Querying Data Using SQL
The Databricks flavor
Getting started
Picking up data
Filtering data
Joins and merges
Ordering data
Functions
Windowing functions
A view worth keeping
Hierarchical data
Creating data
Manipulating data
Delta Lake SQL
UPDATE, DELETE, and MERGE
Keeping Delta Lake in order
Transaction logs
Selecting metadata
Gathering statistics
Summary
Chapter 7: The Power of Python
Python: The language of choice
A turbo-charged intro to Python
Finding the data
DataFrames: Where active data lives
Getting some data
Selecting data from DataFrames
Chaining combo commands
Working with multiple DataFrames
Slamming data together
Summary
Chapter 8: ETL and Advanced Data Wrangling
ETL: A recap
An overview of the Spark UI
Cleaning and transforming data
Finding nulls
Getting rid of nulls
Filling nulls with values
Removing duplicates
Identifying and clearing out extreme values
Taking care of columns
Pivoting
Explode
When being lazy is good
Caching data
Data compression
A short note about functions
Lambda functions
Storing and shuffling data
Save modes
Managed vs. unmanaged tables
Handling partitions
Summary
Chapter 9: Connecting to and from Databricks
Connecting to and from Databricks
Getting ODBC and JDBC up and running
Creating a token
Preparing the cluster
Let’s create a test table
Setting up ODBC on Windows
Setting up ODBC on OS X
Connecting tools to Databricks
Microsoft Excel on Windows
Microsoft Power BI Desktop on Windows
Tableau on OS X
PyCharm (and more) via Databricks Connect
Using RStudio Server
Accessing external systems
A quick recap of libraries
Connecting to external systems
Azure SQL
Oracle
MongoDB
Summary
Chapter 10: Running in Production
General advice
Assume the worst
Write rerunnable code
Document in the code
Write clear, simple code
Print relevant stuff
Jobs
Scheduling
Running notebooks from notebooks
Widgets
Running jobs with parameters
The command line interface
Setting up the CLI
Running CLI commands
Creating and running jobs
Accessing the Databricks File System
Picking up notebooks
Keeping secrets
Secrets with privileges
Revisiting cost
Users, groups, and security options
Users and groups
Using SCIM provisioning
Access Control
Workspace Access Control
Cluster, Pool, and Jobs Access Control
Table Access Control
Personal Access Tokens
The rest
Summary
Chapter 11: Bits and Pieces
MLlib
Frequent Pattern Growth
Creating some data
Preparing the data
Running the algorithm
Parsing the results
MLflow
Running the code
Checking the results
Updating tables
Create the original table
Connect from Databricks
Pulling the delta
Verifying the formats
Update the table
A short note about Pandas
Koalas, Pandas for Spark
Playing around with Koalas
The future of Koalas
The art of presenting data
Preparing data
Using Matplotlib
Building and showing the dashboard
Adding a widget
Adding a graph
Schedule run
REST API and Databricks
What you can do
What you can’t do
Getting ready for APIs
Example: Get cluster data
Example: Set up and execute a job
Example: Get the notebooks
All the APIs and what they do
Delta streaming
Running a stream
Checking and stopping the streams
Running it faster
Using checkpoints
Index

📜 SIMILAR VOLUMES

Beginning Apache Spark Using Azure Datab

📁 Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud

✍ Robert Ilijason 📂 Library 📅 2020 🏛 Apress 🌐 English

<p>Analyze vast amounts of data in record time using Apache Spark with Databricks in the Cloud. Learn the fundamentals, and more, of running analytics on large clusters in Azure and AWS, using Apache Spark with Databricks on top. Discover how to squeeze the most value out of your data at a mere frac

Beginning Apache Spark Using Azure Datab

📁 Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud

✍ Robert Ilijason 📂 Library 📅 2020 🏛 Apress 🌐 English

<p><p>Analyze vast amounts of data in record time using Apache Spark with Databricks in the Cloud. Learn the fundamentals, and more, of running analytics on large clusters in Azure and AWS, using Apache Spark with Databricks on top. Discover how to squeeze the most value out of your data at a mere f

Databricks. Using Apache Spark

📁 Databricks. Using Apache Spark

📂 Library 🌐 English

Мануал от компании Databricks по использованию Apache Spark.<div class="bb-sep"></div><strong>Introduction</strong><br/><strong>Log Analysis with Spark</strong><br/>Introduction to Apache Spark<br/>Importing Data<br/>Exporting Data<br/>Log Analyzer Application<br/><strong>Twitter Streaming Language

Azure Databricks Cookbook: Accelerate an

📁 Azure Databricks Cookbook: Accelerate and scale real-time analytics solutions using the Apache Spark-based analytics service

✍ Phani Raj, Vinod Jaiswal 📂 Library 📅 2021 🏛 Packt Publishing 🌐 English

<p><b>Get to grips with building and productionizing end-to-end big data solutions in Azure and learn best practices for working with large datasets</b></p><h4>Key Features</h4><ul><li>Integrate with Azure Synapse Analytics, Cosmos DB, and Azure HDInsight Kafka Cluster to scale and analyze your proj

Azure Databricks Cookbook: Accelerate an

📁 Azure Databricks Cookbook: Accelerate and scale real-time analytics solutions using the Apache Spark-based analytics service

✍ Phani Raj, Vinod Jaiswal 📂 Library 🏛 Packt Publishing 🌐 English

<p><span>Get to grips with building and productionizing end-to-end big data solutions in Azure and learn best practices for working with large datasets</span></p><h4><span>Key Features</span></h4><ul><li><span><span>Integrate with Azure Synapse Analytics, Cosmos DB, and Azure HDInsight Kafka Cluster

Data Engineering with Databricks Cookboo

📁 Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks

✍ Pulkit Chadha 📂 Library 📅 2024 🏛 Packt Publishing 🌐 English

Work through 70 recipes for implementing reliable data pipelines with Apache Spark, optimally store and process structured and unstructured data in Delta Lake, and use Databricks to orchestrate and govern your data Key Features Learn data ingestion, data transformation, and data management techniqu