Learning and Operating Presto: Fast, Reliable SQL for Data Analytics and Lakehouses

✍ Scribed by Angelica Lo Duca

Publisher: O'Reilly Media
Year: 2023
Tongue: English
Leaves: 191
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

The Presto community has mushroomed since its origins at Facebook in 2012. But ramping up this open source distributed SQL query engine can be challenging even for the most experienced engineers. With this practical book, data engineers and architects, platform engineers, cloud engineers, and software engineers will learn how to use Presto operations at your organization to derive insights on datasets wherever they reside.

Authors Angelica Lo Duca, Tim Meehan, Vivek Bharathan, and Ying Su explain what Presto is, where it came from, and how it differs from other data warehousing solutions. You'll discover why Facebook, Uber, Alibaba Cloud, Hewlett Packard Enterprise, IBM, Intel, and many more use Presto and how you can quickly deploy Presto in production.

With this book, you will:

Learn how to install and configure Presto

Use Presto with business intelligence tools

Understand how to connect Presto to a variety of data...

✦ Table of Contents

Preface
Why We Wrote This Book
Who This Book Is For
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Angelica Lo Duca
Tim Meehan
Vivek Bharathan
Ying Su
1. Introduction to Presto
Data Warehouses and Data Lakes
The Role of Presto in a Data Lake
Presto Origins and Design Considerations
High Performance
High Scalability
Compliance with the ANSI SQL Standard
Federation of Data Sources
Running in the Cloud
Presto Architecture and Core Components
Alternatives to Presto
Apache Impala
Apache Hive
Spark SQL
Trino
Presto Use Cases
Reporting and Dashboarding
Ad Hoc Querying
ETL Using SQL
Data Lakehouse
Real-Time Analytics with Real-Time Databases
Introducing Our Case Study
Conclusion
2. Getting Started with Presto
Presto Manual Installation
Running Presto on Docker
Installing Docker
Presto Docker Image
Dockerfile
The etc/ directory
node.properties
jvm.config
config.properties
log.properties
catalog/.properties
Building and Running Presto on Docker
The Presto Sandbox
Deploying Presto on Kubernetes
Introducing Kubernetes
Configuring Presto on Kubernetes
presto-coordinator.yaml
presto-workers.yaml
presto-config-map.yaml
presto-secrets.yaml
Adding a New Catalog
Running the Deployment on Kubernetes
Querying Your Presto Instance
Listing Catalogs
Listing Schemas
Listing Tables
Querying a Table
Conclusion
3. Connectors
Service Provider Interface
Connector Architecture
Popular Connectors
Thrift
Writing a Custom Connector
Prerequisites
Plugin and Module
ExamplePlugin
ExampleConnectorFactory
ExampleModule
ExampleConnector
ExampleHandleResolver
Configuration
ExampleConfig
SessionProperties
TableProperties
Metadata
Data model
Handles
ExampleMetadata
ExampleClient
Input/Output
ExampleSplitManager
ExampleSplit
ExampleRecordSetProvider and ExampleRecordSet
ExampleRecordCursor
Deploying Your Connector
Apache Pinot
Setting Up and Configuring Presto
Setting up Pinot
Configuring Pinot
Configuring Presto with Pinot
Presto-Pinot Querying in Action
Conclusion
4. Client Connectivity
Setting Up the Environment
Presto Client
Docker Image
Kubernetes Node
Connectivity to Presto
REST API
Python
R
JDBC
Node.js
ODBC
Other Presto Client Libraries
Building a Client Dashboard in Python
Setting Up the Client
Building the Dashboard
Connecting to and querying Presto
Preparing the results of the query
Building the first graph
Building the second graph
Conclusion
5. Open Data Lakehouse Analytics
The Emergence of the Lakehouse
Data Lakehouse Architecture
Data Lake
File Store
File Format
Table Format
Query Engine
Metadata Management
Data Governance
Data Access Control
Building a Data Lakehouse
Configuring MinIO
Populating MinIO
Configuring HMS
Configuring Spark
Registering Hudi Tables with HMS
Connecting and Querying Presto
Conclusion
6. Presto Administration
Introducing Presto Administration
Configuration
Properties
How to configure a cluster
Sessions
Using sessions
JVM
Memory
Out-of-memory errors
Garbage collection
Monitoring
Console
Using the console for monitoring
Using the console for debugging
Using the console for going over the interactive plan
REST API
Metrics
JMX connector
REST API
JMX exporters
Management
Resource Groups
Configuring resource groups
Resource groups properties
Example
Verifiers
Setting up the system
Configuring the MySQL database
Configuring the Presto verifier
Running a test
Session Properties Managers
Configuring a session property manager
Namespace Functions
Setting up the system
Configuring a function
Running a test
Conclusion
7. Understanding Security in Presto
Introducing Presto Security
Building Secure Communication in Presto
Encryption
Keystore Management
Configuring HTTPS/TLS
Running a Presto client
Running the Presto console
Authentication
File-Based Authentication
Running a Presto client
Running the Presto console
LDAP
Kerberos
Prerequisites
Configuring the Presto coordinator and workers
Configuring the Presto client
Creating a Custom Authenticator
Authorization
Authorizing Access to the Presto REST API
Configuring System Access Control
Authorization Through Apache Ranger
Building a custom audit function
Conclusion
8. Performance Tuning
Introducing Performance Tuning
Reasons for Performance Tuning
The Performance Tuning Life Cycle
Query Execution Model
Approaches for Performance Tuning in Presto
Resource Allocation
Storage
Query Optimization
Aria Scan
Table Scanning
Repartitioning
Implementing Performance Tuning
Building and Importing the Sample CSV Table in MinIO
Converting the CSV Table in ORC
Defining the Tuning Parameters
Running Tests
Default parameters
Reducing CPU usage
Query optimization
Aria scan
Conclusion
9. Operating Presto at Scale
Introducing Scalability
Reasons to Scale Presto
Common Issues
Design Considerations
Availability
Manageability
Performance
Protection
Configuration
How to Scale Presto
Multiple Coordinators
Presto on Spark
Spilling
Using a Cloud Service
Conclusion
Index

📜 SIMILAR VOLUMES

Learning and Operating Presto: Fast, Rel

📁 Learning and Operating Presto: Fast, Reliable SQL for Data Analytics and Lakehouses

✍ Angelica Lo Duca, Tim Meehan, Vivek Bharathan, Ying Su 📂 Library 📅 2023 🏛 O’Reilly Media 🌐 English

Learning and Operating Presto: Fast, Rel

📁 Learning and Operating Presto: Fast, Reliable SQL for Data Analytics and Lakehouses

✍ Angelica Lo Duca, Vivek Bharathan, Ying Su 📂 Library 🏛 O'Reilly Media 🌐 English

The Presto community has mushroomed since its origins at Facebook in 2012. But ramping up this distributed SQL query engine can be challenging even for the most experienced engineers. This practical book shows you how to begin Presto operations at your organization to derive insights on dat

SQL for Data Analytics: Perform fast and

📁 SQL for Data Analytics: Perform fast and efficient data analysis with the power of SQL

✍ Upom Malik, Matt Goldwasser, Benjamin Johnston 📂 Library 📅 2019 🏛 Packt Publishing 🌐 English

Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets.Key Features<li>Explore a variety of statistical techniques to analyze your data<li>Integrate your SQL pipelines with other analytics technologies<li>Perform adv

SQL for Data Analytics: Perform fast and

📁 SQL for Data Analytics: Perform fast and efficient data analysis with the power of SQL

✍ Upom Malik, Matt Goldwasser, Benjamin Johnston 📂 Library 📅 2019 🏛 Packt Publishing 🌐 English

Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets. Key Features • Explore a variety of statistical techniques to analyze your data • Integrate your SQL pipelines with other analytics technologies • Perform advanced analytics suc

SQL for Data Analytics: Perform Fast and

📁 SQL for Data Analytics: Perform Fast and Efficient Data Analysis with the Power of SQL

✍ Upom Malik; Matt Goldwasser; Benjamin Johnston 📂 Library 📅 2019 🌐 English

Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets. Key Features Explore a variety of statistical techniques to analyze your data Integrate your SQL pipelines with other analytics technologies Perform advanced analytics such as geospat

Open Data for Education: Linked, Shared,

📁 Open Data for Education: Linked, Shared, and Reusable Data for Teaching and Learning

✍ Dmitry Mouromtsev, Mathieu d’Aquin (eds.) 📂 Library 📅 2016 🏛 Springer International Publishing 🌐 English

This volume comprises a collection of papers presented at an Open Data in Education Seminar and the LILE workshops during 2014-2015.In the first part of the book, two chapters give different perspectives on the current use of linked and open data in education, including the use of techn