Introduction¶

Presto Lakehouse Workshop - Building an Open Data Lakehouse with Presto¶

Welcome to our workshop! In this workshop, you’ll learn the basics of Presto, the open-source SQL query engine, and it's support for the big three open source table formats. You’ll get Presto running locally on your machine and connect to an S3-based data source and a Hive metastore, which enables our Iceberg and Hudi integration. This is a beginner-level workshop for software developers and engineers who are new to Presto and table formats. At the end of the workshop, you will understand how to integrate Presto with table formats and MinIO and to understand the unique features of each table format.

The goals of this workshop are to show you:

What are table formats and how to use them
How to connect Presto to MinIO s3 storage and a compatible Hive metastore using Docker
How to take advantage of table formats using Presto and why you would want to

About this workshop¶

The introductory page of the workshop is broken down into the following sections:

Agenda
Compatibility
Technology Used
Credits

Agenda¶


Introduction	Introduction to the technologies used
Pre-work	Prerequisites for the workshop
Lab 1: Set up an Open Lakehouse	Set up Presto & Spark clusters, a Hive metastore, and an s3 storage mechanism
Lab 2: Create & Query Iceberg Tables	Create Iceberg tables and explore them in MinIO and Presto
Lab 3: Create & Query Basic Hudi Tables	Set up Hudi tables from the `spark-shell` and explore them in MinIO and Presto
Lab 4: Explore Hudi Table & Query Types	Explore how to create and interact with different types of Hudi tables and queries (intermediate-level concepts)

Compatibility¶

This workshop has been tested on the following platforms:

Linux: Ubuntu 22.04
MacOS: M1 Mac

Technology Used¶

Docker: A container engine to run several applications in self-contained containers.
Presto: A fast and Reliable SQL Engine for Data Analytics and the Open Lakehouse
Apache Iceberg: A high-performance format for huge analytic tables
Apache Hudi: A high-performance open table format to bring database functionality to your data lakes
Spark: ! multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters
MinIO: A high-performance, S3 compatible object store