Introduction¶
Presto Lakehouse Workshop - Building an Open Data Lakehouse with Presto¶
Welcome to our workshop! In this workshop, you’ll learn the basics of Presto, the open-source SQL query engine, and it's support for the big three open source table formats. You’ll get Presto running locally on your machine and connect to an S3-based data source and a Hive metastore, which enables our Iceberg and Hudi integration. This is a beginner-level workshop for software developers and engineers who are new to Presto and table formats. At the end of the workshop, you will understand how to integrate Presto with table formats and MinIO and to understand the unique features of each table format.
The goals of this workshop are to show you:
- What are table formats and how to use them
- How to connect Presto to MinIO s3 storage and a compatible Hive metastore using Docker
- How to take advantage of table formats using Presto and why you would want to
About this workshop¶
The introductory page of the workshop is broken down into the following sections:
Agenda¶
Introduction | Introduction to the technologies used |
Pre-work | Prerequisites for the workshop |
Lab 1: Set up an Open Lakehouse | Set up Presto & Spark clusters, a Hive metastore, and an s3 storage mechanism |
Lab 2: Create & Query Iceberg Tables | Create Iceberg tables and explore them in MinIO and Presto |
Lab 3: Create & Query Basic Hudi Tables | Set up Hudi tables from the spark-shell and explore them in MinIO and Presto |
Lab 4: Explore Hudi Table & Query Types | Explore how to create and interact with different types of Hudi tables and queries (intermediate-level concepts) |
Compatibility¶
This workshop has been tested on the following platforms:
- Linux: Ubuntu 22.04
- MacOS: M1 Mac
Technology Used¶
- Docker: A container engine to run several applications in self-contained containers.
- Presto: A fast and Reliable SQL Engine for Data Analytics and the Open Lakehouse
- Apache Iceberg: A high-performance format for huge analytic tables
- Apache Hudi: A high-performance open table format to bring database functionality to your data lakes
- Spark: ! multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters
- MinIO: A high-performance, S3 compatible object store
Credits¶
- Kiersten Stokes
- Yihong Wang
- Deepak Panda