Skip to content

Introduction

Presto Lakehouse Workshop - Building an Open Data Lakehouse with Presto

Welcome to our workshop! In this workshop, you’ll learn the basics of Presto, the open-source SQL query engine, and it's support for the big three open source table formats. You’ll get Presto running locally on your machine and connect to an S3-based data source and a Hive metastore, which enables our Iceberg and Hudi integration. This is a beginner-level workshop for software developers and engineers who are new to Presto and table formats. At the end of the workshop, you will understand how to integrate Presto with table formats and MinIO and to understand the unique features of each table format.

The goals of this workshop are to show you:

  • What are table formats and how to use them
  • How to connect Presto to MinIO s3 storage and a compatible Hive metastore using Docker
  • How to take advantage of table formats using Presto and why you would want to

About this workshop

The introductory page of the workshop is broken down into the following sections:

Agenda

Introduction Introduction to the technologies used
Pre-work Prerequisites for the workshop
Lab 1: Set up an Open Lakehouse Set up Presto & Spark clusters, a Hive metastore, and an s3 storage mechanism
Lab 2: Create & Query Iceberg Tables Create Iceberg tables and explore them in MinIO and Presto
Lab 3: Create & Query Basic Hudi Tables Set up Hudi tables from the spark-shell and explore them in MinIO and Presto
Lab 4: Explore Hudi Table & Query Types Explore how to create and interact with different types of Hudi tables and queries (intermediate-level concepts)

Compatibility

This workshop has been tested on the following platforms:

  • Linux: Ubuntu 22.04
  • MacOS: M1 Mac

Technology Used

  • Docker: A container engine to run several applications in self-contained containers.
  • Presto: A fast and Reliable SQL Engine for Data Analytics and the Open Lakehouse
  • Apache Iceberg: A high-performance format for huge analytic tables
  • Apache Hudi: A high-performance open table format to bring database functionality to your data lakes
  • Spark: ! multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters
  • MinIO: A high-performance, S3 compatible object store

Credits