Analytic Workload

This section will explore some of the analytic workloads that you can run with watsonx.data.

Before starting, make sure you are in the Query Workspace by clicking this icon on the left side

For the analytic workloads, you are going to create a new schema in the iceberg_data catalog. The next statement will create the workshop schema in the existing iceberg-bucket.

Create the workshop schema in the iceberg_data catalog

CREATE SCHEMA IF NOT EXISTS iceberg_data.workshop 
with (location='s3a://iceberg-bucket/');

This SQL will create a new Apache Iceberg table using existing data in the sample Customer table as part of the TPCH catalog schema called TINY.

Create the workshop customer table

create table iceberg_data.workshop.customer as 
  select * from tpch.tiny.customer;

Let us start with some simple examples of running queries and analyzing the execution.

Run a simple scan query which selects customer names and market segment

use iceberg_data.workshop; 
select 
  c_name, c_mktsegment 
from 
  customer 
limit 3;

Browser

To understand the query execution plan we use the explain statement.

Run an explain against the previous statement

use iceberg_data.workshop;
explain select name, mktsegment from customer;

Browser

What you see above is the hierarchy of logical operations to execute the query. It is very difficult to read since the information is found on one line. One option is to export the data and view the results in a spreadsheet.

Explain the query and focus on IO operations

use iceberg_data.workshop;
explain (type io) select name, mktsegment from customer;

Browser

Explain physical execution plan for the query.

Explain physical execution plan for the query

use iceberg_data.workshop;
explain (type distributed) select name, mktsegment from customer;

Browser

A fragment represents a stage of the distributed plan. The Presto scheduler schedules the execution by each stage, and stages can be run on separate instances.

Rather than executing these explain queries, the SQL interface provides an explain button beside the Run button.

Enter the following query into the SQL window and press the Explain button

select 
  name, mktsegment 
from 
  iceberg_data.workshop.customer

Browser

Click on the TableScan operator

Browser

The panel on the right side provides information about the work the engine will do to retrieve the data. The Presto optimizer estimates that 1500 rows will be retrieved and take approximately 16000 cpu units.

Close the explain window

Another way of viewing the explain information is to run the query and then review the Query history.

Run the query that is currently in the SQL window

Once the query completes, switch to the Query history screen.

Click on the Query History icon on the left side of the screen

Browser

Refresh the browser to update the Query history

Once you have refreshed the screen, look for your query in the list. It should be near the top.

Click on the kebab ⋮ icon and select View Execution Plan

Browser

You will see all the explain output that you created with the earlier SQL statements.

Browser

Close the explain output by pressing the [x] on the screen

Creating a Table with User-defined Partitions

Before starting, make sure you are in the Query Workspace by clicking this icon on the left side

Tables can be partitioned in watsonx.data. This SQL will create a partitioned table, based on column mktsegment with data copied from the TPCH.TINY.CUSTOMER table.

Create a partitioned table

create table iceberg_data.workshop.part_customer 
  with (partitioning = array['mktsegment']) 
as select * from tpch.tiny.customer;

The new table part_customer will be created in the iceberg_bucket with multiple files corresponding to the partitioning. We are going to use the Infrastructure manager to view the file structure.

Click on the Infrastructure icon on the left side of the screen

Find the iceberg-bucket in the Infrastructure diagram.

Browser

Click on the iceberg-bucket

Browser

Click on Objects → part_customer → data

Browser

The partitioning = array['mktsegment'] clause in the create table statement resulted in 5 different files being created, each with data specific to that segment. By partitioning the table this way, the Presto engine is able to optimize queries by only scanning partitions that match your SQL predicates.

Close the current window [x] and then return to the Query Workspace by clicking this icon on the left side

Now that have created a partitioned table, we will execute an SQL statement that will make use of this fact.

Run a query against the partitioned table

select
  * 
from 
  iceberg_data."workshop".part_customer 
where 
  mktsegment='MACHINERY';

Browser

Due to the partitioning of this table by mktsegment, it will completely skip scanning a large percentage of the objects in the object store.

Press the explain keyword in the SQL window to view the execution plan and then select the ScanFilter

Browser

You should see that the optimizer expected to read only 288 rows when running this query. This is due to the table being partitioned and scans only needing to be run against one of the partitions.

Close the explain window

Joins and Aggregations

This section will create an orders table to test joins and aggregations.

Create the Orders Table

create table iceberg_data.workshop.orders as 
  select * from tpch.tiny.orders;

The following SQL uses a windowing function to compute a result set.

SQL with a Windowing function

SELECT 
  orderkey, clerk, totalprice, 
  rank() OVER (PARTITION BY clerk ORDER BY totalprice DESC) AS rnk 
FROM 
  iceberg_data.workshop.orders 
ORDER BY 
  clerk, rnk;

Browser

Prepared statements

Presto provides a feature for preparing statements for repeated execution. Using this method will reduce the overhead of optimizing the statement every time you run it.

Save a query as a prepared statement

prepare 
  customer_by_segment
from 
  select * from iceberg_data.workshop.customer where mktsegment=?;

Once the statement has been prepared, you can execute the statement by supplying it with some parameters. In this example we are combining the two statements because the session context is not shared between SQL executions.

Execute prepared statement using parameters

prepare 
  customer_by_segment
from 
  select * from iceberg_data.workshop.customer where mktsegment=?;    
execute customer_by_segment using 'FURNITURE';

Browser

Note that the prepared statement only exists during the current session.

Summary

In this lab you ran some analytic workloads, learned how to view the explain plans for a query, created a partitioned table, and created prepared statements that can be executed multiple times.