Machine Learning

Machine learning (ML) is the practice of training algorithms to make predictions or decisions from data, without being explicitly programmed for each case. IBM i is an ideal starting point for ML because the data is already there — structured, reliable, and stored in Db2 for i — and the business domain context needed to frame a meaningful ML problem already exists in the people and processes that run on the platform.

Why Machine Learning?

IBM i systems accumulate years or decades of high-quality transactional data. That history is exactly what ML models need to learn from. Common ML use cases on IBM i include:

Fraud and anomaly detection — Identify unusual patterns in financial transactions or order data
Demand forecasting — Predict inventory needs based on historical sales and seasonal patterns
Predictive maintenance — Anticipate equipment failures using sensor or operational data
Customer segmentation — Group customers by behavior to improve targeting and service
Churn prediction — Identify customers likely to leave before they do

The key advantage: you don’t have to move your data to get started. Tools like Mapepire, JT400, and the Db2 for i Python SDK let you query Db2 for i directly from Python ML environments.

Basic aspects of the journey

A typical ML project with IBM i data follows this path:

Access the data — Connect a Python ML environment to Db2 for i
Prepare the data — Clean, transform, and engineer features
Train a model — Use a framework like scikit-learn, XGBoost, or PyTorch
Evaluate the model — Validate accuracy, precision, recall, and fairness
Export the model (if needed) — Package for deployment outside the training environment
Run inference — Score new records against the trained model
Integrate results — Surface predictions back into IBM i applications

Training

Training happens in a Python environment — either on IBM i itself, on an adjacent Linux/Power server(via the Python Ecosystem for IBM Power), or in a cloud ML platform like watsonx.ai or Red Hat OpenShift AI.

A minimal scikit-learn example reading from Db2 for i via Mapepire:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from mapepire_python import connect

# Connect to Db2 for i database using Mapepire and fetch data
# The config.ini file should contain connection details (host, user, password, etc.)
with connect("./config.ini") as conn:
    with conn.execute("select * from demo.orders") as cursor:
        results = cursor.fetchall()
        df = pd.DataFrame(results['data'])

# ============================================================================
# FEATURE ENGINEERING & DATA PREPARATION
# ============================================================================
# Select features (independent variables) for the model
# Features used: ORDER_AMOUNT, DAYS_SINCE_LAST_ORDER, CUSTOMER_SEGMENT
X = df[["ORDER_AMOUNT", "DAYS_SINCE_LAST_ORDER", "CUSTOMER_SEGMENT"]]
# Select target variable
# Binary classification: 1 = churned, 0 = not churned
y = df['CHURNED']

# Encode categorical variable (CUSTOMER_SEGMENT) to numerical values
# LabelEncoder converts text categories (e.g., 'Premium', 'Standard') to integers (0, 1, 2, etc.)
le = LabelEncoder()
X['CUSTOMER_SEGMENT'] = le.fit_transform(X['CUSTOMER_SEGMENT'])

# ============================================================================
# DATA SPLITTING
# ============================================================================
# Split dataset into training (80%) and testing (20%) sets
# - random_state=42: Ensures reproducible splits
# - stratify=y: Maintains the same proportion of churned/non-churned customers in both sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# ============================================================================
# MODEL TRAINING
# ============================================================================
# Initialize Random Forest Classifier with 100 decision trees
rf_model = RandomForestClassifier(n_estimators=100)

# Train the model on the training data
rf_model.fit(X_train, y_train)

# ============================================================================
# MODEL EVALUATION
# ============================================================================
# Generate predictions on the test set and display classification metrics
# Includes precision, recall, f1-score, and support for each class
print(classification_report(y_test, rf_model.predict(X_test)))

Export the model (if needed)

Whether you need to export a model depends entirely on your inferencing strategy:

You DON’T need to export if:

Training and inference both happen in the same Python environment (e.g., both in watsonx.ai, or both on IBM i PASE)
You’re using a managed ML platform that handles deployment internally (e.g., watsonx.ai deployments, Wallaroo AI Platform, Red Hat OpenShift AI, and Red Hat AI Inference Server)
The model stays in memory between training and scoring in a long-running service

You DO need to export if:

Training happens in one environment (e.g., cloud ML platform, data scientist’s workstation) but inference runs elsewhere (e.g., on IBM i, adjacent Linux server, or edge device)
You need to version and archive models for compliance, reproducibility, or rollback capability
Multiple applications or services need to load the same trained model
You’re deploying to a production environment separate from your training environment

When export is necessary, serialize the model using a format appropriate for your inference runtime:

ONNX — Open Neural Network Exchange format, supported by many runtimes including those available on IBM i via the Python Ecosystem for IBM Power and Red Hat OpenShift AI. Best for cross-platform deployment and when using different frameworks for training vs. inference.
Pickle / joblib — Standard Python serialization, suitable when inference also runs in Python
PMML — XML-based format for classical ML models, with broad tooling support

For production-grade model serving with versioning, monitoring, and auto-scaling, consider platforms like Red Hat OpenShift AI, Red Hat AI Inference Server, Wallaroo AI Platform, or watsonx.ai deployments, which handle model packaging and deployment automatically.

import joblib
joblib.dump(model, "churn_model.pkl")

If inference will run on IBM i directly, copy the serialized model file to IFS and load it in a Python PASE environment.

Inferencing

Once a model is trained and exported, it can score new records. Inference can run in several deployment locations:

On IBM i via Python PASE — Load the model in a Python script running directly on IBM i
On an adjacent server — Deploy the model on a Linux/Power server via the Python Ecosystem for IBM Power near IBM i
In a managed ML platform — Use model serving capabilities in Red Hat OpenShift AI, Red Hat AI Inference Server, Wallaroo AI Platform, or watsonx.ai

REST APIs provide a universal integration pattern that works with all inference deployment options. Whether the model runs on IBM i PASE, an adjacent server, or a managed ML platform, exposing it via HTTP makes it accessible from any IBM i application.

# Load model and score new records
import joblib
import pandas as pd
from mapepire_python import connect

# ============================================================================
# LOAD TRAINED MODEL AND ENCODER
# ============================================================================
# Load the trained model
model = joblib.load("churn_model.pkl")
# Load the label encoder used during training
le = joblib.load("label_encoder.pkl")

# ============================================================================
# FETCH NEW DATA
# ============================================================================
# Connect to Db2 for i via Mapepire and retrieve new records for prediction
with connect("./config.ini") as conn:
    with conn.execute("select * from demo.orders") as cursor:
        results = cursor.fetchall()
        df = pd.DataFrame(results['data'])

# ============================================================================
# PREPARE FEATURES FOR PREDICTION
# ============================================================================
# Select the same features used during training
X_new = df[["ORDER_AMOUNT", "DAYS_SINCE_LAST_ORDER", "CUSTOMER_SEGMENT"]]

# Encode the categorical variable using the SAVED encoder
X_new['CUSTOMER_SEGMENT'] = le.transform(X_new['CUSTOMER_SEGMENT'])

# ============================================================================
# MAKE PREDICTIONS
# ============================================================================
df["CHURN_SCORE"] = model.predict_proba(X_new)[:, 1]

# Write predictions back to Db2 for i...

RPG integration

RPG programs can invoke ML inference in several ways:

Option 1: Call a Python script using UNIXCMD

The UNIXCMD project provides a robust way to execute PASE commands from RPG.

Option 2: Call a REST API using IBM i HTTP APIs

The IBM i QhttpClnt APIs or the SYSTOOLS.HTTPGETCLOB / SYSTOOLS.HTTPPOSTCLOB SQL functions can call a Python scoring service and parse the JSON response.

-- Call a REST scoring endpoint from SQL
SELECT SYSTOOLS.HTTPPOSTCLOB(
    'http://scoreserver:8080/predict',
    '{"Content-Type":"application/json"}',
    '{"order_amount": 1500, "days_since_last_order": 45, "segment": "B"}'
) AS PREDICTION
FROM SYSIBM.SYSDUMMY1;

Option 3: Db2 for i User-Defined Functions (UDFs)

For tighter integration, a Python-based scoring function will be able to be wrapped as an external UDF in Db2 for i using Db2 for i AI SDK, allowing it to be called directly in SQL queries — scoring every row in a result set in a single statement. This capability is not yet implemented and will be supported in a future release.