Projected Quantum Kernel (PQK) Tutorial: Ovarian Cancer Survival Prediction#
Overview#
This tutorial demonstrates how to use Projected Quantum Kernels (PQK) for predicting 3-year survival in ovarian cancer patients using multi-omics data. PQK is a quantum machine learning technique that projects classical data into a quantum feature space, potentially capturing complex patterns that classical methods might miss.
What You’ll Learn#
How to prepare multi-omics cancer data for quantum machine learning
How to use PQK to create quantum feature representations
How to compare quantum-enhanced vs. classical SVM performance
Best practices for hyperparameter tuning with quantum kernels
Dataset#
For convenience, we use preprocessed data from the Multi-Omics Cancer Benchmark - TCGA Preprocessed Data.
This tutorial uses ovarian cancer (OV) data with four data modalities:
miRNA expression (
mirna)DNA methylation (
methy)Gene expression (
exp)Integrated multi-omics (
integrated)
Data Source:
OV Cancer Data: https://acgt.cs.tau.ac.il/multi_omic_benchmark/data/ovarian.zip
Other Cancer Types: https://acgt.cs.tau.ac.il/multi_omic_benchmark/download.html
The data will be automatically downloaded and extracted in the next section.
1. Setup and Imports#
First, let’s import all necessary libraries for data processing, quantum computing, and machine learning.
[ ]:
import sys
import os
import numpy as np
import pandas as pd
# QBioCode imports
from qbiocode import pqk
# Scikit-learn imports
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
print("✓ All libraries imported successfully")
1.1. Download and Prepare Data#
Download and extract the ovarian cancer dataset from the Multi-Omics Cancer Benchmark.
[ ]:
import os
import zipfile
# Create OV directory if it doesn't exist
os.makedirs('OV', exist_ok=True)
# Define URL for ovarian cancer data (includes survival labels)
data_url = 'https://acgt.cs.tau.ac.il/multi_omic_benchmark/data/ovarian.zip'
zip_filename = 'ovarian.zip'
def download_file(url, filename):
"""Download a file with error handling"""
if os.path.exists(filename):
print(f'✓ Data already downloaded ({filename})')
return True
print(f'Downloading ovarian cancer data from {url}...')
print('This may take a few minutes...')
# Try using requests library (more robust)
try:
import requests
response = requests.get(url, verify=True, timeout=120)
response.raise_for_status()
with open(filename, 'wb') as f:
f.write(response.content)
print(f'✓ Download complete')
return True
except ImportError:
# Fallback to urllib with SSL context
import urllib.request
import ssl
# Create SSL context that doesn't verify certificates (use with caution)
ssl_context = ssl.create_default_context()
ssl_context.check_hostname = False
ssl_context.verify_mode = ssl.CERT_NONE
try:
with urllib.request.urlopen(url, context=ssl_context, timeout=120) as response:
with open(filename, 'wb') as f:
f.write(response.read())
print(f'✓ Download complete')
return True
except Exception as e:
print(f'Error downloading: {e}')
return False
except Exception as e:
print(f'Error downloading: {e}')
return False
# Download the data
print('=' * 60)
print('Downloading Multi-Omics Cancer Benchmark - Ovarian Cancer Data')
print('=' * 60)
success = download_file(data_url, zip_filename)
if not success:
print('\n⚠️ Download failed. Please download manually from:')
print(f' {data_url}')
print(' Place it in the current directory as "{zip_filename}"')
else:
print('\n=' * 60)
print('Extracting Files')
print('=' * 60)
# Extract the zip file
if os.path.exists(zip_filename):
print(f'Extracting {zip_filename}...')
with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
zip_ref.extractall('OV')
print(f'✓ Data extracted to OV/ directory')
print('\n=' * 60)
print('Available Data Files')
print('=' * 60)
# List all downloaded files (including those without extensions)
ov_files = sorted([f for f in os.listdir('OV') if os.path.isfile(os.path.join('OV', f))])
print(f'\nFound {len(ov_files)} files in OV/ directory:\n')
for f in ov_files:
file_path = os.path.join('OV', f)
file_size = os.path.getsize(file_path) / (1024 * 1024) # Size in MB
print(f' - {f:<30} ({file_size:>6.2f} MB)')
print('\n✓ Setup complete! The survival labels are included in the survival file.')
1.2. Process Data and Create Labels#
Process the downloaded multi-omics data and create 3-year survival labels from the survival file.
[ ]:
import pandas as pd
import numpy as np
data_path = 'OV/'
# Helper functions
def remove_dup_columns(df):
"""Remove duplicate columns"""
return df.loc[:, ~df.columns.duplicated()].copy()
def standardize_patient_id(patient_id):
"""Standardize patient ID format"""
# Convert to string, uppercase, replace dots with dashes
pid = str(patient_id).upper().replace('.', '-')
# Keep only first 3 parts (TCGA-XX-XXXX)
parts = pid.split('-')
if len(parts) >= 3:
return '-'.join(parts[:3])
return pid
print('=' * 60)
print('Loading Multi-Omics Data')
print('=' * 60)
# Load data files (files have no extension)
ov_methy = pd.read_csv(data_path + 'methy', sep=" ", index_col=0)
print(f'Methylation: {ov_methy.shape}')
ov_exp = pd.read_csv(data_path + 'exp', sep=" ", index_col=0)
# Remove rows with missing values (marked with ?)
ov_exp = ov_exp[~ov_exp.index.astype(str).str.contains('\?')]
print(f'Expression: {ov_exp.shape}')
ov_mirna = pd.read_csv(data_path + 'mirna', sep=" ", index_col=0)
print(f'miRNA: {ov_mirna.shape}')
ov_survival = pd.read_csv(data_path + 'survival', sep="\t")
print(f'Survival data: {ov_survival.shape}')
# Clean survival data
ov_survival.dropna(inplace=True)
ov_survival.drop_duplicates(['PatientID'], keep='last', inplace=True)
print('\n=' * 60)
print('Processing and Aligning Data')
print('=' * 60)
# Standardize column names (patient IDs) in omics data
ov_methy.columns = [standardize_patient_id(col) for col in ov_methy.columns]
ov_exp.columns = [standardize_patient_id(col) for col in ov_exp.columns]
ov_mirna.columns = [standardize_patient_id(col) for col in ov_mirna.columns]
# Remove duplicates after standardization
ov_methy_new = remove_dup_columns(ov_methy)
ov_exp_new = remove_dup_columns(ov_exp)
ov_mirna_new = remove_dup_columns(ov_mirna)
# Standardize clinical sample IDs
ov_survival['PatientID'] = ov_survival['PatientID'].apply(standardize_patient_id)
clinical_samples_set = set(ov_survival['PatientID'])
print(f'Clinical samples: {len(clinical_samples_set)}')
print(f'Sample ID examples (clinical): {list(clinical_samples_set)[:3]}')
print(f'Sample ID examples (methy): {list(ov_methy_new.columns)[:3]}')
# Find overlapping samples across all modalities
overlapping_cols = set(ov_methy_new.columns).intersection(set(ov_exp_new.columns))
print(f'\nSamples in methy ∩ exp: {len(overlapping_cols)}')
overlapping_cols = set(ov_mirna_new.columns).intersection(overlapping_cols)
print(f'Samples in methy ∩ exp ∩ mirna: {len(overlapping_cols)}')
overlapping_cols = list(clinical_samples_set.intersection(overlapping_cols))
print(f'Samples with clinical data: {len(overlapping_cols)}')
if len(overlapping_cols) == 0:
print('\n⚠️ No overlapping samples found!')
print('This might indicate a data format issue.')
print('Please check the sample ID formats above.')
raise ValueError('No overlapping samples - cannot proceed with analysis')
# Filter data to overlapping samples and transpose
ov_methy_ov = ov_methy_new[overlapping_cols].T
ov_exp_ov = ov_exp_new[overlapping_cols].T
ov_mirna_ov = ov_mirna_new[overlapping_cols].T
print(f'\nFiltered shapes (samples × features):')
print(f' Methylation: {ov_methy_ov.shape}')
print(f' Expression: {ov_exp_ov.shape}')
print(f' miRNA: {ov_mirna_ov.shape}')
print('\n=' * 60)
print('Creating Survival Labels')
print('=' * 60)
# Create survival labels (1 = survived, 0 = did not survive)
ov_survival['5y_survival'] = np.ones(ov_survival.shape[0], dtype=int)
ov_survival['3y_survival'] = np.ones(ov_survival.shape[0], dtype=int)
ov_survival['1y_survival'] = np.ones(ov_survival.shape[0], dtype=int)
ov_survival.loc[ov_survival['Survival'] < (1*365)+1, '1y_survival'] = 0
ov_survival.loc[ov_survival['Survival'] < (3*365)+1, '3y_survival'] = 0
ov_survival.loc[ov_survival['Survival'] < (5*365)+1, '5y_survival'] = 0
print('Survival label distribution:')
print(f' 5-year: {dict(ov_survival.groupby("5y_survival").size())}')
print(f' 3-year: {dict(ov_survival.groupby("3y_survival").size())}')
print(f' 1-year: {dict(ov_survival.groupby("1y_survival").size())}')
print('\n=' * 60)
print('Creating Final Datasets')
print('=' * 60)
# Merge survival labels with each omics type
#lets use 3y_survival
def create_dataset(omics_df, survival_df):
"""Merge omics data with survival labels"""
# Reset index to make patient IDs a column
omics_with_id = omics_df.reset_index()
omics_with_id.rename(columns={'index': 'sampleID'}, inplace=True)
# Merge with survival labels
merged = pd.merge(
survival_df[['PatientID', '3y_survival']],
omics_with_id,
left_on='PatientID',
right_on='sampleID',
how='inner'
)
# Keep only sampleID and drop PatientID
merged.drop(['PatientID'], axis=1, inplace=True)
return merged
ov_methy_final = create_dataset(ov_methy_ov, ov_survival)
ov_exp_final = create_dataset(ov_exp_ov, ov_survival)
ov_mirna_final = create_dataset(ov_mirna_ov, ov_survival)
# Create integrated dataset (concatenate all features)
ov_integrated_features = pd.concat([ov_methy_ov, ov_exp_ov, ov_mirna_ov], axis=1)
ov_integrated = create_dataset(ov_integrated_features, ov_survival)
# Save processed datasets
ov_methy_final.to_csv(data_path + 'OV_methy.csv', index=False)
ov_exp_final.to_csv(data_path + 'OV_exp.csv', index=False)
ov_mirna_final.to_csv(data_path + 'OV_mirna.csv', index=False)
ov_integrated.to_csv(data_path + 'OV_integrated.csv', index=False)
print(f'\nSaved datasets (samples × features):')
print(f' - OV_methy.csv: {ov_methy_final.shape}')
print(f' - OV_exp.csv: {ov_exp_final.shape}')
print(f' - OV_mirna.csv: {ov_mirna_final.shape}')
print(f' - OV_integrated.csv: {ov_integrated.shape}')
print('\n✓ Data processing complete! Ready for PQK analysis.')
2. Helper Functions#
We define utility functions for hyperparameter grid generation and model evaluation.
[ ]:
def declare_hyperparams(param_list):
"""
Generate a comprehensive hyperparameter search space for SVM.
This function creates a wide range of C and gamma values spanning
multiple orders of magnitude to ensure thorough hyperparameter exploration.
Args:
param_list (list): Initial parameter list (can be empty)
Returns:
list: Extended parameter list with values from 0.001 to 9000
"""
param_list.extend([x * 0.001 for x in range(1, 10, 2)]) # 0.001 to 0.009
param_list.extend([x * 0.01 for x in range(1, 10, 2)]) # 0.01 to 0.09
param_list.extend([x * 0.1 for x in range(1, 10, 2)]) # 0.1 to 0.9
param_list.extend(list(range(1, 10, 2))) # 1 to 9
param_list.extend([x * 10 for x in range(1, 10, 2)]) # 10 to 90
param_list.extend([x * 100 for x in range(1, 10, 2)]) # 100 to 900
param_list.extend([x * 1000 for x in range(1, 10, 2)]) # 1000 to 9000
return param_list
def run_model(model, x_train, x_test, y_train, y_test,
param_grid, cv, verbose=1, n_jobs=-1, scoring='f1_weighted'):
"""
Train and evaluate an SVM model with grid search cross-validation.
Args:
model: Scikit-learn classifier (e.g., SVC)
x_train, x_test: Training and test features
y_train, y_test: Training and test labels
param_grid (dict): Hyperparameter grid for search
cv: Cross-validation strategy
verbose (int): Verbosity level
n_jobs (int): Number of parallel jobs (-1 uses all cores)
scoring (str): Scoring metric for optimization
Returns:
float: F1 score on test set
"""
# Perform grid search with cross-validation
grid_search = GridSearchCV(
model,
param_grid,
cv=cv,
verbose=verbose,
n_jobs=n_jobs,
scoring=scoring
)
grid_search.fit(x_train, y_train)
best_model = grid_search.best_estimator_
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
# Evaluate on test set
f1_score = best_model.score(x_test, y_test)
print(f"Test F1 score: {f1_score:.4f}\n")
return f1_score
print("✓ Helper functions defined")
3. Configuration#
Set up the experimental parameters for PQK and classical SVM.
[ ]:
# PCA configuration - reduces data to match number of qubits
n_components = 10
# PQK quantum backend configuration
pqk_args = {
'seed': 1234,
'backend': 'simulator', # Use 'ibm_quantum' for real quantum hardware
}
# SVM hyperparameters
kernel_types = ['linear', 'poly', 'rbf', 'sigmoid']
# Generate hyperparameter search space
C_range = declare_hyperparams([])
gamma_range = declare_hyperparams(['auto', 'scale'])
param_grid = {
'C': C_range,
'gamma': gamma_range,
'kernel': kernel_types
}
# Cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print(f"Configuration:")
print(f" - PCA components: {n_components}")
print(f" - C values: {len(C_range)}")
print(f" - Gamma values: {len(gamma_range)}")
print(f" - Kernel types: {kernel_types}")
print(f" - CV folds: 5")
4. Main Experiment Loop#
Process each data modality: load data, preprocess, apply PQK, and compare quantum vs. classical SVM.
[ ]:
# Store results for all data types
results_dict = {}
# Process each omics data type
data_types = ['mirna', 'methy', 'exp', 'integrated']
for data_type in data_types:
print("=" * 80)
print(f"Processing: {data_type.upper()}")
print("=" * 80)
# Load data
df = pd.read_csv(f'OV/OV_{data_type}.csv')
print(f"Dataset shape: {df.shape}")
# Extract features and labels
samples = df['sampleID'].tolist()
y = np.asarray(df['3y_survival'])
X = df.drop(columns=['sampleID', '3y_survival']).values
print(f"Class distribution: {np.bincount(y)}")
# Train-test split (stratified to maintain class balance)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Dimensionality reduction with PCA
print(f"\nApplying PCA: {X_train.shape[1]} → {n_components} features")
pca = PCA(n_components=n_components)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
explained_var = pca.explained_variance_ratio_.sum()
print(f"Explained variance: {explained_var:.2%}")
# Standardization (zero mean, unit variance)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_pca)
X_test_scaled = scaler.transform(X_test_pca)
# Apply Projected Quantum Kernel
print("\n" + "-" * 40)
print("Applying Projected Quantum Kernel...")
print("-" * 40)
X_train_quantum, X_test_quantum = pqk(
X_train_scaled,
X_test_scaled,
args=pqk_args,
store=False,
encoding='ZZ', # ZZ feature map
entanglement='pairwise', # Pairwise qubit entanglement
reps=8 # 8 repetitions of the feature map
)
print(f"Quantum projection shape: {X_train_quantum.shape}")
# Train and evaluate quantum-enhanced SVM
print("\n" + "=" * 40)
print("QUANTUM-ENHANCED SVM")
print("=" * 40)
svc_quantum = SVC()
f1_quantum = run_model(
model=svc_quantum,
x_train=X_train_quantum,
y_train=y_train,
x_test=X_test_quantum,
y_test=y_test,
param_grid=param_grid,
cv=cv
)
# Train and evaluate classical SVM (baseline)
print("=" * 40)
print("CLASSICAL SVM (Baseline)")
print("=" * 40)
svc_classical = SVC()
f1_classical = run_model(
model=svc_classical,
x_train=X_train_scaled,
y_train=y_train,
x_test=X_test_scaled,
y_test=y_test,
param_grid=param_grid,
cv=cv
)
# Store results
results_dict[data_type] = [f1_quantum, f1_classical]
# Print comparison
improvement = ((f1_quantum - f1_classical) / f1_classical) * 100
print("\n" + "*" * 80)
print(f"RESULTS SUMMARY - {data_type.upper()}")
print("*" * 80)
print(f"Quantum F1: {f1_quantum:.4f}")
print(f"Classical F1: {f1_classical:.4f}")
print(f"Improvement: {improvement:+.2f}%")
print("*" * 80 + "\n\n")
print("\n" + "=" * 80)
print("ALL EXPERIMENTS COMPLETED")
print("=" * 80)
5. Results Analysis#
Compile and visualize the results across all data modalities.
[ ]:
# Create results DataFrame
results_df = pd.DataFrame.from_dict(results_dict, orient='index')
results_df.columns = ['F1_Quantum', 'F1_Classical']
results_df['Improvement_%'] = ((results_df['F1_Quantum'] - results_df['F1_Classical']) /
results_df['F1_Classical'] * 100)
# Display results
print("\nFinal Results Summary:")
print("=" * 60)
print(results_df.to_string())
print("=" * 60)
# Save results
results_df.to_csv('PQK_OV_results.csv')
print("\n✓ Results saved to 'PQK_OV_results.csv'")
# Summary statistics
print("\nOverall Statistics:")
print(f" Average Quantum F1: {results_df['F1_Quantum'].mean():.4f}")
print(f" Average Classical F1: {results_df['F1_Classical'].mean():.4f}")
print(f" Average Improvement: {results_df['Improvement_%'].mean():+.2f}%")
print(f" Best performing data: {results_df['F1_Quantum'].idxmax()}")
6. Visualization (Optional)#
Create a bar plot comparing quantum vs. classical performance.
[ ]:
import matplotlib.pyplot as plt
# Create comparison plot
fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(len(results_df))
width = 0.35
bars1 = ax.bar(x - width/2, results_df['F1_Quantum'], width,
label='Quantum-Enhanced', color='#1f77b4', alpha=0.8)
bars2 = ax.bar(x + width/2, results_df['F1_Classical'], width,
label='Classical', color='#ff7f0e', alpha=0.8)
ax.set_xlabel('Data Modality', fontsize=12)
ax.set_ylabel('F1 Score', fontsize=12)
ax.set_title('Quantum vs. Classical SVM Performance on Ovarian Cancer Data',
fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(results_df.index)
ax.legend(fontsize=11)
ax.grid(axis='y', alpha=0.3)
# Add value labels on bars
for bars in [bars1, bars2]:
for bar in bars:
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2., height,
f'{height:.3f}',
ha='center', va='bottom', fontsize=9)
plt.tight_layout()
plt.savefig('PQK_OV_results.png', dpi=300, bbox_inches='tight')
plt.show()
print("✓ Plot saved to 'PQK_OV_results.png'")
Key Takeaways#
What We Learned#
PQK Workflow: Successfully applied quantum feature maps to transform classical cancer data into quantum representations
Performance Comparison: Evaluated quantum-enhanced vs. classical SVM across multiple omics data types
Hyperparameter Tuning: Demonstrated comprehensive grid search for optimal model parameters
Best Practices#
Dimensionality Reduction: Use PCA to match data dimensions to available qubits
Data Standardization: Always normalize features before quantum encoding
Cross-Validation: Use stratified k-fold CV to ensure robust performance estimates
Baseline Comparison: Always compare quantum methods against classical baselines
Next Steps#
Experiment with different quantum feature maps (Z, ZZ, Pauli)
Try different entanglement strategies (linear, full, circular)
Adjust the number of repetitions (
reps) parameterTest on real quantum hardware using IBM Quantum
Explore other quantum ML algorithms (QSVC, VQC, QNN)
References#
Questions or Issues? Open an issue on GitHub