AutoAI experiment

AutoAI

class ibm_watsonx_ai.experiment.autoai.autoai.AutoAI(credentials=None, project_id=None, space_id=None, verify=None, **kwargs)[source]

Bases: BaseExperiment

AutoAI class for automizing pipeline model optimization.

Parameters:
  • credentials (dict) – credentials to instance

  • project_id (str, optional) – ID of the Watson Studio project

  • space_id (str, optional) – ID of the Watson Studio Space

  • verify (bool or str, optional) –

    You can pass one of the following as verify:

    • the path to a CA_BUNDLE file

    • the path of directory with certificates of trusted CAs

    • True - takes the default path to the truststore

    • False - makes no verification

Example:

from ibm_watsonx_ai.experiment import AutoAI

experiment = AutoAI(
    credentials={
        "apikey": IAM_API_KEY,
        "iam_apikey_description": "...",
        "iam_apikey_name": "...",
        "iam_role_crn": "...",
        "iam_serviceid_crn": "...",
        "instance_id": "...",
        "url": "https://us-south.ml.cloud.ibm.com"
    },
    project_id="...",
    space_id="...")
class ClassificationAlgorithms(value)

Bases: Enum

Classification algorithms that AutoAI can use for IBM Cloud.

DT = 'DecisionTreeClassifier'
EX_TREES = 'ExtraTreesClassifier'
GB = 'GradientBoostingClassifier'
LGBM = 'LGBMClassifier'
LR = 'LogisticRegression'
RF = 'RandomForestClassifier'
SnapBM = 'SnapBoostingMachineClassifier'
SnapDT = 'SnapDecisionTreeClassifier'
SnapLR = 'SnapLogisticRegression'
SnapRF = 'SnapRandomForestClassifier'
SnapSVM = 'SnapSVMClassifier'
XGB = 'XGBClassifier'
class DataConnectionTypes

Bases: object

Supported types of DataConnection.

CA = 'connection_asset'
CN = 'container'
DS = 'data_asset'
FS = 'fs'
GH = 'github'
S3 = 's3'
class ForecastingAlgorithms(value)

Bases: Enum

Forecasting algorithms that AutoAI can use for IBM watsonx.ai software with IBM Cloud Pak® for Data.

ARIMA = 'ARIMA'
BATS = 'BATS'
ENSEMBLER = 'Ensembler'
HW = 'HoltWinters'
LR = 'LinearRegression'
RF = 'RandomForest'
SVM = 'SVM'
class Metrics

Bases: object

Supported types of classification and regression metrics in AutoAI.

ACCURACY_AND_DISPARATE_IMPACT_SCORE = 'accuracy_and_disparate_impact'
ACCURACY_SCORE = 'accuracy'
AVERAGE_PRECISION_SCORE = 'average_precision'
EXPLAINED_VARIANCE_SCORE = 'explained_variance'
F1_SCORE = 'f1'
F1_SCORE_MACRO = 'f1_macro'
F1_SCORE_MICRO = 'f1_micro'
F1_SCORE_WEIGHTED = 'f1_weighted'
LOG_LOSS = 'neg_log_loss'
MEAN_ABSOLUTE_ERROR = 'neg_mean_absolute_error'
MEAN_SQUARED_ERROR = 'neg_mean_squared_error'
MEAN_SQUARED_LOG_ERROR = 'neg_mean_squared_log_error'
MEDIAN_ABSOLUTE_ERROR = 'neg_median_absolute_error'
PRECISION_SCORE = 'precision'
PRECISION_SCORE_MACRO = 'precision_macro'
PRECISION_SCORE_MICRO = 'precision_micro'
PRECISION_SCORE_WEIGHTED = 'precision_weighted'
R2_AND_DISPARATE_IMPACT_SCORE = 'r2_and_disparate_impact'
R2_SCORE = 'r2'
RECALL_SCORE = 'recall'
RECALL_SCORE_MACRO = 'recall_macro'
RECALL_SCORE_MICRO = 'recall_micro'
RECALL_SCORE_WEIGHTED = 'recall_weighted'
ROC_AUC_SCORE = 'roc_auc'
ROOT_MEAN_SQUARED_ERROR = 'neg_root_mean_squared_error'
ROOT_MEAN_SQUARED_LOG_ERROR = 'neg_root_mean_squared_log_error'
class PipelineTypes

Bases: object

Supported types of Pipelines.

LALE = 'lale'
SKLEARN = 'sklearn'
class PredictionType

Bases: object

Supported types of learning.

BINARY = 'binary'
CLASSIFICATION = 'classification'
FORECASTING = 'forecasting'
MULTICLASS = 'multiclass'
REGRESSION = 'regression'
TIMESERIES_ANOMALY_PREDICTION = 'timeseries_anomaly_prediction'
class RAGMetrics

Bases: object

Supported types of AutoAI RAG metrics

ANSWER_CORRECTNESS = 'answer_correctness'
CONTEXT_CORRECTNESS = 'context_correctness'
FAITHFULNESS = 'faithfulness'
class RegressionAlgorithms(value)

Bases: Enum

Regression algorithms that AutoAI can use for IBM Cloud.

DT = 'DecisionTreeRegressor'
EX_TREES = 'ExtraTreesRegressor'
GB = 'GradientBoostingRegressor'
LGBM = 'LGBMRegressor'
LR = 'LinearRegression'
RF = 'RandomForestRegressor'
RIDGE = 'Ridge'
SnapBM = 'SnapBoostingMachineRegressor'
SnapDT = 'SnapDecisionTreeRegressor'
SnapRF = 'SnapRandomForestRegressor'
XGB = 'XGBRegressor'
class SamplingTypes

Bases: object

Types of training data sampling.

FIRST_VALUES = 'first_n_records'
LAST_VALUES = 'truncate'
RANDOM = 'random'
STRATIFIED = 'stratified'
class TShirtSize

Bases: object

Possible sizes of the AutoAI POD. Depending on the POD size, AutoAI can support different data set sizes.

  • S - small (2vCPUs and 8GB of RAM)

  • M - Medium (4vCPUs and 16GB of RAM)

  • L - Large (8vCPUs and 32GB of RAM))

  • XL - Extra Large (16vCPUs and 64GB of RAM)

L = 'l'
M = 'm'
S = 's'
XL = 'xl'
class Transformers

Bases: object

Supported types of congito transformers names in AutoAI.

ABS = 'abs'
CBRT = 'cbrt'
COS = 'cos'
CUBE = 'cube'
DIFF = 'diff'
DIVIDE = 'divide'
FEATUREAGGLOMERATION = 'featureagglomeration'
ISOFORESTANOMALY = 'isoforestanomaly'
LOG = 'log'
MAX = 'max'
MINMAXSCALER = 'minmaxscaler'
NXOR = 'nxor'
PCA = 'pca'
PRODUCT = 'product'
ROUND = 'round'
SIGMOID = 'sigmoid'
SIN = 'sin'
SQRT = 'sqrt'
SQUARE = 'square'
STDSCALER = 'stdscaler'
SUM = 'sum'
TAN = 'tan'
optimizer(name, *, prediction_type, prediction_column=None, prediction_columns=None, timestamp_column_name=None, scoring=None, desc=None, test_size=None, holdout_size=None, max_number_of_estimators=None, train_sample_rows_test_size=None, include_only_estimators=None, daub_include_only_estimators=None, include_batched_ensemble_estimators=None, backtest_num=None, lookback_window=None, forecast_window=None, backtest_gap_length=None, feature_columns=None, pipeline_types=None, supporting_features_at_forecast=None, cognito_transform_names=None, csv_separator=',', excel_sheet=None, encoding='utf-8', positive_label=None, drop_duplicates=True, outliers_columns=None, text_processing=None, word2vec_feature_number=None, daub_give_priority_to_runtime=None, fairness_info=None, sampling_type=None, sample_size_limit=None, sample_rows_limit=None, sample_percentage_limit=None, n_parallel_data_connections=None, number_of_batch_rows=None, categorical_imputation_strategy=None, numerical_imputation_strategy=None, numerical_imputation_value=None, imputation_threshold=None, retrain_on_holdout=None, categorical_columns=None, numerical_columns=None, test_data_csv_separator=',', test_data_excel_sheet=None, test_data_encoding='utf-8', confidence_level=None, incremental_learning=None, early_stop_enabled=None, early_stop_window_size=None, time_ordered_data=None, feature_selector_mode=None, **kwargs)[source]

Initialize an AutoAI optimizer.

Parameters:
  • name (str) – name of the AutoPipelines

  • prediction_type (PredictionType) – the type of prediction

  • prediction_column (str, optional) – name of the target/label column, required for multiclass, binary, and regression prediction types

  • prediction_columns (list[str], optional) – names of the target/label columns, required for forecasting prediction type

  • timestamp_column_name (str, optional) – name of the timestamp column for time series forecasting

  • scoring (Metrics, optional) – type of the metric to optimize with, not used for forecasting

  • desc (str, optional) – description

  • test_size – deprecated, use holdout_size instead

  • holdout_size (float, optional) – percentage of the entire dataset to leave as a holdout

  • max_number_of_estimators (int, optional) – maximum number (top-K ranked by DAUB model selection) of the selected algorithm, or estimator types, for example LGBMClassifierEstimator, XGBoostClassifierEstimator, or LogisticRegressionEstimator to use in the pipeline composition, the default is None which means that the true default value is determined by the internal different algorithms which uses only the algorithm type that is ranked the highest by the model selection

  • train_sample_rows_test_size (float, optional) – percentage of training data sampling

  • daub_include_only_estimators – deprecated, use include_only_estimators instead

  • include_batched_ensemble_estimators (list[BatchedClassificationAlgorithms or BatchedRegressionAlgorithms], optional) – list of batched ensemble estimators to include in the computation process, see: AutoAI.BatchedClassificationAlgorithms, AutoAI.BatchedRegressionAlgorithms

  • include_only_estimators (List[ClassificationAlgorithms or RegressionAlgorithms or ForecastingAlgorithms]], optional) – list of estimators to include in the computation process, see: AutoAI.ClassificationAlgorithms, AutoAI.RegressionAlgorithms or AutoAI.ForecastingAlgorithms

  • backtest_num (int, optional) – number of backtests used for forecasting prediction type, default value: 4, value from range [0, 20]

  • lookback_window (int, optional) – length of lookback window used for forecasting prediction type, default value: 10, if set to -1 lookback window will be auto-detected

  • forecast_window (int, optional) – length of forecast window used for forecasting prediction type, default value: 1, value from range [1, 60]

  • backtest_gap_length (int, optional) – gap between backtests used for forecasting prediction type, default value: 0, value from range [0, data length / 4]

  • feature_columns (list[str], optional) – list of feature columns used for the forecasting prediction type, might contain target column and/or supporting feature columns, list of columns to be detected whether there are anomalies for timeseries anomaly prediction type

  • pipeline_types (list[ForecastingPipelineTypes or TimeseriesAnomalyPredictionPipelineTypes], optional) – list of pipeline types to be used for forecasting or timeseries anomaly prediction type

  • supporting_features_at_forecast (bool, optional) – enables the use of future supporting feature values during the forecast

  • cognito_transform_names (list[Transformers], optional) – list of transformers to include in the feature enginnering computation process, see: AutoAI.Transformers

  • csv_separator (list[str] or str, optional) – the separator or list of separators for separating columns in a CSV file, not used if the file_name is not a CSV file, default is ‘,’

  • excel_sheet (str, optional) – name of the excel sheet to use, only applicable when the xlsx file is an input, support for number of the sheet is deprecated, by default first sheet is used

  • encoding (str, optional) – encoding type for the CSV training file

  • positive_label (str, optional) – the positive class to report when binary classification, when multiclass or regression, this is ignored

  • t_shirt_size (TShirtSize, optional) – size of the remote AutoAI POD instance (computing resources), only applicable to a remote scenario, see: AutoAI.TShirtSize

  • drop_duplicates (bool, optional) – if True, duplicated rows in data are removed before further processing

  • outliers_columns (list, optional) – replace outliers with NaN using the IQR method for specified columns, by default, turned ON for regression learning_type and target column, to turn OFF pass an empty list of columns

  • text_processing (bool, optional) – if True text processing will be enabled, applicable only on Cloud

  • word2vec_feature_number (int, optional) – number of features to be generated from the text column, applied only if text_processing is True, if None the default value will be taken

  • daub_give_priority_to_runtime (float, optional) – the importance of run time over score for pipelines ranking, can take values between 0 and 5, if set to 0.0 only score is used, if set to 1 equally score and runtime are used, if set to value higher than 1 the runtime gets higher importance over score

  • fairness_info (fairness_info) – dictionary that specifies the metadata needed for measuring fairness, it contains three key values: favorable_labels, unfavorable_labels, and protected_attributes, the favorable_labels attribute indicates a positive outcome when the class column contains one of the values from list, the unfavorable_labels is opposite to the favorable_labels and is obligatory for the regression learning type, protected_attributes is a list of features that partition the population into groups whose outcome should have parity, if protected_attributes is an empty list then automatic detection of protected attributes is run, if fairness_info is passed then the fairness metric is calculated

  • n_parallel_data_connections (int, optional) – number of maximum parallel connection to data source, supported only for IBM Cloud Pak® for Data 4.0.1 and later

  • categorical_imputation_strategy (ImputationStrategy, optional) –

    missing values imputation strategy for categorical columns

    Possible values (only non-forecasting scenario):

    • ImputationStrategy.MEAN

    • ImputationStrategy.MEDIAN

    • ImputationStrategy.MOST_FREQUENT (default)

  • numerical_imputation_strategy

    missing values imputation strategy for numerical columns

    Possible values (non-forecasting scenario):

    • ImputationStrategy.MEAN

    • ImputationStrategy.MEDIAN (default)

    • ImputationStrategy.MOST_FREQUENT

    Possible values (forecasting scenario):

    • ImputationStrategy.MEAN

    • ImputationStrategy.MEDIAN

    • ImputationStrategy.BEST_OF_DEFAULT_IMPUTERS (default)

    • ImputationStrategy.VALUE

    • ImputationStrategy.FLATTEN_ITERATIVE

    • ImputationStrategy.LINEAR

    • ImputationStrategy.CUBIC

    • ImputationStrategy.PREVIOUS

    • ImputationStrategy.NEXT

    • ImputationStrategy.NO_IMPUTATION

  • numerical_imputation_value (float, optional) – value for filling missing values if numerical_imputation_strategy is set to ImputationStrategy.VALUE, for forecasting only

  • imputation_threshold (float, optional) – maximum threshold of missing values imputation, for forecasting only

  • retrain_on_holdout (bool, optional) – if True, final pipelines are trained also on holdout data

  • categorical_columns (list, optional) – list of columns names to be treated as categorical

  • numerical_columns (list, optional) – list of columns names to be treated as numerical

  • sampling_type (str, optional) – type of sampling data for training, one of SamplingTypes enum values, default is SamplingTypes.FIRST_N_RECORDS, supported only for IBM Cloud Pak® for Data 4.0.1 and later

  • sample_size_limit (int, optional) – size of the sample upper bound (in bytes). The default value is 1 GB, supported only for IBM Cloud Pak® for Data 4.5 and later

  • sample_rows_limit (int, optional) – size of the sample upper bound (in rows), supported only for IBM Cloud Pak® for Data 4.6 and later

  • sample_percentage_limit (float, optional) – size of the sample upper bound (as fraction of dataset size), supported only for IBM Cloud Pak® for Data 4.6 and later

  • number_of_batch_rows (int, optional) – number of rows to read in each batch when reading from the flight connection

  • test_data_csv_separator (list[str] or str, optional) – the separator or list of separators for separating columns in a CSV user-defined holdout/test file, not used if the file_name is not a CSV file, default is ‘,’

  • test_data_excel_sheet (str or int, optional) – name of the excel sheet to use for user-defined holdout/test data, use only when the xlsx file is a test, dataset file, by default first sheet is used

  • test_data_encoding (str, optional) – encoding type for the CSV user-defined holdout/test file

  • confidence_level (float, optional) – when the pipeline “PointwiseBoundedHoltWinters” or “PointwiseBoundedBATS” is used, the prediction interval is calculated at a given confidence_level to decide if a data record is an anomaly or not, optional for timeseries anomaly prediction

  • incremental_learning (bool, optional) – triggers incremental learning process for supported pipelines

  • early_stop_enabled (bool, optional) – enables early stop for incremental learning process

  • early_stop_window_size (int, optional) – the number of iterations without score improvements before the training stops

  • time_ordered_data (bool, optional) – defines your preference about time-based analysis, if True, the analysis considers the data as time-ordered and time-based, supported only for regression

  • feature_selector_mode (str, optional) – defines if feature selector should be triggered [“on”, “off”, “auto”] the “auto” mode analyzes the impact of removing insignificant features, if there is a drop in accuracy, the PCA is applied to insignificant features, principal components that describe variance in 30% or higher are selected in place of insignificant features and the model is evaluated again, if there is still a drop in accuracy, all features are used the “on” mode removes all insignificant features (0.0. importance), the feature selector is applied during cognito phase (applicable to pipelines with feature engineering stage)

  • **kwargs – Additional keyword arguments for AutoAI configuration.

Keyword Arguments:
  • datetime_processing_flag (bool) - When enabled, detects date column and adds new columns for different types of date/time format aggregations.

Returns:

RemoteAutoPipelines or LocalAutoPipelines, depends on how you initialize the AutoAI object

Return type:

RemoteAutoPipelines or LocalAutoPipelines

Examples

from ibm_watsonx_ai.experiment import AutoAI
experiment = AutoAI(...)

fairness_info = {
           "protected_attributes": [
               {"feature": "Sex", "reference_group": ['male'], "monitored_group": ['female']},
               {"feature": "Age", "reference_group": [[50,60]], "monitored_group": [[18, 49]]}
           ],
           "favorable_labels": ["No Risk"],
           "unfavorable_labels": ["Risk"],
           }

optimizer = experiment.optimizer(
       name="name of the optimizer.",
       prediction_type=AutoAI.PredictionType.BINARY,
       prediction_column="y",
       scoring=AutoAI.Metrics.ROC_AUC_SCORE,
       desc="Some description.",
       holdout_size=0.1,
       max_number_of_estimators=1,
       fairness_info= fairness_info,
       cognito_transform_names=[AutoAI.Transformers.SUM,AutoAI.Transformers.MAX],
       train_sample_rows_test_size=1,
       include_only_estimators=[AutoAI.ClassificationAlgorithms.LGBM, AutoAI.ClassificationAlgorithms.XGB],
       t_shirt_size=AutoAI.TShirtSize.L
   )

optimizer = experiment.optimizer(
       name="name of the optimizer.",
       prediction_type=AutoAI.PredictionType.MULTICLASS,
       prediction_column="y",
       scoring=AutoAI.Metrics.ROC_AUC_SCORE,
       desc="Some description.",
   )
rag_optimizer(name, *, description=None, chunking=None, embedding_models=None, retrieval_methods=None, foundation_models=None, max_number_of_rag_patterns=None, optimization_metrics=None, **kwargs)[source]

Initialize an AutoAi RAG optimizer.

Parameters:
  • name (str) – name for the RAGOptimizer

  • description (str, optional) – description for the RAGOptimizer

  • chunking (list[dict], optional) – chunking configuration to be used.

  • embedding_models (list[str], optional) – The embedding models to try.

  • retrieval_methods (list[str], optional) – Retrieval methods to be used.

  • foundation_models (list[str], optional) – The foundation models to try.

  • max_number_of_rag_patterns (int, optional) – The maximum number of RAG patterns to create.

  • optimization_metrics (list[str], optional) – The metric name(s) to be used for optimization.

Returns:

AutoAI RAG optimizer

Return type:

RAGOptimizer

Examples

from ibm_watsonx_ai.experiment import AutoAI
experiment = AutoAI(...)

optimizer = experiment.rag_optimizer(
    name="RAG - AutoAI",
    description="Sample description",
    max_number_of_rag_patterns=5,
    optimization_metrics=["answer_correctness"]
)
runs(*, filter)[source]

Get the historical runs with a Pipeline name filter (for remote scenario). Get the historical runs with an experiment name filter (for local scenario).

Parameters:

filter (str) – Pipeline name to filter the historical runs or experiment name to filter the local historical runs

Returns:

object that manages the list of runs

Return type:

AutoPipelinesRuns or LocalAutoPipelinesRuns

Example:

from ibm_watsonx_ai.experiment import AutoAI

experiment = AutoAI(...)
experiment.runs(filter='Test').list()