Pipeline

A tutorial for using the @pipeline expression

Dataset

Let us start the tutorial by loading the dataset.

using AutoMLPipeline
using CSV
using DataFrames

profbdata = CSV.File(joinpath(dirname(pathof(AutoMLPipeline)),"../data/profb.csv")) |> DataFrame
X = profbdata[:,2:end]
Y = profbdata[:,1] |> Vector

We can check the data by showing the first 5 rows:

julia> show5(df)=first(df,5); # show first 5 rows

julia> show5(profbdata)
5×7 DataFrame
 Row │ Home.Away  Favorite_Points  Underdog_Points  Pointspread  Favorite_Name  Underdog_name  Year
     │ String     Int64            Int64            Float64      String         String         Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────
   1 │ away                    27               24          4.0  BUF            MIA               89
   2 │ at_home                 17               14          3.0  CHI            CIN               89
   3 │ away                    51                0          2.5  CLE            PIT               89
   4 │ at_home                 28                0          5.5  NO             DAL               89
   5 │ at_home                 38                7          5.5  MIN            HOU               89

This dataset is a collection of pro football scores with the following variables and their descriptions:

  • Home/Away = Favored team is at home or away
  • Favorite Points = Points scored by the favored team
  • Underdog Points = Points scored by the underdog team
  • Pointspread = Oddsmaker's points to handicap the favored team
  • Favorite Name = Code for favored team's name
  • Underdog name = Code for underdog's name
  • Year = 89, 90, or 91
Note

For the purpose of this tutorial, we will use the first column, Home vs Away, as the target variable to be predicted using the other columns as input features. For this target output, we are trying to ask whether the model can learn the patterns from its input features to predict whether the game was played at home or away. Since the input features have both categorical and numerical features, the dataset is a good basis to describe how to extract these two types of features, preprocessed them, and learn the mapping using a one-liner pipeline expression.

AutoMLPipeline Modules and Instances

Before continuing further with the tutorial, let us load the necessary modules of AutoMLPipeline:

using AutoMLPipeline

Let us also create some instances of filters, transformers, and models that we can use to preprocess and model the dataset.

#### Decomposition
pca = SKPreprocessor("PCA"); fa = SKPreprocessor("FactorAnalysis");
ica = SKPreprocessor("FastICA")

#### Scaler
rb = SKPreprocessor("RobustScaler"); pt = SKPreprocessor("PowerTransformer")
norm = SKPreprocessor("Normalizer"); mx = SKPreprocessor("MinMaxScaler")

#### categorical preprocessing
ohe = OneHotEncoder()

#### Column selector
disc = CatNumDiscriminator()
catf = CatFeatureSelector(); numf = NumFeatureSelector()

#### Learners
rf = SKLearner("RandomForestClassifier"); gb = SKLearner("GradientBoostingClassifier")
lsvc = SKLearner("LinearSVC"); svc = SKLearner("SVC")
mlp = SKLearner("MLPClassifier"); ada = SKLearner("AdaBoostClassifier")
jrf = RandomForest(); vote = VoteEnsemble(); stack = StackEnsemble()
best = BestLearner()

Processing Categorical Features

For the first illustration, let us extract categorical features of the data and output some of them using the pipeline expression and its interface:

pop_cat = @pipeline catf
tr_cat = fit_transform!(pop_cat,X,Y)
julia> show5(tr_cat)
5×2 DataFrame
 Row │ Favorite_Name  Underdog_name
     │ String         String
─────┼──────────────────────────────
   1 │ BUF            MIA
   2 │ CHI            CIN
   3 │ CLE            PIT
   4 │ NO             DAL
   5 │ MIN            HOU

One may notice that instead of using fit! and transform, the example uses fit_transform! instead. The latter is equivalent to calling fit! and transform in sequence which is handy for examining the final output of the transformation prior to feeding it to the model.

Let us now transform the categorical features into one-hot-bit-encoding (ohe) and examine the results:

pop_ohe = @pipeline catf |> ohe
tr_ohe = fit_transform!(pop_ohe,X,Y)
julia> show5(tr_ohe)
5×56 DataFrame
 Row │ x1       x2       x3       x4       x5       x6       x7       x8       x9       x10      x11      x12      x13      x14      x15      x16      x17      x18      x19      x20      x21      x22      x23      x24      x25      x26      x27      x28      x29      x30      x31      x32      x33      x34      x35      x36      x37      x38      x39      x40      x41      x42      x43      x44      x45      x46      x47      x48      x49      x50      x51      x52      x53      x54      x55      x56
     │ Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │     1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   2 │     0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   3 │     0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   4 │     0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   5 │     0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0

Processing Numerical Features

Let us have an example of extracting the numerical features of the data using different combinations of filters/transformers:

pop_rb = @pipeline (numf |> rb)
tr_rb = fit_transform!(pop_rb,X,Y)
julia> show5(tr_rb)
5×4 DataFrame
 Row │ x1         x2         x3       x4
     │ Float64    Float64    Float64  Float64
─────┼────────────────────────────────────────
   1 │  0.307692   0.576923   -0.25      -0.5
   2 │ -0.461538  -0.192308   -0.5       -0.5
   3 │  2.15385   -1.26923    -0.625     -0.5
   4 │  0.384615  -1.26923     0.125     -0.5
   5 │  1.15385   -0.730769    0.125     -0.5

Concatenating Extracted Categorical and Numerical Features

For typical modeling workflow, input features are combinations of categorical features transformer to one-bit encoding together with numerical features normalized or scaled or transformed by decomposition.

Here is an example of a typical input feature:

pop_com = @pipeline (numf |> norm) + (catf |> ohe)
tr_com = fit_transform!(pop_com,X,Y)
julia> show5(tr_com)
5×60 DataFrame
 Row │ x1        x2         x3         x4        x1_1     x2_1     x3_1     x4_1     x5       x6       x7       x8       x9       x10      x11      x12      x13      x14      x15      x16      x17      x18      x19      x20      x21      x22      x23      x24      x25      x26      x27      x28      x29      x30      x31      x32      x33      x34      x35      x36      x37      x38      x39      x40      x41      x42      x43      x44      x45      x46      x47      x48      x49      x50      x51      x52      x53      x54      x55      x56
     │ Float64   Float64    Float64    Float64   Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0.280854  0.249648   0.041608   0.925778      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   2 │ 0.18532   0.152616   0.0327035  0.970204      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   3 │ 0.497041  0.0        0.0243647  0.867385      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   4 │ 0.299585  0.0        0.0588471  0.952253      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   5 │ 0.391021  0.0720301  0.0565951  0.915812      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0

The column size from 6 grew to 60 after the hot-bit encoding was applied because of the large number of unique instances for the categorical columns.

Performance Evaluation of the Pipeline

We can add a model at the end of the pipeline and evaluate the performance of the entire pipeline by cross-validation.

Let us use a linear SVC model and evaluate using 5-fold cross-validation.

julia> Random.seed!(12345);

julia> pop_lsvc = @pipeline ( (numf |> rb) + (catf |> ohe) + (numf |> pt)) |> lsvc;

julia> tr_lsvc = crossvalidate(pop_lsvc,X,Y,"balanced_accuracy_score",5)
fold: 1, 0.7500750075007501
fold: 2, 0.6574500768049155
fold: 3, 0.7400793650793651
fold: 4, 0.6597042034250129
fold: 5, 0.6703817219281136
errors: 0
(mean = 0.6955380749476314, std = 0.04562293169822702, folds = 5, errors = 0)

What about using Gradient Boosting model?

julia> Random.seed!(12345);

julia> pop_gb = @pipeline ( (numf |> rb) + (catf |> ohe) + (numf |> pt)) |> gb;

julia> tr_gb = crossvalidate(pop_gb,X,Y,"balanced_accuracy_score",5)
fold: 1, 0.652589240824535
fold: 2, 0.6675257731958764
fold: 3, 0.5700757575757576
fold: 4, 0.5848623853211009
fold: 5, 0.6598484848484849
errors: 0
(mean = 0.6269803283531509, std = 0.04580424791292191, folds = 5, errors = 0)

What about using Random Forest model?

julia> Random.seed!(12345);

julia> pop_rf = @pipeline ( (numf |> rb) + (catf |> ohe) + (numf |> pt)) |> jrf;

julia> tr_rf = crossvalidate(pop_rf,X,Y,"balanced_accuracy_score",5)
fold: 1, 0.6102752293577982
fold: 2, 0.6073232323232324
fold: 3, 0.6244703389830508
fold: 4, 0.5104098893949252
fold: 5, 0.7065527065527066
errors: 0
(mean = 0.6118062793223427, std = 0.06971537419162234, folds = 5, errors = 0)

Let's evaluate several learners which is a typical workflow in searching for the optimal model.

using Random
using DataFrames: DataFrame, nrow,ncol

using AutoMLPipeline

Random.seed!(1)
jrf = RandomForest()
ada = SKLearner("AdaBoostClassifier")
sgd = SKLearner("SGDClassifier")
tree = PrunedTree()
std = SKPreprocessor("StandardScaler")
disc = CatNumDiscriminator()
lsvc = SKLearner("LinearSVC")

learners = DataFrame()
for learner in [jrf,ada,sgd,tree,lsvc]
  pcmc = @pipeline disc |> ((catf |> ohe) + (numf |> std)) |> learner
  println(learner.name)
  mean,sd,_ = crossvalidate(pcmc,X,Y,"accuracy_score",10)
  global learners = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd))
end;
rf_eq7
fold: 1, 0.7313432835820896
fold: 2, 0.7313432835820896
fold: 3, 0.6323529411764706
fold: 4, 0.6716417910447762
fold: 5, 0.7164179104477612
fold: 6, 0.5522388059701493
fold: 7, 0.6567164179104478
fold: 8, 0.6470588235294118
fold: 9, 0.6865671641791045
fold: 10, 0.746268656716418
errors: 0
AdaBoostClassifier_14y
fold: 1, 0.7910447761194029
fold: 2, 0.6865671641791045
fold: 3, 0.6911764705882353
fold: 4, 0.6865671641791045
fold: 5, 0.7014925373134329
fold: 6, 0.7014925373134329
fold: 7, 0.7761194029850746
fold: 8, 0.8382352941176471
fold: 9, 0.7014925373134329
fold: 10, 0.6268656716417911
errors: 0
SGDClassifier_g4m
fold: 1, 0.7014925373134329
fold: 2, 0.7014925373134329
fold: 3, 0.7205882352941176
fold: 4, 0.7313432835820896
fold: 5, 0.7910447761194029
fold: 6, 0.6716417910447762
fold: 7, 0.7164179104477612
fold: 8, 0.7058823529411765
fold: 9, 0.6268656716417911
fold: 10, 0.7164179104477612
errors: 0
prunetree_COs
fold: 1, 0.5522388059701493
fold: 2, 0.6417910447761194
fold: 3, 0.5882352941176471
fold: 4, 0.582089552238806
fold: 5, 0.582089552238806
fold: 6, 0.6268656716417911
fold: 7, 0.6268656716417911
fold: 8, 0.6764705882352942
fold: 9, 0.6268656716417911
fold: 10, 0.5074626865671642
errors: 0
LinearSVC_vrM
fold: 1, 0.7164179104477612
fold: 2, 0.7313432835820896
fold: 3, 0.6911764705882353
fold: 4, 0.7313432835820896
fold: 5, 0.6567164179104478
fold: 6, 0.7611940298507462
fold: 7, 0.7910447761194029
fold: 8, 0.7352941176470589
fold: 9, 0.7014925373134329
fold: 10, 0.7910447761194029
errors: 0
julia> @show learners;
learners = 5×3 DataFrame
 Row │ name                    mean      sd
     │ String                  Float64   Float64
─────┼─────────────────────────────────────────────
   1 │ rf_eq7                  0.677195  0.0589206
   2 │ AdaBoostClassifier_14y  0.720105  0.0623111
   3 │ SGDClassifier_g4m       0.708319  0.0418123
   4 │ prunetree_COs           0.601097  0.0487303
   5 │ LinearSVC_vrM           0.730707  0.0425931
Note

It can be inferred from the results that linear SVC has the best performance with respect to the different pipelines evaluated. The compact expression supported by the pipeline makes testing of the different combination of features and models trivial. It makes performance evaluation of the pipeline easily manageable in a systematic way.

Learners as Filters

It is also possible to use learners in the middle of expression to serve as filters and their outputs become input to the final learner as illustrated below.

julia> Random.seed!(1);

julia> expr = @pipeline (
                          ((numf |> pca) |> gb) + ((numf |> pca) |> jrf)
                        ) |> ohe |> ada;

julia> crossvalidate(expr,X,Y,"accuracy_score",5)
fold: 1, 0.6194029850746269
fold: 2, 0.6888888888888889
fold: 3, 0.6119402985074627
fold: 4, 0.5777777777777777
fold: 5, 0.6194029850746269
errors: 0
(mean = 0.6234825870646766, std = 0.040414801249678875, folds = 5, errors = 0)

It is important to take note that ohe is necessary because the outputs of the two learners (gb and jrf) are categorical values that need to be hot-bit encoded before feeding them to the final ada learner.

Advanced Expressions using Selector Pipeline

You can use * operation as a selector function which outputs the result of the best learner. Instead of looping over the different learners to identify the best learner, you can use the selector function to automatically determine the best learner and output its prediction.

julia> Random.seed!(1);

julia> pcmc = @pipeline disc |> ((catf |> ohe) + (numf |> std)) |>
                        (jrf * ada * sgd * tree * lsvc);

julia> crossvalidate(pcmc,X,Y,"accuracy_score",10)
fold: 1, 0.7164179104477612
fold: 2, 0.7910447761194029
fold: 3, 0.6911764705882353
fold: 4, 0.7761194029850746
fold: 5, 0.6567164179104478
fold: 6, 0.7014925373134329
fold: 7, 0.6417910447761194
fold: 8, 0.7058823529411765
fold: 9, 0.746268656716418
fold: 10, 0.835820895522388
errors: 0
(mean = 0.7262730465320456, std = 0.060932268798867976, folds = 10, errors = 0)

Here is another example using the Selector Pipeline as a preprocessor in the feature extraction stage of the pipeline:

julia> Random.seed!(1);

julia> pjrf = @pipeline disc |> ((catf |> ohe) + (numf |> std)) |>
                        ((jrf * ada ) + (sgd * tree * lsvc)) |> ohe |> ada;

julia> crossvalidate(pjrf,X,Y,"accuracy_score")
fold: 1, 0.7164179104477612
fold: 2, 0.7910447761194029
fold: 3, 0.7205882352941176
fold: 4, 0.7761194029850746
fold: 5, 0.6865671641791045
fold: 6, 0.5970149253731343
fold: 7, 0.6268656716417911
fold: 8, 0.7058823529411765
fold: 9, 0.746268656716418
fold: 10, 0.835820895522388
errors: 0
(mean = 0.7202589991220368, std = 0.07259489614798799, folds = 10, errors = 0)