Pipeline
A tutorial for using the @pipeline
expression
Dataset
Let us start the tutorial by loading the dataset.
using AutoMLPipeline
using CSV
using DataFrames
profbdata = CSV.File(joinpath(dirname(pathof(AutoMLPipeline)),"../data/profb.csv")) |> DataFrame
X = profbdata[:,2:end]
Y = profbdata[:,1] |> Vector
We can check the data by showing the first 5 rows:
julia> show5(df)=first(df,5); # show first 5 rows
julia> show5(profbdata)
5×7 DataFrame
Row │ Home.Away Favorite_Points Underdog_Points Pointspread Favorite_Name Underdog_name Year
│ String Int64 Int64 Float64 String String Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────
1 │ away 27 24 4.0 BUF MIA 89
2 │ at_home 17 14 3.0 CHI CIN 89
3 │ away 51 0 2.5 CLE PIT 89
4 │ at_home 28 0 5.5 NO DAL 89
5 │ at_home 38 7 5.5 MIN HOU 89
This dataset is a collection of pro football scores with the following variables and their descriptions:
- Home/Away = Favored team is at home or away
- Favorite Points = Points scored by the favored team
- Underdog Points = Points scored by the underdog team
- Pointspread = Oddsmaker's points to handicap the favored team
- Favorite Name = Code for favored team's name
- Underdog name = Code for underdog's name
- Year = 89, 90, or 91
For the purpose of this tutorial, we will use the first column, Home vs Away, as the target variable to be predicted using the other columns as input features. For this target output, we are trying to ask whether the model can learn the patterns from its input features to predict whether the game was played at home or away. Since the input features have both categorical and numerical features, the dataset is a good basis to describe how to extract these two types of features, preprocessed them, and learn the mapping using a one-liner pipeline expression.
AutoMLPipeline Modules and Instances
Before continuing further with the tutorial, let us load the necessary modules of AutoMLPipeline:
using AutoMLPipeline
Let us also create some instances of filters, transformers, and models that we can use to preprocess and model the dataset.
#### Decomposition
pca = SKPreprocessor("PCA"); fa = SKPreprocessor("FactorAnalysis");
ica = SKPreprocessor("FastICA")
#### Scaler
rb = SKPreprocessor("RobustScaler"); pt = SKPreprocessor("PowerTransformer")
norm = SKPreprocessor("Normalizer"); mx = SKPreprocessor("MinMaxScaler")
#### categorical preprocessing
ohe = OneHotEncoder()
#### Column selector
disc = CatNumDiscriminator()
catf = CatFeatureSelector(); numf = NumFeatureSelector()
#### Learners
rf = SKLearner("RandomForestClassifier"); gb = SKLearner("GradientBoostingClassifier")
lsvc = SKLearner("LinearSVC"); svc = SKLearner("SVC")
mlp = SKLearner("MLPClassifier"); ada = SKLearner("AdaBoostClassifier")
jrf = RandomForest(); vote = VoteEnsemble(); stack = StackEnsemble()
best = BestLearner()
Processing Categorical Features
For the first illustration, let us extract categorical features of the data and output some of them using the pipeline expression and its interface:
pop_cat = @pipeline catf
tr_cat = fit_transform!(pop_cat,X,Y)
julia> show5(tr_cat)
5×2 DataFrame
Row │ Favorite_Name Underdog_name
│ String String
─────┼──────────────────────────────
1 │ BUF MIA
2 │ CHI CIN
3 │ CLE PIT
4 │ NO DAL
5 │ MIN HOU
One may notice that instead of using fit!
and transform
, the example uses fit_transform!
instead. The latter is equivalent to calling fit!
and transform
in sequence which is handy for examining the final output of the transformation prior to feeding it to the model.
Let us now transform the categorical features into one-hot-bit-encoding (ohe) and examine the results:
pop_ohe = @pipeline catf |> ohe
tr_ohe = fit_transform!(pop_ohe,X,Y)
julia> show5(tr_ohe)
5×56 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 │ 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 │ 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 │ 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 │ 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Processing Numerical Features
Let us have an example of extracting the numerical features of the data using different combinations of filters/transformers:
pop_rb = @pipeline (numf |> rb)
tr_rb = fit_transform!(pop_rb,X,Y)
julia> show5(tr_rb)
5×4 DataFrame
Row │ x1 x2 x3 x4
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────
1 │ 0.307692 0.576923 -0.25 -0.5
2 │ -0.461538 -0.192308 -0.5 -0.5
3 │ 2.15385 -1.26923 -0.625 -0.5
4 │ 0.384615 -1.26923 0.125 -0.5
5 │ 1.15385 -0.730769 0.125 -0.5
Concatenating Extracted Categorical and Numerical Features
For typical modeling workflow, input features are combinations of categorical features transformer to one-bit encoding together with numerical features normalized or scaled or transformed by decomposition.
Here is an example of a typical input feature:
pop_com = @pipeline (numf |> norm) + (catf |> ohe)
tr_com = fit_transform!(pop_com,X,Y)
julia> show5(tr_com)
5×60 DataFrame
Row │ x1 x2 x3 x4 x1_1 x2_1 x3_1 x4_1 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.280854 0.249648 0.041608 0.925778 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 │ 0.18532 0.152616 0.0327035 0.970204 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 │ 0.497041 0.0 0.0243647 0.867385 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 │ 0.299585 0.0 0.0588471 0.952253 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 │ 0.391021 0.0720301 0.0565951 0.915812 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
The column size from 6 grew to 60 after the hot-bit encoding was applied because of the large number of unique instances for the categorical columns.
Performance Evaluation of the Pipeline
We can add a model at the end of the pipeline and evaluate the performance of the entire pipeline by cross-validation.
Let us use a linear SVC model and evaluate using 5-fold cross-validation.
julia> Random.seed!(12345);
julia> pop_lsvc = @pipeline ( (numf |> rb) + (catf |> ohe) + (numf |> pt)) |> lsvc;
julia> tr_lsvc = crossvalidate(pop_lsvc,X,Y,"balanced_accuracy_score",5)
fold: 1, 0.7500750075007501
fold: 2, 0.6574500768049155
fold: 3, 0.7400793650793651
fold: 4, 0.6597042034250129
fold: 5, 0.6703817219281136
errors: 0
(mean = 0.6955380749476314, std = 0.04562293169822702, folds = 5, errors = 0)
What about using Gradient Boosting model?
julia> Random.seed!(12345);
julia> pop_gb = @pipeline ( (numf |> rb) + (catf |> ohe) + (numf |> pt)) |> gb;
julia> tr_gb = crossvalidate(pop_gb,X,Y,"balanced_accuracy_score",5)
fold: 1, 0.652589240824535
fold: 2, 0.6675257731958764
fold: 3, 0.5700757575757576
fold: 4, 0.5848623853211009
fold: 5, 0.6598484848484849
errors: 0
(mean = 0.6269803283531509, std = 0.04580424791292191, folds = 5, errors = 0)
What about using Random Forest model?
julia> Random.seed!(12345);
julia> pop_rf = @pipeline ( (numf |> rb) + (catf |> ohe) + (numf |> pt)) |> jrf;
julia> tr_rf = crossvalidate(pop_rf,X,Y,"balanced_accuracy_score",5)
fold: 1, 0.6102752293577982
fold: 2, 0.6073232323232324
fold: 3, 0.6244703389830508
fold: 4, 0.5104098893949252
fold: 5, 0.7065527065527066
errors: 0
(mean = 0.6118062793223427, std = 0.06971537419162234, folds = 5, errors = 0)
Let's evaluate several learners which is a typical workflow in searching for the optimal model.
using Random
using DataFrames: DataFrame, nrow,ncol
using AutoMLPipeline
Random.seed!(1)
jrf = RandomForest()
ada = SKLearner("AdaBoostClassifier")
sgd = SKLearner("SGDClassifier")
tree = PrunedTree()
std = SKPreprocessor("StandardScaler")
disc = CatNumDiscriminator()
lsvc = SKLearner("LinearSVC")
learners = DataFrame()
for learner in [jrf,ada,sgd,tree,lsvc]
pcmc = @pipeline disc |> ((catf |> ohe) + (numf |> std)) |> learner
println(learner.name)
mean,sd,_ = crossvalidate(pcmc,X,Y,"accuracy_score",10)
global learners = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd))
end;
rf_eq7 fold: 1, 0.7313432835820896 fold: 2, 0.7313432835820896 fold: 3, 0.6323529411764706 fold: 4, 0.6716417910447762 fold: 5, 0.7164179104477612 fold: 6, 0.5522388059701493 fold: 7, 0.6567164179104478 fold: 8, 0.6470588235294118 fold: 9, 0.6865671641791045 fold: 10, 0.746268656716418 errors: 0 AdaBoostClassifier_14y fold: 1, 0.7910447761194029 fold: 2, 0.6865671641791045 fold: 3, 0.6911764705882353 fold: 4, 0.6865671641791045 fold: 5, 0.7014925373134329 fold: 6, 0.7014925373134329 fold: 7, 0.7761194029850746 fold: 8, 0.8382352941176471 fold: 9, 0.7014925373134329 fold: 10, 0.6268656716417911 errors: 0 SGDClassifier_g4m fold: 1, 0.7014925373134329 fold: 2, 0.7014925373134329 fold: 3, 0.7205882352941176 fold: 4, 0.7313432835820896 fold: 5, 0.7910447761194029 fold: 6, 0.6716417910447762 fold: 7, 0.7164179104477612 fold: 8, 0.7058823529411765 fold: 9, 0.6268656716417911 fold: 10, 0.7164179104477612 errors: 0 prunetree_COs fold: 1, 0.5522388059701493 fold: 2, 0.6417910447761194 fold: 3, 0.5882352941176471 fold: 4, 0.582089552238806 fold: 5, 0.582089552238806 fold: 6, 0.6268656716417911 fold: 7, 0.6268656716417911 fold: 8, 0.6764705882352942 fold: 9, 0.6268656716417911 fold: 10, 0.5074626865671642 errors: 0 LinearSVC_vrM fold: 1, 0.7164179104477612 fold: 2, 0.7313432835820896 fold: 3, 0.6911764705882353 fold: 4, 0.7313432835820896 fold: 5, 0.6567164179104478 fold: 6, 0.7611940298507462 fold: 7, 0.7910447761194029 fold: 8, 0.7352941176470589 fold: 9, 0.7014925373134329 fold: 10, 0.7910447761194029 errors: 0
julia> @show learners;
learners = 5×3 DataFrame
Row │ name mean sd
│ String Float64 Float64
─────┼─────────────────────────────────────────────
1 │ rf_eq7 0.677195 0.0589206
2 │ AdaBoostClassifier_14y 0.720105 0.0623111
3 │ SGDClassifier_g4m 0.708319 0.0418123
4 │ prunetree_COs 0.601097 0.0487303
5 │ LinearSVC_vrM 0.730707 0.0425931
It can be inferred from the results that linear SVC has the best performance with respect to the different pipelines evaluated. The compact expression supported by the pipeline makes testing of the different combination of features and models trivial. It makes performance evaluation of the pipeline easily manageable in a systematic way.
Learners as Filters
It is also possible to use learners in the middle of expression to serve as filters and their outputs become input to the final learner as illustrated below.
julia> Random.seed!(1);
julia> expr = @pipeline (
((numf |> pca) |> gb) + ((numf |> pca) |> jrf)
) |> ohe |> ada;
julia> crossvalidate(expr,X,Y,"accuracy_score",5)
fold: 1, 0.6194029850746269
fold: 2, 0.6888888888888889
fold: 3, 0.6119402985074627
fold: 4, 0.5777777777777777
fold: 5, 0.6194029850746269
errors: 0
(mean = 0.6234825870646766, std = 0.040414801249678875, folds = 5, errors = 0)
It is important to take note that ohe
is necessary because the outputs of the two learners (gb
and jrf
) are categorical values that need to be hot-bit encoded before feeding them to the final ada
learner.
Advanced Expressions using Selector Pipeline
You can use *
operation as a selector function which outputs the result of the best learner. Instead of looping over the different learners to identify the best learner, you can use the selector function to automatically determine the best learner and output its prediction.
julia> Random.seed!(1);
julia> pcmc = @pipeline disc |> ((catf |> ohe) + (numf |> std)) |>
(jrf * ada * sgd * tree * lsvc);
julia> crossvalidate(pcmc,X,Y,"accuracy_score",10)
fold: 1, 0.7164179104477612
fold: 2, 0.7910447761194029
fold: 3, 0.6911764705882353
fold: 4, 0.7761194029850746
fold: 5, 0.6567164179104478
fold: 6, 0.7014925373134329
fold: 7, 0.6417910447761194
fold: 8, 0.7058823529411765
fold: 9, 0.746268656716418
fold: 10, 0.835820895522388
errors: 0
(mean = 0.7262730465320456, std = 0.060932268798867976, folds = 10, errors = 0)
Here is another example using the Selector Pipeline as a preprocessor in the feature extraction stage of the pipeline:
julia> Random.seed!(1);
julia> pjrf = @pipeline disc |> ((catf |> ohe) + (numf |> std)) |>
((jrf * ada ) + (sgd * tree * lsvc)) |> ohe |> ada;
julia> crossvalidate(pjrf,X,Y,"accuracy_score")
fold: 1, 0.7164179104477612
fold: 2, 0.7910447761194029
fold: 3, 0.7205882352941176
fold: 4, 0.7761194029850746
fold: 5, 0.6865671641791045
fold: 6, 0.5970149253731343
fold: 7, 0.6268656716417911
fold: 8, 0.7058823529411765
fold: 9, 0.746268656716418
fold: 10, 0.835820895522388
errors: 0
(mean = 0.7202589991220368, std = 0.07259489614798799, folds = 10, errors = 0)