Training and Validation

Let us continue our discussion by using another dataset. This time, let's use CMC dataset that are mostly categorical. CMC is about asking women of their contraceptive choice. The dataset is composed of the following features:

using AutoMLPipeline
using CSV
using DataFrames

cmcdata = CSV.File(joinpath(dirname(pathof(AutoMLPipeline)),"../data/cmc.csv")) |> DataFrame;
X = cmcdata[:,1:end-1]
Y = cmcdata[:,end] .|> string
show5(df) = first(df,5)

julia> show5(cmcdata)
5×10 DataFrame
 Row │ Wifes_age  Wifes_education  Husbands_education  Number_of_children_ever_born  Wifes_religion  Wifes_now_working.3F  Husbands_occupation  Standard.of.living_index  Media_exposure  Contraceptive_method_used
     │ Int64      Int64            Int64               Int64                         Int64           Int64                 Int64                Int64                     Int64           Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        24                2                   3                             3               1                     1                    2                         3               0                          1
   2 │        45                1                   3                            10               1                     1                    3                         4               0                          1
   3 │        43                2                   3                             7               1                     1                    3                         4               0                          1
   4 │        42                3                   2                             9               1                     1                    3                         3               0                          1
   5 │        36                3                   3                             8               1                     1                    3                         2               0                          1

Let's examine the number of unique instances for each column:

julia> [n=>length(unique(x)) for (n,x) in eachcol(cmcdata,true)]
ERROR: MethodError: no method matching eachcol(::DataFrames.DataFrame, ::Bool)
Closest candidates are:
  eachcol(::DataFrames.AbstractDataFrame) at /home/runner/.julia/packages/DataFrames/zXEKU/src/abstractdataframe/iteration.jl:173

Except for Wife's age and Number of children, the other columns have less than five unique instances. Let's create a pipeline to filter those columns and convert them to hot-bits and concatenate them with the standardized scale of the numeric columns.

std = SKPreprocessor("StandardScaler")
ohe = OneHotEncoder()
kohe = SKPreprocessor("OneHotEncoder")
catf = CatFeatureSelector()
numf = NumFeatureSelector()
disc = CatNumDiscriminator(5) # unique instances <= 5 are categories
pcmc = @pipeline disc |> ((catf |> ohe) + (numf |> std))
dfcmc = fit_transform!(pcmc,X)

julia> show5(dfcmc)
5×24 DataFrame
 Row │ x1       x2       x3       x4       x5       x6       x7       x8       x9       x10      x11      x12      x13      x14      x15      x16      x17      x18      x19      x20      x21      x22      x1_1       x2_1
     │ Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64    Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │     1.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      1.0      0.0      1.0      0.0      1.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      1.0      0.0  -1.03817   -0.110856
   2 │     0.0      1.0      0.0      0.0      1.0      0.0      0.0      0.0      1.0      0.0      1.0      0.0      0.0      1.0      0.0      0.0      0.0      1.0      0.0      0.0      1.0      0.0   1.51519    2.85808
   3 │     1.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      1.0      0.0      1.0      0.0      0.0      1.0      0.0      0.0      0.0      1.0      0.0      0.0      1.0      0.0   1.27202    1.58568
   4 │     0.0      0.0      1.0      0.0      0.0      1.0      0.0      0.0      1.0      0.0      1.0      0.0      0.0      1.0      0.0      0.0      1.0      0.0      0.0      0.0      1.0      0.0   1.15043    2.43394
   5 │     0.0      0.0      1.0      0.0      1.0      0.0      0.0      0.0      1.0      0.0      1.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      1.0      0.0      1.0      0.0   0.420897   2.00981

Evaluate Learners with Same Pipeline

You can get a list of sklearners and skpreprocessors by using the following function calls:

julia> sklearners()
syntax: SKLearner(name::String, args::Dict=Dict())
where 'name' can be one of:

AdaBoostClassifier AdaBoostRegressor ARDRegression BaggingClassifier BayesianRidge BernoulliNB ComplementNB DecisionTreeClassifier DecisionTreeRegressor ElasticNet ExtraTreesClassifier ExtraTreesRegressor GaussianNB GaussianProcessClassifier GaussianProcessRegressor GradientBoostingClassifier GradientBoostingRegressor IsotonicRegression KernelRidge KNeighborsClassifier KNeighborsRegressor Lars Lasso LassoLars LinearDiscriminantAnalysis LinearSVC LogisticRegression MLPClassifier MLPRegressor MultinomialNB NearestCentroid NuSVC OrthogonalMatchingPursuit PassiveAggressiveClassifier PassiveAggressiveRegressor QuadraticDiscriminantAnalysis RadiusNeighborsClassifier RadiusNeighborsRegressor RandomForestClassifier RandomForestRegressor Ridge RidgeClassifier RidgeClassifierCV RidgeCV SGDClassifier SGDRegressor SVC SVR VotingClassifier 

and 'args' are the corresponding learner's initial parameters.
Note: Consult Scikitlearn's online help for more details about the learner's arguments.

julia> skpreprocessors()
syntax: SKPreprocessor(name::String, args::Dict=Dict())
where *name* can be one of:

Binarizer chi2 dict_learning dict_learning_online DictionaryLearning f_classif f_regression FactorAnalysis FastICA fastica FunctionTransformer GenericUnivariateSelect IncrementalPCA KBinsDiscretizer KernelCenterer KernelPCA LabelBinarizer LabelEncoder LatentDirichletAllocation MaxAbsScaler MiniBatchDictionaryLearning MiniBatchSparsePCA MinMaxScaler MissingIndicator MultiLabelBinarizer mutual_info_classif mutual_info_regression NMF non_negative_factorization Normalizer OneHotEncoder OrdinalEncoder PCA PolynomialFeatures PowerTransformer QuantileTransformer RFE RFECV RobustScaler SelectFdr SelectFpr SelectFromModel SelectFwe SelectKBest SelectPercentile SimpleImputer sparse_encode SparseCoder SparsePCA StandardScaler TruncatedSVD VarianceThreshold 

and *args* are the corresponding preprocessor's initial parameters.
Note: Please consult Scikitlearn's online help for more details about the preprocessor's arguments.

Let us evaluate 4 learners using the same preprocessing pipeline:

jrf = RandomForest()
ada = SKLearner("AdaBoostClassifier")
sgd = SKLearner("SGDClassifier")
tree = PrunedTree()

using DataFrames: DataFrame, nrow,ncol

learners = DataFrame()
for learner in [jrf,ada,sgd,tree]
  pcmc = @pipeline disc |> ((catf |> ohe) + (numf |> std)) |> learner
  println(learner.name)
  mean,sd,folds = crossvalidate(pcmc,X,Y,"accuracy_score",5)
  global learners = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd,kfold=folds))
end;

┌ Warning: Assignment to `pcmc` in soft scope is ambiguous because a global variable by the same name exists: `pcmc` will be treated as a new local. Disambiguate by using `local pcmc` to suppress this warning or `global pcmc` to assign to the existing global variable.
└ @ none:2
rf_3Of
fold: 1, 0.5288135593220339
fold: 2, 0.5306122448979592
fold: 3, 0.559322033898305
fold: 4, 0.5034013605442177
fold: 5, 0.5322033898305085
errors: 0
AdaBoostClassifier_GDC
fold: 1, 0.5389830508474577
fold: 2, 0.5238095238095238
fold: 3, 0.511864406779661
fold: 4, 0.5238095238095238
fold: 5, 0.5864406779661017
errors: 0
SGDClassifier_XxZ
fold: 1, 0.4711864406779661
fold: 2, 0.5034013605442177
fold: 3, 0.4406779661016949
fold: 4, 0.5102040816326531
fold: 5, 0.44745762711864406
errors: 0
prunetree_eOz
fold: 1, 0.4915254237288136
fold: 2, 0.4897959183673469
fold: 3, 0.4406779661016949
fold: 4, 0.47619047619047616
fold: 5, 0.488135593220339
errors: 0

julia> @show learners;
learners = 4×4 DataFrame
 Row │ name                    mean      sd         kfold
     │ String                  Float64   Float64    Int64
─────┼────────────────────────────────────────────────────
   1 │ rf_3Of                  0.530871  0.0198124      5
   2 │ AdaBoostClassifier_GDC  0.536981  0.0292749      5
   3 │ SGDClassifier_XxZ       0.474585  0.0316079      5
   4 │ prunetree_eOz           0.477265  0.0213209      5

For this particular pipeline, Adaboost has the best performance followed by RandomForest.

Let's extend the pipeline adding Gradient Boost learner and Robust Scaler.

rbs = SKPreprocessor("RobustScaler")
gb = SKLearner("GradientBoostingClassifier")
learners = DataFrame()
for learner in [jrf,ada,sgd,tree,gb]
  pcmc = @pipeline disc |> ((catf |> ohe) + (numf |> rbs) + (numf |> std)) |> learner
  println(learner.name)
  mean,sd,folds = crossvalidate(pcmc,X,Y,"accuracy_score",5)
  global learners = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd,kfold=folds))
end;

┌ Warning: Assignment to `pcmc` in soft scope is ambiguous because a global variable by the same name exists: `pcmc` will be treated as a new local. Disambiguate by using `local pcmc` to suppress this warning or `global pcmc` to assign to the existing global variable.
└ @ none:2
rf_3Of
fold: 1, 0.49830508474576274
fold: 2, 0.4557823129251701
fold: 3, 0.511864406779661
fold: 4, 0.5102040816326531
fold: 5, 0.4915254237288136
errors: 0
AdaBoostClassifier_GDC
fold: 1, 0.5186440677966102
fold: 2, 0.5306122448979592
fold: 3, 0.5627118644067797
fold: 4, 0.6020408163265306
fold: 5, 0.49491525423728816
errors: 0
SGDClassifier_XxZ
fold: 1, 0.4711864406779661
fold: 2, 0.46258503401360546
fold: 3, 0.4610169491525424
fold: 4, 0.47278911564625853
fold: 5, 0.4271186440677966
errors: 0
prunetree_eOz
fold: 1, 0.49491525423728816
fold: 2, 0.4523809523809524
fold: 3, 0.44745762711864406
fold: 4, 0.47619047619047616
fold: 5, 0.5322033898305085
errors: 0
GradientBoostingClassifier_Lck
fold: 1, 0.576271186440678
fold: 2, 0.5680272108843537
fold: 3, 0.5457627118644067
fold: 4, 0.5850340136054422
fold: 5, 0.5491525423728814
errors: 0

julia> @show learners;
learners = 5×4 DataFrame
 Row │ name                            mean      sd         kfold
     │ String                          Float64   Float64    Int64
─────┼────────────────────────────────────────────────────────────
   1 │ rf_3Of                          0.493536  0.022726       5
   2 │ AdaBoostClassifier_GDC          0.541785  0.0416107      5
   3 │ SGDClassifier_XxZ               0.458939  0.0185201      5
   4 │ prunetree_eOz                   0.48063   0.034576       5
   5 │ GradientBoostingClassifier_Lck  0.56485   0.0170196      5

This time, Gradient boost has the best performance.