Training and Validation
Let us continue our discussion by using another dataset. This time, let's use CMC dataset that are mostly categorical. CMC is about asking women of their contraceptive choice. The dataset is composed of the following features:
using AutoMLPipeline
using CSV
using DataFrames
cmcdata = CSV.File(joinpath(dirname(pathof(AutoMLPipeline)),"../data/cmc.csv")) |> DataFrame;
X = cmcdata[:,1:end-1]
Y = cmcdata[:,end] .|> string
show5(df) = first(df,5)
julia> show5(cmcdata)
5×10 DataFrame
Row │ Wifes_age Wifes_education Husbands_education Number_of_children_ever_born Wifes_religion Wifes_now_working.3F Husbands_occupation Standard.of.living_index Media_exposure Contraceptive_method_used
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 24 2 3 3 1 1 2 3 0 1
2 │ 45 1 3 10 1 1 3 4 0 1
3 │ 43 2 3 7 1 1 3 4 0 1
4 │ 42 3 2 9 1 1 3 3 0 1
5 │ 36 3 3 8 1 1 3 2 0 1
Let's examine the number of unique instances for each column:
julia> [n=>length(unique(x)) for (n,x) in eachcol(cmcdata,true)]
ERROR: MethodError: no method matching eachcol(::DataFrames.DataFrame, ::Bool)
Closest candidates are:
eachcol(::DataFrames.AbstractDataFrame) at /home/runner/.julia/packages/DataFrames/zXEKU/src/abstractdataframe/iteration.jl:173
Except for Wife's age and Number of children, the other columns have less than five unique instances. Let's create a pipeline to filter those columns and convert them to hot-bits and concatenate them with the standardized scale of the numeric columns.
std = SKPreprocessor("StandardScaler")
ohe = OneHotEncoder()
kohe = SKPreprocessor("OneHotEncoder")
catf = CatFeatureSelector()
numf = NumFeatureSelector()
disc = CatNumDiscriminator(5) # unique instances <= 5 are categories
pcmc = @pipeline disc |> ((catf |> ohe) + (numf |> std))
dfcmc = fit_transform!(pcmc,X)
julia> show5(dfcmc)
5×24 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x1_1 x2_1
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 -1.03817 -0.110856
2 │ 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.51519 2.85808
3 │ 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.27202 1.58568
4 │ 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.15043 2.43394
5 │ 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.420897 2.00981
Evaluate Learners with Same Pipeline
You can get a list of sklearners and skpreprocessors by using the following function calls:
julia> sklearners()
syntax: SKLearner(name::String, args::Dict=Dict())
where 'name' can be one of:
AdaBoostClassifier AdaBoostRegressor ARDRegression BaggingClassifier BayesianRidge BernoulliNB ComplementNB DecisionTreeClassifier DecisionTreeRegressor ElasticNet ExtraTreesClassifier ExtraTreesRegressor GaussianNB GaussianProcessClassifier GaussianProcessRegressor GradientBoostingClassifier GradientBoostingRegressor IsotonicRegression KernelRidge KNeighborsClassifier KNeighborsRegressor Lars Lasso LassoLars LinearDiscriminantAnalysis LinearSVC LogisticRegression MLPClassifier MLPRegressor MultinomialNB NearestCentroid NuSVC OrthogonalMatchingPursuit PassiveAggressiveClassifier PassiveAggressiveRegressor QuadraticDiscriminantAnalysis RadiusNeighborsClassifier RadiusNeighborsRegressor RandomForestClassifier RandomForestRegressor Ridge RidgeClassifier RidgeClassifierCV RidgeCV SGDClassifier SGDRegressor SVC SVR VotingClassifier
and 'args' are the corresponding learner's initial parameters.
Note: Consult Scikitlearn's online help for more details about the learner's arguments.
julia> skpreprocessors()
syntax: SKPreprocessor(name::String, args::Dict=Dict())
where *name* can be one of:
Binarizer chi2 dict_learning dict_learning_online DictionaryLearning f_classif f_regression FactorAnalysis FastICA fastica FunctionTransformer GenericUnivariateSelect IncrementalPCA KBinsDiscretizer KernelCenterer KernelPCA LabelBinarizer LabelEncoder LatentDirichletAllocation MaxAbsScaler MiniBatchDictionaryLearning MiniBatchSparsePCA MinMaxScaler MissingIndicator MultiLabelBinarizer mutual_info_classif mutual_info_regression NMF non_negative_factorization Normalizer OneHotEncoder OrdinalEncoder PCA PolynomialFeatures PowerTransformer QuantileTransformer RFE RFECV RobustScaler SelectFdr SelectFpr SelectFromModel SelectFwe SelectKBest SelectPercentile SimpleImputer sparse_encode SparseCoder SparsePCA StandardScaler TruncatedSVD VarianceThreshold
and *args* are the corresponding preprocessor's initial parameters.
Note: Please consult Scikitlearn's online help for more details about the preprocessor's arguments.
Let us evaluate 4 learners using the same preprocessing pipeline:
jrf = RandomForest()
ada = SKLearner("AdaBoostClassifier")
sgd = SKLearner("SGDClassifier")
tree = PrunedTree()
using DataFrames: DataFrame, nrow,ncol
learners = DataFrame()
for learner in [jrf,ada,sgd,tree]
pcmc = @pipeline disc |> ((catf |> ohe) + (numf |> std)) |> learner
println(learner.name)
mean,sd,folds = crossvalidate(pcmc,X,Y,"accuracy_score",5)
global learners = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd,kfold=folds))
end;
┌ Warning: Assignment to `pcmc` in soft scope is ambiguous because a global variable by the same name exists: `pcmc` will be treated as a new local. Disambiguate by using `local pcmc` to suppress this warning or `global pcmc` to assign to the existing global variable. └ @ none:2 rf_3Of fold: 1, 0.5288135593220339 fold: 2, 0.5306122448979592 fold: 3, 0.559322033898305 fold: 4, 0.5034013605442177 fold: 5, 0.5322033898305085 errors: 0 AdaBoostClassifier_GDC fold: 1, 0.5389830508474577 fold: 2, 0.5238095238095238 fold: 3, 0.511864406779661 fold: 4, 0.5238095238095238 fold: 5, 0.5864406779661017 errors: 0 SGDClassifier_XxZ fold: 1, 0.4711864406779661 fold: 2, 0.5034013605442177 fold: 3, 0.4406779661016949 fold: 4, 0.5102040816326531 fold: 5, 0.44745762711864406 errors: 0 prunetree_eOz fold: 1, 0.4915254237288136 fold: 2, 0.4897959183673469 fold: 3, 0.4406779661016949 fold: 4, 0.47619047619047616 fold: 5, 0.488135593220339 errors: 0
julia> @show learners;
learners = 4×4 DataFrame
Row │ name mean sd kfold
│ String Float64 Float64 Int64
─────┼────────────────────────────────────────────────────
1 │ rf_3Of 0.530871 0.0198124 5
2 │ AdaBoostClassifier_GDC 0.536981 0.0292749 5
3 │ SGDClassifier_XxZ 0.474585 0.0316079 5
4 │ prunetree_eOz 0.477265 0.0213209 5
For this particular pipeline, Adaboost has the best performance followed by RandomForest.
Let's extend the pipeline adding Gradient Boost learner and Robust Scaler.
rbs = SKPreprocessor("RobustScaler")
gb = SKLearner("GradientBoostingClassifier")
learners = DataFrame()
for learner in [jrf,ada,sgd,tree,gb]
pcmc = @pipeline disc |> ((catf |> ohe) + (numf |> rbs) + (numf |> std)) |> learner
println(learner.name)
mean,sd,folds = crossvalidate(pcmc,X,Y,"accuracy_score",5)
global learners = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd,kfold=folds))
end;
┌ Warning: Assignment to `pcmc` in soft scope is ambiguous because a global variable by the same name exists: `pcmc` will be treated as a new local. Disambiguate by using `local pcmc` to suppress this warning or `global pcmc` to assign to the existing global variable. └ @ none:2 rf_3Of fold: 1, 0.49830508474576274 fold: 2, 0.4557823129251701 fold: 3, 0.511864406779661 fold: 4, 0.5102040816326531 fold: 5, 0.4915254237288136 errors: 0 AdaBoostClassifier_GDC fold: 1, 0.5186440677966102 fold: 2, 0.5306122448979592 fold: 3, 0.5627118644067797 fold: 4, 0.6020408163265306 fold: 5, 0.49491525423728816 errors: 0 SGDClassifier_XxZ fold: 1, 0.4711864406779661 fold: 2, 0.46258503401360546 fold: 3, 0.4610169491525424 fold: 4, 0.47278911564625853 fold: 5, 0.4271186440677966 errors: 0 prunetree_eOz fold: 1, 0.49491525423728816 fold: 2, 0.4523809523809524 fold: 3, 0.44745762711864406 fold: 4, 0.47619047619047616 fold: 5, 0.5322033898305085 errors: 0 GradientBoostingClassifier_Lck fold: 1, 0.576271186440678 fold: 2, 0.5680272108843537 fold: 3, 0.5457627118644067 fold: 4, 0.5850340136054422 fold: 5, 0.5491525423728814 errors: 0
julia> @show learners;
learners = 5×4 DataFrame
Row │ name mean sd kfold
│ String Float64 Float64 Int64
─────┼────────────────────────────────────────────────────────────
1 │ rf_3Of 0.493536 0.022726 5
2 │ AdaBoostClassifier_GDC 0.541785 0.0416107 5
3 │ SGDClassifier_XxZ 0.458939 0.0185201 5
4 │ prunetree_eOz 0.48063 0.034576 5
5 │ GradientBoostingClassifier_Lck 0.56485 0.0170196 5
This time, Gradient boost has the best performance.