Preprocessing
Let us start by loading the diabetes
dataset:
using AutoMLPipeline
using CSV
using DataFrames
diabetesdf = CSV.File(joinpath(dirname(pathof(AutoMLPipeline)),"../data/diabetes.csv")) |> DataFrame
X = diabetesdf[:,1:end-1]
Y = diabetesdf[:,end] |> Vector
We can check the data by showing the first 5 rows:
julia> show5(df)=first(df,5); # show first 5 rows
julia> show5(diabetesdf)
5×9 DataFrame
Row │ preg plas pres skin insu mass pedi age class
│ Int64 Int64 Int64 Int64 Int64 Float64 Float64 Int64 String
─────┼─────────────────────────────────────────────────────────────────────────────
1 │ 6 148 72 35 0 33.6 0.627 50 tested_positive
2 │ 1 85 66 29 0 26.6 0.351 31 tested_negative
3 │ 8 183 64 0 0 23.3 0.672 32 tested_positive
4 │ 1 89 66 23 94 28.1 0.167 21 tested_negative
5 │ 0 137 40 35 168 43.1 2.288 33 tested_positive
This UCI dataset is a collection of diagnostic tests among the Pima Indians to investigate whether the patient shows sign of diabetes or not based on certain features:
- Number of times pregnant
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Body mass index (weight in kg/(height in m)^2)
- Diabetes pedigree function
- Age (years)
- Class variable (0 or 1) indicating diabetic or not
What is interesting with this dataset is that one or more numeric columns can be categorical and should be hot-bit encoded. One way to verify is to compute the number of unique instances for each column and look for columns with relatively smaller count:
julia> [n=>length(unique(x)) for (n,x) in pairs(eachcol(diabetesdf))] |> collect
9-element Vector{Pair{Symbol, Int64}}:
:preg => 17
:plas => 136
:pres => 47
:skin => 51
:insu => 186
:mass => 248
:pedi => 517
:age => 52
:class => 2
Among the input columns, preg
has only 17 unique instances and it can be treated as a categorical variable. However, its description indicates that the feature refers to the number of times the patient is pregnant and can be considered numerical. With this dilemma, we need to figure out which representation provides better performance to our classifier. In order to test the two options, we can use the Feature Discriminator module to filter and transform the preg
column to either numeric or categorical and choose the pipeline with the optimal performance.
CatNumDiscriminator for Detecting Categorical Numeric Features
Transform numeric columns with small unique instances to categories.
Let us use CatNumDiscriminator
which expects one argument to indicate the maximum number of unique instances in order to consider a particular column as categorical. For the sake of this discussion, let us use its default value which is 24.
using AutoMLPipeline
disc = CatNumDiscriminator(24)
@pipeline disc
tr_disc = fit_transform!(disc,X,Y)
julia> show5(tr_disc)
5×8 DataFrame
Row │ preg plas pres skin insu mass pedi age
│ String Int64 Int64 Int64 Int64 Float64 Float64 Int64
─────┼─────────────────────────────────────────────────────────────
1 │ 6 148 72 35 0 33.6 0.627 50
2 │ 1 85 66 29 0 26.6 0.351 31
3 │ 8 183 64 0 0 23.3 0.672 32
4 │ 1 89 66 23 94 28.1 0.167 21
5 │ 0 137 40 35 168 43.1 2.288 33
You may notice that the preg
column is converted by the CatNumDiscriminator
into String
type which can be fed to hot-bit encoder to preprocess categorical data:
disc = CatNumDiscriminator(24)
catf = CatFeatureSelector()
ohe = OneHotEncoder()
pohe = @pipeline disc |> catf |> ohe
tr_pohe = fit_transform!(pohe,X,Y)
julia> show5(tr_pohe)
5×17 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 │ 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 │ 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 │ 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 │ 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
We have now converted all categorical data into hot-bit encoded values.
For a typical scenario, one can consider columns with around 3-10 unique numeric instances to be categorical. Using CatNumDiscriminator
, it is trivial to convert columns of features with small unique instances into categorical and hot-bit encode them as shown below. Let us use 5 as the cut-off and any columns with less than 5 unique instances is converted to hot-bits.
julia> using DataFrames: DataFrame, nrow,ncol
julia> df = rand(1:3,100,3) |> DataFrame;
julia> show5(df)
5×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 3 2
2 │ 2 3 3
3 │ 3 3 2
4 │ 2 1 3
5 │ 2 3 3
julia> disc = CatNumDiscriminator(5);
julia> pohe = @pipeline disc |> catf |> ohe;
julia> tr_pohe = fit_transform!(pohe,df);
julia> show5(tr_pohe)
5×9 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────────
1 │ 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
2 │ 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
3 │ 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0
4 │ 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
5 │ 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
Concatenating Hot-Bits with PCA of Numeric Columns
Going back to the original diabetes
dataset, we can now use the CatNumDiscriminator
to differentiate between categorical columns and numerical columns and preprocess them based on their types (String vs Number). Below is the pipeline to convert preg
column to hot-bits and use PCA for the numerical features:
pca = SKPreprocessor("PCA")
disc = CatNumDiscriminator(24)
ohe = OneHotEncoder()
catf = CatFeatureSelector()
numf = NumFeatureSelector()
pl = @pipeline disc |> ((numf |> pca) + (catf |> ohe))
res_pl = fit_transform!(pl,X,Y)
julia> show5(res_pl)
5×24 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x1_1 x2_1 x3_1 x4_1 x5_1 x6_1 x7_1 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ -75.7103 -35.9097 -7.24498 15.8651 16.3111 3.44999 0.0983858 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 │ -82.3643 28.8482 -5.54565 8.92881 3.75463 5.57591 -0.0757265 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 │ -74.6222 -67.8436 19.4901 -5.61152 -10.7677 7.1707 0.245201 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 │ 11.0716 34.84 -0.0969794 1.16049 -7.43015 2.58333 -0.267739 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 │ 89.7362 -2.84072 25.1258 18.9669 8.76339 -9.50796 1.69639 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Performance Evaluation
Let us compare the RF cross-validation result between two options:
preg
column should be categorical vspreg
column is numerical
in predicting diabetes where numerical values are scaled by robust scaler and decomposed by PCA.
Option 1: Assume All Numeric Columns as not Categorical and Evaluate
pca = SKPreprocessor("PCA")
dt = SKLearner("DecisionTreeClassifier")
rf = SKLearner("RandomForestClassifier")
rbs = SKPreprocessor("RobustScaler")
jrf = RandomForest()
lsvc = SKLearner("LinearSVC")
ohe = OneHotEncoder()
catf = CatFeatureSelector()
numf = NumFeatureSelector()
disc = CatNumDiscriminator(0) # disable turning numeric to categorical features
pl = @pipeline disc |> ((numf |> pca) + (catf |> ohe)) |> jrf
julia> crossvalidate(pl,X,Y,"accuracy_score",30)
fold: 1, 0.8076923076923077
fold: 2, 0.6
fold: 3, 0.8846153846153846
fold: 4, 0.84
fold: 5, 0.7692307692307693
fold: 6, 0.6923076923076923
fold: 7, 0.6
fold: 8, 0.9230769230769231
fold: 9, 0.56
fold: 10, 0.7307692307692307
fold: 11, 0.8461538461538461
fold: 12, 0.8
fold: 13, 0.7307692307692307
fold: 14, 0.72
fold: 15, 0.6923076923076923
fold: 16, 0.7307692307692307
fold: 17, 0.76
fold: 18, 0.7307692307692307
fold: 19, 0.76
fold: 20, 0.8076923076923077
fold: 21, 0.8076923076923077
fold: 22, 0.84
fold: 23, 0.6538461538461539
fold: 24, 0.8
fold: 25, 0.6923076923076923
fold: 26, 0.7307692307692307
fold: 27, 0.68
fold: 28, 0.6538461538461539
fold: 29, 0.68
fold: 30, 0.7307692307692307
errors: 0
(mean = 0.7418461538461539, std = 0.08501079543050155, folds = 30, errors = 0)
Option 2: Assume as Categorical Numeric Columns <= 24 and Evaluate
disc = CatNumDiscriminator(24) # turning numeric to categorical if unique instances <= 24
pl = @pipeline disc |> ((numf |> pca) + (catf |> ohe)) |> jrf
julia> crossvalidate(pl,X,Y,"accuracy_score",30)
┌ Warning: Unseen value found in OneHotEncoder,
│ for entry (7, 1) = 15.
│ Patching value to 6.
└ @ AMLPipelineBase.BaseFilters ~/.julia/packages/AMLPipelineBase/DSlpJ/src/basefilters.jl:99
fold: 1, 0.7307692307692307
┌ Warning: Unseen value found in OneHotEncoder,
│ for entry (7, 1) = 17.
│ Patching value to 6.
└ @ AMLPipelineBase.BaseFilters ~/.julia/packages/AMLPipelineBase/DSlpJ/src/basefilters.jl:99
fold: 2, 0.68
fold: 3, 0.8461538461538461
fold: 4, 0.84
fold: 5, 0.7692307692307693
fold: 6, 0.7692307692307693
fold: 7, 0.84
fold: 8, 0.5769230769230769
fold: 9, 0.72
fold: 10, 0.7692307692307693
fold: 11, 0.6538461538461539
fold: 12, 0.68
fold: 13, 0.8846153846153846
fold: 14, 0.64
fold: 15, 0.7692307692307693
fold: 16, 0.9230769230769231
fold: 17, 0.64
fold: 18, 0.6538461538461539
fold: 19, 0.8
fold: 20, 0.8076923076923077
fold: 21, 0.6923076923076923
fold: 22, 0.68
fold: 23, 0.8076923076923077
fold: 24, 0.72
fold: 25, 0.9615384615384616
fold: 26, 0.7692307692307693
fold: 27, 0.72
fold: 28, 0.6153846153846154
fold: 29, 0.76
fold: 30, 0.6538461538461539
errors: 0
(mean = 0.7457948717948717, std = 0.09283991864007987, folds = 30, errors = 0)
From this evaluation, preg
column should be treated as numerical because the corresponding pipeline got better performance. One thing to note is the presence of errors in the cross-validation performance for the pipeline that treats preg
as categorical data. The subset of training data during the kfold validation may contain singularities and evaluation causes some errors due to hot-bit encoding that increases data sparsity. The error, however, may be a bug which needs to be addressed in the future.