Preprocessing

Let us start by loading the diabetes dataset:

using AutoMLPipeline
using CSV
using DataFrames

diabetesdf = CSV.File(joinpath(dirname(pathof(AutoMLPipeline)),"../data/diabetes.csv")) |> DataFrame
X = diabetesdf[:,1:end-1]
Y = diabetesdf[:,end] |> Vector

We can check the data by showing the first 5 rows:

julia> show5(df)=first(df,5); # show first 5 rows

julia> show5(diabetesdf)
5×9 DataFrame
 Row │ preg   plas   pres   skin   insu   mass     pedi     age    class
     │ Int64  Int64  Int64  Int64  Int64  Float64  Float64  Int64  String
─────┼─────────────────────────────────────────────────────────────────────────────
   1 │     6    148     72     35      0     33.6    0.627     50  tested_positive
   2 │     1     85     66     29      0     26.6    0.351     31  tested_negative
   3 │     8    183     64      0      0     23.3    0.672     32  tested_positive
   4 │     1     89     66     23     94     28.1    0.167     21  tested_negative
   5 │     0    137     40     35    168     43.1    2.288     33  tested_positive

This UCI dataset is a collection of diagnostic tests among the Pima Indians to investigate whether the patient shows sign of diabetes or not based on certain features:

  • Number of times pregnant
  • Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  • Diastolic blood pressure (mm Hg)
  • Triceps skin fold thickness (mm)
  • 2-Hour serum insulin (mu U/ml)
  • Body mass index (weight in kg/(height in m)^2)
  • Diabetes pedigree function
  • Age (years)
  • Class variable (0 or 1) indicating diabetic or not

What is interesting with this dataset is that one or more numeric columns can be categorical and should be hot-bit encoded. One way to verify is to compute the number of unique instances for each column and look for columns with relatively smaller count:

julia> [n=>length(unique(x)) for (n,x) in pairs(eachcol(diabetesdf))] |> collect
9-element Vector{Pair{Symbol, Int64}}:
  :preg => 17
  :plas => 136
  :pres => 47
  :skin => 51
  :insu => 186
  :mass => 248
  :pedi => 517
   :age => 52
 :class => 2

Among the input columns, preg has only 17 unique instances and it can be treated as a categorical variable. However, its description indicates that the feature refers to the number of times the patient is pregnant and can be considered numerical. With this dilemma, we need to figure out which representation provides better performance to our classifier. In order to test the two options, we can use the Feature Discriminator module to filter and transform the preg column to either numeric or categorical and choose the pipeline with the optimal performance.

CatNumDiscriminator for Detecting Categorical Numeric Features

Transform numeric columns with small unique instances to categories.

Let us use CatNumDiscriminator which expects one argument to indicate the maximum number of unique instances in order to consider a particular column as categorical. For the sake of this discussion, let us use its default value which is 24.

using AutoMLPipeline

disc = CatNumDiscriminator(24)
@pipeline disc
tr_disc = fit_transform!(disc,X,Y)
julia> show5(tr_disc)
5×8 DataFrame
 Row │ preg    plas   pres   skin   insu   mass     pedi     age
     │ String  Int64  Int64  Int64  Int64  Float64  Float64  Int64
─────┼─────────────────────────────────────────────────────────────
   1 │ 6         148     72     35      0     33.6    0.627     50
   2 │ 1          85     66     29      0     26.6    0.351     31
   3 │ 8         183     64      0      0     23.3    0.672     32
   4 │ 1          89     66     23     94     28.1    0.167     21
   5 │ 0         137     40     35    168     43.1    2.288     33

You may notice that the preg column is converted by the CatNumDiscriminator into String type which can be fed to hot-bit encoder to preprocess categorical data:

disc = CatNumDiscriminator(24)
catf = CatFeatureSelector()
ohe = OneHotEncoder()
pohe = @pipeline disc |> catf |> ohe
tr_pohe = fit_transform!(pohe,X,Y)
julia> show5(tr_pohe)
5×17 DataFrame
 Row │ x1       x2       x3       x4       x5       x6       x7       x8       x9       x10      x11      x12      x13      x14      x15      x16      x17
     │ Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │     1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   2 │     0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   3 │     0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   4 │     0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   5 │     0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0

We have now converted all categorical data into hot-bit encoded values.

For a typical scenario, one can consider columns with around 3-10 unique numeric instances to be categorical. Using CatNumDiscriminator, it is trivial to convert columns of features with small unique instances into categorical and hot-bit encode them as shown below. Let us use 5 as the cut-off and any columns with less than 5 unique instances is converted to hot-bits.

julia> using DataFrames: DataFrame, nrow,ncol

julia> df = rand(1:3,100,3) |> DataFrame;

julia> show5(df)
5×3 DataFrame
 Row │ x1     x2     x3
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      3      2
   2 │     2      3      3
   3 │     3      3      2
   4 │     2      1      3
   5 │     2      3      3

julia> disc = CatNumDiscriminator(5);

julia> pohe = @pipeline disc |> catf |> ohe;

julia> tr_pohe = fit_transform!(pohe,df);

julia> show5(tr_pohe)
5×9 DataFrame
 Row │ x1       x2       x3       x4       x5       x6       x7       x8       x9
     │ Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64
─────┼─────────────────────────────────────────────────────────────────────────────────
   1 │     1.0      0.0      0.0      1.0      0.0      0.0      1.0      0.0      0.0
   2 │     0.0      1.0      0.0      1.0      0.0      0.0      0.0      1.0      0.0
   3 │     0.0      0.0      1.0      1.0      0.0      0.0      1.0      0.0      0.0
   4 │     0.0      1.0      0.0      0.0      1.0      0.0      0.0      1.0      0.0
   5 │     0.0      1.0      0.0      1.0      0.0      0.0      0.0      1.0      0.0

Concatenating Hot-Bits with PCA of Numeric Columns

Going back to the original diabetes dataset, we can now use the CatNumDiscriminator to differentiate between categorical columns and numerical columns and preprocess them based on their types (String vs Number). Below is the pipeline to convert preg column to hot-bits and use PCA for the numerical features:

pca = SKPreprocessor("PCA")
disc = CatNumDiscriminator(24)
ohe = OneHotEncoder()
catf = CatFeatureSelector()
numf = NumFeatureSelector()
pl = @pipeline disc |> ((numf |> pca) + (catf |> ohe))
res_pl = fit_transform!(pl,X,Y)
julia> show5(res_pl)
5×24 DataFrame
 Row │ x1        x2         x3          x4        x5         x6        x7          x1_1     x2_1     x3_1     x4_1     x5_1     x6_1     x7_1     x8       x9       x10      x11      x12      x13      x14      x15      x16      x17
     │ Float64   Float64    Float64     Float64   Float64    Float64   Float64     Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ -75.7103  -35.9097   -7.24498    15.8651    16.3111    3.44999   0.0983858      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   2 │ -82.3643   28.8482   -5.54565     8.92881    3.75463   5.57591  -0.0757265      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   3 │ -74.6222  -67.8436   19.4901     -5.61152  -10.7677    7.1707    0.245201       0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   4 │  11.0716   34.84     -0.0969794   1.16049   -7.43015   2.58333  -0.267739       0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   5 │  89.7362   -2.84072  25.1258     18.9669     8.76339  -9.50796   1.69639        0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0

Performance Evaluation

Let us compare the RF cross-validation result between two options:

  • preg column should be categorical vs
  • preg column is numerical

in predicting diabetes where numerical values are scaled by robust scaler and decomposed by PCA.

Option 1: Assume All Numeric Columns as not Categorical and Evaluate
pca = SKPreprocessor("PCA")
dt = SKLearner("DecisionTreeClassifier")
rf = SKLearner("RandomForestClassifier")
rbs = SKPreprocessor("RobustScaler")
jrf = RandomForest()
lsvc = SKLearner("LinearSVC")
ohe = OneHotEncoder()
catf = CatFeatureSelector()
numf = NumFeatureSelector()
disc = CatNumDiscriminator(0) # disable turning numeric to categorical features
pl = @pipeline disc |> ((numf |>  pca) + (catf |> ohe)) |> jrf
julia> crossvalidate(pl,X,Y,"accuracy_score",30)
fold: 1, 0.8076923076923077
fold: 2, 0.6
fold: 3, 0.8846153846153846
fold: 4, 0.84
fold: 5, 0.7692307692307693
fold: 6, 0.6923076923076923
fold: 7, 0.6
fold: 8, 0.9230769230769231
fold: 9, 0.56
fold: 10, 0.7307692307692307
fold: 11, 0.8461538461538461
fold: 12, 0.8
fold: 13, 0.7307692307692307
fold: 14, 0.72
fold: 15, 0.6923076923076923
fold: 16, 0.7307692307692307
fold: 17, 0.76
fold: 18, 0.7307692307692307
fold: 19, 0.76
fold: 20, 0.8076923076923077
fold: 21, 0.8076923076923077
fold: 22, 0.84
fold: 23, 0.6538461538461539
fold: 24, 0.8
fold: 25, 0.6923076923076923
fold: 26, 0.7307692307692307
fold: 27, 0.68
fold: 28, 0.6538461538461539
fold: 29, 0.68
fold: 30, 0.7307692307692307
errors: 0
(mean = 0.7418461538461539, std = 0.08501079543050155, folds = 30, errors = 0)
Option 2: Assume as Categorical Numeric Columns <= 24 and Evaluate
disc = CatNumDiscriminator(24) # turning numeric to categorical if unique instances <= 24
pl = @pipeline disc |> ((numf |>  pca) + (catf |> ohe)) |> jrf
julia> crossvalidate(pl,X,Y,"accuracy_score",30)
┌ Warning: Unseen value found in OneHotEncoder,
│                 for entry (7, 1) = 15.
│                 Patching value to 6.
└ @ AMLPipelineBase.BaseFilters ~/.julia/packages/AMLPipelineBase/DSlpJ/src/basefilters.jl:99
fold: 1, 0.7307692307692307
┌ Warning: Unseen value found in OneHotEncoder,
│                 for entry (7, 1) = 17.
│                 Patching value to 6.
└ @ AMLPipelineBase.BaseFilters ~/.julia/packages/AMLPipelineBase/DSlpJ/src/basefilters.jl:99
fold: 2, 0.68
fold: 3, 0.8461538461538461
fold: 4, 0.84
fold: 5, 0.7692307692307693
fold: 6, 0.7692307692307693
fold: 7, 0.84
fold: 8, 0.5769230769230769
fold: 9, 0.72
fold: 10, 0.7692307692307693
fold: 11, 0.6538461538461539
fold: 12, 0.68
fold: 13, 0.8846153846153846
fold: 14, 0.64
fold: 15, 0.7692307692307693
fold: 16, 0.9230769230769231
fold: 17, 0.64
fold: 18, 0.6538461538461539
fold: 19, 0.8
fold: 20, 0.8076923076923077
fold: 21, 0.6923076923076923
fold: 22, 0.68
fold: 23, 0.8076923076923077
fold: 24, 0.72
fold: 25, 0.9615384615384616
fold: 26, 0.7692307692307693
fold: 27, 0.72
fold: 28, 0.6153846153846154
fold: 29, 0.76
fold: 30, 0.6538461538461539
errors: 0
(mean = 0.7457948717948717, std = 0.09283991864007987, folds = 30, errors = 0)

From this evaluation, preg column should be treated as numerical because the corresponding pipeline got better performance. One thing to note is the presence of errors in the cross-validation performance for the pipeline that treats preg as categorical data. The subset of training data during the kfold validation may contain singularities and evaluation causes some errors due to hot-bit encoding that increases data sparsity. The error, however, may be a bug which needs to be addressed in the future.