Extending AutoMLPipeline

Having a meta-ML package sounds ideal but not practical in terms of maintainability and flexibility. The metapackage becomes a central point of failure and bottleneck. It doesn't subscribe to the KISS philosophy of Unix which encourages decentralization of implementation. As long as the input and output behavior of transformers and learners follow a standard format, they should work without dependency or communication. By using a consistent input/output interfaces, the passing of information among the elements in the pipeline will not bring any surprises to the receivers and transmitters of information down the line.

Because AMPL's symbolic pipeline is based on the idea of Linux pipeline and filters, there is a deliberate effort to follow as much as possible the KISS philosophy by just using two interfaces to be overloaded (fit! and transform!): input features should be a DataFrame type while the target output should be a Vector type. Transformers fit! function expects only one input argument and ignores the target argument. On the other hand, the fit! function of any learner requires both input and target arguments to carry out the supervised learning phase. For the transform! function, both learners and transformers expect one input argument that both use to apply their learned parameters in transforming the input into either prediction, decomposition, normalization, scaling, etc.

AMLP Abstract Types

The AMLP abstract types are composed of the following:

abstract type Machine end
abstract type Workflow    <:  Machine  end 
abstract type Computer    <:  Machine  end 
abstract type Learner     <:  Computer end
abstract type Transformer <:  Computer end

At the top of the hierarchy is the Machine abstraction that supports two major interfaces: fit! and transform!. The abstract Machine has two major types: Computer and Workflow. The Computer types perform computations suchs as filters, transformers, and filters while the Workflow controls the flow of information. A Workflow can be a sequential flow of information or a combination of information from two or more workflow. A Workflow that provides sequential flow is called Pipeline (or linear pipeline) while the one that combines information from different workflows is called ComboPipeline.

The Computer type has two subtypes: Learner and Transformer. Their main difference is in the behavior of their fit! function. The Learner type learns its parameters by finding a mapping function between its input and output arguments while the Transformer does not require these mapping function to perform its operation. The Transfomer learns all its parameters by just processing its input features. Both Transfomer and Learner has similar behaviour in the transform! function. Both apply their learned parameters to transform their input into output.

Extending AMLP by Adding a CSVReader Transformer

Let's extend AMLP by adding CSV reading support embedded in the pipeline. Instead of passing the data in the pipeline argument, we create a csv transformer that passes the data to succeeding elements in the pipeline from a csv file.

module FileReaders

using CSV
using DataFrames: DataFrame, nrow,ncol

using AutoMLPipeline
using AutoMLPipeline.AbsTypes # abstract types (Learners and Transformers)

import AutoMLPipeline.fit!
import AutoMLPipeline.transform!

export fit!, transform!
export CSVReader

# define a user-defined structure for type dispatch
mutable struct CSVReader <: Transformer
   name::String
   model::Dict

   function CSVReader(args = Dict(:fname=>""))
      fname = args[:fname]
      fname != "" || throw(ArgumentError("missing filename."))
      isfile(fname) || throw(ArgumentError("file does not exist."))
      new(fname,args)
   end
end

CSVReader(fname::String) = CSVReader(Dict(:fname=>fname))

# Define fit! which does error checking. You can also make
# it do nothing and let the transform! function does the
# the checking and loading. The fit! function is only defined
# here to make sure there is a fit! dispatch for CSVReader
# type which is needed in the pipeline call iteration.
function fit!(csvreader::CSVReader, df::DataFrame=DataFrame(), target::Vector=Vector())
   fname = csvreader.name
   isfile(fname) || throw(ArgumentError("file does not exist."))
end

# define transform which opens the file and returns a dataframe
function transform!(csvreader::CSVReader, df::DataFrame=DataFrame())
   fname = csvreader.name
   df = CSV.File(fname) |> DataFrame
   df != DataFrame() || throw(ArgumentError("empty dataframe."))
   return df
end
end

Let's now load the FileReaders module together with the other AutoMLPipeline modules and create a pipeline that includes the csv reader we just created.

using DataFrames: DataFrame, nrow,ncol


using AutoMLPipeline

using .FileReaders # load from the Main module

#### Column selector
catf = CatFeatureSelector()
numf = NumFeatureSelector()
pca = SKPreprocessor("PCA")
ohe = OneHotEncoder()

fname = joinpath(dirname(pathof(AutoMLPipeline)),"../data/profb.csv")
csvrdr = CSVReader(fname)

p1 = @pipeline csvrdr |> (catf + numf)
df1 = fit_transform!(p1) # empty argument because input coming from csvreader
julia> first(df1,5)
5×7 DataFrame
 Row │ Home.Away  Favorite_Name  Underdog_name  Favorite_Points  Underdog_Points  Pointspread  Year
     │ String     String         String         Int64            Int64            Float64      Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────
   1 │ away       BUF            MIA                         27               24          4.0     89
   2 │ at_home    CHI            CIN                         17               14          3.0     89
   3 │ away       CLE            PIT                         51                0          2.5     89
   4 │ at_home    NO             DAL                         28                0          5.5     89
   5 │ at_home    MIN            HOU                         38                7          5.5     89
p2 = @pipeline csvrdr |> (numf |> pca) + (catf |> ohe)
df2 = fit_transform!(p2) # empty argument because input coming from csvreader
julia> first(df2,5)
5×62 DataFrame
 Row │ x1        x2         x3        x4        x1_1     x2_1     x3_1     x4_1     x5       x6       x7       x8       x9       x10      x11      x12      x13      x14      x15      x16      x17      x18      x19      x20      x21      x22      x23      x24      x25      x26      x27      x28      x29      x30      x31      x32      x33      x34      x35      x36      x37      x38      x39      x40      x41      x42      x43      x44      x45      x46      x47      x48      x49      x50      x51      x52      x53      x54      x55      x56      x57      x58
     │ Float64   Float64    Float64   Float64   Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │  2.47477    7.87074  -1.10495  0.902431      1.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   2 │ -5.47113   -3.82946  -2.08342  1.00524       0.0      1.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   3 │ 30.4068   -10.8073   -6.12339  0.883938      1.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   4 │  8.18372  -15.507    -1.43203  1.08255       0.0      1.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0
   5 │ 16.6176    -6.68636  -1.66597  0.978243      0.0      1.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      1.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0

With the CSVReader extension, csv files can now be directly processed or loaded inside the pipeline and can be used with other existing filters and transformers.