Extending AutoMLPipeline
Having a meta-ML package sounds ideal but not practical in terms of maintainability and flexibility. The metapackage becomes a central point of failure and bottleneck. It doesn't subscribe to the KISS philosophy of Unix which encourages decentralization of implementation. As long as the input and output behavior of transformers and learners follow a standard format, they should work without dependency or communication. By using a consistent input/output interfaces, the passing of information among the elements in the pipeline will not bring any surprises to the receivers and transmitters of information down the line.
Because AMPL's symbolic pipeline is based on the idea of Linux pipeline and filters, there is a deliberate effort to follow as much as possible the KISS philosophy by just using two interfaces to be overloaded (fit!
and transform!
): input features should be a DataFrame type while the target output should be a Vector type. Transformers fit!
function expects only one input argument and ignores the target argument. On the other hand, the fit!
function of any learner requires both input and target arguments to carry out the supervised learning phase. For the transform!
function, both learners and transformers expect one input argument that both use to apply their learned parameters in transforming the input into either prediction, decomposition, normalization, scaling, etc.
AMLP Abstract Types
The AMLP abstract types are composed of the following:
abstract type Machine end
abstract type Workflow <: Machine end
abstract type Computer <: Machine end
abstract type Learner <: Computer end
abstract type Transformer <: Computer end
At the top of the hierarchy is the Machine
abstraction that supports two major interfaces: fit!
and transform!
. The abstract Machine
has two major types: Computer
and Workflow
. The Computer
types perform computations suchs as filters, transformers, and filters while the Workflow
controls the flow of information. A Workflow
can be a sequential flow of information or a combination of information from two or more workflow. A Workflow
that provides sequential flow is called Pipeline
(or linear pipeline) while the one that combines information from different workflows is called ComboPipeline
.
The Computer
type has two subtypes: Learner
and Transformer
. Their main difference is in the behavior of their fit!
function. The Learner
type learns its parameters by finding a mapping function between its input
and output
arguments while the Transformer
does not require these mapping function to perform its operation. The Transfomer
learns all its parameters by just processing its input
features. Both Transfomer
and Learner
has similar behaviour in the transform!
function. Both apply their learned parameters to transform their input
into output
.
Extending AMLP by Adding a CSVReader Transformer
Let's extend AMLP by adding CSV reading support embedded in the pipeline. Instead of passing the data in the pipeline argument, we create a csv transformer that passes the data to succeeding elements in the pipeline from a csv file.
module FileReaders
using CSV
using DataFrames: DataFrame, nrow,ncol
using AutoMLPipeline
using AutoMLPipeline.AbsTypes # abstract types (Learners and Transformers)
import AutoMLPipeline.fit!
import AutoMLPipeline.transform!
export fit!, transform!
export CSVReader
# define a user-defined structure for type dispatch
mutable struct CSVReader <: Transformer
name::String
model::Dict
function CSVReader(args = Dict(:fname=>""))
fname = args[:fname]
fname != "" || throw(ArgumentError("missing filename."))
isfile(fname) || throw(ArgumentError("file does not exist."))
new(fname,args)
end
end
CSVReader(fname::String) = CSVReader(Dict(:fname=>fname))
# Define fit! which does error checking. You can also make
# it do nothing and let the transform! function does the
# the checking and loading. The fit! function is only defined
# here to make sure there is a fit! dispatch for CSVReader
# type which is needed in the pipeline call iteration.
function fit!(csvreader::CSVReader, df::DataFrame=DataFrame(), target::Vector=Vector())
fname = csvreader.name
isfile(fname) || throw(ArgumentError("file does not exist."))
end
# define transform which opens the file and returns a dataframe
function transform!(csvreader::CSVReader, df::DataFrame=DataFrame())
fname = csvreader.name
df = CSV.File(fname) |> DataFrame
df != DataFrame() || throw(ArgumentError("empty dataframe."))
return df
end
end
Let's now load the FileReaders module together with the other AutoMLPipeline modules and create a pipeline that includes the csv reader we just created.
using DataFrames: DataFrame, nrow,ncol
using AutoMLPipeline
using .FileReaders # load from the Main module
#### Column selector
catf = CatFeatureSelector()
numf = NumFeatureSelector()
pca = SKPreprocessor("PCA")
ohe = OneHotEncoder()
fname = joinpath(dirname(pathof(AutoMLPipeline)),"../data/profb.csv")
csvrdr = CSVReader(fname)
p1 = @pipeline csvrdr |> (catf + numf)
df1 = fit_transform!(p1) # empty argument because input coming from csvreader
julia> first(df1,5)
5×7 DataFrame
Row │ Home.Away Favorite_Name Underdog_name Favorite_Points Underdog_Points Pointspread Year
│ String String String Int64 Int64 Float64 Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────
1 │ away BUF MIA 27 24 4.0 89
2 │ at_home CHI CIN 17 14 3.0 89
3 │ away CLE PIT 51 0 2.5 89
4 │ at_home NO DAL 28 0 5.5 89
5 │ at_home MIN HOU 38 7 5.5 89
p2 = @pipeline csvrdr |> (numf |> pca) + (catf |> ohe)
df2 = fit_transform!(p2) # empty argument because input coming from csvreader
julia> first(df2,5)
5×62 DataFrame
Row │ x1 x2 x3 x4 x1_1 x2_1 x3_1 x4_1 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 2.47477 7.87074 -1.10495 0.902431 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 │ -5.47113 -3.82946 -2.08342 1.00524 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 │ 30.4068 -10.8073 -6.12339 0.883938 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 │ 8.18372 -15.507 -1.43203 1.08255 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 │ 16.6176 -6.68636 -1.66597 0.978243 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
With the CSVReader extension, csv files can now be directly processed or loaded inside the pipeline and can be used with other existing filters and transformers.