Aggregators and Imputers

The package assumes a two-column table composed of Dates and Values. The first part of the workflow aggregates values based on the specified date-time interval which minimizes occurence of missing values and noise. The aggregated data is then left-joined to the complete sequence of DateTime in a specified date-time interval. Remaining missing values are replaced by k nearest neighbors where k is the symmetric distance from the location of missing value. This replacement algo is called several times until there are no more missing values.

Let us create a Date, Value table with some missing values and output the first 15 rows. We will then apply some TSML functions to normalize/clean the data. Below is the code of the generateDataWithMissing() function:

using TSML
function generateDataWithMissing()
   Random.seed!(123)
   gdate = DateTime(2014,1,1):Dates.Minute(15):DateTime(2016,1,1)
   gval = Array{Union{Missing,Float64}}(rand(length(gdate)))
   gmissing = 50000
   gndxmissing = Random.shuffle(1:length(gdate))[1:gmissing]
   df = DataFrame(Date=gdate,Value=gval)
   df[!,:Value][gndxmissing] .= missing
   return df
end

julia> X = generateDataWithMissing();
julia> first(X,15)15×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼─────────────────────────────────────
   1 │ 2014-01-01T00:00:00        0.9063
   2 │ 2014-01-01T00:15:00  missing
   3 │ 2014-01-01T00:30:00  missing
   4 │ 2014-01-01T00:45:00  missing
   5 │ 2014-01-01T01:00:00  missing
   6 │ 2014-01-01T01:15:00        0.334152
   7 │ 2014-01-01T01:30:00  missing
   8 │ 2014-01-01T01:45:00  missing
   9 │ 2014-01-01T02:00:00  missing
  10 │ 2014-01-01T02:15:00  missing
  11 │ 2014-01-01T02:30:00  missing
  12 │ 2014-01-01T02:45:00        0.136551
  13 │ 2014-01-01T03:00:00  missing
  14 │ 2014-01-01T03:15:00  missing
  15 │ 2014-01-01T03:30:00  missing

DateValgator

You'll notice several blocks of missing in the table above with reading frequency of every 15 minutes. To minimize noise and lessen the occurrence of missing values, let's aggregate our dataset by taking the hourly median using the DateValgator transformer.

using TSML

dtvlgator = DateValgator(Dict(:dateinterval=>Dates.Hour(1)))
results = fit_transform!(dtvlgator,X)

julia> first(results,10)10×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼─────────────────────────────────────
   1 │ 2014-01-01T00:00:00        0.9063
   2 │ 2014-01-01T01:00:00        0.334152
   3 │ 2014-01-01T02:00:00  missing
   4 │ 2014-01-01T03:00:00        0.136551
   5 │ 2014-01-01T04:00:00        0.478823
   6 │ 2014-01-01T05:00:00  missing
   7 │ 2014-01-01T06:00:00        0.131798
   8 │ 2014-01-01T07:00:00        0.450484
   9 │ 2014-01-01T08:00:00        0.360583
  10 │ 2014-01-01T09:00:00        0.571645

The occurrence of missing values is now reduced because of the hourly aggregation. While the default is hourly aggregation, you can easily change it by using a different interval in the argument during instance creation. Below indicates every 30 minutes interval.

julia> dtvlgator = DateValgator(Dict(:dateinterval=>Dates.Minute(30)))DateValgator("dtvalgtr_DZ6", Dict{Symbol, Any}(:dateinterval => Minute(30), :aggregator => :median, :name => "dtvalgtr_DZ6"))

DateValgator is one of the several TSML transformers to preprocess and clean the time series data. In order to create additional transformers to extend TSML, each transformer must overload the two Transformer functions:fit! and transform!. DateValgator fit! performs initial setups of necessary parameters and validation of arguments while its transform! function contains the algorithm for aggregation.

For machine learning prediction and classification transformer, fit! function is equivalent to ML training or parameter optimization, while the transform! function is for doing the actual prediction. The later part of the tutorial will provide an example how to add a Transformer to extend the functionality of TSML.

DateValNNer

Let's perform further processing to replace the remaining missing values with their nearest neighbors. We will use DateValNNer which is a TSML transformer to process the output of DateValgator. DateValNNer can also process non-aggregated data by first running similar workflow of DateValgator before performing its imputation routine.

using TSML

datevalnner = DateValNNer(Dict(:dateinterval=>Dates.Hour(1)))
results = fit_transform!(datevalnner,X)

julia> first(results,10)10×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼───────────────────────────────
   1 │ 2014-01-01T00:00:00  0.9063
   2 │ 2014-01-01T01:00:00  0.334152
   3 │ 2014-01-01T02:00:00  0.235352
   4 │ 2014-01-01T03:00:00  0.136551
   5 │ 2014-01-01T04:00:00  0.478823
   6 │ 2014-01-01T05:00:00  0.305311
   7 │ 2014-01-01T06:00:00  0.131798
   8 │ 2014-01-01T07:00:00  0.450484
   9 │ 2014-01-01T08:00:00  0.360583
  10 │ 2014-01-01T09:00:00  0.571645

After running the DateValNNer, it's guaranteed that there will be no more missing data unless the input are all missing data.

DateValizer

One more imputer to replace missing data is DateValizer. It computes the hourly median over 24 hours and use the hour => median hashmap learned to replace missing data using hour as the key. In this implementation, fit! function is doing the training of parameters by computing the medians and save it for the transform! function to use for imputation. It is possible that the hashmap can contain missing values in cases where the pooled hourly median in a particular hour have all missing data. Below is a sample workflow to replace missing data in X with the hourly medians.

using TSML

datevalizer = DateValizer(Dict(:dateinterval=>Dates.Hour(1)))
results = fit_transform!(datevalizer,X)

julia> first(results,10)10×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼───────────────────────────────
   1 │ 2014-01-01T00:00:00  0.9063
   2 │ 2014-01-01T01:00:00  0.334152
   3 │ 2014-01-01T02:00:00  0.485687
   4 │ 2014-01-01T03:00:00  0.136551
   5 │ 2014-01-01T04:00:00  0.478823
   6 │ 2014-01-01T05:00:00  0.51387
   7 │ 2014-01-01T06:00:00  0.131798
   8 │ 2014-01-01T07:00:00  0.450484
   9 │ 2014-01-01T08:00:00  0.360583
  10 │ 2014-01-01T09:00:00  0.571645