Imputation

There are two ways to impute the date,value TS data. One uses DateValNNer which uses nearest neighbor and DateValizer which uses the dictionary of medians mapped to certain date-time interval grouping.

DateValNNer

DateValNNer expects the following arguments with their default values during instantation:

:dateinterval => Dates.Hour(1)
- grouping interval
:nnsize => 1
- size of neighborhood
:missdirection => :symmetric
- :forward vs :backward vs :symmetric
:strict => true
- whether or not to repeatedly iterate until no more missing data

The :missdirection indicates the imputation direction and the extent of neighborhood. Symmetric implies getting info from both sides of the missing data. :forward direction starts imputing from the top while the :reverse starts from the bottom. Please refer to Aggregators and Imputers for other examples.

Let's use the same dataset we have used in the tutorial and print the first few rows.

julia> first(X,10)10×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼─────────────────────────────────────
   1 │ 2014-01-01T00:00:00        0.9063
   2 │ 2014-01-01T00:15:00  missing
   3 │ 2014-01-01T00:30:00  missing
   4 │ 2014-01-01T00:45:00  missing
   5 │ 2014-01-01T01:00:00  missing
   6 │ 2014-01-01T01:15:00        0.334152
   7 │ 2014-01-01T01:30:00  missing
   8 │ 2014-01-01T01:45:00  missing
   9 │ 2014-01-01T02:00:00  missing
  10 │ 2014-01-01T02:15:00  missing

Let's try the following setup grouping daily with forward imputation and 10 neighbors:

dnnr = DateValNNer(Dict(:dateinterval=>Dates.Hour(2),
             :nnsize=>10,:missdirection => :forward,
             :strict=>false))
forwardres=fit_transform!(dnnr,X)

julia> first(forwardres,5)5×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼───────────────────────────────
   1 │ 2014-01-01T00:00:00  0.9063
   2 │ 2014-01-01T02:00:00  0.478823
   3 │ 2014-01-01T04:00:00  0.478823
   4 │ 2014-01-01T06:00:00  0.131798
   5 │ 2014-01-01T08:00:00  0.360583

Same parameters as above but uses reverse instead of forward direction:

dnnr = DateValNNer(Dict(:dateinterval=>Dates.Hour(2),
             :nnsize=>10,:missdirection => :reverse,
             :strict=>false))
reverseres=fit_transform!(dnnr,X)

julia> first(reverseres,5)5×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼───────────────────────────────
   1 │ 2014-01-01T00:00:00  0.9063
   2 │ 2014-01-01T02:00:00  0.486039
   3 │ 2014-01-01T04:00:00  0.478823
   4 │ 2014-01-01T06:00:00  0.131798
   5 │ 2014-01-01T08:00:00  0.360583

Using symmetric imputation:

dnnr = DateValNNer(Dict(:dateinterval=>Dates.Hour(2),
             :nnsize=>10,:missdirection => :symmetric,
             :strict=>false))
symmetricres=fit_transform!(dnnr,X)

julia> first(symmetricres,5)5×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼───────────────────────────────
   1 │ 2014-01-01T00:00:00  0.9063
   2 │ 2014-01-01T02:00:00  0.478823
   3 │ 2014-01-01T04:00:00  0.478823
   4 │ 2014-01-01T06:00:00  0.131798
   5 │ 2014-01-01T08:00:00  0.360583

Unlike symmetric imputation that guarantees 100% imputation of missing data as long as the input has non-missing elements, forward and reverse cannot guarantee that the imputation replaces all missing data because of the boundary issues. If the top or bottom of the input is missing, the assymetric imputation will not be able to replace the endpoints that are missing. It is advised that to have successful imputation, symmetric imputation shall be used.

In the example above, the number of remaining missing data not imputed for forward, reverse, and symmetric is:

julia> sum(ismissing.(forwardres.Value))0
julia> sum(ismissing.(reverseres.Value))0
julia> sum(ismissing.(symmetricres.Value))0

DateValizer

DateValizer operates on the principle that there is a reqularity of patterns in a specific time period such that replacing values is just a matter of extracting which time period it belongs and used the pooled median in that time period to replace the missing data. The default time period for DateValizer is hourly. In a more advanced implementation, we can add daily, hourly, and weekly periods but it will require much larger hash table. Additional grouping criteria can result into smaller subgroups which may contain 100% missing in some of these subgroups resulting to imputation failure. DateValizer only depends on the :dateinterval => Dates.Hour(1) argument with default value of hourly. Please refer to Aggregators and Imputers for more examples.

Let's try hourly, daily, and monthly median as the basis of imputation:

julia> hourlyzer = DateValizer(Dict(:dateinterval => Dates.Hour(1)));
julia> monthlyzer = DateValizer(Dict(:dateinterval => Dates.Month(1)));
julia> dailyzer = DateValizer(Dict(:dateinterval => Dates.Day(1)));
julia> hourlyres = fit_transform!(hourlyzer,X)17521×2 DataFrame
   Row │ Date                 Value
       │ DateTime             Float64?
───────┼───────────────────────────────
     1 │ 2014-01-01T00:00:00  0.9063
     2 │ 2014-01-01T01:00:00  0.334152
     3 │ 2014-01-01T02:00:00  0.485687
     4 │ 2014-01-01T03:00:00  0.136551
     5 │ 2014-01-01T04:00:00  0.478823
     6 │ 2014-01-01T05:00:00  0.51387
     7 │ 2014-01-01T06:00:00  0.131798
     8 │ 2014-01-01T07:00:00  0.450484
   ⋮   │          ⋮              ⋮
 17515 │ 2015-12-31T18:00:00  0.920049
 17516 │ 2015-12-31T19:00:00  0.380189
 17517 │ 2015-12-31T20:00:00  0.970942
 17518 │ 2015-12-31T21:00:00  0.3312
 17519 │ 2015-12-31T22:00:00  0.508722
 17520 │ 2015-12-31T23:00:00  0.632262
 17521 │ 2016-01-01T00:00:00  0.951966
                     17506 rows omitted
julia> dailyres = fit_transform!(dailyzer,X)731×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼───────────────────────────────
   1 │ 2014-01-01T00:00:00  0.390013
   2 │ 2014-01-02T00:00:00  0.509696
   3 │ 2014-01-03T00:00:00  0.571543
   4 │ 2014-01-04T00:00:00  0.578252
   5 │ 2014-01-05T00:00:00  0.463037
   6 │ 2014-01-06T00:00:00  0.646811
   7 │ 2014-01-07T00:00:00  0.468079
   8 │ 2014-01-08T00:00:00  0.538296
  ⋮  │          ⋮              ⋮
 725 │ 2015-12-26T00:00:00  0.478571
 726 │ 2015-12-27T00:00:00  0.441035
 727 │ 2015-12-28T00:00:00  0.651315
 728 │ 2015-12-29T00:00:00  0.438614
 729 │ 2015-12-30T00:00:00  0.442234
 730 │ 2015-12-31T00:00:00  0.433304
 731 │ 2016-01-01T00:00:00  0.632262
                     716 rows omitted
julia> monthlyres = fit_transform!(monthlyzer,X)25×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼───────────────────────────────
   1 │ 2014-01-01T00:00:00  0.51885
   2 │ 2014-02-01T00:00:00  0.531905
   3 │ 2014-03-01T00:00:00  0.495285
   4 │ 2014-04-01T00:00:00  0.485427
   5 │ 2014-05-01T00:00:00  0.459212
   6 │ 2014-06-01T00:00:00  0.504739
   7 │ 2014-07-01T00:00:00  0.485148
   8 │ 2014-08-01T00:00:00  0.511601
  ⋮  │          ⋮              ⋮
  19 │ 2015-07-01T00:00:00  0.512868
  20 │ 2015-08-01T00:00:00  0.498659
  21 │ 2015-09-01T00:00:00  0.480135
  22 │ 2015-10-01T00:00:00  0.486423
  23 │ 2015-11-01T00:00:00  0.482642
  24 │ 2015-12-01T00:00:00  0.498927
  25 │ 2016-01-01T00:00:00  0.485564
                      10 rows omitted