Imputation

There are two ways to impute the date,value TS data. One uses DateValNNer which uses nearest neighbor and DateValizer which uses the dictionary of medians mapped to certain date-time interval grouping.

DateValNNer

DateValNNer expects the following arguments with their default values during instantation:

  • :dateinterval => Dates.Hour(1)
    • grouping interval
  • :nnsize => 1
    • size of neighborhood
  • :missdirection => :symmetric
    • :forward vs :backward vs :symmetric
  • :strict => true
    • whether or not to repeatedly iterate until no more missing data

The :missdirection indicates the imputation direction and the extent of neighborhood. Symmetric implies getting info from both sides of the missing data. :forward direction starts imputing from the top while the :reverse starts from the bottom. Please refer to Aggregators and Imputers for other examples.

Let's use the same dataset we have used in the tutorial and print the first few rows.

julia> first(X,10)10×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼─────────────────────────────────────
   1 │ 2014-01-01T00:00:00        0.9063
   2 │ 2014-01-01T00:15:00  missing
   3 │ 2014-01-01T00:30:00  missing
   4 │ 2014-01-01T00:45:00  missing
   5 │ 2014-01-01T01:00:00  missing
   6 │ 2014-01-01T01:15:00        0.334152
   7 │ 2014-01-01T01:30:00  missing
   8 │ 2014-01-01T01:45:00  missing
   9 │ 2014-01-01T02:00:00  missing
  10 │ 2014-01-01T02:15:00  missing

Let's try the following setup grouping daily with forward imputation and 10 neighbors:

dnnr = DateValNNer(Dict(:dateinterval=>Dates.Hour(2),
             :nnsize=>10,:missdirection => :forward,
             :strict=>false))
forwardres=fit_transform!(dnnr,X)
julia> first(forwardres,5)5×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼───────────────────────────────
   1 │ 2014-01-01T00:00:00  0.9063
   2 │ 2014-01-01T02:00:00  0.478823
   3 │ 2014-01-01T04:00:00  0.478823
   4 │ 2014-01-01T06:00:00  0.131798
   5 │ 2014-01-01T08:00:00  0.360583

Same parameters as above but uses reverse instead of forward direction:

dnnr = DateValNNer(Dict(:dateinterval=>Dates.Hour(2),
             :nnsize=>10,:missdirection => :reverse,
             :strict=>false))
reverseres=fit_transform!(dnnr,X)
julia> first(reverseres,5)5×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼───────────────────────────────
   1 │ 2014-01-01T00:00:00  0.9063
   2 │ 2014-01-01T02:00:00  0.486039
   3 │ 2014-01-01T04:00:00  0.478823
   4 │ 2014-01-01T06:00:00  0.131798
   5 │ 2014-01-01T08:00:00  0.360583

Using symmetric imputation:

dnnr = DateValNNer(Dict(:dateinterval=>Dates.Hour(2),
             :nnsize=>10,:missdirection => :symmetric,
             :strict=>false))
symmetricres=fit_transform!(dnnr,X)
julia> first(symmetricres,5)5×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼───────────────────────────────
   1 │ 2014-01-01T00:00:00  0.9063
   2 │ 2014-01-01T02:00:00  0.478823
   3 │ 2014-01-01T04:00:00  0.478823
   4 │ 2014-01-01T06:00:00  0.131798
   5 │ 2014-01-01T08:00:00  0.360583

Unlike symmetric imputation that guarantees 100% imputation of missing data as long as the input has non-missing elements, forward and reverse cannot guarantee that the imputation replaces all missing data because of the boundary issues. If the top or bottom of the input is missing, the assymetric imputation will not be able to replace the endpoints that are missing. It is advised that to have successful imputation, symmetric imputation shall be used.

In the example above, the number of remaining missing data not imputed for forward, reverse, and symmetric is:

julia> sum(ismissing.(forwardres.Value))0
julia> sum(ismissing.(reverseres.Value))0
julia> sum(ismissing.(symmetricres.Value))0

DateValizer

DateValizer operates on the principle that there is a reqularity of patterns in a specific time period such that replacing values is just a matter of extracting which time period it belongs and used the pooled median in that time period to replace the missing data. The default time period for DateValizer is hourly. In a more advanced implementation, we can add daily, hourly, and weekly periods but it will require much larger hash table. Additional grouping criteria can result into smaller subgroups which may contain 100% missing in some of these subgroups resulting to imputation failure. DateValizer only depends on the :dateinterval => Dates.Hour(1) argument with default value of hourly. Please refer to Aggregators and Imputers for more examples.

Let's try hourly, daily, and monthly median as the basis of imputation:

julia> hourlyzer = DateValizer(Dict(:dateinterval => Dates.Hour(1)));
julia> monthlyzer = DateValizer(Dict(:dateinterval => Dates.Month(1)));
julia> dailyzer = DateValizer(Dict(:dateinterval => Dates.Day(1)));
julia> hourlyres = fit_transform!(hourlyzer,X)17521×2 DataFrame Row │ Date Value │ DateTime Float64? ───────┼─────────────────────────────── 1 │ 2014-01-01T00:00:00 0.9063 2 │ 2014-01-01T01:00:00 0.334152 3 │ 2014-01-01T02:00:00 0.485687 4 │ 2014-01-01T03:00:00 0.136551 5 │ 2014-01-01T04:00:00 0.478823 6 │ 2014-01-01T05:00:00 0.51387 7 │ 2014-01-01T06:00:00 0.131798 8 │ 2014-01-01T07:00:00 0.450484 ⋮ │ ⋮ ⋮ 17515 │ 2015-12-31T18:00:00 0.920049 17516 │ 2015-12-31T19:00:00 0.380189 17517 │ 2015-12-31T20:00:00 0.970942 17518 │ 2015-12-31T21:00:00 0.3312 17519 │ 2015-12-31T22:00:00 0.508722 17520 │ 2015-12-31T23:00:00 0.632262 17521 │ 2016-01-01T00:00:00 0.951966 17506 rows omitted
julia> dailyres = fit_transform!(dailyzer,X)731×2 DataFrame Row │ Date Value │ DateTime Float64? ─────┼─────────────────────────────── 1 │ 2014-01-01T00:00:00 0.390013 2 │ 2014-01-02T00:00:00 0.509696 3 │ 2014-01-03T00:00:00 0.571543 4 │ 2014-01-04T00:00:00 0.578252 5 │ 2014-01-05T00:00:00 0.463037 6 │ 2014-01-06T00:00:00 0.646811 7 │ 2014-01-07T00:00:00 0.468079 8 │ 2014-01-08T00:00:00 0.538296 ⋮ │ ⋮ ⋮ 725 │ 2015-12-26T00:00:00 0.478571 726 │ 2015-12-27T00:00:00 0.441035 727 │ 2015-12-28T00:00:00 0.651315 728 │ 2015-12-29T00:00:00 0.438614 729 │ 2015-12-30T00:00:00 0.442234 730 │ 2015-12-31T00:00:00 0.433304 731 │ 2016-01-01T00:00:00 0.632262 716 rows omitted
julia> monthlyres = fit_transform!(monthlyzer,X)25×2 DataFrame Row │ Date Value │ DateTime Float64? ─────┼─────────────────────────────── 1 │ 2014-01-01T00:00:00 0.51885 2 │ 2014-02-01T00:00:00 0.531905 3 │ 2014-03-01T00:00:00 0.495285 4 │ 2014-04-01T00:00:00 0.485427 5 │ 2014-05-01T00:00:00 0.459212 6 │ 2014-06-01T00:00:00 0.504739 7 │ 2014-07-01T00:00:00 0.485148 8 │ 2014-08-01T00:00:00 0.511601 ⋮ │ ⋮ ⋮ 19 │ 2015-07-01T00:00:00 0.512868 20 │ 2015-08-01T00:00:00 0.498659 21 │ 2015-09-01T00:00:00 0.480135 22 │ 2015-10-01T00:00:00 0.486423 23 │ 2015-11-01T00:00:00 0.482642 24 │ 2015-12-01T00:00:00 0.498927 25 │ 2016-01-01T00:00:00 0.485564 10 rows omitted