Imputation

There are two ways to impute the date,value TS data. One uses DateValNNer which uses nearest neighbor and DateValizer which uses the dictionary of medians mapped to certain date-time interval grouping.

DateValNNer

DateValNNer expects the following arguments with their default values during instantation:

:dateinterval => Dates.Hour(1)
- grouping interval
:nnsize => 1
- size of neighborhood
:missdirection => :symmetric
- :forward vs :backward vs :symmetric
:strict => true
- whether or not to repeatedly iterate until no more missing data

The :missdirection indicates the imputation direction and the extent of neighborhood. Symmetric implies getting info from both sides of the missing data. :forward direction starts imputing from the top while the :reverse starts from the bottom. Please refer to Aggregators and Imputers for other examples.

Let's use the same dataset we have used in the tutorial and print the first few rows.

julia> first(X,10)
10×2 DataFrame
│ Row │ Date                │ Value     │
│     │ DateTime            │ Float64?  │
├─────┼─────────────────────┼───────────┤
│ 1   │ 2014-01-01T00:00:00 │ missing   │
│ 2   │ 2014-01-01T00:15:00 │ missing   │
│ 3   │ 2014-01-01T00:30:00 │ missing   │
│ 4   │ 2014-01-01T00:45:00 │ missing   │
│ 5   │ 2014-01-01T01:00:00 │ missing   │
│ 6   │ 2014-01-01T01:15:00 │ missing   │
│ 7   │ 2014-01-01T01:30:00 │ missing   │
│ 8   │ 2014-01-01T01:45:00 │ 0.0521332 │
│ 9   │ 2014-01-01T02:00:00 │ 0.26864   │
│ 10  │ 2014-01-01T02:15:00 │ 0.108871  │

Let's try the following setup grouping daily with forward imputation and 10 neighbors:

dnnr = DateValNNer(Dict(:dateinterval=>Dates.Hour(2),
             :nnsize=>10,:missdirection => :forward,
             :strict=>false))
forwardres=fit_transform!(dnnr,X)

julia> first(forwardres,5)
5×2 DataFrame
│ Row │ Date                │ Value    │
│     │ DateTime            │ Float64? │
├─────┼─────────────────────┼──────────┤
│ 1   │ 2014-01-01T00:00:00 │ 0.491286 │
│ 2   │ 2014-01-01T02:00:00 │ 0.108871 │
│ 3   │ 2014-01-01T04:00:00 │ 0.361194 │
│ 4   │ 2014-01-01T06:00:00 │ 0.918165 │
│ 5   │ 2014-01-01T08:00:00 │ 0.690462 │

Same parameters as above but uses reverse instead of forward direction:

dnnr = DateValNNer(Dict(:dateinterval=>Dates.Hour(2),
             :nnsize=>10,:missdirection => :reverse,
             :strict=>false))
reverseres=fit_transform!(dnnr,X)

julia> first(reverseres,5)
5×2 DataFrame
│ Row │ Date                │ Value    │
│     │ DateTime            │ Float64? │
├─────┼─────────────────────┼──────────┤
│ 1   │ 2014-01-01T00:00:00 │ 0.491286 │
│ 2   │ 2014-01-01T02:00:00 │ 0.108871 │
│ 3   │ 2014-01-01T04:00:00 │ 0.361194 │
│ 4   │ 2014-01-01T06:00:00 │ 0.918165 │
│ 5   │ 2014-01-01T08:00:00 │ 0.690462 │

Using symmetric imputation:

dnnr = DateValNNer(Dict(:dateinterval=>Dates.Hour(2),
             :nnsize=>10,:missdirection => :symmetric,
             :strict=>false))
symmetricres=fit_transform!(dnnr,X)

julia> first(symmetricres,5)
5×2 DataFrame
│ Row │ Date                │ Value    │
│     │ DateTime            │ Float64? │
├─────┼─────────────────────┼──────────┤
│ 1   │ 2014-01-01T00:00:00 │ 0.491286 │
│ 2   │ 2014-01-01T02:00:00 │ 0.108871 │
│ 3   │ 2014-01-01T04:00:00 │ 0.361194 │
│ 4   │ 2014-01-01T06:00:00 │ 0.918165 │
│ 5   │ 2014-01-01T08:00:00 │ 0.690462 │

Unlike symmetric imputation that guarantees 100% imputation of missing data as long as the input has non-missing elements, forward and reverse cannot guarantee that the imputation replaces all missing data because of the boundary issues. If the top or bottom of the input is missing, the assymetric imputation will not be able to replace the endpoints that are missing. It is advised that to have successful imputation, symmetric imputation shall be used.

In the example above, the number of remaining missing data not imputed for forward, reverse, and symmetric is:

julia> sum(ismissing.(forwardres.Value))
3

julia> sum(ismissing.(reverseres.Value))
3

julia> sum(ismissing.(symmetricres.Value))
0

DateValizer

DateValizer operates on the principle that there is a reqularity of patterns in a specific time period such that replacing values is just a matter of extracting which time period it belongs and used the pooled median in that time period to replace the missing data. The default time period for DateValizer is hourly. In a more advanced implementation, we can add daily, hourly, and weekly periods but it will require much larger hash table. Additional grouping criteria can result into smaller subgroups which may contain 100% missing in some of these subgroups resulting to imputation failure. DateValizer only depends on the :dateinterval => Dates.Hour(1) argument with default value of hourly. Please refer to Aggregators and Imputers for more examples.

Let's try hourly, daily, and monthly median as the basis of imputation:

julia> hourlyzer = DateValizer(Dict(:dateinterval => Dates.Hour(1)));

julia> monthlyzer = DateValizer(Dict(:dateinterval => Dates.Month(1)));

julia> dailyzer = DateValizer(Dict(:dateinterval => Dates.Day(1)));

julia> hourlyres = fit_transform!(hourlyzer,X)
17521×2 DataFrame
│ Row   │ Date                │ Value    │
│       │ DateTime            │ Float64? │
├───────┼─────────────────────┼──────────┤
│ 1     │ 2014-01-01T00:00:00 │ 0.498827 │
│ 2     │ 2014-01-01T01:00:00 │ 0.500748 │
│ 3     │ 2014-01-01T02:00:00 │ 0.108871 │
│ 4     │ 2014-01-01T03:00:00 │ 0.473017 │
│ 5     │ 2014-01-01T04:00:00 │ 0.361194 │
│ 6     │ 2014-01-01T05:00:00 │ 0.582318 │
│ 7     │ 2014-01-01T06:00:00 │ 0.918165 │
⋮
│ 17514 │ 2015-12-31T17:00:00 │ 0.549606 │
│ 17515 │ 2015-12-31T18:00:00 │ 0.680491 │
│ 17516 │ 2015-12-31T19:00:00 │ 0.500731 │
│ 17517 │ 2015-12-31T20:00:00 │ 0.468921 │
│ 17518 │ 2015-12-31T21:00:00 │ 0.28438  │
│ 17519 │ 2015-12-31T22:00:00 │ 0.533108 │
│ 17520 │ 2015-12-31T23:00:00 │ 0.308998 │
│ 17521 │ 2016-01-01T00:00:00 │ 0.498827 │

julia> dailyres = fit_transform!(dailyzer,X)
731×2 DataFrame
│ Row │ Date                │ Value    │
│     │ DateTime            │ Float64? │
├─────┼─────────────────────┼──────────┤
│ 1   │ 2014-01-01T00:00:00 │ 0.48     │
│ 2   │ 2014-01-02T00:00:00 │ 0.628368 │
│ 3   │ 2014-01-03T00:00:00 │ 0.509263 │
│ 4   │ 2014-01-04T00:00:00 │ 0.559623 │
│ 5   │ 2014-01-05T00:00:00 │ 0.539073 │
│ 6   │ 2014-01-06T00:00:00 │ 0.387866 │
│ 7   │ 2014-01-07T00:00:00 │ 0.464466 │
⋮
│ 724 │ 2015-12-25T00:00:00 │ 0.44458  │
│ 725 │ 2015-12-26T00:00:00 │ 0.625784 │
│ 726 │ 2015-12-27T00:00:00 │ 0.659934 │
│ 727 │ 2015-12-28T00:00:00 │ 0.368161 │
│ 728 │ 2015-12-29T00:00:00 │ 0.506546 │
│ 729 │ 2015-12-30T00:00:00 │ 0.516895 │
│ 730 │ 2015-12-31T00:00:00 │ 0.299126 │
│ 731 │ 2016-01-01T00:00:00 │ 0.434787 │

julia> monthlyres = fit_transform!(monthlyzer,X)
25×2 DataFrame
│ Row │ Date                │ Value    │
│     │ DateTime            │ Float64? │
├─────┼─────────────────────┼──────────┤
│ 1   │ 2014-01-01T00:00:00 │ 0.525587 │
│ 2   │ 2014-02-01T00:00:00 │ 0.501297 │
│ 3   │ 2014-03-01T00:00:00 │ 0.540474 │
│ 4   │ 2014-04-01T00:00:00 │ 0.492871 │
│ 5   │ 2014-05-01T00:00:00 │ 0.514414 │
│ 6   │ 2014-06-01T00:00:00 │ 0.515317 │
│ 7   │ 2014-07-01T00:00:00 │ 0.501932 │
⋮
│ 18  │ 2015-06-01T00:00:00 │ 0.499711 │
│ 19  │ 2015-07-01T00:00:00 │ 0.509305 │
│ 20  │ 2015-08-01T00:00:00 │ 0.505218 │
│ 21  │ 2015-09-01T00:00:00 │ 0.511359 │
│ 22  │ 2015-10-01T00:00:00 │ 0.504835 │
│ 23  │ 2015-11-01T00:00:00 │ 0.487876 │
│ 24  │ 2015-12-01T00:00:00 │ 0.512668 │
│ 25  │ 2016-01-01T00:00:00 │ 0.482073 │