Imputation
There are two ways to impute the date,value
TS data. One uses DateValNNer
which uses nearest neighbor and DateValizer
which uses the dictionary of medians mapped to certain date-time interval grouping.
DateValNNer
DateValNNer
expects the following arguments with their default values during instantation:
:dateinterval => Dates.Hour(1)
- grouping interval
:nnsize => 1
- size of neighborhood
:missdirection => :symmetric
:forward
vs:backward
vs:symmetric
:strict => true
- whether or not to repeatedly iterate until no more missing data
The :missdirection
indicates the imputation direction and the extent of neighborhood. Symmetric implies getting info from both sides of the missing data. :forward
direction starts imputing from the top while the :reverse
starts from the bottom. Please refer to Aggregators and Imputers for other examples.
Let's use the same dataset we have used in the tutorial and print the first few rows.
julia> first(X,10)
10×2 DataFrame Row │ Date Value │ DateTime Float64? ─────┼───────────────────────────────────── 1 │ 2014-01-01T00:00:00 0.9063 2 │ 2014-01-01T00:15:00 missing 3 │ 2014-01-01T00:30:00 missing 4 │ 2014-01-01T00:45:00 missing 5 │ 2014-01-01T01:00:00 missing 6 │ 2014-01-01T01:15:00 0.334152 7 │ 2014-01-01T01:30:00 missing 8 │ 2014-01-01T01:45:00 missing 9 │ 2014-01-01T02:00:00 missing 10 │ 2014-01-01T02:15:00 missing
Let's try the following setup grouping daily with forward
imputation and 10 neighbors:
dnnr = DateValNNer(Dict(:dateinterval=>Dates.Hour(2),
:nnsize=>10,:missdirection => :forward,
:strict=>false))
forwardres=fit_transform!(dnnr,X)
julia> first(forwardres,5)
5×2 DataFrame Row │ Date Value │ DateTime Float64? ─────┼─────────────────────────────── 1 │ 2014-01-01T00:00:00 0.9063 2 │ 2014-01-01T02:00:00 0.478823 3 │ 2014-01-01T04:00:00 0.478823 4 │ 2014-01-01T06:00:00 0.131798 5 │ 2014-01-01T08:00:00 0.360583
Same parameters as above but uses reverse
instead of forward
direction:
dnnr = DateValNNer(Dict(:dateinterval=>Dates.Hour(2),
:nnsize=>10,:missdirection => :reverse,
:strict=>false))
reverseres=fit_transform!(dnnr,X)
julia> first(reverseres,5)
5×2 DataFrame Row │ Date Value │ DateTime Float64? ─────┼─────────────────────────────── 1 │ 2014-01-01T00:00:00 0.9063 2 │ 2014-01-01T02:00:00 0.486039 3 │ 2014-01-01T04:00:00 0.478823 4 │ 2014-01-01T06:00:00 0.131798 5 │ 2014-01-01T08:00:00 0.360583
Using symmetric
imputation:
dnnr = DateValNNer(Dict(:dateinterval=>Dates.Hour(2),
:nnsize=>10,:missdirection => :symmetric,
:strict=>false))
symmetricres=fit_transform!(dnnr,X)
julia> first(symmetricres,5)
5×2 DataFrame Row │ Date Value │ DateTime Float64? ─────┼─────────────────────────────── 1 │ 2014-01-01T00:00:00 0.9063 2 │ 2014-01-01T02:00:00 0.478823 3 │ 2014-01-01T04:00:00 0.478823 4 │ 2014-01-01T06:00:00 0.131798 5 │ 2014-01-01T08:00:00 0.360583
Unlike symmetric
imputation that guarantees 100% imputation of missing data as long as the input has non-missing elements, forward
and reverse
cannot guarantee that the imputation replaces all missing data because of the boundary issues. If the top or bottom of the input is missing, the assymetric imputation will not be able to replace the endpoints that are missing. It is advised that to have successful imputation, symmetric
imputation shall be used.
In the example above, the number of remaining missing data not imputed for forward
, reverse
, and symmetric
is:
julia> sum(ismissing.(forwardres.Value))
0
julia> sum(ismissing.(reverseres.Value))
0
julia> sum(ismissing.(symmetricres.Value))
0
DateValizer
DateValizer
operates on the principle that there is a reqularity of patterns in a specific time period such that replacing values is just a matter of extracting which time period it belongs and used the pooled median in that time period to replace the missing data. The default time period for DateValizer
is hourly. In a more advanced implementation, we can add daily, hourly, and weekly periods but it will require much larger hash table. Additional grouping criteria can result into smaller subgroups which may contain 100% missing in some of these subgroups resulting to imputation failure. DateValizer
only depends on the :dateinterval => Dates.Hour(1)
argument with default value of hourly. Please refer to Aggregators and Imputers for more examples.
Let's try hourly, daily, and monthly median as the basis of imputation:
julia> hourlyzer = DateValizer(Dict(:dateinterval => Dates.Hour(1)));
julia> monthlyzer = DateValizer(Dict(:dateinterval => Dates.Month(1)));
julia> dailyzer = DateValizer(Dict(:dateinterval => Dates.Day(1)));
julia> hourlyres = fit_transform!(hourlyzer,X)
17521×2 DataFrame Row │ Date Value │ DateTime Float64? ───────┼─────────────────────────────── 1 │ 2014-01-01T00:00:00 0.9063 2 │ 2014-01-01T01:00:00 0.334152 3 │ 2014-01-01T02:00:00 0.485687 4 │ 2014-01-01T03:00:00 0.136551 5 │ 2014-01-01T04:00:00 0.478823 6 │ 2014-01-01T05:00:00 0.51387 7 │ 2014-01-01T06:00:00 0.131798 8 │ 2014-01-01T07:00:00 0.450484 ⋮ │ ⋮ ⋮ 17515 │ 2015-12-31T18:00:00 0.920049 17516 │ 2015-12-31T19:00:00 0.380189 17517 │ 2015-12-31T20:00:00 0.970942 17518 │ 2015-12-31T21:00:00 0.3312 17519 │ 2015-12-31T22:00:00 0.508722 17520 │ 2015-12-31T23:00:00 0.632262 17521 │ 2016-01-01T00:00:00 0.951966 17506 rows omitted
julia> dailyres = fit_transform!(dailyzer,X)
731×2 DataFrame Row │ Date Value │ DateTime Float64? ─────┼─────────────────────────────── 1 │ 2014-01-01T00:00:00 0.390013 2 │ 2014-01-02T00:00:00 0.509696 3 │ 2014-01-03T00:00:00 0.571543 4 │ 2014-01-04T00:00:00 0.578252 5 │ 2014-01-05T00:00:00 0.463037 6 │ 2014-01-06T00:00:00 0.646811 7 │ 2014-01-07T00:00:00 0.468079 8 │ 2014-01-08T00:00:00 0.538296 ⋮ │ ⋮ ⋮ 725 │ 2015-12-26T00:00:00 0.478571 726 │ 2015-12-27T00:00:00 0.441035 727 │ 2015-12-28T00:00:00 0.651315 728 │ 2015-12-29T00:00:00 0.438614 729 │ 2015-12-30T00:00:00 0.442234 730 │ 2015-12-31T00:00:00 0.433304 731 │ 2016-01-01T00:00:00 0.632262 716 rows omitted
julia> monthlyres = fit_transform!(monthlyzer,X)
25×2 DataFrame Row │ Date Value │ DateTime Float64? ─────┼─────────────────────────────── 1 │ 2014-01-01T00:00:00 0.51885 2 │ 2014-02-01T00:00:00 0.531905 3 │ 2014-03-01T00:00:00 0.495285 4 │ 2014-04-01T00:00:00 0.485427 5 │ 2014-05-01T00:00:00 0.459212 6 │ 2014-06-01T00:00:00 0.504739 7 │ 2014-07-01T00:00:00 0.485148 8 │ 2014-08-01T00:00:00 0.511601 ⋮ │ ⋮ ⋮ 19 │ 2015-07-01T00:00:00 0.512868 20 │ 2015-08-01T00:00:00 0.498659 21 │ 2015-09-01T00:00:00 0.480135 22 │ 2015-10-01T00:00:00 0.486423 23 │ 2015-11-01T00:00:00 0.482642 24 │ 2015-12-01T00:00:00 0.498927 25 │ 2016-01-01T00:00:00 0.485564 10 rows omitted