Statistical Metrics
Each TS can be evaluated to extract its statistical features which can be used for data quality assessment, data discovery by clustering and classification, and anomaly characterization among others.
TSML relies on Statifier
to perform statistical metrics on the TS which can be configured to extract the statistics of missing blocks aside from the non-missing elements. Some of the scalar statistics it uses include: pacf, acf, autocor, quartiles, mean, median, max, min, kurtosis, skewness, variation, standard error, entropy, etc. It has only one argument :processmissing => true
which indicates whether to include the statistics of missing data.
Let us again start generating an artificial data with missing values using the generateDataWithMissing()
described in the beginning of tutorial.
julia> X = generateDataWithMissing();
julia> first(X,15)
15×2 DataFrame Row │ Date Value │ DateTime Float64? ─────┼───────────────────────────────────── 1 │ 2014-01-01T00:00:00 0.9063 2 │ 2014-01-01T00:15:00 missing 3 │ 2014-01-01T00:30:00 missing 4 │ 2014-01-01T00:45:00 missing 5 │ 2014-01-01T01:00:00 missing 6 │ 2014-01-01T01:15:00 0.334152 7 │ 2014-01-01T01:30:00 missing 8 │ 2014-01-01T01:45:00 missing 9 │ 2014-01-01T02:00:00 missing 10 │ 2014-01-01T02:15:00 missing 11 │ 2014-01-01T02:30:00 missing 12 │ 2014-01-01T02:45:00 0.136551 13 │ 2014-01-01T03:00:00 missing 14 │ 2014-01-01T03:15:00 missing 15 │ 2014-01-01T03:30:00 missing
Statifier for Both Non-Missing and Missing Values
TSML includes Statifier
transformer that computes scalar statistics to characterize the time series data. By default, it also computes statistics of missing blocks of data. To disable this feature, one can pass :processmissing => false
to the argument during its instance creation. Below illustrates this workflow.
using TSML
dtvalgator = DateValgator(Dict(:dateinterval => Dates.Hour(1)))
dtvalnner = DateValNNer(Dict(:dateinterval => Dates.Hour(1)))
dtvalizer = DateValizer(Dict(:dateinterval => Dates.Hour(1)))
stfier = Statifier(Dict(:processmissing => true))
mypipeline = dtvalgator |> stfier
results = fit_transform!(mypipeline,X)
julia> show(results,allcols=true)
1×26 DataFrame Row │ tstart tend sfreq count max min median mean q1 q2 q25 q75 q8 q9 kurtosis skewness variation entropy autocor pacf bmedian bmean bq25 bq75 bmin bmax │ DateTime DateTime Float64 Int64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 ─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 2014-01-01T00:00:00 2016-01-01T00:00:00 0.999943 12932 0.999752 5.16098e-5 0.49567 0.497334 0.146138 0.252456 0.296718 0.695049 0.742278 0.850896 -0.921986 0.0136205 0.512859 3531.43 0.0410337 0.0414848 1.0 1.3585 1.0 2.0 1.0 7.0
Statifier for Non-Missing Values only
If you are not intested with the statistics of the missing blocks, you can disable missing blocks stat summary by indicating :processmissing => false
in the instance argument:
stfier = Statifier(Dict(:processmissing=>false))
mypipeline = dtvalgator |> stfier
results = fit_transform!(mypipeline,X)
julia> show(results,allcols=true)
1×20 DataFrame Row │ tstart tend sfreq count max min median mean q1 q2 q25 q75 q8 q9 kurtosis skewness variation entropy autocor pacf │ DateTime DateTime Float64 Int64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 ─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 2014-01-01T00:00:00 2016-01-01T00:00:00 0.999943 12932 0.999752 5.16098e-5 0.49567 0.497334 0.146138 0.252456 0.296718 0.695049 0.742278 0.850896 -0.921986 0.0136205 0.512859 3531.43 0.0410337 0.0414848
Statifier After Imputation
Let us check the statistics after the imputation by adding DateValNNer
instance in the pipeline. We expect that if the imputation is successful, the stats for missing blocks will all be NaN because stats of empty set is an NaN.
stfier = Statifier(Dict(:processmissing=>true))
mypipeline = dtvalgator |> dtvalnner |> stfier
results = fit_transform!(mypipeline,X)
julia> show(results,allcols=true)
1×26 DataFrame Row │ tstart tend sfreq count max min median mean q1 q2 q25 q75 q8 q9 kurtosis skewness variation entropy autocor pacf bmedian bmean bq25 bq75 bmin bmax │ DateTime DateTime Float64 Int64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 ─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 2014-01-01T00:00:00 2016-01-01T00:00:00 0.999943 17521 0.999752 5.16098e-5 0.495285 0.496761 0.164574 0.27195 0.315593 0.678395 0.722277 0.830334 -0.789767 0.016735 0.487173 4923.45 0.284516 0.275754 NaN NaN NaN NaN NaN NaN
As we expected, the imputation is successful and there are no more missing values in the processed time series dataset.
Let's try with the other imputation using DateValizer
and validate that there are no more missing values based on the stats.
stfier = Statifier(Dict(:processmissing=>true))
mypipeline = dtvalgator |> dtvalizer |> stfier
results = fit_transform!(mypipeline,X)
julia> show(results,allcols=true)
1×26 DataFrame Row │ tstart tend sfreq count max min median mean q1 q2 q25 q75 q8 q9 kurtosis skewness variation entropy autocor pacf bmedian bmean bq25 bq75 bmin bmax │ DateTime DateTime Float64 Int64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 ─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 2014-01-01T00:00:00 2016-01-01T00:00:00 0.999943 17521 0.999752 5.16098e-5 0.494636 0.497349 0.184767 0.317136 0.371291 0.621298 0.677163 0.808563 -0.192232 0.015616 0.440888 5124.35 0.0518163 0.0515352 NaN NaN NaN NaN NaN NaN
Indeed, the imputation got rid of the missing values.