Statistical Metrics

Each TS can be evaluated to extract its statistical features which can be used for data quality assessment, data discovery by clustering and classification, and anomaly characterization among others.

TSML relies on Statifier to perform statistical metrics on the TS which can be configured to extract the statistics of missing blocks aside from the non-missing elements. Some of the scalar statistics it uses include: pacf, acf, autocor, quartiles, mean, median, max, min, kurtosis, skewness, variation, standard error, entropy, etc. It has only one argument :processmissing => true which indicates whether to include the statistics of missing data.

Let us again start generating an artificial data with missing values using the generateDataWithMissing() described in the beginning of tutorial.

julia> X = generateDataWithMissing();

julia> first(X,15)
15×2 DataFrame
│ Row │ Date                │ Value     │
│     │ DateTime            │ Float64?  │
├─────┼─────────────────────┼───────────┤
│ 1   │ 2014-01-01T00:00:00 │ missing   │
│ 2   │ 2014-01-01T00:15:00 │ missing   │
│ 3   │ 2014-01-01T00:30:00 │ missing   │
│ 4   │ 2014-01-01T00:45:00 │ missing   │
│ 5   │ 2014-01-01T01:00:00 │ missing   │
│ 6   │ 2014-01-01T01:15:00 │ missing   │
│ 7   │ 2014-01-01T01:30:00 │ missing   │
│ 8   │ 2014-01-01T01:45:00 │ 0.0521332 │
│ 9   │ 2014-01-01T02:00:00 │ 0.26864   │
│ 10  │ 2014-01-01T02:15:00 │ 0.108871  │
│ 11  │ 2014-01-01T02:30:00 │ 0.163666  │
│ 12  │ 2014-01-01T02:45:00 │ 0.473017  │
│ 13  │ 2014-01-01T03:00:00 │ 0.865412  │
│ 14  │ 2014-01-01T03:15:00 │ missing   │
│ 15  │ 2014-01-01T03:30:00 │ missing   │

Statifier for Both Non-Missing and Missing Values

TSML includes Statifier transformer that computes scalar statistics to characterize the time series data. By default, it also computes statistics of missing blocks of data. To disable this feature, one can pass :processmissing => false to the argument during its instance creation. Below illustrates this workflow.

using TSML

dtvalgator = DateValgator(Dict(:dateinterval => Dates.Hour(1)))
dtvalnner = DateValNNer(Dict(:dateinterval => Dates.Hour(1)))
dtvalizer = DateValizer(Dict(:dateinterval => Dates.Hour(1)))
stfier = Statifier(Dict(:processmissing => true))

mypipeline = @pipeline dtvalgator |> stfier

results = fit_transform!(mypipeline,X)

julia> show(results,allcols=true)
1×26 DataFrame
│ Row │ tstart              │ tend                │ sfreq    │ count │
│     │ DateTime            │ DateTime            │ Float64  │ Int64 │
├─────┼─────────────────────┼─────────────────────┼──────────┼───────┤
│ 1   │ 2014-01-01T00:00:00 │ 2016-01-01T00:00:00 │ 0.999943 │ 13055 │

│ Row │ max      │ min         │ median   │ mean     │ q1       │ q2       │
│     │ Float64  │ Float64     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼─────────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.999751 │ 0.000456433 │ 0.500944 │ 0.502196 │ 0.146624 │ 0.251897 │

│ Row │ q25      │ q75      │ q8       │ q9       │ kurtosis  │ skewness   │
│     │ Float64  │ Float64  │ Float64  │ Float64  │ Float64   │ Float64    │
├─────┼──────────┼──────────┼──────────┼──────────┼───────────┼────────────┤
│ 1   │ 0.304154 │ 0.701455 │ 0.750592 │ 0.860293 │ -0.928492 │ 0.00290679 │

│ Row │ variation │ entropy │ autocor   │ pacf      │ bmedian │ bmean   │
│     │ Float64   │ Float64 │ Float64   │ Float64   │ Float64 │ Float64 │
├─────┼───────────┼─────────┼───────────┼───────────┼─────────┼─────────┤
│ 1   │ 0.510627  │ 3544.12 │ 0.0475025 │ 0.0477302 │ 1.0     │ 1.3268  │

│ Row │ bq25    │ bq75    │ bmin    │ bmax    │
│     │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ 1.0     │ 1.0     │ 1.0     │ 6.0     │

Statifier for Non-Missing Values only

If you are not intested with the statistics of the missing blocks, you can disable missing blocks stat summary by indicating :processmissing => false in the instance argument:

stfier = Statifier(Dict(:processmissing=>false))

mypipeline = @pipeline dtvalgator |> stfier

results = fit_transform!(mypipeline,X)

julia> show(results,allcols=true)
1×20 DataFrame
│ Row │ tstart              │ tend                │ sfreq    │ count │
│     │ DateTime            │ DateTime            │ Float64  │ Int64 │
├─────┼─────────────────────┼─────────────────────┼──────────┼───────┤
│ 1   │ 2014-01-01T00:00:00 │ 2016-01-01T00:00:00 │ 0.999943 │ 13055 │

│ Row │ max      │ min         │ median   │ mean     │ q1       │ q2       │
│     │ Float64  │ Float64     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼─────────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.999751 │ 0.000456433 │ 0.500944 │ 0.502196 │ 0.146624 │ 0.251897 │

│ Row │ q25      │ q75      │ q8       │ q9       │ kurtosis  │ skewness   │
│     │ Float64  │ Float64  │ Float64  │ Float64  │ Float64   │ Float64    │
├─────┼──────────┼──────────┼──────────┼──────────┼───────────┼────────────┤
│ 1   │ 0.304154 │ 0.701455 │ 0.750592 │ 0.860293 │ -0.928492 │ 0.00290679 │

│ Row │ variation │ entropy │ autocor   │ pacf      │
│     │ Float64   │ Float64 │ Float64   │ Float64   │
├─────┼───────────┼─────────┼───────────┼───────────┤
│ 1   │ 0.510627  │ 3544.12 │ 0.0475025 │ 0.0477302 │

Statifier After Imputation

Let us check the statistics after the imputation by adding DateValNNer instance in the pipeline. We expect that if the imputation is successful, the stats for missing blocks will all be NaN because stats of empty set is an NaN.

stfier = Statifier(Dict(:processmissing=>true))

mypipeline = @pipeline dtvalgator |> dtvalnner |> stfier

results = fit_transform!(mypipeline,X)

julia> show(results,allcols=true)
1×26 DataFrame
│ Row │ tstart              │ tend                │ sfreq    │ count │
│     │ DateTime            │ DateTime            │ Float64  │ Int64 │
├─────┼─────────────────────┼─────────────────────┼──────────┼───────┤
│ 1   │ 2014-01-01T00:00:00 │ 2016-01-01T00:00:00 │ 0.999943 │ 17521 │

│ Row │ max      │ min         │ median  │ mean     │ q1       │ q2       │
│     │ Float64  │ Float64     │ Float64 │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼─────────────┼─────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.999751 │ 0.000456433 │ 0.50022 │ 0.501232 │ 0.167654 │ 0.274743 │

│ Row │ q25      │ q75      │ q8      │ q9       │ kurtosis  │ skewness   │
│     │ Float64  │ Float64  │ Float64 │ Float64  │ Float64   │ Float64    │
├─────┼──────────┼──────────┼─────────┼──────────┼───────────┼────────────┤
│ 1   │ 0.320764 │ 0.680764 │ 0.7263  │ 0.838924 │ -0.789838 │ 0.00479609 │

│ Row │ variation │ entropy │ autocor  │ pacf     │ bmedian │ bmean   │
│     │ Float64   │ Float64 │ Float64  │ Float64  │ Float64 │ Float64 │
├─────┼───────────┼─────────┼──────────┼──────────┼─────────┼─────────┤
│ 1   │ 0.485412  │ 4896.49 │ 0.279811 │ 0.273143 │ NaN     │ NaN     │

│ Row │ bq25    │ bq75    │ bmin    │ bmax    │
│     │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ NaN     │ NaN     │ NaN     │ NaN     │

As we expected, the imputation is successful and there are no more missing values in the processed time series dataset.

Let's try with the other imputation using DateValizer and validate that there are no more missing values based on the stats.

stfier = Statifier(Dict(:processmissing=>true))

mypipeline = @pipeline dtvalgator |> dtvalizer |> stfier

results = fit_transform!(mypipeline,X)

julia> show(results,allcols=true)
1×26 DataFrame
│ Row │ tstart              │ tend                │ sfreq    │ count │
│     │ DateTime            │ DateTime            │ Float64  │ Int64 │
├─────┼─────────────────────┼─────────────────────┼──────────┼───────┤
│ 1   │ 2014-01-01T00:00:00 │ 2016-01-01T00:00:00 │ 0.999943 │ 17521 │

│ Row │ max      │ min         │ median   │ mean     │ q1       │ q2       │
│     │ Float64  │ Float64     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼─────────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.999751 │ 0.000456433 │ 0.500748 │ 0.502152 │ 0.185242 │ 0.319607 │

│ Row │ q25      │ q75      │ q8       │ q9       │ kurtosis  │ skewness   │
│     │ Float64  │ Float64  │ Float64  │ Float64  │ Float64   │ Float64    │
├─────┼──────────┼──────────┼──────────┼──────────┼───────────┼────────────┤
│ 1   │ 0.377796 │ 0.625371 │ 0.685253 │ 0.820227 │ -0.225831 │ 0.00396169 │

│ Row │ variation │ entropy │ autocor   │ pacf      │ bmedian │ bmean   │
│     │ Float64   │ Float64 │ Float64   │ Float64   │ Float64 │ Float64 │
├─────┼───────────┼─────────┼───────────┼───────────┼─────────┼─────────┤
│ 1   │ 0.441044  │ 5088.2  │ 0.0284397 │ 0.0284876 │ NaN     │ NaN     │

│ Row │ bq25    │ bq75    │ bmin    │ bmax    │
│     │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ NaN     │ NaN     │ NaN     │ NaN     │

Indeed, the imputation got rid of the missing values.