Statistical Metrics

Each TS can be evaluated to extract its statistical features which can be used for data quality assessment, data discovery by clustering and classification, and anomaly characterization among others.

TSML relies on Statifier to perform statistical metrics on the TS which can be configured to extract the statistics of missing blocks aside from the non-missing elements. Some of the scalar statistics it uses include: pacf, acf, autocor, quartiles, mean, median, max, min, kurtosis, skewness, variation, standard error, entropy, etc. It has only one argument :processmissing => true which indicates whether to include the statistics of missing data.

Let us again start generating an artificial data with missing values using the generateDataWithMissing() described in the beginning of tutorial.

julia> X = generateDataWithMissing();
julia> first(X,15)15×2 DataFrame
 Row │ Date                 Value
     │ DateTime             Float64?
─────┼─────────────────────────────────────
   1 │ 2014-01-01T00:00:00        0.9063
   2 │ 2014-01-01T00:15:00  missing
   3 │ 2014-01-01T00:30:00  missing
   4 │ 2014-01-01T00:45:00  missing
   5 │ 2014-01-01T01:00:00  missing
   6 │ 2014-01-01T01:15:00        0.334152
   7 │ 2014-01-01T01:30:00  missing
   8 │ 2014-01-01T01:45:00  missing
   9 │ 2014-01-01T02:00:00  missing
  10 │ 2014-01-01T02:15:00  missing
  11 │ 2014-01-01T02:30:00  missing
  12 │ 2014-01-01T02:45:00        0.136551
  13 │ 2014-01-01T03:00:00  missing
  14 │ 2014-01-01T03:15:00  missing
  15 │ 2014-01-01T03:30:00  missing

Statifier for Both Non-Missing and Missing Values

TSML includes Statifier transformer that computes scalar statistics to characterize the time series data. By default, it also computes statistics of missing blocks of data. To disable this feature, one can pass :processmissing => false to the argument during its instance creation. Below illustrates this workflow.

using TSML

dtvalgator = DateValgator(Dict(:dateinterval => Dates.Hour(1)))
dtvalnner = DateValNNer(Dict(:dateinterval => Dates.Hour(1)))
dtvalizer = DateValizer(Dict(:dateinterval => Dates.Hour(1)))
stfier = Statifier(Dict(:processmissing => true))

mypipeline = dtvalgator |> stfier

results = fit_transform!(mypipeline,X)

julia> show(results,allcols=true)1×26 DataFrame
 Row │ tstart               tend                 sfreq     count  max       min         median   mean      q1        q2        q25       q75       q8        q9        kurtosis   skewness   variation  entropy  autocor    pacf       bmedian  bmean    bq25     bq75     bmin     bmax
     │ DateTime             DateTime             Float64   Int64  Float64   Float64     Float64  Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64    Float64    Float64    Float64  Float64    Float64    Float64  Float64  Float64  Float64  Float64  Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 2014-01-01T00:00:00  2016-01-01T00:00:00  0.999943  12932  0.999752  5.16098e-5  0.49567  0.497334  0.146138  0.252456  0.296718  0.695049  0.742278  0.850896  -0.921986  0.0136205   0.512859  3531.43  0.0410337  0.0414848      1.0   1.3585      1.0      2.0      1.0      7.0

Statifier for Non-Missing Values only

If you are not intested with the statistics of the missing blocks, you can disable missing blocks stat summary by indicating :processmissing => false in the instance argument:

stfier = Statifier(Dict(:processmissing=>false))

mypipeline = dtvalgator |> stfier

results = fit_transform!(mypipeline,X)

julia> show(results,allcols=true)1×20 DataFrame
 Row │ tstart               tend                 sfreq     count  max       min         median   mean      q1        q2        q25       q75       q8        q9        kurtosis   skewness   variation  entropy  autocor    pacf
     │ DateTime             DateTime             Float64   Int64  Float64   Float64     Float64  Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64    Float64    Float64    Float64  Float64    Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 2014-01-01T00:00:00  2016-01-01T00:00:00  0.999943  12932  0.999752  5.16098e-5  0.49567  0.497334  0.146138  0.252456  0.296718  0.695049  0.742278  0.850896  -0.921986  0.0136205   0.512859  3531.43  0.0410337  0.0414848

Statifier After Imputation

Let us check the statistics after the imputation by adding DateValNNer instance in the pipeline. We expect that if the imputation is successful, the stats for missing blocks will all be NaN because stats of empty set is an NaN.

stfier = Statifier(Dict(:processmissing=>true))

mypipeline = dtvalgator |> dtvalnner |> stfier

results = fit_transform!(mypipeline,X)

julia> show(results,allcols=true)1×26 DataFrame
 Row │ tstart               tend                 sfreq     count  max       min         median    mean      q1        q2       q25       q75       q8        q9        kurtosis   skewness  variation  entropy  autocor   pacf      bmedian  bmean    bq25     bq75     bmin     bmax
     │ DateTime             DateTime             Float64   Int64  Float64   Float64     Float64   Float64   Float64   Float64  Float64   Float64   Float64   Float64   Float64    Float64   Float64    Float64  Float64   Float64   Float64  Float64  Float64  Float64  Float64  Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 2014-01-01T00:00:00  2016-01-01T00:00:00  0.999943  17521  0.999752  5.16098e-5  0.495285  0.496761  0.164574  0.27195  0.315593  0.678395  0.722277  0.830334  -0.789767  0.016735   0.487173  4923.45  0.284516  0.275754      NaN      NaN      NaN      NaN      NaN      NaN

As we expected, the imputation is successful and there are no more missing values in the processed time series dataset.

Let's try with the other imputation using DateValizer and validate that there are no more missing values based on the stats.

stfier = Statifier(Dict(:processmissing=>true))

mypipeline = dtvalgator |> dtvalizer |> stfier

results = fit_transform!(mypipeline,X)

julia> show(results,allcols=true)1×26 DataFrame
 Row │ tstart               tend                 sfreq     count  max       min         median    mean      q1        q2        q25       q75       q8        q9        kurtosis   skewness  variation  entropy  autocor    pacf       bmedian  bmean    bq25     bq75     bmin     bmax
     │ DateTime             DateTime             Float64   Int64  Float64   Float64     Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64   Float64    Float64   Float64    Float64  Float64    Float64    Float64  Float64  Float64  Float64  Float64  Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 2014-01-01T00:00:00  2016-01-01T00:00:00  0.999943  17521  0.999752  5.16098e-5  0.494636  0.497349  0.184767  0.317136  0.371291  0.621298  0.677163  0.808563  -0.192232  0.015616   0.440888  5124.35  0.0518163  0.0515352      NaN      NaN      NaN      NaN      NaN      NaN

Indeed, the imputation got rid of the missing values.