Statistical Metrics

Statistical Metrics

Each TS can be evaluated to extract its statistical features which can be used for data quality assessment, data discovery by clustering and classification, and anomaly characterization among others.

TSML relies on Statifier to perform statistical metrics on the TS which can be configured to extract the statistics of missing blocks aside from the non-missing elements. Some of the scalar statistics it uses include: pacf, acf, autocor, quartiles, mean, median, max, min, kurtosis, skewness, variation, standard error, entropy, etc. It has only one argument :processmissing => true which indicates whether to include the statistics of missing data.

Let us again start generating an artificial data with missing values using the generateDataWithMissing() described in the beginning of tutorial.

X = generateDataWithMissing()
first(X,15)

15 rows × 2 columns

DateValue
Dates…Float64⍰
12014-01-01T00:00:00missing
22014-01-01T00:15:00missing
32014-01-01T00:30:00missing
42014-01-01T00:45:00missing
52014-01-01T01:00:00missing
62014-01-01T01:15:00missing
72014-01-01T01:30:00missing
82014-01-01T01:45:000.0521332
92014-01-01T02:00:000.26864
102014-01-01T02:15:000.108871
112014-01-01T02:30:000.163666
122014-01-01T02:45:000.473017
132014-01-01T03:00:000.865412
142014-01-01T03:15:00missing
152014-01-01T03:30:00missing

Statifier for Both Non-Missing and Missing Values

TSML includes Statifier transformer that computes scalar statistics to characterize the time series data. By default, it also computes statistics of missing blocks of data. To disable this feature, one can pass :processmissing => false to the argument during its instance creation. Below illustrates this workflow.

using Dates
using TSML
using TSML.TSMLTypes
using TSML.TSMLTransformers
using TSML: Pipeline
using TSML: DateValgator
using TSML: DateValNNer
using TSML: Statifier

dtvalgator = DateValgator(Dict(:dateinterval => Dates.Hour(1)))
dtvalnner = DateValNNer(Dict(:dateinterval => Dates.Hour(1)))
dtvalizer = DateValizer(Dict(:dateinterval => Dates.Hour(1)))
stfier = Statifier(Dict(:processmissing => true))

mypipeline = Pipeline(
  Dict( :transformers => [
            dtvalgator,
            stfier
         ]
  )
)

fit!(mypipeline,X)
results = transform!(mypipeline,X)

1 rows × 26 columns

tstarttendsfreqcountmaxminmedianmeanq1q2q25q75q8q9kurtosisskewnessvariationentropyautocorpacfbmedianbmeanbq25bq75bminbmax
Dates…Dates…Float64Int64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
12014-01-01T00:00:002016-01-01T00:00:000.999943130550.9997510.0004564330.5009440.5021960.1466240.2518970.3041540.7014550.7505920.860293-0.9284920.002906790.5106273544.120.04750250.04773021.01.32681.01.01.06.0

Statifier for Non-Missing Values only

If you are not intested with the statistics of the missing blocks, you can disable missing blocks stat summary by indicating :processmissing => false in the instance argument:

stfier = Statifier(Dict(:processmissing=>false))
mypipeline = Pipeline(
  Dict( :transformers => [
            dtvalgator,
            stfier
         ]
  )
)
fit!(mypipeline,X)
results = transform!(mypipeline,X)

1 rows × 20 columns

tstarttendsfreqcountmaxminmedianmeanq1q2q25q75q8q9kurtosisskewnessvariationentropyautocorpacf
Dates…Dates…Float64Int64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
12014-01-01T00:00:002016-01-01T00:00:000.999943130550.9997510.0004564330.5009440.5021960.1466240.2518970.3041540.7014550.7505920.860293-0.9284920.002906790.5106273544.120.04750250.0477302

Statifier After Imputation

Let us check the statistics after the imputation by adding DateValNNer instance in the pipeline. We expect that if the imputation is successful, the stats for missing blocks will all be NaN because stats of empty set is an NaN.

stfier = Statifier(Dict(:processmissing=>true))
mypipeline = Pipeline(
  Dict( :transformers => [
            dtvalgator,
            dtvalnner,
            stfier
         ]
  )
)
fit!(mypipeline,X)
results = transform!(mypipeline,X)

1 rows × 26 columns

tstarttendsfreqcountmaxminmedianmeanq1q2q25q75q8q9kurtosisskewnessvariationentropyautocorpacfbmedianbmeanbq25bq75bminbmax
Dates…Dates…Float64Int64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
12014-01-01T00:00:002016-01-01T00:00:000.999943175210.9997510.0004564330.500220.5012320.1676540.2747430.3207640.6807640.72630.838924-0.7898380.004796090.4854124896.490.2798110.273143NaNNaNNaNNaNNaNNaN

As we expected, the imputation is successful and there are no more missing values in the processed time series dataset.

Let's try with the other imputation using DateValizer and validate that there are no more missing values based on the stats.

stfier = Statifier(Dict(:processmissing=>true))
mypipeline = Pipeline(
  Dict( :transformers => [
            dtvalgator,
            dtvalizer,
            stfier
         ]
  )
)
fit!(mypipeline,X)
results = transform!(mypipeline,X)

1 rows × 26 columns

tstarttendsfreqcountmaxminmedianmeanq1q2q25q75q8q9kurtosisskewnessvariationentropyautocorpacfbmedianbmeanbq25bq75bminbmax
Dates…Dates…Float64Int64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
12014-01-01T00:00:002016-01-01T00:00:000.999943175210.9997510.0004564330.5007480.5021520.1852420.3196070.3777960.6253710.6852530.820227-0.2258310.003961690.4410445088.20.02843970.0284876NaNNaNNaNNaNNaNNaN

Indeed, the imputation got rid of the missing values.