Statistical Metrics

Each TS can be evaluated to extract its statistical features which can be used for data quality assessment, data discovery by clustering and classification, and anomaly characterization among others.

TSML relies on Statifier to perform statistical metrics on the TS which can be configured to extract the statistics of missing blocks aside from the non-missing elements. Some of the scalar statistics it uses include: pacf, acf, autocor, quartiles, mean, median, max, min, kurtosis, skewness, variation, standard error, entropy, etc. It has only one argument :processmissing => true which indicates whether to include the statistics of missing data.

Let us again start generating an artificial data with missing values using the generateDataWithMissing() described in the beginning of tutorial.

X = generateDataWithMissing()
first(X,15)

15 rows × 2 columns

	Date	Value
	Dates…	Float64⍰
1	2014-01-01T00:00:00	missing
2	2014-01-01T00:15:00	missing
3	2014-01-01T00:30:00	missing
4	2014-01-01T00:45:00	missing
5	2014-01-01T01:00:00	missing
6	2014-01-01T01:15:00	missing
7	2014-01-01T01:30:00	missing
8	2014-01-01T01:45:00	0.0521332
9	2014-01-01T02:00:00	0.26864
10	2014-01-01T02:15:00	0.108871
11	2014-01-01T02:30:00	0.163666
12	2014-01-01T02:45:00	0.473017
13	2014-01-01T03:00:00	0.865412
14	2014-01-01T03:15:00	missing
15	2014-01-01T03:30:00	missing

Statifier for Both Non-Missing and Missing Values

TSML includes Statifier transformer that computes scalar statistics to characterize the time series data. By default, it also computes statistics of missing blocks of data. To disable this feature, one can pass :processmissing => false to the argument during its instance creation. Below illustrates this workflow.

using Dates
using TSML
using TSML.TSMLTypes
using TSML.TSMLTransformers
using TSML: Pipeline
using TSML: DateValgator
using TSML: DateValNNer
using TSML: Statifier

dtvalgator = DateValgator(Dict(:dateinterval => Dates.Hour(1)))
dtvalnner = DateValNNer(Dict(:dateinterval => Dates.Hour(1)))
dtvalizer = DateValizer(Dict(:dateinterval => Dates.Hour(1)))
stfier = Statifier(Dict(:processmissing => true))

mypipeline = Pipeline(
  Dict( :transformers => [
            dtvalgator,
            stfier
         ]
  )
)

fit!(mypipeline,X)
results = transform!(mypipeline,X)

1 rows × 26 columns

	tstart	tend	sfreq	count	max	min	median	mean	q1	q2	q25	q75	q8	q9	kurtosis	skewness	variation	entropy	autocor	pacf	bmedian	bmean	bq25	bq75	bmin	bmax
	Dates…	Dates…	Float64	Int64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	2014-01-01T00:00:00	2016-01-01T00:00:00	0.999943	13055	0.999751	0.000456433	0.500944	0.502196	0.146624	0.251897	0.304154	0.701455	0.750592	0.860293	-0.928492	0.00290679	0.510627	3544.12	0.0475025	0.0477302	1.0	1.3268	1.0	1.0	1.0	6.0

Statifier for Non-Missing Values only

If you are not intested with the statistics of the missing blocks, you can disable missing blocks stat summary by indicating :processmissing => false in the instance argument:

stfier = Statifier(Dict(:processmissing=>false))
mypipeline = Pipeline(
  Dict( :transformers => [
            dtvalgator,
            stfier
         ]
  )
)
fit!(mypipeline,X)
results = transform!(mypipeline,X)

1 rows × 20 columns

	tstart	tend	sfreq	count	max	min	median	mean	q1	q2	q25	q75	q8	q9	kurtosis	skewness	variation	entropy	autocor	pacf
	Dates…	Dates…	Float64	Int64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	2014-01-01T00:00:00	2016-01-01T00:00:00	0.999943	13055	0.999751	0.000456433	0.500944	0.502196	0.146624	0.251897	0.304154	0.701455	0.750592	0.860293	-0.928492	0.00290679	0.510627	3544.12	0.0475025	0.0477302

Statifier After Imputation

Let us check the statistics after the imputation by adding DateValNNer instance in the pipeline. We expect that if the imputation is successful, the stats for missing blocks will all be NaN because stats of empty set is an NaN.

stfier = Statifier(Dict(:processmissing=>true))
mypipeline = Pipeline(
  Dict( :transformers => [
            dtvalgator,
            dtvalnner,
            stfier
         ]
  )
)
fit!(mypipeline,X)
results = transform!(mypipeline,X)

1 rows × 26 columns

	tstart	tend	sfreq	count	max	min	median	mean	q1	q2	q25	q75	q8	q9	kurtosis	skewness	variation	entropy	autocor	pacf	bmedian	bmean	bq25	bq75	bmin	bmax
	Dates…	Dates…	Float64	Int64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	2014-01-01T00:00:00	2016-01-01T00:00:00	0.999943	17521	0.999751	0.000456433	0.50022	0.501232	0.167654	0.274743	0.320764	0.680764	0.7263	0.838924	-0.789838	0.00479609	0.485412	4896.49	0.279811	0.273143	NaN	NaN	NaN	NaN	NaN	NaN

As we expected, the imputation is successful and there are no more missing values in the processed time series dataset.

Let's try with the other imputation using DateValizer and validate that there are no more missing values based on the stats.

stfier = Statifier(Dict(:processmissing=>true))
mypipeline = Pipeline(
  Dict( :transformers => [
            dtvalgator,
            dtvalizer,
            stfier
         ]
  )
)
fit!(mypipeline,X)
results = transform!(mypipeline,X)

1 rows × 26 columns

	tstart	tend	sfreq	count	max	min	median	mean	q1	q2	q25	q75	q8	q9	kurtosis	skewness	variation	entropy	autocor	pacf	bmedian	bmean	bq25	bq75	bmin	bmax
	Dates…	Dates…	Float64	Int64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	2014-01-01T00:00:00	2016-01-01T00:00:00	0.999943	17521	0.999751	0.000456433	0.500748	0.502152	0.185242	0.319607	0.377796	0.625371	0.685253	0.820227	-0.225831	0.00396169	0.441044	5088.2	0.0284397	0.0284876	NaN	NaN	NaN	NaN	NaN	NaN

Indeed, the imputation got rid of the missing values.