TS Data Discovery

We have enough building blocks to perform data discovery given a bunch of time series data generated by sensors. Processing hundreds or thousands of time series data is becoming a common occurrence and typical challenge nowadays with the rapid adoption of IoT technology in buildings, manufacturing industries, etc.

In this section, we will use those transformers discussed in the previous sections to normalize and extract the statistical features of TS. These extracted stat features will be used as input to a Machine learning model. We will train this model to learn the signatures of different TS types so that we can use it to classify unknown or unlabeled sensor data.

In this tutorial, we will use TSClassifier which works in the following context: Given a bunch of time-series with specific types. Get the statistical features of each, use these as inputs to a classifier with output as the TS type, train, and test. Another option is to use these stat features for clustering and check cluster quality. If accuracy is poor, add more stat features and repeat same process as outlined for training and testing. Assume that each time series during training is named based on their type which will be used as the target output. For example, temperature time series will be named as temperature?.csv where ? is any positive integer. Using this setup, the TSClassifier loops over each file in the training directory, get the stats and record these accumulated stat features into a dataframe and train the model to learn the input->output mapping during fit! operation. Apply the learned models in the transform! operation loading files in the testing directory.

The entire process of training to learn the appropriate parameters and classification to identify unlabeled data exploits the idea of the pipeline workflow discussed in the previous sections.

Let's illustrate the process by loading some sample data:

using TSML

Random.seed!(12345)
trdirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/training")
tstdirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/testing")
modeldirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/model")

Here's the list of files for training:

show(readdir(trdirname) |> x->filter(y->match(r".csv",y) != nothing,x))

["AirOffTemp1.csv", "AirOffTemp2.csv", "AirOffTemp3.csv", "Energy1.csv", "Energy10.csv", "Energy2.csv", "Energy3.csv", "Energy4.csv", "Energy6.csv", "Energy7.csv", "Energy8.csv", "Energy9.csv", "Pressure1.csv", "Pressure3.csv", "Pressure4.csv", "Pressure6.csv", "RetTemp11.csv", "RetTemp21.csv", "RetTemp41.csv", "RetTemp51.csv"]

and here are the files in testing directory:

show(readdir(tstdirname) |> x->filter(y->match(r".csv",y) != nothing,x))

["AirOffTemp4.csv", "AirOffTemp5.csv", "Energy5.csv", "Pressure5.csv", "RetTemp31.csv"]

The files in testing directory doesn't need to be labeled but we use the labeling as a way to validate the effectiveness of the classifier. The labels will be used as the groundtruth during prediction/classification.

TSClassifier

Let us now setup an instance of the TSClassifier and pass the arguments containing the directory locations of files for training, testing, and modeling.

using TSML

tscl = TSClassifier(Dict(:trdirectory=>trdirname,
          :tstdirectory=>tstdirname,
          :modeldirectory=>modeldirname,
          :feature_range => 6:20,
          :num_trees=>20)
       )

Time to train our TSClassifier to learn the mapping between extracted stats features with the TS type.

julia> fit!(tscl);getting stats of AirOffTemp1.csv
getting stats of AirOffTemp2.csv
getting stats of AirOffTemp3.csv
getting stats of Energy1.csv
getting stats of Energy10.csv
getting stats of Energy2.csv
getting stats of Energy3.csv
getting stats of Energy4.csv
getting stats of Energy6.csv
getting stats of Energy7.csv
getting stats of Energy8.csv
getting stats of Energy9.csv
getting stats of Pressure1.csv
getting stats of Pressure3.csv
getting stats of Pressure4.csv
getting stats of Pressure6.csv
getting stats of RetTemp11.csv
skipping RetTemp21.csv: ErrorException("Nearest Neigbour algo failed to replace missings")
skipping RetTemp41.csv: ErrorException("Nearest Neigbour algo failed to replace missings")
getting stats of RetTemp51.csv

We can examine the extracted features saved by the model that is used for its training.

mdirname = tscl.model[:modeldirectory]
modelfname=tscl.model[:juliarfmodelname]

trstatfname = joinpath(mdirname,modelfname*".csv")
res = CSV.read(trstatfname,DataFrame)

julia> first(res,5)5×22 DataFrame
 Row │ tstart               tend                 sfreq     count  max        m ⋯
     │ DateTime             DateTime             Float64   Int64  Float64    F ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ 2012-12-01T00:00:00  2013-01-01T00:00:00  0.998658    745  8.9        3 ⋯
   2 │ 2012-12-01T00:00:00  2013-01-01T00:00:00  0.998658    745  5.2        2
   3 │ 2012-12-01T00:00:00  2013-01-01T00:00:00  0.998658    745  2.0        0
   4 │ 2017-10-01T00:00:00  2018-10-30T23:00:00  0.999895   9480  2.73733e7  2
   5 │ 2014-01-01T00:00:00  2014-12-31T23:00:00  0.999886   8760  6.5        1 ⋯
                                                              17 columns omitted

Let's check the accuracy of prediction with the test data using the transform! function.

julia> dfresults = transform!(tscl);getting stats of AirOffTemp4.csv
getting stats of AirOffTemp5.csv
getting stats of Energy5.csv
getting stats of Pressure5.csv
skipping RetTemp31.csv: ErrorException("Nearest Neigbour algo failed to replace missings")
loading model from file: /home/runner/work/TSML.jl/TSML.jl/src/../data/realdatatsclassification/model/juliarfmodel.serialized
julia> first(dfresults,5)4×2 DataFrame
 Row │ fname            predtype
     │ String           SubStrin…
─────┼─────────────────────────────
   1 │ AirOffTemp4.csv  AirOffTemp
   2 │ AirOffTemp5.csv  AirOffTemp
   3 │ Energy5.csv      Energy
   4 │ Pressure5.csv    Pressure

The table above shows the prediction corresponding to each filename which is the groundtruth. We can compute the accuracy by extracting from the filename the TS type and compare it with the corresponding prediction. Below computes the prediction accuracy:

prediction = dfresults.predtype
fnames = dfresults.fname
myregex = r"(?<dtype>[A-Z _ - a-z]+)(?<number>\d*).(?<ext>\w+)"
groundtruth=map(fnames) do fname
  mymatch=match(myregex,fname)
  mymatch[:dtype]
end

julia> sum(groundtruth .== prediction) / length(groundtruth) * 100100.0

Of course we need more data to split between training and testing to improve accuracy and get a more stable measurement of performance.