TS Data Discovery

We have enough building blocks to perform data discovery given a bunch of time series data generated by sensors. Processing hundreds or thousands of time series data is becoming a common occurrence and typical challenge nowadays with the rapid adoption of IoT technology in buildings, manufacturing industries, etc.

In this section, we will use those transformers discussed in the previous sections to normalize and extract the statistical features of TS. These extracted stat features will be used as input to a Machine learning model. We will train this model to learn the signatures of different TS types so that we can use it to classify unknown or unlabeled sensor data.

In this tutorial, we will use TSClassifier which works in the following context: Given a bunch of time-series with specific types. Get the statistical features of each, use these as inputs to a classifier with output as the TS type, train, and test. Another option is to use these stat features for clustering and check cluster quality. If accuracy is poor, add more stat features and repeat same process as outlined for training and testing. Assume that each time series during training is named based on their type which will be used as the target output. For example, temperature time series will be named as temperature?.csv where ? is any positive integer. Using this setup, the TSClassifier loops over each file in the training directory, get the stats and record these accumulated stat features into a dataframe and train the model to learn the input->output mapping during fit! operation. Apply the learned models in the transform! operation loading files in the testing directory.

The entire process of training to learn the appropriate parameters and classification to identify unlabeled data exploits the idea of the pipeline workflow discussed in the previous sections.

Let's illustrate the process by loading some sample data:

using TSML

Random.seed!(12345)
trdirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/training")
tstdirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/testing")
modeldirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/model")

Here's the list of files for training:

show(readdir(trdirname) |> x->filter(y->match(r".csv",y) != nothing,x))

["AirOffTemp1.csv", "AirOffTemp2.csv", "AirOffTemp3.csv", "Energy1.csv", "Energy10.csv", "Energy2.csv", "Energy3.csv", "Energy4.csv", "Energy6.csv", "Energy7.csv", "Energy8.csv", "Energy9.csv", "Pressure1.csv", "Pressure3.csv", "Pressure4.csv", "Pressure6.csv", "RetTemp11.csv", "RetTemp21.csv", "RetTemp41.csv", "RetTemp51.csv"]

and here are the files in testing directory:

show(readdir(tstdirname) |> x->filter(y->match(r".csv",y) != nothing,x))

["AirOffTemp4.csv", "AirOffTemp5.csv", "Energy5.csv", "Pressure5.csv", "RetTemp31.csv"]

The files in testing directory doesn't need to be labeled but we use the labeling as a way to validate the effectiveness of the classifier. The labels will be used as the groundtruth during prediction/classification.

TSClassifier

Let us now setup an instance of the TSClassifier and pass the arguments containing the directory locations of files for training, testing, and modeling.

using TSML

tscl = TSClassifier(Dict(:trdirectory=>trdirname,
          :tstdirectory=>tstdirname,
          :modeldirectory=>modeldirname,
          :feature_range => 6:20,
          :num_trees=>20)
       )

Time to train our TSClassifier to learn the mapping between extracted stats features with the TS type.

julia> fit!(tscl);
Progress:  10%|██▊                        |  ETA: 0:00:11 ( 0.60  s/it)
  fname:  AirOffTemp2.csv


Progress:  35%|█████████▌                 |  ETA: 0:00:06 ( 0.47  s/it)
  fname:  Energy3.csv


Progress:  40%|██████████▊                |  ETA: 0:00:05 ( 0.43  s/it)
  fname:  Energy4.csv


Progress:  45%|████████████▏              |  ETA: 0:00:04 ( 0.41  s/it)
  fname:  Energy6.csv


Progress:  50%|█████████████▌             |  ETA: 0:00:04 ( 0.38  s/it)
  fname:  Energy7.csv


Progress:  55%|██████████████▉            |  ETA: 0:00:03 ( 0.37  s/it)
  fname:  Energy8.csv


Progress:  60%|████████████████▎          |  ETA: 0:00:03 ( 0.35  s/it)
  fname:  Energy9.csv


Progress:  65%|█████████████████▌         |  ETA: 0:00:02 ( 0.34  s/it)
  fname:  Pressure1.csv


Progress:  70%|██████████████████▉        |  ETA: 0:00:02 ( 0.33  s/it)
  fname:  Pressure3.csv


Progress:  75%|████████████████████▎      |  ETA: 0:00:02 ( 0.32  s/it)
  fname:  Pressure4.csv


Progress:  80%|█████████████████████▋     |  ETA: 0:00:01 ( 0.32  s/it)
  fname:  Pressure6.csv


Progress:  85%|███████████████████████    |  ETA: 0:00:01 ( 0.31  s/it)
  fname:  RetTemp11.csv


Progress:  90%|████████████████████████▎  |  ETA: 0:00:01 ( 0.30  s/it)
  fname:  RetTemp21.csv


Progress:  95%|█████████████████████████▋ |  ETA: 0:00:00 ( 0.30  s/it)
  fname:  RetTemp41.csv


Progress: 100%|███████████████████████████| Time: 0:00:05 ( 0.29  s/it)
  fname:  RetTemp51.csv

We can examine the extracted features saved by the model that is used for its training.

mdirname = tscl.model[:modeldirectory]
modelfname=tscl.model[:juliarfmodelname]

trstatfname = joinpath(mdirname,modelfname*".csv")
res = CSV.read(trstatfname,DataFrame)

julia> first(res,5)5×22 DataFrame
 Row │ tstart               tend                 sfreq     count  max        m ⋯
     │ DateTime             DateTime             Float64   Int64  Float64    F ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ 2012-12-01T00:00:00  2013-01-01T00:00:00  0.998658    745  8.9        3 ⋯
   2 │ 2012-12-01T00:00:00  2013-01-01T00:00:00  0.998658    745  5.2        2
   3 │ 2012-12-01T00:00:00  2013-01-01T00:00:00  0.998658    745  2.0        0
   4 │ 2017-10-01T00:00:00  2018-10-30T23:00:00  0.999895   9480  2.73733e7  2
   5 │ 2014-01-01T00:00:00  2014-12-31T23:00:00  0.999886   8760  6.5        1 ⋯
                                                              17 columns omitted

Let's check the accuracy of prediction with the test data using the transform! function.

julia> dfresults = transform!(tscl);
Progress:  40%|██████████▊                |  ETA: 0:00:01 ( 0.25  s/it)
  fname:  AirOffTemp5.csv


Progress:  60%|████████████████▎          |  ETA: 0:00:00 ( 0.24  s/it)
  fname:  Energy5.csv


Progress:  80%|█████████████████████▋     |  ETA: 0:00:00 ( 0.23  s/it)
  fname:  Pressure5.csv


Progress: 100%|███████████████████████████| Time: 0:00:01 ( 0.22  s/it)
  fname:  RetTemp31.csv
loading model from file: /home/runner/work/TSML.jl/TSML.jl/src/../data/realdatatsclassification/model/juliarfmodel.serialized
julia> first(dfresults,5)5×2 DataFrame
 Row │ fname            predtype
     │ String           SubStrin…
─────┼─────────────────────────────
   1 │ AirOffTemp4.csv  AirOffTemp
   2 │ AirOffTemp5.csv  AirOffTemp
   3 │ Energy5.csv      Energy
   4 │ Pressure5.csv    Pressure
   5 │ RetTemp31.csv    Energy

The table above shows the prediction corresponding to each filename which is the groundtruth. We can compute the accuracy by extracting from the filename the TS type and compare it with the corresponding prediction. Below computes the prediction accuracy:

prediction = dfresults.predtype
fnames = dfresults.fname
myregex = r"(?<dtype>[A-Z _ - a-z]+)(?<number>\d*).(?<ext>\w+)"
groundtruth=map(fnames) do fname
  mymatch=match(myregex,fname)
  mymatch[:dtype]
end

julia> sum(groundtruth .== prediction) / length(groundtruth) * 10080.0

Of course we need more data to split between training and testing to improve accuracy and get a more stable measurement of performance.