TS Data Discovery
We have enough building blocks to perform data discovery given a bunch of time series data generated by sensors. Processing hundreds or thousands of time series data is becoming a common occurrence and typical challenge nowadays with the rapid adoption of IoT technology in buildings, manufacturing industries, etc.
In this section, we will use those transformers discussed in the previous sections to normalize and extract the statistical features of TS. These extracted stat features will be used as input to a Machine learning model. We will train this model to learn the signatures of different TS types so that we can use it to classify unknown or unlabeled sensor data.
In this tutorial, we will use TSClassifier
which works in the following context: Given a bunch of time-series with specific types. Get the statistical features of each, use these as inputs to a classifier with output as the TS type, train, and test. Another option is to use these stat features for clustering and check cluster quality. If accuracy is poor, add more stat features and repeat same process as outlined for training and testing. Assume that each time series during training is named based on their type which will be used as the target output. For example, temperature time series will be named as temperature?.csv where ? is any positive integer. Using this setup, the TSClassifier
loops over each file in the training
directory, get the stats and record these accumulated stat features into a dataframe and train the model to learn the input->output mapping during fit!
operation. Apply the learned models in the transform!
operation loading files in the testing
directory.
The entire process of training to learn the appropriate parameters and classification to identify unlabeled data exploits the idea of the pipeline workflow discussed in the previous sections.
Let's illustrate the process by loading some sample data:
using TSML
Random.seed!(12345)
trdirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/training")
tstdirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/testing")
modeldirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/model")
Here's the list of files for training:
show(readdir(trdirname) |> x->filter(y->match(r".csv",y) != nothing,x))
["AirOffTemp1.csv", "AirOffTemp2.csv", "AirOffTemp3.csv", "Energy1.csv", "Energy10.csv", "Energy2.csv", "Energy3.csv", "Energy4.csv", "Energy6.csv", "Energy7.csv", "Energy8.csv", "Energy9.csv", "Pressure1.csv", "Pressure3.csv", "Pressure4.csv", "Pressure6.csv", "RetTemp11.csv", "RetTemp21.csv", "RetTemp41.csv", "RetTemp51.csv"]
and here are the files in testing directory:
show(readdir(tstdirname) |> x->filter(y->match(r".csv",y) != nothing,x))
["AirOffTemp4.csv", "AirOffTemp5.csv", "Energy5.csv", "Pressure5.csv", "RetTemp31.csv"]
The files in testing directory doesn't need to be labeled but we use the labeling as a way to validate the effectiveness of the classifier. The labels will be used as the groundtruth during prediction/classification.
TSClassifier
Let us now setup an instance of the TSClassifier
and pass the arguments containing the directory locations of files for training, testing, and modeling.
using TSML
tscl = TSClassifier(Dict(:trdirectory=>trdirname,
:tstdirectory=>tstdirname,
:modeldirectory=>modeldirname,
:feature_range => 6:20,
:num_trees=>20)
)
Time to train our TSClassifier
to learn the mapping between extracted stats features with the TS type.
julia> fit!(tscl);
Progress: 10%|██▊ | ETA: 0:00:11 ( 0.60 s/it) fname: AirOffTemp2.csv Progress: 35%|█████████▌ | ETA: 0:00:06 ( 0.47 s/it) fname: Energy3.csv Progress: 40%|██████████▊ | ETA: 0:00:05 ( 0.43 s/it) fname: Energy4.csv Progress: 45%|████████████▏ | ETA: 0:00:04 ( 0.41 s/it) fname: Energy6.csv Progress: 50%|█████████████▌ | ETA: 0:00:04 ( 0.38 s/it) fname: Energy7.csv Progress: 55%|██████████████▉ | ETA: 0:00:03 ( 0.37 s/it) fname: Energy8.csv Progress: 60%|████████████████▎ | ETA: 0:00:03 ( 0.35 s/it) fname: Energy9.csv Progress: 65%|█████████████████▌ | ETA: 0:00:02 ( 0.34 s/it) fname: Pressure1.csv Progress: 70%|██████████████████▉ | ETA: 0:00:02 ( 0.33 s/it) fname: Pressure3.csv Progress: 75%|████████████████████▎ | ETA: 0:00:02 ( 0.32 s/it) fname: Pressure4.csv Progress: 80%|█████████████████████▋ | ETA: 0:00:01 ( 0.32 s/it) fname: Pressure6.csv Progress: 85%|███████████████████████ | ETA: 0:00:01 ( 0.31 s/it) fname: RetTemp11.csv Progress: 90%|████████████████████████▎ | ETA: 0:00:01 ( 0.30 s/it) fname: RetTemp21.csv Progress: 95%|█████████████████████████▋ | ETA: 0:00:00 ( 0.30 s/it) fname: RetTemp41.csv Progress: 100%|███████████████████████████| Time: 0:00:05 ( 0.29 s/it) fname: RetTemp51.csv
We can examine the extracted features saved by the model that is used for its training.
mdirname = tscl.model[:modeldirectory]
modelfname=tscl.model[:juliarfmodelname]
trstatfname = joinpath(mdirname,modelfname*".csv")
res = CSV.read(trstatfname,DataFrame)
julia> first(res,5)
5×22 DataFrame Row │ tstart tend sfreq count max m ⋯ │ DateTime DateTime Float64 Int64 Float64 F ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ 2012-12-01T00:00:00 2013-01-01T00:00:00 0.998658 745 8.9 3 ⋯ 2 │ 2012-12-01T00:00:00 2013-01-01T00:00:00 0.998658 745 5.2 2 3 │ 2012-12-01T00:00:00 2013-01-01T00:00:00 0.998658 745 2.0 0 4 │ 2017-10-01T00:00:00 2018-10-30T23:00:00 0.999895 9480 2.73733e7 2 5 │ 2014-01-01T00:00:00 2014-12-31T23:00:00 0.999886 8760 6.5 1 ⋯ 17 columns omitted
Let's check the accuracy of prediction with the test data using the transform!
function.
julia> dfresults = transform!(tscl);
Progress: 40%|██████████▊ | ETA: 0:00:01 ( 0.25 s/it) fname: AirOffTemp5.csv Progress: 60%|████████████████▎ | ETA: 0:00:00 ( 0.24 s/it) fname: Energy5.csv Progress: 80%|█████████████████████▋ | ETA: 0:00:00 ( 0.23 s/it) fname: Pressure5.csv Progress: 100%|███████████████████████████| Time: 0:00:01 ( 0.22 s/it) fname: RetTemp31.csv loading model from file: /home/runner/work/TSML.jl/TSML.jl/src/../data/realdatatsclassification/model/juliarfmodel.serialized
julia> first(dfresults,5)
5×2 DataFrame Row │ fname predtype │ String SubStrin… ─────┼───────────────────────────── 1 │ AirOffTemp4.csv AirOffTemp 2 │ AirOffTemp5.csv AirOffTemp 3 │ Energy5.csv Energy 4 │ Pressure5.csv Pressure 5 │ RetTemp31.csv Energy
The table above shows the prediction corresponding to each filename which is the groundtruth. We can compute the accuracy by extracting from the filename the TS type and compare it with the corresponding prediction. Below computes the prediction accuracy:
prediction = dfresults.predtype
fnames = dfresults.fname
myregex = r"(?<dtype>[A-Z _ - a-z]+)(?<number>\d*).(?<ext>\w+)"
groundtruth=map(fnames) do fname
mymatch=match(myregex,fname)
mymatch[:dtype]
end
julia> sum(groundtruth .== prediction) / length(groundtruth) * 100
80.0
Of course we need more data to split between training and testing to improve accuracy and get a more stable measurement of performance.