We have enough building blocks to perform data discovery given a bunch of time series data generated by sensors. Processing hundreds or thousands of time series data is becoming a common occurrence and typical challenge nowadays with the rapid adoption of IoT technology in buildings, manufacturing industries, etc.
In this section, we will use those transformers discussed in the previous sections to normalize and extract the statistical features of TS. These extracted stat features will be used as input to a Machine learning model. We will train this model to learn the signatures of different TS types so that we can use it to classify unknown or unlabeled sensor data.
In this tutorial, we will use
TSClassifier which works in the following context: Given a bunch of time-series with specific types. Get the statistical features of each, use these as inputs to a classifier with output as the TS type, train, and test. Another option is to use these stat features for clustering and check cluster quality. If accuracy is poor, add more stat features and repeat same process as outlined for training and testing. Assume that each time series during training is named based on their type which will be used as the target output. For example, temperature time series will be named as temperature?.csv where ? is any positive integer. Using this setup, the
TSClassifier loops over each file in the
training directory, get the stats and record these accumulated stat features into a dataframe and train the model to learn the input->output mapping during
fit! operation. Apply the learned models in the
transform! operation loading files in the
The entire process of training to learn the appropriate parameters and classification to identify unlabeled data exploits the idea of the pipeline workflow discussed in the previous sections.
Let's illustrate the process by loading some sample data:
using TSML Random.seed!(12345) trdirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/training") tstdirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/testing") modeldirname = joinpath(dirname(pathof(TSML)),"../data/realdatatsclassification/model")
Here's the list of files for training:
show(readdir(trdirname) |> x->filter(y->match(r".csv",y) != nothing,x))
["AirOffTemp1.csv", "AirOffTemp2.csv", "AirOffTemp3.csv", "Energy1.csv", "Energy10.csv", "Energy2.csv", "Energy3.csv", "Energy4.csv", "Energy6.csv", "Energy7.csv", "Energy8.csv", "Energy9.csv", "Pressure1.csv", "Pressure3.csv", "Pressure4.csv", "Pressure6.csv", "RetTemp11.csv", "RetTemp21.csv", "RetTemp41.csv", "RetTemp51.csv"]
and here are the files in testing directory:
show(readdir(tstdirname) |> x->filter(y->match(r".csv",y) != nothing,x))
["AirOffTemp4.csv", "AirOffTemp5.csv", "Energy5.csv", "Pressure5.csv", "RetTemp31.csv"]
The files in testing directory doesn't need to be labeled but we use the labeling as a way to validate the effectiveness of the classifier. The labels will be used as the groundtruth during prediction/classification.
Let us now setup an instance of the
TSClassifier and pass the arguments containing the directory locations of files for training, testing, and modeling.
using TSML tscl = TSClassifier(Dict(:trdirectory=>trdirname, :tstdirectory=>tstdirname, :modeldirectory=>modeldirname, :feature_range => 6:20, :num_trees=>20) )
Time to train our
TSClassifier to learn the mapping between extracted stats features with the TS type.
getting stats of AirOffTemp1.csv getting stats of AirOffTemp2.csv getting stats of AirOffTemp3.csv getting stats of Energy1.csv getting stats of Energy10.csv getting stats of Energy2.csv getting stats of Energy3.csv getting stats of Energy4.csv getting stats of Energy6.csv getting stats of Energy7.csv getting stats of Energy8.csv getting stats of Energy9.csv getting stats of Pressure1.csv getting stats of Pressure3.csv getting stats of Pressure4.csv getting stats of Pressure6.csv getting stats of RetTemp11.csv skipping RetTemp21.csv: ErrorException("Nearest Neigbour algo failed to replace missings") skipping RetTemp41.csv: ErrorException("Nearest Neigbour algo failed to replace missings") getting stats of RetTemp51.csv
We can examine the extracted features saved by the model that is used for its training.
mdirname = tscl.model[:modeldirectory] modelfname=tscl.model[:juliarfmodelname] trstatfname = joinpath(mdirname,modelfname*".csv") res = CSV.read(trstatfname) |> DataFrame nothing #hide
ERROR: UndefVarError: res not defined
Let's check the accuracy of prediction with the test data using the
julia> dfresults = transform!(tscl);
getting stats of AirOffTemp4.csv getting stats of AirOffTemp5.csv getting stats of Energy5.csv getting stats of Pressure5.csv skipping RetTemp31.csv: ErrorException("Nearest Neigbour algo failed to replace missings") loading model from file: /home/runner/work/TSML.jl/TSML.jl/src/../data/realdatatsclassification/model/juliarfmodel.serialized
4×2 DataFrame Row │ fname predtype │ String SubStrin… ─────┼───────────────────────────── 1 │ AirOffTemp4.csv AirOffTemp 2 │ AirOffTemp5.csv AirOffTemp 3 │ Energy5.csv Energy 4 │ Pressure5.csv Pressure
The table above shows the prediction corresponding to each filename which is the groundtruth. We can compute the accuracy by extracting from the filename the TS type and compare it with the corresponding prediction. Below computes the prediction accuracy:
prediction = dfresults.predtype fnames = dfresults.fname myregex = r"(?<dtype>[A-Z _ - a-z]+)(?<number>\d*).(?<ext>\w+)" groundtruth=map(fnames) do fname mymatch=match(myregex,fname) mymatch[:dtype] end
julia> sum(groundtruth .== prediction) / length(groundtruth) * 100
Of course we need more data to split between training and testing to improve accuracy and get a more stable measurement of performance.