Lessons from a Simple, Unsupervised Anomaly Detection Use Case
When time-series models fall short, and why that's okay
Anomaly detection is not a novel problem in Data Science and has been in common vocabulary owing to its diverse applicability — in finance (fraud detection) to infrastructure monitoring (outages, cost spikes, ..). Early in my career, I had the opportunity to work in the Knowledge-based Computer Systems group at NCST, Mumbai. Our use case was to monitor quotation outliers in imports, and our systems took hours to analyse few months’ of data before flagging “undervalued” imports.
Two decades later, in a recent hackathon, we ended up processing millions of records in just minutes! Since we were dealing with timed events, and the initial hunch was that a multi-variate1 time-series-based anomaly detection would be a natural fit. However, it wasn’t to be!
To our surprise, common anomaly detection models like Isolation Forest significantly outperformed the time-series one both in training latency and prediction accuracy for multi-variate use cases.
This could of course be subjective, and it seemed that the training latency aspect was influenced by the fact that time-series-anomaly-detection model appeared to rely only on process-level concurrency, which limited its scalability in distributed environments like Spark.
The prediction accuracy (or the lack of it) aspect can be blamed on the data, which, even though was timestamped, but the events were sporadic — sometimes clustered, and on other occasions there was a lull. This was of course not in-line with the frequency that the time-series-based model would have expected.
Time-series-based models typically expect the data to be evenly distributed, and spikes would cause them to fumble. In other words, time-series based detection would work better in situations like sensor-based events, where one would expect a constant timestamped stream of data.
Since the data was granular to the second, clustering of events posed another interesting challenge — that of data loss! This was because in order to leverage the model, an index was needed. In our case, an index that was purely based on the timestamp field would have resulted in last-write-wins scenario because the system considers the last event (or row) it’s processing for the index. This is, of course, a side effect of not maintaining a millisecond or microsecond granularity while storing event timestamps.
To train a model using timestamped events for time-series anomaly detection, an important aspect is to maintain the data at the highest level of temporal granularity, to minimise data loss.
In parallel, we were also exploring other models, like Isolation Forest, LOF, PCA, etc., with various levels of contamination. We observed that Isolation Forest outperformed all other models both during training as well as prediction. For training, the time-series model we used became exponentially slower when the training feature count was increased even by a small number, while retaining the sample size.
Isolation Forest, in Spark environment, not only could leverage the distribution, but does it much more efficiently — and at negligible costs!
Based on these results, our next step would be to further refine the model with some amount of labeled data, allowing us to improve the precision.
Further reading
Multivariate anomaly detection models evaluate patterns across multiple features, unlike traditional models that operate on a single metric (e.g. daily cost, or, single parameter threshold breach). This allows them to detect more complex, less obvious anomalies.