Large-Scale Unusual Time Series Detection 2015

Exploring the feature space of large collections of time series

Video Hyndman.pdf
时间序列异常检测代码工具
Exploring the feature space of large collections of time series
Workshop on Frontiers in Functional Data Analysis
Banff, Canada.

It is becoming increasingly common for organizations to collect very large amounts of data over time. Data visualization is essential for exploring and understanding structures and patterns, and to identify unusual observations. However, the sheer quantity of data available challenges current time series visualisation methods.

For example, Yahoo has banks of mail servers that are monitored over time. Many measurements on server performance are collected every hour for each of thousands of servers. We wish to identify servers that are behaving unusually.

Alternatively, we may have thousands of time series we wish to forecast, and we want to be able to identify the types of time series that are easy to forecast and those that are inherently challenging.

I will demonstrate a functional data approach to this problem using a vector of features on each time series, measuring characteristics of the series. For example, the features may include lag correlation, strength of seasonality, spectral entropy, etc. Then we use a principal component decomposition on the features, and plot the first few principal components. This enables us to explore a lower dimensional space and discover interesting structure and unusual observations.

Large-scale unusual time series detection

Rob J Hyndman1, Earo Wang1 and Nikolay Laptev2

Monash Business School, Monash University, Clayton, Victoria, Australia.
Yahoo Labs, Sunnyvale, California, USA

Abstract It is becoming increasingly common for organizations to collect very large amounts of data over time, and to need to detect unusual or anomalous time series. For example, Yahoo has banks of mail servers that are monitored over time. Many measurements on server performance are collected every hour for each of thousands of servers. We wish to identify servers that are behaving unusually.
We compute a vector of features on each time series, measuring characteristics of the series. The features may include lag correlation, strength of seasonality, spectral entropy, etc. Then we use a principal component decomposition on the features, and use various bivariate outlier detection methods applied to the first two principal components. This enables the most unusual series, based on their feature vectors, to be identified. The bivariate outlier detection methods used are based on highest density regions and α-hulls.
Download working paper
Associated R package

A new R package for detecting unusual time series

The anomalous package provides some tools to detect unusual time series in a large collection of time series. This is joint work with Earo Wang (an honours student at Monash) and Nikolay Laptev (from Yahoo Labs). Yahoo is interested in detecting unusual patterns in server metrics.
The package is based on this paper with Earo and Nikolay.
The basic idea is to measure a range of features of the time series (such as strength of seasonality, an index of spikiness, first order autocorrelation, etc.) Then a principal component decomposition of the feature matrix is calculated, and outliers are identified in 2-dimensional space of the first two principal component scores.

We use two methods to identify outliers.

A bivariate kernel density estimate of the first two PC scores is computed, and the points are ordered based on the value of the density at each observation. This gives us a ranking of most outlying (least density) to least outlying (highest density).
A series of α–convex hulls are computed on the first two PC scores with decreasing α, and points are classified as outliers when they become singletons separated from the main hull. This gives us an alternative ranking with the most outlying having separated at the highest value of α, and the remaining outliers with decreasing values of α.

I explained the ideas in a talk last Tuesday given at a joint meeting of the Statistical Society of Australia and the Melbourne Data Science Meetup Group. Slides are available here. A link to a video of the talk will also be added there when it is ready.
The density-ranking of PC scores was also used in my work on detecting outliers in functional data. See my 2010 JCGS paper and the associated rainbow package for R.
There are two versions of the package: one under an ACM licence, and a limited version under a GPL licence. Eventually we hope to make the GPL version contain everything, but we are currently dependent on the alphahull package which has an ACM licence.

A new open source data set for detecting time series outliers

Yahoo Labs has just released an interesting new data set useful for research on detecting anomalies (or outliers) in time series data. There are many contexts in which anomaly detection is important. For Yahoo, the main use case is in detecting unusual traffic on Yahoo servers.

The data set comprises real traffic to Yahoo services, along with some synthetic data. There are 367 time series in the data set, each of which contains between 741 and 1680 observations recorded at regular intervals. Each series is accompanied by an indicator series with a 1 if the observation was an anomaly, and 0 otherwise. The anomalies in the real data were determined by human judgement, while those in the synthetic data were generated algorithmically. For the synthetic data, some information about the components used to construct the data is also provided.

Although the Yahoo announcement claims that the data are publicly available, in fact they are only available to people with an edu address. Further, you have to apply to use them, and it takes about 24 hours before approval is granted. I have suggested that they remove these restrictions, and make the data available without restriction to anyone who wants to use them.

Research on anomaly detection in time series seems to be growing in popularity. Twitter has also released their own Anomaly Detection R package. Their approach has some similarities with my own tsoutliers function in the forecast package. The tso function in the ts outliers package is another approach to the same problem.
Hopefully having a large public data set available will lead to improvements in time series outlier detection methods, at least for detecting outliers in internet traffic data.

2015/6/28 11:31:21
本文问题的不同之处 1页
We are interested in the time series that are anomalous relative to the other time series in the same cluster, or more generally, in the same set. This type of anomaly detection is diﬀerent from univariate anomaly detection or even from a multivariate point anomaly detection [6] because we are interested in identifying entire time series that are behaving unusually in the context of other metrics.

工具包已有 R ，2页
作者贡献
First, we introduce a novel and accurate method of using PCA with α-convex hulls for ﬁnding anomalous time series. Second we perform a study of possible features that are useful for the types of time series dynamics seen in web-traﬃc time series.

为何PCA有效，2页
Therefore,loosely speaking the ﬁrst k principal components capture the
k most prevalent patterns in the data

本文用的方法
To find anomalies in the first two PCs we use a multi-dimensional outlier detection algorithm. We have implemented a density-based and an α-hull based multidimensional outlier detection algorithms.
The density based multi-dimensional anomaly detection algorithm [7] Computing and Graphing Highest Density Regions finds points in the first two principal components with lowest density.The α-hull method [15]Generalizing the Convex Hull of a Sample: The R Package ...is a generalization of the convex hull [6]A Survey of Outlier Detection Methodologies. which is a bounding region of a point set. The α parameter in the α-hull method defines a generalized disk of radius α. When α is sufficiently large, the α-hull method is equivalent to the convex hull. Given α, an
edge of the α-shape is drawn between two members of the finite point set if there exists a generalized disk of radius α containing the entire point set and the two points lie on its boundary.

2015/6/28 15:20:03
the variance of the variances across blocks measures the “lumpiness” of the series.
方差的跨越块的方差测量序列的“凹凸不平”。
Some of our features rely on a robust STL decomposition。

2015/6/28 15:44:43
“Flat spots” are computed by dividing the sample space of a time series into ten equal-sized intervals, and computing the maximum run length within any single interval.
“平点”是通过将一个时间序列的样本空间分成十个大小相等的间隔，并计算任何单一间隔内的最大游程长度进行计算。
Finally, “crossing points”are defined as the number of times a time series crosses the mean line.
最后，“交叉点”被定义为一个时间序列穿过平均线的次数。

2015/6/28 15:50:11
我们的方法效果
our approach first extracts the two most significant principal components (PC)s from all time series and then determines the outliers
in the new 2D “feature space”. For multidimensional outlier detection on the PC space we show results for the density-based method (HDR) and for the α-hull method.
对于多维异常检测在PC领域，我们显示结果基于密度的方法（HDR）和α-船体的方法。