Large-Scale Unusual Time Series Detection 2015

Exploring the feature space of large collections of time series

Video Hyndman.pdf
时间序列异常检测 代码工具
Exploring the feature space of large collections of time series
Work­shop on Fron­tiers in Func­tional Data Analy­sis
Banff, Canada.

It is becoming increasingly common for organizations to collect very large amounts of data over time. Data visualization is essential for exploring and understanding structures and patterns, and to identify unusual observations. However, the sheer quantity of data available challenges current time series visualisation methods.

For example, Yahoo has banks of mail servers that are monitored over time. Many measurements on server performance are collected every hour for each of thousands of servers. We wish to identify servers that are behaving unusually.

Alternatively, we may have thousands of time series we wish to forecast, and we want to be able to identify the types of time series that are easy to forecast and those that are inherently challenging.

I will demonstrate a functional data approach to this problem using a vector of features on each time series, measuring characteristics of the series. For example, the features may include lag correlation, strength of seasonality, spectral entropy, etc. Then we use a principal component decomposition on the features, and plot the first few principal components. This enables us to explore a lower dimensional space and discover interesting structure and unusual observations.

Large-scale unusual time series detection

Rob J Hyndman1, Earo Wang1 and Nikolay Laptev2

Monash Business School, Monash University, Clayton, Victoria, Australia.
Yahoo Labs, Sunnyvale, California, USA

Abstract It is becoming increasingly common for organizations to collect very large amounts of data over time, and to need to detect unusual or anomalous time series. For example, Yahoo has banks of mail servers that are monitored over time. Many measurements on server performance are collected every hour for each of thousands of servers. We wish to identify servers that are behaving unusually.
We compute a vector of features on each time series, measuring characteristics of the series. The features may include lag correlation, strength of seasonality, spectral entropy, etc. Then we use a principal component decomposition on the features, and use various bivariate outlier detection methods applied to the first two principal components. This enables the most unusual series, based on their feature vectors, to be identified. The bivariate outlier detection methods used are based on highest density regions and α-hulls.
Download working paper
Associated R package

A new R package for detecting unusual time series

The anom­alous pack­age pro­vides some tools to detect unusual time series in a large col­lec­tion of time series. This is joint work with Earo Wang (an hon­ours stu­dent at Monash) and Niko­lay Laptev (from Yahoo Labs). Yahoo is inter­ested in detect­ing unusual pat­terns in server met­rics.
The pack­age is based on this paper with Earo and Niko­lay.
The basic idea is to mea­sure a range of fea­tures of the time series (such as strength of sea­son­al­ity, an index of spik­i­ness, first order auto­cor­re­la­tion, etc.) Then a prin­ci­pal com­po­nent decom­po­si­tion of the fea­ture matrix is cal­cu­lated, and out­liers are iden­ti­fied in 2-​​dimensional space of the first two prin­ci­pal com­po­nent scores.

We use two meth­ods to iden­tify outliers.

A bivari­ate ker­nel den­sity esti­mate of the first two PC scores is com­puted, and the points are ordered based on the value of the den­sity at each obser­va­tion. This gives us a rank­ing of most out­ly­ing (least den­sity) to least out­ly­ing (high­est density).
A series of α–con­vex hulls are com­puted on the first two PC scores with decreas­ing α, and points are clas­si­fied as out­liers when they become sin­gle­tons sep­a­rated from the main hull. This gives us an alter­na­tive rank­ing with the most out­ly­ing hav­ing sep­a­rated at the high­est value of α, and the remain­ing out­liers with decreas­ing val­ues of α.

I explained the ideas in a talk last Tues­day given at a joint meet­ing of the Sta­tis­ti­cal Soci­ety of Aus­tralia and the Mel­bourne Data Sci­ence Meetup Group. Slides are avail­able here. A link to a video of the talk will also be added there when it is ready.
The density-​​ranking of PC scores was also used in my work on detect­ing out­liers in func­tional data. See my 2010 JCGS paper and the asso­ci­ated rain­bow pack­age for R.
There are two ver­sions of the pack­age: one under an ACM licence, and a lim­ited ver­sion under a GPL licence. Even­tu­ally we hope to make the GPL ver­sion con­tain every­thing, but we are cur­rently depen­dent on the alphahull pack­age which has an ACM licence.

The anom­alous pack­age pro­vides some tools to detect unusual time series in a large col­lec­tion of time series. This is joint work with Earo Wang (an hon­ours stu­dent at Monash) and Niko­lay Laptev (from Yahoo Labs). Yahoo is inter­ested in detect­ing unusual pat­terns in server met­rics.
The pack­age is based on this paper with Earo and Niko­lay.

Related Posts:
A new open source data set for detect­ing time series outliers
My Yahoo talk is now online
A time series clas­si­fi­ca­tion contest
North Amer­i­can sem­i­nars: June 2015
Esti­mat­ing a non­lin­ear time series model in R

A new open source data set for detecting time series outliers

Yahoo Labs has just released an inter­est­ing new data set use­ful for research on detect­ing anom­alies (or out­liers) in time series data. There are many con­texts in which anom­aly detec­tion is impor­tant. For Yahoo, the main use case is in detect­ing unusual traf­fic on Yahoo servers.

The data set com­prises real traf­fic to Yahoo ser­vices, along with some syn­thetic data. There are 367 time series in the data set, each of which con­tains between 741 and 1680 obser­va­tions recorded at reg­u­lar inter­vals. Each series is accom­pa­nied by an indi­ca­tor series with a 1 if the obser­va­tion was an anom­aly, and 0 oth­er­wise. The anom­alies in the real data were deter­mined by human judge­ment, while those in the syn­thetic data were gen­er­ated algo­rith­mi­cally. For the syn­thetic data, some infor­ma­tion about the com­po­nents used to con­struct the data is also provided.

Although the Yahoo announce­ment claims that the data are pub­licly avail­able, in fact they are only avail­able to peo­ple with an edu address. Fur­ther, you have to apply to use them, and it takes about 24 hours before approval is granted. I have sug­gested that they remove these restric­tions, and make the data avail­able with­out restric­tion to any­one who wants to use them.

Research on anom­aly detec­tion in time series seems to be grow­ing in pop­u­lar­ity. Twit­ter has also released their own Anom­aly Detec­tion R pack­age. Their approach has some sim­i­lar­i­ties with my own tsoutliers func­tion in the forecast pack­age. The tso func­tion in the ts outliers pack­age is another approach to the same problem.
Hope­fully hav­ing a large pub­lic data set avail­able will lead to improve­ments in time series out­lier detec­tion meth­ods, at least for detect­ing out­liers in inter­net traf­fic data.

Related Posts:
A new R pack­age for detect­ing unusual time series
New in fore­cast 5.0
My Yahoo talk is now online
North Amer­i­can sem­i­nars: June 2015
More time series data online

2015/6/28 11:31:21
本文问题的不同之处 1页
We are interested in the time series that are anomalous relative to the other time series in the same cluster, or more generally, in the same set. This type of anomaly detection is different from univariate anomaly detection or even from a multivariate point anomaly detection [6] because we are interested in identifying entire time series that are behaving unusually in the context of other metrics.

工具包已有 R ,2页
作者贡献
First, we introduce a novel and accurate method of using PCA with α-convex hulls for finding anomalous time series. Second we perform a study of possible features that are useful for the types of time series dynamics seen in web-traffic time series.

为何PCA有效 ,2页
Therefore,loosely speaking the first k principal components capture the
k most prevalent patterns in the data

本文用的方法
To find anomalies in the first two PCs we use a multi-dimensional outlier detection algorithm. We have implemented a density-based and an α-hull based multidimensional outlier detection algorithms.
The density based multi-dimensional anomaly detection algorithm [7] Computing and Graphing Highest Density Regions finds points in the first two principal components with lowest density.The α-hull method [15]Generalizing the Convex Hull of a Sample: The R Package ...is a generalization of the convex hull [6]A Survey of Outlier Detection Methodologies. which is a bounding region of a point set. The α parameter in the α-hull method defines a generalized disk of radius α. When α is sufficiently large, the α-hull method is equivalent to the convex hull. Given α, an
edge of the α-shape is drawn between two members of the finite point set if there exists a generalized disk of radius α containing the entire point set and the two points lie on its boundary.

2015/6/28 15:20:03
the variance of the variances across blocks measures the “lumpiness” of the series.
方差的跨越块的方差测量序列的“凹凸不平”。
Some of our features rely on a robust STL decomposition。

2015/6/28 15:44:43
“Flat spots” are computed by dividing the sample space of a time series into ten equal-sized intervals, and computing the maximum run length within any single interval.
“平点”是通过将一个时间序列的样本空间分成十个大小相等的间隔,并计算任何单一间隔内的最大游程长度进行计算。
Finally, “crossing points”are defined as the number of times a time series crosses the mean line.
最后,“交叉点”被定义为一个时间序列穿过平均线的次数。

2015/6/28 15:50:11
我们的方法 效果
our approach first extracts the two most significant principal components (PC)s from all time series and then determines the outliers
in the new 2D “feature space”. For multidimensional outlier detection on the PC space we show results for the density-based method (HDR) and for the α-hull method.
对于多维异常检测在PC领域,我们显示结果基于密度的方法(HDR)和α-船体的方法。

参考

6 R: A Language and Environment for Statistical Computing
R: The R Project for Statistical Computing
R: a language and environment for statistical computing ...
21 A PCA-based Similarity Measure for Multivariate Time Series

Nonparametric and semiparametric response surface methodology: a review of designs, models and optimization techniques

Recent publications

Do human rhinovirus infections and food allergy modify grass pollen–induced asthma hospital admissions in children?
Jun 2015, Journal article

STR: A Seasonal-Trend Decomposition Procedure Based on Regression
Jun 2015, Working paper

Probabilistic time series forecasting with boosted additive models: an application to smart meter data
Jun 2015, Working paper

Large-scale unusual time series detection
Jun 2015, Working paper

A note on the validity of cross-validation for evaluating time series predictionApr 2015, Working paper

Discussion of “High-dimensional autocovariance matrices and optimal linear prediction”Apr 2015, Journal article

Bivariate data with ridges: two-dimensional smoothing of mortality rates
Dec 2014, Working paper

Optimally reconciling forecasts in a hierarchy
Oct 2014, Journal article

Outdoor fungal spores are associated with child asthma hospitalisations - a case-crossover study
Sep 2014, Journal article

Efficient identification of the Pareto optimal set
Aug 2014, Conference

Working papers

2015 (4) STR: A Seasonal-Trend Decomposition Procedure Based on Regression

Probabilistic time series forecasting with boosted additive models: an application to smart meter data

Large-scale unusual time series detection

A note on the validity of cross-validation for evaluating time series prediction

2014 (6) Bivariate data with ridges: two-dimensional smoothing of mortality rates

Low-dimensional decomposition, smoothing and forecasting of sparse functional data

Fast computation of reconciled forecasts for hierarchical and grouped time series

Monash Electricity Forecasting Model

“Facts” may still be artefacts, since models can make unrealistic assumptions: statistical methods for the estimation of invasion lag-phases from herbarium data

Bagging exponential smoothing methods using STL decomposition and Box-Cox transformation

2013 (2) Nonparametric and semiparametric response surface methodology: a review of designs, models and optimization techniques

hts: An R package for forecasting hierarchical or grouped time series

2012 (1) Recursive and direct multi-step forecasting: the best of both worlds

2008 (1) Forecasting without significance tests?

2007 (1) A state space model for exponential smoothing with group seasonality

2006 (1) Local linear multivariate regression with variable bandwidth in the presence of heteroscedasticity

2005 (1) Time series forecasting: the case for the single source of error state space approach

2000 (1) Seasonal adjustment methods for the analysis of respiratory disease in environmental epidemiology

1996 (1) A unified view of linear AR(1) models

1995 (1) The problem with Sturges’ rule for constructing histograms

Papers in conference proceedings

2014 (3) Efficient identification of the Pareto optimal set

Common functional principal component models for mortality forecasting

Boosting multi-step autoregressive forecasts

2010 (3) Exploratory graphics for functional data

Short-term load forecasting based on a semi-parametric additive model

Functionalization of microarray devices: process optimization using a multiobjective PSO and multiresponse MARS modeling

2009 (1) Nonparametric time series forecasting with dynamic updating

2005 (2) Dimension reduction for clustering time series using global characteristics

Robust forecasting of mortality and fertility rates: a functional data approach

2001 (1) Statistical methodological issues in studies of air pollution and respiratory disease

1999 (1) Nonparametric additive regression models for binary time series

1987 (1) Calculating the odds

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 157,012评论 4 359
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 66,589评论 1 290
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 106,819评论 0 237
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 43,652评论 0 202
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 51,954评论 3 285
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,381评论 1 210
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,687评论 2 310
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,404评论 0 194
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,082评论 1 238
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,355评论 2 241
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 31,880评论 1 255
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,249评论 2 250
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 32,864评论 3 232
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,007评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,760评论 0 192
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,394评论 2 269
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,281评论 2 259

推荐阅读更多精彩内容