"Breakout events in a time series can reveal unusual activities that has happened in the past as well as possible forthcoming level shifts and other anomalous behavior in the near future. It helps understand the time series better and can probably tell you where to look, while revealing valuable insights that you have so far been overlooking . ."
Installation and setup
For this analysis, we are going to use 3 packages that offer facilities to detect breakouts:
- AnamolyDetection (available at twitter’s github page)
While AnamolyDetection package has its own mechanism to plot the graphs, we will use the ‘autoplot’ function in ggplot2 along with the ggfortify package that enables autoplot to draw time series graphs. The data we use for this analysis is the australian air passengers data between the years 1970 and 2009, which is available in the ‘ausair’ timeseries in the ‘fpp’ package.
devtools::install_github("twitter/AnomalyDetection") # install twitter's AnomalyDetection
library(fpp) # for 'ausair' data
library(ggfortify) # enable timeseries in autoplot
myTS <- ausair # initialise data
myPeriod <- "year" # set the period
ymth <- paste(start(myTS), collapse="/")
startDate <- as.Date(paste(ymth, "1", sep="/"), format="%Y/%m/%d") # start date
eymth <- paste(end(myTS), collapse="/")
endDate <- as.Date(paste(eymth, "1", sep="/"), format="%Y/%m/%d") # end date
Dates <- seq.Date(startDate, endDate, by=myPeriod) # create the dates
Dates <- ymd(Dates) # convert to POSIXct
myData <- data.frame(Dates, myTS) # cast as a data.frame
AnomalyDetectionTs(myData, max_anoms = 0.2, direction='both', plot=TRUE) # perform anamoly detection and plot
What happened in the code above? Our aim is to prepare data in the format required by AnomalyDetectionTs function. It takes in as its first argument, a dataframe that has time stamps in the first columns and the actual values of the time series in the second column. Since the original ‘ausair’ data is available to us as a time series, we do the above steps to convert it the the required dataframe format (1st column contains time stamps, while 2nd contains the data values) before applying the AnomalyDetection function on the data frame.
The key arguments in the AnomalyDetection are the ‘max_anoms’ that takes the percentage of datapoints that can be considered as a breakout point and the ‘direction’ (pos/neg/both) where the anomalies need to be discovered.
Upon applying the function the result throws out a time series graph that highlights the breakpoints and a $anoms attribute that shows a set of breakpoint events. We may infer that the focus of these breakpoints are on the future events because, the points that are marked are typically are those that lie at a rising points in the time series where a breakout seem to have initiated a different level for the series. My guess is, had we had a time series that had low lying points, that leads to sharper movement in negative direction, those would be marked as breakouts as well.
$anoms timestamp anoms 1 2002-01-01 39.02158 2 2003-01-01 41.38643 3 2004-01-01 41.59655 4 2005-01-01 44.65732 5 2006-01-01 46.95177 6 2007-01-01 48.72884 7 2008-01-01 51.48843 8 2009-01-01 50.02697
autoplot(cpt.meanvar(myTS), size=1.5, colour="firebrick") +
labs(x="Date", y="Total Annual Air Passengers", title="AusAir - Changepoint") + # add labels
theme(plot.title = element_text(size=20, face="bold", vjust=2), # style the axis and title text
plot.margin=unit(c(10,10,0,0),"mm")) # adjust plot margin
In the above code we use the ‘cpt.meanvar’ function to detect the change points, which are essentially points that cause anti-patterns that make it stand out from the from the rest of the series. The main engine of the above code is the first line containing the ‘autoplot’. The rest from second line on wards is meant for styling the graph.
bpts <- breakpoints(myTS ~ 1)# get the breakpoints
autoplot(bpts, ts.colour="firebrick", size=1.5, cpt.linetype="solid") +
labs(x="Date", y="Total Annual Air Passengers", title="AusAir - Strucchange") +
theme(plot.title = element_text(size=20, face="bold", vjust=2),
The breakpoints funcition uses a linear regression based approach to compute the breaks. It tries to partition that time series into segments. The algorithm for computing the optimal breakpoints given the number of breaks is based on a dynamic programming approach using the Bellman principle. The main computational effort is to compute a triangular RSS matrix, which gives the residual sum of squares for a segment starting at observation i and ending at i’ with i < i’. Breakpoints are the number of observations that are the last in one segment.
Optimal 5-segment partition: Call: breakpoints.formula(formula = myTS ~ 1) Breakpoints at observation number: 9 21 27 33 Corresponding to breakdates: 1978 1990 1996 2002
The three methods discussed here approach breakouts in much different ways. It is up to the investigator to decide which method to use based on your problem’s specific objectives.