Using
R to Test Pairs of Securities for
Cointegration
Ernie Chan's book,
Quantitative
Trading,
explains why cointegrated
pairs of securities are useful for constructing mean-reverting trades.
It also explains how to test pairs of securities for
cointegration. Ernie uses Matlab, but some readers
may want
to use R,
the software for statistical computing and graphics. This
note explains how to perform the cointegration test using R.
Let's assume you have the history of daily prices for two
stocks, GLD
and GDX.
You want to know if the prices are cointegrated. Let's also assume the
data are captured in two files, GLD.csv
and GDX.csv
,
which are
in comma-separated value (CSV) format with seven columns:
date, open, high, low, close, volume, and adjusted close.
Two notes before we begin.
- This is not a tutorial on R.
(Sorry.) You can learn about R
from its official web
site, from An Introduction to R, from
one of the tutorials,
or from one of the available books.
- This is also not a tutorial
on cointegration. For that I
recommend Ernie Chan's book;
or, for the more mathematically inclined, the book
by
Bernhard Pfaff, Analysis of
Integrated and Cointegrated Time Series with R.
Both books
are excellent.
Data Representation
You could perform this work using vectors or data frames to
represent your time series data, but that would be tedious. I use zoo
objects
for representation of time series, and I strongly recommend either
the zoo
package or the xts
package. xts
is a super-set of zoo
with an extremely fast implementation and many other features. Here,
I'll assume you are using zoo.
Once your data are captured in a zoo
object, say t,
it behaves like a data frame. One zoo
object can contain several columns, where each column is a
different time series and each row is a set of (simultaneous)
observations of those
series. The object provides an additional attribute, index(t)
,
which is the vector of dates, one date per observation. The
first and last dates are available using start(t)
and end(t)
,
respectively.
Loading the Data
Loading the CSV files into a zoo
object requires a few simple steps.
- Read the two files into two
data
frames.
- Convert the dates from
strings into Date objects.
- Convert the two data frames
into two zoo objects.
- Take the intersection of the
two zoo objects. That will create one zoo object with the
observations common
to both datasets.
Here is the R code.
library(zoo)
# Load
the zoo package
#
Read the CSV files
into data frames
#
gld <- read.csv("GLD.csv", stringsAsFactors=F)
gdx <- read.csv("GDX.csv", stringsAsFactors=F)
# The first column contains dates. The as.Date
# function can convert strings into Date objects.
#
gld_dates <- as.Date(gld[,1])
gdx_dates <- as.Date(gdx[,1])
# The seventh column contains the adjusted close.
# We use the zoo function to create zoo objects from that data.
# The function takes two arguments: a vector of data and
# a vector of dates.
#
gld <- zoo(gld[,7], gld_dates)
gdx <- zoo(gdx[,7], gdx_dates)
# The merge function can combine two zoo objects,
# computing either their intersection (all=FALSE)
# or union (all=TRUE).
#
t.zoo <- merge(gld, gdx, all=FALSE)
# At this point, t.zoo is a zoo object with two columns: gld and gdx.
# Most statistical functions expect a data frame for input,
# so we create a data frame here.
#
t <- as.data.frame(t.zoo)
# Tell the user what dates are spanned by the data.
#
cat("Date range is", format(start(t.zoo)), "to", format(end(t.zoo)),
"\n")
The as.Date
function assumes your
date strings are formatted like yyyy-mm-dd,
which is the ISO standard format.
If your strings use the common US format of mm/dd/yy,
include a format
specification.
gld_dates
<- as.Date(gld[,1], format="%m/%d/%y")
gdx_dates
<- as.Date(gdx[,1],
format="%m/%d/%y")
Constructing the Spread
In Ernie's book, he first tests for cointegration, then he constructs
the spread. In R, we do it the other way around: First we construct the
spread, then we test the spread for a
unit
root. It the spread
has a root inside the unit circle, the underlying securities are
cointegrated.
(See
Bernhard Pfaff's book for a discusion of unit roots and their
significance.)
The spread is defined this way.
S
= y
- (β × x)
where β is the hedge ratio, calculated
using ordinary least squares
(OLS). Rearranging terms, we want to find the value of
β which
best fits this equation.
y
= (-β) × x
This is a simple linear equation with no y
intercept. In R,
the lm
function
can fit a linear model such as this.
#
The lm function
builds linear regression models using OLS.
# We build the linear model, m, forcing a zero intercept,
# then we extract the model's first
regression coefficient.
#
m <- lm(gld ~ gdx + 0, data=t)
beta <- coef(m)[1]
cat("Assumed hedge ratio is", beta, "\n")
# Now compute the spread
#
sprd <- t$gld - beta*t$gdx
The first argument to lm
is a formula, which specifies
the linear
model. The formula gld ~ gdx + 0
says the model is
GLDi
= β×GDXi
+ εi
(If we omit "+ 0" from the formula, R would fit a y intercept, too.)
Testing for Cointegration
The Augmented Dickey-Fuller test is a basic statistical test for a unit
root, and several R packages implement that test. Here, we
will use the adf.test
function which is implemented in the tseries
package. The
function
returns an object which
contains the test results. In particular, it contains the p-value
that we want.
library(tseries)
# Load
the tseries package
# Setting alternative="stationary" chooses the appropriate test.
# Setting k=0 forces a basic (not augmented) test. See the
# documentation for its full meaning.
#
ht
<-
adf.test(sprd, alternative="stationary", k=0)
cat("ADF p-value is", ht$p-value, "\n")
Setting
alternative="stationary"
is important.
- To a statistician, it
specifies a null hypothesis that the spread is non-stationary
or explosive.
- For everyone else, it means
that if the p-value
is small, the spread is mean-reverting. By "small," I mean
less than 0.10 or less than 0.05, depending how fussy you are.
(Smaller is better.)
We can interpret the ADF test results for the user.
#
The ht object
contains the p-value from the ADF test.
# The p-value is the probability that the spread is NOT
# mean-reverting. Hence, a small p-value means it is very
# improbable that the spread is NOT mean-reverting
# (got that?).
#
if (ht$p.value < 0.05) {
cat("The spread is likely
mean-reverting.\n")
} else {
cat("The spread is not
mean-reverting.\n")
}
One word of caution: The adf.test
function essentially detrends your data before testing for
stationarity. If your data contains a strong trend, you might
be very surprised to learn it is "mean reverting" when it is obvously
moving upward or downward. If this is a problem for you, consider
the fUnitRoots package
which contains the adfTest
function (note the spelling!).
That function lets you analyze either with or without the trend
assumption.
Putting It All Together
OK, I've talked a lot, but it's really not much code. Here it is with
some fat removed.
library(zoo)
library(tseries)
gld <-
read.csv("GLD.csv", stringsAsFactors=F)
gdx <- read.csv("GDX.csv", stringsAsFactors=F)
gld <- zoo(gld[,7], as.Date(gld[,1]))
gdx <- zoo(gdx[,7], as.Date(gdx[,1]))
t.zoo <- merge(gld, gdx, all=FALSE)
t <- as.data.frame(t.zoo)
cat("Date range is", format(start(t.zoo)), "to", format(end(t.zoo)),
"\n")
m <- lm(gld ~
gdx
+ 0, data=t)
beta <- coef(m)[1]
cat("Assumed hedge ratio is", beta, "\n")
sprd <- t$gld - beta*t$gdx
ht <-
adf.test(sprd, alternative="stationary", k=0)
cat("ADF p-value is", ht$p.value, "\n")
if (ht$p.value
< 0.05) {
cat("The spread is likely
mean-reverting\n")
} else {
cat("The spread is not
mean-reverting.\n")
}
When I ran this code on recent data, I got this output.
Date
range is 2006-05-23 to 2009-08-10
Assumed hedge ratio is 1.928096
ADF p-value is 0.2609909
The spread is not mean-reverting.
Postscripts
I used many R
functions, but did not try to explain all their nuances. If
you have any questions (e.g., "Why did you set stringsAsFactors=FALSE
for read.csv
?"),
use the help facility to learn more about the function:
>
?read.csv
These notes assume your data is captured inside CSV files. I
did
that just to mirror the example in Ernie Chan's book. R
can read its data from many
places. To learn more about R's input/output in general, start
with the R
Data
Import/Export manual.
To learn more about loading financial
data in particular, visit the web site for the quantmod
package, an excellent
source of tools for anyone working with financial data.
R's input/output is so flexible, in fact, that you can read data
directly from web
sites, such as the Yahoo! Finance web site. Just use a URL
instead of a file name, like this.
gld
<-
read.csv("http://ichart.finance.yahoo.com/table.csv?s=GLD&ignore=.csv",
stringsAsFactors=F)
gdx <-
read.csv("http://ichart.finance.yahoo.com/table.csv?s=GDX&ignore=.csv",
stringsAsFactors=F)
This eliminates the extra step of saving the data to an intermediate
file.
I used the adf.test
function, above, to test for stationarity. The ADF test is
implemented
in several other packages, too, such as the fUnitRoots and
urca packages. Bernhard
Pfaff's book contains an entire chapter on
tests for unit roots, including tests beyond the original ADF test.
Finally, high-powered statisticians will be offended by my statement
that "the p-value [of the ADF
test] is the probability that the
spread is not
mean-reverting." Just for the record, here is the strict
interpretation: "The p-value
is the probability that we could
have observed the spread data, given the assumption that the spread is
not mean-reverting."