# Using R to Test Pairs of Securities for Cointegration

Paul Teetor
August 2009

Ernie Chan's book, Quantitative Trading, explains why cointegrated pairs of securities are useful for constructing mean-reverting trades.  It also explains how to test pairs of securities for cointegration.  Ernie uses Matlab, but some readers may want to use R, the software for statistical computing and graphics.  This note explains how to perform the cointegration test using R.

Let's assume you have the history of daily prices for two stocks, GLD and GDX. You want to know if the prices are cointegrated. Let's also assume the data are captured in two files, `GLD.csv` and `GDX.csv`, which are in comma-separated value (CSV) format with seven columns: date, open, high, low, close, volume, and adjusted close.

Two notes before we begin.
• This is not a tutorial on R.  (Sorry.)  You can learn about R from its official web site, from An Introduction to R, from one of the tutorials, or from one of the available books.
• This is also not a tutorial on cointegration.  For that I recommend Ernie Chan's book; or, for the more mathematically inclined, the book by Bernhard Pfaff, Analysis of Integrated and Cointegrated Time Series with R.  Both books are excellent.

## Data Representation

You could perform this work using vectors or data frames to represent your time series data, but that would be tedious. I use objects for representation of time series, and I strongly recommend either the zoo package or the xts package.  xts is a super-set of zoo with an extremely fast implementation and many other features. Here, I'll assume you are using zoo.

Once your data are captured in a zoo object, say t, it behaves like a data frame. One zoo object can contain several columns, where each column is a different time series and each row is a set of (simultaneous) observations of those series.  The object provides an additional attribute, `index(t)`, which is the vector of dates, one date per observation.  The first and last dates are available using `start(t)` and `end(t)`, respectively.

Loading the CSV files into a zoo object requires a few simple steps.
1. Read the two files into two data frames.
2. Convert the dates from strings into Date objects.
3. Convert the two data frames into two zoo objects.
4. Take the intersection of the two zoo objects.  That will create one zoo object with the observations common to both datasets.
Here is the R code.

library(zoo)            # Load the zoo package

# Read the CSV files into data frames
#

# The first column contains dates.  The as.Date
# function can convert strings into Date objects.
#
gld_dates <- as.Date(gld[,1])
gdx_dates <- as.Date(gdx[,1])

# The seventh column contains the adjusted close.
# We use the zoo function to create zoo objects from that data.
# The function takes two arguments: a vector of data and
# a vector of dates.
#
gld <- zoo(gld[,7], gld_dates)
gdx <- zoo(gdx[,7], gdx_dates)

# The merge function can combine two zoo objects,
# computing either their intersection (all=FALSE)
# or union (all=TRUE).
#
t.zoo <- merge(gld, gdx, all=FALSE)

# At this point, t.zoo is a zoo object with two columns: gld and gdx.
# Most statistical functions expect a data frame for input,
# so we create a data frame here.
#
t <- as.data.frame(t.zoo)

# Tell the user what dates are spanned by the data.
#
cat("Date range is", format(start(t.zoo)), "to", format(end(t.zoo)), "\n")

The as.Date function assumes your date strings are formatted like yyyy-mm-dd, which is the ISO standard format. If your strings use the common US format of mm/dd/yy, include a format specification.

gld_dates <- as.Date(gld[,1], format="%m/%d/%y")
gdx_dates <- as.Date(gdx[,1], format="%m/%d/%y")

In Ernie's book, he first tests for cointegration, then he constructs the spread. In R, we do it the other way around: First we construct the spread, then we test the spread for a unit root.  It the spread has a root inside the unit circle, the underlying securities are cointegrated.  (See Bernhard Pfaff's book for a discusion of unit roots and their significance.)

The spread is defined this way.

Sy - (β × x)

where β is the hedge ratio, calculated using ordinary least squares (OLS).  Rearranging terms, we want to find the value of β which best fits this equation.

y = (-β) × x

This is a simple linear equation with no y intercept.  In R, the `lm` function can fit a linear model such as this.

# The lm function builds linear regression models using OLS.
# We build the linear model, m, forcing a zero intercept,
# then we extract the model's first regression coefficient.
#
m <- lm(gld ~ gdx + 0, data=t)
beta <- coef(m)

cat("Assumed hedge ratio is", beta, "\n")

#
sprd <- t\$gld - beta*t\$gdx

The first argument to `lm` is a formula, which specifies the linear model.  The formula `gld ~ gdx + 0` says the model is

GLDi = β×GDXi + εi

(If we omit "+ 0" from the formula, R would fit a y intercept, too.)

## Testing for Cointegration

The Augmented Dickey-Fuller test is a basic statistical test for a unit root, and several R packages implement that test.  Here, we will use the `adf.test` function which is implemented in the `tseries` package.  The function returns an object which contains the test results. In particular, it contains the p-value that we want.

library(tseries)            # Load the tseries package

# Setting alternative="stationary" chooses the appropriate test.
# Setting k=0 forces a basic (not augmented) test.  See the
# documentation for its full meaning.
#

Setting alternative="stationary" is important.
• To a statistician, it specifies a null hypothesis that the spread is non-stationary or explosive.
• For everyone else, it means that if the p-value is small, the spread is mean-reverting. By "small," I mean less than 0.10 or less than 0.05, depending how fussy you are.  (Smaller is better.)
We can interpret the ADF test results for the user.

# The ht object contains the p-value from the ADF test.
# The p-value is the probability that the spread is NOT
# mean-reverting.  Hence, a small p-value means it is very
# improbable that the spread is NOT mean-reverting
# (got that?).
#
if (ht\$p.value < 0.05) {
} else {
}

One word of caution:  The `adf.test` function essentially detrends your data before testing for stationarity.  If your data contains a strong trend, you might be very surprised to learn it is "mean reverting" when it is obvously moving upward or downward.  If this is a problem for you, consider the  fUnitRoots package which contains the `adfTest` function (note the spelling!).  That function lets you analyze either with or without the trend assumption.

## Putting It All Together

OK, I've talked a lot, but it's really not much code. Here it is with some fat removed.

library(zoo)
library(tseries)

gld <- zoo(gld[,7],
as.Date(gld[,1]))
gdx <- zoo(gdx[,7],
as.Date(gdx[,1]))

t.zoo <- merge(gld, gdx, all=FALSE
)
t <- as.data.frame(t.zoo)

cat("Date range is", format(start(t.zoo)), "to", format(end(t.zoo)), "\n")

m <- lm(gld ~ gdx + 0, data=t)
beta <- coef(m)

cat("Assumed hedge ratio is", beta, "\n")

sprd <- t\$gld - beta*t\$gdx

if (ht\$p.value < 0.05) {
} else {
}

When I ran this code on recent data, I got this output.

```Date range is 2006-05-23 to 2009-08-10 Assumed hedge ratio is 1.928096 ADF p-value is 0.2609909 The spread is not mean-reverting.```

## Postscripts

I used many R functions, but did not try to explain all their nuances.  If you have any questions (e.g., "Why did you set `stringsAsFactors=FALSE` for `read.csv`?"), use the help facility to learn more about the function:

```> ?read.csv```

```gld <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=GLD&ignore=.csv", stringsAsFactors=F) gdx <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=GDX&ignore=.csv", stringsAsFactors=F)```
I used the `adf.test` function, above, to test for stationarity.  The ADF test is implemented in several other packages, too, such as the fUnitRoots and urca packages.  Bernhard Pfaff's book contains an entire chapter on tests for unit roots, including tests beyond the original ADF test.