Study Book

STA3303 Statistics for Climate Research
Faculty of Sciences
Study Book
Written by Dr Peter Dunn Department of Mathematics & Computing Faculty of Sciences The University of Southern Queensland
ii
Published by University of Southern Queensland Toowoomba Queensland 4350 Australia http://www.usq.edu.au The University of Southern Queensland, 2007.2.
Copyrighted materials reproduced herein are used under the provisions of the Copyright Act 1968 as amended, or as a result of application to the copyright owner. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without prior permission.
A Produced using LTEX in the USQ style by the Department of Mathematics and Computing.
USQ, February 21, 2007
Table of Contents
Time Series Analysis
1
3 23 41 59 73 105 129 173 205
1 Introduction 2 Autoregressive (AR) models 3 Moving Average (MA) models 4 arma Models 5 Finding a Model 6 Diagnostic Tests 7 Non-Stationary Models 8 Markov chains 9 Other Models
II
Multivariate Statistics
213
215 iii
10 Introduction
iv 11 Principal Components Analysis 12 Factor Analysis 13 Cluster Analysis A Installing other packages in R B Review of statistical rules C Some time series tricks in R D Time series functions in R E Multivariate analysis functions in R
Table of Contents 225 255 279 291 293 299 301 305
Strand
Time Series Analysis
Module
1
. . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 7 9 10 12 13 13 17 18 19
Introduction
Module contents
1.1 1.2 Introduction 1.2.1 1.2.2 Time-series . . . . . . . . . . . . . . . . . . . . . . . . . . Denitions . . . . . . . . . . . . . . . . . . . . . . . . . Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Signal and noise . . . . . . . . . . . . . . . . . . . . . . . 1.4 1.5 Simple methods . . . . . . . . . . . . . . . . . . . . . . . Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 The r package . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Getting help in r . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Answers to selected Exercises . . . . . . . . . . . . . . .
1.6
Module objectives
Upon completion of this module students should be able to:
recognize and dene a time series;
Module 1. Introduction
understand what denes a stationary time series; know the particular kinds of time series being discussed in this course; recognise the reasons for nding statistical models for time series; understand the notation used to designate a time series; understand that a time series consists of a signal plus noise; understand that the signal of a time series can be modelled and that the noise is random; list some simple time series modelling methods; know how to use the software package r to do basic manipulations with time series data, including loading data, plotting the data, and dening the data as time series data.
1.1
Introduction
This Module introduces time series and associated terminology. Some simple methods are discussed for analysing time series, and the software used in the course is also introduced.
1.2
1.2.1
Time-series
Denitions
A time series is a sequence of observations ordered by time. Examples include the noon temperature measured daily at the Oakey airport, the annual sales of passenger cars in Australia, monthly average values of the southern oscillation index (SOI), the number of people receiving unemployment benets in Queensland each month, and the number of bits of information sent through a computer line per second. In each case, the observations are taken at regular time intervals. This is not necessary, but greatly simplies the mathematics; we will only be concerned with time series where observations are taken at regular intervals (that is, equally spaced: each month, each day or each year for example). In this course, the emphasis is on climatological applications; however time series are used in many branches of science and engineering, and are particularly common in business (sales forecasts, share markets and so on).
1.2. Time-series
A time series is interesting because the series is a function of past values of itself, and so the series is somewhat predictable. The task of the scientist is to nd out more about that relationship between observations. Unlike most statistics, the observations in a time series are not independent (that is, they are dependent). Time series are usually plotted using a time-plot, as in the next example. Example 1.1: The monthly Southern Oscillation Index (the SOI) is available for approximately the last 130 years. A plot of the monthly average SOI (Fig. 1.1) has time on the horizontal axis, and the SOI on the vertical axis. Generally, the observations are joined with a line to indicate that the points are given in a particular order. (Note the horizontal line at zero was added by me, and is not part of the default plot.)
Example 1.2: The seasonal SOI can also be examined. This series certainly does not consist of independent observations. The seasonal SOI can be plotted against the SOI for the previous season, the season before that, and so on (Fig. 1.2). There is a reasonably strong relationship between the seasonal SOI and the previous season. The relationship between the SOI and the season before that is still obvious; it is less obvious (but still present) with three seasons previous. There is basically no relationship between the seasonal SOI and the SOI four seasons previous.
A stationary time series is a time series whose statistics do not change over time. Such statistics are typically the mean and the variance (and the covariance, discussed in Sect. 2.5.3). Initially, only stationary time series are considered in this course. In Module 7, methods are discussed for modelling non-stationary time series and for identifying non-stationary time series. At present, identify a non-stationary time series simply using a time series plot of the data, as shown in the next Example. Example 1.3: Consider the annual rainfall near Wendover, Utah, USA. (These data are considered in more detail in Example 7.1.) A plot of the data (Fig. 1.3, top panel) suggests a non-stationary mean (the mean goes up and down a little). To check this, a smoothing lter was applied
30 20 10 SOI 0 10 20 30 40 1880 1900 1920 1940 Time 1960 1980 2000
20 10 SOI 0 10 20 30
1980
1985
1990 Time
1995
2000
Figure 1.1: A time-plot of the monthly average SOI. Top: the SOI from from 1876 to 2001; Bottom: the SOI since 1980 showing more detail. (In this example, the SOI has been plotted using las=1; this just make the labels on the vertical axis easier to read in my opinion, but is not necessary.)
1.2. Time-series
SOI vs SOI one season previous
q
7
SOI vs SOI two seasons previous
q q
20 10 0 10 20
q q
q q qq q q q q q q q q q q q q q qqq qq q qq q q q q q qqqq q q q q q q q qq q qq q q q qq q q qq qq q q qqq q q q qqq qq q q q q q qq q q q q qq q q q q qqq q q q q q qqq qqqqq q q q q q q q q q qqq qq qqq q q qq q q q q qq q qq q q q q q q q q q q q qqqqqq qqqq qq qq q qq qqq qq q qq q q q q q qqqqqq q q q q q qqqq q q qqq q qq qq q q q q qqq q qqq q qqqq q qqq q qq q q qqq qq qq q qqq qqqqqq qqqqqqq qq q qqqqqq q qq q q q qqqq q qqq q q q q qq q qq qqqqqqqq q q q q q q q qq q q q q q q q q q qq q q qq qq qq qq q q q qq q q q qq qq q q qq q qq q q q q qq q q q q q q q qqqq q q q qqqqq qqq q q q q qq q q q q qq q q q q q q qq qq q q q qqq qq q qq q q qq q q q qq q q q q q qqq q q q q q q q q q q q q q q q q q q q q q
20 10 0
q
10 20 30
q
q q qq q q q q q qq q qqq q q q q q q qq q q q q q q q q q q q q qqq q q q qq q q q q q q q qq q q qqq q q q q q q q q q q qqq qqqq q qq q q q q q q q q q q q qq qq q q q qq qq q qqq qqqq q q q q qq q q q q q q q q q qq q q q q q q qq q q q qqqqqqqq q q q q qqq q q qq q q qq q qq q q qqqq q q q qq q qq q q q q q q qq qq q qq q q qqqqq q q q q q q q q qqqqq qq q q q qqq q q q qqq qq qq q q q q qq q qq q qq q qq qqq q qqqqqqqqqqq q q qqq q qqq q q qq q qq q q q qq q q qq q q q q q qq qq qq q q qq q q q qq q q q q q q q q qq q q q q q q q q q q q qqq qqq qq qq q q q q q qq qq qq q q qq qqq q q qq qq q q q q qqq q qqq qqqq q qq q q q q q qq q q q q q qq q q q q q q qqq q q q q q q q q q q q q q q qqq q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q
SOI at time t 1
30 30 20
10
10
20
SOI at time t 2
30
20
10
10
20
SOI at time t
SOI at time t
SOI vs SOI three seasons previous

q
SOI vs SOI four seasons previous

q
20 10 0
q
q q q q q q q q qq q q q q q qq q q q q q q qq q q q q qqq q qq qq q qqq qq q qq q q qq q qq q q q qq qq qq qq q q q q qq q q q q q q q qq q q qq q qqqqq qq qq q q q q q qq q q q q qq qqq q q q q q qqq q q q q q q qq q qq q q q q q q qqqqqq qqq q qq q q q q qq qqq qq q q q q q qq qq qqq q q q q q q q q q qq q qq q qq q qq q qq q q q q qqq q q q q qqq qqqq q q qq q q q qq q q qq q qq q q qq q q q qq q q q q q qqqq qqq q q qq q q q q qq q qqqq q q qq qq q qq q q q q q q q qqqqqq q q qq q qq q q qqq q qq q q q q q q q qq qqq qq q qqq q q q qq q q q q qq q q qq q q qq q q q qq q q qq q qq q q q qq q qq qqqqq q q q qq q q q qq qq q q q q q q q q q qq qq q q q q q q q qq q qq q qqq q q q q q qq qqq q qq q q q q q q qq q q qq q q q q q q q q qq q q q q q q q q q q q q q q
20
q
q q q q q qq q q q q qq q q q qq q q q q q q q q qq q q q q q q q q q q q q q q q q qq qqq qq q qq q q q q qq q q q q q q qqq q q qq q q qq q q qq q q q q q q q q q q qqq q qq q q q q qq qq qq q qq q q q q q q q q q qq q qq q q q q q q q q q q qq qq q q qq q q q qq qq q q qq qq qq q qqq q q q q q q qq q q q q q q qq q q q q q q q q qq qq q q qqqq qq q q q q qq qqq qq qq q q q q q q q qq q q q q q q qq qq q qqqqqqqq qqq q qq q q qq qq q q q q q q q q q q qq q q q qq q q qqqq q q q q qq q q q q q q q q q qqq qqqqq qqqqq q q q q q q q qq q qqq q q q q q q q qq q q q qq qqq q q q q qq q q q qq q q qq q q q q q q q q qq q qq q q q qq q q q q qq q q q qq q q q q q qq qq q qq q q q qq q q qq q q q q qq q q q q q q q q q q q q q q qq q q q q q q qqq q q q q q q q qq q q q qq q q q q q q qq q q q q q qq q q q q q q q
SOI at time t 3
SOI at time t 4
10
q
0 10 20 30
10 20 30 30
20
10
10
20
30
20
10
10
20
SOI at time t
SOI at time t
Figure 1.2: The seasonal SOI plotted against previous values of the SOI. that computed the mean of each set of six observations at a time. This smooth gave the thick, dark line in the bottom panel of Fig. 1.3, and suggests that the mean is perhaps non-stationary as this line is not (approximately) constant. However, it is not too bad. The middle panel of Fig. 1.3 shows a series that is denitely non-stationary. This seriesthe average monthly sea-level at Darwinis not stationary as the mean obviously uctuates. However, the SOI from 1876 to 2001, plotted in the top panel of Fig. 1.3 (and seen in Example 1.1), is approximately stationary.
All the time series considered in this part of the course will be equally spaced (or regular ). These are time series recorded at regular intervals every day, year, month, second, etc. Until Module 8, the time series are all considered for continuous data. In addition, only stationary time series will be considered initially (until Module 7).
1.2.2
Purpose
There are two main purposes of gathering time series:
30 20 10 SOI 0 10 20 30 40 1880 1900 1920 1940 Time 1960 1980 2000
4.2 Sea level (in m) 4.1 4.0 3.9 3.8 1988 1990 1992 1994 Time 1996 1998 2000
Annual rainfall (in mm)
500 400 300 200
1920
1940 Year
1960
1980
2000
Figure 1.3: Stationary and non-stationary time series. Bottom: the annual rainfall near Wendover, Utah, USA in mm is plotted. The data is plotted with a thin line, and the smoothed data in a thick line indicating that the mean is perhaps non-stationary. Middle: the monthly average sea level (in metres) in Darwin, Australia is plotted. The data are denitely not stationary, as the mean uctuates. Top: the average monthly SOI from 1876 to 2001 is shown. This series looks approximately stationary.
1.2. Time-series
1. First, it helps us understand the process underlying the observations. 2. Secondly, data is gathered to predict, or forecast, what may happen next. It is of great interest in climatology, for example, to predict the value of seasonal climatic indicators. In business, it is important to be able to predict future sales of products. Forecasting is the process of estimating future values of numerical parameters on the basis of the past. To do this, a model is created. This model is an articial equation that captures the important features of the data. Example 1.4: Consider the average monthly sea level (in metres) in Darwin, Australia (Fig. 1.3, middle panel). Any useful model for this time series would need to capture the important features of this time series. What are the important features? One obvious feature is that the series has a cyclic pattern: the average sea level rises and falls on a regular basis. Is there also an indication that the average sea level has been rising since about 1994? Any good model should capture these important features of the data. As noted in the previous Example, the series is not stationary.
Methods for modelling and forecasting time series are well established and rigourous and are sometimes quite accurate, but keep in mind the following:
Any forecast is only as good as the information it is based on. It is not possible for a good method of forecasting to make up for lack of information, or inaccurate information, about the process being forecasted. Some processes may be impossible to forecast with any useful accuracy (for example, future outcomes of a coin tossing experiment). Some processes are usefully forecast by means of complex expensive methodsfor example, daily regional weather forecasting.
1.2.3
Notation
Consider a sequence of numbers {Xn } = {X1 , X2 , . . . , XN }, ordered by time, so that Xa comes before Xb if a is less than b; that is, {Xn } is a time series. This notation indicates that the time series measures the variable X (which may be monthly rainfall, water temperatures or snowfall depths,
10
for example). The subscript indicates particular observations in the series. Hence, X1 is the rst observation, the rst recorded in the data. (Note that Y , W or some other letter may be used in place of X.) The notation Xt (or Xn , or similar) is used to indicate the value of the time series X at a particular point in time t. For dierent values of t, values of the time series at dierent points in time are indicated. That is, Xt+1 refers to the next term in the series following Xt . The entire series is usually written {Xn }n1 , indicating the variable X is a time sequence of numbers. Sometimes, the upper and lower limits are specied explicitly as {Xn }n=1000 . Quite often, the notation is abbreviated n=1 so that {Xn } {Xn }n1 .
1.3
Signal and noise
The observed and recorded time series, say {Xn }, consists of two components: 1. The signal. This is the component of the data that contains information, say {Sn } This is the component of the time series that can be forecast. 2. The noise. This is the randomness that is observed, which may be due to numerous other variables aecting the signal, measurement imperfections, etc. Because the noise is random, it cannot be forecast. The task of the scientist is to extract the signal (or information) from the time series in the presence of noise. There is no way of knowing exactly what the signal is; instead, statistical methods are used to separate the random noise from the forecastable signal. There are many methods for doing this; in this course, one of those methods will be studied in detail: the BoxJenkins method. Some other simple models are discussed in Sect. 1.4; more complex methods are discussed in Module 9. Example 1.5: Consider the monthly Pacic Decadal Oscillation, or PDO (obtained from monthly Sea-Surface Temperature (SST) anomalies in the North Pacic Ocean). The data from January 1980 to December 2000 (Fig. 1.4, top panel) is non-stationary. The data consist of a signal and noise. One way to extract the signal is to use a smoother.
1.3. Signal and noise
11
1 PDO 0 1 2 1980 1985 1990 Time 1995 2000
1 PDO signal 0 1 2 1980 1985 1990 Time 1995 2000
1.5 1.0 PDO noise 0.5 0.0 0.5 1.0 1.5 1980 1985 1990 Time 1995 2000
Figure 1.4: The monthly Pacic Decadal Oscillation (PDO) from Jan 1980 to Dec 2000 in the top plot. Middle: a lowess smooth is shown superimposed over the PDO. Bottom: the noise is shown (observations minus signal).
12
Module 1. Introduction A lowess smoother can be applied to the data1 . (The details are not importantit is simply one type of smoother.) For one set of parameters, the smooth is shown in Fig. 1.4 (middle panel). The smoother captures the important features of the time series, and ignores the random noise. The noise is shown Fig. 1.4 (bottom panel), and if the smooth is good, should be random. (In this example, the noise does not apear random, and so the model is probably not very good.) One diculty with using smoothers is that they have limited use for forecasting into the future, as the tted smoother apply only for the given data. Consequently, other methods are considered here.
1.4
Simple methods
Many methods exist for modelling time series. These notes concentrate only on the BoxJenkins method, though some other methods will be discussed very briey at the end of the course. It is very important to use the appropriate forecasting technique for each particular application however. The BoxJenkins technique is of general applicability, and has been used in many applications. In addition, studying the BoxJenkins method will enable the student to learn other techniques as appropriate: the language, basic techniques and skills are applicable to other methods also. In this section, a variety of simple methods for forecasting are rst discussed. Importantly, in some situations they are also the best method available. If this is the case, it may not be obviousit might require some careful statistical analysis to show that a simple model is the best model. Constant estimation The simplest possible approach is to use a constant forecast for all future values of the time series. This is appropriate when successive values of the time series are completely uncorrelated but do come from the same distribution. Slope estimation If the time series appears to have a linear trend, it may be appropriate to estimate this trend by tting a straight line by linear regression. Future values can then be forecast by extrapolating this line.
1 Many statisticians would probably not identify a smoother as a statistical model (in fact, I am one of them). But the use of a smoother here demonstrates a point.
1.5. Software
13
Random walk model In some cases, the best estimate of a future value is the most recent observation. This model is called a random walk model. For example, the best forecast of the future price of a share is usually quite close to the present price. Smoothing Smoothing is the name given to a collection of techniques which estimate future values of a time series by an average of past values. This approach makes sense when there is random variations which add onto a relatively stable trend in the process under study. An example has been seen in Example 1.5. Regression Another method of forecasting is to relate the parameter under study to some known parameter, or parameters, by means of a functional relationship which is statistically estimated using regression.
1.5
Software
Most standard statistical software packagessuch as SPSS, SAS, r and SPluscan analyse time series data. In addition, many mathematical packages (such as Matlab) can be used, but sometime require add-ons which usually cost money.
1.5.1
The
package
This course uses the free software package r. r is a free, open source software project which is not unlike S-Plus, an expensive commercial software package. r is available for many operating systems from http://cran. r-project.org/, or http://mirror.aarnet.edu.au/pub/CRAN/ for residents of Australia and New Zealand. More information about r, including documentation, is found at http://www.r-project.org/. r is command line driven like Matlab, but has a statistical rather than mathematical focus. r is object orientated. This means to get the most benet from r, objects should be correctly dened. For example, time series data should be declared as time series data. When r knows that a particular data set is a time series, it has default mechanisms of working with the data. For example, plotting data in r generally produces a dot-plot; if the data is declared as time series data, the data are joined by lines which is the standard way of plotting time series data. The following example explains some of these details.
14
Module 1. Introduction In r, you can set the working directory using (for example) setwd("c:/My Documents/USQ/STA3303/data"). Check the current working directory using getwd(). It is usually sensible to set this working directory as soon as you start r to the location of your data les. This will be assumed throughout these study notes.
Example 1.6: In Example 1.1, the monthly average SOI was plotted. Assming the current folder (or directory) is set as described above, the following code reproduces this plot. > soidata <- read.table("soiphases.dat", header = TRUE) The data is loaded using read.table. The option header=TRUE means that the rst row of the data contained header information (that is, names for the variables). An alternative method for loading the data directly from the internet is:
> soidata <- read.table("http://www.sci.usq.edu.au/staff/dunn/Datasets/appl + header = TRUE) Now, take a quick look at the variables: > summary(soidata) year Min. :1876 1st Qu.:1907 Median :1939 Mean :1939 3rd Qu.:1970 Max. :2002 soiphase Min. :0.000 1st Qu.:2.000 Median :3.000 Mean :3.148 3rd Qu.:5.000 Max. :5.000 month Min. : 1.000 1st Qu.: 3.000 Median : 6.000 Mean : 6.493 3rd Qu.: 9.000 Max. :12.000 soi Min. :-38.8000 1st Qu.: -6.6000 Median : 0.3000 Mean : -0.1514 3rd Qu.: 6.7500 Max. : 33.1000
> soidata[1:5, ]
1.5. Software year month soi soiphase 1876 1 10.8 2 1876 2 10.6 2 1876 3 -0.7 3 1876 4 7.9 4 1876 5 6.9 2
15
1 2 3 4 5
> names(soidata) [1] "year" "month" "soi" "soiphase"
This shows the dataset (or object) soidata consists of four dierent variables. The one of interest now is soi, and this variable is referred to (and accessed) as soidata$soi. To use this variable rst declare it as a time series object: > SOI <- ts(soidata$soi, start = c(1876, 1), + end = c(2002, 2), frequency = 12) The rst argument is the name of the variable. The input start indicates the time when the data starts. For the SOI data, the data starts at January 1876, which is input to r as c(1876, 1) (the one means January, the rst month). The command c means concatenate, or join together. The data set ends at February 2002; if an end is not dened, r should be able to deduce it anyway from the rest of the given information. But make sure you check your time series to ensure r has interpreted the input correctly. The argument frequency indicates that the data have a cycle of twelve (that is, each twelve points make one larger groupinghere twelve months make one year). Now plot the data: > plot(SOI, las = 1) > abline(h = 0) The plot (Fig. 1.5, top panel) is formatted correctly for time series data. (The command abline(h=0) adds a horizontal line at y = 0.) In contrast, if the data is not declared as time series data, the default plot appears as the bottom panel in Fig. 1.5. When the data are declared as a time series, the observations are plotted and joined by lines and the horizontal axis is labelled Time by default (the axis label is easily changed using the command: title(ylab="New y-axis label")). Other methods also have a standard default if the data have been declared as a time series object.
16
30 20 10 SOI 0 10 20 30 40 1880 1900 1920 1940 Time 1960 1980 2000
30 20 soidata$soi 10 0 10 20 30 40
q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q qq q q q q q qq q q q q q q q q q q q q q qqq q q qq q q qq q q q q q qq q q q q qqq qq q q qq q q q q q q qq q q q q q qqq qq q qq q qq q q q q q q q q q q qq q q qq q q qq q qqq q q q qq q q q qq qqq q q q q q qq q qqq qq qqq q q q q q q q q q q q q qqq qq q qqq q q q q q qq qq q qq q q q qq qq q qqq q q qqq qqq q q q q q q q q q qq q q q q q q q q q q q qq qq q qq qq qq q q q q qq q qq qqq q q qq qq qqqqq qq q q qqqq q q q qq q q qqq q q q q qq q q q qq qqqq qqq q q qq qq q q qqqq qqqq q q q q q q qqq q qqq q q q q q q q q q qq qqq qq q q q qq qqq q qqq q qq q qq q qq q q q qq q q q qq qqq q q qq q q q qqqq qqqqqqqq q qqq q q q q q q q qqqq q qq q qq q qq qq qq q q q q qq q q q q q q q qq q qq q q q q qq q qq qqqqqqqqqqq qqq qqqqq qqq qqq qq qqqqqqqqq q q qq q qq qqqq q qqq q q q q qq q q q qq qqq q qq qq q q qq qqqqq q qqqqq qqqqq q qq q qqqq q qq q q q q q q qq q q qq q q q q q qq q q q q qq q q q qqq q qqq q q qq qqq qqqqq qqq q qqq qqq q qq q q q q q q q qq qq q q q qq qqq q qqqq q qqqq qq q qqq qqqqqqqqqqqqqq qqqq q q q qq q qq q q qq q q q qq q qqqqq q q q q qq qq q qqqq q qq q qq q q q q q q q q q q q qq q qqq qq q qq qqqq q q qq q qqq qq q qqq qq qq qq q q q q qq q qq q qq q qq q q q q qqq qqqq q q qqqq q qq q q q q q q qqq q q qq q qqq q qqqqq qqqqq qq qq qq q q q q q q qq q qq q q q q qq q qq qq qq q qq qq qq qq q q q q q q q q q qq qq q q q q qq q q q q q q q qq q q q q q qq q qqq q qqq q q q qq q qq q q q qqq qqqqq qqq qqqqqq q q qqq q q q qq q qq q q q qqq qq q q q q qqq q q q q q q q q q qq q q q qq q q q q q q q qq q qq q q qq q qq q qq q q qq q q q q qq q q q q q q q q q qq q qq q q q q qq q q q q qqq q qq q q q qq q q qq qq q q q qq q q q q q q q q q qq q qq q q q q q q q q q q q q qq q q q q q q q qqq q qq q q q qq q q q q q qq q q q qq q qq q q q q q q q qqq q q qq q q qqq q qq q q q q q q q q q qq q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q
500 Index
1000
1500
Figure 1.5: A plot of the monthly average SOI from 1876 to 2001, without the series declared as being a time series. Top: the data has been declared as a time series; Bottom: the data has not been declared as a time series.
1.5. Software
17
In the above example, the data was available in a le. If it is not available, a data le can be created, or the data can be entered in to r. The following commands show the general approach to entering data in r. The command c is very useful: it is used to create a list of numbers, and stands for concatenate. > data.values <- c(12, 14, 1, 8, 9, 10, 7) > data <- ts(data.values, start = 1980) The rst line puts the observation into a list called data.values. The second line designates the data as a time series starting in 1980 (and so r assumes the values are annual measurements). You can also use scan(); see ?scan.Data stops being read when a blank line is entered if you use scan. Other commands will be introduced as appropriate throughout the course. A full list of the time series functions available in r are given in Appendix D.
1.5.2
Getting help in
Two commands of particular interest are the commands help and help.search. The help command gives help on a particular topic. For example, try typing help("names") or help("plot") at the r command prompt. (The quotes are necessary.) A short-cut is also available: typing ?names is equivalent to typing help("names"). Using the short-cut is generally more convenient. The command help.search searches the help database for particular words. For example, try typing help.search("eigen") to nd how to evaluate eigenvalues in r. (The quotes are necessary.) This function requires a reasonably specic search phrase. The command help.start starts the r help in a web browser (if everything is congured correctly). Further help and information is available at http://stat.ethz.ch/R/manual/ doc/html/, including a Web-based manual An Introduction to r. After starting r, look under the Help menu for available documentation.
18
1.6
Exercises
Ex. 1.7: Start r and load in the data le qbo.dat. This data le is a time series of the monthly quasi-biennial oscillation (QBO) from January 1948 to December 2001. (a) Examine the variables in the data set using names. (b) Declare the QBO as a time series, setting the start, end and frequency parameters correctly. (c) Plot the data. (d) Is the data stationary? Explain. (e) Determine the mean and variance of the series. (f) List important features in the data (if any) that should be modelled. Ex. 1.8: Start r and load in the data le easterslp.dat. This data le is a time series of sea-level air pressure anomalies at Easter Island from Jan 1951 to Dec 1995. (a) Examine the variables in the data set using names. (b) Declare the air pressures as a time series, setting the start, end and frequency parameters correctly. (c) Plot the data. (d) List important features in the data (if any) that should be modelled. Ex. 1.9: Obtain the maximum temperature from your town or residence for as far back as possible up to, say, thirty days. This may be obtained from a newspaper or website. (a) Load the data into r. (b) Declare the series a time series, and plot the data. (c) List important features in the data (if any) that should be modelled. (d) Compute the mean and variance of the series. Ex. 1.10: The data in Table 1.1 shows the mean annual levels at Lake Victoria Nyanza from 1902 to 1921, relative to a xed reference point (units are not given). The data are from Shaw [41], as quoted in Hand [19]. (a) Enter the data into r as a time series.
1.6. Exercises Year 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 Level 10 13 18 15 29 21 10 8 1 7 Year 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 Level 11 3 2 4 15 35 27 8 3 5
19
Table 1.1: The mean annual level of Lake Victoria Nyanza from 1902 to 1921 relative to some xed level (units are unknown).
(b) Plot the data. Make sure you give appropriate labels. (c) List important features in the data (if any) that should be modelled. Ex. 1.11: Many people believe that sunspots aect the climate on the earth. The mean number of sunspots from 1770 to 1869 for each year are given in the data le sunspots.dat and are shown in Table 1.2. (The data are from Izenman [23] and Box & Jenkins [9, p 530], as quoted in Hand [19]). (a) Enter the data into r as a time series by loading the data le sunspots.dat. (b) Plot the data. Make sure you give appropriate labels. (c) List important features in the data (if any) that should be modelled.
1.6.1
Answers to selected Exercises
1.7 (a) Here is one solution: > qbo <- read.table("qbo.dat", header = TRUE) > names(qbo) [1] "Year" (b) One option is: > qbo <- ts(qbo$QBO, start = c(qbo$Year[1], + 1), frequency = 12) "Month" "QBO"
20
Year 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803
Sunspots 101 82 66 35 31 7 20 92 154 125 85 68 38 23 10 24 83 132 131 118 90 67 60 47 41 21 16 6 4 7 14 34 45 43
Year 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837
Sunspots 48 42 28 10 8 2 0 1 5 12 14 35 46 41 30 24 16 7 4 2 8 17 36 50 62 67 71 48 28 8 13 57 122 138
Year 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869
Sunspots 103 86 63 37 24 11 15 40 62 98 124 96 66 64 54 39 21 7 4 23 55 94 96 77 59 44 47 30 16 7 37 74
Table 1.2: The annual sunspot numbers from 1770 to 1869.
1.6. Exercises
21
QBO from 1948 to 2001
10 Quasibienniel oscillation
10
20
30 1950 1960 1970 Time 1980 1990 2000
Figure 1.6: The QBO from January 1948 to December 2001.
Here the square brackets [ . . . ] have been used; they are used by r to indicate elements of an array or matrix2 . (Note that start must have numeric inputs, so qbo$Month[1]will not work as it returns Jan, which is a text string.) It is worth printing out qbo to ensure that r has interpretted your statements correctly. Type qbo at the prompt, and in particular check that the series ends in December 2001. (c) The following code plots the graph: > plot(qbo, las = 1, xlab = "Time", ylab = "Quasi-bienniel oscillation", + main = "QBO from 1948 to 2001") The nal plot is shown in Fig. 1.6. 1.10 Here is one way of doing the problem. (Note: The data can be entered using scan or by typing the data into a data le and loading the usual way. Here, we assume the data is available as the object llevel.) > llevel <- ts(llevel, start = c(1902)) > plot(llevel, las = 1, xlab = "Time", ylab = "Level of Lake Victoria Nyanza", + main = "The (relative) Level of Lake Nyanza from 1902 to 1921")
2
Matlab, for example, uses round brackets: ( . . . ).
22
The (relative) Level of Lake Nyanza from 1902 to 1921
30 Level of Lake Victoria Nyanza
20
10
10 1905 1910 Time 1915 1920
Figure 1.7: The mean annual level of Lake Victoria Nyanza from 1902 to December 1921. The gures are relative to some xed level and units are unknown.
The nal plot is shown in Fig. 1.7. There is too little data to be sure of any patterns of features to be modelled, but the series suggests there may be some regular up-and-down pattern.
Module
2
. . . . . . . . . . . . . . . . . . . . . . . . . 24 24 27 27 28 29 29 30 31 31 32 32 35 36 36 39
Autoregressive (AR) models
Module contents
2.1 2.2 2.3 Introduction Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . Forecasting ar models . . . . . . . . . . . . . . . . . . . . 2.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 The backshift operator . . . . . . . . . . . . . . . . . . . 2.4.1 2.5 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 The mean . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 2.5.3 2.5.4 2.6 2.7 2.8 The variance . . . . . . . . . . . . . . . . . . . . . . . . Covariance and correlation . . . . . . . . . . . . . . . . Autocovariance and autocorrelation . . . . . . . . . . .
More on stationarity . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Answers to selected Exercises . . . . . . . . . . . . . . .
23
24
Module 2. Autoregressive (AR) models
Module objectives
understand what is meant by an autoregressive (ar) model; use the ar(p) notation to dene ar models; use ar models to develop forecasting formulae; understand the operation of the backshift operator; write ar models using the backshift operator; compute the mean of a time series written in ar form; understand that the variance is not easily computed from the ar form of a model; understand the concepts of autocorrelation and autocovariance; understand the term lag used in the context of autocorrelation; compute the autocorrelation function (acf) for an ar model; know that the acf will always be one at a lag of zero; understand that the acf for a lower-order ar model will decay slowly toward zero.
2.1
Introduction
In this Module, one particular type of time series modelan autoregressive modelis discussed. Subsequent Modules examine other types of models.
2.2
Denition
As stated in previously, the observations in a time series are somehow related to past values of the series, and the task of the scientist is to nd out more about that relationship. Recall that a time series consists of two components: the or information; and the or random error. If values in a time series are related to past values of the series, one possible model for the signal St is for the series Xt to be expressed as a function of previous values of X. This is exactly the idea behind an autoregressive model, denoted an ar model. An ar model is one particular type of model in the BoxJenkins methodology.
2.2. Denition
25
Example 2.1: Consider the model Wn+1 = 3.12+0.63Wn +en+1 for n 0. This model is an ar(1) model, since Wn+1 is a function of only one past value of the series {Wn }. In this model, the information or signal is Sn+1 = 3.12 + 0.63Wn and the noise is en+1 .
Example 2.2: An example of an ar(3) model is Tn = 0.9Tn1 0.4Tn2 + 0.1Tn3 + en for n 1. In this model, the information or signal is Sn = 0.9Tn1 0.4Tn2 + 0.1Tn3 and the noise is en .
A more formal denition of an autoregressive process follows. Denition 2.1 An autoregressive model of order p, or an ar(p) model, satises the equation
p
Xn = m + en +
k=1
k Xnk (2.1)
= m + en + 1 Xn1 + 2 Xn2 + + p Xnp
for n 0, where {en }n0 is a series of independent, identically distributed (iid) random variables, and m is some constant. The letter p denotes the order of autoregressive model, dening how many previous values the current value is related to. The model is called autoregressive because the series is regressed on to past values of itself. The error term {en } in Equation (2.1) refers to the noise in the time series. Above, the errors were said to be iid. Commonly, they are also assumed to 2 have a normal distribution with mean zero and variance e . For the model in Equation (2.1) to be of use in practice, the scientist must be able to estimate the value of p (that is, how many terms are needed in the ar model), and then estimate the values of k and m . Each of these issues will be addressed in later sections. Notice the subscripts are dened so that the rst value of the series to appear on the left of the equation is always one. Now consider the ar(3) model in Example 2.2: When n = 1 (for the rst observation in the time series), the equation reads T1 = 0.9T0 0.4T1 + 0.1T2 + e1 . Bbut the series {T } only exists for positive indices. This means that the model does not apply for the rst three terms in the series, because the data T0 , T1 and T2 are unavailable.
26
Example 2.3: Using r, it is easy to simulate an ar model. For Example 2.1, the following r code simulates the series: > > > > + + > noise <- rnorm(100, 0, 1) W <- array(dim = length(noise)) W[1] <- 0 for (i in 2:length(noise)) { W[i] <- 3.12 + 0.63 * W[i - 1] + noise[i] } plot(W, type = "l", las = 1)
Note type="l" means to use lines, not points (meaning it is an ell, not a numeral one). More directly, a time series can be simulated using arima.sim as follows: > sim.ar1 <- arima.sim(model = list(ar = c(0.63)), + n = 100)
10
0 0 20 40 Index 60 80 100
Figure 2.1: One realization of the ar(1) model Wn+1 = 3.12+0.63Wn +en+1 The nal plot is shown in Fig. 2.1. The data created in r are called a realization of the model. Every realization will be dierent, since each will be based on a dierent set of random {e}. The rst few values
2.3. Forecasting ar models
27
are not typical, as the model cannot be used for the rst observation (when n = 0 in the ar(1) model in Example 2.1, W0 does not exist); it takes a few terms before the eect of this is out of the system.
Example 2.4: Chu & Katz [13] studied the seasonal SOI time series {Xt } from January 1935 to August 1983 (that is, the average SOI for (northern hemisphere) Summer, Spring, etc), and concluded the data was well modelled using the ar(3) model Xt = 0.6885Xt1 + 0.2460Xt2 0.3497Xt3 + et . An ar(3) model was alluded to in Example 1.2 (Fig. 1.2 on p 7).
2.3
Forecasting AR models
One purposes of having models for time series data is to make forecasts. In this section, ar models will be discussed. First, some notation is established.
2.3.1
Notation
Consider a time series {Xn }. Suppose the values of {Xn } are known from n = 1 to n = 100. Then the forecast of {Xn } at n = 101 is written as X101|100 . The hat indicates the quantity is a forecast, not an observed value of the series. The subscript implies the value of {Xn } is known up to n = 100, and the forecast is for the value at n = 101. This is called a one-step ahead forecast, since the forecast is one-step ahead of the available data. In general, the notation Xn+k|n indicates the value of the time series {Xn } is to be forecast for time n + k assuming that the series is known up to time n. This forecast is a k-step ahead forecast. Note a k-step ahead forecast can be written in many ways: Xn+k|n , Xn|nk and Xn2|nk2 are all k-step ahead forecasts. Example 2.5: Consider the forecast Yt+3|t+1 . This is a forecast of the time series {Yt } at time t + 3 if the time series is known to time t + 1. This is a two-step ahead forecast, since the forecast at t + 3 is two steps ahead of the available information, known up to time t + 1.
28
2.3.2
Forecasting
Forecasting using an ar model is quite simple. Consider the following ar(2) model: Fn = 23 + 0.4Fn1 0.2Fn2 + en , (2.2) where en has a normal distribution with a mean of zero and variance of 2 e = 5; that is, en N (0, 5). Suppose a one-step ahead forecast is required if the information about the time series {Fn } is known up to time n; that is, Fn+1|n is required. The value of Fn+1 , if we knew exactly what is was, is found from Equation (2.2) as Fn+1 = 23 + 0.4Fn 0.2Fn1 + en+1 (2.3) by adjusting the subscripts. Then conditioning on what we actually know gives Fn+1|n = 23 + 0.4Fn|n 0.2Fn1|n + en+1|n Adding hats to all the terms, the forecast will be Fn+1|n = 23 + 0.4Fn|n 0.2Fn1|n + en+1|n . Now, since information is known up to time n, the value of Fn|n is known exactly: its the value of F at time n, Fn . Likewise, Fn1|n = Fn1 . But what about the value of en+1|n ? It is not known at time n as it is a future random noise component. So what do we do with the en+1|n term? If we know nothing about the value of en+1|n , a sensible approach would be to use the mean value of {en }, which is zero. Hence, Fn+1|n = 23 + 0.4Fn 0.2Fn1 is the forecast. The dierence between Fn+1 and Fn+1|n determined from Equation (2.3) and (2.4) is Fn+1 Fn+1|n = (23 + en+1 + 0.4Fn 0.2Fn1 ) (23 + 0.4Fn 0.2Fn1 ) = en+1 . Hence, the error in making the forecast is en+1 , and so the terms {en } are actually the one-step ahead forecasting errors. The same approach can be used for k-step ahead forecasts also, as shown in the next example. (2.4)
2.4. The backshift operator
29
Example 2.6: Consider the ar(2) model in Equation (2.2). To determine the two-step ahead forecast, rst nd Fn+2 = 23 + 0.4Fn+1 0.2Fn + en+2 . Hence Fn+2|n = 23 + 0.4Fn+1|n 0.2Fn|n + en+2|n . Now, information is known up to time n, so Fn|n = Fn . As before, en+2|n is not know, so is replaced by the mean value, which is zero. But what about Fn+1|n ? It is unknown, since information is only known up to time n, so information at time n + 1 is unknown. So what is the best estimate of Fn+1|n ? Note that Fn+1|n is simply a one-step ahead forecast itself available from Equation (2.4). So the two-step ahead forecast here is Fn+2|n = 23 + 0.4Fn+1|n 0.2Fn , where Equation (2.4) can be substituted for Fn+1|n , but it is not necessary.
2.4
The backshift operator
This section introduces the backshift operator , a tool that enables complicated time series model to be written in a simple form, and also allows the models to be manipulated. A full appreciation of the value of the backshift operator will not become apparent until later, when the models considered become very complicated and cannot be written down in any other (practical) way (see, for example, Example 7.22).
2.4.1
Denition
The backshift operator, B, is dened on a time series as follows: Denition 2.2 Consider a time series {Xt }. The backshift operator, B, is dened so that BXt = Xt1 .
30
Note the backshift operator can be used more than once, so that B 2 Xt = B.B.Xt = B(BXt ) = BXt1 = Xt2 . In general, B r Xt = Xtr . The backshift operator allows ar models to be written in a dierent form, which will later prove very useful. Note the backshift operator only operates on time series (otherwise it makes no sense to shift backward in time). This implies that Bk = k if k is a constant. Example 2.7: Consider the ar(2) model Yt+1 = 0.23Yt 0.15Yt1 + et+1 . Using the backshift operator notation, this model is written Yt+1 0.23BYt+1 + 0.15B 2 Yt+1 = et+1 (1 0.23B + 0.15B 2 )Yt+1 = et+1 .
Example 2.8: The ar(3) model Xt = et 0.4Xt1 + 0.6Xt2 0.1Xt3 is written using the backshift operator as (B)Xt = et where (B) = (1 + 0.4B 0.6B 2 + 0.1B 3 ). The notation (B) is often used to denote an autoregressive polynomial in B.
2.5
Statistics
In this Section, the important statistics of an ar model are studied.
2.5. Statistics
31
2.5.1
The mean
In Equation (2.1), the general form of an ar(p) model is given. Taking expected values of each term in this series gives E[Xn ] = E[m ] + E[en ] + E[1 Xn1 ] + E[2 Xn2 ] + + E[p Xnp ] = m + + 1 E[Xn1 ] + 2 E[Xn2 ] + + p E[Xnp ],
since E[en ] = 0 (the average error is zero). Now, assuming the time series {Xk } is stationary, the mean of this series will be approximately constant at any time (that is, for any subscript). Let this constant mean be . (It only makes sense to talk about the mean of a series if the series is stationary.) Then, = m + 1 + 2 + + p , and so, on solving for , = m . 1 1 2 p
This enables the mean of the sequences to be computed from the ar model. Example 2.9: In Equation (2.2), let the mean of the series be = E[F ]. Taking expected values of each term, = 23 + 0.4 0.2 + 0. The mean of the series is = E[F ] = 23/0.8 = 28.75.
Example 2.10: Consider the ar(1) model of Example 2.3: Wn+1 = 3.12 + 0.63Wn + en+1 , for n 0. Taking expectations, E[W ] = 8.43. The plot of the simulated data in Fig. 2.1 (page 26) conrms this.
2.5.2
The variance
(It may be useful to refer to Appendix B while reading this section.) Consider the ar(1) model Yt = 12 + 0.5Yt1 + et ,
32 where {en } N (0, 4). First, write as
Yt 0.5Yt1 = 12 + et , and then taking the variance of both sides gives var[Yt ] + (0.5)2 var[Yt1 ] + 2Covar[Yt , Yt1 ] = var[et ], since the errors {en } are assumed to be independent of the time series {Yn }. Since the series is assumed stationary, the variance is constant at all time 2 steps; hence dene Y = var[Yn ]. Then,
2 1.25Y + 2Covar[Yt , Yt1 ] = 4,
since var[en ] = 4 in this example. This equation cannot be simplied and 2 solved for Y unless there is some understanding of the covariance which characterizes the time series.
2.5.3
Covariance and correlation
The covariance is a measure of how two variables change together. For two 2 random variables X (with mean X and variance X ) and Y (with mean 2 ), the covariance is dened as Y and variance Y Covar[X, Y ] = E[(X X )(Y Y )]. Then, the correlation is Corr[X, Y ] = Covar[X, Y ] . 2 2 X Y
A correlation of +1 indicates perfect positive correlation; a correlation of 1 indicates perfect negative correlation. A correlation of zero indicates no correlation at all between X and Y .
2.5.4
Autocovariance and autocorrelation
In the case of a time series, the autocovariance is dened between two points in the time series {Xn } (with a mean ), say Xi and Xj , as ij = E[(Xi )(Xj )]. Since the time series is stationary, the autocovariance is the same if the time series is shifted in time. For example, consider Example 1.2 which includes
2.5. Statistics
33
a plot of the SOI. If we were to split the SOI series into (say) ve equal period of time, and produce a plot like Fig. 1.2 (top panel) (p 7) for each time period, the correlation would be similar for each time period. This all means the important information about Xi and Xj is the time between the two observations (that is, |i j|). Arbitrarily, Xi can be set to X0 then, and hence the autocovariance can be written as k = Covar[X0 , Xk ] for integer k. As with correlation, the autocorrelation is then dened as k k = 0 for integer k, where 0 = Covar[X0 , X0 ] is simply the variance of the time series. The series {k } is known as the autocorrelation function, or acf, at lag k. For any given ar model, it is possible to determine the acf, which will be unique to that ar model. For this reason, the acf is one of the most important pieces of information to know about a time series. Later, the acf isused to determine which ar model is appropriate for our data. The term lag indicates the time dierence in the acf. Thus, the acf at lag 2 means the term in the acf for k = 2, which is the correlation of any term in the series with the term two time steps before (or after, as the series is assumed stationary). Note that since the autocorrelation is a series, the backshift operator can be used with the autocorrelation. It can be shown that the autocovariance for an ar(p) model is 2 e . (2.5) (B) = (B)(B 1 ) Example 2.11: In Example 1.2 (p 5), the seasonal SOI was plotted against the seasonal SOI for one, two, three and four seasons ago. In r, the correlation coecients were computed as > > > > > > > > > > soi <- read.table("soiseason.dat", header = TRUE) attach(soi) len <- length(soi$SOI) lags <- 5 SOI0 <- soi$SOI[lags:len] SOI1 <- soi$SOI[(lags - 1):(len - 1)] SOI2 <- soi$SOI[(lags - 2):(len - 2)] SOI3 <- soi$SOI[(lags - 3):(len - 3)] SOI4 <- soi$SOI[(lags - 4):(len - 4)] cor(cbind(SOI0, SOI1, SOI2, SOI3, SOI4))
34 SOI0 1.000000000 0.631920149 0.409889218 0.200195528 0.007600544 SOI4 0.007600544 0.201856306 0.411991828 0.634015609 1.000000000 SOI1 0.6319201 1.0000000 0.6327576 0.4111551 0.2018563
Module 2. Autoregressive (AR) models SOI2 0.4098892 0.6327576 1.0000000 0.6336245 0.4119918 SOI3 0.2001955 0.4111551 0.6336245 1.0000000 0.6340156
SOI0 SOI1 SOI2 SOI3 SOI4 SOI0 SOI1 SOI2 SOI3 SOI4
The correlations between the SOI and lagged values of the SOI can be written as the series of autocorrelations: {} = {1, 0.632, 0.41, 0.2, 0.0076}.
Example 2.12: The ar(2) model Ut+1 = 0.3Ut 0.2Ut1 + et+1 is written using the backshift operator as (B)Ut+1 = et+1 where (B) = 1 0.3B + 0.2B 2 . Suppose for the sake of example that 2 e = 10. Then, since (B 1 ) = 1 0.3B 1 + 0.2B 2 , the autocovariance is 10 (B) = 1 + 0.2B 2 )(1 0.3B 1 + 0.2B 2 ) (1 0.3B 10 . = (0.2B 2 0.36B 1 + 1.13 0.36B + 0.2B 2 ) By some detailed mathematics (Sect. 3.6.3), this equals (B) = +11.11+2.78B 1 1.39B 2 0.97B 3 0.0139B 4 +0.190B 5 + , only quoting the terms for the non-negative lags (recall that the autocorrelation is symmetric). The terms in the autocorrelation are therefore (quoting terms from the non-negative lags again): {} = {0 , 1 , 2 , . . . } = {11.11, 2.78, 1.39, 0.97, 0.0139, 0.190, 0.0598, }. (2.6)
2.6. More on stationarity
35
The corresponding terms in the autocovariance are found by dividing by 0 = var[U ] = 11.11, to give {k } = {1, 0.25, 0.125, 0.0875, 0.00125, 0.0017125, 0.00053875, } The rst term at lag zero always has an acf value of one (that is, each term is perfectly correlated with itself). It is usual to plot the acf, (Fig. 2.2).
1.0
0.8
0.6 ACF
0.4
0.2
0.0
10 lag
15
20
Figure 2.2: The acf for the ar(2) model in Equation (2.6). The plot is typical of an ar(2) model: the terms in the acf decay slowly towards zero. Indeed, any low order ar model (such as ar(1), ar(2), ar(3), or similar) shows similar behaviour: a slow decay of the term towards zero.
2.6
More on stationarity
In an ar(1) model, it can be shown that the model is stationary only if |1 | < 1, otherwise the model is non-stationary (Exercise 2.24).
36
For an ar(2) process to be stationary, the following conditions must be satised: 1 + 2 < 1 2 1 < 1 1 < 2 < 1 These inequalities dene a triangular region in the (1 , 2 ) plane (Exercise 2.26).
2.7
Summary
In this Module, autoregressive models, or ar models, were studied. Forecasting and the statistics of the models have been considered. In addition, the use of the backshift operator was studied.
2.8
Exercises
Ex. 2.13: Classify the following ar models (that is, state if they are ar(1), ar(4), etc.) (a) Xn+1 = en+1 + 78.03 0.56Xn 0.23Xn1 + 0.19Xn2 . (b) Yn = 12.8 0.22Yn1 + en . (c) Dt 0.17Dt1 + 0.18Dt2 = et . Ex. 2.14: Classify the following ar models (that is, state if they are ar(1), ar(4), etc.) (a) Xn = en + 0.223Xn1 . (b) At = 26.7 + 0.2At1 0.2At2 + et . (c) Qt + 0.21Qt1 + 0.034Qt2 0.13Qt3 = et . Ex. 2.15: Determine the mean of each series in Exercise 2.13. Ex. 2.16: Determine the mean of each series in Exercise 2.14. Ex. 2.17: Write each of the models in Exercise 2.13 using the backshift operator. Ex. 2.18: Write each of the models in Exercise 2.14 using the backshift operator.
2.8. Exercises
37
Ex. 2.19: The time series {An } has a mean of 47.4. The following ar(2) model was tted to the series: An = m + 0.25An1 + 0.17An2 + en . (a) Find the value of m . (b) Write the model using the backshift operator. Ex. 2.20: The time series {Yn } has a mean of 12.26. The following ar(3) model was tted to the series: Yn = en + m 0.31Yn1 + 0.12Yn2 0.10Yn3 . (a) Find the value of m . (b) Write down formulae for forecasting the series one, two and three steps ahead. Ex. 2.21: Yao [52] ts numerous ar models to model the total June rainfall (in mm) at Shanghai, {Yt }, from 1932 to 1950. One of the tted models is Yt = 309.70 0.44Yt1 0.29Yt2 + et . (a) Classify the ar model tted to the series. (b) Determine the mean of the series {Yt }. (c) Write down formulae for forecasting the June rainfall in Shanghai one and two years ahead. (d) Write the model using the backshift operator. Ex. 2.22: In Guiot & Tessier [18], ar(3) models are tted to the widths of tree rings. This is of interest as there is evidence that pollution may be aecting tree growth. Each observation in the series {Ct } is the average of 30 tree-ring widths from 1900 to 1941 of a species of conifer. Write down the general form of the model used to forecast tree-ring width. Ex. 2.23: Woodward and Gray [51] use a number of models, including ar models, to study change in global temperature. One such ar model is given in the paper (their Table 2) for modelling the International Panel for Climate Change (IPCC) data series from 1968 to 1990 has the factor (1 + 0.22B + 0.59B 2 ) when the model is written using backshift operators. Write out the model without using the backshift operator.
38
Ex. 2.24: Write a short piece of R-code to simulate the ar model Xt = Xt1 +et where e N (0, 4) (see Example 2.3). Plot a simulated series of length 200 for each of the following eight values of : = 1.5, 1, 0.6, 0.2, 0, 0.5, 1, 1.5. Comment on your ndings: What eect does the value of have on the stationarity of the series? Ex. 2.25: Write a short piece of R-code to simulate the ar model Yn = 2 0.2Yn1 + en where e N (0, e ) (see Example 2.3). Plot a simulated 2 2 series of length 200 for each of the following four values of e : e = 0.5, 1, 2, 4. Comment on your ndings: What eect does changing the 2 value of e have? Ex. 2.26: The notes indicate that for an ar(2) process to be stationary, the following conditions must be satised: 1 + 2 < 1 2 1 < 1 1 < 2 < 1 These inequalities dene a triangular region in the (1 , 2 ) plane. Draw this rectangular region, and then write some R-code to simulate some ar(2) series with parameters in this region, and some with parameters outside this region. You should observe non-stationary time series when the parameters are outside this triangular region. Ex. 2.27: Consider the time series {G}, for which the last three observations are: G67 = 40.3, G68 = 39.6, G69 = 50.1. A statistician has developed the ar(2) model Gn = en 0.3Gn1 0.1Gn2 + 63 for modelling the data. (a) Determine the mean of the series {G}. (b) Develop a forecasting formula for forecasting {G} one-, two- and three-steps ahead. (c) Using the data above, compute numerical forecasts for G70|69 , G71|69 and G72|69 . Ex. 2.28: Use r generate a time series from the ar(1) model Ft+1 = 12 + 0.3Ft + et+1 of length 300 (see Example 2.3 for a guideline). (a) Compute the mean of {F } from Equation (2.7). (2.7)
2.8. Exercises
39
(b) Compute the mean of your R-generated time series, ignoring the rst 50 observations. (It usually takes a little while for the simulations to stabilize; see Fig. 2.1.) Compare to your previous answer, and comment. (c) Develop a forecasting formula for forecasting {F } one-, two- and three-steps ahead. (d) Using your generated data set, compute numerical forecasts for the next three observations.
2.8.1
2.13 The models are: ar(3), ar(1) and ar(2). 2.15 (a) Let = E[X] and take expectations of each term. This gives: = 0 + 78.03 0.56 0.23 + 0.19. Solving for shows that = E[X] 48.77. (b) In a similar manner, E[Y ] = 10.49. (c) E[D] = 0. 2.17 (a) (1 + 0.56B 1 + 0.23B 2 0.19B 3 )Xn+1 = 78.3 + en+1 ; (b) (1 + 0.22B)Yn = 12.8 + en ; (c) (1 0.17B + 0.18B 2 )Dt = et . 2.19 (a) Taking expectations shows that 0.58E[A] = m . Since E[A] = 47.4, it follows that m = 27.492. (b) (1 0.25B 0.17B 2 )An = en . 2.20 (a) Taking expectations, 1.29E[Y ] = m . Since E[Y ] = 12.26, it follows that m = 15.8154. (b) The one-step ahead forecast is Yn+1|n = 15.8154 0.31Yn + 0.12Yn1 0.10Yn2 . The two-step ahead forecast is Yn+2|n = 15.8154 0.31Yn+1|n + 0.12Yn 0.10Yn1 . The three-step ahead forecast is Yn+3|n = 15.8154 0.31Yn+2|n + 0.12Yn+1|n 0.10Yn . 2.23 If Gt is the global temperature, one model is Gt = 0.22Gt1 0.59Gt2 + et .
40
Module
3
. . . . . . . . . . . . . . . . . . . . . . . . . 42 42 43 44 44 45 47 47 47 48 49 50 50 51 53 55 55 57
Moving Average (MA) models
Module contents
3.1 3.2 3.3 3.4 Introduction Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . The backshift operator . . . . . . . . . . . . . . . . . . . Forecasting ma models . . . . . . . . . . . . . . . . . . . 3.4.1 3.4.2 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . Condence intervals . . . . . . . . . . . . . . . . . . . .
3.4.3 Forecasting diculties with ma models . . . . . . . . . . 3.5 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 3.5.2 3.5.3 3.6 The mean . . . . . . . . . . . . . . . . . . . . . . . . . . The variance . . . . . . . . . . . . . . . . . . . . . . . . Autocovariance and autocorrelation . . . . . . . . . . .
Why have dierent types of models? . . . . . . . . . . . 3.6.1 Two reasons . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Conversion of models . . . . . . . . . . . . . . . . . . . . 3.6.3 The acf for ar models . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Answers to selected Exercises . . . . . . . . . . . . . . .
3.7 3.8
41
42
Module 3. Moving Average (MA) models
Module objectives
understand what is meant by a moving average (ma) model; use the ma(q) notation to dene ma models; use ma models to develop forecasting formulae; use ma models to develop condence intervals for forecasts; write ma models using the backshift operator; compute the mean and variance of a time series written in ma form; understand the need for both ar and ma models; convert ar models to ma models using appropriate methods; compute the autocorrelation function (acf) for an ma model; understand that the acf for an ma(q) model will have q non-zero terms (apart from the term at lag zero, which is always one).
3.1
Introduction
This Module introduces a second type of time series model: moving average models. Together with autoregressive models, they form the two basic models in the BoxJenkins methodology.
3.2
Denition
Another type of (BoxJenkins) time series model is a Moving Average model, or ma model. ar models imply the time series signal can be expressed as a linear function of previous values on the time series. The error (or noise) term in the equation, en , is the one-step ahead forecasting error. In contrast, ma models imply the signal can be expressed as a function of previous forecasting errors. This is sensible: it suggests ma models make forecasts based on the errors made in the past, and so one can learn from the errors made in the past to improve later forecasts. (Colloquially, it means it learns from its own mistakes!)
3.3. The backshift operator
43
Example 3.1: Consider the model Xt = et 0.3et1 + 0.2et2 0.15et3 . The signal or information is Sn = 0.3et1 + 0.2et2 0.15et3 . This is an ma(3) model, since the is based on three previous error terms. The term et is the error.
Example 3.2: An example of an MA(2) model is Wn+1 = 12 + 0.9en 0.4en1 + en+1 .
A more formal denition of a moving average model follows. Denition 3.1 A moving average model of order q, or an ma(q) model, is of the form Xn = m + en + 1 en1 + 2 en2 + + q enq
q
(3.1) (3.2)
= m + en +
k=1
k enk
for n 1 where 1 , . . . , q are real numbers and m is a real number. For the model in Equation (3.2) to be of use in practice, the scientist must be able to estimate the value of q (that is, how many terms are needed in the ma model), and then estimate the values of k and m. Each of these issues will be addressed in later sections.
3.3
The backshift operator can be used to write ma models in the same way as ar models. Consider the model in Example 3.1. Using backshift operators, this is written Xt = (1 0.3B + 0.2B 2 0.15B 3 )et = (B)et .
44
3.4
3.4.1
Forecasting MA models
Forecasting
The principles of forecasting were developed in Sect. 2.3.1 (it may be worth reading this section again) in the context of ar models. The same principles apply for ma models. Consider the following ma(2) model: Rn = 12 + en 0.3en1 0.12en2 , (3.3)
where en has a normal distribution with a mean of zero and variance of 2 e = 3; that is, en N (0, 3). Suppose a one-step ahead forecast is required if the information about the time series {Rn } is known up to time n; that is, Rn+1|n is required. Proceeding as before, rst adjust the subscripts: Rn+1 = 12 + en+1 0.3en 0.12en1 ; then write Rn+1|n = 12 + en+1|n 0.3n|n 0.12n1|n . e e Now, en|n and en1|n are both known at time n to be en and en1 , but en+1 is not known at time n. So what do we use for the value of en+1 ? Again, if we have no other information, use the mean value of {en }, which is zero. So the forecast is Rn+1|n = 12 0.3en 0.12en1 . The same procedure is used for k-step ahead forecasts. Example 3.3: A two-step ahead forecast for the ma(2) model in Equation (3.3) is found by rst adjusting the subscripts: Rn+2 = 12 + en+2 0.3en+1 0.12en , and then writing Rn+2|n = 12 + en+2|n 0.3n+1|n 0.12n|n . e e Of the terms on the right, only en|n is known; the rest must be replaced by the mean value of zero. So the two-step ahead forecast is Rn+2|n = 12 0.12en . The forecasts for three-steps ahead is Rn+3|n = 12, which is also the forecast for further steps ahead as well. (3.4)
3.4. Forecasting ma models
45
3.4.2
Condence intervals
Consider the ma(2) model in Equation (3.3): Rn = 12 + en 0.3en1 0.12en2 . (3.5)
For this equation, one- and two-step ahead forecasts were developed. The one-step ahead forecast is Rn+1|n = 12 0.3en 0.12en1 . The forecasting error is the dierence between the value forecast, and the value actually observed. It is found as follows: Rn+1 Rn+1|n . Now, even though Rn+1 is not known exactly, it can be expressed as Rn+1 = 12 + en+1 0.3en 0.12en1 , from Equation (3.5). (The reason Rn+1 is not known exactly is that Rn+1 depends on the unknown random value of en+1 ; this is the error we make when we make our forecast, which is of course unknown.) This means that the forecasting error is Rn+1 Rn+1|n = [12 + en+1 0.3en 0.12en1 ] [12 0.3en 0.12en1 ] = en+1 . This tells us that the series {en } is actually just the one-step ahead forecasting errors. A condence interval for the forecast of Rn+1 can also be formed. The actual error about to be made, en+1 is, of course, unknown. But this information can be used to develop condence intervals for the forecast. The variance of {en } can generally be estimated by computing all the previous forecasting errors (r computes these) and then computing the variance. Suppose for the sake of example the variance of the errors is 5.8. Then the variance of the forecast is var[Rn+1 Rn+1|n ] = var[en+1 ] = 5.8. Then a 95% condence interval for the one-step ahead forecast is Rn+1|n z Rn+1|n z var[Rn+1 Rn+1|n ] 5.8
46
for the appropriate value of z . Generally, this is taken as 2 for a 95% condence interval. (1.96 is more precise; t-values with an appropriate number of degrees of freedom even more precise. In practice, however, the value of 2 is often used.) So the condence interval for the forecast is approximately Rn+1|n 2 5.8 or Rn+1|n 4.82.
The same principles apply for other forecasts. Example 3.4: In Example 3.3, the following two-step ahead forecast was obtained for Equation (3.5): Rn+2|n = 12 0.12en . The actual value of Rn+2|n is Rn+2 = 12 + en+2 0.3en+1 0.12en , so the forecasting error is Rn+2 Rn+2|n = [12 + en+2 0.3en+1 0.12en ] [12 0.12en ] = en+2 0.3en+1 . The variance of the forecasting error is var[en+2 0.3en+1 ] = var[en+2 ] + (0.3)2 var[en+1 ] = 5.8 + (0.09 5.8) = 6.322. The condence interval becomes Rn+2|n 2 6.322 = Rn+2|n 5.03. The same principle is used for three-, four- and further steps ahead, when the condence interval is Rn+k|n 2 6.40552 = Rn+k|n 5.06 when k > 2. Notice that the condence interval gets wider as we predict further ahead of our knowledge. This should be expected.
3.5. Statistics
47
MA
3.4.3
Forecasting difculties with
models
Consider the ma model Tn = en 0.3en1 . The one-step ahead forecasting formula is Tn+1|n = 0.3en . Suppose we seek a forecast; the last three observations are: T8 = 4.6; T9 = 3.0; T10 = 0.1. Lets use the forecasting formula to produce a forecast for T11|10 : We would use T11|10 = 0.3e10 . So we need to know the one-step ahead forecasting error at n = 10; that is e10 . What is this forecasting error? We know the actual observed value at 10: it is T10 = 0.1. But to know the one-step ahead error in forecasting T10 , we need to know T10|9 . What is this value? By the forecasting formula, it is computed using T10|9 = 0.3e9 . And so we need the one-step ahead forecasting error for n = 9, which requires knowledge of T9|8 . From the forecasting formula, we nd this using T9|8 = 0.3e8 . And so the cycle continues, right back to the start of the series. In practice, we need to compute all the one-step ahead forecasting errors. r can compute these errors and produce predictions without having to worry about these diculties in a real (data-driven) situation; see Sect. 5.4.
3.5
Statistics
In this Section, the important statistics of a model are found.
3.5.1
The mean
In Equation (3.2), the general form of an ma(q) model is given. Taking expected values of each term in this series gives E[Xn ] = E[m] + E[en ] + E[1 en1 ] + E[2 en2 ] + + E[p enp ] = m, since the average error is zero. Hence, for an ma model, the constant term m is actually the mean of the series {Xn }.
48
Example 3.5: In Equation (3.3), let the mean of the series be = E[R]. Then taking expected values of each term gives = 12, so that the mean of the series is = E[R] = 12. This should not be unexpected given the forecasts in Example 3.3
3.5.2
The variance
The variance of a time series written in ma(1) form is found by taking the variance of each term. Consider again Equation (3.2); taking the variance of each term gives var[Rn ] = 12 + var[en ] + (0.3)2 var[en1 ] + (0.12)2 var[en2 ], where {en } N (0, 3), since the errors {en } are independent of the time series {Rn } and independent of each other. This gives var[Rn ] = {1 + (0.3)2 + (0.12)2 }var[en ], and so var[R] = 1.1044 3 = 3.3132. This approach can be applied to other ma models also. Example 3.6: The above results can be checked numerically in r as follows (set.seed() sets the random number seed so these results are reproducible): > set.seed(100) > ma.sim <- arima.sim(model = list(ma = c(-0.3, + -0.12)), n = 10000, sd = sqrt(3)) > var(ma.sim) [1] 3.321068 > ma.sim <- arima.sim(model = list(ma = c(-0.3, + -0.12)), n = 10000, sd = sqrt(3)) > var(ma.sim) [1] 3.309557
3.5. Statistics
49
3.5.3
Autocovariance and autocorrelation
The autocovariance for a time series is written, as shown earlier, as k = Covar[X0 , Xk ] for integer k. The is then dened as k = k 0
for integer k, where 0 = Covar[X0 , X0 ] is simply the variance of the time series. The series {k } is the autocorrelation function, or . For any ma model, the acf can be computed, which will be unique to that ma model. For this reason, the acf is one of the most important pieces of information that we can know about a time series. Later, the acf will be used to determine which ma model might be appropriate for our data. Note that since the autocorrelation is a series, it can be written using the backshift operator. It can be shown that the autocovariance for an ma(p) model is 2 (B) = (B)(B 1 )e . Example 3.7: The ma(2) model Vn+1 = en+1 0.39en 0.22en1 can be written Vn+1 = (B)en+1 where (B) = 1 0.39B 1 0.22B 2 . Suppose for the sake of example 2 that e = 2. Then, since (B 1 ) = 1 0.39B 1 0.22B 2 , the autocovariance is (B) = 2(1 0.39B 1 0.22B 2 )(1 0.39B 1 0.22B 2 ) = 2(0.22B 2 0.3042B 1 + 1.2005 0.3042B 1 0.22B 2 ) = 0.44B 2 0.6084B 1 + 2.4010 0.6084B 1 0.44B 2 . The terms in the autocovariance are therefore (quoting only the terms for the non-zero lags, as the autocorrelation is symmetric): {} = {2.4010, 0.6084, 0.4400}. and so the corresponding terms in the autocorrelation are k = k /0 , where 0 = 2.4010. Hence {} = {1, 0.253, 0.183}. The rst element of the autocorrelation is always one. It is usual to plot the acf(Fig. 3.1).
50
1.0 0.8 0.6 ACF 0.4 0.2 0.0 0.2 0 1 2 3 Lag 4 5 6 7
Figure 3.1: The acf for the ma(2) model in Example 3.7. The plot is typical of an ma(2) model: there are two terms in the acf that are non-zero (apart from the term at a lag of zero, which is always one). In general, the acf of an ma(q) model has q non-zero terms excluding the term at lag zero which is always one.
3.6
Why have different types of models?
Why are both ar and ma models necessary? ar models are far more popular in the literature than ma models, so why not just have ar models? There are two important reasons why both ma and ar models are necessary.
3.6.1
Two reasons
The rst reason is that only ma models can be used to create condence intervals on forecasts (Sect. 3.4.2). If an ar model is developed, it must be written as ma model to produce condence intervals for the forecasts.
3.6. Why have dierent types of models?
51
Secondly, it is necessary to again recall one of the principles of statistical modelling: to nd the simplest possible model that captures the important features of the data. In some applications, the only suitable ar model has a large number of parameters. In these situation, there will probably be an ma model that will be almost identical in terms of forecasting ability, but has fewer parameters to estimate. In this case, the ma model would be preferred. In other applications, a simpler ar model will be preferred over a more complicated ma model.
3.6.2
Conversion of models
This discussion implies that it is possible to convert ar models into ma models, and ma models into ar models. This is indeed true, and the vehicle through which this is done is the backshift operator. Consider an ar model, written using backshift notation as (B)Xn = en . If it is possible and sensible to divide by (B), expressed the model as Xn = Denoting 1/(B) by (B) gives Xn = (B)en , which looks like an ma model. This is exactly the way models are converted from ar to ma. Consider writing the ar(1) model Xn = 0.6Xn1 +en as an ma model. There are three ways of proceeding. The rst can only be used for ar(1) models as it uses a mathematical result relevant only then. The second approach is more dicult, but is used for any ar model. The third approach uses r, and so is the easiest but of no use in the examination. Using the rst approach, write the model using the backshift operator as: (B)Xn = en , where (B) = 1 0.6B. Then divide by (B) to obtain Xn = (B)en , where (B) = 1/(B). So, (B) = 1 . 1 0.6B (3.6) 1 en . (B)
The mathematical result for the sum of a geometric series1 is then used to obtain (B) =
1
1 = 1 + 0.6B + (0.6)2 B 2 + (0.6)3 B 3 + . 1 0.6B
1 + r + r2 + r3 + = 1/(1 r) if |r| < 1.
52
So, the corresponding ma model is the innite ma model (written ma()) Xn = en + 0.6en1 + (0.6)2 en2 + (0.6)3 en3 + . This shows an ar(1) model has an equivalent ma() form. Since both are equivalent, the simpler ar(1) form would be preferred, but the ma form is necessary for computing condence intervals of forecasts. In the second approach, start with Equation (3.6), and equate it to an unknown innite sequence of s: 1 = 1 + 1 B + 2 B 2 + . 1 0.6B Then multiply both sides by 1 0.6B to get 1 = (1 0.6B)(1 + 1 B + 2 B 2 + ) = 1 + B(1 0.6) + B 2 (2 0.61 ) + , and then equate the powers of B on both sides of the equation. For example, looking at constants, there is one on both sides. Looking at powers of B, zero are on the left, and 0.6 + 1 on the right after multiplying out. Equating, we nd that 1 = 0.6 (as before). Then equating powers of B 2 , the left hand side has zero, and the right hand side has 2 0.61 . Substituting 1 = 0.6 and solving gives 2 = (0.6)2 (as before). A general pattern emerges, giving the same result as before. Remember the second method is used to convert any ar model into an ma model (and also any ma model into an ar model). The third approach uses r. This is useful, but you will need to know other methods for the examination. Naturally, the answers are the same as using the other two methods. > imp <- as.ts(c(1, rep(0, 19))) > phi <- 0.6 Note the one is not needed in the list of ar components as it is always one! Confusingly, the sign is dierent for the . > theta <- filter(imp, phi, method = "recursive") > theta
3.6. Why have dierent types of models? Time Series: Start = 1 End = 20 Frequency = 1 [1] 1.000000e+00 [4] 2.160000e-01 [7] 4.665600e-02 [10] 1.007770e-02 [13] 2.176782e-03 [16] 4.701850e-04 [19] 1.015600e-04
53
6.000000e-01 1.296000e-01 2.799360e-02 6.046618e-03 1.306069e-03 2.821110e-04 6.093597e-05
3.600000e-01 7.776000e-02 1.679616e-02 3.627971e-03 7.836416e-04 1.692666e-04
3.6.3
The
ACF
for
AR
models
Briey, we digress to again consider the acf for ar models, seen previously in Sect. 2.5.4, Equation 2.5, and Example 2.12 (p 34) in particular. In this example, the following in stated: . . . the autocovariance is (B) = = 10 (1 0.3B 1 + 0.2B 2 )(1 0.3B 1 + 0.2B 2 ) 10 . (0.2B 2 0.36B 1 + 1.13 0.36B + 0.2B 2 )
By some detailed mathematics (covered in Sect. 3.6), this equals (B) = +11.11+2.78B 1 1.39B 2 0.97B 3 0.0139B 4 +0.190B 5 + , (3.7) only quoting the terms for the non-negative lags Since this is Sect. 3.6, we had better deliver! The way to convert to Equation (3.7) is to proceed as in this section. First, write (B) = + 2 B 2 + 1 B 1 + 0 + 1 B + 2 B 2 + (recalling that the autocovariance in a series in both directions, but is symmetric.) Then, rearrange the original equation to get 10 = (B)(0.2B 2 0.36B 1 + 1.13 0.36B + 0.2B 2 ) = ( + 1 B 1 + 0 + 1 B + 2 B 2 + ) (0.2B 2 0.36B 1 + 1.13 0.36B + 0.2B 2 )
54
Then, expand and equate powers of B as before in this section. In this situation, it is just a lot trickier. On the left, the constant term is 10; on the right, a constant can be found from: 0 (1.13) + 1 (0.36) + 2 (0.2) + 1 (0.36) +
1 B 1 (0.36B) 2 B 2 (0.2B) 1 B 1 (0.36B 1 )
2 (0.2)
2 B 2 (0.2B 1 )
So we have 10 = 0 (1.13) + 1 (0.36) + 2 (0.2) + 1 (0.36) + 2 (0.2) Proceed for other powers of B also, and develop a set of equations to be solved for 1 , 2 , and so on. Far easier is to use r after rst converting to an ma model, whose parameters we call theta: > imp <- as.ts(c(1, rep(0, 99))) > theta <- filter(imp, c(0.3, -0.2), "recursive") Thats the AR model found. Note that the rst component is 1 and is assumed; it should not be included. > theta[1:4] [1] 1.000 0.300 -0.110 -0.093
> gamma <- convolve(theta, theta) * 10 > gamma[1:4] [1] 11.1111111 2.7777778 -1.3888889 -0.9722222
> rho <- gamma/gamma[1] > rho[1:4] [1] 1.0000 0.2500 -0.1250 -0.0875
3.7. Summary
55
3.7
Summary
In this Module, moving average models were studied, including forecasting, establishing condence intervals on forecasts, and writing using the backshift operator. In addition, three methods were shown that can be used to convert ar models to ma models.
3.8
Exercises
Ex. 3.8: Classify the following ma models (that is, state if they are ma(3), ma(2), etc.) (a) At+1 = et+1 + 8.39 0.06et + 0.35et1 . (b) Xn = 0.12en1 + en . (c) Yt 0.29et1 + 0.19et2 + 0.62et3 0.26et4 et = 12.40. Ex. 3.9: Classify the following ma models (that is, state if they are ma(3), ma(2), etc.) (a) Bt = 0.1et1 + et . (b) Yn = 0.036en2 0.36en1 + en . (c) Wt + 0.39et1 + 0.25et2 0.21et3 et = 8.00. Ex. 3.10: Determine the mean of each series in Exercise 3.8. Ex. 3.11: Determine the mean of each series in Exercise 3.9. Ex. 3.12: Write each of the models in Exercise 3.8 using the backshift operator. Ex. 3.13: Write each of the models in Exercise 3.9 using the backshift operator. Ex. 3.14: Convert the ar model Xt+1 = et+1 + 0.4Xt into the equivalent ma model using each of the three methods outlined in Sect. 3.6, and conrm that they give the same answer. Ex. 3.15: Convert the ma(2) model Yn = en + 0.3en1 0.1en2 into the equivalent ar model using one of the three methods outlined in Sect. 3.6.
56 Ex. 3.16: Convert the AR model
Yn = 0.25Yn1 0.13Yn2 + en into the equivalent ma model using one of the three methods outlined in Sect. 3.6. Ex. 3.17: Compute forecasting formula for each of the ma models in Exercise 3.8 for one-, two- and three-steps ahead, and compute condence 2 intervals for each forecast in terms of the error variance e . Ex. 3.18: Compute forecasting formula for each of the ma models in Exercise 3.9 for one-, two- and three-steps ahead, and compute condence 2 intervals for each forecast. In each case, assume e = 2. Ex. 3.19: Write a short piece of r-code to simulate the ma model Xt = et + et1 where e N (0, 1) (see Example 2.3 for a guideline). Plot a simulated series of length 200 for each of the following eight values of : = 1.5, 1, 0.6, 0.2, 0, 0.5, 1, 1.5. Comment on your ndings: What eect does the value of have on the stationarity of the series? Ex. 3.20: Consider the ma(1) model Xn = 0.4en1 + en where e N (0, 3). (a) Write the model using backshift operators. (b) Find the autocovariance series {}. (c) Compute the autocorrelation function (acf), {}. Ex. 3.21: Consider the ma(1) model Sn+1 = 0.2en + en+1 , where e N (0, 2). (a) Write the model using backshift operators. (b) Find the autocovariance series {}. (c) Compute the autocorrelation function (acf), {}. Ex. 3.22: Consider the time series model Zt = 0.2et1 0.1et2 + et , where e N (0, 5).
3.8. Exercises (a) Write the model using backshift operators. (b) Find the autocovariance series {}. (c) Compute the autocorrelation function (acf), {}. Ex. 3.23: Consider the ar(1) model Wn = 0.3Wn1 + en , where e N (0, 2.5). (a) Write the model using backshift operators. (b) Find the autocovariance series {} using R. (c) Compute the autocorrelation function (acf), {}. Ex. 3.24: Consider the AR model Yt = 0.45Yt1 0.2Yt2 + et , where e N (0, 5). (a) Write the model using backshift operators. (b) Find the autocovariance series {} using R. (c) Compute the autocorrelation function (acf), {}.
57
3.8.1
3.8 The models are: ma(2); ma(1) and ma(4). 3.10 The means are: E[A] = 8.39; E[X] = 0; and E[Y ] = 12.40. 3.12 (a) At+1 = (1 0.06B + 0.35B 2 )et+1 + 8.39; (b) Xn = (1 0.12B)en ; (c) Yt = (1 + 0.29B 0.19B 2 0.62B 3 + 0.26B 4 )et + 12.40. 3.14 First, convert to backshift operator notation: (1 0.4B)Xt+1 = et+1 . The innite ma model is given by Xt+1 = 1 et+1 . 1 0.4B
Then the equivalent ma model, using any method, is Xt+1 = (1 + 0.4B + (0.4)2 B 2 + (0.4)3 B 3 + )et+1 , or Xt+1 = et+1 + 0.4et + 0.16et1 + 0.064et2 + .
58 3.17 For (a) only:
(a) One-step ahead: At+1|t = 8.39 0.06et + 0.35et1 ; var[At+1|t 2 2 At+1 ] = var[et ] = e ; the CI is At+1|t 2e . (b) Two-steps ahead: At+2|t = 8.39 + 0.35et ; var[At+2|t At+2 ] = 2 2 (1 + (0.06)2 )var[et ] = 1.0036e ; the CI is At+1|t 2 1.0036e . (c) Three-steps ahead: At+3|t = 8.39; var[At+3|t At+3 ] = (1 + 2 2 (0.06)2 +(0.35)2 )var[et ] = 1.1261e ; the CI is At+1|t 2 1.1261e . 3.20 (a) Xn = (1 + 0.4B)en , or Xn = (B)en where (B) = (1 + 0.4B).
2 (b) The autocovariance using the backshift operator is (B) = (B)(B 1 )e , 1 ) = 1.2B 1 + 3.48 + 1.2B, so so (B) = 3(1 + 0.4B)(1 + 0.4B the series is {1.2, 3.48, 1.2}.
(c) Dividing the autocovariance by 0 = 3.48 gives the acf series as {0.345, 1, 0.345} 3.23 (a) (1 0.3B)Wn = en . (b) > > > > imp <- as.ts(c(1, rep(0, 99))) theta <- filter(imp, c(-0.3), "recursive") gamma <- convolve(theta, theta) * 2.5 gamma[1:6]
[1] 2.747252747 -0.824175824 0.247252747 [4] -0.074175824 0.022252747 -0.006675824 (c) > rho <- gamma/gamma[1] > rho[1:6] [1] 1.00000 -0.30000 [6] -0.00243 0.09000 -0.02700 0.00810
Module
ARMA
4
. . . . . . . . . . . . . . . . . . . . . . . . . 60 60 62 62 62 63 64 65 65 66 67 67 68 70
Models
Module contents
4.1 4.2 4.3 4.4 Introduction Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . The backshift operator for arma models . . . . . . . . . Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 4.4.2 4.5 4.6 The mean . . . . . . . . . . . . . . . . . . . . . . . . . . The autocovariance and autocorrelation . . . . . . . . .
Conversion of arma models to ar and ma models . . . Forecasting arma models . . . . . . . . . . . . . . . . . . 4.6.1 4.6.2 4.6.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . Condence intervals . . . . . . . . . . . . . . . . . . . . Forecasting diculties with arma models . . . . . . . .
4.7 4.8
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Answers to selected Exercises . . . . . . . . . . . . . . .
Module objectives
Upon completion of this module students should be able to: 59
60
Module 4. arma Models

understand what is meant by an autoregressive moving average (arma) model; use the arma(p, q) notation to dene arma models; use arma models to develop forecasting formulae; develop condence intervals for forecasts from arma models; write arma models using the backshift operator; compute the mean of a time series written in arma form; understand the need for ar, ma and arma models; convert arma models to ma and ar models using appropriate methods; compute the autocorrelation function (acf) for an arma model.
4.1
Introduction
This Module examines models with both autoregressive and moving average components.
4.2
Denition
The principle of parsimonythat the best model is the simplest model that captures the important features of the datahas been mentioned before, where it was noted that a complex ar model can often be replaced by a simpler ma model. Sometimes, however, neither a simple ar model or simple ma model exists. In these cases, a combination of ar and ma models will almost always produce a simple model. These models are called AutoRegressive Moving Average models, or arma models. Once again, p is used for the number of autoregressive components, and q for the number of moving average components. Consider rst some examples. Example 4.1: An example of an arma(2, 1) model is Wn+1 = 0.56 + 0.8Wn 0.4Wn1 +en+1 +
2 ar components
0.5en .
1 ma component
4.2. Denition
61
The signal or information is of the form Sn+1 = 0.56 + 0.8Wn 0.4Wn1 + 0.5en , and has both ar and ma components.
Example 4.2: An example of an arma(1, 3) model is Xt = 120.78 + 0.88Xt1

The ar(1) component
+et
0.41et1 0.15et2 + 0.08et3 .

The ma(3) component
A more formal denition follows. Denition 4.1 The form of an arma(p, q) model is the equation:
p q
Xn
k=1
k Xnk = m + en +
j=1
j enj ,
n 0,
(4.1)
where {Xn }n1 , m is some constant, and the k and j are dened as for ar and ma models respectively. Example 4.3: Chu & Katz [13] studied the monthly SOI time series from January 1935 to August 1983, and concluded the data could be modelled by an arma(1, 1) model.
Example 4.4: Davis & Rappoport [15] use an arma(2, 2) model for the Palmer Drought Index, {Yt }. The nal tted model is Yt = 1.344Yt1 0.431Yt2 + et 0.419et1 + 0.034et2 . Katz & Skaggs [26] claim the equivalent ar(2) model is almost as good as the model given by Davis & Rappoport, yet has half the number of parameters. For this reason, they prefer the ar(2) model.
62
4.3
The backshift operator for ARMA models
arma models have both ar and ma components; the model is easily written using the backshift operator by following the guidelines for ar and ma models. Example 4.5: Consider the arma(1, 2) model Zt = 0.83 + et 0.66et1 + 0.72et2 0.29Zt1 . First, re-write as Zt + 0.29Zt1 = 0.83 + et 0.66et1 + 0.72et2 ; then use the backshift operator to get (B)Zt = m + (B)et (1 + 0.29B)Zt = 0.83 + (1 0.66B + 0.72B 2 )et . (4.2)
4.4
4.4.1
Statistics
The mean
In Equation (4.1), the general form of an arma(p, q) model is given. Taking expected values of each term in this series gives E[Xn ] = E[m ] + E[en ] + E[1 en1 ] + + E[q enq ] + + E[1 Xn1 ] + E[2 Xn2 ] + + E[p Xnp ] = m + 1 E[Xn1 ] + 2 E[Xn2 ] + + p E[Xnp ], since the average error is zero. Now, since the assumption is that the time series {X} is stationary, the mean of this series is approximately constant, so the expected value of the series will be the same at any time step. Let this constant mean be . Then, = m + 1 + 2 + + p , and so, on solving for , = m 1 1 2 p
This enables the mean of the sequences to be computed from the arma model.
4.4. Statistics
63
Example 4.6: The mean of {Zt }, say , in the arma(1, 2) model in Equation (4.2) is found by taking expectations of each term: E[Zt ] = 0.83 + E[et ] 0.66E[et1 ] + 0.72E[et2 ] E[0.29Zt1 ] = 0.83 0.29 so that E[Z] = = 0.643.
See also Example 4.8.
4.4.2
The autocovariance and autocorrelation
For an arma(p, q) model, the autocovariance can be expressed as (B) =

2 e (B)(B 1 ) . (B)(B 1 )
Example 4.7: In Example 4.5, the following were found: (B) = (1 + 0.29B 1 ) (B) = (1 0.66B 1 + 0.72B 2 ).
2 Suppose for the sake of example that e = 4. Then, the autocovariance is
(B) =
4(1 0.66B 1 + 0.72B 2 )(1 0.66B 1 + 0.72B 2 ) . (1 + 0.29B)(1 + 0.29B 1 )
This can be converted into the series (B) = 5.4430B 2 8.8380B 1 + 11.9381 8.8380B + 5.4430B 2 1.5785B 3 + 0.4578B 4 + so that the autocorrelation is {} = {1.0000, 0.7403, 0.4559, 0.1322, 0.0383, . . . }.
64
4.5
Conversion of ARMA models to AR and MA models
Using similar approaches as used before in Sect. 3.6, arma models can be converted to pure ar or pure ma models. Example 4.8: Consider the arma(1, 2) model Xt = 0.3Xt1 + et + 0.4et1 0.1et2 + 10. For the moment, ignore the constant term m = 10 and write (1 0.3B)Xt = (1 + 0.4B 0.1B 2 )et (B)Xt = (B)et . To write the model as a pure ma model, Xt = (B) et = (B)et (B) (4.4) (4.3)
where (B) = 0 + 1 B + 2 B + . To convert to this pure ma form, the values of 0 , 1 , and so on must be found. Rearrange Equation (4.4) to obtain (B) = (B)(B) 1 + 0.4B 0.1B 2 = (0 + 1 B + 2 B 2 + 3 B 3 + )(1 0.3B) = 0 + B(0.30 + 1 ) + B 2 (0.31 + 2 ) + B 3 (0.32 + 3 ) +
Now, equate powers of B so that both sides of the equation are equal. Equating constant terms: 1 = 0 as expected. Equating terms in B: 0.4 = 0.30 + 1 , so that 1 = 0.4 + 0.30 = 0.7. Equating terms in B 2 : 0.1 = 0.31 + 2 , so that 2 = 0.1 + 0.31 = 0.11. Equating terms in B 3 : 0 = 0.32 + 3 ,
4.6. Forecasting arma models so that 3 = 0.32 = 0.11(0.3).
65
Continuing, a pattern emerges showing that k = (0.3)k2 (0.11) when k 2. Hence, (B) = 1 + 0.7B + 0.11B 2 + 0.11(0.3)B 3 + + 0.11(0.3)k2 B k . This means the arma(1, 2) model has an equivalent ma() representation of Xt = et + 0.7et1 + 0.11et2 + + 0.11(0.3)k2 etk . However, there will probably be a constant term in the model yet to be found, so that Xt = m + et + 0.7et1 + 0.11et2 + + 0.11(0.3)k2 etk . (4.5)
Taking expectations of Equation (4.3) shows that the mean of the series is E[X] = 10/0.7 14.2857. Taking expectations of Equation (4.5) shows that m = E[X] 14.2857. So the arma(1, 2) model has the equivalent ma model Xt = 14.2857 + et + 0.7et1 + 0.11et2 + + 0.11(0.3)k2 etk . Note that I have yet to determine how to do these conversion in r.
4.6
4.6.1
Forecasting ARMA models

Forecasting
Forecasting arma models uses the same principles as for forecasting ma and ar models. This procedures is called the hat principle, summarized below: The forecasting equation for an arma model is obtained from the model equation by placing hats on all the terms of the equation, and adjusting subscripts accordingly. The hat designates the best linear estimate of the quantity underneath the hat. This equation is then adjusted by noting: 1. An ek|j for which k is in the future (i.e. k > j) just equals zero (the mean of {ek }), while one for which k is in the present or past (k j) just equals ek . In other words, hats change future ek s to zeros and they fall o present and past ek s.
66
Module 4. arma Models 2. A Xk|j for which k is in the present or past (i.e. k j) just equals Xk , while one for which k is in the future can be expressed in terms of another forecasting equation, which ultimately will allow it to be expressed in terms of known quantities. In other words, hats fall o present and past Xk s and they stay on future ones.
Example 4.9: Consider the arma(2, 1) model Wn = 0.72 + 0.44Wn1 + 0.17Wn2 + en 0.26en1 . A one-step ahead forecast is Wn+1|n = 0.72 + 0.44Wn|n + 0.17Wn1|n + en+1|n 0.26n|n . e Since en+1|n is in the future, it is replaced by the mean of the {ek }, which is zero. In contrast, en|n = en . Likewise, Wn|n = Wn and Wn1|n = Wn1 , so the forecasting formula is Wn+1|n = 0.72 + 0.44Wn + 0.17Wn1 0.26en . (4.7) (4.6)
Using the sample principles, the two-step ahead forecasting formula is Wn+2|n = 0.72 + 0.44Wn+1|n + 0.17Wn . Again, Wn+1|n can be replaced by Equation (4.7) (though this is not necessary) to get Wn+2|n = 0.72 + 0.44 {0.72 + 0.44Wn + 0.17Wn1 0.26n } + 0.17Wn , e which can be simplied if you wish.
4.6.2
Condence intervals
As with ar models, arma models must be rst converted to pure ma models before condence intervals can be computed for forecasts. After conversion to a pure ma form, the same principles as used in Sect. 3.4.2 are used. Example 4.10: Consider the arma(1, 2) model from Example 4.8: Xt = 0.3Xt1 + et + 0.4et1 0.1et2 + 10. This model has the equivalent ma() form Xt = 14.2857 + et + 0.7et1 + 0.11et2 + + 0.11(0.3)k2 etk .
4.7. Summary The one-step ahead forecast of the model in ma form is Xt+1|t = 14.2857 + 0.7et + 0.11et1 + + 0.11(0.3)k2 etk+1 , whereas the exact (but unknown) value will be
67
Xt+1 = 14.2857 + et+1 + 0.7et + 0.11et1 + + 0.11(0.3)k2 etk+1 . The dierence between them is et+1 , and so the forecasting error is 2 just the error variance, say e . The two-step ahead forecast is Xt+2|t = 14.2857 + 0.11et + + 0.11(0.3)k2 etk+2 , whereas the exact (but unknown) value is Xt+2 = 14.2857 + et+2 + 0.7et+1 + 0.11et + + 0.11(0.3)k2 etk+2 . The dierence between them is Xt+2 Xt+2|t = et+2 + 0.7et+1 ,
2 2 so the variance of the forecast is e (1 + 0.72 ) = 1.49e . Condence intervals can be constructed from the values of the error variance.
Continuing in the same manner, the variance of a three-step ahead 2 2 forecast is 1.5021e and a four-step ahead forecast is 1.503189e .
4.6.3
Forecasting difculties with
ARMA
models
In Sect. 3.4.3, some diculties forecasting with ma models were presented. In short, the one-step ahead forecasting errors need to be determined right to the beginning of the series. Because aspects of ma models are present in arma models, this same diculty is also present. Of course, r can compute these errors and produce predictions without having to worry about these diculties in a real (data-driven) situation; see Sect. 5.4.
4.7
Summary
In this Module, a combination of autoregressive and moving average models, called arma models, was discussed. Forecasting methods were also examined for these models.
68
4.8
Exercises
Ex. 4.11: Classify the following models as ar, ma or arma, and state the orders of the models (for example, an answer may be arma(1, 3)): (a) At = 12.6 0.44At1 + 0.37et1 + et ; (b) Xn 0.24Xn1 + 0.38Xn2 14.8 = en ; (c) Yt+1 = et+1 0.19et 0.44Yt ; (d) Rn = 0.46en1 + en ; (e) Pn+1 = 8.69 + en+1 0.35Pn 0.26en 0.18en1 + 0.11en2 . Ex. 4.12: Classify the following models as ar, ma or arma, and state the orders of the models (for example, an answer may be arma(1, 3)): (a) An 0.1An1 = 7.40 + 0.22en1 + en ; (b) Bn 0.5Bn1 = en ; (c) Xt et = 0.61Xt1 0.67et1 ; (d) Zt+1 = 0.26et + 0.10et1 + 0.17Zt 0.16Zt1 + et+1 ; (e) Xt+1 0.2et + 0.2et1 = et+1 + 7; (f) Yn = 2.2 + en + 0.23Yn1 0.19en1 0.18en2 + 0.17en3 . Ex. 4.13: Find the mean of each series in Exercise 4.11. Ex. 4.14: Find the mean of each series in Exercise 4.12. Ex. 4.15: Write each model in Exercise 4.11. using the backshift operator. Ex. 4.16: Write each model in Exercise 4.12. using the backshift operator. Ex. 4.17: Consider the arma(1, 1) model Xn = 0.2Xn1 + en 0.1en1 where var[en ] = 9.3. (a) Write the model using the backshift operator. (b) Find a one- and two-step ahead forecast for the model. (c) Convert the model into a pure ma model. (d) Find 95% condence intervals for the forecasts in (b).
4.8. Exercises Ex. 4.18: Consider the arma(1, 1) model Yn = 0.3Yn1 + en + 0.2en1 where var[en ] = 7.0. (a) Write the model using the backshift operator.
69
(b) Find a one-, two- and three-step ahead forecast for the model. (c) Convert the model into a pure ma model. (d) Find 95% condence intervals for the forecasts in (b). Ex. 4.19: Consider the arma(1, 1) model Wt+1 + 0.2Wt = 2 + en + 0.2en1 where var[en ] = 7.0. (a) Write the model using the backshift operator. (b) Find a one-, two- and three-step ahead forecast for the model. (c) Convert the model into a pure ma model. (d) Find 95% condence intervals for the forecasts in (b). Ex. 4.20: Give two reasons why it is sometimes necessary to convert ar and arma models into pure ma models. Ex. 4.21: Claps & Morrone [14] give the following model for modelling runo Dt under certain conditions: Dt exp{1/K3 }Dt1 = (1 c3 exp{1/K3 })It exp{1/K3 }(1 c3 )It1 , where c3 is a recharge coecient (constant in any given problem), It is the eective rainfall input, and K3 is a storage coecient (constant in any given problem). The authors state that if the eective rainfall input It is white noise, then the model is equivalent to an arma(1, 1) model. Use that (1 c3 exp{1/K3 })It = et to show that this is the case. Ex. 4.22: Sales, Pereira & Vieira [40] discuss numerous arma-type models in connection with the Brazilian Electrical Sector. A signicant proportion of electricity is sourced from hydroelectricty in Brazil. In their paper, the author use arma-type models to model natural monthly average ow rate (in cubic metres per second) of the reservoir of Furnas on the Grande River in Brazil. Initially, the logarithm of the data was found to create a time series {Ft }, and then an arma(1, 1) model was tted. The information in Table 4.1 comes from their Table 2.
70
Table 4.1: Parameters estimates and standard errors for the arma(1, 1) model tted by Sales, Pereira & Vieira [40]. Parameter Estimate Standard Error 1 1 2 e 0.8421 0.2398 0.4343 0.0237 0.0426
(a) Write down the tted model. (b) Convert the model to a pure ma model. (c) Develop one-, two- and three- step ahead forecasts for the log of the owrate. (d) Determine 95% condence intervals for each of these forecasts. Ex. 4.23: Consider the arma(2, 2) model for the Palmer Drought Index seen in Example 4.4. Write this model using the backshift operator. Then create forecasting formulae for forecasting one-, two-, three- and four-steps ahead.
4.8.1
4.11 The models are arma(1, 1); ar(2) (or arma(2, 0)); arma(1, 1); ma(1) (or arma(0, 1)); arma(1, 3). 4.13 The means are: E[A] = 8.75; E[X] 13.0; E[Y ] = 0; E[R] = 0; E[P ] 6.44. 4.15 (a) (1 + 0.44B)At = 12.6 + (1 + 0.37B)et ; (b) (1 0.24B + 0.38B 2 )Xn = 14.8 + en ; (c) (1 + 0.44B)Yt+1 = (1 0.19B)et+1 ; (d) Rn = (1 + 0.46B)en ; (e) (1 + 0.35B)Pn+1 = 8.69 + (1 0.26B 0.18B 2 + 0.11B 3 )en+1 . 4.17 (a) (1 0.2B 1 )Xn = (1 0.1B 1 )en ; (b) The one-step ahead forecast is Xn+1|n = 0.2Xn 0.1en . The two-step ahead forecast is Xn+2|n = 0.2Xn+1|n . (c) We have (B) = (1 0.1B)/(1 0.2B). Solving shows that (B) = 1 + 0.1B + 0.1(0.2)B 2 + + 0.1(0.2)k1 B k . The pure MA models is therefore Xt = et + 0.1et1 + 0.1(0.2)et2 + + 0.1(0.2)k1 etk .
4.8. Exercises
71
(d) The variance of the forecasting error for the one-step ahead fore2 cast is e = 9.3. For the two-step ahead forecast, the variance of 2 2 the forecast error is e + (0.1)2 e = 9.393. The 95% condence intervals therefore are Xt+1|t 2 9.3 for the one-step ahead fore cast; and Xt+2|t 2 9.393 for the two-step ahead forecast. 4.20 Firstly, models must be in ma form to compute condence intervals for forecasts; secondly, sometimes the ma model will be the simplest model in a given situation. 4.21 Hint: First write = exp(1/K3 ), and the right-hand side looks like the ar(1) part. Then, use the given relationship between It and et to nd It1 and hence show that = (1 c3 )/(1 c3 ) for the ma(1) part.
72
Module
5
. . . . . . . . . . . . . . . . . . . . . . . . . 74 75 75 75 80 83 83 84 85 85 86 88 89 91 95
Finding a Model
Module contents
5.1 5.2 Introduction Identifying a Model . . . . . . . . . . . . . . . . . . . . . 5.2.1 The Autocorrelation Function . . . . . . . . . . . . . . . 5.2.2 5.2.3 5.2.4 5.2.5 5.2.6 5.3 Sample acf . . . . . . . . . . . . . . . . . . . . . . . . Sample pacf . . . . . . . . . . . . . . . . . . . . . . . . Tips for using the sample acf and pacf . . . . . . . . . Model selection using aic . . . . . . . . . . . . . . . . . Selecting arma models . . . . . . . . . . . . . . . . . .
Parameter estimation . . . . . . . . . . . . . . . . . . . . 5.3.1 Preliminary estimation for ar models: The YuleWalker equations . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Parameter estimation in R . . . . . . . . . . . . . . . . 5.4 Forecasting using R . . . . . . . . . . . . . . . . . . . . . 5.5 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Answers to selected Exercises . . . . . . . . . . . . . . .
73
74
Module 5. Finding a Model
Module objectives
understand the information contained in the sample autocorrelation function (acf); understand the information contained in the sample partial acf (PACF); use the sample acf and sample pacf to select ar and ma models for time series data; use r to plot the sample acf and pacf for time series data; write down the tted ar or ma model given the r output; use the Akaike Information Criterion (aic) to select the order of ar models for time series data using r; understand that selecting arma models is more dicult than selecting ar and ma models; compute initial parameter estimates of an ar model using the Yule Walker equations; use r to compute predictions from an ar or ma model; use r to compute parameter estimates for ar and ma models of a given order.
5.1
Introduction
In this Module, methods are discussed for nding the best model for a particular time series. This consists of two stages: rst, determining which type of model is appropriate for the given data (for example, ar(1) or ma(2)); then secondly estimating the parameters in the chosen model. The choice of ar, ma and arma models are discussed, as well as the number of parameters necessary for the chosen type of model. The two most important tools in making these decisions are the sample autocorrelation function (acf) and sample partial autocorrelation function (pacf).
5.2. Identifying a Model
75
5.2
5.2.1
Identifying a Model
The Autocorrelation Function
The autocorrelation function, or acf, was studied in earlier Modules. The approach then was to take a given model and deduce the acf is characteristic of that particular model. In practice, the scientist doesnt start with a known model, but instead starts with data for which a model is sought. Using software, the acf is estimated from the data (using a sample acf), and the characteristics of the sample acf used to select the best model.
5.2.2
Sample
ACF
The autocorrelation function is estimated from the data using the formulae k = k = 1 N k 0
N k
(Xi )(Xi+k )
i=1
k0
k0
where N is the number of terms in the time series and is the sample mean of the time series. Of course, the actual computations are performed by computer, using a package such as r. Since the quantities k (and hence k ) are estimated, there will be some sampling error. Formulae exist for estimation of the sampling error but will not be given here. However, r uses these formulae to produce approximate 95% condence intervals for k . Consider the ma(2) model as used in Example 3.7 (p 49): Vn+1 = en+1 2 0.39en 0.22en1 , where e = 2. In that example, the theoretical acf was computed as {} = {1, 0.253, 0.183}. The series {Vn } is simulated in r as follows: > ma.terms <- c(-0.39, -0.22) > sim.ma2 <- arima.sim(model = list(ma = ma.terms), + n = 1000, sd = sqrt(2)) Note the variance of the errors is given as 2. The sample acf of this data is found as follows (Fig. 5.1):
76
Series sim.ma2[10:1000]
1.0 ACF 0.2 0 0.0 0.2 0.4 0.6 0.8
10
15 Lag
20
25
30
Figure 5.1: The sample acf for the ma(2) model in Example 2.2 (p 35). > acf(sim.ma2[10:1000]) Note the rst few terms have been ignored; this allows the simulation to recover from the initial (arbitrary) choice of errors needed to begin the simulation. First, note the dotted horizontal lines on the plot. These indicate the approximate 95% condence intervals for k . In other words, if the autocorrelation value lies within the dotted lines, the value can be considered as zero; the reason it is not exactly zero is due to sampling error only. We would expect that the sample acf would demonstrate the features of the acf for the model. Compare Figures 3.1 (p 50) and 5.1; the sample acf and acf do look similarthey both show two components in the plot that are larger than the rest when we ignore the term at a lag of zero which will always be one. (Recall that only two acf values are outside the dotted condence bands, so the rest can be considered as zero, and that the rst term will always be one so is of no importance.) Notice there are two components in the acf that are non-zero for a two-parameter ma model (that is, ma(2)). In fact, this is typical. Here is one of the most important rules for identifying time series models:
5.2. Identifying a Model If the sample acf has k non-zero components from 1 to k, then an ma(k) model is appropriate.
77
In r, the sample acf is produced by typing acf( time.series ) at the r prompt, where time.series is the name of the time series. Example 5.1: Consider the ar(2) model Xn = 0.4Xn1 0.3Xn2 + en . The theoretical acf can be computed and plotted in r (by rst converting to an ma model): > > > > > > imp <- as.ts(c(1, rep(0, 99))) ar.terms <- c(0.4, -0.3) theta <- filter(imp, ar.terms, "recursive") errorvar <- 1 gamma <- convolve(theta, theta) * sqrt(errorvar) rho <- gamma/gamma[1]
2 Note we used e = 1; it doesnt matter what we use since we eventually compute anyway.
> plot(c(1, 10), c(1, -0.2), type = "n", las = 1, + main = "Actual ACF ", xlab = "Lag", ylab = "ACF") > lines(rho, type = "h", lwd = 2) > abline(h = 0) This theoretical acf is shown in the top panel of Fig. 5.2. Suppose we generated some random numbers from this time series and computed the sample acf; we would expect the sample acf to look similar to Fig. 5.2. Proceed: > ar2.sim <- arima.sim(model = list(ar = ar.terms), + n = 1000) > acf(ar2.sim[10:1000], lwd = 2, las = 1, lag.max = 10, + main = "Sample ACF") This sample acf is shown in the bottom panel of Fig. 5.2. They are very similar as expected.
Example 5.2: Parzens [36] studied a time series of yearly snowfall in Bualo from 1910 to 1972 (recorded to the nearest tenth of an inch):
78
Actual ACF
1.0 0.8 ACF 0.6 0.4 0.2 0.0 0.2 2 4 Lag 6 8 10
Sample ACF
1.0 0.8 ACF 0.6 0.4 0.2 0.0 0.2 0 2 4 Lag 6 8 10
Figure 5.2: Top: the theoretical acf for the ar(2) model in Example 5.1; Bottom: the sample acf for data simulated from the ar(2) model in Example 5.1.
79
120 100 sf 80 60 40
1910
1920
1930
1940 Time
1950
1960
1970
Series sf
1.0 ACF 0.2 0 0.2 0.6
5 Lag
10
15
Figure 5.3: Yearly Bualo snowfall from 1910 to 1972. Top: the plot of the data; Bottom: the sample acf. > bs <- read.table("buffalosnow.dat", header = TRUE) > sf <- ts(bs$Snow, start = c(1910, 1), frequency = 1) The data are plotted in the top panel of Fig. 5.3. The time series is small, but the series appears to be approximately stationary. The sample acf for the data has been computed in r(Fig. 5.3, bottom panel). The acf has two non-zero terms (ignoring the term at lag zero, which is always one), suggesting an ma(2) model is appropriate for modelling the data. Note the condence bands are approximate only. Here is some of the code used to produce the plots: > plot(sf, las = 1) > acf(sf, lwd = 2)
80
5.2.3
Sample
PACF
In the previous section, the acf was introduced to indicate the order of the ma model appropriate for a dataset. How do we choose the appropriate order of the ar model? To identify ar models, a partial acf is used, which is explained below. Consider three random variable X, Y and Z. Suppose X and Y are correlated, and Y and Z are correlated. Does this mean X and Z will be correlated? Generally yesbecause both are correlated with Y . If Y changes, both X and Z will change, and so there will be a non-zero correlation between X and Z. Partial correlation measures the correlation between X and Z after removing the eect of the variable Y on both X and Z. Likewise, the partial autocorrelation measures the correlation between Xi and Xi+k after removing the eect of the joint correlations with Xi+1 , Xi+2 , . . . , Xi+(k1) . The number of non-zero terms in the partial acf or pacf suggests the order of the ar model. Here is the second of the most important rules for identifying time series models: If the sample pacf has k non-zero components from 1 to k, then an ar(k) model is appropriate. In r, the sample pacf is produced by typing pacf( time.series ) at the r prompt, where time.series is the name of the time series. Note there is no term at a lag of zero for the sample pacf, as it makes no sense given the explanation above about removing the eect of intermediate observations. Example 5.3: Consider the ar(2) model from Example 5.1. As this is an ar(2) model, the sample pacf from the simulated data is expected to have two signicant terms. The sample pacf (Fig. 5.4) has two signicant terms as expected. As explained, note there is no term at a lag of zero for the sample pacf.
Example 5.4: In Example 5.2 (p 77), the annual Bualo snowfall data was examined using the acf and an ma(2) model was found to be suitable.
81
Sample Partial ACF

0.3 Partial ACF 0.3 0.2 0.1 0.0 0.1 0.2
4 Lag
10
Figure 5.4: The sample pacf of data simulated from an ar(2) model.
Series sf
0.3 Partial ACF 0.2 0.1 0.0 0.1 0.2
5 Lag
10
15
Figure 5.5: The sample pacf of yearly Bualo snowfall from 1910 to 1972.
82
Correct AR(2) model Incorrect ARIMA(9,2,9) model
Data 4 0 2 0
100
200 Time
300
400
500
Figure 5.6: Simulated ar(2) data. Two models have been used to make predictions; the simple model is better for prediction. Note the more complex model predicts snowfall wil increase linearly over time! The sample pacf for the data has been computed in r(Fig. 5.5, bottom panel); there is no term at a lag of zero for the sample pacf. The pacf has only one non-zero term, suggesting an ar(1) model is appropriate for modelling the data. Recall the acf suggested an ma(2) model. Which model do we choose? Since the one-parameter ar model is simpler than the two-parameter ma model, the ar(1) model would be chosen as the best model. We almost certainly do not need both a ma(2) and ar(1) term in the model. (Later, we will learn about other criteria to use that helps make this decision also.) Now that an ar(1) model is chosen, it remains to estimate the parameters of the model. This will be discussed in Sect. 5.3.
Example 5.5: Consider some simulated ar(2) data. An ar(2) model and a more complicated model (an arima(9, 2, 9); we learn about arima models in Module 7.4) are tted to the data. Predictions can be made using both models; these predictions are compared in Fig. 5.6. The simple model is far better for making predictions!
83
Table 5.1: Typical features of a sample acf and sample pacf for ar and ma models. The slow decay may not always be observed. acf pacf ar(k) model ma(k) model slow decay k non-zero terms k non-zero terms slow decay
5.2.4
Tips for using the sample
ACF
and
PACF
When using the sample acf and pacf it is important to realize they are obtained from sample information. This means they have sampling error. To allow for this, the dotted lines produced by r represent condence intervals (95% by default). This implies a small number of terms (about 1 in 20) will lie outside the dotted lines even if they are truly zero. In addition, these condence intervals are approximate only. Since 5% (or 1 in 20) components are expected to be outside these approximate limits anyway, it is important to not place too much emphasis on term in the sample acf and pacf are marginal. For example, if the sample acf has two signicant terms, but one is just over the condence bands, perhaps an ma(1) model will be just as good as an ma(2). Tools for assisting in making this decision will be considered in Module 6. An ar(k) model is implied by a sample pacf with non-zero terms from 1 to k, and typically (but not always) the terms in the sample acf will decay slowly toward zero. Similarly, a ma(k) model will be implied by a sample acf with k non-zero terms from 1 to k, and typically (but not always) the terms in the sample pacf will decay slowly toward zero. Table 5.1 summarizes these very important facts for selecting time series models.
5.2.5
Model selection using
AIC
Another method of selecting the order of the ar model is appropriate is to use the Akaike Information Criterion (aic). The aic is used in many areas of statistics, and details will not be considered here. The aic, in general terms, determines the size of the errors by evaluating the log-likelihood, but also penalizes overtting of models by including a penalty term (usually twice the number of parameters used). While including extra (but possibly unnecessary) parameters in the model will reduce the size of the errors, the penalty function ensures these unnecessary terms will be less attractive when using the aic. There are numerous variations of the aic which use dierent forms for the penalty function, and often produce dierent models
84
than produced using the aic. In each case, the model with the minimum aic is selected. In r, the function ar uses the aic to select the order of the best ar model; unfortunately, ma and arma models are not considered. The advantage of this method it is automatic, and any two people using the same data and software will select the same model. The disadvantage is the computer is very strict in its decision making and does not allow for a humans expert knowledge or interpretation of the information. Example 5.6: Using the snowfall data from Example 5.4 (p 80), the function ar can be used to select the order of the ar model. > sf.armodel <- ar(sf) > sf.armodel Call: ar(x = sf) Coefficients: 1 2 0.2379 0.2229 Order selected 2 sigma^2 estimated as 500.7
(We will consider writing down the actual model in Sect. 5.3.2). Thus the ar function recommends an ar(2) model (from line 10) Order selected 2). There are therefore three models to consider: an ma(2) from the sample acf an ar(1) from the sample pacf and now an ar(2) from r using the aic. Which do we choose? This predicament happens often in time series analysis: there are often many good models from which to choose. In Module 6, some methods will be discussed for evaluating various models. If one of the model appears better than the others using these methods, that model should be chosen. But what if they all appear to be equally good? In that case, the simplest model would be chosenthe ar(1) model in this case.
5.2.6
Selecting
ARMA
models
Selecting arma models is not easy from the acf and the pacf. To select arma models, it is rst necessary to study some diagnostics of ar and ma
5.3. Parameter estimation
85
models in the next Module. The issue of selecting arma models will be reconsidered in Sect. 6.3.
5.3
Parameter estimation
Previous sections have given the basis for selecting an ar or ma model for a given data set, and to determine the order of the model. This section now discusses how to estimate the unknown parameters in the model using r. The actual mathematics is not discussed and indeed, it is not easy.
5.3.1
Preliminary estimation for equations
AR
models: The YuleWalker
Consider the ar model in Equation (2.1) (p 2.1). If the number of terms p in the series is nite, it is possible to write down a system of equations for calculating the autoregressive coecients {k }p from the autocorrelation k=1 coecients, {k }k0 . Multiplying Equation (2.1) by Xnk and taking expectations, obtain k = 1 k1 + + p kp for k 0. Dividing through by 0 , k = 1 k1 + + p kp , for k 0. The set of equations (5.1) with k = 0, . . . p matrix equation, and can be solved for the coecients k . as the . In matrix form, we have 1 1 2 p1 1 1 1 1 p2 2 . . . . . = .. . . . . . . . . . . . p1 p2 p3 1 p (5.1) are written as a These are known . (5.2)
1 2 . . . p
This matrix equation can be solved for the coecients {k }p via the fork=0 mula 1 1 1 1 2 p1 1 2 1 1 1 p1 2 (5.3) . = . . . . . . .. . . . . . . . . . . . . . p p1 p2 p3 1 p
86
Example 5.7: Suppose we have a set of time series data. A plot of the reveals that the rst few non-zero terms of the (and hence k values) are 0.36, 0.14, 0.01 and 0.03. We could use the to determine approximate values for k : 1 1 1 0.36 0.14 0.01 0.36 2 0.36 1 0.36 0.14 0.14 , 3 = 0.14 0.36 1 0.36 0.01 0.01 0.14 0.36 1 0.03 4 which gives = (0.6032, 0.5247, 0.3708, 0.2430). Using more terms would give estimate for k for k > 4, but this is sucient to demonstrate the use of the YuleWalker equation.
The YuleWalker equations are used to nd an initial estimate of the parameters. Note also that they are based on nding parameters for an ar model only.
5.3.2
Parameter estimation in R
The function used by r to estimate parameters in arma models in the function arima. To demonstrate how to use this function, consider again the yearly Bualo snowfall from Example 5.2 (p 77), Example 5.4 (p 80) and Example 5.6 (p 84). In these examples, the following models were considered: ar(1) (from the pacf); ma(2) (from the acf); and a ar(2) (from the aic). Example 5.8: To t the ar(1) model, use > snow.ar1 <- arima(sf, order = c(1, 0, 0)) > snow.ar1 Call: arima(x = sf, order = c(1, 0, 0)) Coefficients: ar1 intercept 0.3302 80.8809 s.e. 0.1236 4.1722 sigma^2 estimated as 496.8: log likelihood = -285.01, aic = 576.01
5.3. Parameter estimation
87
Importantly, r always ts a model to the mean-corrected time series. That is, the mean of the series is subtracted from the observations before computing the acf and pacf. Hence, if yearly Bualo snowfall is {Bt }, the output indicates the tted model is Bt 80.88 = 0.3302(Bt1 80.88) + et . Rearranging produces the model Bt = 54.17 + 0.3302Bt1 + et . The parameter estimates are also given in the output. Either form is acceptable as the nal model.
Example 5.9: Similarly, the ar(2) model is found thus: > snow.ar2 <- arima(sf, order = c(2, 0, 0)) > snow.ar2 Call: arima(x = sf, order = c(2, 0, 0)) Coefficients: ar1 ar2 0.2542 0.2373 s.e. 0.1262 0.1262
intercept 81.5422 5.2973 log likelihood = -283.3, aic = 574.59
sigma^2 estimated as 469.6: This indicates the ar(2) model is
Bt 81.54 = 0.2542(Bt1 81.54) + 0.2373(Bt2 81.54) + et . Rearranging produces Bt = 41.46 + 0.2542Bt1 + 0.2373Bt2 + et . Comparing the aic for both the ar models show that the ar(2) model is only slightly better using this criterion than the ar(1) model. The output from using the function ar can also be used to write down the tted model but it doesnt estimate the intercept; see Example 5.6. The estimates are also slightly dierent as a dierent algorithm is used for estimating the parameters.
88 Example 5.10: To t the ma(1) model, use
> snow.ma1 <- arima(sf, order = c(0, 0, 1)) > snow.ma1 Call: arima(x = sf, order = c(0, 0, 1)) Coefficients: ma1 intercept 0.2104 80.5421 s.e. 0.0982 3.4616 sigma^2 estimated as 517.6: This indicates the ma(1) model is Bt 80.54 = +et + 0.2104et1 . or Bt = 80.54 + et + 0.2104et1 . log likelihood = -286.27, aic = 578.53
In general, the model is tted using arima using the order option. The rst component in order is the order of the ar component, and the third is the order of the ma component. What is the second term? The second term is only necessary if the series is non-stationary. The next Module discusses this issue, where the meaning of the second term in the order parameter will be discussed.
5.4
Forecasting using R
Once a model has been found, r can be used to make forecasts. The function to use is predict. The following example shows how to use this function. Example 5.11: To demonstrate how to use this function, consider again the yearly Bualo snowfall recently seen in Examples 5.8 to 5.10. The data contain the annual snowfall in Bualo up to 1972. Consider just Example 5.8, where an ar(1) model was tted. To make a forecast, the following commands are used (note that the object snow.ar1 was created earlier by tting an ar(1) model to the data):
5.5. Summary > snow.pred <- predict(snow.ar1, n.ahead = 10) > snow.pred $pred Time Series: Start = 1973 End = 1982 Frequency = 1 [1] 90.49534 84.05536 81.92903 81.22696 80.99516 [6] 80.91862 80.89335 80.88500 80.88225 80.88134 $se Time Series: Start = 1973 End = 1982 Frequency = 1 [1] 22.28815 23.47162 23.59705 23.61068 23.61217 [6] 23.61233 23.61235 23.61235 23.61235 23.61235
89
r has made predictions for the next ten years based on the ar(1) model, and has included the standard errors of the forecasts as well. (This make it easy to compute the condence intervals.) Notice the forecasts from about six years ahead and further are almost the same. This implies that the model has little skill at forecasting that far ahead (which is not surprising). Forecasts a long way into the future tend to be the mean, which is reasonable. The data and the forecasts can be plotted together (Fig. 5.7) as follows: > snow.and.preds <- ts.union(sf, snow.ar1$pred) > plot(snow.and.preds, plot.type = "single", + lty = c(1, 2), lwd = 2, las = 1) Similar forecasts and plots can be constructed from the other types of models (that is, ma or arma models) in a similar way. The forecasts are shown for each of these models in Table 5.2.
5.5
Summary
This Module considered the identication of ar and ma models for a given set of stationary time series data, primarily using the acf and the pacf. The Akaike Information Criterion (aic) was also considered.
90
120
100 snow.and.preds
80
60
40
1910
1920
1930
1940
1950
1960
1970
1980
Time
Figure 5.7: Forecasting the Bualo snowfall data ten years ahead. There is little skill in the forecast after a few years. The forecasts are shown using a dashed line.
Table 5.2: Comparison of the predictions for forecasting ten-steps ahead using the ar(1), ar(2) and ma(2) models for the Bualo snowfall data. ar(1) ar(2) ma(2) 1 2 3 4 5 6 7 8 9 10 90.50 84.06 81.93 81.23 81.00 80.92 80.89 80.89 80.88 80.88 92.44 91.06 86.55 85.07 83.63 82.91 82.38 82.08 81.88 81.76 86.18 85.90 80.90 80.90 80.90 80.90 80.90 80.90 80.90 80.90
5.6. Exercises
91
Note that most time series (including climatological time series) are not stationary, but the methods developed so far apply only to stationary data. In Module 7, non-stationary time series will be examined.
5.6
Exercises
Ex. 5.12: Consider a time series {L}. The tted model is an arma(1, 0) model. (a) The model is a special case of an arma model. What is another way of expressing the model? (b) Write this model using the backshift operator. (c) Sketch the possible sample acf and pacf that lead to the selection of this model. Ex. 5.13: Consider a time series {Y }. The tted model is an arma(0, 2) model. (a) The model is a special case of an arma model. What is another way of expressing the model? (b) Write this model using the backshift operator. (c) Sketch the possible sample acf and pacf that lead to the selection of this model. Ex. 5.14: The mean annual streamow in Cache River at Forman, Illinois, from 1925 to 1988 is given in the le cacheriver.dat. (The data are not reported by calendar year, but by water year. A water year starts in October of the calendar year one year less than the water year and ends in September of the calendar year the same as the water year. For example, water year 1980 covers the period October 1, 1979 through September 30, 1980. However, this does not aect the model or your analysis.) There are two variables of interest: Mean reports the mean annual ow, and Max reports the maximum ow each water year, each measured in cubic feet per second. (The data have been obtained from USGS [4].) (a) Use r to nd a suitable model for the mean annual stream ow using the acf and pacf. (b) Use r to nd a suitable model for the maximum annual stream ow using the function ar and the sample acf and sample pacf. (c) Using your chosen model, produce forecasts up to three-steps ahead.
92 0.77 3.37 1.31 4.75 1.74 2.20 0.32 2.48 0.81 3.00 0.59 0.96 1.20 3.09 0.81 1.89 1.95 1.51 2.81 0.90
Module 5. Finding a Model 1.20 2.10 1.87 2.05 0.47 0.52 1.18 1.43 1.62 1.35
Table 5.3: Thirty consecutive days of precipitation in inches at Minneapolis, St Paul. The data should be read across the rows.
Ex. 5.15: Simulate the ar(2) model Rn+1 = 0.2Rn 0.4Rn1 + en+1 where {e} N (0, 4). Compute the sample acf and sample pacf from this simulated data. Do they show the features you expect? Ex. 5.16: Simulate the ma(2) model Xt = 0.3et1 0.2et2 + et where {e} N (0, 8). Compute the sample acf and sample pacf from this simulated data. Do they show the features you expect? Ex. 5.17: The data in Table 5.3 are thirty consecutive values of March precipitation in inches for Minneapolis, St. Paul obtained from Hand et al. [19]. The years are not given. (The data are available in the data le minn.txt.) (a) Load the data into r and nd a suitable model (ma or ar) for the data. (b) Produce forecasts up to three-steps ahead with your chosen model. Ex. 5.18: The data in the le lake.dat give the mean annual levels at Lake Victoria Nyanza from 1902 to 1921, relative to a xed reference point (units are not given). The data are from Shaw [41] as quoted in Hand et al [19]. Explain why an ar, ma or arma cannot be tted to this data set. Ex. 5.19: The Easter Island sea level air pressure anomalies from 1951 to 1995 are given in the data le easterslp.dat, which were obtained from the IRI/LDEO Climate Data Library (http://ingrid.ldgo. columbia.edu/). Find a suitable ar or ma model for the series using the sample acf and pacf. Use this model to forecast up to three months ahead.
5.6. Exercises
93
Ex. 5.20: The Western Pacic Index (WPI) measures the mode of lowfrequency variability over the North Pacic. The time series in the data le wpi.txt is from the Climate Prediction Center [3] and the Climate Diagnostic Centre [2], and gives the monthly WPI from January 1950 to December 2001. (a) Conrm that the data are approximately stationary by plotting the data. (b) Find an appropriate model for the data using the acf and pacf. (c) Find an appropriate model using the ar function. (d) Which model is your preferred model? Explain your answer. (e) Find parameter estimates for your preferred model. Ex. 5.21: The seasonal average SOI from (southern hemisphere) summer 1876 to (southern hemisphere) summer 2001 is given in the le soiseason.dat (a) Conrm that the data are approximately stationary by plotting the data. (b) Find an appropriate model for the data using the acf and pacf. (c) Find an appropriate model using the ar function. (d) Which model is your preferred model? Explain your answer. (e) Find parameter estimates for your preferred model. Ex. 5.22: The monthly average solar ux from January 1948 to December 2002 is given in the le solarflux.txt. (a) Conrm that the data are approximately stationary by plotting the data. (b) Find an appropriate model for the data using the acf and pacf. (c) Find an appropriate model using the ar function. (d) Which model is your preferred model? Explain your answer. (e) Find parameter estimates for your preferred model. Ex. 5.23: The acf in Fig. 5.8 was produced for a time series {P }. In this question, the YuleWalker equations are used to form initial estimates for the values of . (a) Use the rst three terms in the acf to set up the YuleWalker equations, and solve for the ar parameters. (Any terms within the condence limits can be assumed to be zero.) (b) Repeat, but use four terms of the acf. Compare you answers to those in part (a).
94
1.0
0.8
0.6 ACF
0.4
0.2
0.0
0.2 0 2 4 Lag 6 8 10
Figure 5.8: The acf for the time series {P }.
Ex. 5.24: The acf in Fig. 5.9 was produced for a time series {Q}. In this question, the YuleWalker equations are used to form initial estimates for the values of . (a) Use the rst three terms in the acf to set up the YuleWalker equations, and solve for the ar parameters. (Any terms within the condence limits can be assumed to be zero.) (b) Repeat, but use four terms of the acf. Compare you answers to those in part (a). (c) Repeat, but use ve terms of the acf. Compare you answers to those in parts (a) and (b). Ex. 5.25: The acf in Fig. 5.10 was produced for a time series {R}. In this question, the YuleWalker equations are used to form initial estimates for the values of . (a) Use the rst three terms in the acf to set up the YuleWalker equations, and solve for the ar parameters. (Any terms within the condence limits can be assumed to be zero.) (b) Repeat, but use four terms of the acf. Compare you answers to those in part (a).
5.6. Exercises
95
1.0 0.8 0.6 0.4 ACF 0.2 0.0 0.2 0.4 0 2 4 Lag 6 8 10
Figure 5.9: The acf for the time series {Q}.
(c) Repeat, but use ve terms of the acf. Compare you answers to those in parts (a) and (b).
5.6.1
5.12 (a) ar(1) model. (b) (1 B)Ln = en for some value of . (c) The possible sample acf and pacf are shown in Fig. 5.11. The actual details of the s are not important; what is important is that there is only one signicant term in the sample pacf and the sample acf takes a long time to decay (and the term at lag zero in the acf is one as always). 5.14 (a) The time series is plotted in Fig. 5.12. The data appears to be approximately stationary. The sample acf and pacf are shown in Fig. 5.13. The sample acf has no signicant terms, suggesting no particular ma model will be useful. The sample pacf has only one term marginally signicant at a lag of 14. This suggests that there is
96
1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0 2 4 Lag 6 8 10
ACF
Figure 5.10: The acf for the time series {R}.
no obvious ar model. What is the conclusion? The conclusion is that there is no suitable ar or ma model for modelling the data. In fact, it suggests that the observations are actually random, and therefore unpredictable. Using the function ar suggests the same. Notice that a lot of work is sometimes needed to come to the conclusion that no model is useful. This does not mean the exercise has been a waste of timeafter all, it is now known that there is no useful ar or ma model, which in in itself is useful information. If {St } is the mean annual streamow, then the model is St = m + et for the appropriate value of m (which will be the mean in this case). Since the mean value of the mean streamow is 299.3, the model is St = 299.3 + et . The forecasts up to three-steps ahead are all 299.3. (b) Using the function ar, a suitable ar model is an ar(5) model: > cr <- read.table("cacheriver.dat", header = TRUE) > ar(cr$Max) Call: ar(x = cr$Max)
5.6. Exercises
97
0.0 0.2 0.4 0.6 0.8 1.0 0
ACF
4 Lag
10
Partial ACF
0.0
0.2
0.4
0.6
4 Lag
10
Figure 5.11: A possible sample acf and pacf for an ar(1) model. The acf is shown in the top plot; the pacf in the bottom plot.
Coefficients: 1 2 -0.2096 -0.1298 Order selected 5
3 -0.3270
4 -0.2821
5 -0.2111 4662001
sigma^2 estimated as
In contrast, using the acf and pacf would suggest that the data are random. This is an example of a situation where the human is probably correct, and the computer doesnt actually know best. (c) The chosen model is St = 4133 + et where 4133 is the mean. The forecasts are all 4133. 5.19 The time series is plotted in Fig. 5.14. The data appears to be approximately stationary. The sample acf and pacf are shown in Fig. 5.15. The sample acf has seven signicant terms, suggesting an ma(7)
98
800
600
rflow
400
200
1930
1940
1950
1960 Time
1970
1980
1990
Figure 5.12: A plot of the mean annual streamow in cubic feet per second at Cache River, Illinois, from 1925 to 1988.
5.6. Exercises
99
ACF 0.2 0 0.2
0.6
1.0
5 Lag
10
15
Partial ACF
0.3
0.1
0.1
10 Lag
15
Figure 5.13: The sample acf and pacf of the mean annual streamow in cubic feet per second at Cache River, Illinois, from 1925 to 1988. Top: the sample acf; Bottom: the sample pacf.
100
6 4 2 eislp 0 2 4 6
1950
1960
1970 Time
1980
1990
Figure 5.14: A plot of the Easter Island sea level air pressure anomaly from 1951 to 1995.
5.6. Exercises
101
ACF
0.0 0.0
0.4
0.8
0.5
1.0 Lag
1.5
2.0
Partial ACF
0.1 0.0
0.0
0.1
0.2
0.5
1.0 Lag
1.5
2.0
Figure 5.15: The sample acf and pacf of the Easter Island sea level air pressure anomaly. Top: the sample acf; Bottom: the sample pacf.
102
Module 5. Finding a Model model. It is likely that a more compact ar model can be found. The sample pacf suggests an ar(3) model may be appropriate (the terms at lag 5 and 6 are so marginal, they can probably be ignored.) The second term is not signicant, but the third term in the pacf is signicant, so we need to use an ar(3) model if the signicant term at lag 3 is to be taken. Given the choice of either ma(7) or ar(3), the more compact ar model is to be preferred. The code used to generate the above plots is shown below > > > > > ei <- read.table("easterslp.dat", header = TRUE) eislp <- ts(ei$slpa, start = c(1951, 1), frequency = 12) plot(eislp, main = "", las = 1) acf(eislp, main = "") pacf(eislp, main = "")
To estimate the parameters, use > eislp.model <- arima(eislp, order = c(3, 0, + 0)) > eislp.model$coef ar1 0.25139496 ar2 0.02663228 ar3 intercept 0.16891009 -0.15173751
The tted ar model is therefore Et = 0.251Et1 + 0.0266Et2 + 0.1689Et3 + en if {Et } is the Easter Island sea level air pressure anomaly. The one-step ahead forecast is Et+1|t = 0.251Et + 0.0266Et1 + 0.1689Et2 The last few values in the series are: > length(eislp) [1] 540 > eislp[535:540] [1] 0.1 3.0 2.9 -1.2 3.2 -1.0
So the one-step ahead forecast is Et+1|t = 0.251 (1.0) + 0.0266 3.2 + 0.1689 (1.2) = 0.36896, and likewise for further steps ahead.
5.6. Exercises
103
5.23 From the acf 1 0.3, 3 0.2 and 3 0.2 (and the rest are essentially zero). So the matrix equation is 1 0.3 0.2 0.3 1 0.3 1 0.3 2 = 0.2 3 0.2 0.3 1 0.2 with solution 0.5416667 . 0.5 0.4583333
In r: > + + > + > > + + + > + > > Mat <- matrix(data = c(1, 0.3, -0.2, 0.3, 1, 0.3, -0.2, 0.3, 1), byrow = FALSE, nrow = 3, ncol = 3) rhs <- matrix(nrow = 3, data = c(0.3, -0.2, 0.2)) sol1a <- solve(Mat, rhs) Mat <- matrix(data = c(1, 0.3, -0.2, 0.2, 0.3, 1, 0.3, -0.2, -0.2, 0.3, 1, 0.3, 0.2, -0.2, 0.3, 1), byrow = FALSE, nrow = 4, ncol = 4) rhs <- matrix(nrow = 4, data = c(0.3, -0.2, 0.2, 0)) sol1b <- solve(Mat, rhs) sol1b
[,1] [1,] 0.7870968 [2,] -0.7677419 [3,] 0.7483871 [4,] -0.5354839 The solutions are very dierent. In practice, all the available information is used (and hence very large matrices result).
104
Module
6
. . . . . . . . . . . . . . . . . . . . . . . . . 106
Diagnostic Tests
Module contents
6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 Introduction Residual acf and pacf . . . . . . . . . . . . . . . . . . . . 107 Identication of arma models . . . . . . . . . . . . . . . 110 The BoxPierce test (Q-statistic) . . . . . . . . . . . . . 116 The cumulative periodogram . . . . . . . . . . . . . . . 117 Signicance of parameters . . . . . . . . . . . . . . . . . 118 Normality of residuals . . . . . . . . . . . . . . . . . . . . 119 Alternative models . . . . . . . . . . . . . . . . . . . . . 120 Evaluating the performance of a model . . . . . . . . . 121
6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.11.1 Answers to selected Exercises . . . . . . . . . . . . . . . 124
Module objectives
use r to create residual acf and pacf plots;
105
106
Module 6. Diagnostic Tests

understand the information contained in residual acf and pacf plots; use the residual acf and pacf to identify arma models; write down the arma model from the r output; use r to make forecasts from a tted arma model; use r to evaluate the BoxPierce statistic and LjungBox statistic and understand what they imply about the tted model; use r to create a cumulative periodogram and understand what it implies about the tted model; use r to create QQ plot and understand what it implies about the tted model; use r to test the signifcance of tted parameters in a tted model; t competing models to a time series, and use the appropriate tests to compare the possible models; select a good model for given stationary time series data.
6.1
Introduction
Once a model is tted, it is important to know if the model is a good model, or if it can be improved. But rst, what is a good model? A good model should be able to capture the important features of the data, or, in other words, capture the signal . After removing the signal from the time series, only random noise should remain. So to test if a model is a good model or not, the noise is usually tested to ensure it is indeed random (and hence unpredictable). If the residuals are somehow predictable, the model should be rened so the residuals are unpredictable and random. In addition, a good model is as simple as possible. To ensure the model is as simple as possible, each term in the model should be tested to make sure it is signicant; otherwise, the insignicant parameters should be removed from the model. The process of evaluating a model is called diagnostic testing. A number of diagnostic tests are considered in this Module.
6.2. Residual acf and pacf
107
6.2
Residual ACF and PACF
Since the residuals should be white noise (that is, are independent and contain no elements are predictable), the acf and pacf of the residuals should contain no hint of being forecastable. In other words, the terms of the residual acf and residual pacf should all lie between the (approximate) 95% condence limits. If not, there are elements in the residuals are forecastable, and these forecastable aspects should be included in the signal of model. Example 6.1: In Sect. 5.3 (p 85), numerous models were tted to the yearly Bualo snowfall data rst introduced in Example 5.2 (p 77). Two of those models were ar models. Here, consider the ar(1) model. The model was tted in Example 5.8 (p 86). There are two ways to do diagnostic tests in r. The rst way is to use the tsdiag function; this function plots the standardized residuals in order and plots the acf of the residuals. (It also produces another plot studied in Sect. 6.4). Here is how the function can be used: > > > > > par(mfrow = c(1, 1)) bs <- read.table("buffalosnow.dat", header = TRUE) sf <- ts(bs$Snow, start = c(1910, 1), frequency = 1) ar1 <- arima(sf, order = c(1, 0, 0)) tsdiag(ar1)
The result is shown in Fig. 6.1. The middle panel in Fig. 6.1 indicates the residual acf is ne and no model could be tted to the residuals. The second method involves using the output object from the arima command, as shown below. > ar1 <- arima(sf, order = c(1, 0, 0)) > names(ar1) [1] [5] [9] [13] "coef" "loglik" "call" "model" "sigma2" "aic" "series" "var.coef" "arma" "code" "mask" "residuals" "n.cond"
The residuals are given by ar1$resid, or more directly as resid(ar1): > summary(resid(ar1)) Min. 1st Qu. -65.6600 -14.6800 Median 1.4540 Mean -0.2791 3rd Qu. 16.8700 Max. 47.3700
108
Standardized Residuals
2 3 1910 2 1 0 1
1920
1930
1940 Time
1950
1960
1970
ACF of Residuals
1.0 ACF 0.2 0 0.2 0.6
5 Lag
10
15
p values for LjungBox statistic

1.0 0.8
q q q q q q q q q q
p value
0.0
0.2
0.4
0.6
4 lag
10
Figure 6.1: Diagnostic plots after tting an ar(1) model to the yearly Bualo snowfall data. This is the output of using the tsdiag command in r.
6.2. Residual acf and pacf
109
Series resid(ar1)
1.0 ACF 0.2 0 0.2 0.6
5 Lag
10
15
Series resid(ar1)
Partial ACF
0.2
0.0
0.1
0.2
5 Lag
10
15
Figure 6.2: Diagnostic plots after tting an ar(1) model to the yearly Bualo snowfall data. Top: the residual acf; Bottom: the residual pacf.
110
Module 6. Diagnostic Tests These residuals can be used to perform diagnostic tests. For example, the residual acf and residual pacf are shown in Fig. 6.2. The residual acf and pacf indicate the residuals (or the noise) are not forecastable. This suggests the ar(1) model tted in Sect. 5.3 is adequate, considering this single criterion.
6.3
Identication of ARMA models
Using the residual acf and pacf is often how arma models are tted. A researcher may look at the sample acf and sample pacf and conclude an ar(2) model is appropriate. After tting such a model, an examination of the residual acf and residual pacf indicates an ma(1) model now seems appropriate. The best model for the data then be an arma(2, 1) model. The researcher would hope the residuals from this arma(2, 1) would be white noise. As was alluded to in Sect. 6.2, using the residual acf and pacf allows arma models to be identied. Example 6.2: In Example 4.3, Chu & Katz [13] were said to t an arma(1, 1) model to the monthly SOI time series from January 1935 to August 1983. In this example we see how that model may have been chosen. Keep in mind that selecting arma models is very much an art and requires experience to do well. As with any time series, the data must be stationarity. (Fig. 1.3 bottom panel, p 8), which it appears to be. The next step is to look at the acf and pacf; (Fig. 6.3). The acf suggests a very large order ma model; the pacf suggests possibly an ar(2) model or an ar(4) model. To begin, select an ar(2) model as it is simpler and the terms at lags 3 and 4 are only just signicant; if an ar(4) model is necessary, it will become apparent in the diagnostic analysis. The code so far: > > > > ms <- read.table("soiphases.dat", header = TRUE) acf(ms$soi) pacf(ms$soi) ms.ar2 <- arima(ms$soi, order = c(2, 0, 0))
The residuals can be examined now to see if the tted ar(2) model is adequate using the residual acf and pacf from the ar(2) model (Fig. 6.4). The residual acf suggests the model is reasonable, but the residual pacf suggests at least one ma term at lag 2 may be necessary. (There
6.3. Identication of arma models
111
Series ms$soi
1.0 ACF 0.2 0 0.2 0.6
10
15 Lag
20
25
30
Series ms$soi
0.6 Partial ACF 0.0 0 0.2 0.4
10
15 Lag
20
25
30
Figure 6.3: The acf and pacf of monthly SOI. Top: the acf; Bottom: the pacf.
112
Series resid(ms.ar2)
ACF
0.0 0
0.4
0.8
10
15 Lag
20
25
30
Series resid(ms.ar2)
Partial ACF
0.10 0
0.00
0.05
10
15 Lag
20
25
30
Figure 6.4: The acf and pacf of residuals for the ar(2) model tted to the monthly SOI. In (a), the acf; in (b), the pacf.
113
are signicant terms at lags 5, 6, 14 and 15 also; it is more common that observations will be strongly related to more recent observations than those some time ago. Initially, then, deal with problem at lag 2; if the problems at the others lags persist, they can be dealt with later.) This is surprising as we tted an ar(2) model which we would expect to account for signicant terms at lag 2. This suggests trying to add an ma(2) component to the ar(2) component above, making an arma(2, 2) model. Fit this and look again at the residual plots: > > > + > > acf(ms.ar2$residuals) pacf(ms.ar2$residuals) ms.arma22 <- arima(ms$soi, order = c(2, 0, 2)) acf(ms.arma22$residuals) pacf(ms.arma22$residuals)
Again the residual acf looks ne; the residual pacf looks better, but still not ideal (Fig. 6.5). The signicant term at lag 2 has gone as well as those at lags 5 and 6 however; this is more important than the signicant terms at lags 14 and higher (as lags 14 time steps away are less likely to be of importance). So perhaps the arma(2, 2) model will suce. Heres the model: > ms.arma22 Call: arima(x = ms$soi, order = c(2, 0, 2)) Coefficients: ar1 ar2 0.9192 -0.0473 s.e. 0.3801 0.3250
ma1 -0.4273 0.3792
ma2 -0.0131 0.1451
intercept -0.0903 0.8158 aic = 10325.74
sigma^2 estimated as 53.19:
log likelihood = -5156.87,
Note the second ar term and the second ma term are both unnecessary (the estimate divided by the standard errors are much less than one). This suggests the second ar term and the second ma term should be excluded from the model. In other words, try tting an arma(1, 1) model. > ms.arma11 <- arima(ms$soi, order = c(1, 0, + 1)) > acf(ms.arma11$residuals) > pacf(ms.arma11$residuals)
114
Series ms.arma22$residuals
ACF
0.0 0
0.4
0.8
10
15 Lag
20
25
30
0.05 Partial ACF 0.10 0 0.05 0.00
10
15 Lag
20
25
30
Figure 6.5: The acf and pacf of residuals for the arma(2, 2) model tted to the monthly SOI. Top: the acf; Bottom: the pacf.
115
ACF
0.0 0
0.4
0.8
10
15 Lag
20
25
30
0.05 Partial ACF 0.10 0 0.05 0.00
10
15 Lag
20
25
30
Figure 6.6: The acf and pacf of residuals for the arma(1, 1) model tted to the monthly SOI. Top: the acf; Bottom: the pacf. The residual acf and pacf from this model (Fig. 6.6) look very similar to those in Fig. 6.5, suggesting the arma(1, 1) model is better than the arma(2, 2) model, and also simpler. Heres the arma(1, 1) model: > ms.arma11 Call: arima(x = ms$soi, order = c(1, 0, 1)) Coefficients: ar1 ma1 0.8514 -0.3698 s.e. 0.0196 0.0355
intercept -0.1183 0.7927
116 sigma^2 estimated as 53.25:
Module 6. Diagnostic Tests log likelihood = -5157.63, aic = 10323.26
The aic implies this is a better model than the arma(2, 2) model, and so the arma(1, 1) is appropriate for the data.
6.4
The BoxPierce test (Q-statistic)
Another test to apply to the residuals is to calculate the BoxPierce statistic, or the Q-statistic, also known as a Portmanteau test. The purpose of the test is to check if the residuals are independent. The null hypothesis is that the residuals are independent, and the alternative is they are not independent. This test computes the sum of the squares of the rst m (e.g. m = 15) sample acf coecients, multiplied by the length of the time series (say N ) and calls this Q:
m
Q=N
k=1
2 . k
If the residuals are taken from a white noise process, the Q statistic will have approximately a chi-square (2 ) distribution with mN degrees of freedom, where m is the number of autocorrelation coecients used in computing the statistic (15 above), and N is the number of autoregressive and moving average components estimated for the model. Some authors use m rather than m N degrees of freedom (as does r). Chateld [11, p 62] and others note the test is really only useful when the time series has more than 100 observations. An alternative test to use which is better for shorter series is
m
Q = N (N + 2)
k=1
2 k , N k
called the LjungBox test. Both tests, however, may lack statistical power. In r, the function Box.test is used for both tests. Example 6.3: In Example 6.1, the yearly Bualo snowfall data were considered. In that Example, the residual acf and pacf showed the residuals were not forecastable using an ar(1) model. To test if the residuals appear to be independent, use the Box.test function in r. The input variables are the residuals from the tted model, and the number of terms in the acf to be used to compute the statistic. The default value is one, which is far too few. Typically, a value such as 15 is used (it is often more if the series is longer or is seasonal, and shorter if the time series is short).
6.5. The cumulative periodogram > Box.test(resid(ar1), lag = 15) Box-Pierce test data: resid(ar1) X-squared = 7.3209, df = 15, p-value = 0.9481 > Box.test(resid(ar1), lag = 15, type = "Ljung-Box") Box-Ljung test data: resid(ar1) X-squared = 8.1009, df = 15, p-value = 0.9197
117
The P -value indicates there is no evidence that the residuals are dependent. The conclusion from the LjungBox test is similar. This further conrms that the ar(1) model is adequate. If the P -value was below about 0.05, there would be some cause for concern: it would imply that the terms in the acf are too large to be a white noise.
Note the r function tsdiag produces a plot using P -value of the BoxPierce statistic for various value of the lag; see the third (bottom) panel in Fig. 6.1. The dotted line in the plot corresponds to a P -value of 0.05.
6.5
The cumulative periodogram
Another test applied to the residuals is to calculate the cumulative (or integrated) periodogram and apply a KolmogorovSmirnov test to check the assumption that the residuals form a white noise process. The r function cpgram performs this test. The cumulative periodogram from a white noise process will lie close to the central diagonal line. Thus, if the residuals do form a white noise process as they should do approximately if the model is correct, the cumulative periodogram of the residuals will lie within the indicated bounds with probability 95%. Example 6.4: In Example 6.1, the yearly Bualo snowfall data were considered and an ar(1) model tted. The cumulative periodogram is found as follows:
118
0.0 0.0
0.2
0.4
0.6
0.8
1.0
0.1
0.2
0.3
0.4
0.5
frequency
Figure 6.7: The cumulative periodogram after tting an ar(1) model to the yearly Bualo snowfall data. > cpgram(ar1$resid, main = "") The result (Fig. 6.7) indicates that the model is adequate as it remains between the condence bands.
6.6
Signicance of parameters
The next important test to perform is to check on the statistical signicance of the parameters. Standard errors of the parameter estimates are computed and shown by r when the model is tted using arima. Roughly speaking, the parameters of a model are accepted as signicant if the estimated value of the parameter is twice the standard error of this estimate or more. This is made a little more precise by using a statistical test (the t-test), however in practice this amounts to almost the same thing. If a parameter shows up as not signicant, it should be removed from the model.
6.7. Normality of residuals
119
Example 6.5: In Example 6.1, the yearly Bualo snowfall data were considered. An ar(1) model was tted to the data. There were two estimated parameters: the constant term in the model, m , and the ar term. The ar term can be tested for signicance. (Recall that the intercept is of no interest to the structure of the model.) The parameter estimates and the standard errors are shown in Example 5.8 (p 86). Dividing the estimate by the standard error produces an approximate t-score. The parameter estimates for the ar term has a t-score greater than two in absolute value, indicating that it is necessary in the model. The actual t-scores can be computed using the output from the tting of the model, as shown below. > coef(ar1) ar1 intercept 0.3301765 80.8808921 > ar1$coef ar1 intercept 0.3301765 80.8808921 > ar1$var.coef ar1 intercept ar1 0.01528329 0.03975151 intercept 0.03975151 17.40728000 > coef(ar1)/sqrt(diag(ar1$var.coef)) ar1 intercept 2.670778 19.385655 The conclusion is that the ar parameter in the model is necessary, and so that the ar(1) model seems appropriate.
6.7
Normality of residuals
Throughout, the residual have been assumed to be normally distributed. To test this, use a QQ plot of the residuals. If the residuals do have a normal distribution, the points in the plot will lie close to the diagonal line.
120
Normal QQ Plot
q
40
q qq q q q q q q qq q q q q q q q q q q q qq qqq qqq qq qq q qq q q q q q qq q q q q qq q q qqq q
Sample Quantiles
20
20
qq
60
40
0 Theoretical Quantiles
Figure 6.8: The cumulative periodogram after tting an ar(1) model to the yearly Bualo snowfall data. Example 6.6: Continuing Example 6.1 (the yearly Bualo snowfall), consider again the tted ar(1) model. The QQ plot of the residuals (Fig. 6.8) indicates the residuals are approximately normally distributed. > qqnorm(resid(ar1)) > qqline(resid(ar1)) (Note: qqnorm plots the points; qqline draw the diagonal line.)
6.8
Alternative models
The last type of test is to check if an alternative model might be better. This is open-ended, because there is an endless variety of alternative models from which to choose. But, as seen before, there are sometimes a small number of models that are suggested from which the researcher has to choose. If one model proves to be better using the diagnostic tests, that model should be used. If all perform similarly, choose the simplest model. But what if there
6.9. Evaluating the performance of a model
121
is more than one model that perform similarly, and each are as simple as the other? If you cant decide between them, then it probably doesnt matter!
6.9
Evaluating the performance of a model
Finally, consider an evaluation tool that is slightly dierent than those previously discussed. The idea is that the model is tted to the rst portion of the data (perhaps half the data), and then forecasts are made on the basis of that model tted to this portion (called the training sets). One-step ahead forecasts are then made for each of the remaining data points (called the testing set) to see how adequate the model can forecastwhich, after all, is one of the main reasons for developing time series models. This approach generally requires a time series with a large number of observations to work well, since splitting the data into two parts halves the amount of information available for model selection. Obviously, smaller portions can be withheld from the model selection stage if necessary, as shown in the next example. The approach discussed here is called cross-validation. The best model is the model whose predictions in the tesing set are closest to the actual observed values; this can be summarised by noting the mean and variance of the dierences. More sophisticated cross-validation techniques are possible, but not discussed here. Example 6.7: Because the Bualo snowfall data is a short series, we withhold only the last ten observations and retain those for model evaluation. The one-step ahead forecasts for the remaing ten observations for each model are shown in Table 6.1. These one-step aheads predictions are plotted in Fig. 6.9. Table 6.1 suggests little dierence between the models; the ar(2) model has smaller errors on average (compare the means), but the ar(1) model is closest more consistent (compare the variances).
6.10
Summary
Before accepting a time series model, it must be tested. The main tests are based on analysing the residualsthe one-step ahead forecast errors of the model. Table 6.2 summaries the diagnostic tests discussed.
122
Table 6.1: The one-step ahead forecasts for the ar(1), ar(2) and ma(2) model after withholding the last ten observations and using the remainder as a training set. Prediction from Model: Actual 1 2 3 4 5 6 7 8 9 10 Errors: 89.80 71.50 70.90 98.30 55.50 66.10 78.40 120.50 97.00 110.00 Mean: Var: ar(1) 87.08 83.21 77.10 76.89 86.05 71.75 75.29 79.40 93.46 85.61 4.215 414.9 ar(2) 91.67 88.52 80.89 75.88 82.53 79.17 70.43 76.31 90.05 95.39 2.717 444.7 ma(2) 86.91 84.59 77.14 73.97 85.08 79.26 66.64 79.19 95.84 93.69 3.570 427.0
120
110 Series and predictions
100
90
80
70 Series AR(1) preds AR(2) preds MA(2) preds 1960 1962 1964 1966 Years 1968 1970 1972
60
Figure 6.9: The cross-validation one-step ahead predictions for the ar(1), ar(2) and ma(2) models applied to the Bualo snowfall data.
6.11. Exercises Assumption to Test Residuals unforecastable Residuals independent Residuals white noise Simple model Residuals normally distributed Test to Use Residual acf & Residual pacf BoxPierce test cumulative periodogram signicance of parameters QQ plot of residuals R Commands to Use use acf and pacf on residuals Box.test cpgram output from arima qqnorm
123
Table 6.2: A summary of the diagnostic test to use on given time series models.
6.11
Exercises
Ex. 6.8: In Exercise 4.22, an arma(1, 1) model was discussed that was tted by Sales, Pereira & Vieira [40] to the natural monthly average ow rate (in cubic metres per second) of the reservior of Furnas on the Grande River in Brazil. Table 4.1 (p 70) gave the parameter estimates and their standard errors. Determine if each parameter is signicant at the 95% level. Ex. 6.9: In Exercise 5.14 (p 91), data concerning the mean annual streamow from 1925 to 1988 in Cache River at Forman, Illinois, given in the le cacheriver.dat There are two variables of interest: Mean reports the mean annual ow, and Max reports the maximum ow each water year, each measured in cubic feet per second. Perform the diagnostic checks to see if the model found for the variable Mean in that exercise produce adequate models. Ex. 6.10: In Exercise 5.19 (p 92), the Easter Island sea level air pressure anomalies from 1951 to 1995, given in the data le easterslp.txt, were analysed. An ar(3) model was considered a suitable model. Perform the appropriate diagnostic checks on this model, and determine if the model is adequate. Ex. 6.11: In Exercise 4.4, Davis & Rappoport [15] were reported to use an arma(2, 2) model for modelling the Palmer Drought Index, {Yt }. Katz & Skaggs [26] claim the equivalent ar(2) model is almost as good
124
Module 6. Diagnostic Tests as the model given by Davis & Rappoport, yet has half the number of parameters. For this reason, they prefer the ar(2) model. Load the data into r and decide on the best model. Give reasons for your solution, and include diagnostics analyses.
Ex. 6.12: In Exercise 5.20, a model was tted to the Western Pacic Index (WPI). The time series in the data le wpi.txt gives the monthly WPI from January 1950 to December 2001. Perform some diagnostic analyses and select the best model for the data, justifying your choice and illustrating your answer with appropriate diagrams. Ex. 6.13: In Exercise 5.21, the seasonal average SOI from (southern hemisphere) summer 1876 to (southern hemisphere) summer 2001 was studied. The data is given in the le soiseason.dat. Fit an appropriate model to the data justifying your choice and illustrating your answer with appropriate diagrams. Ex. 6.14: In Exercise 5.22, the monthly average solar ux from December 1950 to December 2001 was studied. The data is given in the le solarflux.txt. Fit an appropriate model to the data justifying your choice and illustrating your answer with appropriate diagrams. Ex. 6.15: The data le rionegro.dat contains the average monthly heights of the Rio Negro river at Manaus from 19031992 in metres (relative to an arbitrary reference point). Find a suitable model for the times series, including a diagnostic analysis of possible models.
6.11.1
6.9 The model chosen for the variable Mean was simply that the data were random. Hence the residual acf and residual pacf are just the sample acf and sample pacf as shown in Fig. 5.13. The cumulative periodogram shows no problems with this model; see Fig. 6.10. The BoxPierce test likewise indicates no problems. The QQ plot is not ideal though (and looks better if an ar(3) model is tted). Here is some of the code: > Box.test(rflow) Box-Pierce test data: rflow X-squared = 0.865, df = 1, p-value = 0.3523 6.15 First, load and prepare the data:
6.11. Exercises
125
Normal QQ Plot
800
q
Series: rflow
q
1.0
600
Sample Quantiles
0.6
q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q qq qq qq q q q q q qq qq q qq qq qq
0.8
0.2
0.4
0.0
0.0
0.1
0.2
0.3
0.4
0.5
q
200
400
frequency 2 1 0 1 2
Theoretical Quantiles
Figure 6.10: The cumulative periodogram of the annual streamow at Cache River; the plot suggests that the data are random. However, the QQ plot suggests that data are perhaps not normally distributed. > RN <- read.table("rionegro.dat", header = TRUE) > ht <- ts(RN$Height, start = c(RN$Year[1], + RN$Month[1]), frequency = 12) A plot of the data shows the series is reasonaboy stationary (Fig. 6.11). See the acf and pacf (Fig. 6.12); the acf suggests a very large order ma model, while the pacf suggests an ar(3) model. Decide to start with the ar(3) model! > rn.ar3 <- arima(ht, order = c(3, 0, 0)) The residual acf and pacf are pretty good if not perfect (Fig. 6.13); there are a a couple of components outside the approximate condence limits, but probably nothing of importance. Lets examine more diagnostics (Fig. 6.14); the cumulative periodogramn looks ne, but the normal probability plot looks bad. However, a histogram shows the residuals have a decent distribution that loos slightly normal, so things arent so bad (try hist(resid(rn.ar3))). So, for some nal diagnostics: > Box.test(resid(rn.ar3)) Box-Pierce test data: resid(rn.ar3) X-squared = 0.1847, df = 1, p-value = 0.6674
126 > plot(ht)
ht 6 1900 4 2
1920
1940 Time
1960
1980
Figure 6.11: A plot of the Rio negro river data > par(mfrow = c(1, 2)) > acf(ht) > pacf(ht)
Series ht
1.0
Series ht
0.8
0.6
Partial ACF 0.0 0.5 1.0 Lag 1.5 2.0 2.5
ACF
0.4
0.2
0.0
0.2 0.0
0.0
0.2
0.4
0.6
0.8
0.5
1.0
1.5 Lag
2.0
2.5
Figure 6.12: The acf and pacf of the Rio negro river data
6.11. Exercises > par(mfrow = c(1, 2)) > acf(resid(rn.ar3)) > pacf(resid(rn.ar3))
Series resid(rn.ar3)
1.0
127
Series resid(rn.ar3)
0.6
0.8
Partial ACF 0.0 0.5 1.0 Lag 1.5 2.0 2.5
ACF
0.4
0.0
0.05 0.0
0.2
0.00
0.05
0.5
1.0
1.5 Lag
2.0
2.5
Figure 6.13: The residual acf and pacf of the Rio Negro river data after tting the ar(3) model > > > > par(mfrow = c(1, 2)) cpgram(resid(rn.ar3)) qqnorm(resid(rn.ar3)) qqline(resid(rn.ar3))
Normal QQ Plot Series: resid(rn.ar3)
1.0 3
q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq qq qq qq q qq qq q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q
0.8
Sample Quantiles 0 1 2 3 4 5 6 frequency
0.6
0.4
0.2
0.0
Figure 6.14: Further diagnsotic plot of the Rio Negro river data after tting the ar(3) model
128
Module 6. Diagnostic Tests > coef(rn.ar3)/sqrt(diag(rn.ar3$var.coef)) ar1 ar2 38.72317372 -11.39726406 ar3 6.14219789 intercept -0.01352855
Ther Box test shows no problems; all the parameters seem necessary. This model seems ne (if not perfect). > rn.ar3 Call: arima(x = ht, order = c(3, 0, 0)) Coefficients: ar1 ar2 1.1587 -0.4985 s.e. 0.0299 0.0437
ar3 0.1837 0.0299
intercept -0.0020 0.1462 aic = 2463.77
Module
7
. . . . . . . . . . . . . . . . . . . . . . . . . 130 134 136
Non-Stationary Models
Module contents
7.1 7.2 7.3 7.4 Introduction Non-stationarity in the mean . . . . . . . . . . . . . . . 131 Non-stationarity in the variance . . . . . . . . . . . . . 134 arima models . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.4.1 7.4.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimation . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.3 Backshift operator . . . . . . . . . . . . . . . . . . . . . 138 7.5 Seasonal models . . . . . . . . . . . . . . . . . . . . . . . 138 7.5.1 7.5.2 7.5.3 Identifying the season length . . . . . . . . . . . . . . . Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . The backshift operator . . . . . . . . . . . . . . . . . . . 141 145 147
7.5.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.6 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.7 7.8 7.9 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . 151 A summary of model tting . . . . . . . . . . . . . . . . 154 A complete example . . . . . . . . . . . . . . . . . . . . . 156
7.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.11.1 Answers to selected Exercises . . . . . . . . . . . . . . . 164
129
130
Module 7. Non-Stationary Models
Module objectives
identify time series that are not stationary in the mean; use dierences to remove non-stationarity in the mean; identify time series that are not stationary in the variance; use logarithms to remove non-stationarity in the variance; understand what is meant by an arima model; use the arima(p, d, q) notation to dene arima models; develop forecasting formulae for arima models; develop condence intervals for forecasts for arima models; write arima models using the backshift operator; identify seasonal time series; identify the length of a season in a seasonal time series; use the arima(p, d, q) (P, D, Q)s notation to dene seasonal arima models; develop forecasting formulae for seasonal arima models; develop condence intervals for forecasts for seasonal arima models; write seasonal arima models using the backshift operator; use r to estimate the parameters in seasonal arima models; use r to t an appropriate BoxJenkins model to time series data.
7.1
Introduction
Up to now, all the time series considered have been assumed stationary. This assumption was crucial to the denitions of the autocorrelation and partial autocorrelation. In practice, however, many time series are not stationary. In this Module, methods for identifying non-stationary series are considered, and then models for modelling these series are examined. In this Module, three types of non-stationarity are discussed:
7.2. Non-stationarity in the mean 1. series that have a non-stationary mean; 2. series that have a non-stationary variance; and 3. series with a periodic or seasonal component.
131
Many series may exhibit more than one of these types of non-stationarity.
7.2
Non-stationarity in the mean
One common type of non-stationarity is a non-stationary mean. Typically, the mean of the series tends to increase or uctuate. This is easiest to identify by looking at a plot of the data. Sometimes, the sample acf may indicate a non-stationary mean if the terms take a long time to decay to zero. If a dataset exhibits a non-stationary mean, the solution is to take dierences. That is, if a time series {X} is non-stationary in the mean, compute the dierences Yn = Xn Xn1 . Generally, this makes any time series with a non-stationary mean into a time series with a stationary mean {Y }. Occasionally, the dierenced time series {Y } will also be non-stationary in the mean, and another set of dierences will be needed. It is rare to ever need more than two sets of dierences. When dierences of this kind are taken (soon another type of dierence is considered), this is referred to as taking rst dierences. Note that each time a set of dierences is calculated, the new series has one less observation than the original. In r, dierences are created using diff. Example 7.1: Consider the annual rainfall near Wendover, Utah, USA. The data appear to have a non-stationary mean (Fig. 7.1) as the mean goes up and down, though it is not too severe. To check this, a smoothing lter was applied computing the mean of each set of six observations at a time. This smooth (Fig. 7.1, top panel) suggests the mean is probably non-stationary as this line is not (approximately) constant. The following code fragment shows how the dierences series was found in r. > rfdata <- read.table("./rainfall/wendover.dat", + header = TRUE) > rf <- rfdata[(rfdata$Year > 1907) & (rfdata$Year < + 1999), ]
132
Annual rainfall (in mm)
500 400 300 200
1920
1940
1960 Year
1980
2000
Differences of Annual rainfall (in mm)
200 100 0 100 200 1920 1940 1960 Year 1980 2000
Figure 7.1: The annual rainfall near Wendover, Utah, USA in mm. Top: the original data is plotted with a thin line, and a smooth in a thick line, indicating that the mean is non-stationary. Bottom: the dierenced data is plotted with a thin line, and a smooth in a thick line. Since the smooth is relatively at, the dierenced data has a stationary mean. > + > + > + > > ann.rain <- tapply(rfdata$Rain, list(rfdata$Year), sum) ann.rain <- ts(as.vector(ann.rain), start = rfdata$Year[1], end = rfdata$Year[length(rfdata$Year)]) plot(ann.rain, type = "l", las = 1, ylab = "Annual rainfall (in mm)", xlab = "Year") ar.l <- lowess(ann.rain, f = 0.1) lines(ar.l, lwd = 2)
If dierences are applied, the series appears more stationary in the mean (Fig. 7.1, bottom panel).
7.2. Non-stationarity in the mean

AMO
0.2 0.1 amo 0.0 0.1
133
1950
1960
1970 Time
1980
1990
One difference of AMO
0.02 damo 0.00
0.02 0.04 1950 1960 1970 Time 1980 1990
Two differences of AMO

0.04 0.03 0.02 ddamo 0.01 0.00
0.01 0.02 0.03 1950 1960 1970 Time 1980 1990
Figure 7.2: The Atlantic Multidecadal Oscillation from 1948 to 1994. Top: the plot shows the data is not stationary. Middle: the rst dierences are also not stationary. Bottom: taking two sets of dierences has produced a stationary series. Example 7.2: Eneld et al. [16] used the Kaplan SST to compute a tenyear running mean of detrended Atlantic SST anomalies north of the equator. This data series is called the Atlantic Multidecadal Oscillation (AMO). The data, obtained from the NOAA Climatic Diagnostic Center [2], are stored as amo.dat. A plot of the data shows the series is non-stationary in the mean; (Fig. 7.2, top panel). The rst dierences are also non-stationary; (Fig. 7.2, middle panel). Taking one more set of dierences produces approximately stationary data; (Fig. 7.2, bottom panel). Here is the code used. > amo <- read.table("amo.dat", header = TRUE) > amo <- ts(amo$AMO, start = c(amo$Year[1]),
134 + > > > > + > > +
Module 7. Non-Stationary Models frequency = 1) par(mfrow = c(3, 1)) plot(amo, main = "AMO", las = 1) damo <- diff(amo) plot(damo, main = "One difference of AMO", las = 1) ddamo <- diff(damo) plot(ddamo, main = "Two differences of AMO", las = 1)
7.3
Non-stationarity in the variance
A less common type of non-stationarity with climate data is non-stationarity in the variance. A non-stationary variance is a common diculty, however, in many business applications. Generally, a series that is non-stationary in the variance has a variance that gets larger over time (that is, as time progresses, the observations become more variable). In these case, usually taking logarithms of the time series will help. Another possible diculty is that the time series contains negative values (for example, SOI series). In these cases, add a suciently large constant to the data (which wont aect the variance), and then take logarithms. If the time series is non-stationary in the mean and the variance, logs should be taken before dierences (to avoid taking logs of negative values).
7.4
ARIMA
models
Once a non-stationary time series has been made stationary, it can be analysed like any other (stationary) time series. These models, which include some dierencing, are called Autoregressive Integrated Moving Average models, or arima models.
7.4.1
Notation
ar, ma or arma models in which dierences have been taken are collectively called autoregressive integrated moving average models, or arima models. Consider an arima model in which the original time series has been dierenced d times (d is mostly 1, sometimes 2, and almost never greater than 2). If this now-stationary time series can be well modelled by an arma(p, q)
7.4. arima models
135
ACF 0.5 0 0.0
0.5
1.0
10 Lag
15
Partial ACF
0.4
0.2
0.0
0.2
10 Lag
15
Figure 7.3: The dierences of the annual rainfall near Wendover, Utah, USA in mm. Top: the sample acf. Bottom: the sample pacf. model, then the nal model is said to be an arima(p, d, q) model, where d is the number of sets of dierences needed to make the series stationary. Example 7.3: In Example 7.1, the annual rainfall near Wendover, Utah, say {Xn }, was considered. The time series was non-stationary, and dierences were taken. The dierenced time series, say {Yn }, is now stationary. The sample acf and pacf of the stationary series {Yn } is shown in Fig. 7.3. The sample acf suggests an ma(1) model is appropriate (again recalling that the term at lag 0 is always one), while the sample pacf suggests an ar(2) model is appropriate. The AIC recommends an ar(1) model. If the ar(2) model is chosen, the model would be an arima(2, 1, 0). If the ma(1) model is chosen, the model would be an arima(0, 1, 1). If the ar(1) model is chosen, the model would be an
136
Module 7. Non-Stationary Models arima(1, 1, 0), since there is one set of dierences. Here is some of the code used: > > + > + > + > + > > > + > >
rf <- read.table("wendover.dat", header = TRUE) rf <- rf[(rf$Year > 1907) & (rf$Year < 1999), ] ann.rain <- tapply(rf$Rain, list(rf$Year), sum) ann.rain <- ts(as.vector(ann.rain), start = rf$Year[1], end = rf$Year[length(rf$Year)]) plot(ann.rain, type = "n", las = 1, ylab = "Annual rainfall (in mm)", xlab = "Year") lines(ann.rain) ann.rain.d <- diff(ann.rain) plot(ann.rain.d, type = "n", las = 1, ylab = "Differences of Annual rainf xlab = "Year") acf(ann.rain.d, main = "") pacf(ann.rain.d, main = "")
Example 7.4: An example of an arima(2, 1, 1) model is Wt = 0.3Wt1 0.1Wt2 + et 0.24et1 , where Wt = Yt Yt1 is the stationary, dierenced time series. The model for the original series, {Yt }, is therefore (Yt Yt1 ) = 0.3(Yt1 Yt2 ) 0.1(Yt2 Yt3 ) + et 0.24et1 Yt = 1.3Yt1 0.4Yt2 + 0.1Yt3 + et 0.24et1 .
7.4.2
Estimation
The r function arima can be used to t arima models, with only a simple change to what was seen for stationary models. Example 7.5: In Example 7.3, three models are considered. To t the arima(0, 1, 1) model, use the code > ann.rain.ma1 <- arima(ann.rain, order = c(0, + 1, 1)) > ann.rain.ma1
7.4. arima models Call: arima(x = ann.rain, order = c(0, 1, 1)) Coefficients: ma1 -0.7036 s.e. 0.1208 sigma^2 estimated as 5548: log likelihood = -516,
137
aic = 1035.99
We have now seen what the second element of order is for: it indicates the order of the dierencing necessary to make the series stationary. The tted model for the rst dierences of the annual rainfall series is therefore Wt = 0.7036et1 + et where Wt = Yt Yt1 , and {Y } is the original time series of annual rainfall (since rst dierences were taken). This can be written as Yt Yt1 = 0.7036et1 + et and further unravelled to Yt = Yt1 0.7036et1 + et . To t the arima(1, 1, 0) model, proceed as follows: > ann.rain.ar1 <- arima(ann.rain, order = c(1, + 1, 0)) > ann.rain.ar1 Call: arima(x = ann.rain, order = c(1, 1, 0)) Coefficients: ar1 -0.4494 s.e. 0.0933 sigma^2 estimated as 6296: log likelihood = -521.46, aic = 1046.92
So the model for the rst dierence of annual rainfall is Wt = 0.4494Wt1 + et , where Wt = Yt Yt1 and {Y } is the original rainfall series. This can be also expressed as Yt = 0.5506Yt1 + 0.4494Yt2 + et in terms of the original rainfall series.
138
7.4.3
Backshift operator
When dierences are taken of a time series {Xt }, this is written using the backshift operator as Yt = (1 B)Xt . Example 7.6: In Example 7.4 an arma(2, 1) model was tted to a stationary series {Wt } (hence making an arima(2, 1, 1) model). The model can be written using the backshift operator as (1 0.3B + 0.1B 2 )Wt = (1 0.24B)et . Since {Wt } is a dierenced time series, Yt = (1 B)Wt , so the model for {Yt } written using the backshift operator is (1 0.3B + 0.1B 2 )(1 B)Yt = (1 0.24B)et . This expression can be expanded to give (1 1.3B + 0.4B 2 0.1B 3 )Yt = (1 0.24B)et , producing the same model as before.
Example 7.7: In Example 7.2, the AMO from 1948 to 1994 was examined. Two sets of dierences were required to make the data stationary. Looking at the sample acf and sample pacf of the twice-dierenced data shows that no model is necessary. The tted model is therefore is an arima(0, 2, 0) model. Using the backshift operator, the model is (1 B)2 At = et , where {A} is the AMO series.
7.5
Seasonal models
The most common type of non-stationarity is when the time series exhibits a seasonal pattern. Seasonal does not necessarily have anything to do with the seasons of Winter, Spring, and so on. It means that there is some kind of regular pattern in the data. This type of non-stationarity is very common in climatological and meteorological applications, where there is often an annual pattern evident in the data. Seasonal data is time series data that shows regular uctuation aligned usually with some natural time period (not just the actual seasons of Winter, Spring, etc). The length of a season is the time period over which the pattern repeats. For example, monthly data might show an annual pattern with a season of length 12, as the data may have a pattern that repeats each year (that is, each twelve months). These patterns usually appear in the sample acf and pacf.
7.5. Seasonal models Example 7.8:
139
The average monthly sea level at Darwin, Australia (in millimetres), obtained from the Joint Archive for Sea Level [1], is plotted in the top panel of Fig. 7.4. The sample acf and sample pacf are also shown. The code used to produce these Figure is given below: > > + > > > sealevel <- read.table("darwinsl.txt", header = TRUE) sl <- ts(sealevel$Sealevel/1000, start = c(1988, 1), end = c(1999, 12), frequency = 12) plot(sl, ylab = "Sea level (in m)", las = 1) acf(sl, lag.max = 40, main = "") pacf(sl, lag.max = 40, main = "")
The data show a seasonal patternthe sea level has a regular rise and fall according to the months of the year (as expected). The length of the season is therefore twelve, since the pattern is of length twelve, when the pattern then repeats. This seasonality also appears in the sample acf.
Seasonal time series have a non-stationary mean, but the non-stationarity is of a regular kind (that is, every year or every month a cycle repeats). These type of time series can be represented using a model that explicitly allows for the seasonality. In seasonal models, ar and ma components can be introduced at the value of the season. For example, the model Xt = et 0.23Xt12 might be used to model monthly data (where the season length is twelve, as the data might be expected to repeated each year). This model explicitly models the seasonal pattern by incorporating an autoregressive term at a lag of twelve. This model is a seasonal ar model. More generally, a seasonal arma model may have the usual non-seasonal ar and ma components (or the ordinary ar and ma components) but also seasonal ar and ma components. The model in the previous paragraph is a seasonal ar(1) model, since the one ar term is one season before. Similarly, for a time series with a season of length twelve, an example of a seasonal ar(2) model is Yn+1 = en+1 + 0.17Yn11 0.55Yn23 ,
140
4.2 Sea level (in m) 4.1 4.0 3.9 3.8 1988 1990 1992 1994 Time 1996 1998 2000
ACF
0.4 0.0
0.0
0.4
0.8
0.5
1.0
1.5 Lag
2.0
2.5
3.0
Partial ACF
0.4 0.0
0.0
0.4
0.8
0.5
1.0
1.5 Lag
2.0
2.5
3.0
Figure 7.4: The monthly average sea level at Darwin, Australia in metres. Top: the data are plotted. Centre: the sample acf and Bottom: the sample pacf.
7.5. Seasonal models
141
since the rst ar term is one season (12 time steps) behind, and the second ar term is two seasons (2 12 = 24 time steps) behind. Sometimes it is also necessary to take seasonal dierences. If a time series {Xt } shows a very strong seasonal component with a season of length s, then a seasonal dierence of the form Yt = Xt Xts is used to create a more stationary time series. Again, the r function diff is used with an optional parameter given to indicate the season length. Example 7.9: The Darwin sea average monthly sea level data (Example 7.8, p 139) has a strong seasonal pattern. Taking seasonal dierences seems appropriate: > dsl <- diff(sl, 12) > plot(dsl, las = 1) The plot of the seasonally dierenced data (Fig. 7.5, top panel) suggests the series is still possibly non-stationary in the mean, so taking ordinary (non-seasonal) dierences also seems appropriate: > ddsl <- diff(dsl) > plot(ddsl, las = 1) The plot of the twice-dierenced data (Fig. 7.5, bottom panel) is now approximately stationary.
Example 7.10: Krner & Rannik [25] using the seasonal ma model a xt xt12 = et 1 et12 to model cloud amount, where seasonal dierence have been initially taken.
7.5.1
Identifying the season length
Sometimes it is easy to identify the length of a season as it is aligned with a yearly or seasonal cycle. If this is the case, the season length should be made to aligned with the natural season. But for many climatological variables this is not true. In these cases, identifying the season length can be dicult.
142
0.3 0.2 dsl 0.1 0.0 0.1 0.2 1990 1992 1994 Time 1996 1998 2000
0.10 0.05 ddsl 0.00
0.05 0.10
1990
1992
1994 Time
1996
1998
2000
Figure 7.5: The dierences in monthly average sea level at Darwin, Australia in metres (see also Fig. 7.4). Top: the seasonal dierences are plotted, while in the bottom plot, both seasonal and non-seasonal dierences have been taken.
143
10 0 qbo 10 20 30 1960 1970 1980 Time 1990 2000
Series: x Raw Periodogram

1e+03 spectrum 1e03 1e01 1e+01
3 frequency bandwidth = 0.00601
Series: x Smoothed Periodogram

1e+02 spectrum 1e02 1e+00
3 frequency bandwidth = 0.0662
Figure 7.6: The Quasi-Biennial Oscillation (QBO) from 1955 to 2001. Top: the the QBO is plotted and shown cyclic behaviour. Middle: the spectrum is shown. Bottom: the spectrum is shown again, but has been smoothed. To help identifying the season length, a periodogram, or spectrum, is used. The spectrum examines many frequencies in the data and computes the strength of each possible frequency. Hence any frequency that is very strong is an indication of the period of the season. Example 7.11: The quasi-biennial oscillation, or QBO, is calculated at the Climate Diagnostic Centre from the zonal average of the 30mb zonal wind at the equator. The monthly data have a distinct seasonal pattern (Fig. 7.6, top panel) but is not aligned with years or seasons, or anything else useful. Using the spectrum is found using the function spectrum as follows: , > qbo <- ts(qbo, start = c(1955, 1), end = c(2001,
144 + 12), frequency = 12) > qbo.spec <- spectrum(qbo)
This spectrum (Fig. 7.6, centre panel) is very noisy. It is best to smooth the plot, as shown below. (You do not need to understand what this smoother does or how it works; the point is that it smooths the plot.) > k5 <- kernel("daniell", 5) > qbo.spec <- spectrum(qbo, kernel = k5) The result is a much smoother spectrum (Fig. 7.6, bottom panel). The season length is identied as the frequency where the spectrum is at its greatest. This can also be done in r: > max.spec <- max(qbo.spec$spec) > max.freq <- qbo.spec$freq[max.spec == qbo.spec$spec] > max.freq [1] 0.4375 > 1/max.freq [1] 2.285714 The maximum frequency corresponds to 2.3 seasons, or 2.3 years in this case.
Random numbers are expected to have a spectrum that is fairly constant for all frequencies; see the following example. Example 7.12: In this example, we look at the spectrum of four sets of random numbers from a Normal distribution. > > > > + + + set.seed(102030) par(mfrow = c(2, 2)) k5 <- kernel("daniell", 5) for (i in (1:4)) { random.numbers <- rnorm(1000) spectrum(random.numbers, kernel = k5) }
In the output (Fig. 7.7)., no frequencies stand out as being much stronger than others.

1.0 1.5
145
spectrum
1.0
spectrum 0.0 0.1 0.2 0.3 0.4 0.5
0.5
0.2
0.5 0.0
0.1
0.2
0.3
0.4
0.5
frequency bandwidth = 0.00318

2.0 2.0 spectrum 0.0 0.1 0.2 0.3 0.4 0.5 0.5 1.0
spectrum
0.2
0.5
1.0
0.0
0.1
0.2
0.3
0.4
0.5
Figure 7.7: Four replication of a spectrum from 1000 Normal random numbers. There is no evidence of one frequency dominating.
7.5.2
Notation
These models are very dicult to write down. There are a number of parameters that must be included: 1. The order of non-seasonal (or ordinary) dierencing, d; 2. The order of the non-seasonal ar model, p; 3. The order of the non-seasonal ma model, q; 4. The length of a season, s; 5. The order of seasonal dierencing, D; 6. The order of the seasonal ar model, P ; 7. The order of the seasonal ma model, Q. Note that any one model should only have a few parameters, and so some of p, q P , and Q are expected to be zero. In addition, d + D is most often one, sometimes two, and rarely greater than two. These parameters are summarized by writing a model down as follows: A model with all of the above parameters would be written as a arima(p, d, q) (P, D, Q)s model.
146
Example 7.13: Consider a time series {Rt }. The series is non-stationary, and ordinary dierences and seasonal dierences (period 7) are taken to make the series stationary in the mean. An ordinary ar(2) model and seasonal ma(1) model is then tted. The nal model is a arima(2, 1, 0) (0, 1, 1)7 model.
Example 7.14: Consider a time series {Zt }. The series is non-stationary, and two sets of seasonal dierences (period 12) are taken to make the series stationary in the mean. An ordinary arma(1, 1) model is then tted. The nal model is a arima(1, 0, 1) (0, 2, 0)12 model.
Example 7.15: Consider a time series {Pn }. The series is non-stationary, and one set of seasonal dierences (period 4) are taken to make the series stationary in the mean. An ordinary ma(1) model and seasonal ar(2) model is then tted. The nal model is a arima(0, 0, 1) (2, 1, 0)4 model.
Example 7.16: Consider a time series {An }. The series is non-stationary, and one set of seasonal dierences (period 12) are taken to make the series stationary in the mean. The data then appears to be white noise (that is, the acf andpacf suggest no model to be tted). The nal model is a arima(0, 0, 0) (0, 1, 0)12 model.
When writing seasonal components of the model, it is usual to write seasonal ar terms with a capital phi: . Likewise, seasonal ma models are written using a capital theta: . This is in line with using capital P and Q for the orders of the seasonal components. Example 7.17: In Example 7.8 (p 139), the average monthly sea level at Darwin was analysed. In Example 7.9 (p 141), seasonal dierences were taken to make the data stationary. The seasonally dierenced data (Fig. 7.5, top panel) was non-statonary. The seasonally dierenced and non-seasonally dierenced data (Fig. 7.5, bottom panel) looks approximately stationary. The sample acf and pacf of this series is shown in Fig. 7.8.
147
1.0
0.5
Partial ACF 0.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Lag
ACF
0.4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Lag
Figure 7.8: The sample acf and pacf for the twice-dierenced monthly average sea level at Darwin, Australia in metres. Top: the sample acf; Bottom: the sample pacfof the twice-dierenced data are shown. For the non-seasonal components of the model, the sample acf suggests no model is necessary (the one component above the dotted condence interval can probably be ignoredit is just over the approximate lines and is at a lag of two). The sample pacf suggest no model is needed eitherthough there is again a marginal component at a lag of two. (It may be necessary to include these terms later, as wil become evident in the diagnostic analysis, but it is unlikely.) For the seasonal model, the sample pacf decays very slowly (there is one at seasonal lag 1, lag 2 and lag 3), suggesting a large number seasonal ar terms would be necessary. In contrast, the sample acf suggests one seasonal ma term is needed. In summary, two dierences have been taken (so d = 1 and D = 1). No non-seasonal model seems necessary (so p = q = 0), but a seasonal ma(1) term is suggested (so P = 0 and Q = 1). So the model is arima(0, 1, 0) (0, 1, 1)12 , and there is only one parameter to estimate (the seasonal ma(1) parameter).
7.5.3
Earlier, it was shown the backshift operator equivalent of taking non-seasonal dierences was (1 B). Similarly, if the series {Xt } is seasonally dierenced with a season of length s, then the backshift operator equivalent is (1 B s )Xt .
0.3
0.2
0.1
0.0
0.1
0.2
148
The general form of an arima(p, d, q) (P, D, Q)s model is written using the backshift operator as (1 B)d (1 B s )D (B)(B)Xt = (B)(B)et , where (B) is the non-seasonal ar component written using the backshift operator, (B) is the seasonal ar component written using the backshift operator, (B) is the non-seasonal ma component written using the backshift operator, and (B) is the the seasonal ma component written using the backshift operator. The terms in the seasonal components decay in steps of the season-length (that is, of the season has a length of seven, (B) = 1 + 0.311 B 7 0.192 B 14 is a typical term). Example 7.18: In Example 7.17, one model suggested for the average monthly sea level at Darwin was arima(0, 1, 0) (0, 1, 1)12 . Using the backshift operator, this model is (1 B)(1 B 12 )Xt = (B)et , where (B) = 1 + 1 B 12 . Using r, the unknown parameter is calculated to be 0.9996, so the model is (1 B)(1 B 12 )Xt = (1 0.9996B 12 )et .
Example 7.19: Maier & Dandy [31] use arima models to model the daily salinity at Murray Bridge, South Australia from Jan 1, 1987 to 31 Dec 1991. They examined numerous models, including some models not in the BoxJenkins methodology. The best BoxJenkins models were those based on one set of non-seasonal dierences, and one or two sets of seasonal dierences, with a season of length s = 365. One of their nal models was the arima(1, 1, 1) (1, 2, 0)365 model (1 B)(1 B 365 )2 (1 + 0.267B)(1 0.513B 365 )Xt = (1 0.455B)et , where the daily salinity is {Xt }.
149
7.5.4
Estimation
Estimation of seasonal arima models is quite tricky, as there are many parameters that could be specied: the ar and ma components both seasonally and non-seasonally. This is part of the help from the arima function arima(x, order = c(0, 0, 0), seasonal = list(order = c(0, 0, 0), period = NA) ) The input order has been used previously; to also speciy seasonal components, the input seasonal must be used. Example 7.20: In Example 7.17 (p 146), an arima(0, 1, 0) (0, 1, 1)12 was suggested for the average monthly sea level at Darwin. This model is tted in r as follows: > dsl.small <- arima(sealevel$Sealevel, order = c(0, + 1, 0), seasonal = list(order = c(0, 1, + 1), period = 12)) > dsl.small Call: arima(x = sealevel$Sealevel, order = c(0, 1, 0), seasonal = list(order = c(0, 1, 1), period = 12)) Coefficients: sma1 -0.9996 s.e. 0.2305 sigma^2 estimated as 1013: So the tted model is written
ord. di
aic = 1312.02
(1 B) (1 B 12 ) Dt = (1 0.99957B 12 )et
seas. di
Dt Dt12 Dt1 + Dt13 = et 0.99957et12 . This means the estimated seasonal ma parameter is 0.99957.
150
7.6
Forecasting
The principles of forecasting used earlier apply to arima and seasonal arima models without signicant dierences. However, it is necessary to write the model without using the backshift operator rst, which can be quite tedious. Example 7.21: Consider the arima(1, 0, 0) (0, 1, 1)4 model Wn = 0.20Wn1 + en 0.16en4 , where Wn = Zn Zn4 is the seasonally dierenced series. Using the backshift operator, the model is (1 B 4 ) (1 0.20B)Zt = (1 0.16B 4 )et .
seasonal di.
Expanding the backshift terms gives (1 0.2B B 4 + 0.20B 5 )Zt = (1 0.16B 4 )et , so the model is written as Zn = 0.2Zn1 + Zn4 0.20Zn5 0.16en4 + en . A one-step ahead forecast is given by Zn+1|n = 0.2Zn + Zn3 0.20Zn4 0.16en3 . A two-step ahead forecast is given by Zn+2|n = 0.2Zn+1|n + Zn2 0.20Zn3 0.16en2 .
Example 7.22: In Example 7.19, the arima(1, 1, 1) (1, 2, 0)365 model (1 B)(1 B 365 )2 (1 + 0.267B)(1 0.513B 365 )Xt = (1 0.455B)et , was given for the daily salinity at Murray Bridge, South Australia, say {Xt }. After expanding the terms on the left-hand side, there will be terms involving B, B 2 , B 365 , B 366 , B 367 , B 730 , B 731 , B 732 , B 1095 , B 1096 and B 1097 . This makes it very dicult to write down. Indeed, without using the backshift operator as above, it would be very tedious to write down the model at all, even though only three parameters have been estimated. Note this is an unusual case of model tting in that three sets of dierences were taken.
7.7. Diagnostics
Series resid(sma1)
0.00 0.10 0.20 1.0
151
Series resid(sma1)
Partial ACF 0.0 0.5 1.0 Lag 1.5
ACF
0.2
0.6
0.2
0.15
0.5
1.0 Lag
1.5
Series: resid(dsl.small)
Normal QQ Plot
q
Sample Quantiles
0.8
100 0.0 0.2 frequency 0.4
q qq qq q q qq qq qqq qq qq qq qq qq qq qq qq qq q q qq q qq qq qq qq qq qqq qqq qq qq qq q qq q q qq qq qq qq q qq qq qq q q q q q q q q q q q qq qq qq q q q
0.0
0.4
50
Figure 7.9: Some residual plots for the arima(0, 1, 0) (0, 1, 1)12 tted to the monthly sea level at Darwin. Top left: the residual acf; top right: the residual pacf; bottom left: the cumulative periodogram; bottom right: the QQ plot
7.7
Diagnostics
The usual diagnostics apply equally for non-stationary models; see Module 6. Example 7.23: In Example 7.17, the model arima(0, 1, 0) (0, 1, 1)12 was suggested for the monthly sea level at Darwin. The residual acf, residual pacf and the cumulative periodogram can be produced in (Fig. 7.9). > + > > > sma1 <- arima(sl, order = c(0, 1, 0), seasonal = list(order = c(0, 1, 1), period = 12)) acf(resid(sma1)) pacf(resid(sma1)) cpgram(resid(sma1))
Both the residual acf and pacf look OK, but both have a signicant term at lag 2; the periodogram looks a little suspect, but isnt too bad. The BoxPierce Q statistic can be computed, and the standard error of the estimated parameter found also:
152 > Box.test(resid(sma1)) Box-Pierce test
data: resid(sma1) X-squared = 2.9538, df = 1, p-value = 0.08567 > coef(sma1)/sqrt(diag(sma1$var.coef)) sma1 -4.335704 The BoxPierce test is OK, but is marginal. The estimated parameter is signicant. The QQ plot is OK if not perfect. In summary, the arima(0, 1, 0) (0, 1, 1)12 looks OK, but there are some points of minor concern. Is there perhaps a better model? Perhaps a model with a term at lag 2 such as arima(2, 1, 0) (0, 1, 1)12 ? (The reason for proposing this model is that the residual acf and pacf suggest diculties at lag 2.) We t this model and compare the residual analyses; see Fig. 7.10. > + > > > > > > oth.mod <- arima(sl, order = c(2, 1, 0), seasonal = list(order = c(0, 1, 1), period = 12)) acf(resid(oth.mod)) pacf(resid(oth.mod)) cpgram(resid(oth.mod)) qqnorm(resid(oth.mod)) qqline(resid(oth.mod)) Box.test(resid(oth.mod)) Box-Pierce test data: resid(oth.mod) X-squared = 0.0116, df = 1, p-value = 0.9144 > coef(oth.mod)/sqrt(diag(oth.mod$var.coef)) ar1 -1.309895 ar2 sma1 2.202391 -4.992391
The residual acf and pacf appear better, as does the periodogram. The BoxPierce statistic now certainly not signicant, but one of the parameters is unnecessary. (This was expected; we only really wanted
7.7. Diagnostics
153
Series resid(oth.mod)
1.0
Series resid(oth.mod)
Partial ACF 0.0 0.5 1.0 Lag 1.5
ACF
0.6
0.2
0.2
0.15
0.00
0.10
0.5
1.0 Lag
1.5
Series: resid(oth.mod)
Normal QQ Plot
q q q q qq qq q qq qq qq qq qq qq qq qqq qq qq qq q q qq qq qq qq q qqq qqq qqq qqq qqq q q qq qq qq qq q qq q q q q q q qq qq qq q qq q qq qq qq q q q qq q
Sample Quantiles 0 1 2 3 4 5 6
0.8
0.4
0.10
0.0
0.00
frequency
Figure 7.10: Some residual plots for the arima(2, 1, 0) (0, 1, 1)12 tted to the monthly sea level at Darwin. Top left: the residual acf; top right: the residual pacf; bottom left: the cumulative periodogram; bottom right: the QQ plot.
154
Module 7. Non-Stationary Models the second lag, but were forced to take the rst, insignicant one.) The QQ plot looks marginally improved also. Fitting a arima(0, 1, 2) (0, 1, 1)12 produces similar results. Which is the better model? It is not entirely clear; either is probably OK.
7.8
A summary of model tting
To summarise, these are the steps that need to be taken to t a good model:
Plot the data. Check that the data is stationary. If the data is not stationary, deal with it appropriately (by taking logarithms or dierences (seasonal and/or non-seasonal), or perhaps both). Remember that it is rare to require many levels of dierencing. Examine the sample acf, sample pacf and/or the AIC to determine possible models to for the data. Models may include ma, ar, arma or arima models, with non-seasonal and/or seasonal aspects. (Remember that is it rare to have models with a large number of parameters to be estimated.) You may have to use a periodogram to identify season length. Use rs arima function to t the models and determine the parameter estimates. Perform the following diagnostic checks for each of the possible models.
examine the residual acf and pacf; the cumulative periodogram of residuals; and the BoxPierce statistic; examine the QQ plot; and the signicance of the parameter estimates.
Choose the best model from the available information, and write down the model (probably using backshift operators). Remember that the simplest, most adequate model is the best model; more parameters do not necessarily make a better model.
These steps are summarized in the owchart in Fig. 7.11.
7.8. A summary of model tting
155
Plot the series
Is time series stationary?
Yes
No Use differences and/or logs
Identify possible models: AR, MA, ARMA or ARIMA (using ACF, PACF and/or AIC)
Estimate model parameters
Perform diagnostic checks
Is the model adequate? No
Yes Write down the final model
Figure 7.11: A owchart for tting arima (BoxJenkins) type models.
156
7.9
A complete example
The data le mlco2.dat contains monthly measurements of carbon dioxide above Mauna Loa Hawaii from Jan 1959 to Dec 1990 in parts per million (ppm). (Missing values have been lled in by linear interpolation.) The data were collected by Scripps Institute of Oceanography, La Jolla, California. The original source is the climatology database maintained by the Oak Ridge National Laboratory, and the data here have been obtained from Hyndman [5]. We will nd a suitable model for the data, and use diagnostic tests to determine if the model is adequate. A plot (Fig. 7.12, top panel) shows the series is clearly non-stationary in the mean, and is also seasonal. Taking non-seasonal dierence produces an approximately stationary series in the mean, but the series is still strikingly seasonal (Fig. 7.12, centre panel). Taking seasonal dierences (length 12) produces a series that appears stationary (Fig. 7.12, bottom panel). Using the stationary (twice-dierenced) series, the sample acf and pacf are shown in Fig. 7.13. To nd a model, rst consider the non-seasonal components. The acf suggests an ma(1) or perhaps ma(3) model. The pacf suggests an ar(1) or perhaps ar(3) model. At this stage, choosing either the ma(1) or ar(1) model seems appropriate as the terms at lag 3 are marginally over the approximate condence limits. Which to choose? Since the lag 3 term seems more marginal in the acf, perhaps the ma(1) model is the best choice (this may not turn out to be the case). Consider now the seasonal components. The acf has a strong term at a seasonal lag of 1 only, suggesting a seasonal ma(1) model. In contrast, the pacf shows signicant terms at a seasonal lag of 1, 2 and 3 (and there may be more if we looked at higher seasonal lags). This suggests a seasonal model of at least ar(2). For the seasonal component, the best model is the ma(1). Combining this information suggests the model arima(0, 1, 1) (0, 1, 1)12 . This model is tted as follows: > co.model <- arima(co, order = c(0, 1, 1), + seasonal = list(order = c(0, 1, 1), season = 12)) Is this model an adequate model? The residual acf and pacf (Fig. 7.14) suggest the model is adequate. The cumulative periodogram and QQ plots (Fig. 7.15) indicate the model is adequate. The two estimated parameters are also signicant:
7.9. A complete example
157
350 340 co 330 320
1960
1965
1970
1975 Time
1980
1985
1990
First difference of CO2

2 1 dco 0 1 2
1960
1965
1970
1975 Time
1980
1985
1990
Seasonal and nonseasonal differences of CO2
1.0 0.5 ddco 0.0 0.5 1.0 1960 1965 1970 1975 Time 1980 1985 1990
Figure 7.12: The monthly measurements of carbon dioxide above Mauna Loa, Hawaii. from Jan 1959 to Dec 1990 in parts per million (ppm). Top: the data is clearly non-stationary in the mean and is seasonal; Middle: the rst dierences have been taken; Bottom: the seasonal dierences have also been taken, and now the series appears stationary.
158
ACF 0.5 0.0 0.0
0.5
1.0
0.5
1.0
1.5 Lag
2.0
2.5
3.0
Partial ACF
0.3 0.0
0.1
0.1
0.5
1.0
1.5 Lag
2.0
2.5
3.0
Figure 7.13: The monthly measurements of carbon dioxide above Mauna Loa, Hawaii from Jan 1959 to Dec 1990 in parts per million (ppm). Top: the sample acf of the twice-dierenced series; Bottom: the sample pacf of the twice-dierenced series.
7.9. A complete example
159
Series resid(co.model)
ACF
0.0 0.0
0.4
0.8
0.5
1.0 Lag
1.5
2.0
Series resid(co.model)
0.00 0.05 0.10 0.10
Partial ACF
0.5
1.0 Lag
1.5
2.0
Figure 7.14: The monthly measurements of carbon dioxide above Mauna Loa, Hawaii. from Jan 1959 to Dec 1990 in parts per million (ppm). Top: the residual acf; Bottom: the residual pacf.
160
Normal QQ Plot
q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q
Series: resid(co.model)
1.0 0.5 Sample Quantiles 0 1 2 3 frequency 3 4 5 6 1.0 0.5
q
0.8
0.0
0.2
0.4
0.6
0.0
Figure 7.15: The monthly measurements of carbon dioxide above Mauna Loa, Hawaii. from Jan 1959 to Dec 1990 in parts per million (ppm). The cumulative periodogram indicates that the model is adequate. > co.model$coef/sqrt(diag(co.model$var.coef)) ma1 sma1 -6.62813 -27.24159 The model suggested is > co.model
Call: arima(x = co, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), season = 1 Coefficients: ma1 -0.3634 s.e. 0.0548
sma1 -0.8581 0.0315 log likelihood = -66.66, aic = 139.33
Using backshift operators, the tted model is B(1 B 12 )Ct = (1 0.3634B 0.8581B 1 2)et .
7.10. Summary
161
7.10
Summary
In this Module, three types of non-stationarity have been considered: nonstationarity in the mean, non-stationarity in the variance, and seasonal models. For each, identication and estimation has been considered, as well as the notation for each. The diagnostic testing involved is the same as for stationary models.
7.11
Exercises
Ex. 7.24: Consider a arima(1, 0, 0) (0, 1, 1)7 model tted to a time series {Pn }. Write this model using the backshift operator notation (make up some reasonable parameter estimates). Ex. 7.25: Consider a arima(1, 1, 0) (1, 1, 0)12 model tted to a time series {Yn }. Write this model using the backshift operator notation (make up some reasonable parameter estimates). Ex. 7.26: Consider some non-stationary data {W }. After taking nonseasonal dierences, the series seems stationary. Let this dierenced data be {Y }. A non-seasonal ar(1) model and seasonal ma(2) is tted to the stationary data (the season is of length 12). (a) Write down the model tted to the series {W } using the backshift notation; (b) Write down the model tted to the series {W } using arima notation. (c) Write the model out in terms of Wt , et and previous terms (that is, dont use the backshift operator). Ex. 7.27: Consider some non-stationary data {Z}. After taking seasonal dierences, the series seems stationary. Let this dierenced data be {Y }. A non-seasonal ma(2) model and seasonal arma(1, 1) is tted to the stationary data (the season is of length 24). (a) Write down the model tted to the series {Z} using the backshift notation; (b) Write down the model tted to the series {Z} using arima notation. (c) Write the model out in terms of Zt , et and previous terms (that is, dont use the backshift operator).
162
Ex. 7.28: For each of the following cases, write down the nal model using the backshift operator and using notation. (a) The time series {P } is non-stationary; after taking ordinary differences, an arma(1, 0) model was tted to the data. (b) The time series {T } is seasonal with period 12. After seasonal dierence were taken, a seasonal ma(2) model was tted to the data. Ex. 7.29: For each of the following cases, write down the nal model using the backshift operator and using notation. (a) The time series {Y } is non-stationary; after taking seasonal differences (season of length 12), an arma(1, 1) model was tted to the data. (b) The daily time series {S} is seasonal with period 365. After ordinary and seasonal dierence were taken, an arma(1, 1) model was tted to the data. Ex. 7.30: For the following models written using backshift operators, expand the model and write down the model in standard form. In addition, write down the model using arima notation. (a) (1 B)(1 0.3B 12 )Xt = (1 + 0.2B)et . (b) (1 B 7 )Hn = (1 0.5B 0.2B 2 )en . (c) (1 0.3B)Wn+1 = (1 0.4B)en+1 . Ex. 7.31: For the following models written using backshift operators, expand the model and write down the model in standard form. In addition, write down the model using arima notation. (a) (1 B)2 (1 + 0.3B)Yt = et . (b) (1 B 12 )(1 B)(1 + 0.3B)Mn+1 = en+1 . (c) Wn+1 = (1 0.4B)(1 + 0.3B 7 )en+1 . Ex. 7.32: Consider some non-stationary monthly data {G}. After taking seasonal dierences, the series seems stationary. Let this dierenced data be {H}. A non-seasonal ma(2) model, a seasonal ma(1) and a seasonal ar(1) model are tted to {H}. Write down the model tted to the series {G} using (a) the backshift operator; (b) arima notation.
7.11. Exercises
163
(c) Make up some (reasonable) numbers for the parameters in this model. Then write the model out in terms of Gt , et and previous terms. Ex. 7.33: Trenberth & Stepaniak [43] dened an index of El Nio evolution n they called the Trans-Nio Index (TNI). This monthly time series is n given in the data le tni.txt, and contains values of the TNI from January 1958 to December 1999. (The data have been obtained from the Climate Diagnostic Center [2].) (a) Plot the series and see that it is a little non-stationary. (b) Use dierences to make the series stationary. (c) Find a suitable ar model for the series. (d) Find a suitable ma models for the series. (e) Which model would you prefer: the ar or ma model? Explain your answer using diagnostic analyses. (f) For your prefered model, estimate the parameters. Ex. 7.34: The sunspot numbers from 1770 to 1869 were given in Table 1.2 (p 20). The data are given in the data le sunspots.dat. (a) Plot the data and decide if a seasonal component appears to exist. (b) Use spectrum (and a smoother) to nd any seasonal components. (c) Suggest a possible model for the data (make sure to do a diagnostic analysis). Ex. 7.35: The quasi-bienniel oscillation (QBO) was considered in Exercise 1.7. (a) Plot the data and decide if a seasonal component appears to exist. (b) Use spectrum (and a smoother) to nd any seasonal components. (c) Suggest a possible model for the data (make sure to do a diagnostic analysis). Ex. 7.36: The average monthly air temperatures in degrees Fahrenheit at Nottingham Castle has been recorded for 20 years and is given in the data le nottstmp.txt. (The data are from Anderson [6, p 166], as quoted in Hand et al. [19, p 279].) Find a suitable time series model for the data. Note that the season is expected to be of length 12. See if you can discover this from the unsmoothed spectrum, and also from the smoothed spectrum.
164
Ex. 7.37: Krner & Rannik [25] t an arima(0, 0, 0) (0, 1, 1)12 to the Intera national Satellite Cloud Climatology Project (ISCCP) cloud detection time series {Cn }. They t dierent model for dierent latitudes. At 90 latitude, the unknown model parameter is about 0.7 (taken from their Figure 5). (a) Write this model using the backshift operator. (b) Write the model in terms of Cn and en . (c) Develop a forecasting model for forecasting one-, two-, twelveand thirteen- steps ahead. Ex. 7.38: The streamow in Little Mahoning Creek, McCormick, Pennsylvania, from 1940 to 1988 is given in the data le mcreek.txt. The le contains the monthly mean values of streamow in cubic feet per second. Find a suitable time series model for the data. (Make sure to do a diagnostic analysis.) Ex. 7.39: The data le wateruse.dat contains the annual water usage in Baltimore city in litres per capita per day from 1885 to 1963. The data are from Hipel & McLeod [21] and Hyndman [5]. Plot the data and conrm that the data is non-stationary. (a) Use appropriate methods to make the series stationary. (b) Find a suitable model for the series and estimate the parameters of the model. Make sure to do a diagnostic analysis. (c) Write the model using the backshift operator. Ex. 7.40: The le firring.txt contain the tree ring indicies for the Douglas r at the Navajo National Monument in Arizona, USA from 1107 to 1968. Find a suitable model for the data. Ex. 7.41: The data le venicesealevel.dat contains the maximum sea levels recorded at Venice from 18871981. Find a suitable model for the times series, including a diagnostic analysis of possible models.
7.11.1
7.24 The model is of the form (1 B 7 )(1 B)Pn = (1 B 7 )et for some values and .
7.11. Exercises
165
7.26 (a) (1 B)(1 B)Wt = (1 + 1 B 12 + 2 B 24 )et for some numbers , 1 and 2 . (b) arima(1, 1, 0) (0, 0, 2)12 . (c) Expanding the model written using backshift operators gives (1 (1 + )B + B 2 )Wt = (1 + 1 B 12 + 2 B 24 )et . This is equivalent to Wt = (1 + )Wt1 Wt2 + et + 1 et12 + 2 et24 . 7.28 (1 B)(1 B)Pt = et which is arima(1, 1, 0) (0, 0, 0)0 ; (1 B 12 )Tt = (1 + 1 B 12 + 2 B 24 )et which is arima(0, 0, 0) (0, 1, 2)12 . 7.30 (a) Xt = Xt1 + 0.3Xt12 0.3Xt13 + et + 0.2et1 , which is an arima(0, 1, 1) (1, 0, 0)12 model. (b) Hn = Hn7 + en 0.5en1 0.2en2 which is a arima(0, 0, 2) (1, 0, 0)7 model. (c) Wn+1 = 0.3Wn + en+1 0.4en which is a arima(1, 0, 1) (0, 0, 0)? model; ie, it is not seasonal. 7.39 The series is plotted in the top plot in Fig. 7.16. The data are clearly non-stationary in the mean. Taking dierence produces an approximately stationary series; see the bottom plot in Fig. 7.16. Using the stationary dierenced series, the sample acf and pacf are shown in Fig. 7.17. These plots suggest that no model can be tted to the dierenced series. That is, the rst dierences are random. The model for the water usage {Wt } is therefore (1 B)Wt = et or Wt = Wt1 + et . There are no parameters to estimate. Here is the code used: > > > > > + > > wu <- read.table("wateruse.dat", header = TRUE) wu <- ts(wu$Use, start = 1885) plot(wu, las = 1) dwu <- diff(wu) plot(dwu, main = "First difference of water use", las = 1) acf(dwu, main = "") pacf(dwu, main = "")
The diagnostics have been left for you.
166
650 600 550 wu 500 450 400 350 1900 1920 Time 1940 1960
First difference of water use
100 50 dwu 0 50 100 150 1900 1920 Time 1940 1960
Figure 7.16: The annual water usage in Baltimore city in litres per capita per day from 1885 to 1968. Top: the data is clearly non-stationary in the mean; Bottom: the rst dierences are approximately stationary.
7.11. Exercises
167
ACF 0.2 0 0.2
0.6
1.0
5 Lag
10
15
Partial ACF
0.2
0.0
0.1
0.2
10 Lag
15
Figure 7.17: The annual water usage in Baltimore city in litres per capita per day from 1885 to 1968. Top: the sample acf of the dierenced series; Bottom: the sample pacf of the dierenced series.
168 > par(mfrow = c(1, 2)) > plot(vs) > plot(diff(vs))
180
160
140
120
diff(vs) 1900 1920 1940 Time 1960 1980
vs
100
60
60
40
80
20
20
40
60
80
1900
1920
1940 Time
1960
1980
Figure 7.18: A plot of the Venice sea level data. Left: original data; right: after taking rst dierences 7.41 First, load and prepare the data: > VSL <- read.table("venicesealevel.dat", header = TRUE) > vs <- ts(VSL$MaxSealevel, start = c(1887)) A plot of the data shows the series is non-stationary (Fig. 7.18, left panel) and the data increasing (what is the implication there?). Taking dierence produces a more stationary series (Fig. 7.18, right panel). See the acf and pacf (Fig. 7.19); the acf suggests an ma(1) model (or possibly ma(3), but start with the simpler choice), while the pacf suggests an ar(2) model. Decide to start with the ma(1) model: > vs.ar1 <- arima(vs, order = c(1, 1, 0)) The residual acf and pacf arent great (Fig. 7.20); there are quite a few components outside the approximate condence limits, but the components at lag 1 are ne in both plots. Maybe the ma(1) would be better? That does appear to be true (Fig. 7.21). > vs.ma1 <- arima(vs, order = c(0, 1, 1)) This model appears ne, if not perfect, so lets examine more diagnostics (Fig. 7.22); these look OK too. So, for some nal diagnostics:
7.11. Exercises
169
> par(mfrow = c(1, 2)) > acf(diff(vs)) > pacf(diff(vs))

Series diff(vs)
1.0 0.2
Series diff(vs)
0.8
0.6
Partial ACF 0 5 10 Lag 15
0.4
ACF
0.2
0.0
0.2
0.4
0.4
0.3
0.2
0.1
0.0
0.1
10 Lag
15
Figure 7.19: The acf and pacf of the Venice sea level data
> par(mfrow = c(1, 2)) > acf(resid(vs.ar1)) > pacf(resid(vs.ar1))

Series resid(vs.ar1)
1.0 0.2
Series resid(vs.ar1)
0.8
0.6
Partial ACF 0 5 10 Lag 15
ACF
0.4
0.2
0.2
0.0
0.3
0.2
0.1
0.0
0.1
10 Lag
15
Figure 7.20: The residual acf and pacf of the Venice sea level data after tting the ar(1) model
170
> par(mfrow = c(1, 2)) > acf(resid(vs.ma1)) > pacf(resid(vs.ma1))

Series resid(vs.ma1)
1.0 0.2 Partial ACF 0 5 10 Lag 15 0.3 0.2 0.1 0.0 0.1
Series resid(vs.ma1)
ACF
0.2
0.0
0.2
0.4
0.6
0.8
10 Lag
15
Figure 7.21: The residual acf and pacf of the Venice sea level data after tting the ma(1) model
> > > >
par(mfrow = c(1, 2)) cpgram(resid(vs.ma1)) qqnorm(resid(vs.ma1)) qqline(resid(vs.ma1))

Normal QQ Plot
80
Series: resid(vs.ma1)
1.0
0.8
Sample Quantiles
60
qq q q q q q q q q q q q qq q qq q q q q q q q q qq qq q qq qq qq qq qq qq qq qq qq qq qq qq q q q q q qq q q q q q qq qq qq qq qq q q q q q q q q q qq qq q qq q
0.6
0.4
0.2
0.0
0.0
0.1
0.2
0.3
0.4
0.5
frequency
40
20
20
40
Figure 7.22: Further diagnsotic plot of the Venice sea level data after tting the ma(1) model
7.11. Exercises > Box.test(resid(vs.ma1)) Box-Pierce test data: resid(vs.ma1) X-squared = 0.8388, df = 1, p-value = 0.3597 > coef(vs.ma1)/sqrt(diag(vs.ma1$var.coef)) ma1 -16.48790 All looks well; decide the ma(1) model is suitable: > vs.ma1 Call: arima(x = vs, order = c(0, 1, 1)) Coefficients: ma1 -0.8677 s.e. 0.0526 sigma^2 estimated as 319.5: log likelihood = -405.11,
171
aic = 814.23
172
Module
Markov chains
Module contents
8.1 8.2 8.3 8.4 8.5 8.6 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 174 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 174 The transition matrix . . . . . . . . . . . . . . . . . . . . 177 Forecast the future with powers of the transition matrix181 Classication of nite Markov chains . . . . . . . . . . 184 Limiting state (steady state) probabilities . . . . . . . . 187 8.6.1 Share of the market model . . . . . . . . . . . . . . . . 190 8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 8.7.1 Answers to selected Exercises . . . . . . . . . . . . . . . 200
Module objectives: Upon completion of this Module students should be able to:
state and understand the Markov property; identify processes requiring a Markov chain description; determine the transition and higher order matrices of a Markov chain; calculate state probabilities;
173
174
Module 8. Markov chains

determine and interpret the steady state distribution; calculate and interpret mean recurrence intervals; apply Markov chain techniques to basic decision making problems; determine future states or conditions using Markov analysis.
8.1
Introduction
Up to now, only continuous time series have been considered; that is, the quantity being measured over time is continuous. In this Module1 , a simple method is considered for time series that take on discrete values. A simple example is the state of the weather: If it is ne or if it is raining, for example.
8.2
Terminology
A stochastic process is a collection of random variables {X(t)} where the parameter t denotes time (possibly space) and ranges over some interval of interest; e.g. t 0. X(t) denotes the random variable X at time t. The values assumed by X(t) may be called states and the set of all possible states is called the state space. The state space (and hence X(t)) may be discrete or continuous: the state space of a queue is discrete; the state space of inter-event times is continuous. The time parameter t (sometimes called the indexing parameter) may also be discrete or continuous. In this Module, we study the case where the state space is discrete, and the time parameter t is also discrete (and equally spaced). Two examples of a discrete time stochastic process follow.
Let Y (n) be the volume of water in a reservoir at the start of month n. The parameter n is used in place of t to emphasise the fact that this parameter is discrete, taking on values 0, 1, 2,. . . . Although Y (n) is naturally continuous, since it is a measure of volume, it may be sucient in some applications to measure Y (n) on a crude scale containing relatively few values, in which case Y (n) would be treated as discrete. Let T (n) be the time between nth and (n + 1)th pulse registered on a Geiger counter. The indexing parameter n {0, 1, 2, . . .} is discrete and the state space continuous. A realisation of this process would be a discrete set of real numbers with values in the range (0, ).
1 Most of the material in this Module has been drawn from previous work by Dr Ashley Plank and Professor Tony Roberts.
8.2. Terminology
175
In this section we consider stochastic models with both discrete state space and discrete parameter space. Some example include annual survey of biological populations; monthly assessment of the water levels in a reservoir; weekly inventories of stock; daily inspections of a vending machine; microsecond sampling of a buer state. These models are used occasionally in climate modelling. Example 8.1: Tomorrows weather Consider the state of the weather on a day to day basis. Days may be classied as either ne/sunny or overcast/cloudy. Suppose a ne day follows a ne sunny day 40% of time and an overcast cloudy day follows an overcast day 20% of the time. For example, the data this conclusion comes from may be the following sequence of observations for consecutive days: C, S, C, C, S, S, S, C, S, C, S (though illustrative, this sample is far too small for real applications). See that as stated above, 2/5s of the sunny days are followed by a sunny day and that 1/5 of the cloudy days are followed by cloudy days. Dene Xt = 1, if day t is ne/sunny, 2, if day t is overcast/cloudy.
In other words, let state 1 correspond to sunny days, and state 2 to cloudy. We model this process as a Markov chain as dened below by assuming that for any two consecutive days t and t + 1 in the future: Pr {Xt+1 = 1 | Xt = 1} = 0.4 , It follows that Pr {Xt+1 = 2 | Xt = 1} = 1 0.4 = 0.6 , Pr {Xt+1 = 1 | Xt = 2} = 1 0.2 = 0.8 . This information is recorded on a state transition diagram such as:
0.6
Pr {Xt+1 = 2 | Xt = 2} = 0.2 .
Always draw a state transition diagram
1=sunny
0.4
2=cloudy
0.2
0.8
176
Module 8. Markov chains These four probabilities are conveniently represented as the matrix P = P11 P12 P21 P22 = 0.4 0.6 0.8 0.2 (8.1)
Note that the rows sum to one.
Markov chains are a special type of discrete-time stochastic process. For convenience, as above, we write times as an integral number of some basic units such as days, weeks, months, years or microseconds. Denition 8.1 (Markov chain) Suppose a discrete-time stochastic process can be in one of a nite number of states, generally labelled 1, 2, 3, . . . , s, then the stochastic process is called a Markov chain if Pr {Xt+1 = it+1 | Xt = it , Xt1 = it1 , . . . , X1 = i1 , X0 = i0 } = Pr {Xt+1 = it+1 | Xt = it } , This expression says that the probability distribution of the state at time t + 1 depends only on the state at time t (namely it ) and does not depend on the states the chain passed through on the way to it at time t. Usually we make a further assumption that for all states i and j and all t, Pr {Xt+1 = j | Xt = i} is independent of t. This assumption applies whenever the system under study behaves consistently over time. Any stochastic process with this behaviour is called stationary. Based on this assumption we write Pr {Xt+1 = j | Xt = i} = Pij , (8.2) so that Pij is the probability that given the system is in state i at time t, the system will be in state j at time t + 1. Pij s are referred to as the transition probabilities. Note that it is crucial that you clearly dene the states and the discrete times. Example 8.2: Preisendorfer and Mobley [37] and Wilks [48] use a threestate Markov chain to model the transitions between below-normal, normal and above-normal months for temperature and precipitation.
8.3. The transition matrix
177
8.3
The transition matrix
For a system with s states the transition probabilities are conveniently represented as an s s matrix P . Such a matrix P is called the transition matrix and each Pij is called a one-step transition probability. For example, P12 represents the probability that the process makes a transition from state 1 to state 2 in one period, whereas P22 is the probability that the system stays in state 2. Each row represents the one-step transition probability distribution over all states. If we observe the system in state i at the beginning of any period, then the ith row of the transition matrix P represents the probability distribution over the states at the beginning of the next period. The same transition matrix completely describes the probabilistic behaviour of the system for all future one-step transitions. The probabilistic behaviour of such a system over time is called a stationary Markov chain. Stationary because the matrix P is the same for transitions between all times. Example 8.3: Tomorrows weather continued Consider the weather example with transition matrix (8.1) and suppose today, t = 0, is sunny, state X0 = 1. Then from the given data the probability of being sunny tomorrow, state X1 = 1, is 0.4 and the probability of it being cloudy, X1 = 2, is thus 0.6 . So our forecast for tomorrows weather is the probabilistic mix p(1) = 0.4 0.6 , called a probability vector and denoting the probability of being sunny or cloudy respectively. As claimed above, this is just the rst row of the transition matrix P . What can we say about the weather in two days time? We seek a vector of probabilities, say p(2), giving the probabilities of the day after tomorrow being sunny of cloudy respectively. Given today is sunny, then the day after tomorrow can be sunny, X2 = 1, via two possible routes: it can be cloudy tomorrow then sunny the day after, with probability (as the Markov assumption is that these transitions are independent) Pr {X2 = 1 | X1 = 2} Pr {X1 = 2 | X0 = 1} = P21 P12 = 0.8 0.6 ; or it can be sunny tomorrow then sunny the day after, with probability (as the transitions are assumed independent) Pr {X2 = 1 | X1 = 1} Pr {X1 = 1 | X0 = 1} = P11 P11 = 0.4 0.4 . Since these are two mutually exclusive routes, we add their probability to determine Pr {X2 = 1 | X0 = 1} = 0.4 0.4 + 0.6 0.8 = 0.64 .
178
Module 8. Markov chains Similarly, the probability that it is cloudy the day after tomorrow is the sum of two possible routes: Pr {X2 = 2 | X0 = 1} = 0.4 0.6 + 0.6 0.2 = 0.36 . Combining these into one probability vector our probabilistic forecast for the day after tomorrow is p(2) = 0.64 0.36 . The important general feature of this example is that post-multiplication by the transition matrix determines how the vector of probabilities evolve, p(2) = p(1)P , as you see realised in the above two displayed expressions. This formula applies to the initial forecast of tomorrows weather too. Since we know today is sunny, the current state is p(0) = 1 0 denoting that we are certain the weather is in state 1. Then observe in the above that p(1) = p(0)P .
Using independence of transitions from step to step, and the mutual exclusiveness of dierent possible paths we establish the general key result: Theorem 8.2 If a Markov chain has transition matrix P and is in states with probability vector p(t) at time t, then 1 time step later its probability vector is p(t + 1) = p(t)P . Proof: Consider the following schematic general but partial state transition diagram:
Z P1j Z Z 2 Z p2 (t) PP P Z pj (t + 1) . ~ Z PPPZ . 2j P q P . : j > i ij pi (t) P . . . Psj
p1 (t)
ps (t)

8.3. The transition matrix
179
The system arrives to be in some state j at time t+1 by s mutually exclusive possibilities depending upon the state of the system at time t:
s
pj (t + 1) =
i=1 s
Pr {make state i to j transition} Pr {in state i} Pr {Xt+1 = j | Xt = i}

i=1 s
= =
i=1
pi (t)Pij
by their denition
= jth element of p(t)P . Hence putting these elements together: p(t + 1) = p(t)P . Note that the future behaviour of the system (for example, the states of the weather) only depends on the current state and not on how it entered this state. Given the transition matrix P , knowledge of the current state occupied by the process is sucient to completely describe the future probabilistic behaviour of the process. This lack of memory of earlier history may be viewed as an extreme limitation. However, this is not so. As the next example shows, we can build into the current state such a memory. The trick is widely applicable and creates a powerful modelling mechanism. Example 8.4: Remembering yesterdays weather. Assume that tomorrows weather depends on the weather condition during the last two days as follows:
if the last two days have been sunny, then 95% of the time tomorrow will be sunny; if yesterday was cloudy and today is sunny, then 70% of the time tomorrows will be sunny; if yesterday was sunny and today is cloudy, then 60% of the time tomorrows will be cloudy; if the last two days have been cloudy, then 80% of the time tomorrow will be cloudy.
Using this information model the weather as a Markov chain, draw the state transition diagram and write down its transition matrix. If tomorrows weather depends on the weather conditions during the last three days, how many states would be needed to model the weather as a Markov chain?
180
Module 8. Markov chains Solution: Since each day is classied as either sunny (S) or cloudy (C) then we have 4 states: SS, SC, CS, and CC. In these labels the rst letter denotes what the weather was yesterday and the second letter denotes todays weather. For example, the second rule above says that if today we are in state CS (that yesterday was cloudy and today is sunny), then with probability 70% tomorrow will be in state SS because tomorrow will be sunny, the second S, and tomorrows yesterday, namely today, was sunny, the rst S. The state transition diagram is:
SC
0.05 0.6 SS 0.95 0.3
0.4
CC
0.8
0.2 0.7 CS
The transition matrix is thus SS SC CS CC SS 0.95 0.0 0.70 0.0 SC 0.05 0.0 0.30 0.0 CS 0.0 0.40 0.0 0.20 CC 0.0 0.60 0.0 0.80
P =
If tomorrows weather depends on the weather conditions during the last 3 days, then 23 = 8 states are needed: SSS, SSC, SCS, SCS, SCC, CCC, CCS, CSC, and CSS.
See that you may write down the states of a Markov chain in any order that you please. But once you have decided on an ordering, you must stick to that ordering throughout the analysis. In the above example, the labels for both the rows and the columns of the transition matrix must be, and are, in the same order, namely SS, SC, CS, and CC. In applying Markov chains, there need not be a natural order for the states, and so you will have to decide and x upon one.
8.4. Forecast the future with powers of the transition matrix
181
8.4
Forecast the future with powers of the transition matrix
Using independence of transitions from step to step, and the mutual exclusiveness of dierent possible paths: Theorem 8.3 If the process is in states with probability vector p(t) at time t then n steps later its probability vector is p(t + n) = p(t)P n . Example 8.5: In the weather Example 8.3 we saw that p(1) = p(0)P and p(2) = p(1)P so that p(2) = p(1)P = p(0)P P = p(0)P 2 . Thus the forecast 2 days later is P 2 times the current probability vector.
Proof: It is certainly true for the n = 1 case: p(t + 1) = p(t)P by Theorem 8.2. For the case n = 2: p(t + 2) = p(t + 1)P = p(t)P P = p(t)P . For the case n = 3: p(t + 3) = p(t + 2)P = p(t)P P = p(t)P . For the case n = 4: p(t + 4) = p(t + 3)P = p(t)P P = p(t)P . And so on (formally by induction) for the general case. Given a Markov chain with transition probability matrix P , if the chain is in state i at time t, we might be interested to know the probability that n periods later the chain will be in a state j. Since we are dealing with a stationary Markov chain, this probability will be independent of t.
4 3 3 2 2
by Theorem 8.2 by Theorem 8.2 again
by Theorem 8.2 by n = 2 case
by Theorem 8.2 by n = 3 case
182
Corollary 8.4 The (i, j)th element of P n gives the probability of starting from state i and being in state j precisely n steps later. Proof: Being in state i at time t corresponds to p(t) being zero except for the ith element which is one, then the right-hand side of p(t + n) = p(t)P n shows p(t + n) must be just the ith row of P n . Thus the corollary follows. Example 8.6: Assume that the population movement of people between city and country is modelled as a Markov chain with transition matrix P = where:
P11 = 0.9, is the probability that a person currently living in the city will remain in the city after one transition (year) P12 = 0.1, is the probability that a person currently living in the city will move to country after one transition (year) P21 = 0.2, is the probability that a person currently living in the country will move to the city after one transition (year); and P22 = 0.8, is the probability that a person currently living in the country will remain in the country after one transition (year).
P11 P12 P21 P22
If a person is currently living in the city what is the probability that this person will be living in the country 2 years from now? If 75% of the population is currently living in the city and 25% in the country, what is the population distribution after 1, 2, 3 and 10 years from now. Solution: To answer the rst question we determine element (1, 2) of the matrix P 2 . P2 = 0.9 0.1 0.2 0.8 0.9 0.1 0.2 0.8 = 0.83 0.17 0.34 0.66 .
Hence [P 2 ]12 = 0.17. This means that the probability that a city person will live in the country after 2 transitions (years) is 17%. To nd the population distribution after 1, 2, 3 and 10 years given that the initial distribution is p(0) = 0.75 0.25 we perform the following calculations.
8.4. Forecast the future with powers of the transition matrix After 1 year the distribution is: p(1) = 0.75 0.25 0.9 0.1 0.2 0.8 = 0.725 0.275 .
183
Use this result to nd the population distribution after 2 years: p(2) = 0.725 0.275 0.9 0.1 0.2 0.8 = 0.7075 0.2925 .
And after 3 years: p(3) = 0.7075 0.2925 0.9 0.1 0.2 0.8 = 0.6952 0.3048 .
We continue with this process to obtain the population distribution after 9 years and 10 years: p(9) = 0.6700 0.3300 , p(10) = 0.6690 0.3310 .
Notice that after many transitions the population distribution tends to settle down to a steady state distribution. The above calculations can be also performed as follows: p(n) = p(0)P n . Hence to calculate p(10) we multiply the initial population distribution by P 10 = p(10) = 0.75 0.25 0.6761 0.3239 0.6478 0.3522 0.6761 0.3239 0.6478 0.3522 = . 0.6690 0.3310 .
which is the same result as before. For large n notice that P n also approaches a steady state with identical rows. For example,
n
lim P n =
0.6667 0.3333 0.6667 0.3333
The probabilities in each row represent the population distribution in the steady state. This distribution is independent of the initial conditions. For example if a fraction x (0 x 1) of the population initially lived in the city and a fraction (1x) in the country, in the steady state situation we will nd 66.67 percent living in the city and 33.33 percent living in the country regardless of the value of x. This is veried by computing p() = x 1x 0.6667 0.3333 0.6667 0.3333 = 0.6667 0.3333 .
184
8.5
Classication of nite Markov chains
There are biological situations with intriguing non-ergodic eects.
The long term behaviour of Markov chains depend on the general structure of the transition matrix. For some transition matrices the chain will settle down to a steady state condition which is independent of the initial state. In this subsection we identify the characteristics of a Markov chain that will ensure a steady state exists. In order to do this we must classify a Markov chain according to the structure of its transition diagram and matrix. The critical property we need for a steady state is that the Markov chain is ergodicyou may meet this term in completely dierent contexts, such as in uid turbulence, but the meaning is essentially the same: here it means that the probabilities get mixed up enough to ensure there are no long time correlations in behaviour and hence a steady state will appear. Further, an ergodic system is one in which time averages, such as might be obtained from an experiment, are identical to ensemble averages, averages over many realisations, which is what we often want to discuss and report in applications. Consider the following transition matrix 1 2 3 4 5 1 0.3 0.9 0 0 0 2 0.7 0.1 0 0 0 3 0 0 0.2 0.5 0 4 0 0 0.8 0.3 0.4 5 0 0 0 0.2 0.6
P =
Always draw such a state transition diagram for your Markov chains.
This matrix is depicted by the following state transition diagram. Each node represents a state and the labels on the arrows represent the transition probability Pij .
0.7 1 0.3 0.9 2 0.1 3
0.8 0.2 0.5 4
0.2 0.3 0.4 5 0.6
The following properties refer to this particular Markov chain as a rst example.
Given two states i and j a path from i to j is a sequence of transitions that begins in i and ends in j such that each transition in the sequence
8.5. Classication of nite Markov chains
185
has a positive probability of occurrence: thus [P n ]ij > 0 for an n-step path. For example, see that there are paths from 1 to 2, from 1 to 1, from 3 to 5, but not from 3 to 1.
A state j is accessible from i if there is a path leading from i to j after one or more transitions.
For example, state 5 is accessible from state 3 but state 5 is not accessible from states 1 nor 2.
Two states i and j communicate with each other if j is accessible from i and i is accessible from j. If state i communicates with j and with k, then j also communicates with k. Therefore, all states that communicate with i also communicate with each other.2
For example, states 1 and 2 communicate with each other. Similarly states 3 and 5 communicate, but states 1 and 5 do not.
A set of states S in a Markov chain is a closed set if no state outside of S is accessible from any state in S.
For example, S1 = {1, 2} and S2 = {3, 4, 5} are both closed sets.

A state i is an absorbing state if Pii = 1. Once we enter such an absorbing state, we never leave that state because with probability 1 we can only make the transition from i to i, there is no spare probability to go elsewhere.
There are no absorbing states in the above example. However, in many models of biological populations, the population going extinct is an absorbing state because with zero females the species cannot breed and so remains extinct forever.
A state i is a transient state if a state j exists that is accessible from i, but the state i is not accessible from j. If a state is not transient it is called a recurrent state. After a large number of transitions the probability of being in a transient state is zero.
There are no transient states in the above example. States 1 and 2 in the Markov chain with the following state transition diagram are transient, states 3 and 4 are recurrent:
2 For those who did Discrete Mathematics for Computing: communication is an equivalence relation.
186
A recurrent state i is cyclic (periodic) with period d > 1 if the system can never return to state i except after a multiple of d steps. (Thus d is the greatest common divisor, over all possibilities, of the number of transitions, n, for the process to move from state i back to state i: d = gcd{n | [P n ]ii > 0}.) A state that is not cyclic is called aperiodic.
In the earlier example, all states are aperiodic because from each state the system can revisit that state after any integer number of steps. However, for the example immediately above, the two recurrent states 3 and 4 are cyclic with period d = 2 as, for example, state 3 can only be returned to after a multiple of d = 2 steps, similarly for state 4.
If all states in a chain are recurrent, aperiodic and communicate with each other, the chain is ergodic.
The above examples are not ergodic because not all states communicate with each other. See in the earlier example that no unique steady state exists because if the system starts in states 1 or 2 it must stay in those states forever, whereas if its starts in states 35 then it stays in those states forever: the long time behaviour is quite dierent depending upon which case occurs, and thus there is no unique steady state. Example 8.7: Determine which matrices is ergodic. 0 0 0.5 0 0 0.4 P1 = 0.1 0.9 0 0.4 0.6 0 of the chains with the following transition 0.5 0.6 , 0 0
0.2 0.4 0.4 P2 = 0.1 0.2 0.7 . 0.3 0.3 0.4
Solution: Draw a state transition diagram for each, then the following observations easily follow. The states in P1 communicate with each other. However, if the process is in state 1 or 2 it will always move to either state 3 or 4 in the next transition. Similarly if the process is in
8.6. Limiting state (steady state) probabilities
187
state 3 or 4 it will move back to state 1 or 2. All states in such a chain are cyclic with period d = 2. This chain is not ergodic. All states in P2 communicate with each other. The states are recurrent and aperiodic. Therefore P2 is an ergodic chain.
8.6
Limiting state (steady state) probabilities
Theorem 8.5 Let P be the the transition matrix of an s-state ergodic Markov chain, then a vector = 1 2 . . . s exists such that 1 2 s 1 2 s (8.3) lim P n = . . .. . . . . . n . . . . 1 2 s The common row vector represents the limiting state probability distribution or the steady state probability distribution that the process approaches regardless of the initial state. When the above limit occurs, then following any initial condition p(0) the probability vector after a large number n of transitions is p(n) = p(0)P n . To show this last step, consider just the rst element, p1 (n), of the probability vector p(n). It is computed as p(0) times the rst column of P n , but T hence P n to 1 1 1 . p1 (n) p(0) . . 1 . 1 p(0) . . 1 1 1
= =
as the sum of the elements in p(0) have to be 1. Similarly for all the other elements in p(n). How do we nd these limiting state probabilities ? For a given chain with transitions matrix P we have observed that as the number of transitions n increases p(n)
188
But we know p(n + 1) = p(n)P and so taking the limit as n : = P . (8.4)
The limiting steady state probabilities are therefore the solution of the system of linear equations such that the row sum of is 1:
s
j = 1 .
j=1
(8.5)
Unfortunately, with the above condition we have s + 1 linear equations in s unknowns. To solve for the unknowns we may replace any one of the s linear equations obtained from (8.4) with s j = 1. j=1 Example 8.8: To illustrate how to solve the steady state probabilities consider the transition matrix, 0.7 0.2 0.1 0 0.3 0.4 0.2 0.1 . P = 0 0.3 0.4 0.3 0 0 0.3 0.7 Solving = P we have 1 2 3 4 = 1 2 3 4 0.7 0.2 0.1 0 0.3 0.4 0.2 0.1 0 0.3 0.4 0.3 , 0 0 0.3 0.7
or 1 2 3 4 together with 1 + 2 + 3 + 4 = 1 . (8.6) Discarding any of the rst four equations and solving the remaining equations we nd the steady state probabilities: =
3 15 3 15 4 15 5 15
= = = =
0.71 + 0.32 + 03 + 04 , 0.21 + 0.42 + 0.33 + 04 , 0.11 + 0.22 + 0.43 + 0.34 , 01 + 0.12 + 0.33 + 0.74 ,
The steady state probabilities can be found by rst noting that = P can be written as (I P ) = 0,
8.6. Limiting state (steady state) probabilities
189
where I is an identity matrix of appropriate size (and remembering that the order of multiplication s important in matrix multiplication). This equation is of the form xA = b. To turn it into the more familiar form Ax = b, transpose both sides: (I P )T T = 0 (since (AB)T = B T AT ). Now, this system has four equation, only three of which are necessary. One row (say the last row) can be replaced with the equation = 1 1 1 1
(that is, Equation (8.6)). This can all be done in ralbeit with some eort. > + + > + > > > > + > > data <- c(0.7, 0.2, 0.1, 0, 0.3, 0.4, 0.2, 0.1, 0, 0.3, 0.4, 0.3, 0, 0, 0.3, 0.7) P <- matrix(data, nrow = 4, ncol = 4, byrow = T) eye <- diag(4) tIP <- t(eye - P) tIP[4, ] <- c(1, 1, 1, 1) rhs <- matrix(c(0, 0, 0, 1), nrow = 4, ncol = 1) steady.state <- solve(tIP, rhs) steady.state [,1] 0.2000000 0.2000000 0.2666667 0.3333333
[1,] [2,] [3,] [4,]
Of course, in R it can be easier just to raise the transition matrix to a large power: > > > > > > P2 <- P %*% P P4 <- P %*% P %*% P %*% P P16 <- P4 %*% P4 %*% P4 %*% P4 P64 <- P16 %*% P16 %*% P16 %*% P16 P256 <- P64 %*% P64 %*% P64 %*% P64 P256
190 [,1] [,2] [,3] [,4] 0.2 0.2 0.2666667 0.3333333 0.2 0.2 0.2666667 0.3333333 0.2 0.2 0.2666667 0.3333333 0.2 0.2 0.2666667 0.3333333
[1,] [2,] [3,] [4,]
The answers are the same.
8.6.1
Share of the market model
One application area of the Markov chains is in brand switching or share of the market models. Suppose NoFrill Airlines (nfa) is competing for the market share of domestic passengers with the other two major carriers, KangaRoo Airways (kra) and emu Airlines. The major airlines have commissioned a survey to determine the likely impact of the newcomer on their market share. The results of a random survey have revealed the following information:
40% of passengers currently y with kra; 50% of passengers currently y with emu; 10% of passengers currently y with nfa.
The survey results also showed that:

80% of the passengers who currently y with kra will y with kra next time, 15% will switch to emu and the remaining 5% will switch to nfa; 90% of the passengers who currently y with emu will y with emu next time, 6% will switch to kra and the remaining 4% will switch to nfa; 90% of the passengers who currently y with nfa will y with nfa next time, 4% will switch to kra and the remaining 6% will switch to emu Airlines.
The preference pattern of passengers is here with the following transitions matrix: 0.8 0.15 kra P = emu 0.06 0.90 nfa 0.04 0.06
modelled as a Markov chain 0.05 0.04 . 0.90
8.7. Exercises We also have the initial market share p(0) = 0.4 0.5 0.1 .
191
To determine the long term market share for each airline we nd the steady state probabilities of the transition matrix. Solving = P and replacing any of the equations with 1 + 2 + 3 = 1 we get = 0.2077 0.4918 0.3005 .
Therefore, in the long term, the market share of the kra Airlines would drop from an initial 40% to 20.77%, the market share for the emu Airlines will remain steady and the nfa airline would increase their market share from 10% to 30.05%. Note that the future market share for each airline only depends on the transition matrix and not on the initial market share. The management of the kra could launch an advertising campaign to regain some of the 20% of their customers who are switching to the other two airlines.
8.7
Exercises
Ex. 8.9: The SOI is a well known climatological indicator for eastern Australia. Stone and Auliciems [42] developed SOI phases in which the average monthly SOI is allocated to one of ve phases correspond to the SOI falling rapidly (phase 1), staying consistently negative (phase 2), staying consistently near zero (phase 3), staying consistently positive (phase 4), and rising rapidly (phase 5). The transition matrix, based on data February 2002 is 0.668 0.000 0.081 0.000 0.683 0.125 P = 0.354 0.000 0.063 0.000 0.387 0.204 0.036 0.026 0.132 collected from July 1877 to 0.154 0.062 0.370 0.102 0.276 0.101 0.130 0.212 0.303 0.529 .
(a) Draw a transition diagram for the SOI phases. (b) Determine if the Markov chain is ergodic. (c) Determine the steady state probabiities. Ex. 8.10: Draw the state transition diagram for the Markov chain given by P = 1/3 2/3 1/4 3/4 .
192
Ex. 8.11: Draw the state transition diagram and hence determine if the following Markov chain is ergodic. Also determine the recurrent, transient and absorbing states of the chain. 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 P = 1 1 1 4 4 0 2 0 0 1 0 0 0 0 0 0
1 3
0 0 0
2 3
Ex. 8.12: The daily rainfall in Melbourne has been recorded from 1981 to 1990. The data is contained in the le melbrain.dat, and is from Hyndman [5] (and originally from the Australian Bureau of Meteorology). A large number of days recorded no rainfall at all. The following transition matrix shows the transition matrix for the two states Rain and No rain: 0.721 0.279 P = . 0.440 0.560 (a) Draw a transition diagram from the matrix P . (b) Use r to determine the steady state probabilities of days with rain in Melbourne. (c) Determine the probability of having a wet day two days after a ne day. Ex. 8.13: The daily rainfall in Melbourne has been recorded from 1981 to 1990, and was used in the previous exercise. In that exercise, two states (Rain (R) or No rain (N)) were used. Then, the state yesterday was used to deduce probabilities of the two states today. In this exercise, four states are used, taking into account the weather for the previous two days. There are four states RR, RN, NR, NN; the left-most state occurs earlier. (That is, RN means a rain-day followed by a day with no rain). The following transition matrix shows the transition matrix for the four states: 0.564 0.436 0 0 0 0 0.315 0.685 . P = 0.5554 0.445 0 0 0 0 0.265 0.735 (a) Draw a transition diagram from the matrix P .
8.7. Exercises
193
(b) Explain why eight entries in the transition matrix must be exactly zero. (c) Use r to determine the steady state probabilities of the four states for the data. (d) Determine the probability that two wet days will be followed by two dry days. Ex. 8.14: A computer laboratory has become notorious in service because of computer breakdowns. Data collected on its status every 15 minutes for about 12 hours (50 observations) is given below (1 indicates system up and 0 indicates system down.) 1110010011111110011110111 1111001111111110001101101 Assuming this process can be modelled as a Markov chain, estimate from the data the probabilities of the system being up or down each 15 minutes given it was up or down in the previous period, draw the state transition diagram and write down the transition matrix. Ex. 8.15: Suppose that if it has rained for the past three days, then it will rain today with probability 0.8; if it did not rain for any of the past three days, then it will rain today with probability 0.2; and in any other case the weather today will, with probability 0.6, be the same as the weather yesterday. Determine the transition matrix for this Markov chain. Ex. 8.16: Let {Xn | n = 0, 1, 2, ...} be a Markov chain with state space {1, 2, 3} and transition matrix 1 1 1 P =
2 2 3 3 5 4
0
2 5
4 1 3
Determine the following probabilities: (a) being in state 3 two steps after being in state 2; (b) Pr {X4 = 1 | X2 = 1} ; (c) p(2) given that p(0) = 1 0 0 ; (d) Pr {X2 = 3} given that Pr {X0 = 1} = Pr {X0 = 2} = Pr {X0 = 3} ; (e) Pr {X2 = 3 | X1 = 2 & X0 = 1} ; (f) Pr {X2 = 3 & X1 = 2 | X0 = 1} .
194
Ex. 8.17: Determine the limiting state probabilities for Markov chains with the following transition matrices. 0 1 0 0.2 0.4 0.4 0.5 0.5 P1 = P2 = 0 0 1 P3 = 0.5 0.2 0.3 0.7 0.3 0.4 0 0.6 0.3 0.4 0.3
Ex. 8.18: Two white and two black balls are distributed in two urns in such a way that each contains two balls. We say that the system is in state i, i = 0, 1, 2, if the rst urn contains i white balls. At each step, we randomly select one ball from each urn and place the ball drawn from the rst urn into the second, and conversely with the ball from the second urn. Let Xn denote the state of the system after nth step. Assuming that the process can be modelled as a Markov chain, draw the state transition diagram and determine the transition matrix. Ex. 8.19: A company has two machines. During any day each machine that is working at the beginning of the day has a 1/3 chance of breaking down. If a machine breaks down during the day, it is sent to repair facility and will be working 3 days after it breaks down. (i.e. if a machine breaks down during day 3, it will be working at the beginning of day 6). Letting the state of the system be the number of machines working at the beginning of the day, draw a state transition diagram and formulate a transition probability matrix for this situation. Ex. 8.20: The State Water Authority plans to build a reservoir for ood mitigation and irrigation purposes on the Macintyre river. The proposed maximum capacity of the reservoir is 4 million cubic metres. The weekly ow of the river can be approximated by the following discrete probability distribution: weekly inow (106 m3 ) probability 2 0.3 3 0.4 4 0.2 5 0.1
Irrigation demand is 2 million cubic metres per week. Environmental demand is 1 million cubic metres per week. Minimum storage requirement is 1 million cubic metres. Any demand shortage is at the expense of irrigation. Excess inow would be released over the spillway. Assume that the irrigation water may be supplied after the inow arrives. Before proceeding with the construction, the Water Authority wishes to have some idea of the behaviour of the reservoir. (a) Model the system as a Markov chain and determine the steady state probabilities. State any assumptions you make.
8.7. Exercises
195
(b) Explain the steady state probabilities in the context of this question. Ex. 8.21: Past records indicate that the survival function for light bulbs of trac lights has the following pattern: Age of Bulbs in months Number surviving to age n 0 100 1 85 2 60 3 0
(a) If each light bulb is replaced after failure, draw a state transition diagram and nd the transition matrix associated with this process. Assume that a replacement during the month is equivalent to a replacement at the end of the month. (b) Determine the steady state probabilities. (c) If an intersection has 20 bulbs, how many bulbs fail on average per month? (d) If an individual replacement has a cost of $15, what is the longrun average cost per month ? Ex. 8.22: A machine in continuous service requires frequent adjustment to ensure quality output. If it gets out of adjustment, on average $600 of defective parts are made before it can be corrected. Adjustment costs $200 in labour and downtime. Data collected on the operation of the machine is summarised below: Time since adjustment (hours) 1 2 3 4 Probability of defective production 0.00 0.20 0.50 1.00
In answering the following questions make suitable assumptions where appropriate. (a) If the machine is adjusted only when defective production occurs, nd the transition matrix associated with this process. (b) Determine the steady state probabilities. What is the long run mean hourly cost of this policy? (c) Suppose a policy of readjustment when needed or after three hours of running time (whichever comes rst) is introduced. What is the long run mean cost of this policy? Ex. 8.23: The following exercises involve computer work in R.
196
Module 8. Markov chains (a) The weather can be classied as Sunny (S) or Cloudy (C). Consider the previous two days classied in this manner; then there are four states: SS, SC, CS and CC. The transition matrix, entered in R, is pp <- matrix( nrow=4, ncol=4, byrow=TRUE, data=c(.9, .1, 0, 0, 0, 0, .4, .6, .7, .3, 0, 0, 0, 0, .2, .8) ) (You can read ?matrix for assistance.) Verify this could be a valid transition matrix by computing the row sums rowSums(pp) (the row sums ) are all one. (See ?rowSums.) Suppose today is the second of two sunny days in a row, state SS, that is p0 = 1 0 0 0 . Enter this state into r by typing pie <- c(1,0,0,0), then compute the probabilities of being in various states tomorrow as pie <- pie %*% pp. (See ?"%*%" for help here (quotes necessary). Note that the operator * does an element-by-element multiplication; the command %*% is used for matrix multiplication in r.) Why is Pr {cloudy tomorrow} = 0.1? Evaluate pie <- pie %*% pp again to compute the probabilities for two days time. Why is Pr {cloudy in 2 days} = 0.15? What is the probability of being sunny in 3 days time? (b) Keep applying pie <- pie %*% pp iteratively and see that the predicted probabilities recognisably converge in about 1020 days to = 0.58 0.08 0.08 0.25 . These are the long-term probabilities of the various states. Compute P 10 , P 20 and P 30 and see that the rows of powers of the transition matrix also converge to the same probabilities. (c) So far we have only addressed patterns of probabilities. Sometimes we run simulations to see how the Markov chain may actually evolve. That is, we need to generate a sequence of states according to the probabilities of transitions. For this weather example, if we start in zeroth state SS we need to generate for the rst state either SS with probability 0.9 or SC with probability 0.1. Suppose it was SC, then for the second state we need generate either CS with probability 0.4 and CC with probability 0.6. How is this done? Sampling from general probability distributions is done using the cumulative probability distribution (cdf) and rand. For example, if we are in state SS, i = 1, then the cdf for the choice of next state is .9 1 1 1 obtained in r by cumsum( pp[1,] ). Thus in general the next state j is found from the current state i by for example
8.7. Exercises
197 j <- sample( c(1,2,3,4), prob=pp[1,] , size=1)
Run a simulation by wrapping this in a loop such as i<- 1 # Initial state for (t in (1:99)) { i <- sample( c(1,2,3,4), prob=pp[ i,] , size=1) cat(i) # prints the value of i } cat("\n") # Ends the line Now save the history of states by executing i <- array( dim=200) # Set up an array i[1] <- 1 # Initial state for (t in (1:199)) { i[t+1] <- sample( c(1,2,3,4), prob=pp[ i[t],] , size=1) } Use hist(i, freq=FALSE, breaks=c(0,1,2,3,4) ) to draw a histogram and verify the long-term histogram is reasonably close to that predicted by theory. (The proportions are found by table(i)/sum(table(i)); compare to pie.) Ex. 8.24: Packets of information sent via modems down a noisy telephone line often fail. For example, suppose in 31 attempts we nd that packets are sent successfully or fail with the following pattern, where 1 denotes success and 0 failure: 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1 . There seem to be runs of success and runs of failure so we guess that these may not be each independent of each other (failure in communication networks are indeed generally correlated). Thus we try to model as a Markov chain. Suppose the probability of success of the next packet only depends upon chance and the success or failure of the current attempt. Argue from the data that the transition matrix should be P 1/2 1/2 1/3 2/3 .
Given this model, what is the long-term probability of success for each packet? Suppose the probability of success of packet transmission depends upon chance and the success or failure of the previous two attempts.
198
Module 8. Markov chains Write down and interpret the four states of this Markov chain model. Use the data to estimate the transition probabilities, then form them into a 44 transition matrix P . Compute using Matlab a high power of P to see that the long-term distribution of states is approximately = .4 .2 .2 .2 and hence deduce this model would predict a slightly higher overall success rate.
Ex. 8.25: Let a Markov chain with the state space S = {0, 1, 2} be such that:
from state 0 the particle jumps to states 1 or 2 with equal probability 1/2; from state 2 the particle must next jump to state 1; state 1 is absorbing (that is, once the particle enters state 1, it cannot leave.
Draw the transition diagram and write down the transition matrix. Ex. 8.26: For a Markov chain with the 0 0.8 P = 0.7 transition matrix 0.1 0.9 0 0.2 , 0.3 0
draw the transition diagram and nd the probability that the particle will be in state 1 after three jumps given it started in state 1. Ex. 8.27: (sampling problem) Let X be a Markov Chain. Show that the sequence Yn = X2n , n 0 is a Markov chain (such chains are called imbedded in X). Ex. 8.28: (lumping states together) Let X be a Markov chain. Show that Yn = |Xn | , n 0 is not necessarily a Markov chain. Ex. 8.29: Classify the states of the following Markov chains and determine whether they are absorbing, transient or recurrent: 0 1/2 1/2 P1 = 1/2 0 1/2 ; 1/2 1/2 0 0 0 0 0 0 0 P2 = 1/2 1/2 0 0 0 1 1 1 ; 0 0
8.7. Exercises 1/2 1/4 P3 = 1/2 0 0 1/4 1/2 P4 = 0 0 1 0 1/2 0 0 1/2 1/4 0 0 0 1/2 0 0 ; 0 0 1/2 1/2 0 0 1/2 1/2 3/4 0 0 0 1/2 0 0 0 0 1 0 0 . 0 1/3 2/3 0 0 0 0 0
199
Ex. 8.30: Classify the states of the Markov chains with the following transition probability matrices: 0 1/2 1/2 P1 = 1/2 0 1/2 ; 1/2 1/2 0 0 0 1/2 1/2 1 0 0 0 ; P2 = 0 1 0 0 0 1 0 0 1/2 1/2 0 0 0 1/2 1/2 0 0 0 0 1/2 1/2 0 . P3 = 0 0 0 1/2 1/2 0 1/4 1/4 0 0 1/2 Ex. 8.31: Consider the Markov chain consisting of the four states and having transition probability matrix 0 0 1/2 1/2 1 0 0 0 . P = 0 1 0 0 0 1 0 0 Which states are recurrent? Ex. 8.32: Let a Markov chain be 1/2 1/2 P = 0 0 1/4 dened by the matrix 1/2 0 0 0 1/2 0 0 0 0 1/2 1/2 0 . 0 1/2 1/2 0 1/4 0 0 1/2
What can you say about its decomposability into disjoint Markov chains and the transient and recurrent nature of its states?
200
Ex. 8.33: (A Communications system) Consider a communications system which transmits the digit 0 and 1. Each digit transmitted must pass through several stages, at each of which there is a probability p that the digit entered will be unchanged when it leaves. Letting Xn denote the digit entering the nth stage, dene its transmission probability matrix. Show by induction that Pn =
1 2 1 2
+ 1 (2p 1)n 2 1 (2p 1)n 2
1 2 1 2
1 2 (2p 1)n 1 + 2 (2p 1)n
Ex. 8.34: Suppose that coin 1 has probability 0.7 of coming up heads, and coin 2 has probability 0.6 of coming up heads. If the coin ipped today comes up heads, then we select coin 1 to ip tomorrow, and if it comes up tails, then we select coin 2 to ip tomorrow. If the coin initially ipped is equally likely to be coin 1 or coin 2, then what is the probability that the coin ipped on the third day after the initial ip is coin 1? Ex. 8.35: For a series of dependent trials the probability of success on any trial is (k + 1)/(k + 2) where k is equal to the number of successes on the previous two trials. Compute
n
lim Pr {success on the nth trial} .
Ex. 8.36: An organisation has N employees where N is a large number. Each employee has one of three possible job classications and changes classications (independently) according to a Markov chain with transition probabilities 0.7 0.2 0.1 P = 0.2 0.6 0.2 . 0.1 0.4 0.5 What percentage of employees are in each classication in the long run?
8.7.1
8.9 The chain is ergodic, and the steady state probabilities are (to three decimal places) [0.165, 0.247, 0.126, 0.183, 0.278]. 8.11 States 1, 2, 3, 5 and 6 are recurrent. State 4 is transient. S1 = {1, 3, 5} and S2 = {2, 6} are two closed sets. Since states 4 and 1 do not communicate the chain is not ergodic.
8.7. Exercises 8.14 P = 0 1
201
6 14 8 35
8 14 27 35
8.15 The process may be modelled as an 8 state Markov chain with states {[111], [112], [121], [122], [211], [212], [221], [222]} where 1 indicates no rain, 2 indicates rain and a triple [abc] indicates the weather was a the day before yesterday, b yesterday and c today. 0.8 0.2 0 0 0 0 0 0 [111] 0 0.4 0.6 0 0 0 0 [112] 0 0 0 0 0 0.6 0.4 0 0 [121] 0 0 0 0 0 0.4 0.6 [122] 0 P = 0 0 0 0 0 [211] 0.6 0.4 0 0 0.4 0.6 0 0 0 0 [212] 0 0 0 0 0.6 0.4 0 0 [221] 0 [222] 8.16 0 0 P2 = (a) [P 2 ]23 = 1/6 (b) Pr {X4 = 1 | X2 = 1} = 17/30 (c) p(2) = p(0)P 2 = (d) p(0) = [1/3 79/360 1/3 17/30 9/40 5/24 1/3] therefore Pr {X2 = 3} = p(0)P 2
3
0
17 30 16 30 17 30
0
9 40 9 30 3 20 5 24 1 6 17 60
0 .
0.2 0.8
(e) Pr {X2 = 3 | X1 = 2 & X0 = 1} = Pr {X2 = 3 | X1 = 2} = 1/3 (f) Pr {X2 = 3 & X1 = 2 | X0 = 1} = Pr {X2 = 3 | X1 = 2}Pr {X1 = 2 | X0 = 1} = 1/3 1/4 = 1/12 8.17 (a) = (b) = (c) = 8.18
7 12 2 9 1 3 5 12 2 9 1 3 5 9 1 3
P =
0
1 4
1
1 2
0
1 4
8.19 The process may be modelled as a 6 state Markov chain with the following states {[200], [101], [110], [020], [011], [002]}. The three numbers in the label for each state describes the number of machines currently working, in repair for 1 day and in repair for 2
202
Module 8. Markov chains days. For example, the state [020] means no machines are currently working and both machines were broken down yesterday and would be available again the day after tomorrow. If we are currently at state [020] then after one transition (day) the process will move to state [002]. Following this process we nd the transition matrix as [200] [101] [110] P = [020] [011] [002]
4 9 2 3
0 0 0 1
0 0
2 3
4 9 1 3
1 9
0 1 0
0 0 0 0
0 0 0 0 0
0 0 0 0 1 3 0 0 1 0 0 0 0
8.20 (a) The states are the volume of water in the reservoir, which although continuous are assumed to take discrete values {1, 2, 3, 4}. Hence the transition matrix is 0.7 0.2 0.1 0.0 0.3 0.4 0.2 0.1 P = 0.0 0.3 0.4 0.3 0.0 0.0 0.3 0.7 The steady state probabilities may be computed by solving = P where the elements of sum to one to give = 0.2 0.2 0.2667 0.3333 .
(b) The steady state probabilities represent the long term average probability of nding the reservoir in each state. For example in the long run we expect the reservoir will start, or end, with a volume of 1 million m3 , 20.5% of the time and a volume of 4 million m3 , 32.9% of the time. 8.21 (a) The states are the age of lights in months {0, 1, 2} then the transition matrix associated with this process is 0.15 0.85 0.0 P = 0.29 0.0 0.71 . 1.0 0.0 0.0 (b) = [0.407 0.346 0.246]
(c) Average number of failures per month = 0.4076 20 = 8.15 units (d) Long term average cost per month = $15 8.15 = $122.28
8.7. Exercises
203
8.22 (a) Let Xn = elapsed time in hours (at time n) since adjustment {0, 1, 2, 3}. Assume that adjustments occur on the hour only and that the time taken to service the machine is negligible. (alternative sets of assumptions are possible.) The transition probabilities can be found by converting the given table as follows. If the machine is adjusted 100 times, the number of these adjustments which we expect to survive are given by Time since adjustment (hours) 0 1 2 3 4 We then have P01 = Pr {Xn+1 = 1 | Xn = 0} = P12 = P23 = 80 = 0.8 100 50 = 0.625 80 a state 4 is not needed. Hence the 1 0 0 0 0.8 0 . 0 0 0.625 0 0 0 100 =1 100 Number surviving 100 100 80 50 0
Since none survive to age 4 required transition matrix is 0 0.2 P = 0.375 1 (b) =
10 33
10 33
8 33
5 33
In the long run breakdowns occur in a proportion 0 of the hours and each breakdown costs $800. Therefore mean cost per 10 hour = 33 800 = $242.42 . (c) Now Xn {0, 1, 2} and 0 1 0 P = 0.2 0 0.8 , 1 0 0 with steady state distribution. =
5 14 5 14 4 14
204
Module 8. Markov chains In the long run, the proportion of hours in which a breakdown occurs 5 4 5 5 0+ 0.2 + 0.375 = = 14 14 14 28 and each breakdown costs $800. The proportion of time that readjustment occurs without a breakdown = Pr {reaching age 3 and no breakdown occurs} 5 4 0.624 = = 14 28 and each readjustment alone costs $200. Hence, the long term cost per hour of this policy is = 5 5 800 + 200 = $178.57 28 28
8.34 Model this as a Markov chain with two states: C1 means that coin 1 is to be tossed; C2 means that coin 2 is to be tossed. From the question the state transition diagram is Pr(T)=0.3 0.7 C1 C2 0.4 Pr(H)=0.6

From this diagram the transition matrix is read o to be P = 0.7 0.3 0.6 0.4 0.5 0.5 the predictions for
whence starting from the state 0 = the states after three tosses is 3 = 0 P 3 =
0.6665 0.3335
Thus the probability of tossing coin 1 on the third day is 0.6665 .
Module
9
. . . . . . . . . . . . . . . models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 206 206 207 207 208 208 211 211 212
Other Models
Module contents
9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.10 Introduction . . . . . . . . . . . . Using other models . . . . . . . . Seasonally adjusted models . . . Regime-dependent autoregressive Neural networks . . . . . . . . . . Trend-regression models . . . . . Multivariate time series models . Forecasting by means of a model Finding similar past patterns . . Singular spectrum analysis . . . .
Module objectives
understand there are numerous other types of models for modelling time series; name some other time series models used in climatology; explain one of the methods in more detail.
205
206
Module 9. Other Models
9.1
Introduction
In this part of the course, one particular type of time series methodology has been discussed: the BoxJenkins models, or arma type models. There are a large number of other possible models for time series however. In this Module, some of these models are briey discussed. You are required to know the details of only one of these models in particular, but should at least know the names and ideas behind the others. You dont need to understand all the details in this Module; but see Assignment 3.
9.2
Using other models
The time series models previously discussedarima( , m, o)dels and Markov chain modelsare reasonably simple. There are, however, many far more complicated models have not studied. In an attempt to compare numerous types of forecasting methods, Spyros Makridakis and Mich`le Hibon e conducted the M3-Competition (following the M- and M2-Competitions), which compared 24 time series methods (including the BoxJenkins approach adopted here plus many more complicated models) on 3003 dierent time series. This was one of the conclusion from the competition: Statistically sophisticated or complex methods do not necessarily produce more accurate forecasts than simpler ones. In particular, the method that unocially won the competition was the theta-method (see Assimakopoulos and Nikolopoulos [7]) which was shown later (see Hyndman and Billah [22]) to be simply exponential smoothing with a drift (or trend) component. Exponential smoothing was listed in Section 1.4 as a simple method. The lesson is clear: Just because methods appear clever, complicated or technical, simple methods are often the best. However, all methods have situation in which they perform well, and there are other methods worthy of consideration. Some of those are considered here.
9.3
Seasonally adjusted models

Activity 9.A: Read Chu & Katz [13] in the selected readings.
9.4. Regime-dependent autoregressive models
207
Table 9.1: The parameter estimates for an ar(3) model with seasonally varying parameters for modelling the seasonal SOI. Note the seasons refer to northern hemisphere seasons. SOI predictand Parameter Estimates 1 (t) 2 (t) 3 (t) Spring (t = 1) 0.5268 0.1158 0.2011 Summer (t = 2) 0.7832 0.2568 0.3816 Fall (t = 3) 0.8554 0.1700 0.0674 Winter (t = 4) 0.7736 0.1971 0.1808
Chu & Katz [13] discuss tting arma type models to the seasonal and monthly SOI using an arma model whose coecients change according to the season. They t a seasonally varying ar(3) model to the seasonal SOI, {Xt }, of the form Xt = 1 (t)Xt1 + 2 (t)Xt2 + 3 (t)Xt3 + et , with the parameters as shown in Table 9.1.
9.4
Regime-dependent autoregressive models

Activity 9.B: readings. Read Zwiers & von Storch [53] in the selected
Zwiers & von Storch [53] t a regime-dependent ar model (ram) to the SOI described by a stochastic dierential equation. (These models are also called Threshold Autoregressive Models by other authors, such as Tong [44].) In essence, the SOI is modelled using one of two indicators of the SOI (either the South Pacic Convergence Zone hypothesis, or the Indian Monsoon hypothesis, as explained in the article), and a seasonal indicator.
9.5
Neural networks
Neural networks consist of processing elements (called nodes) joined by weighted connections. The processing elements take as inputs the weighted sum of the output of the nodes connected to it. The input to the processing element is transformed (linearly or non-linearly) which is then the output
208
(and can be passed to other processing elements). Neural networks are said to be loosely based on the operation of the human brain (!). Maier & Dandy [31] t neural networks to daily salinity data at Murray Bridge, South Australia, as well as numerous BoxJenkins models. They conclude the BoxJenkins models produce better one-day ahead forecasts, while the neural networks produce better long term forecasts. Guiot & Tessier [18] use neural networks and ar(3) models to detect the eects of pollution of the widths of tree rings, and hence tree growth, from 1900 to 1983.
9.6
Trend-regression models
Visser & Molenaar [47] discuss a trend-regression model for modelling a time series {Yt } in the presence of k other variables {Xi,t } for i = 1, . . . , k. These models are of the form yt = t + 1,t X1,t + + k,t Xk,t + et where the stochastic trend t is described using an arima(p, d, q) process, and et is the noise term. These models are written as TR(p, d, q, k) models, where p, d and q are the usual parameters for an arima(p, d, q) model, and k is the number of explanatory variables. The authors state the trendregression models include most trend and regression models used in the literature. One particular model they t is for modelling annual mean surface air temperatures in the northern hemisphere from 1851 to 1990, {Tn }. They t a TR(0, 2, 0, 2) model using the Southern Oscillation Index (SOI) and the index of volcanic dust (VDI) on the northern hemisphere as covariates. The tted model is Tn = t 0.050SOIt 0.086VDIt + et where the trend t is described using an arima(0, 2, 0) model (parameters not given).
9.7
Multivariate time series models
In this course, only univariate time series have been discussed. It is possible, however, for two time series to be related to each other. In this case, there is a multivariate time series.
9.7. Multivariate time series models
209
SOI
30
10
10
30
1960
1970
1980 Time
1990
Sea Level Pressure Anomaly
2 0
1960
1970
1980 Time
1990
Figure 9.1: Two time series that might be expected to vary together: Top: the SOI; Bottom: the sea level air pressure anomaly at Easter Island.
Example 9.1: The SOI and the sea level air pressure anomaly at Easter Island might be expected to vary together, since the SOI is related to pressure anomalies at Darwin and Tahiti. The two are plotted together in Figure 9.1. In a similar way as the autocorrelation was measured, the cross correlation can be dened as: XY = E[(Xt X )(Ytk Y )], where X is the mean of the time series {Xt } and Y is the mean of the time series {Yt }, and k is again the lag. The cross correlation can be computed for various k. For this example, the plot of the cross correlation is shown in Figure 9.2. The cross correlation indicates there is a signicant correlation between
210
SOI & slpa
ACF
0.1 2
0.0
0.1
0.2
0.3
0 Lag
Figure 9.2: The cross correlation between the SOI and the sea level air pressure anomaly at Easter Island.
9.8. Forecasting by means of a model
211
the two series near a lag of zero. That is, when the SOI goes up, there is a strong chance the sea level air pressure anomaly at Easter Island will also go up at the same time.
9.8
Forecasting by means of a model
Forecasting by means of a model is common in meteorology and astronomy. The weather is routinely forecast by special groups in all developed countries. They use data from satellites and terrestrial weather stations as input to a uid dynamical model of the earths atmosphere. The model is simply projected forward by a type of numerical integration to produce the forecasts.
9.9
Finding similar past patterns
Suppose we have a time series {Xt }t0 and we wish to be able to forecast future values. We wish to identify an estimator of the next value of the time series, say Xt+1|t . One way of doing this is to search through the history of the time series and nd a time when the past k values of the time series have approximately occurred before. For example, suppose we wish to forecast tomorrows maximum daily temperature at Hervey Bay and wish to use the past ve days maximum temperatures to make this forecast. The strategy is to search through the available history of the maximum temperatures at Hervey Bay and nd a time is the past when ve maximum temperatures have been very similar to the maximum temperatures over the last ve days. Whatever the next days maximum temperature was in the past will be the prediction for tomorrow. How do we determine which ve past values are like the pattern we are currently observing? Call the current m values vector x. For any past series of m values, say vector y, one measure of the distance between these two vectors is dened by
m
d(x, y) =
k=1
(xk yk )2 .
where xk is the kth element of the vector x. (This is the most common way to dene distance between vectors.)
212
The choice of m needs to be made carefully after consideration of the time series in question. The idea here is simply this: we are trying to nd all the times in the past when things were similar to now.
9.10
Singular spectrum analysis
Singular spectrum analysis is a method which attempts to identify naturally occurring patterns in a time series. The time series formed by keeping the most important patterns, and removing the others (which are regarded as noise) potentially leaves a series which represents the underlying dynamics and is also easier to forecast.
Strand
II
Multivariate Statistics
213
214
Module
10
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 216 217 217 217 217 221 221 222 223
Introduction
Module contents
10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 Introduction . . . . . . . . . . . . . . . . . . . Multivariate data . . . . . . . . . . . . . . . . Preview of methods . . . . . . . . . . . . . . . Review of mathematical concepts . . . . . . . Software . . . . . . . . . . . . . . . . . . . . . . Displaying multivariate data . . . . . . . . . . Some hypothesis tests . . . . . . . . . . . . . . Further comments . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . 10.9.1 Answers to selected Exercises . . . . . . . . .
Module objectives
recognize multivariate data; give some examples of multivariate data; list some type of multivariate statistical methods; appropriately display multivariate data.
215
216
10.1
Introduction
In this Module, some basic multivariate statistical techniques are introduced. The emphasis is on the application rather than the details and the theory; there is insucient time to delve too far into the theory. This Module is based on the textbook Multivariate Statistical Methods by Bryan F. J. Manly. This book includes numerous examples using real data sets, although most examples do not have a climatological avour. Some examples with such a avour are given in these notes. There are numerous books available about multivariate statistics, and many are available from the USQ library. You may nd other books useful to refer to during your study of this multivariate anlaysis component of this course. As a general comment, you will be expected to read the textbook to understand this Module. The Study Book will supplement these notes where necessary, provide extra example, and make notes about using the r software for performing the analyses.
10.2
Multivariate data
Activity 10.A: Read Manly, section 1.1.
Multivariate analysis is popular in many area of science, engineering and business; the examples give some avour of typical problems. Climatology is lled with examples of multivariate data. There are numerous climatological variables measured on a routine basis which can collectively be considered multivariate data. One of the most common sources of multivariate data are the Sea Surface Temperatures (SST). SSTs are measurements of the temperature of the oceans, measured at locations all around the world. In addition, multivariate data can be created from any univariate series since climatological variable are often time-dependent. The original data, with say n observations, can be designated as X1 . The series can then be shifted back t time steps to create a new variable X2 . Both variables can be adjusted to have a length of n t, when the variables could now be identied as X1 and X2 . The two variables (X1 , X2 ) can be considered multivariate data.
10.3. Preview of methods
217
10.3
Preview of methods
Activity 10.B: Read Manly, section 1.2.
This section introduces some dierent types of multivariate methods. Not all the methods will be discussed in this course, but it is useful to know the types of methods available.
10.4
Review of mathematical concepts

Activity 10.C: Briey read Manly, Chapter 2.
This Chapter contains material that should be revision for the most part. You may nd it useful to refer back to Chapter 2 throughout this course. Pay particular attention to sections 2.5 to 2.7 as many multivariate techniques use these concepts.
10.5
Software
The software package r will be used for this Part, as with the time series component. See Sect. 1.5.1 for more details. Most statistical programs will have multivariate analysis capabilities. For this part of the course, the r multivariate analysis library is needed; this should be part of the package that you install by default. To enable this package to be available to r, type library(mva) at the r prompt when r is started. For an idea of what functions are available in this library, type library(help=mva) at the r prompt.
10.6
Displaying multivariate data
With multivariate data, any plots will be of a multi-dimensional nature, and will therefore be dicult to display on a two-dimensional page. Plotting data is, of course, always useful for understanding the data and detecting possible problems in the data (outliers, errors, missing values, and so on). Some creative solutions have been developed for plotting multivariate data.
218
Module 10. Introduction Activity 10.D: Read Manly, Chapter 3. We will not discuss Andrews method.
Many of the plots discussed are available in the package S-Plus, a commercial package not unlike r. In the free software, r, however, some of these plots are not available (in particular, Cherno faces). The general consensus is that it would be a lot of work for a graphic that isnt that useful. One particular problem with Cherno faces is that the faces (and interpretations) can change dramatically depending on what variables are allocated to which dimensions of the face. However, the star plot is available using the function stars. The Draftsmans display is available just by plotting a multivariate dataset; see the following example. The prole plots are useful, but only when there are not too many variables or too many groups, otherwise the plots become too cluttered to be of any use. Example 10.1: Hand et al. [19, dataset 26] gives a number of measurements air pollution from 41 cities in the USA. The data consists of seven variables (plus the names of the cities), generally means from 1969 to 1971
SO2: The SO2 content in micrograms per cubic metre; temp: The average annual temperature in degrees F; manufac: The number of manufacturing enterprizes employing 20 or more workers; population: The population in thousands, in the 1970 census; wind.speed: The average annual wind speed in miles per hours; annual.precip: The average annual precipitation in inches; days.precip: The average number of days with precipitation each year.
The following code shows how to plot this multi-dimensional data in r. First, a Draftsmans display (this is an unusual term; it is often called a pairwise scatterplot): > > > > library(mva) us <- read.table("usair.dat", header = TRUE) plot(us[, 1:4]) pairs(us[, 1:4])
10.6. Displaying multivariate data
219
45
q q
55
65
75
q q
1000
2500 100 0 1000 2500 20

q q q q q q
SO2
q
q q q q
q q
q q q qq q q q qq q q q q q q q qq q q q q
q q q qq
75
q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q
q q q q q q
q q qq q q q q qq q q qq q q q q q q q q q q qqq q qq q q q q q q q
65
55
temp
q q
q qq qq qq q q q q q qq q q q q q q q qq q q q qq q q q q
q q
45
q q q q q qq q q qq q qq q qq q q q qq q qq q qq q q qq qqq q q q q q q q
manufac
q q qq q q q q q q q q qq q q qqq qq q q q qqq q qq q qq qq qq qq qq q q
q q q q qqq qqq q q q qq q q q q q qq q q q qq qq q
2500
q q q
q q q q qq q q q q q q q qq q q q qq q q q q q q q qq q q q q qq q q q q q q q q q q qq q qq q q qqq q qq q q q q qq q q q q q q q q
population
1000
q q q q q q q qq q qq q q q q qq q qq q q q qq qq q q qq q q
20
60
100
1000
2500
Figure 10.1: A multivariate plot of the US pollution dataset.
The plot is shown in Fig. 10.1. Star plots can also be produced: > stars(us[1:11, ], main = + flip.labels = FALSE, + 2)) > stars(us[1:11, ], main = + flip.labels = FALSE, + key.loc = c(7.8, 2)) "Pollution measures in 41 US cities", key.loc = c(7.8, "Pollution measures in 41 US cities", draw.segments = TRUE,
The input key.loc changes the location of the key that shows which variable is displayed where on the star; it was determined through trial and error. Alter the value of the input flip.labels to true (that is, set flip.labels=TRUE) to see what aect this has. The star plot discussed in the text is in Fig. 10.2. Only the stars for the rst eleven cities are shown so that the detail can be seen here. A variation of the star plot is given in Fig. 10.3, and is particularly instructive when seen in colour. From the star plots, can you nd any cities that look very similar? That look very dierent?
60
q q q q q q q q q qq q q q q qq qq qqq q q q q qq qqq q q q q
q q q q q q q q q qqq q q q qq q q q qqq q qq q q qq q q q q q
220

Pollution measures in 41 US cities
Phoenix
Little.Rock
San.Francicso
Denver
Hartford
Wilmington
Washington
Jacksonville
Miami manufac temp population SO2 wind.speed
Atlanta
Chicago annual.precip
days.precip
Figure 10.2: A star plot of the US pollution dataset.
Pollution measures in 41 US cities
Phoenix
Little.Rock
San.Francicso
Denver
Hartford
Wilmington
Washington
Jacksonville
Miami temp manufac SO2 population days.precip
Atlanta
Chicago
wind.speed annual.precip
Figure 10.3: A variation of the star plot of the US pollution dataset.
10.7. Some hypothesis tests
221
10.7
Some hypothesis tests

Activity 10.E: Read Manly, Chapter 4. We will not dwell on the details, but it is important that you understand the issues involved (especially section 4.4).
Currently, Hotellings T 2 test is not implemented in r.
10.8
Further comments
One diculty with multivariate data has already been discussed: it may be hard to display the data in a useful way. Because of this, it is often dicult to nd any outliers in multivariate data. Note that an observation may not appear as an outlier with regard to any particular variables, but it may have a strange combination of variables. Multivariate data also can present computational diculties. The mathematics involved in using multivariate techniques is usually matrix based, and so often very large matrices will be in use. This can create memory problems, particularly when matrix inversion is necessary. Many computational tricks and advanced methods are employed in standard software for performing the computations. Techniques such as singular value decomposition (SVD) are common. Indeed, dierent answers are often obtained in dierent software packages because dierent algorithms are used. The main multivariate techniques can be broadly divided into the following categories:
Data reduction techniques. These techniques reduce the dimension of the data at the expense of losing a small amount of information. A balance is made between reducing the dimension of the data and retaining as much information as possible. Techniques such as principal components analysis (PCA; see Module 11) and factor analysis (FA; see Module 12) are in this category. Classication techniques. These techniques attempt to classify data into a number of groups. Techniques such as cluster analysis (see Module 13) and discriminant analysis fall into this category.
Consider the data in Example 10.1. We may wish to reduce the number of variables from eight to two or three. If we could reduce the number of
222
Toowoomba weather by Decade
1890
1900
1910
1920
1930
1940
1950
1960
1970 maxt
mint 1980 1990
radn
Figure 10.4: A star plot of the Toowoomba weather data. variables to just one, this might be called a pollution index. This would be an example of data reduction. Data reduction works with the variables. However, we may wish to classify the 41 cities into a number of groups depending on their characteristics. We may be able to identify three group: high pollution, moderate pollution and low pollution categories. This is a classication problem. Classication works with the individuals.
10.9
Exercises
Ex. 10.2: The data set twdecade.dat contains (among other things) the average rainfall, maximum temperature and minimum temperature at Toowoomba for the decades 1890s to the 1990s. Produce a multivariate plot of the three variables by decade. Which decades appear similar? Ex. 10.3: The data set twdecade.dat contains the average rainfall, maximum temperature and minimum temperature at Toowoomba for the each month. It should be possible to see the seasonal pattern in temperatures and rainfall. Produce a multivariate plot that shows the features by month.
10.9. Exercises
223
Ex. 10.4: The data set emdecade.dat contains the average rainfall, maximum temperature and minimum temperature at Emerald for the decades 1890s to the 1990s. Produce a multivariate plot of the three variables by decade. Which decades appear similar? How similar are the patterns to those observed for Toowoomba? Ex. 10.5: The data set emdecade.dat contains the average rainfall, maximum temperature and minimum temperature at Emerald for the each month. It should be possible to see the seasonal pattern in temperatures and rainfall. Produce a multivariate plot that shows the features by month. How similar are the patterns to those observed for Toowoomba? Ex. 10.6: The data in the le countries.dat contains numerous variables from a number of countries, and the countries have been classied by region. Create a plot to see which countries appear similar. Ex. 10.7: This question concerns a data set that is not climatological, but you may nd interesting. The data le chocolates.dat, available from http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/ popular/chocolates.html, contains measurements of the price, weight and nutritional information for 17 chocolates commonly available in Queensland stores. The data was gathered in April 2002 in Brisbane. Create a plot to see which chocolates appear similar. Are there are surprises? Ex. 10.8: The data le soitni.txt contains the SOI and TNI from 1958 to 1999. The TNI is related to sea surface temperatures (SSTs), and SOI is also known to be related to SSTs. It may be expected, therefore, that there may be a relationship between the two indices. Create a plot to examine if such a relationship exists.
10.9.1
10.2 A star plot can be found as follows:
> td <- read.table("http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/clim + header = TRUE) > head(td) rain maxt mint radn pan vpd 1890 1087.22 22.426 11.430 17.798 4.521 14.526 1900 850.78 22.426 11.430 17.798 4.521 14.526
224 1910 1920 1930 1940 856.65 921.28 931.85 969.08 22.426 22.427 22.426 22.427 11.430 11.431 11.430 11.431 17.798 17.798 17.798 17.798
Module 10. Introduction 4.521 4.521 4.521 4.521 14.526 14.527 14.526 14.527
> stars(td[, 1:3], draw.segments = TRUE, + key.loc = c(7, 2), main = "Toowoomba weather by Decade") The plot (Fig. 10.4) shows a trend of increasing rainfall from the 1900s to the 1950s, a big drop in the 1960s, then a big jump in the 1970s. The 1990s were very dry again. The 1990s were also a very warm decade (relatively speaking), and the 1960s very cold (relatively speaking).
Module
11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 228 232 233 234 235 236 242 242 245 247 248 252
Principal Components Analysis
Module contents
11.1 Introduction . . . . . . . . . . . . . . . . . . . 11.2 The procedure . . . . . . . . . . . . . . . . . . 11.2.1 When should the correlation matrix be used? 11.2.2 Selecting the number of pcs . . . . . . . . . . 11.2.3 Interpretation of pcs . . . . . . . . . . . . . . 11.3 pca and other statistical techniques . . . . . 11.4 Using r . . . . . . . . . . . . . . . . . . . . . . . 11.5 Spatial pca . . . . . . . . . . . . . . . . . . . . 11.5.1 A small example . . . . . . . . . . . . . . . . 11.5.2 A larger example . . . . . . . . . . . . . . . . 11.6 Rotation of pcs . . . . . . . . . . . . . . . . . . 11.7 Exercises . . . . . . . . . . . . . . . . . . . . . . 11.7.1 Answers to selected Exercises . . . . . . . . .
Module objectives
226
Module 11. Principal Components Analysis

understand the principles underlying principal components analysis; give a geometric interpretation of the principal components method; compute principal components from given data using r; select an appropriate number of principal components using suitable techniques; make sensible interpretations of the principal components where possible; compute the principal components scores for each subject; conduct a spatial pca; understand that rotation of principal components is a contentious issue.
11.1
Introduction
Principal components analysis (pca) is one of the basic multivariate techniques, and is also one of the simplest. Wilks [49, p 373] says of pca that it is possibly the most widely used multivariate statistical technique in the atmospheric sciences. . . (to which statistical climatology belongs). pca is an example of a data reduction technique, one that reduces the dimension of the data. This is possible if the variables are correlated. pca attempts to nd a new coordinate system for the data. In climatology and related sciences, numerous variables are correlated, so pca is a commonly used technique. pca is also called empirical orthogonal functions (EOFs) or sometimes empirical eigenvector analysis (EEA). Activity 11.A: Read Manly, Section 6.1. Read Wilks, the introduction to Section 9.3. For a geometric interpretation of principal components in the two-dimensional case, see Fig. 11.1. Fig. 11.1 (top left panel) shows the original data. The data have a strong trend in the SWNE direction. In Fig. 11.1 (top right panel), the two principal components are shown. The rst principal component is in the SWNE direction as expected. Fig. 11.1 (centre left) shows one particular point being mapped to the new coordinates. In Fig. 11.1 (bottom right panel), a screeplot (see the next section) shows that most (almost 96%) of the original variation in the data can be explained by the rst principal
11.1. Introduction
227
q q q q q
q q q q q q
q q q
q q q
q q
q q q q q q qq q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q qq
q q
q q q q q q qq q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q qq
x2
x2 1 0
q q q
q q qq q
q q q q q q
q q qq q
0 x1
q q
0 x1
Histogram of PCA 1
q
q q q q q q
q q
q q q q q
q q qq q
q q q qq
0 x1
0 3
10
q q q q q q qq q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q
Frequency
x2
15
20
25
q q q
30
35
predict(pca)[, 1]
Histogram of PCA 2
30
Scree plot of principal components
20
25
Frequency
10
Variances 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 predict(pca)[, 2]
15
Figure 11.1: A geometric interpretation for principal components in the two-dimensional case. Top left: some points are shown. They tend to be strongly oriented in one direction. Top right: the corresponding principal components are shown as bold lines. The main principal component is in the SWNE direction. Bottom left: a particular point is mapped to the new coordinates. Bottom right: the scree plot shows that the rst pc accounts for most of the variation in the data.
0.0
0.5
1.0
1.5
228
component only. That is, using just the rst principal components reduces the dimension of the problem from two to one, with only a small loss of information. Note the pcs are simply linear combinations of the variables, and that they are orthogonal. Also note that my computer struggles to perform the required computations on my machine (it seems to manage despite complaining).
11.2
The procedure
Activity 11.B: Read Manly, Sections 6.2 and 6.3. Read Wilks, Section 9.3.1.
A pca is conducted on a set of n observations of p (probably correlated) variables. It is important to realize that pcaand most other multivariate methods alsois based on nding the eigenvalues and eigenvectors. Also note that the eigenvalues and eigenvectors are found from either the correlation matrix or the covariance matrix. The next section discusses which should be used. The four steps outlined at the bottom of p 80 of Manly show the general procedure. Software is used to do the computations. Example 11.1: Consider the following data matrix X with two variables X1 and X2 , with three observations (so n = 3) for each variable: 1 0 X = 1 1 . 4 2 The data are plotted in Fig. 11.2 (a). The mean for each variable is X 1 = 2 and X 2 = 1, so the centred matrix is 1 1 Xc = 1 0 . 2 1 The centred data are plotted in Fig. 11.2 (b). It is usual to nd the pcs from the correlation matrix. First, nd the covariance matrix, found
11.2. The procedure by computing (X X)T (X X)/n as follows1 : P = (X X)T (X X)/n = 1 3 6 3 3 2 = 2 1 1 2/3 .
229
This matrix is always symmetric. From the diagonals of this matrix, var[X1 ] = 2 and var[X2 ] = 2/3. Using these two numbers, the diagonal matrix D can be formed: D= 2 0 0 2/3 ,
when, by convention, D1/2 refers to the matrix with the diagonals raised to the power 1/2: 1/ 2 0 1/2 D = . 0 3/ 2 The correlation matrix, say R, can then be found as follows: 1 3/2 1/2 1/2 R=D PD = . 3/2 1 This matrix will always have ones on the diagonals. The data can be scaled after being centred by dividing by the standard deviation (obtained from matrix D1/2 ); in this case, the centred and scaled data are 1/2 3/ 2 Xcs = 1/ 2 0 . 2 3/ 2 The centred and scaled data are plotted in Fig. 11.2 (c). In eect, it is this data for which principal components are sought (since R = T Xcs Xcs /n). Now, 1 3 3 3/2 1 3/2 R= = , 3 3/2 1 3 3 3/2 the correlation matrix. The eigenvectors e and eigenvalues of matrix R are now required2 , which are the solutions of (R I)e = 0.
1
(11.1)
Notice that we have divided by n rather than n 1. This is simply to follow what r does; more commonly, the divisor is n 1 when sample variances (and covariances) are computed. I do not know why r divides by n instead of n 1. 2 This is a quick review of work already studied in MAT2100. Eigenvalues and eigenvectors are covered in most introductory algebra texts.
230
Module 11. Principal Components Analysis This system of equation is only consistent if |R I| = 0 (where |W | means the determinant of matrix W ). This becomes 1 3/2 = 0, 3/2 1 or with solutions 1 = 1+ 3/2 1.866 and 2 = 1 3/2 0.134. Subsituting these eigenvalues into Equation (11.1) to nd the eigenvectors gives 1/ 2 1/2 ; e2 = . e1 = 1/ 2 1/ 2 These eigenvectors become the principal components, or pcs. There are two pcs as there were originally two variables. Note that the two eigenvectors (or the two pcs) are orthogonal: e1 .e2 = 0. Generally, a matrix of eigenvectors is dened: 1/2 1/ 2 C= . 1/ 2 1/ 2 (Note that these vectors are only dened up to a constant. These vectors have been dened to have a length of one, and the signs determined to be equivalent to those given in the current version of r I have3 .) Note that the two eigenvectors are orthogonal: e1 .e2 = 0. The (directions of the) eigenvectors are shown plotted with the centred and scaled data in Fig. 11.2 (d). There were originally two variables; there will be two pcs. The pcs are dened in the direction of the two eigenvectors. The proportions of the variance explained by each is found from the eigenvalues, and can be reported in a table like that shown below. pc pc 1 pc 2 evalue 1.866 0.134 2 % variance 93.3% 6.7% 100% cumulative % 93.3% 100% (1 )2 3/4 = 0,
The signs may change from one version of r to another, or even dier between copies of r on dierent operating systems. This is true for almost any computer package generating eigenvectors. A change in sign simply means the eigenvectors point in the opposite direction and makes no eective dierence to the analysis
11.2. The procedure
231
Original data
4 4
Centred data
Centred X 2
X2
1
q
1 1 0 1 X1 2 3 4
1 0
Centred X 1
Centred and scaled data

4 4
With the PCs shown
Centred & scaled X 2
e1
X2
e2
1 0
1 0
1 X1
Centred & scaled X 1
Figure 11.2: The data from Example 11.1. Top left: the original data; Top right: the data have been centred; Bottom left: the data have been centred and then scaled; Bottom right: the directions of the principcal components have been added.
232
Module 11. Principal Components Analysis A scree plot can be drawn from this if you wish. In any case, one pc would be taken (otherwise, no simplication has been made for all this work!). It is possible to then determine what score each of the original points now have on the new variables (or principal components). These new scores, say Y , can be found from the original variables, X, using Y = XC. In our case in this example, the matrix X will refer to the centred, scaled variables since the pcs were computed using these. Hence, 1/2 3/ 2 2 1/2 1/ 1/ 2 Y = 0 1/ 2 1/ 2 2 3/ 2 (1 3)/2 (1 + 3)/2 . 1/2 1/2 = (1 3)/2 1 + 3/2 Thus, the point (1, 0) is now mapped to ((1 3)/2, (1 + 3)/2), the point (1, 1) is now mapped to (1/2, 1/2), and the point (4, 2) is now mapped to (1 + 3/2, (1 3)/2) in the new system. In Fig. 11.2 (d), the point (1, 1) can be seen to be mapped to a negative value for the rst pc, and the same (possibly negative) value for the second pc4 . Thus, (1/2, 1/2) seems a sensible value to which the second point could be mapped. Since we only take one pc, the new variable takes the values (1 3)/2, 1/2, 1 + 3/2 which accounts for about 93% of the variation in the original data.
11.2.1
When should the correlation matrix be used?

Activity 11.C: Read Wilks, Section 9.3.4.
When the variables measure similar information, or have similar units of measurement, the covariance matrix is generally used. If the variables are on very dierent scales, the correlation matrix is usually the basis for pca.
4
We say possibly since it depends on which direction the eigenvectors are pointing.
11.2. The procedure
233
For example, Example 10.1 involves variables that are measured on dierent scales: SO2 was measured in micrograms per cubic metre, whereas manufac is simply the number of manufacturing enterprises with more than 20 employees. These are very dierent and measured in dierent units of measurement. For this reason, the pca should be based on the correlation matrix. In eect, the correlation matrix transforms all of the variables to a similar scale so that the actual units of measurement are not important. Commonly, but not always, the correlation matrix is used. It is important to realize that, in general, dierent results are obtained using the correlation and covariance matrices.
11.2.2
Selecting the number of
PC s
Activity 11.D: Read Wilks, Sections 9.3.2 and 9.3.3. One of the dicult decisions to make in pca is how many principal components (pcs) are necessary to keep. The analysis will always produce as many pcs as there are variables, so keeping all the pcs means that no information is lost, but it also completely reproduces the data. This defeats the purpose of performing a data reduction technique such as pcait simply complicates matters! There are many criteria for making this decision, but no formal procedure (involving tests, etc.). There are only guidelines; some are given below. Using any of the methods without thought is dangerous and prone to error. Always examine the information and make a sensible decision that you can justify. Sometimes, there is not one clear decision. Remember the purpose of pca is to reduce the dimension of the data, so a small number of pcs is preferred.
Scree plots One way to help make the decision is to use a scree plot. The scree plot is used to help decide between the important pcs (with large eigenvalues) and the less important pcs (with small eigenvalues). Some authors claim this method generally includes too many pcs. When using a screeplot, some pcs should be clearly more important than others. (This is not always the case, however.)
234 Total variance rule
Another proposed method is to take as many pcs as necessary until a certain percentage (often 90%) of the variance has been explained.
Use above average PCs This method recommends only keeping those pcs whose eigenvalues are greater than the average. (Note that if the correlation matrix has been used to compute the pcs, this means that pcs are retained if their eigenvalues are greater than one.) For a small number of variables (say 20), this method is reported to include too few techniques. Example 11.2: Kidson [29] analysed monthly means of surface pressures, temperature and rainfall using principal components analysis. In each case considered, 10 out of a possible 120 components accounted for more than 80% of the observed variance.
Example 11.3: Katz & Glantz [27] use a principal components analysis on rainfall data to show that no single rainfall index (or principal component) can adequately explain rainfall variation.
11.2.3
Interpretation of
PC s
It is often useful to nd an interpretation for the pcs, recalling that the pcs are simply linear combinations of the variables. It is not uncommon for the rst pc to be a measure of size. Finding interpretations is often quite an art, and sometimes any interpretation is dicult to nd. Example 11.4: Mantua et al. [33] dene the the Pacic Decadal Oscillation (PDO) as the leading pc of monthly SST anomalies in the North Pacic Ocean.
11.3. pca and other statistical techniques
235
11.3
PCA
and other statistical techniques
pca is often used as a data reduction technique, as has been described in these notes. But there are other uses as well. For example, pca can be used on various type of data often as a preliminary step before further analysis. pca is sometimes used as a preliminary step before a regression analysis. In particular, if there are a large number of covariates, or there are a number of large correlations between covariates, a pca is often performed, a number of pcs selected, and these pcs used as covariates in a regression analysis. Example 11.5: Wol, Morrisey & Kelly [50] use principal components analysis followed by a regression to identify source areas of the ne particles and sulphates which are the primary components of summer haze in the Blue Ridge Mountains of Virginia, USA.
Example 11.6: Fritts [17] describes two techniques for examining the relationship between ring-width of conifers in western North America and climatic variables. The rst technique is a multiple regression on the principal components of climate.
pca is sometimes used with cluster analysis (see Module 13) to classify climatological variables. Example 11.7: Stone & Auliciems [42] use a combination of cluster analysis and pca to dene phases of the Southern Oscillation Index (SOI).
Example 11.8: One use of principal components analysis is to extract principal components from a multivariate time series. Michaelsen [35] used this method (which he called frequency domain principal components analysis) on the movement of sea surface temperatures (SST) anomalies in the North Pacic, and found a low frequency SST eld.
236 r function princomp prcomp
Module 11. Principal Components Analysis Computation method Eigen-analysis SVD* Matrix used correlation or covariance centre and/or scale
Table 11.1: Two methods for computing principal components in r. The stars indicate the preferred option. SVD stands for singular-value decomposition. princomp uses the less-preferred eigenvalue-based analysis (for compatibility with programs such as S-Plus). The functions use dierent methods of specifying the matrix on which to base the computations: using center=TRUE and scale=TRUE in prcomp is equivalent to using cor=TRUE in princomp. (The default for prcomp is center=TRUE, scale=FALSE; the default for princomp is cor=FALSE (that is, use the covariance matrix)).
11.4
Using R
r can be used to nd principal components; confusingly, two dierent methods exist; Table 11.1 compares the methods. In general, the function prcomp will be used here. The next example continues on from Example 11.9 and uses a very small data matrix to show how the calculations done by hand can be compared to those performed in r. Example 11.9: Refer to Example 11.1. and the data are plotted in Fig. 11.2 (a). How can this analysis be done in r? Of course, tasks such as multiplying matrices and computing the eigenvalues can be done in r (using the commands %*% and eigen respectively). First, dene the data matrix (and then centre it also): > + > > > testd <- matrix(byrow = TRUE, nrow = 3, data = c(1, 0, 1, 1, 4, 2)) means <- colMeans(testd) means <- c(1, 1, 1) %o% means ctestd <- testd - means
Some of the matrics we used can be dened also: > XtX <- t(ctestd) %*% ctestd > P <- XtX/length(testd[, 1])
11.4. Using r > > > > > > > > > st.devs <- sqrt(diag(P)) cstestd <- testd cstestd[, 1] <- ctestd[, 1]/st.devs[1] cstestd[, 2] <- ctestd[, 2]/st.devs[2] cormat <- cor(ctestd) D.power <- diag(st.devs) cormat2 <- D.power^T %*% P %*% D.power es <- eigen(cormat) es
237
$values [1] 1.8660254 0.1339746 $vectors [,1] [,2] [1,] 0.7071068 0.7071068 [2,] 0.7071068 -0.7071068 These results agree with those in Example 11.1. But of course, r can compute principal components without us having to resort to matrix multiplication and nding eigenvalues. > p <- prcomp(testd, center = TRUE, scale = TRUE) > names(p) [1] "sdev" [5] "x" "rotation" "center" "scale"
Specifying center=TRUE and scale=TRUE instructs r to use the correlation matrix to nd the pcs. The standard deviations used by r to scale the data is > p$sdev [1] 1.3660254 0.3660254 > p$sdev^2 [1] 1.8660254 0.1339746 Likewise, the centres (means) of each variable is found using p$center (but arent shown here). The eigenvectors are in the columns of: > p$rotation
238
Module 11. Principal Components Analysis PC1 PC2 [1,] 0.7071068 0.7071068 [2,] 0.7071068 -0.7071068 A screeplot can be produced using > screeplot(p) or just > plot(p) but is not shown here. The proportion of the variance explained by each pc is found using summary: > summary(p) Importance of components: PC1 PC2 Standard deviation 1.366 0.366 Proportion of Variance 0.933 0.067 Cumulative Proportion 0.933 1.000 The eigenvalues are given by > p$sdev^2 [1] 1.8660254 0.1339746 The new scores, called the principal components or pcs (and called Y earlier), can be found using > predict(p) PC1 PC2 [1,] -1.1153551 0.2988585 [2,] -0.4082483 -0.4082483 [3,] 1.5236034 0.1093898
This example was to show you how to perform a pca by hand, and how to nd those bits-and-pieces in the r output. Notice that once the correlation matrix has been found, the analysis proceeds without knowledge of anything else. Hence, given only a correlation matrix, pca can be performed. (Note
11.4. Using r
239
r requires a data matrix for use in prcomp; to use only a correlation matrix, you must use eigen and so on.) Commonly, a small number of the pcs are chosen for further analysis; these can be extracted as follows (where the rst two pcs here are extracted as an example): > p.pcs <- predict(p) The next example is more practical. Example 11.10: Consider the sparrow example used by Manly in Example 6.1. (While not a climatological, it will demonstrate how to do equivalent analyses in r.) We use the correlation matrix since the variables are dissimilar. First, load the data > sp <- read.table("sparrows.txt", header = TRUE) It is then interesting to examine the correlations between the variables: > cor(sp) Length 1.0000000 0.7349642 0.6618119 0.6269482 0.6051247 Sternum 0.6051247 0.5290138 0.5262701 0.5787743 1.0000000 Extent 0.7349642 1.0000000 0.6737411 0.7621451 0.5290138 Head 0.6618119 0.6737411 1.0000000 0.7184943 0.5262701 Humerus 0.6269482 0.7621451 0.7184943 1.0000000 0.5787743
Length Extent Head Humerus Sternum Length Extent Head Humerus Sternum
There are many high correlations, so it may be possible to reduce the number of variables are retain most of the information. That is, a pca may be useful. The following code analyses the data: > sp <- read.table("sparrows.txt", header = TRUE) > sp.prcomp <- prcomp(sp, center = TRUE, + scale = TRUE) > names(sp.prcomp)
240 [1] "sdev" [5] "x"
Module 11. Principal Components Analysis "rotation" "center" "scale"
The command prcomp returns numerous variables, as can be seen. The table at the bottom of Manly, p 81 is found as follows: > sp.prcomp$rotation PC1 PC2 PC3 0.4548793 -0.06760175 0.7340681 0.4662631 0.30512343 0.2671031 0.4494628 0.29277283 -0.3470235 0.4635108 0.22746613 -0.4772988 0.3985280 -0.87457014 -0.2038638 PC4 PC5 Length 0.23424318 0.4413490 Extent -0.47737764 -0.6247119 Head 0.73389847 -0.2307272 Humerus -0.41989524 0.5738386 Sternum -0.04818454 -0.1800565
Length Extent Head Humerus Sternum
Can these pcs be interpretted? The rst pc is almost equally loaded for each variable; it therefore measures the general size of the bird. The second pc is highly loaded with the sternum length, not very loaded with length, and equally loaded for the rest. It is not easy to interpret, but perhaps is a measure of sternum length. The third pc has a high loading for length; perhaps it is a length pc. The fourth is a measure of head size; the fth the constrast between extent and humerus (since these two variable are loaded with dierent signs). As can be seen, some creativity may be necessary to develop meaningful interpretations! The table above is equivalent to Table 6.3 in Manly, but information is transposed (try t(sp.prcomp$rotation)). The numbers are also slightly dierent, but certainly similar. The eigenvalues (variances of the pcs) in Manlys Table 6.3 are found as follows: > sp.prcomp$sdev^2 [1] 3.5762941 0.5355019 0.3788619 0.3273533 [5] 0.1819888 A screeplot is produced using screeplot: > screeplot(sp.prcomp) > screeplot(sp.prcomp, type = "lines")
11.4. Using r
241
sp.prcomp
3.5 3.5
q
sp.prcomp
3.0
2.5
Variances
2.0
Variances
1.5
1.0
0.5
0.5
1.0
1.5
2.0
2.5
3.0
q q q q
0.0
Figure 11.3: Two dierent ways of presenting the screeplot for the sparrow data. In (a), the default screeplot; in (b), the more standard screeplot produced with the option type="lines". The nal plot is shown in Fig. 11.3. The rst pc obviously is much larger than the rest, and easily accounts for most of the variation in the data. If we use the screeplot, you may decide to keep only one pc. Using the total variance rule, you may decide that three or four pcs are necessary: > summary(sp.vars) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.03640 0.06547 0.07577 0.20000 0.10710 0.71530 Using the above average pc rule would select only one pc: > mean(sp.vars) [1] 0.2 > sp.vars > mean(sp.vars) Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 TRUE FALSE FALSE FALSE FALSE The values of the pcs for each bird is found using (for the rst 10 birds only)
242
Module 11. Principal Components Analysis > predict(sp.prcomp)[1:10] [1] 0.07836554 -2.16233078 -1.13609553 [4] -2.29462019 -0.28519596 1.93013405 [7] -1.03954232 0.44378025 2.70477182 [10] 0.19259851 Note that the rst bird has a score of 0.07837 on the rst pc, whereas the score is 0.064 in Manly. The scores on the second pc are very similar: 0.6166 (above) compared to 0.602 (Manly). The rst three pcs are extracted for further analysis using > sp.pcs <- predict(sp.prcomp)[, 1:3]
11.5
Spatial PCA
One important use of pca in climatology is spatial pca, or eld pca. Activity 11.E: Read Wilks, Section 9.3.5. As noted by Wilks, this is a very common use of pca. The idea is this: Data, such as rainfall, may be available for a large number of locations (usually called stations), usually over a long time period. pca can be used to nd patterns over those locations.
11.5.1
A small example
Example 11.11: As a preliminary example, consider some rainfall data from selected rainfall stations in Australia, as shown in Table 11.2. Each column consist of 15 observations of the rainfall at each station. Thus, there are the equivalent of 10 variables with 15 repeated observations each. A pca can be performed to reduce the information contained in 10 stations to a smaller number. Notice that each of the 15 observations for each station constitute a time series. > p <- prcomp(rain, cor = TRUE) > plot(p, main = "Small rainfall example")
11.5. Spatial pca
243
Table 11.2: Monthly rainfall gures for ten stations in Australia. There are 15 observations for each station, given in order of time (the actual recording months are unknown; the source did not state).
Station number 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 111.70 25.50 82.90 174.30 77.70 117.10 111.20 147.40 66.50 107.70 26.70 92.40 157.00 20.80 137.20 2 30.80 2.80 47.50 81.50 22.00 35.90 52.70 109.70 29.00 37.70 6.10 25.70 63.00 4.10 38.10 3 78.70 19.20 98.90 106.80 48.90 118.10 69.10 150.70 41.70 77.00 16.20 45.50 79.20 12.50 82.40 4 58.60 4.00 65.20 80.90 56.20 86.90 56.80 101.20 22.60 52.80 11.90 58.00 70.10 7.90 59.70 5 30.60 8.10 73.50 73.90 67.10 81.90 27.20 102.80 50.60 27.60 14.20 22.20 45.70 7.40 27.60 6 63.60 7.80 117.00 123.50 113.00 98.60 51.60 112.40 73.10 34.80 34.80 32.30 66.80 11.70 58.00 7 53.40 10.30 95.60 155.80 256.40 84.00 76.00 32.60 92.80 16.20 32.60 35.70 76.00 9.30 45.30 8 15.90 1.00 37.50 51.20 38.30 42.40 16.30 42.60 26.40 7.60 18.00 8.80 14.40 14.80 5.00 9 27.60 4.10 93.40 81.50 65.60 67.30 50.40 52.50 36.00 5.50 28.70 13.80 16.30 6.60 34.30 10 72.60 27.30 139.90 177.10 253.30 154.30 191.50 47.30 80.10 12.20 118.30 37.80 71.50 19.40 108.40
The scree plot is shown in Fig. 11.4; it is not clear how many pcs should be retained. We shall select three for the purpose of this example; three is not unreasonable as they account for over 90% of the variation in the data (see line 11 of the output). There a few important points to note: (a) In practice, there are often hundreds of stations with available data, and over a hundred years worth of rainfall data for most stations. This creates huge data les that, in practice, take large amounts of computing power to analyse. (b) If latitudes and longitudes of the stations are known, contour maps can be drawn of the principal components over a map of Australia (see the next example). (c) Each pc is a vector of length 15 and is also a time series. These can be plotted as time series (see Fig. 11.5) and even analysed as a time series using the techniques previously studied. This analysis can detect time trends in the pcs. In this small example, the time trends of 15 stations have been reduced to time trends of three new variables that capture the important information carried by all 15.
244
Small rainfall example
Variances Time
Figure 11.4: The scree plot for the pca of the small rainfall example.
100
100
0 0
2000
4000
6000
8000
10000
200 2 4 6 8 PCs 10 12
PCA 1 PCA 2 PCA 3 14
Figure 11.5: The pcs plotted over time for the small rainfall examples.
11.5. Spatial pca
245
Full rainfall example
Variances
Figure 11.6: The scree plot for the full rainfall example.
11.5.2
A larger example
Example 11.12: Using the larger data le from which the data in the previous example came, a more thorough pca can be performed. This analysis was over 1188 time points for 52 stations. The data matrix has 1188 52 = 61 776 entries; this needs a lot of storage in the computer, and a lot of memory for performing operations such as matrix multiplication and matrix inversions. The scree plot is shown in Fig. 11.6. Plotting the rst pc over a map of Australia gives Fig. 11.7 (a). The second pc has been plotted over a map of Australia Fig. 11.7 (b). This time, the rst three pcs account for about 57% of the total variation. Notice that even with 52 stations, the contours are jagged; they could, of course, be smoothed. It requires special methods to handle data les of this size. The code used to generate these picture is given below. Be aware that you probably cannot run this code as it requires installing r libraries that you probably do not have by default (but can perhaps be installed; see Appendix A). The huge data les necessary are in a format called netCDF, and a special library is required to read these les.
10
15
246
First PC
10 Latitude 45 40 35 30 25 20 15
120
130 Longitude
140
150
Second PC
10 Latitude 45 40 35 30 25 20 15
120
130 Longitude
140
150
Figure 11.7: The rst two pcs plotted over a map of Australia.
11.6. Rotation of pcs > > > > > > > > > > > > > > > > + + + + + + + + > > > + > > >
247
library(oz) library(ncdf) set.datadir() d <- open.ncdf("./pca/oz-rain.nc") rawrain <- get.var.ncdf(d, "RAIN") missing <- attr(rawrain, "missing_value") rawrain[rawrain == missing] <- NA set.docdir() longs <- get.var.ncdf(d, "LONGITUDE79_90") nx <- length(longs) lats <- get.var.ncdf(d, "LATITUDE19_33") ny <- length(lats) times <- get.var.ncdf(d, "TIME") ntime <- length(times) rain <- matrix(0, ntime, nx * ny) for (ix in (1:nx)) { for (iy in (1:ny)) { idx <- (iy - 1) * nx + ix t <- rawrain[ix, iy, 1:ntime] if (length(na.omit(t)) == ntime) { rain[, idx] <- t } } } pc.rain <- rain[, colSums(rain) > 0] p1 <- prcomp(pc.rain, center = TRUE, scale = TRUE) plot(p1$rotation, type = "b", main = "Full rainfall example", ylab = "Eigenvalues") par(mfrow = c(2, 1)) oz(add = TRUE, lwd = 2) oz(add = TRUE, lwd = 2)
The gaps in the plots are because there is such little data in those remote parts of Australia, and rainfall is scare there anyway. Note the pcs are deduced from the correlations, so the contours are for small and sometimes negative numbers, not rainfall amounts.
11.6
Rotation of PCs
One controverisal topic is the rotation of principal components, which we briey discuss here.
248
One constraint on the pcs is they must be orthogonal, which some authors argue limits how well they can be interpretted. If the physical interpretation of the pcs is more important than data reduction, some authors argue that the orthogonality constraint should be relaxed to allow better interpretation (see, for example, Richman [38]). This is called rotation of the pcs. Many methods exist for rotation of the pcs. However, there are many arguments against rotation of pcs (see, for example, Basilevsky [8]). Accordingly, r does not explicitly allow for pcs to be rotated, but it can be accomplished using functions designed to be used in factor analysis (where rotations are probably the norm rather than the exception). We will not discuss this topic any further, except to note two issues: 1. Rotation is discussed further in Chapter 12 on factor analysis, where it is more appropriate; 2. The purpose of rotation of the pcs appears to generally be to cluster the pcs together. This can be accomplished using a cluster analysis (see Chapter 13).
11.7
Exercises
Ex. 11.13: Consider the following data: 3 3 X= 1 1
3 4 . 3 6
(a) Perform a pca by hand using the correlation matrix (follow Example 11.1 or Example 11.9). (Dont use prcomp or similar functions; you may use r to do the matrix multiplication and so on for you.) (b) Perform a pca by hand, but using the covariance matrix. (c) Compare and comment on the two strategies. Ex. 11.14: Consider the following data: 1 0 X= 3 4
2 3 . 5 6
11.7. Exercises
249
(a) Perform a pca by hand using the correlation matrix (follow Example 11.1 or Example 11.9). (Dont use prcomp or similar functions; you may use r to do the matrix multiplication and so on for you.) (b) Perform a pca by hand, but using the covariance matrix. (c) Compare and comment on the two strategies. Ex. 11.15: Consider the correlation matrix R= 1 0.6 0.6 1 .
Perform a pca using the correlation matrix. Dene the new variables, and explain how many new pcs are necessary. Ex. 11.16: Consider the correlation matrix R= 1 r r 1 .
(a) Perform a pca using the correlation matrix and show it always produces new axes at 45 to the original axes. (b) Explain what happens in the pca for r = 0, r = 0.25, r = 0.5 and r = 1. Ex. 11.17: The data le toowoomba.dat contains (among other things) the daily rainfall, maximum and minimum temperatures at Toowoomba from 1 January 1889 to 21 July 2002 (a total of 41474 observations on three variables). Perform a pca. How many pcs are necessary to summarize the data? Ex. 11.18: Consider again the air quality data from 41 cities in the USA, as seen in Example 10.1. For each city, seven variables have been measured (see p 218). The rst is the concentration of SO2 in microgram per cubic metre; the other six are potential identiers of pollution problems. The original source treats the concentration of SO2 as a response variable, and the other six as covariates. (a) Examine the correlation matrix; what varaible are highly correlated? (b) Produce a star plot of the data, and comment. (c) Is it possible to reduce these six covariates to a smaller number, without losing much information? Use a pca to perform a data reduction.
250
Module 11. Principal Components Analysis (d) Should a correlation or covaraince matrix be used for the pca? Explain your answer. (e) Examine the loadings; is there any sensible interpretation? (f)
Ex. 11.19: Consider the example in 11.5.2. If you can load the appropriate libraries, try the same steps in that example but for the data in oz-slp.nc. Ex. 11.20: The data le emerald.dat contains the daily rainfall, maximum and minimum temperatures, radiationp, an evaporation and maximum vapour pressure decit (in hPa) at Emerald from 1 January 1889 to 15 September 2002 (a total of 41530 observations on three variables). Perform a pca. How many pcs are necessary to summarize the data? Ex. 11.21: The data le gatton.dat contains the daily rainfall, maximum and minimum temperatures, radiationp, an evaporation and maximum vapour pressure decit (in hPa) at Gatton from 1 January 1889 to 15 September 2002 (a total of 41530 observations on three variables). (a) Perform a pca using the covariance matrix. (b) Perform a pca using the correlation matrix. Compare to the previous pca. Which would you choose: a pca based on the covariance or the correlation matrix? Explain. (c) How many pcs are necessary to summarize the data? Explain. (d) If possible, interpret the pcs. (e) Take the rst pc; perform a quick time series analysis on this pc. (Dont attempt necessarily to nd an optimal model; doing so will be time consuming because oe the amount of data, and may be dicult also. Just plot an ACF, PACF and suggest a model based on those.) Ex. 11.22: The data le strainfall.dat contains the average month and annual rainfall (in tenths of mm) for 363 Australian rainfall stations. (a) Perform a pca using the monthly averages (and not the annual average) using the correlation matrix. How many pcs seems necessary? (b) Perform a pca using the monthly averages (and not the annual average) using the covariance matrix. How many pcs seems necessary? (c) Which pca would you prefer? Why? (d) Select the rst two pcs. Conrm that they are uncorrelated.
11.7. Exercises
251
Ex. 11.23: The data le jondaryn.dat contains the daily rainfall, maximum and minimum temperatures, radiationp, an evaporation and maximum vapour pressure decit (in hPa) at Jondaryn from 1 January 1889 to 15 September 2002 (a total of 41474 observations on six variables). Perform a pca. How many pcs are necessary to summarize the data? Ex. 11.24: The data le wind_ca.dat contains numerous weather and wind measurements from Canberra during 1989. (a) Explain why it is best to use the correlation matrix for this data. (b) Perform a pca using the correlation matrix. (c) How many pcs are necessary to summarize the data? Explain. (d) If possible, interpret the pcs. (e) Perform a time series analysis on the rst pc. Ex. 11.25: The data le wind_wp.dat contains numerous weather and wind measurements from Wilsons Promontory, Victoria (the most southerly point of mainland Australia) during 1989. (a) Explain why it is best to use the correlation matrix for this data. (b) Perform a pca using the correlation matrix. (c) How many pcs are necessary to summarize the data? Explain. (d) If possible, interpret the pcs. (e) Explain why a time series analysis on, say, the rst pc cannot be done here. (Hint: Read the help about the data.) Ex. 11.26: The data le qldweather.dat contains six weather-related variables for 20 Queensland cities. (a) Perform a pca using the correlation matrix. How many pcs seems necessary? (b) Perform a pca using the covariance matrix. How many pcs seems necessary? (c) Which pca would you prefer? Why? (d) Select the rst three pcs. Conrm that they are uncorrelated. Ex. 11.27: This question concerns a data set that is not climatological, but you may nd interesting. The data le chocolates.dat, available from http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/ popular/chocolates.html, contains measurements of the price, weight and nutritional information for 17 chocolates commonly available in Queensland stores. The data was gathered in April 2002 in Brisbane.
252
Module 11. Principal Components Analysis (a) Would it be best to use the correlation or covariance matris for the pca? Explain. (b) Perform this pca using the nutritional information. (c) How many pcs are useful? (d) If possible, give an interpretation for the pcs.
11.7.1
11.13 First, use the correlation matrix. > + > > > > > > > > > > > > > > > testd <- matrix(byrow = TRUE, nrow = 4, data = c(3, 3, 3, 4, 1, 3, 1, 6)) means <- colMeans(testd) ctestd <- testd - means means <- colMeans(testd) means <- c(1, 1, 1) %o% means XtX <- t(ctestd) %*% ctestd P <- XtX/length(testd[, 1]) st.devs <- sqrt(diag(P)) cstestd <- testd cstestd[, 1] <- ctestd[, 1]/st.devs[1] cstestd[, 2] <- ctestd[, 2]/st.devs[2] cormat <- cor(ctestd) D.power <- diag(1/st.devs) cormat2 <- D.power^T %*% P %*% D.power es <- eigen(cormat) es
$values [1] 1.5 0.5 $vectors [,1] [,2] [1,] 0.7071068 0.7071068 [2,] -0.7071068 0.7071068 Using the covariance matrix: > es <- eigen(cov(testd)) > es $values [1] 2.4120227 0.9213107
11.7. Exercises
253
$vectors [,1] [,2] [1,] 0.5257311 0.8506508 [2,] -0.8506508 0.5257311 As expected, the eigenvalues and pcs are dierent. 11.18 Here is some r code: > us <- read.table("usair.dat", header = TRUE, + row.names = 1) > us.pca <- prcomp(us[, 2:7], center = TRUE, + scale = TRUE) > plot(us.pca, main = "Screeplot for US air data") How many pcs should be selected? The screeplot is shown in Fig. 11.8, from which three or four might be selected. The variances of the eigenvectors are > summary(us.pca) Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.482 1.225 1.181 0.872 Proportion of Variance 0.366 0.250 0.232 0.127 Cumulative Proportion 0.366 0.616 0.848 0.975 PC5 PC6 Standard deviation 0.3385 0.18560 Proportion of Variance 0.0191 0.00574 Cumulative Proportion 0.9943 1.00000 Perhaps three pcs are appropriate. The rst three account for almost 74% of the total variance. It would also be possible to choose four pcs, but with six original variables, this isnt a large reduction. PC1 PC2 PC3 temp -0.32964613 0.1275974 -0.67168611 manufac 0.61154243 0.1680577 -0.27288633 population 0.57782195 0.2224533 -0.35037413 wind.speed 0.35383877 -0.1307915 0.29725334 annual.precip -0.04080701 -0.6228578 -0.50456294 days.precip 0.23791593 -0.7077653 0.09308852 PC4 PC5 PC6 temp -0.30645728 0.55805638 -0.13618780
254
Screeplot for US air data
Variances
Figure 11.8: The scree plot for the US air data.
manufac 0.13684076 -0.10204211 -0.70297051 population 0.07248126 0.07806551 0.69464131 wind.speed -0.86942583 0.11326688 -0.02452501 annual.precip -0.17114826 -0.56818342 0.06062222 days.precip 0.31130693 0.58000387 -0.02196062 Is there a sensible interpretation for these pcs? The rst pc has a high positive loading for temperature, but a high negative loading for the other variables (apart from annual precipitation). This could be seen as the contrast between temperature and the other variables: the contrast between temperature rises and other variables rising. It is hard to see any intelligent purpose in such a pc. Likewise, interpretations for the next two pcs are dicult to determine.
0.0
0.5
1.0
1.5
2.0
Module
12
. . . . . . . . . . . . . . . . . . . . . . . . . 256 258 260
Factor Analysis
Module contents
12.1 Introduction 12.2 The Procedure . . . . . . . . . . . . . . . . . . . . . . . . 257 12.2.1 Path model . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Steps in a fa . . . . . . . . . . . . . . . . . . . . . . . .
12.3 Factor rotation . . . . . . . . . . . . . . . . . . . . . . . . 262 12.3.1 Methods of factor rotation . . . . . . . . . . . . . . . . . 262 12.4 Interpretation of factors . . . . . . . . . . . . . . . . . . 263 12.5 The dierences between pca and fa . . . . . . . . . . . 266 12.6 Principal components factor analysis . . . . . . . . . . . 267 12.7 How many factors to choose? . . . . . . . . . . . . . . . 268 12.8 Using r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 12.9 Concluding comments . . . . . . . . . . . . . . . . . . . . 274 12.10Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 12.10.1 Answers to selected Exercises . . . . . . . . . . . . . . . 277
Module objectives
256
Module 12. Factor Analysis

understand the principles underlying factor analysis; give a geometric interpretation of the factors used, where possible; perform a factor analysis from given data using r; select an appropriate number of factors using suitable techniques.
12.1
Introduction
Factor analysis is a data reduction technique very similar to pca. Indeed, many students nd it hard to see the dierences between the two methods; see Sect. 12.5 for a discussion on this issue. Activity 12.A: Read Manly, Sect. 7.1. Factor analysis refers to a variety of statistical techniques whose common objective is to represent a set of variables in terms of a smaller number of hypothetical variables or factors. pca is therefore an example of a factor analysis. Usually, however, factor analysis refers to so-called common factor analysis, which is considered here. In general, the rst step is an examination of the interrelationships between the variables. Usually correlation coecients are used as a measure of the association between variables. Inspection of the correlation matrix may reveal relationships within some subsets of variables, and that these correlations are higher than those between subsets. Factor analysis explains these observed correlations by postulating the existence of a small number of hypothetical variables or factors which are causing the observed correlations. It can be argued that, ignoring sampling errors, a causal system of factors will lead to a unique correlation system of observed variables. However the reverse is not true. Only under very limiting conditions can one unequivocably determine the underlying causal structure from the correlational structure. In practice, only a correlational structure is presented. The construction of a causal system of factors from this structure relies as much on mathematics as judgement, knowledge of the system under investigation and interpretation of the analysis. At one extreme, the researcher may not have any idea as to how many underlying factors exist. Then fa is an exploratory technique aiming at ascertaining the minimum number of hypothetical factors that can account for the observed covariation. The majority of applications of this type are in the social sciences.
12.2. The Procedure
257
fa may also be used as a means of testing specic hypotheses. A researcher with a considerable depth of knowledge of an area may hypothesize two dierent underlying dimensions or factors, and that certain variables belong to one dimension while others belong to the second. If fa is used to test this expectation, then it is used as a means of conrming a certain hypothesis, not as a means of exploring underlying dimensions. Thus, it is referred to as conrmatory factor analysis. The idea of having underlying, but unobservable, factors may sound odd. But consider an example: annual taxable income, number of cars owned, value of home, and occupation may all be measure various observable socioeconomic status indicators. Likewise, heart rate, muscle strength, blood pressure and hours of exercise per week may all be measurements of tness. The observable measurements are all aspects of the underlying factor called tness. In both cases, the true, underlying variable of interest (socioeconomic status and tness) is hard to measure directly, but can be measured using the observed variables given.
12.2
The Procedure
Activity 12.B: Read Manly, Sect 7.2 and 7.3.
Factor analysis (fa), like pca, is a data reduction technique. fa and pca are very similar, and indeed some computer programs and texts barely distingish between them. However, there are certainly dierences. As with pca, the analysis starts with n observations on p variables. These p variables are assumed to have a common set of m factors underlying them; the role of fa is to identify these factors. Mathematically, the p variables are X1 = a11 F1 + a12 F2 + + a1m Fm + e1 X2 = a21 F1 + a22 F2 + + a2m Fm + e2 . . . . . . . . . Xp = ap1 F1 + ap2 F2 + + apm Fm + ep (12.1)
where Fj are the underlying factors common to all the variables Xi , aij are called factor loadings, and the ei are the parts of each variable unique to that variable. In matrix notation, x = f + e, (12.2)
258
where the factor loadings are in the matrix . In general, the Xi are standardized to have mean zero and variance one. Likewise, the factors Fj are assumed to have mean zero and variance one, and are independent of ei . The factor loadings aij are assumed constant. Under these assumptions, var[Xi ] = 1 = a2 + a2 + + a2 + var[ei ]. i1 i2 im Hence, the observed variance in Xi is due to two components: 1. The eect of the common factors Fj , through the constants aij . Hence, the quantity a2 + a2 + + a2 is called the communality for Xi . i1 i2 im 2. The eect of the component specic to Xi , through var[ei ]. Hence var[ei ] is called the specicity or uniqueness of Xi . This can also be seen as the error variance.
12.2.1
Path model
The relationship between (observed) variables and factors is often displayed using a path model. For example, consider the (unlikely) situation where there are three observed variables, X1 , X2 and X3 , and two factors F1 and F2 . Suppose further that the factor loadings aij in Eq. (12.1) are known. Then a path model can be constructed which is consistent with the original data:
F1
F2
XXX Z a a31Z XXX 21 X X z X X Z : 2 a12Z Z a22 Z X X Z aXXXXXZ 32 ~ Z X z X X 3
> a 11
: X1
e1
e2
e3
Using properties of expectations and covariances, the original variances of the Xi (which are 1, recall) can be recovered.
12.2. The Procedure
259
Example 12.1: Consider a (hypothetical) example where three variables are observed on a number of t men: X1 is the numbers of hours of exercise performed each week; X2 is the time taken to run 10km; and X3 is the time taken to sprint 100m. The correlation matrix is 1 0.64 0.51 0.64 1 0.27 . 0.51 0.27 1 One possible allocation of the factors is shown below.
F1
F2
: X1 > 0.6 X ZXXXX 0.9 0.1Z XXX z X Z : X2 Z 0.5Z Z 0.2 XXX XXXZZ 0.9 XX ~ Z z X X 3
0.38
0.15
0.18
Note that, for example, Covar[X1 , X2 ] = Covar[0.6F1 + 0.5F2 , 0.9F1 + 0.2F2 ] = Covar[0.6F1 + 0.5F2 , 0.9F1 ] + Covar[0.6F1 + 0.5F2 , 0.2F2 ] = Covar[0.6F1 , 0.9F1 ] + Covar[0.5F2 , 0.9F1 ] + Covar[0.6F1 , 0.2F2 ] + Covar[0.5F2 , 0.2F2 ] = 0.54Covar[F1 , F1 ] + 0 + 0 + 0.1Covar[F2 , F2 ] = 0.64, as in the original correlation matrix. The communalities are given by e1 = 0.38; e2 = 0.15 and e3 = 0.18. At this stage, we are assuming F1 and F2 are orthogonal, so Covar[F1 , F2 ] = 0. (Recall var[Fi ] = Covar[Fi , Fi ] = 1 and var[Xi ] = 1.) In addition, var[X1 ] = var[0.6F1 ] + var[0.5F2 ] + var[e1 ] = 0.36 + 0.25 + 0.38 1 as required. Thus this path model represents one possible allocation of the factors; there are, however, others possible. Often, the relationships between the factors and the observable variables are given in a table:
260 F1 X1 X2 X3 0.6 0.9 0.1 F2 0.5 0.2 0.9
Is there a sensible interpretation of the factors? F1 is strongly related to the time to run 10km, and also to the hours of exercise per week; perhaps this factor could be interpretted as measuring stamina. The second factor is highly related to the time to sprint 100m, and the hours of exercise per week; perhaps this factor could be interpreted as measuring strength. Written using the matrix notation of Eq. (12.2), x = f + e 0.6 0.5 X1 X2 = 0.9 0.2 0.1 0.9 X3 where 0.38 + 0.15 , 0.18
F1 F2
0.6 0.5 = 0.9 0.2 . 0.1 0.9
12.2.2
Steps in a
FA
fa has three steps: 1. Find some provisional factor loadings. Commonly, this is done using a pca. Since the number of underlying factors is often unknown, m pcs are chosen to become the m underlying factors. Since these factors are actually pcs, they are uncorrelated. However, the choice of factors F1 , F2 , . . . , Fm is not unique. Any linear combination of these is also a valid choice for the factors. That is, F1 = d11 F1 + d12 F2 + + d1m Fm F2 = d21 F1 + d22 F2 + + d2m Fm . . . . . . . . . Fm = dm1 F1 + dm2 F2 + + dmm Fm are also valid factors. The original factor loadings are eectively replaced by T for some rotation matrix T .
12.2. The Procedure
261
2. The second step involves selecting a linear combination of the factors to help interpretation; that is, computing the dij above. This step is called rotation. There are two types of rotation: (a) Orthogonal: With this type of rotation, the factors remain orthogonal. A common example is the varimax rotation. This method maximizes ij (dij d.j )2 where d.j is the mean over i of the dij . A transformation y = Ax is orthogonal if the transformation matrix A is orthogonal; a square matrix A is orthogonal if and only if its column vectors (say, ai , a2 , . . . an ) form an orthonormal set; that is 0 if i = j aT aj = i 1 if i = j For example, the matrix P= 0.9397 0.3420 0.3420 0.9397
is orthogonal. First, write a1 = [0.9397, 0.3420]T , and a2 = [0.3420, 0.9397]T . Then, aT a1 = 0.93972 + 0.34202 = 1 and 1 aT a2 = (0.3420)2 +0.93972 = 1; also, aT a2 = (0.93970.3420)+ 2 1 (0.9397 0.3420) = 0. Thus, a transformation based on matrix P is an orthogonal transformation. (In fact, it represents a rotation of 20 .) (b) Oblique: The factors do not have to remain orthogonal with this type of rotation. The promax rotation is an example. This procedure tends to increase large loadings in magnitude relative to small loadings. 3. The third step is to compute the factors scores; that is, how much of each variable is explained by each factor. This leads to interpretations of the factors. To make interpretation easier, a good rotation should produce factor loadings so that some are close to one, and the others close to zero. Some points to note:
pca is often the rst step in a factor analysis; Factor analysis, like pca, is based on eigenvalues; Many types of rotation may be performed. The software package SPlus (which is very similar to r) implements twelve dierent criteria (Venables & Ripley [46, p 409]). The varimax method is probably the most popular.
262
12.3
Factor rotation
In general, with two or more common factors, the initial factor solution may be converted to another equally valid solution with the same number of factors by an orthogonal rotation. Such a rotation preserves the correlations and communalities amongst variables, but of course changes the loadings or correlations between the original variables and the factors. Recalling that the initial factor solution may result in loadings which do not allow easy interpretation of the factors, rotation can be used to simplify the loadings in the sense of enabling easier interpretation. The rotational process of factor analysis allows the reseacher a degree of exibility by presenting a multiplicity of views of the same data set. Obtain a parsimonious or simple structure following these guidelines: 1. Any column of the factor loadings matrix should have mostly small values, as close to zero as possible. 2. Any row of the matrix should have only a few entries far from zero. 3. Any two columns of the matrix should exhibit a dierent pattern of high and low loadings.
12.3.1
Methods of factor rotation
Orthogonal rotation discussed above preserves the orientation between the initial factors so that they are still perpendicular after rotation. In fact the initial factor axes can be rotated independently giving factors which are not necessarily perpendicular to each other but still explain the reduced correlation matrix. This rotation technique is called oblique. Orthogonal rotation methods enjoy some distinctive properties: 1. Factors remain uncorrelated. 2. The communality estimates are not aected but the proportion of variability accounted for by a given factor will change as a result of the rotation. 3. Although the total amount of variance explained by the common factors wont change with orthogonal rotation, the percentage accounted for by an individual factor will, in general, be dierent.
12.4. Interpretation of factors
263
The standard orthogonal rotation techniques are the varimax (which is in r), quartimax , and equimax methods. They each aim to simplify the factor structure but in dierent ways. Varimax is the most popular and is usually used with pca extraction. It aims to create small, medium and large loadings within a particular factor. Quartimax aims, for each variable, to obtain one and only one major loading across the factors. Equimax attempts to simplify both the rows and the columns of the structure matrix. Unfortunately, the use of orthogonal rotation techniques may not result in uncovering an easily interpretable set of factors. Also there is often no reason to believe that the hypothetical factors should be uncorrelated. Thus, it is possible to arrive at much more interpretable factors if oblique rotation is allowed. The most popular oblique factor rotation methods are promax (which is in r), oblimax , quartimin, covarimin, biquartimin, and oblimin. Similar to orthogonal rotation methods, oblique methods are designed to satisfy various denitions of simple structure, and no algorithm is clearly superior to another. Oblique methods present complexities that dont exist for orthogonal methods. They include: 1. The factors are no longer uncorrelated and hence the pattern and structure matrices will not in general be identical. 2. Communalities and variances accounted for are not invariant under oblique rotation. For more information on some popular rotation techniques, see Kim and Mueller [30]. Example 12.2: Buell & Bundgaard [10] use factor analysis to represent wind soundings over Battery MacKenzie.
12.4
Interpretation of factors
It is often useful to nd an interpretation for the resultant factors; rotation is usually performed to help with this. As with pca, nding interpretations is often quite an art, and sometimes any interpretation is dicult to nd. Sometimes using a dierent kind of rotation may help.
264
Example 12.3: Kalnicky [24] used factor analysis to classify the atmospheric circulation over the midlatitudes of the northern hemisphere from 18991969.
Example 12.4: Hannes [20] used rotated factors to explore the relationship between water temperatures measured at Blunts Reef Light Ship and the air pressure at Eureka, California. The factor loadings indicated that the water temperatures measured at Trinidad Head and Blunts Reef were quite dierent.
Example 12.5: Rogers [39] used factor analysis to nd areal patterns of anomalous sea surface temperature (SST) over the eastern North Pacic based on monthly SSTs, surface pressure and 1000500mb layer thickness over North America during 19601973.
Example 12.6: Consider Example 12.1. An orthogonal rotation can be used to rotate the matrix of factor loadings. For example (and this is probably not a practical example of a rotation but serves to demonstrate the point), an orthogonal rotation could be achieved using the matrix 3/2 1/2 . (12.3) T = 1/2 3/2 (Check this transformation matrix is orthogonal!) Then, the factor loadings become 0.6 0.5 3/2 1/2 T = 0.9 0.5 3/2 1/2 0.1 0.9 0.774 0.13 0.88 0.28 . 0.54 0.73 This allocation of factor loadings produces the following path diagram:
12.4. Interpretation of factors
265
Transformed y 4 2 0 Original x 1 2 3
Original y
1 0
3 4
1 0
Transformed x
Figure 12.1: The eect on the cartesian plane of applying the orthogonal transform in matrix T in Eq. (12.3)
F1
F2
X ZXXXX 0.88 0.54 Z XXX z X Z : X2 Z 0.13 Z 0.28 Z XXX XXXZZ 0.73 XX ~ Z z X X 3
> 0.77
: X1
0.38
0.15
0.18
Note that still Covar[X1 , X2 ] = Covar[0.77F1 + 0.13F2 , 0.88F1 0.28F2 ] = (0.77 0.88) + (0.1 0.28) 0.64. It is not clear that this (arbitrary) rotation helps aid interpretation; it has been used merely to demonstrate the concepts. The transformation represents a rotation of 30 (Fig. 12.2).
Example 12.7: A non-orthogonal rotation for Example 12.1 can be obtained using the rotation matrix S= 1.07 0.288 0.116 1.04 .
Then, the factor loadings become 0.58 0.35 S 0.94 0.052 . 0.0023 0.90
266
Transformed y 4 3 2 1 0 1 2
Original y
3 2 1
3 4
1 0
Original x
Transformed x
Figure 12.2: The eect on the cartesian plane of applying the oblique transform in matrix S in Eq. (12.7) With oblique rotations, matters become more complicated because now the factors are correlated. In a path diagram, this is indicated as shown below, where r is the correlation between the two factors.
: X1 > 0.58 XXX F1 Z 0.94 0.0023 XXXXX Z 6 z X X Z r : 2 ? 0.35 Z Z 0.052 Z XXX F2 XXXZZ 0.90 XX ~ Z z X X 3
0.38
0.15
0.18
12.5
The differences between PCA and FA
pca and factor analysis are similar methods, which is often a source of confusion for students. This section lists some of the dierence (also see Mardia, Kent & Bibby [34, 9.8]). 1. As seen above, a pca is often a rst step in a factor analysis. 2. There is an essential dierence between the two analyses. In pca, the hypothetical new variables (the principal components) are dened as linear combinations of the observed variables. In factor analysis, it is the other way around: The observed variables are conceptualized as being linear composites of some unobserved variables or factors.
12.6. Principal components factor analysis
267
3. In pca, the major objective is to select a number of components that explain as much of the total variance as possible. The values of the principal components for an individual are relatively simple to compute and interpret usually. In contrast, the factors obtained in factor analysis are selected mainly to explain the interrelationships between the original variables. 4. In pca, computations are started with the covariance matrix or the correlation matrix. In factor analysis computations often begin with a reduced correlation matrix, a matrix in which the 1s on the main diagonal are replaced by communalities. These are further explained below. 5. In pca, the principal components is just a transformation of the original data, with no assumptions made about the form of the covariance matrix of the data. In factor analysis, a denite form is assumed.
12.6
Principal components factor analysis
In the previous section, dierences between fa and pca were pointed out. However, pca can actually be used to assist in performing a fa. This is called principal components factor analysis, and uses a pca to perform the rst step in the fa (note that this is not the only option) from which the next two steps can be done. This idea is presented in this section. Begin with p original variables Xi for i = 1 . . . p. Performing a pca will produce p pcs, Zi for i = 1 . . . p. The pcs are dened as Z1 = b11 X1 + b12 X2 + + b1p Xp . . . . . . . . . Zp = bp1 X1 + bp2 X2 + + bpp Xp where the bij are given by the eigenvectors of the correlation matrix. In matrix form, write Z = BX. Since B is a matrix of eigenvectors, B 1 = B T , so also X = B T Z, or X1 = b11 Z1 + b21 Z2 + + bp1 Zp . . . . . . . . . Xp = b1p Z1 + b2p Z2 + + bpp Zp Now in a factor analysis, we only keep m of the p factors; hence X1 = b11 Z1 + b21 Z2 + bp1 Zm + e1 . . . . . . . . . Xp = b1p Z1 + b2p Z2 + bmp Zm + ep
268
where the ei are unexplained components after omitting the last p m pcs. In this equation, the bij are like factor loadings. But true factors have a variance of one; here, var[Zi ] = i since the Zi is a pc. This means the Zi are not true factors. Of course, the Zi can be rescaled to have a variance of one: X1 = ( 1 b11 )Z1 / . . . . . . . . . Xp = ( 1 b1p )Z1 1 + ( 2 b21 )Z2 2 + + ( m bm1 )Zm 2 + + ( p bmp )Zm m + e1
1 + ( 2 b2p )Z2
m + ep
when we can also write X1 = a11 F1 + b12 F2 + b1m Fm + e1 . . . . . . . . . Xp = ap1 F1 + ap2 F2 + bpm Fm + em , where Fi = Zi / i and aij = bji i (note the subscripts carefully!). In matrix form, X = f + e. A rotation can be perfomed by writing X = T f + e for an appropriate rotation matrix T .
12.7
How many factors to choose?
In pca, there were some guidelines for selecting the number of pcs. Similar guidelines also exist for factor analysis. r will not let you have too many factors; for example, if you try to extract three factors from four variables, you will be told this is too many. As usual, there are two competing criteria: To have the simplest model possible, and to explain as much of the variation as possible. There is no easy answer to explain how many factors are chosen. This is one of the major criticism of fa. Try to nd a number of factors that explains as much variation as possible (using the communalities and uniquenesses), but is not too complicated, and preferably leads to a useful interpretation. The best methods is probably to perform a pca, and note that best number of pcs, and then use this many factors in the fa.
12.8. Using r
269
Note also that choosing the number of factors is a separate issue to the rotation. The rotation will not alter the communalities or uniquenesses. The rst step is therefore to decide on the number of factors using communalities and uniquenesses, and then try various rotations to nd the best interpretation.
12.8
Using R
r can be used to perform factor analysis using the function factanal. The help le for this r function states The t is done by optimizing the log likelihood assuming multivariate normality over the uniquenesses. Actually doing this is beyond the scope of this course; we will just use r trusting the code gives sensible answers. Example 12.8: Consider the European employment data used by Manly in Example 7.1. (While not a climatological, it will demonstrate how to do equivalent analyses in r.) The following code analyses the data. First, Manlys Table 7.1 can be found directly, or using factanal: The factor analysis without rotation, shown in the middle of Manly p 101, can be obtained as follows: > > > > ee <- read.table("europe.txt", header = TRUE) cmat <- cor(ee) ee.fa4 <- factanal(ee, factors = 4, rotation = "none") print(ee.fa4$loadings, cutoff = 0)
Loadings: Factor1 AGR -0.961 MIN 0.143 MAN 0.744 PS 0.582 CON 0.449 SER 0.601 FIN 0.103 SPS 0.697 TC 0.615
Factor2 0.178 0.625 0.416 0.576 0.034 -0.327 -0.121 -0.672 -0.121
Factor3 -0.178 -0.410 -0.102 -0.017 0.376 0.600 0.631 -0.138 -0.233
Factor4 0.094 -0.078 -0.508 0.569 -0.375 0.089 0.228 0.196 0.146
270
Module 12. Factor Analysis Factor1 Factor2 Factor3 Factor4 3.274 1.516 1.184 0.858 0.364 0.168 0.132 0.095 0.364 0.532 0.664 0.759
SS loadings Proportion Var Cumulative Var
Notice that the value are not identical to those shown in Manly; there are numerous dierent algorithms for factor analysis, so this is of no concern. The help for the function factanal in r states
There are so many variations on factor analysis that it is hard to compare output from dierent programs. Further, the optimization in maximum likelihood factor analysis is hard, and many other examples we compared had less good ts than produced by this function.
The values are, however, similar. The signs are dierent, but this is of no consequence. The results using the varimax rotation are obtained as follows: > ee.fa4r <- factanal(ee, factors = 4, rotation = "varimax") > print(ee.fa4r$loadings, cutoff = 0) Loadings: Factor1 Factor2 Factor3 AGR -0.695 -0.633 -0.278 MIN -0.142 0.194 -0.546 MAN 0.199 0.882 -0.293 PS 0.205 0.086 0.084 CON 0.081 0.644 0.250 SER 0.427 0.368 0.720 FIN -0.022 0.041 0.686 SPS 0.972 0.051 0.197 TC 0.614 0.160 -0.061
Factor4 -0.185 0.479 0.302 0.969 -0.033 0.023 0.055 -0.091 0.249
Factor1 Factor2 Factor3 Factor4 2.097 1.803 1.563 1.368 0.233 0.200 0.174 0.152 0.233 0.433 0.607 0.759
Again, the factors are not identical, but are similar. The communalities are not produced by r; instead, the uniqueness is computed (these are called specicity in Manly). Simply, the variance of each factor (that is, the eigenvalues) consists of two parts: the uniqueness plus the communality. The communalities represent the proportion of each variable that is shared with the other variables
12.8. Using r
271
through the common factors. The uniqueness is the proportion of the variance unique to each variable and not shared with the other variables. The communalities are computed in r as follows: > 1 - ee.fa4$uniqueness AGR MIN MAN PS CON 0.9950000 0.5852025 0.9950000 0.9950000 0.4853668 SER FIN SPS TC 0.8366518 0.4758879 0.9950000 0.4682174 Again, while they are somewhat similar to those shown in Manly, they are not identical. We now show how to extract the scores from the factor analysis. In this example, the scores represent how each country scores on each factor. First, we need to adjust the call to factanal by adding scores="regression": > ee.fa4scores <- factanal(ee, factors = 4, + scores = "regression") > names(ee.fa4scores) [1] [4] [7] [10] [13] "converged" "correlation" "dof" "STATISTIC" "call" "loadings" "criteria" "method" "PVAL" "uniquenesses" "factors" "scores" "n.obs"
> ee.fa4scores$scores Factor1 0.735864909 1.660414922 0.196749209 0.367203694 0.109519146 -0.243011308 -0.387309499 0.998482010 1.296604056 -0.554916391 0.674141842 -1.463580232 0.910020084 -0.582448069 Factor2 0.39347899 -0.66961020 0.34219028 1.17655067 -1.15479327 0.72606452 1.19266940 -0.43751930 -0.10590831 0.47914938 -0.53711442 -0.79224633 -0.36175539 -0.01995971
Belgium Denmark France W.Germany Ireland Italy Luxemborg Netherlands UK Austria Finland Greece Norway Portugal
272 Spain Sweden Switzerland Turkey Bulgaria Czechoslovakia E.Germany Hungary Poland Romania USSR Yugoslavia
Module 12. Factor Analysis -1.446012474 0.95896642 1.826033446 -0.42669861 -0.904728288 2.13805967 -1.041975726 -2.66833845 -0.045481468 0.55490800 0.000259302 0.62003588 0.668431037 1.24247646 0.027698217 -0.79180745 -0.354297646 -0.48659964 -1.022644195 0.39156263 0.760473031 -0.50031050 -2.185489608 -1.26345072 Factor3 Factor4 Belgium 1.05675332 -0.29779175 Denmark 0.42843017 -1.16880321 France 0.76331126 -0.16036086 W.Germany -0.55782348 -0.15329451 Ireland 0.78688797 1.07501857 Italy 0.52722797 -1.17426449 Luxemborg 0.95859219 -0.38523076 Netherlands 1.45523464 -0.04966847 UK 0.16428566 1.06316594 Austria 0.88978886 1.33940370 Finland 0.38459638 0.93802719 Greece 0.60623551 -0.51373635 Norway 1.10665730 -0.54306128 Portugal 0.05541367 -0.72558905 Spain 0.69160242 -0.40625354 Sweden -0.11120997 -0.63396595 Switzerland 0.32945045 -0.32444541 Turkey -1.07926842 -1.66123747 Bulgaria -1.64948488 -0.73494006 Czechoslovakia -1.36338513 0.86309299 E.Germany -1.65825030 0.96913085 Hungary -0.72829821 2.83417294 Poland -0.93180860 0.18018067 Romania -1.52721874 -0.52834041 USSR -1.32194122 -0.83876243 Yugoslavia 0.72422119 1.03755315 These scores may be used in further analysis. For example, a factor analysis (or pca) is often used to reduce the number of covariates used in a regression analysis. Suppose, in this example, the given variables were to be used in a regression analysis where the response variable
12.8. Using r
273
is gross domestic product (GDP). (There is no such variable, but this will demonstrate the ideas). In r, to perform the regression of GDP against the four factors identied above, use > ee.lm <- lm(GDP ~ ee.fa4scores) To learn more about this regression t, use > summary(ee.lm) > names(ee.lm)
Example 12.9: Consider Example 11.10, where a pca was performed on Manlys sparrow data. Here, a fa is conducted for comparison. > sp <- read.table("sparrows.txt", header = TRUE) > sp.fa.vm <- factanal(sp, factors = 2, + rotation = "varimax") > loadings(sp.fa.vm) Loadings: Factor1 Length 0.370 Extent 0.659 Head 0.638 Humerus 0.901 Sternum 0.475
Factor2 0.926 0.530 0.459 0.317 0.463
Factor1 Factor2 SS loadings 2.017 1.665 Proportion Var 0.403 0.333 Cumulative Var 0.403 0.736 > 1 - sp.fa.vm$uniqueness Length Extent Head Humerus Sternum 0.9950000 0.7151875 0.6186305 0.9123528 0.4403432 > sp.fa.pm <- factanal(sp, factors = 2, + rotation = "promax") > loadings(sp.fa.pm) Loadings: Factor1 Factor2
274 Length -0.184 Extent 0.588 Head 0.614 Humerus 1.138 Sternum 0.358 1.143 0.293 0.200 -0.234 0.338 Factor1 Factor2 2.180 1.601 0.436 0.320 0.436 0.756
> 1 - sp.fa.pm$uniqueness Length Extent Head Humerus Sternum 0.9950000 0.7151875 0.6186305 0.9123528 0.4403432
12.9
Concluding comments
Activity 12.C: Read Manly, Sect. 7.6.
Factor analysis is perceived as valuable by many, and with scepticism by many others. We present the technique here as a tool, without judgement. Of note, however, is that Wilks [49] does not consider fa; he only mentions in passing that pca and fa are distinct methods.
12.10
Exercises
Ex. 12.10: The data le toowoomba.dat contains the daily rainfall, maximum and minimum temperatures, radiation, pan evaporation and maximum vapour pressure decit (in hPa) at Toowoomba from 1 January 1889 to 21 July 2002 (a total of 41474 observations on three variables). Perform a fa to nd two underying factors, and compare the factors using no rotation, promax rotation and varimax rotation. Ex. 12.11: In a certain factor analysis, the factor loadings were computed as shown in the following table.
12.10. Exercises F1 X1 X2 X3 X4 0.3 0.8 0.1 0.6 F2 0.5 0.1 0.8 0.7
275
(a) Draw the path model for this problem. (b) Determine the uniqueness for each variable. Ex. 12.12: Consider again the air quality data from 41 cities in the USA, as seen in Example 10.1. For each city, seven variables have been measured (see p 218). The rst is the concentration of SO2 in microgram per cubic metre; the other six are potential identiers of pollution problems. The original source treats the concentration of SO2 as a response variable, and the other six as covariates. (a) Is it possible to reduce these six covariates to a smaller number, without losing much information? How many factors are adequate? (b) Use an appropriate fa to perform a data reduction. If possible, nd a useful interpretation of the resultant factors. (c) Perform a regression analysis using SO2 as the response, and the factors as regressors. Compare to a regression of SO2 on all the original variables, and comment. (To regress variables A and B against Y in r, use m1 <- lm ( Y ~ A + B); then names(m1) and summary(m1) may prove useful.) Ex. 12.13: The data le gatton.dat contains the daily rainfall, maximum and minimum temperatures, radiation, pan evaporation and maximum vapour pressure decit (in hPa) at Gatton from 1 January 1889 to 21 July 2002 from 1 January 1889 to 15 September 2002 (a total of 41474 observations on six variables). Perform a fa to nd two underying factors, and compare the factors using no rotation, promax rotation and varimax rotation. Ex. 12.14: The data le strainfall.dat contains the average month and annual rainfall (in tenths of mm) for 363 Australian rainfall stations. (a) Perform a fa. How many factors seems necessary? (b) How many factors are useful? (c) If possible, nd a rotation that provides a useful interpretation for the factors.
276
Ex. 12.15: The data le jondaryn.dat contains the daily rainfall, maximum and minimum temperatures, radiation, pan evaporation and maximum vapour pressure decit (in hPa) at Jondaryn from 1 January 1889 to 21 July 2002 (a total of 41474 observations on six variables). Perform a fa to nd two underying factors, and compare the factors using no rotation, promax rotation and varimax rotation. Ex. 12.16: The data le emerald.dat contains the daily rainfall, maximum and minimum temperatures, radiationp, an evaporation and maximum vapour pressure decit (in hPa) at Emerald from 1 January 1889 to 21 July 2002 (a total of 41474 observations on six variables). Perform a fa to nd two underying factors, and compare the factors using no rotation, promax rotation and varimax rotation. Ex. 12.17: The data le wind_ca.dat contains numerous weather and wind measurements from Canberra during 1989. (a) Perform a fa on the data. (b) How many factors are necessary to summarize the data? Explain. (c) If possible, interpret the factors. What rotation makes for easiest interpretation? Ex. 12.18: The data le wind_wp.dat contains numerous weather and wind measurements from Wilsons Promontory, Victoria (the most southerly point of mainland Australia) during 1989. (a) Perform a fa on the data. (b) How many factors are necessary to summarize the data? Explain. (c) If possible, interpret the factors. What rotation makes for easiest interpretation? Ex. 12.19: The data le qldweather.dat contains six weather-related variables for 20 Queensland cities. (a) Perform a fa. How many factors seems necessary? (b) How many factors are useful? (c) If possible, nd a rotation that provides a useful interpretation for the factors. Ex. 12.20: This question concerns a data set that is not climatological, but you may nd interesting. The data le chocolates.dat, available from http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/ popular/chocolates.html, contains measurements of the price, weight and nutritional information for 17 chocolates commonly available in Queensland stores. The data was gathered in April 2002 in Brisbane.
12.10. Exercises (a) Perform a fa using the nutritional information. (b) How many factors are useful?
277
(c) If possible, nd a rotation that provides a useful interpretation for the factors.
12.10.1
12.10 Here is a brief analysis. > > + > + > + tw <- read.table("toowoomba.dat", header = TRUE) tw.2.n <- factanal(tw[4:9], factors = 2, rotation = "none") tw.2.v <- factanal(tw[4:9], factors = 2, rotation = "varimax") tw.2.p <- factanal(tw[4:9], factors = 2, rotation = "promax")
278
Module
13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 280 280 281 281 281 281 287 288 290
Cluster Analysis
Module contents
13.1 Introduction . . . . . . . . . . . . . . . . . . . 13.2 Types of cluster analysis . . . . . . . . . . . . 13.2.1 Hierarchical methods . . . . . . . . . . . . . . 13.3 Problems with cluster analysis . . . . . . . . 13.4 Measures of distance . . . . . . . . . . . . . . 13.5 Using PCA and cluster analysis . . . . . . . . 13.6 Using r . . . . . . . . . . . . . . . . . . . . . . . 13.7 Some nal comments . . . . . . . . . . . . . . 13.8 Exercises . . . . . . . . . . . . . . . . . . . . . . 13.8.1 Answers to selected Exercises . . . . . . . . .
Module objectives
understand the principles underlying cluster analysis; compute clusters using r; select an appropriate number of cluster for a given data; plot a dendrogram using r.
279
280
Module 13. Cluster Analysis
13.1
Introduction
Cluster analysis, unlike PCA and factor analysis, is a classication technique. Activity 13.A: Read Manly, section 9.1. Example 13.1: Kavvas and Delleur [28] use a cluster analysis for modelling sequences of daily rainfall in Indiana.
Example 13.2: Fritts [17] describes two techniques for examining the relationship between ring-width of conifers in western North America and climatic variables. The second technique is a cluster analysis which he uses to identify similarities and dierences in the response function and then to classify the tree sites.
13.2
Types of cluster analysis

Activity 13.B: Read Manly, sections 9.2.
The simple idea of cluster analysis is explained in Manly, section 9.1. The actual mechanics, however, can be performed in numerous ways. Manly discusses two of these. Two methods are hierarchical clustering (using hclust),also the rst mentioned by Manly; and k-means clustering (using kmeans),the second method mentioned by Manly. The hierarchical methods are discussed in more detail in both Manly and these notes.
13.2.1
Hierarchical methods
Activity 13.C: Read Manly, section 9.3.
The hierarchical methods discussed in this section are well explained by the text. The third method, using group averages, can be performed in r using the option method="average" in the call to hclust. A similar approach to the rst method is found using the option method="single". r also provides other hierarchical clustering methods; see ?hclust.
13.3. Problems with cluster analysis
281
13.3
Problems with cluster analysis

Activity 13.D: Read Manly, section 9.4.
13.4
Measures of distance
The hierarchical clustering methods are all based on measures of distance between observations. There are dierent measures of distance that can be used besides the standard Euclidean distance. Activity 13.E: Read Manly, sections 9.5, 5.1, 5.2 and 5.3.
13.5
Using PCA and cluster analysis
As mentioned in Sect. 11.3, PCA is often a preliminary step before conducting a cluster analysis. Activity 13.F: Read Manly, section 9.6. Example 13.3: Stone & Auliciems [42] use a combination of cluster analysis and PCA to dene phases of the Southern Oscillation Index (SOI).
13.6
Using R
Cluster analysis can be performed in r, as briey been mentioned previously. The primary functions to use for hierarchical methods are hclust (which performs the clustering), and dist (which computes the distance matrix of which the clustering is based). The default distance measure is the standard Euclidean distance. After hclust is used, the resultant object can be plotted; the default plot is the dendrogram (Manly, Figure 9.1). For k-means clustering (called partitioning in Manly), the function kmeans can be used.
282
Example 13.4: Consider the example concerning European countries used by Manly in Example 9.1. (While not a climatological, it will demonstrate how to do equivalent analyses in r.) The following code analyses the data. First the data is loaded, the names of the countries extracted, and the rest of the variables re-labelled as ec: > ec <- read.table("europe.txt", header = TRUE) Then, attach the data: > attach(ec) The example in Manly uses standardized data (see the top of page 135). Here is one way to do this in r: > ec.std <- scale(ec) Now that the data is prepared, the clustering can commence. The clustering method used in the Example is the nearest neighbour method; the most similar of the methods available in r is called method="single". The distance measure used is the default Euclidean distance. > es.hc <- hclust(dist(ec.std), method = "single") > plot(es.hc, hang = -1) The nal plot, shown in Fig. 13.1, looks very similar to that shown in Manly Figure 9.3. You can try other methods if you want to experiment. To then determine which countries are in which cluster, the function cutree is used, here is an example of extracting four clusters: > cutree(es.hc, k = 4) Belgium Denmark 1 1 W.Germany Ireland 1 1 Luxemborg Netherlands 1 1 Austria Finland 1 1 Norway Portugal 1 1 Sweden Switzerland 1 1 Bulgaria Czechoslovakia France 1 Italy 1 UK 1 Greece 1 Spain 2 Turkey 3 E.Germany
13.6. Using r
283
Cluster Dendrogram
5 Height 1 Turkey Yugoslavia Spain USSR Hungary Czechoslovakia E.Germany Romania Bulgaria Poland Luxemborg Italy Greece Portugal Norway Switzerland Ireland Austria UK Finland W.Germany Denmark Sweden Netherlands Belgium France dist(ec.std) hclust (*, "single")
Figure 13.1: The dendrogram after tting a hierarchical clustering model (using the single agglomeration method) to the European countries data
284 1 Hungary 1 USSR 1 1 Poland 1 Yugoslavia 4
Module 13. Cluster Analysis 1 Romania 1
> sort(cutree(es.hc, k = 4)) Belgium 1 W.Germany 1 Luxemborg 1 Austria 1 Norway 1 Switzerland 1 E.Germany 1 Romania 1 Turkey 3 Denmark France 1 1 Ireland Italy 1 1 Netherlands UK 1 1 Finland Greece 1 1 Portugal Sweden 1 1 Bulgaria Czechoslovakia 1 1 Hungary Poland 1 1 USSR Spain 1 2 Yugoslavia 4
Later (Example 13.6), we will see that using Wards method is common in the climatological literature. This produces four dierent clusters (Fig. 13.2.) > es.hc.w <- hclust(dist(ec.std), method = "ward") > plot(es.hc.w, hang = -1) > sort(cutree(es.hc.w, k = 4)) Belgium 1 Ireland 1 Austria 1 Sweden 1 Luxemborg Denmark 1 Netherlands 1 Finland 1 W.Germany 2 Greece France 1 UK 1 Norway 1 Italy 2 Portugal
13.6. Using r
285
Cluster Dendrogram
15 Height 0 UK Finland Ireland Austria Netherlands Belgium France Norway Denmark Sweden Luxemborg Italy W.Germany Switzerland Spain Greece Portugal Turkey Yugoslavia Hungary Czechoslovakia E.Germany USSR Romania Bulgaria Poland dist(ec.std) hclust (*, "ward")
Figure 13.2: The dendrogram after tting a hierarchical clustering model (using Wards method) to the European countries data
2 Spain 2 Yugoslavia 3 E.Germany 4 Romania 4
10
2 2 Switzerland Turkey 2 3 Bulgaria Czechoslovakia 4 4 Hungary Poland 4 4 USSR 4
Which clustering seems to produce the more sensible clusters? Why?
Example 13.5: On page 137, Manly discusses using the partitioning, or k-means, method, on the European cities data. This can also be done in r; rstly, grouping into two groups: > ec.km2 <- kmeans(ec, centers = 2) > row.names(ec)[ec.km2$cluster == 1]
286 [1] [3] [5] [7] [9] [11] [13] [15] [17] [19] [21] "Belgium" "France" "Ireland" "Luxemborg" "UK" "Finland" "Portugal" "Sweden" "Bulgaria" "E.Germany" "USSR"
Module 13. Cluster Analysis "Denmark" "W.Germany" "Italy" "Netherlands" "Austria" "Norway" "Spain" "Switzerland" "Czechoslovakia" "Hungary"
> row.names(ec)[ec.km2$cluster == 2] [1] "Greece" [4] "Romania" "Turkey" "Poland" "Yugoslavia"
These are dierent groups that given in Manly (since a dierent algorithm is used). Six groups can also be specied: > ec.km6 <- kmeans(ec, centers = 6) > row.names(ec)[ec.km6$cluster == 1] [1] "Greece" "Yugoslavia"
> row.names(ec)[ec.km6$cluster == 2] [1] "W.Germany" "Switzerland" [3] "Czechoslovakia" "E.Germany" > row.names(ec)[ec.km6$cluster == 3] [1] "Belgium" [4] "UK" "Denmark" "Norway" "Netherlands" "Sweden"
> row.names(ec)[ec.km6$cluster == 4] [1] "Ireland" [5] "Hungary" "Portugal" "Spain" "Poland" "Romania" "Bulgaria" "USSR"
> row.names(ec)[ec.km6$cluster == 5] [1] "Turkey" > row.names(ec)[ec.km6$cluster == 6]
13.7. Some nal comments [1] "France" [4] "Austria" "Italy" "Finland" "Luxemborg"
287
Example 13.6: Unal, Kindap and Karaca [45] use cluster analysis to analyse Turkeys climate. The abstract states: Climate zones of Turkey are redened by using . . . cluster analysis. Data from 113 climate stations for temperatures (mean, maximum and minimum) and total precipitation from 1951 to 1998 are used after standardizing with zero mean and unit variance, to conrm that all variables are weighted equally in the cluster analysis. Hierarchical cluster analysis is chosen to perform the regionalization. Five different techniques were applied initially to decide the most suitable method for the region. Stability of the clusters is also tested. It is decided that Wards method is the most likely to yield acceptable results in this particular case, as is often the case in climatological research. Seven dierent climate zones are found, as in conventional climate zones, but with considerable dierences at the boundaries.
In the above quote, it is noted that Wards method is commonly used in climatology. This is specied in r as follows: hclust( dist( data ), method="ward") The clusters produced using dierent methods can be quite dierent (the default method is the complete agglomeration method).
13.7
Some nal comments
A cluster analysis is generally used to classify data into clusters. It is rarely obvious how many clusters is ideal. There are, however, hypothesis tests available for helping make this decision (see Wilks [49, p 424] for some references). In terms of hierarchical cluster analysis as described here, the dendrogram can help in the decision. One can select a value of the Height
288
or Distance appropriately. An appropriate distance may be that value under which the clusters change rapidly; alternatively, interpretations may aid the clustering. After clustering, there are sometimes useful labels that can be applied to the clusters (in a data set containing climatic variables for various countries, for example, countries may be clustered by climatic regions and labels such as Desert, Mediterranean etc, may be applied). In addition, it is often recommended that data be rst standardized to similar scales before a cluster analysis is performed, especially if the data are in dierent units. (That is, subtract the mean from the variable, and divide by the standard deviation; use the r function scale). This prevents variables with large variances dominating the distance measure.
13.8
Exercises
Ex. 13.7: Try to reproduce Manlys Figure 9.4 by standardizing and then using hclust. Ex. 13.8: The data le tempppt.dat contains the average July temperature (in F ) and the average July precipitation for 28 stations in the USA. Each station has also been classied as belonging to southeastern, central or northeastern USA. (a) Plot the temperature and precipitation data on a set of axes, identifying on the plot the three regions the stations are from. Do the three regions appear to form clusters? (b) Perform a cluster analysis using the temperature and precipitation data. Use various clustering methods and compare. (c) How well are the stations clustered according to the three predened classications? (d) Using a dendrogram, which two regions are most similar? Ex. 13.9: The data le strainfall.dat contains the average month and annual rainfall (in tenths of mm) for 363 Australian rainfall stations. (a) Perform a cluster analysis using the monthly averages. Use various clustering methods and compare. (b) Using a dendrogram, how many classications seems useful? Ex. 13.10: Consider the data le strainfall.dat again. (a) Perform a PCA on the data. Show that two PCs is reasonable. (b) Plot the rst PC against the second PC. What does this indicate?
13.8. Exercises (c) Perform a cluster analysis on the rst two PCs. (d) Using a dendrogram, how many classications seems useful?
289
Ex. 13.11: This question concerns a data set that is not climatological, but you may nd interesting. The data le chocolates.dat, available from http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/ popular/chocolates.html, contains measurements of the price, weight and nutritional information for 17 chocolates commonly available in Queensland stores. The data was gathered in April 2002 in Brisbane. (a) Perform a cluster analysis using the nutritional information using various clustering methods, and compare. (b) Using a dendrogram, how many classications of seems useful? What broad names could be given to these classications? Ex. 13.12: The data le ustemps.dat contains the normal average January minimum temperature in degrees Fahrenheit with the latitude and longitude of 56 U.S. cities. (See the help le for full details.) Perform a cluster analysis. How many clusters seem appropriate? Explain. Ex. 13.13: In Exercise 11.18, the US pollution data was examined, and a PCA performed. (a) Perform a cluster analysis of the rst two PCs. Produce a dendrogram. Does it appear the cities can be clustered into a small number of groups, based on the rst two PCs? (b) Repeat the above exercise, but use the rst three PCs. Compare the two cluster analyses. Ex. 13.14: The data in the le qldweather.dat contains six weather-related data for 20 Queensland cities (covering temperatures, rainfall, number of raindays, humidity) plus elevation. (a) Perform a PCA to summarise the seven variables into a small number. How many PCs seems appropriate? (b) Using the rst three PCs, perform a cluster analysis (use Wards method). (c) Plot a dendrogram. Can you identify and nd useful names for some clusters? A map of Queensland may be useful (Fig. 13.3). (d) Compare your results to a cluster analysis on all numerical variables. (e) Based on the cluster analysis of all variables, use cutree to divide into four clusters. (f) Plot a star plot, and see if the clusters can be identied.
290
Weipa
Cairns Atherton Innisfail Townsville Mt.Isa Mackay
Rockhampton Birdsville Theodore
Cunnamulla
Gladstone Childers Maryborough Gympie Roma Nambour Brisbane Toowomba Mt.Tamborine Warwick Stanthorpe
Figure 13.3: A map of Queensland may be useful to label the clusters in Exercise 13.14. Ex. 13.15: The data in the le countries.dat contains numerous variables from a number of countries, and the countries have been classied by region. (a) Perform a cluster analysis on the original data. Given the regions of the countries, is there a sensible clustering that emerges? Explain. (b) Perform a PCA on the data. How many PCs seem necessary? Let this number of PCs be p. (c) Cluster the rst p PCs. Given the regions of the countries, is there a sensible clustering that emerges? Explain. (d) How do these clusters compare to the clusters identied using all the data?
13.8.1
13.7 The following r code will work: > mn <- read.table("mandible.txt", header = TRUE) > mn.hc <- hclust(dist(scale(mn)), method = "single") > plot(mn.hc, hang = -1)
Appendix
Installing other packages in R
Installing extra r packages can be a tricky business in Windows (I have never had any trouble in Linux, however). To install the packages oz and ncdf as used in Section 11.5, there are a couple of options.
First try using the menu in r: click Packages|Install packages from CRAN. I have never got this to work for me, but some people have. If the above doesnt work, there is an alternative. Follow these steps:
1. Check your version of r by typing version at the r prompt. 2. If the CD has packages for your version of r, then use the r menu to select Packages|Install package from local zip le and install the packages from the zip les on the CD. 3. If this doesnt work, or your version of r diers from that on the CD, use your browser to go to: http://mirror.aarnet.edu. au/pub/CRAN/, then click on r Binaries, then Windows and then contrib. (Alternatively, go straight to http://mirror.aarnet. edu.au/pub/CRAN/bin/windows/contrib/ directly). 4. Select the directory/folder that corresponds to your version of r. 5. Then download the zip le you need and put it somewhere that youll remember.
291
292
Appendix A. Installing other packages in R 6. Then, from the r menu, select Install package from local zip le. Then point to where you saved the le.
Then you should have the package installed ready for use. At the r prompt, you can then type library(oz), for example, and the library is loaded.
Appendix
Review of statistical rules
B.1
Basic denitions
Experiment: Any situation where the outcome is uncertain is called an experiment. An experiments range from the simple tossing of a coin, to the complex simulation of a queuing system. Sample space: For any experiment, the sample space S of the experiment consists of all possible outcomes for the experiment. For example, the sample space of a coin toss is simply S = {tail, head}, usually abbreviated to {T, H}, whereas for a queuing system the sample space is the huge set of all possible realisations over time of people arriving and being served in the queue. Event: An event E consists of any collection of points (set of outcomes) in the sample space. For example, in the coin toss there are two possible outcomes: either T or H. These engender three possible nontrivial events: {T }, {H} and {T, H} (this last event always happens). Whereas when two coins are tossed there are four possible outcomes: either T T , T H, HT or HH (using what I trust is an obvious notation). There are then fteen possible nontrivial events such as: two heads, E1 = {HH}; the rst 293
294
Appendix B. Review of statistical rules coin is a head, E2 = {HT, HH}; at least one of the coins is a tail, E3 = {T T, T H, HT }; etc. This denition of an event as a set of outcomes is very important as it allows us to discuss events at level appropriate to the circumstances. For example, a driver is the event drunk if his/her blood-alcohol content is above 0.05% . This groups all the possible outcomes of the level of alcohol (a real percentage) into two possible sets, that is, events: drunk or not drunk.
Mutually exclusive: A collection of events E1 , E2 , E3 , . . . are mutually exclusive if for i = j, Ei and Ej have no outcomes in common. For example, when tossing two coins, the above events E1 and E3 are mutually exclusive because HH (the only outcome in E1 ) is not in E3 . But, E2 and E3 is not mutually exclusive because HT is in both; neither is E1 and E2 mutually exclusive. The probabilities of events must satisfy the following rules of probability.
For any event E, Pr {E} 0 . Something always happens: Pr {S} = 1 . If E1 , E2 ,. . . , En are mutually exclusive events, then
n
Pr {E1 E2 E3 En } =
j=1
Pr {Ej } .
For example, we used these last two properties to determine the steady state probabilities in a queue. Let event Ej denote that the queue is in state j (that is, with j people in the queue). These are clearly mutually exclusive events as the queue cannot be in two states at once. Further the sample space is the union of all possible states: S = E0 E1 E2 and hence 1 = Pr {S} = Pr {E0 E1 E2 } = Pr {E0 } + Pr {E1 } + Pr {E2 } + = 0 + 1 + 2 + . Pr E = 1 Pr {E} where E is the complement of E, that is, E is the set of outcomes that are not in E . We used this before too. For example, the the event E be that none are waiting in the queue, that is the system is in states 0 or 1, then Pr E = 1 Pr {E} = 1 Pr {E0 E1 } = 1 Pr {E0 } Pr {E1} gives the probability that there is someone waiting in the queue.
B.1. Basic denitions

If two events are not mutually exclusive, then
295
Pr {E1 E2 } = Pr {E1 } + Pr {E2 } Pr {E1 E2 } . This is known as the general addition rule of probability. Note that if E1 and E2 are mutually exclusive then Pr {E1 E2 } = 0 and so Pr {E1 E2 } = Pr {E1 } + Pr {E2 } as given above.
For two events E1 and E2 , the conditional probability that event E2 will occur given that E1 has already occurred, is
Pr {E2 | E1 } =
Pr {E1 E2 } . Pr {E1 }
This gives rise to the general multiplication rule: Pr {E2 E1 } = Pr {E1 } Pr {E2 | E1 } = Pr {E2 } Pr {E1 | E2 } . Events E1 and E2 are termed independent if and only if Pr {E2 | E1 } = Pr {E2 }, or equivalently Pr {E1 | E2 } = Pr {E1 }, or equivalently Pr {E2 E1 } = Pr {E1 } Pr {E2 }. Example B.1: The probability that a person convicted of dangerous driving will be ned is 0.87, and the probability that he/she will lose his/her licence is 0.52. The probability that such a person will be ned and lose their licence is 0.41. What is the probability that a person convicted of dangerous driving will be either ned or lose licence or both? Solution: The events of being ned and losing licence are not mutually exclusive, therefore apply the general addition rule: Pr {F L} = Pr {F } + Pr {L} Pr {F L} = 0.87 + 0.52 0.41 = 0.98 .
Example B.2: A researcher knows that 60% of the goats in a certain district are male and that 30% of female goats have a certain disease. Find the probability that a goat picked at random from the district is a female and has the disease.
296
Appendix B. Review of statistical rules Solution: Apply the general multiplication rule: Pr {F D} = Pr {F } Pr {D | F } = 0.4 0.3 = 0.12 .
B.2
Mean and variance for sums of random variables
If X1 and X2 are random variables and c is a constant, then the following relationships must hold.
E(cX1 ) = cE(X1 ) E(X1 + c) = E(X1 ) + c E(X1 + X2 ) = E(X1 ) + E(X2 ) Var(cX1 ) = c2 Var(X1 ) Var(X1 + c) = Var(X1 )
If X1 and X2 are independent random variables, Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) . If X1 and X2 are not independent random variables, Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) + 2Covar[X1 , X2 ] , where Covar[X1 , X2 ,] is the covariance between X1 and X2 . Example B.3: A random variable X has a mean of 10 and variance of 5. Determine the mean and variance of 3X 1.
B.2. Mean and variance for sums of random variables Solution: Given that E(X) = 10 and Var(X) = 5
E(3X 1) = E(3X) 1 = 3E(X) 1 = 3(10) 1 = 29 Var(3X 1) = Var(3X) = 9 Var(X) = 9(5) = 45
297
Example B.4: The alternative formula for the variance is derived as follows Var(X) = E (X X )2 = E X 2 2X X + 2 X = E X 2 + E [2X X] + 2 X = E X
2
by addition rules by multiplication rule
2X E [X] +
2
2 X
= E X 2 E [X]
as X = E(X) .
Denition B.1 (general expectation, variance and standard deviation)
The expected value of any given function g(X) of a discrete random variable is E(g(X)) = g(x)p(x) ,
x
where p(x) is its probability distribution.
For a discrete random variable X, the variance of X is the expected value of g(X) where g(x) = (x X )2 (recall X = E(X)), that is,
2 Var(X) = X = E (X X )2 = x
(x X )2 p(x) .
The standard deviation of X is X =
Var(X) .
For any distribution, the variance of X may also be computed from Var(X) = E(X 2 ) E(X)2 , as can be shown from properties established in the next subsubsection. Example B.5: A random variable X has the following probability distribution x p(x) 0 0.05 1 0.15 2 0.35 3 0.25 4 0.20
Determine the expected value of X and the variance of X.
298 Solution:
Appendix B. Review of statistical rules
E(X) = x = x xp(x) = 0 0.05 + 1 0.15 + 2 0.35 + 3 0.25 + 4 0.20, therefore E(X) = 2.40 E(X 2 ) = x x2 p(x) = 02 0.05 + 12 0.15 + 22 0.35 + 32 0.25 + 42 0.20 = 7.0, therefore Var(X) = E(X 2 ) E(X)2 = 7.0 2.402 = 1.24
Appendix
Some time series tricks in R
C.1
Helpful R commands
To convert an AR model to an MA model: imp <- as.ts( c(1, rep(0,19) ) ) # Creates a time-series (1, 0, 0, 0, ...) theta <- filter(imp, c(ar1, ar2, ...), "recursive") # Note that ar0 = 1 is assumed, and should not be included. To nd the ACF of an AR model: imp <- as.ts( c(1, rep(0,99) ) ) # Creates a time-series (1, 0, 0, 0, ...) # Now convert to MA model, as above theta <- filter(imp, c(ar1, ar2, ...), "recursive") # Note that ar0 = 1 is assumed, and should not be included. # Now get gamma: convolve( theta, theta ) 299
300
Appendix C. Some time series tricks in R
Appendix
Time series functions in R
The following is a list of the functions available in r for time series analysis. Table D.1: The time series library in r.
Function acf ar ar.burg ar.mle ar.ols ar.yw arima austres bandwidth.kernel beaver1 beaver2 beavers BJsales Box.test ccf
Description Autocovariance and Autocorrelation Function Estimation Fit Autoregressive Models to Time Series Fit Autoregressive Models to Time Series Fit Autoregressive Models to Time Series Fit Autoregressive Models to Time Series by OLS Fit Autoregressive Models to Time Series ARIMA Modelling of Time Series Quarterly Time Series: Number of Australian Residents Smoothing Kernel Objects Body Temperature Series of Two Beavers Body Temperature Series of Two Beavers Body Temperature Series of Two Beavers Sales Data with Leading Indicator. BoxPierce and LjungBox Tests Function Estimation
301
302 Function (cont.) cpgram df.kernel dinv embed EuStockMarkets fdeaths lter is.tskernel kernapply kernel lag lag.plot LakeHuron ldeaths lh lynx mdeaths na.contiguous nottem pacf plot.acf plot.spec plot.stl plot.tskernel PP.test predict.ar predict.arima0 print.ar print.arima0 print.stl print.tskernel spec spec.ar spec.pgram spec.taper
Appendix D. Time series functions in R Description (cont.) Plot Cumulative Periodogram Smoothing Kernel Objects Discrete Integrals: Inverse of Dierencing Embedding a Time Series Daily Closing Prices of Major European Stock Indices, 1991-1998. Monthly Deaths from Lung Diseases in the UK Linear Filtering on a Time Series Smoothing Kernel Objects Apply Smoothing Kernel Smoothing Kernel Objects Lag a Time Series Time Series Lag Plots Level of Lake Huron 18751972 Monthly Deaths from Lung Diseases in the UK Luteinizing Hormone in Blood Samples Annual Canadian Lynx trappings 18211934 Monthly Deaths from Lung Diseases in the UK NA Handling Routines for Time Series Average Monthly Temperatures at Nottingham, 19201939 Autocovariance and Autocorrelation Function Estimation Plotting Autocovariance and Autocorrelation Functions Plotting Spectral Densities Methods for STL Objects Smoothing Kernel Objects Phillips-Perron Unit Root Test Fit Autoregressive Models to Time Series ARIMA Modelling of Time Series - Preliminary Version Fit Autoregressive Models to Time Series ARIMA Modelling of Time Series - Preliminary Version Methods for STL Objects Smoothing Kernel Objects Spectral Density Estimation Estimate Spectral Density of a Time Series from AR Fit Estimate Spectral Density of a Time Series from Smoothed Periodogram Taper a Time Series
303 Function (cont.) spectrum stl summary.stl sunspot toeplitz treering ts.intersect ts.plot ts.union UKDriverDeaths UKLungDeaths USAccDeaths Tskernel Description (cont.) Spectral Density Estimation Seasonal Decomposition of Time Series by Loess Methods for STL Objects Yearly Sunspot Data, 1700-1988. Monthly Sunspot Data, 1749-1997. Form Symmetric Toeplitz Matrix Yearly Treering Data, -6000-1979. Bind Two or More Time Series Plot Multiple Time Series Bind Two or More Time Series Deaths of Car Drivers in Great Britain, 1969-84 Monthly Deaths from Lung Diseases in the UK Accidental Deaths in the US 1973-1978 Smoothing Kernel Objects
304
Appendix D. Time series functions in R
Appendix
Multivariate analysis functions in R
Table E.1: The multivariate statistics library in r. Function ability.cov as.dendrogram as.dist as.hclust as.matrix.dist biplot biplot.princomp cancor Canonical cmdscale cut.dendrogram cutree dist factanal factanal.t.mle format.dist Harman23.cor Harman74.cor Description Ability and Intelligence Tests General Tree Structures Distance Matrix Computation Convert Objects to Class hclust Distance Matrix Computation Biplot of Multivariate Data Biplot for Principal Components Correlations Classical (Metric) Multidimensional Scaling General Tree Structures Cut a tree into groups of data Distance Matrix Computation Factor Analysis Factor Analysis Distance Matrix Computation Harman Example 2.3 Harman Example 7.4
305
306 Function (cont.) hclust identify.hclust kmeans K-Means loadings names.dist plclust plot.dendrogram plot.hclust plot.prcomp plot.princomp plotNode plotNodeLimit prcomp predict.princomp princomp print.dist print.factanal print.hclust print.loadings print.prcomp print.princomp print.summary.prcomp print.summary.princomp promax rect.hclust screeplot summary.prcomp summary.princomp varimax
Appendix E. Multivariate analysis functions in R Description (cont.) Hierarchical Clustering Identify Clusters in a Dendrogram Clustering Print Loadings in Factor Analysis Distance Matrix Computation Hierarchical Clustering General Tree Structures Hierarchical Clustering Principal Components Analysis Principal Components Analysis General Tree Structures General Tree Structures Principal Components Analysis Principal Components Analysis Principal Components Analysis Distance Matrix Computation Print Loadings in Factor Analysis Hierarchical Clustering Print Loadings in Factor Analysis Principal Components Analysis Principal Components Analysis Principal Components Analysis Summary method for Principal Components Analysis Rotation Methods for Factor Analysis Draw Rectangles Around Hierarchical Clusters Screeplot of PCA Results Principal Components Analysis Summary method for Principal Components Analysis Rotation Methods for Factor Analysis
Bibliography
[1] Joint Archive for Sea Level, http://uhslc.soest.hawaii.edu/uhslc/ jasl.html. [2] Climate Indicies, from the Climate Diagnostic Centre, http://www. cdc.noaa.gov/ClimateIndices/ [3] Climate Prediction Center http://www.cpc.noaa.gov/ [4] U.S. Geological Survey, Hydro-Climatic Data Network (HCDN): Streamow Data Set, 18741988 By J.R. Slack, Alan M. Lumb, and Jurate Maciunas Landwehr. http://water.usgs.gov/pubs/wri/ wri934076/1st_page.html [5] Hyndman, Rob. The Time Series Data Library, http:// www-personal.buseco.monash.edu.au/~hyndman/TSDL/index.htm [6] Anderson, O. D. (1976). Time Series Analysis and Forecasting: The BoxJenkins Approach. London and Boston: Butterworths. [7] Assimakopoulos, V. and Nikolopoulos, K. (2000) The theta model: a decomposition approach to forecasting in International Journal of Forecasting 16(4) 521530. [8] Basilevsky, A. (1994). Statistical factor analyasis and related methods, New York: John Wiley and Sons. [9] Box, G. E. P. and Jenkins, G. M. (1970). Time Series Analysis, San Francisco: Holden-Day. [10] Buell, C. Eugene and Bundgaard, Robert C. (1971). A factor analysis of winds to 60 km over Battery MacKenzie, C.Z. in Journal of Applied Meteorology, 10(4), 803810. [11] Chateld, Chris (1996). The Analysis of Time Series: an introduction, Boca Raton:Chapman and Hall. 307
308
Bibliography
[12] Chin, Roland T., Jau, Jack Y. C. and Weinman, James A. (1987). The application of time series models to cloud eld morphology analysis in Journal of Climate and Applied Meteorology, 26, 363373. [13] Chu, Pao-Shin and Katz, Richard W. (1985). Modeling and forecasting the southern oscillation: A Time-Domain Approach in Monthly Weather Review, 113, 18761888. [14] Claps, P. and Murrone, F. (1994). Optimal parameter estimation of conceptually-based streamow models by time series aggregation in Stochastic and Statistical Methods in Hydrology and Environmental Engineering, Volume 3, eds Keith W. Hipel, A. Ian McLeod, U. S. Panu and Vijay P. Singh, Netherland: Kluwer Academic Publishers p421 434. [15] Davis, J. M. and Rapoport, P. N. (1974). The use of time series analysis techniques in forecasting meteorological drought, in Monthly Weather Review 102, 176180. [16] Eneld, D.B., A. M. Mestas-Nunez and P.J. Tribble, (2001). The Atlantic multidecadal oscillation and its relation to rainfall and river ows in the continental U.S. in Geophysical Research Letters, 28, 20772080. [17] Fritts, Harold C. (1974). Relationships of ring widths in arid-site conifers to variations in monthly temperature and precipitation in Ecological Monographs, 44, 411440. [18] Guiot, J. and Tessier, L. (1997). Detection of pollution signals in treering series using AR processes and neural networks in Applications of Time Series Analysis in Astronomy and Meteorology, eds T. Subba Rao, M. B. Priestley and O. Lessi, London: Chapman and Hall, p413426. [19] Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J. and Ostrowski, E. (1994). A Handbook of Small Data Sets, London: Chapman and Hall. [20] Hannes, Gerald (1976). Factor analyusis of costal air pressure and water temperature in Journal of Applied Meteorology, 15(2), 120126. [21] Hipel and McLeod (1984). Time Series Modelling of Water Resources and Environmental Systems, Elsevier. [22] Hyndman, Rob J. and Billah, Baki. Unmasking the Theta method to appear in International Journal of Forecasting. [23] Izenman, A. J. (1983). J. R. Wolf and H. A. Wolfer: An historical note on the Zurick sunspot relative numbers in Journal of the Royal Statistical Society A, 146, 311318.
Bibliography
309
[24] Kalnicky, Richard A. (1987) Seasons, singularities, and climatic changes over the midlatitudes of the northern hemisphere during 1899 1969 in Journal of Applied Meterology, 26(11), 14961510. [25] Krner, Olavi and Rannik, Ullar (1996). Stochastic models to reprea sent the temporal variability of zonal average cloudiness in Journal of Climate, 9, 27182726. [26] Katz, Richard W. and Skaggs, Richard H. (1981). On the use of autoregressive-moving average processes to model meteorological time series in Monthly Weather Review, 109, 479484. [27] Katz, Richard W. and Glantz, Michael H. (1986). Anatomy of a rainfall index in Monthly Weather Review, 114(4), 764771. [28] Kavvas, M. L. and Delleur, J. W. (1981). A stochastic cluster model of daily rainfall sequences in Water Resources Research, 17(4), 1151 1160. [29] Kidson, John W. (1975). Eigenvector analysis of monthly mean surface data in Monthly Weather Review, 103(3), 177186. [30] Kim, Jae-On and Mueller, Charles W. (1990). Fcator Analysis: Statistical Methods and Prcatical Issues, Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 14. Beverley Hills and London: Sage Publications. [31] Maier, H. R. and Dandy, G. C. (1995). Comparison Of The Box-Jenkins Procedure with Articial Neural Network Methods for Univariate Time Series Modelling, Volume 1, Research Report R127, Department of Civil and Environmental Engineering, The University of Adelaide. [32] Makridakis, Spyros and Hibon, Mich`le (2000). The M3-Competition: e results, conclusions and implications in International Journal of Forecasting 16(4), 451476 [33] Mantua, Nathan J. Hare, Steven R., Zhang, Yuan, Wallace, John M., and Francis, Robert C. (1997). A Pacic interdecadal climate oscillation with impacts on salmon production in the Bulletin of the American Meteorological Society, 78, 10691079. [34] Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis, London: Academic Press. [35] Michaelsen, Joel (1982). A statistical study of large-scale, long period variability in North Pacic sea surface temperature anomalies in Journal of Physical Oceanography, 12(7), 694703.
310
Bibliography
[36] Parzen, Emanuel (1979). Nonparametric statistical data modeling in Journal of the American Statistical Association, 74(365), 105121. [37] Preisendorfer, RW and Mobley, CD (1984). Climate forecast verications , United States Mainland 197483 in Monthly Weather Review, 112, 809825. [38] Richman, MB (1986). Rotation of principal components in Journal of Climatology, 6, 293335. [39] Rogers, Jeery C. (1976). Sea surface temperature anomalies in the eastern North Pacic and associated wintertime atmospheric uctuations over North America, 196073 in Monthly Weather Review, 104(8), 985993. [40] Sales, P. R. H., Pereira, B. de B. and Vieira, A. M. (1994). Linear procedures for time series analysis in hydrology in in Stochastic and Statistical Methods in Hydrology and Environmental Engineering, Volume 3, eds Keith W. Hipel, A. Ian McLeod, U. S. Panu and Vijay P. Singh, Netherland: Kluwer Academic Publishers p105117. [41] Shaw, N. (1942). Manual of Meterology, Volume 1, London: Cambridge University Press. [42] Stone RC and Auliciems A. (1992). SOI phase relationships with rainfall in eastern Australia in International Journal of Climatology, 12, 625636. [43] Trenberth Kevin E. and and Stepaniak, David P. (19XX). Indices of El Nino Evolution in Journal of Climate, 14, 16971701. [44] Tong, Howell (1983). Threshold Models in Nonlinear Time Series Analysis, Springer-Verlag. [45] Unal,Yurdanur, Kindap, Tayfun and Karaca, Mehmet (2003). Redening the climate zones of Turkey using cluster analysis in International journal of climatology 23, 10451055. [46] Venables, W. N. and Ripley, B. D. (1997). Modern Applied Statistics with S-PLUS, second edition, Springer-Verlag: New York. [47] Visser, H. and Molenaar, J. (1995). Trend estimation and regression analysis in climatological time series: an application of structural time series models and Kalman lter in Journal of Climate, 8, 969979. [48] Wilks, DS (1989). Conditioning stochastic daily precipitation models on total monthly precipitation in Water Resources Research, 25, 14291439.
Bibliography
311
[49] Wilks, Daniel S. (1995). Statistical Methods in the Atmospheric Sciences. Academic Press, San Diego. [50] Wol, George T., Morrisey, Mark L. and Kelly, Nelson A. (1984). An investigation of the sources of summertime haze in the Blue Ridge Mountains using multivariate statistical methods in Journal of Applied Meteorology, 23(9), 13331341. [51] Woodward, Wayna A. and Gray, H. L. (1995). Selecting a model for detecting the presence of a trend in Journal of Climate, 8, 19291937. [52] Yao, C. S. (1983). Fitting a linear autoregressive model for long-range forecasting in Monthly Weather Review, 111, 692700. [53] Zwiers, Francis and von Storch, Hans (1990). Regime-dependent autoregressive time series models of the southern oscillation in Journal of Climate, 3, 13471363.
312
Bibliography

Study Book

Cargado por

Información del documento

Descripción original:

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Study Book

Cargado por

Copyright:

Formatos disponibles

STA3303 Statistics for Climate Research

USQ, February 21, 2007

Time Series Analysis

USQ, February 21, 2007

Time Series Analysis

USQ, February 21, 2007

USQ, February 21, 2007

USQ, February 21, 2007

30 20 10 SOI 0 10 20 30 40 1880 1900 1920 1940 Time 1960 1980 2000

USQ, February 21, 2007

SOI vs SOI three seasons previous

SOI vs SOI four seasons previous

There are two main purposes of gathering time series:

USQ, February 21, 2007

30 20 10 SOI 0 10 20 30 40 1880 1900 1920 1940 Time 1960 1980 2000

Annual rainfall (in mm)

500 400 300 200

USQ, February 21, 2007

USQ, February 21, 2007

Signal and noise

USQ, February 21, 2007

1.3. Signal and noise

1 PDO 0 1 2 1980 1985 1990 Time 1995 2000

1 PDO signal 0 1 2 1980 1985 1990 Time 1995 2000

USQ, February 21, 2007

USQ, February 21, 2007

USQ, February 21, 2007

USQ, February 21, 2007

> names(soidata) [1] "year" "month" "soi" "soiphase"

USQ, February 21, 2007

30 20 10 SOI 0 10 20 30 40 1880 1900 1920 1940 Time 1960 1980 2000

USQ, February 21, 2007

USQ, February 21, 2007

USQ, February 21, 2007

Answers to selected Exercises

USQ, February 21, 2007

Sunspots 101 82 66 35 31 7 20 92 154 125 85 68 38 23 10 24 83 132 131 118 90 67 60 47 41 21 16 6 4 7 14 34 45 43

Sunspots 48 42 28 10 8 2 0 1 5 12 14 35 46 41 30 24 16 7 4 2 8 17 36 50 62 67 71 48 28 8 13 57 122 138

Sunspots 103 86 63 37 24 11 15 40 62 98 124 96 66 64 54 39 21 7 4 23 55 94 96 77 59 44 47 30 16 7 37 74

Table 1.2: The annual sunspot numbers from 1770 to 1869.

USQ, February 21, 2007

QBO from 1948 to 2001

30 1950 1960 1970 Time 1980 1990 2000

Figure 1.6: The QBO from January 1948 to December 2001.

Matlab, for example, uses round brackets: ( . . . ).

USQ, February 21, 2007

The (relative) Level of Lake Nyanza from 1902 to 1921

30 Level of Lake Victoria Nyanza

10 1905 1910 Time 1915 1920

USQ, February 21, 2007

Autoregressive (AR) models

More on stationarity . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Answers to selected Exercises . . . . . . . . . . . . . . .

Module 2. Autoregressive (AR) models

USQ, February 21, 2007

= m + en + 1 Xn1 + 2 Xn2 + + p Xnp

USQ, February 21, 2007

Module 2. Autoregressive (AR) models

USQ, February 21, 2007

2.3. Forecasting ar models

USQ, February 21, 2007

Module 2. Autoregressive (AR) models

USQ, February 21, 2007

2.4. The backshift operator