57 vistas

Cargado por crcalub

A handout made by the authors, students of UP Diliman, as fulfillment of partial requirement for a course.

- Easterly, Levine, Roodman; Aid, Policies and Growth
- Notes on Econometrics
- Chain Ladder as d
- TRẮC NGHIỆM KTLTC
- Chapter 13 Linear regression
- lewisLinzer.pdf
- Statistics -Regression and probability
- Articol Eng
- final 2006
- Robust Regression
- Chain Ladder
- 2013QMEA Lecture05 MLR Asymptotics
- Updated Dols Model
- Prueba Analista
- Topic 4 Panel Regression Model Wble
- A Study of Leadership Style and Learning Organization in Canadia.pdf
- 1.Format Man-The Effects of Credit Risk Mitigation Strategies on Profitability Of
- model_specification_oh.pdf
- Basic Statistical Tools for Research
- Simple Regression - Session 11

Está en la página 1de 42

An Introduction to Robust Regression

Carl Dominick

CALUB

Robert Elcivir

RULONA

Emkay

EVANGELISTA

Contents

1 Introduction 1

1.1 Outliers:

What they are and what they do . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Robust Regression:

What does it do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Univariate Robust Estimation 7

2.1 LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 LTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Large Batch Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Robust Regression 15

3.1 LMS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 LTS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Inference in Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Software Implementations 18

4.1 PROGRESS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Illustration:

Land Use and Water Quality in New York Rivers 20

Bibliography 26

Appendix 27

A R functions for Exact Univariate LMS and LTS Estimation 27

B SAS Code and Outputs used for Section 5 28

B.1 SAS Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

B.2 SAS Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

C R Script and Results used for Section 5 37

D Software Specications Used 39

E Review Questions 39

1 INTRODUCTION

1 Introduction

Before proceeding to discuss robust regression, salient features of ordinary least squares

(OLS) regression must rst be revisited.

Ordinary Least Squares Regression The classical linear model:

Y

i

=

0

+

1

X

i1

+

2

X

i2

+ +

p

X

ip

+

i

where

i

N(0,

2

) i

estimates its parameters,

0

,

1

, . . . ,

p

, as the values that would minimize the sum of the

squared residuals, i.e.

j

= arg

j

min

_

n

i=1

(y

i

y

i

)

2

_

where

_

_

y =

0

+

p

j=1

j

X

j

j;

j = 0, 1, . . . , p

The reason why OLS regression is popular is because of the convenience brought about by

its properties, e.g. parameter estimates are BLUE, ease in computation, and simplicity in

interpretation.

However, there is a caveat to the beauty of OLS regression it imposes stringent assump-

tions, viz. normality, independence of observations, and homoskedasticity. OLS is quite

sensitive to departures from these classical assumptions.

But it is not just the fulllment of the classical assumptions that aects the tenability of

inferences. OLS regression is quite sensitive to outliers because of the nature of how the

parameter estimates are arrived at.

1.1 Outliers:

What they are and what they do

Outliers persist for various reasons encoding errors, data contamination, or observations

surrounded by unqiue circumstances. Regardless of source, outliers pose a serious threat to

data analysis through the distortion of resulting inferences.

1

1.1 Outliers:

What they are and what they do 1 INTRODUCTION

In fact, the presence of outliers introduces non-normality into the equation through heavy-

tailed error distributions (Hamilton, 1992). Robust regression assigns lower weights to out-

lying observations so as to limit their spurious inuence, thus rendering resistance to the

inferences.

In order to appreciate the benets brought by robust regression, the dierent characteristics

of outliers and how they garble the analysis are presented.

Leverage Point An observation whose explanatory value(s) lie far from the bulk of the

dataset is deemed as a leverage point. Leverage points need to be paid special attention

because of their potential to inuence the resulting OLS estimates greatly. Ipso facto, the

presence of a leverage point has the potential to severely distort inferences made from the

subject data.

To illustrate its eect (and understand where the termleverage comes from), consider the

following datasets taken from Rousseeuw and Leroy (1987).

Dataset.noLev

x y

1 5.00 0.30

2 1.00 1.23

3 1.27 1.78

4 1.57 2.79

5 2.10 3.90

Dataset.wLev

x y

1 0.20 0.30

2 1.00 1.23

3 1.27 1.78

4 1.57 2.79

5 2.10 3.90

One of the datapoints in Dataset.noLev has been erroneously encoded into Dataset.wLev,

in particular the x-value, causing an observation to lie far from the other data points along

the x-axis (a plot is presented in Figure 1 on the following page to better visualize the

datasets). The resulting tted OLS models on the two datasets are then compared.

Fitted OLS Model without Leverage

R-squared: 0.9557

Parameter Estimates:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.3726911 0.3316215 -1.123845 0.342892886

x 1.9321589 0.2402323 8.042877 0.004014026

Fitted OLS Model with Leverage

R-squared: 0.2277

Parameter Estimates:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.8956610 1.1431918 2.5329618 0.08520275

x -0.4093515 0.4352826 -0.9404268 0.41637622

2

1.1 Outliers:

What they are and what they do 1 INTRODUCTION

Notice that the stability of the OLS model tted on the dataset with the leverage point is

comparably lower than the the one tted on the dataset without, with the R-squared falling

from 0.9557 to 0.2277. Furthermore, the validity of the parameter estimates of the tted

OLS model has become dubious upon the introduction of the leverage point, as reected by

the dierence in the standard errors (or equivalently, the p-values).

Apart from the degradation in the tenability of the parameter estimates, juxtaposing the two

models also points out the drastic change in the estimated slope. This is a very dangerous

case under the context of regression as it may lead to misleading inferences.

0 2 4 6

0

1

2

3

4

5

6

no leverage

with leverage

Figure 1: Scatterplot and Fitted OLS Lines of Datasets 2a and 2b

The substantial change in the values of the parameter estimates caused by the presence of

the leverage point is illustrated in Figure 1.

On another note, notice that the outlier pulled the tted OLS model towards it, similar

to how an external force acting on lever changes the levers orientation (hence the term

leverage).

Setting the trivia aside, the potential for leverage points to mislead does not only come

from the wild change in parameter estimates, but also from how the drastic change in the

tted OLS line masks which observations are supposed to be treated as outliers. In other

words, discrimination of outliers based on the tted regression model becomes misleading as

well.

To provide an insight on this, the residuals of the Dataset.wLev from its tted OLS model

and the residuals of the same dataset from the model tted on Dataset.noLev

1

are com-

1

Technically speaking, this kind of procedure is spurious. The proper procedure will be discussed later on.

But for the purposes of illustrating the eects of leverage points, this example is enough since the premise

is that the tted OLS model on the dataset without a leverage point and the model tted with the bulk of

the observations in the dataset that are close enough

3

1.1 Outliers:

What they are and what they do 1 INTRODUCTION

pared.

From OLS Model

x y Residuals Std.Residuals

1 5.00 0.30 -0.5489037 -0.345044

2 1.00 1.23 -1.2563095 -1.331951

3 1.27 1.78 -0.5957846 -0.689797

4 1.57 2.79 0.5370208 0.676806

5 2.10 3.90 1.8639771 2.548244

From Model Fitted on Data without Leverage

x y Residuals Std.Residuals

1 5.00 0.30 -8.9881033 -1.97266528

2 1.00 1.23 -0.3294678 -0.13005900

3 1.27 1.78 -0.3011507 -0.12613446

4 1.57 2.79 0.1292017 0.04767381

5 2.10 3.90 0.2151575 0.05292135

Looking at the standardized residuals from the rst and second models generated, it can

be observed that the observations considered as outliers by the two models are dierent.

The OLS model identies the observation on x = 2.1 as an outlier, despite its consistency

with the general linear trend folowed by the rest of the points. Meanwhile, the second

model discriminates the observation at x = 0.20 as the relatively wayward one, which is not

surprising because the second set of residuals have been obtained from a model tted on a

set of points that closely follow a specic linear trend.

While on the topic of identifying outlying observations using residuals, it is worth mentioning

that although Studentized residuals could be used on the identifying the leverage point from

the residuals of the OLS model tted on the dataset with a leverage point in this example to

discover that the rst observation is actually an outlier instead of the fth, this is not always

the case.

The Studentized residual only singles out one observation at a time. Ipso facto, the inclusion

of other outliers in the computation for the Studentized residual of one of the actual outliers

will fail to inate the Studentized residual.

Apart from the propensity of leverage points to severely aect analysis through spurious

estimates and discrimination of outliers, special attention is given to them because they are

more likely to occur in a multidimensional setting. Naturally, the consideration of more

explanatory variables would provide more opportunities for leverage points to appear.

That said, not all outliers are detrimental to analysis; some outliers are benign only because

they do not debilitate the inferences.

Figure 2 on the next page highlights how important it is to keep in mind that leverage points

only have the potential to impair analysis. While Figure 2(a) is similar to the illustration of

the eects of leverage points (Dataset.noLev and Dataset.wLev), Figure 2(b) shows that

4

1.2 Robust Regression:

What does it do 1 INTRODUCTION

(a) Debilitating Outlier (b) Benign Outlier

Figure 2: Examples of a Debilitating and a Benign Leverage Point

Source: Hamilton (1992)

the outlying observation, despite being a leverage point, is still consistent with the linear

trend followed by the bulk of the data points.

If a leverage point is consistent with the linear trend followed by the majority of the dataset,

it is concomitant that the observation would not just be a leverage point, but also an outlier

in the y-direction. This is not to say, however, that an observation considered as a leverage

point and an outlier in the y-direction will mean that it is consistent with the linear trend

of the majority. If the value of its response or one of its explanatory variables is too far o,

then it will not follow the linear trend.

Outlying points that deviate away from the linear trend exhibited by the majority of the

datapoints are labeled as regression outliers.

That said, it actually the presence of regression outliers that erode the tenability of the

parameter estiamtes. So, leverage points are considered detrimentally inuential if they

are regression outliers as well. If not, then the subject observation is a benign leverage

point.

1.2 Robust Regression:

What does it do

It is worth mentioning before discussing the essence of robust regression that in OLS re-

gression, outliers are determined based on their deviation from the tted line using various

measures such as the adjusted, standardized, and studentized residuals; DFts; DFBetas;

Cooks Distance, etc.

5

1.2 Robust Regression:

What does it do 1 INTRODUCTION

As previously mentioned, this sort of discrimination of residuals could lead to complica-

tions as tting an OLS line would mask regression outliers before their eects are marginal-

ized.

Robust Regression, on the other hand, ts a line using resistant estimators rst. In using re-

sistant estimators in the estimation process itself, eects of outlying values are marginalized,

thus obtaining a robust solution.

Figure 3: Illustration of the dierence in response of OLS and robust regression to outliers

Source: Hamilton (1992)

Notice in Figure 3, the OLS tted model was pulled downward by the data values of four

regression outliers San Jose, San Diego, San Francisco, and Los Angeles while the

tted model using robust regression ignored the excessive inuence imposed by said outliers.

In this sense, robust regression is sometimes referred to as resistant regression.

It is noteworthy that these resistant estimators essentially assign weights to observations

(similar to L-estimators). Wayward observations are just assigned lower weights but not

always zero-weights.

Outliers are then identied based on their deviation from the robust solution.

While there is a plethora of robust estimators available e.g. repeated median, iteratively

reweighted least squares this article will only be focusing on two: the least median of

squares and the least trimmed squares. These two estimators are noted for their very high

breakdown bound.

6

2 UNIVARIATE ROBUST ESTIMATION

2 Univariate Robust Estimation

This section will be presenting two estimators used in robust regression the least median

of squares (LMS) and the least trimmed squares (LTS) but only under the univariate

setting. LTS and LMS estimation under multidimensional data will be presented in the next

section, but already under the context of regression since estimation involving more than

one variables is usually done in this setting.

That said, this section will proceed as follows: a brief description about the estimation

procedure is presented, followed by an outline on its computation, then an illustration, and

ending with a presentation of its properties.

2.1 Least Median of Squares (LMS)

As the name implies, the LMS estimator,

LMS

, is computed as the value that would minimize

the median squared deviation, i.e.:

LMS

= arg

inf

_

Med

_

_

y

i

_

2

__

In fact, this denition would imply that the LMS estimators objective function is given

by:

_

y

i

;

_

= Med

_

_

y

i

_

2

_

However, LMS estimation is not a form of M-estimation because the objective function above

does not include all observations, which is inconsistent with the denition of M-estimation.

In fact, the Help and Documentation of SAS 9.3 (SAS Institute, Inc., 2011) dierentiates

LMS and LTS estimation from M-estimation.

Computing for the LMS estimator

(1) First order of business is to arrange the batch of size n in ascending order,

y

(1)

, y

(2)

, . . . , y

(n)

y

(1)

y

(2)

. . . y

(n)

If n is odd, just repeat the median and include it in the ordered batch. Adjust the batch

size accordingly and still denote it by n.

(2) Compute for:

h =

__

n

2

__

+ 1

7

2.1 LMS 2 UNIVARIATE ROBUST ESTIMATION

(3) Partition the batch into two parts, where the second partition starts at y

h

. Denote them

as:

y

(1)

, y

(2)

, . . . , y

(n)

_

y

(1)

,

y

(h)

,

y

(2)

,

y

(h+1)

,

. . . ,

. . . ,

y

(nh+1)

y

(n)

Note that both of the sub-batches are of size n h + 1. Ipso facto, there should be a

one-to-one correspondence between the sub-batches.

(4) Compute for:

y

(d)

i

= y

(i+h1)

y

(i)

i = 1, 2, . . . , n h + 1

(5) The LMS estimate is the midpoint of the values corresponding to the pair with the least

dierence, i.e.:

LMS

=

y

(h+k1)

+ y

(k)

2

where y

(h+k1)

y

(k)

= min

i

y

(d)

i

Illustration Consider the following batch of numbers taken from Rousseeuw and Leroy

(1987):

40 75 80 83 86 88 90 92 93 95

Note that n = 10, which means that h =

__

10

2

before the 6

th

ordered observation.

40 75 80 83 86

88 90 92 93 95

After dividing the batch into two sub-batches, the sub-batches are then paired up and their

dierences obtained.

88

40

90

75

92

80

93

83

95

86

min

_

y

(d)

1

, y

(d)

2

, y

(d)

3

, y

(d)

4

, y

(d)

5

_

= min {48, 15, 12, 10, 9}

= 9 = y

(d)

5

= y

(10)

y

(5)

= 95 86

LMS

=

y

(10)

+ y

(5)

2

=

95 + 86

2

= 90.5

Properties of the LMS estimator Having laid down the procedure involved in com-

puting its computation, some salient properties of the LMS estimator are presented, which

are:

1. has a breakdown bound of 50%

2. the LMS estimator is location and scale equivariant (i.e. linear equivariant);

8

2.1 LMS 2 UNIVARIATE ROBUST ESTIMATION

3. a solution for the objective function always exists;

4. the objective function is not smooth; and

5. the objective function has a low convergence rate.

9

2.2 LTS 2 UNIVARIATE ROBUST ESTIMATION

30 40 50 60 70 80 90 100

LMS Mean

Median

Figure 4: Wilkinson dot plot and location of the mean, median, and the LMS estimator.

In addition to the properties mentioned, the LMS estimator is also considered as some sort

of mode estimator, in that it tends to the modal value of the batch. In simpler terms, the

LMS estimator tends to where the values cluster, as seen in Figure 4, compared to orthodox

location estimators such as the mean and median.

Since the LMS estimator is aected by the shape (or the skewness) of the data, it is inherently

less reliable than other robust estimators because it is more variable.

Despite its having a higher variability and a non-smooth objective function with a slow

convergence rate, the LMS estimator is generalizeabile to a multidimensional case while still

maintaining a high breakdown bound and linear equivariance.

2.2 Least Trimmed Squares (LTS)

The LTS estimator, meanwhile, is computed as the value that would minimize the trimmed

sum of ordered squared deviations. In mathematical notation:

LTS

= arg

inf

_

h

i=1

r

2

(i)

_

where

_

_

h =

__

n

2

__

+ 1,

r

j

= (y

j

) j = 1, 2, . . . , n,

where r

2

(1)

r

2

(2)

. . . r

2

(n)

Again, this denition would imply that the objective function of the LTS estimator is given

10

2.2 LTS 2 UNIVARIATE ROBUST ESTIMATION

by:

_

y

i

;

_

=

h

i=1

r

2

(i)

As before, it must be kept in mind that the LTS estimator is still not an M-estimator because

it does not include all observations in evaluating its objective function, similar to the premise

of how the LMS estimator is not an M-estimator.

Note that the upper bound of the summation is h, not n. So in essence, the LTS estimator

minimizes the sum of the lower h ordered squared residuals, equivalently discarding the upper

nh squared deviations.

Computing for the LTS estimator

(1) As before, rst order of business is to sort the data:

y

(1)

, y

(2)

, . . . , y

(n)

y

(1)

y

(2)

. . . y

(n)

But n here can take on any positive integer value special procedures are neither needed

for odd nor even n.

(2) Compute for h =

__

n

2

+ 1

(3) Now, partition the sorted data into nh+1 sub-batchs, each of size h, in the following

manner:

_

y

(1)

, y

(2)

, . . . , y

(h)

_

,

_

y

(2)

, y

(3)

, . . . , y

(h+1)

_

,

.

.

.

_

y

(nh+1)

, y

(nh+2)

, . . . , y

(n)

_

i.e. Simply enclose the rst h units of the sorted batch to obtain the rst sub-batch. To

obtain the second sample, just move the left and the right enclosures one unit to the

right. Repeat the process nh+1 times (including the rst iteration) until the right

enclosure reaches the end of the batch. Each repition would then correspond to one

sub-batch.

(4) Next, compute for the means of each sub-batch. There are two ways to go about this:

y

(j)

=

1

h

j+h1

i=j

y

(i)

(1)

=

h y

(j1)

y

(j1)

+ y

(j+h1)

h

(2)

where j = 2, 3, . . . , n h + 1

Note that Equation 1 is simply the sub-batch mean.

11

2.2 LTS 2 UNIVARIATE ROBUST ESTIMATION

To understand Equation 2, keep in mind that the nh+1 sub-batchs are obtained in

a progressive manner. For example, the second sub-batch contains some elements from

the rst sub-batch, but the rst ordered observation is excluded while the (h + 1)

th

ob-

servation is included.

Generally speaking, the (j + 1)

th

sub-batch is the same as the j

th

sub-batch, but exclud-

ing the j

th

observation and including the (j + h)

th

observation, where

j = 1, 2, . . . , n h.

That said, note that before Equation 2 can be used, Equation 1 must rst be evaluated

at j = 1.

(5) After obtaining the nh+1 means, the nh+1 variances must then be computed for.

Any of the two formulae can be used for this:

SQ

(j)

=

j+h1

i=j

_

y

(i)

y

(i)

_

2

(3)

= SQ

(j1)

y

2

(j1)

+ h

_

y

(j1)

_

2

+ y

2

(j+h1)

h

_

y

(j)

_

2

(4)

j = 2, 3, . . . , n h + 1

Again, Equation 4 is a recursive form of Equation 3. Also, Equation 3 must be evaluated

at j = 1 rst before proceeding to use Equation 4

(6) The LTS estimator is then taken as the mean corresponding to the sub-batch with the

least variance, SQ

(j)

, i.e.:

LTS

= y

(k)

where SQ

(k)

= min

j

SQ

(j)

Before moving on, care must be taken when using the recursive formulae Equa-

tions 2 and 4 in that rounding-o must not be done within each iteration. Rounding-

o the y

(j)

s and the SQ

(j)

s in each iteration will result in not just grouping errors, but

also its propagation.

LTS Illustration Consider the same batch of numbers from the previous illustration:

40 75 80 83 86 88 90 92 93 95

The resulting sub-batch means and sub-batch variances are approximated here only to con-

serve space again, these values must not be rounded-o before obtaining the actual LTS

estimate.

12

2.2 LTS 2 UNIVARIATE ROBUST ESTIMATION

That said, the y

(j)

s and the SQ

(j)

s are computed as follows:

40 75 80 83 86 88

. .

y

(1)

75.3333

SQ

(1)

160.3333

90 92 93 95

40 75 80 83 86 88 90

. .

y

(2)

83.6667

SQ

(2)

153.3333

92 93 95

40 75 80 83 86 88 90 92

. .

y

(3)

= 86.5

SQ

(3)

= 99.5

93 95

40 75 80 83 86 88 90 92 93

. .

y

(4)

88.6667

SQ

(4)

71.3333

95

40 75 80 83 86 88 90 92 93 95

. .

y

(5)

90.6667

SQ

(5)

55.3333

min

_

SQ

(1)

, SQ

(2)

, SQ

(3)

, SQ

(4)

, SQ

(5)

_

= min {160.33, 153.33, 99.5, 71.33, 55.33}

= 55.33 = SQ

(5)

LTS

= y

(5)

90.6667

As previously mentioned, the LTS estimator includes only the elements from the sub-batch,

of size h, with the lowest variance. In doing so, theh other nh observations are excluded. So

really, the LTS estimator, at least as presented, is the trimmed mean of the sub-batch with

the lowest squared deviations with a trimming proportion of

_

1

h

n

_

. It need not be said

that, having described the LTS estimator as a trimmed mean, it allows for an asymmetric

trimming of observations.

Properties of the LTS Estimator Unlike the LMS estimator, the LTS estimator per-

forms (relatively) well under asymptotic eciency. Meaning to say, it has comparably faster

convergence rate, or equivalently, it takes less iterations before a value for the estimate is

arrived at, at least compared to the LTS estimator.

13

2.3 Large Batch Estimation 2 UNIVARIATE ROBUST ESTIMATION

Like the LMS estimator, the LTS estimator:

1. has a breakdown bound of 50%;

2. is linearly equivariant (i.e. location and scale equivariance);

3. is extendable to to multidimensional cases (while still maintaining a high breakdown

bound and linear equivariance); and

4. a lack of a smooth objective function.

30 40 50 60 70 80 90 100

LTS

LMS Mean

Median

Figure 5: Wilkinson dot plot and locations of the mean, median, LTS, and LMS estimators.

Like the LMS estimator, the LTS estimator should also be located somewhere near the

modal value of the batch (at least relative to the mean and the median). Since the objective

function of the LTS estimator is based on the ordered partition of the batch with the smallest

variance, which more often than not is the interval around which the data values cluster, then

it should follow that the LTS estimator as well can be likened into a modal estimator.

2.3 LMS and LTS Estimation in Large Batches

The compromise for having a high breakdown bound, among others, of these estimators is

the ineciency in computation. As illustrated in the previous examples, computation of

these estimators involves solving for the scales of the sub-batches (range and sum of squared

deviations for LMS and LTS, respectively) nh+1 times.

In especially large batches, this is quite impractical. To render eciency in solving for the

LMS and LTS estimators of large batches, resampling techniques are used instead. Thus,

solutions are determined randomly for large batch sizes. Ipso facto, it is possible to yield

inconsistent resultant computational values.

14

3 ROBUST REGRESSION

3 Robust Regression

This section is outlined as follows: a brief description of the properties of the robust regression

techniques are presented, in particular the objective function that is used to arrive at pa-

rameter estimates and the breakdown bounds of the parameter estimates. After, inferential

properties under the robust regression techniques are presented.

3.1 LMS Regression

The parameter estimates are estimated in LMS regression as those that would yield the

minimum median of squared residuals, i.e.:

arg

min

_

Med

_

r

2

(i)

__

= arg

min

_

Med

r

(i)

_

where r

i

= y

i

y

i

i

The resulting breakdown bound of the resulting estimates are:

BDB(LMS) =

__

n p

2

__

+ 1

n

provided that p > 1, p being the number of parameters estimated.

3.2 LTS Regression

The parameter estimates in LTS regression are computed as the ones that would yield the

minimum trimmed sum of ordered squared residuals:

arg

inf

_

h

i=1

r

2

(i)

_

where

_

_

h =

__

n

2

__

+ 1,

r

j

= (y

j

y

j

) j = 1, 2, . . . , n,

where r

2

(1)

r

2

(2)

. . . r

2

(n)

with a breakdown bound of:

BDB(LTS) =

__

n p

2

__

+ 1

n

where p is the number of parameters estimated.

15

3.3 Inference in Robust Regression 3 ROBUST REGRESSION

3.3 Inference in Robust Regression

Scale estimator of error terms,

s

LMS

=

_

1 +

5

n p

_

c

h,n

r

(h)

(5)

s

LTS

= d

h,n

_

1

h

h

i=1

r

2

(i)

(6)

where d

h,n

=

1

1

2

h c

h,n

_

1

c

h,n

_

c

h,n

=

1

1

_

n + h

2n

_

h =

__

n

2

__

+ 1

Note that c

h,n

and d

h,n

are chosen to make the scale estimators consistent with the Gaus-

sian model (Rousseeuw and Hubert, 1997).

Moreover, it is important to note that Equation 5 only applies for odd n, and that

_

1 +

5

np

_

is a nite population correction factor (see Rousseeuw and Hubert (1997)).

It is noteworthy to mention that there are more more ecient scale estimates (see Rousseeuw

and Hubert (1997)) based on Equations 5 and 6; but for the purposes of just introducing the

notion of robust regression, these equations should suce.

Coecient of determination, R

2

Rousseeuw and Hubert (1997) proposes a robust coun-

terpart of the OLS notion of R

2

, based on Equations 5 and 6, for LMS and LTS regression

as:

R

2

LMS

= 1

s

LMS

_

1 +

5

np

_

c

h,n

(h),LMS

R

2

LTS

= 1

s

LTS

d

h,n

_

1

h

h

i=1

r

2

(i),LTS

where the r

LMS or LTS estimates.

16

3.3 Inference in Robust Regression 3 ROBUST REGRESSION

Unfortunately, this is as far as the related literature goes regarding inference in robust

regression. Noteworthy is how the Help and Documentation of SAS 9.3 (SAS Institute, Inc.,

2011) species that there is no test for the canonical linear hypothesis under LMS and LTS

regression.

Interpretation of predicted values Looking at the logic the previous paragraphs fol-

lowed, then it is quite obvious that the resultant predicted values would be interpreted under

the paradigm of robustness. In other words, whereas the parameter estimates of the model

were computed for using a robust solution, ipso facto, the resulting predicted values would

be based on the linear relationship of the majority of the values as determined by the robust

solution employed.

17

4 SOFTWARE IMPLEMENTATIONS

4 Software Implementations

Since the objective functions of the LMS and LTS estimators are not smooth, they do not

lend themselves to mathematical optimization. In other words, there is no formula for

computing the parameter estimates. (In fact, it is apparent that this diculty is inherent to

all regression estimators with high breakdown bounds (Bhar, nd).)

4.1 PROGRESS algorithm

To this end, Rousseeuw and Hubert (1997) proposed an algorithm in computing for the pa-

rameter estimates under LMS and LTS regression called PROGRESS (Program for RObust

reGRESSion). PROGRESS essentially involves resampling methods.

The details of the algorithm will not be discussed here, but a general ow of the algorithm

will instead be outlined.

Briey describing the algorithm, the process involves rst obtaining a subsample (or sub-

batch) with a comparably lower size to make the computation eciently feasible. After

obtaining the parameter estimates, the objective function is then evaluated. The process

is repeated a number of times. The (overall) parameter estimates are then obtained as

the estimates generated from the subsample that yielded the lowest value of the evaluated

objective function.

Modern algorithms for robust regression are based on PROGRESS.

4.2 SAS

LMS Regression LMS regression in SAS is done in the Interactive Matrix Language

(IML) environment. Ipso facto, the SAS datasets generated via the DATA step or the IMPORT

procedure, or any other SAS function other than those generated within the IML procedure

for that matter, must be converted into an object that is usable within PROC IML. An example

of how to do such task can be seen in Appendix B.1.

That said, to the following command is invoked while in PROC IML to conduct an LMS

regression:

CALL LMS(sc, coef, wgt, opt, y, x);

It is important to mention that the parameters of this function are in the form of a matrix

or a vector. After all, the function is called inside the IML environment.

For the purposes of this introduction, only the last three of the six function parameters are

discussed, opt, y, and x. The rst three function parameters are implicitly left to their

default values.

18

4.3 R 4 SOFTWARE IMPLEMENTATIONS

x is the matrix of explanatory values, with the rows as the observations and the column as the

explanatory variables. An additonal column of 1s (for the intercept) need not be included

since it could be specied in opt to include the intercept in the estimated paramters.

y and opt are vectors corresponding to the vector of the values of the response variable and

the dierent options, respectively.

LTS Regression Fortunately, invoking LTS regression as SAS is not as cumbersome as

LMS regression and is simply invoked via PROC ROBREG. The general syntax is:

PROC ROBUSTREG DATA=dataset METHOD=LTS;

MODEL response = var1 var2 . . . vark / options

RUN;

4.3 R

LMS Regression Before being able to invoke LMS regression in R, the MASS package

must rst be installed.

After installing the package, conducting an LMS regression is as simple as:

lmsreg(formula, dataframe)

which is actually a wrapper function of the lqs function.

While there are other function parameters (such as seed number and weights), these paramter

specications are sucient for this and the subsequent R functions presented.

LTS Regression As with LMS regression, LTS regression in R also requires the MASS

package, and is called via the following function:

ltsreg(formula, dataframe)

which too is an lqs wrapper-function.

There is another option, though, which is via:

ltsReg(formula, dataframe)

which requires the robustbase package.

19

5 ILLUSTRATION:

LAND USE AND WATER QUALITY IN NEW YORK RIVERS

5 Illustration:

Land Use and Water Quality in New York Rivers

As a demonstration on how robust regression analysis usually goes, consider the following

dataset taken from Haith (1976) (as cited in Hamilton (1992)).

Table 1: Land use and nitrogen content in 20 river basins.

Basin Agri Forest Urban Nitro

1 Olean 26 63 1.49 1.1

2 Cassadaga 29 57 0.79 1.01

3 Oatka 54 26 2.38 1.9

4 Neversink 2 84 3.88 1

5 Hackensack 3 27 32.61 1.99

6 Wappinger 19 61 3.96 1.42

7 Fishkill 16 60 6.71 2.04

8 Honeoye 40 43 1.64 1.65

9 Susquehanna 28 62 1.25 1.01

10 Chenago 26 60 1.13 1.21

11 Tioughnioga 26 53 1.08 1.33

12 West Canada 15 75 0.86 0.75

13 East Canada 6 84 0.62 0.73

14 Saranac 3 81 1.15 0.8

15 Ausable 2 89 1.05 0.76

16 Black 6 82 0.65 0.87

17 Schoharie 22 70 1.12 0.8

18 Raquette 4 75 0.58 0.87

19 Oswegatchie 21 56 0.63 0.66

20 Chocton 40 49 1.23 1.25

The variable Basin is the name of the river basin/ area containing the river basin. Agri is the

percentage of land in active agriculture, while Forest is the percentage of land forested, brush-

land, or plantation. Urban is the percentage of land urban (including residential, commercial,

and industrial). Nitro is the nitrogen concetration in the river water (in mg/L).

That said, the data in Table 1 was used to explore the eect of the dierent types of land

use on nonpoint-source water pollution.

OLS, LMS, and LTS regression models are tted on this dataset and then compared. For

the purposes of this demonstration, SAS outputs are used. The codes used to generate the

outputs are in Appendix B.1 on page 28

20

5 ILLUSTRATION:

LAND USE AND WATER QUALITY IN NEW YORK RIVERS

(a) OLS Model

(b) OLS Residuals

Figure 6: Selected SAS PROC REG Outputs

21

5 ILLUSTRATION:

LAND USE AND WATER QUALITY IN NEW YORK RIVERS

(a) LMS Model (b) LMS Residuals

Figure 7: Selected SAS PROC IML Outputs

22

5 ILLUSTRATION:

LAND USE AND WATER QUALITY IN NEW YORK RIVERS

(a) LTS Model (b) LTS Model R

2

(c) LTS Residuals

Figure 8: Selected SAS PROC ROBREG Outputs

23

5 ILLUSTRATION:

LAND USE AND WATER QUALITY IN NEW YORK RIVERS

In comparing the tted models using OLS, LMS, and LTS regression, there are three key

aspects the could be scrutinized as far as resistance goes: (i) R

2

, (ii) parameter estimates,

and (iii) identied outlying observations.

Coecient of determination Note that the OLS R

2

, which is approximately equal to

65 percent, is comparably lower ot the LMS and LTS R

2

s, which are approximately 91

percent and 88 percent, respectively. Not much information can be obtained from these

values, though. In order to gain a deeper understanding of the dynamics of the relationship

among the extents of the dierent land uses on water pollution, let us look at the parameter

estimates.

Parameter estimates To facilitate the comparison of the resultant parameter estimates

from the three regression procedures, Table 2 below summarizes the parameter estimates

taken from Figures 6(a), 7(a), and 7(a).

Table 2: Parameter Estimates from the Fitted OLS, LMS, and LTS models

Variable OLS LMS LTS

Agri 0.0085 -0.0116 -0.0151

Forest -0.0084 -0.0288 -0.0319

Urban 0.0293 0.1413 0.1235

Table 2 shows that the eect of the extent of usage of dierent types of land usage on water

pollution are comparatively marginal, at around half of the magnitude of the estimated

eects by the LMS and LTS procedures; and whose stability are rendered questionable by

the estimated standard errors (as shown in Figure 6(a) on page 21).

The estimated eects of the extent of use per land usage type by the LMS and LTS regression

methods do not lie very far from one another; and are more pronounced compared to the

OLS estimates (as have already been mentioned). Whats important to note is the change

in sign of the parameter estimate of the percentage agricultural land usage.

Ipso facto, it can be implied that one the outlying observation(s) are (bad) leverage points

with respect to the percentage agricultural land usage (which shall be further discussed in

the next paragraph).

That said, it is apparent from the robust parameter estimates that, among the three variables,

percentage land usage has the largest (positive) eect on water pollution. It could be that

this is because of the distribution of urban wastes into the surrounding waters result in an

increase in their nitrogen concentration because of how these kinds of wastes have the largest

nitrogen content.

The negative signs of the agricultural land usage and forested land percentage, on the other

hand, could be attributed to the relatively smaller nitrogen content associated with these

types of land usage compared to the nitrogen content of the wastes associated with other

types of land usage that could have been instead (particularly urban).

24

5 ILLUSTRATION:

LAND USE AND WATER QUALITY IN NEW YORK RIVERS

The magnitude of the eect of the percentage of forested land around twice that of the

percentage of land used for agriculture could be due to the relatively lack of nitrogen content

in the run-o content from former land usage compared to the latter.

So, in essence, interpretations of these parameter estimates could be because of the nitrogen

content associated with the types of land usage, with that of urban wastes to apparently

possess the largest nitrogen content.

Hence, based on these resultant estimates using robust solutions, it is possible that the

outlying observations have an increased nitrogen concentration in its surrounding waters

despite the increased percentage of agricultural land usage because of major agricultural

activities (such as industrial farming, extensive use of fertilizers, etc.) that produce run-o

wastes that contribute to the proliferation of nitrogen pollution in the surrounding bodies of

water. This can be implied from the manifested positive sign of the estimated change nitrogen

concentration due to a change in the percentage agricultural land usage which pulled said

estimate towards it.

Outlying observations Note that based on the OLS tted model, observation numbers

5, 7, and 19 (Hackensack, Fishkill, Oswegatchie) have been identied as outliers Figure 6(b)

on page 21; while the robust models have identied only observations 5 and 19 (Hackensack

and Oswegatchie) as the outliers (Figures 7(b) on page 22 and 8(c) on page 23).

Following the logic mentioned in the last paragraph, it could be that Hackensack and Oswe-

gatchie are areas where there are industrial agricultural activities; and that their inclusion

in the sample considered has pulled the OLS estimates towards their geographic pattern dy-

namics from the usual pattern boosting the quantitative eect of percentage agricultural

usage, and slightly pushing that of the percentage of forested land upward, at the expense

of the quantitative eect of percentage urbanised land, on nitrogen concentration than what

is common.

Meanwhile, Fishkill is an area that is consistent with the pattern exhibited by the majority

of the areas near the subject river basin.

25

REFERENCES REFERENCES

References

Bhar, L. (n.d.). Robust regression.

Chen, C. (n.d.). Robust Regression and Outlier Detection with the ROBUSTREG Procedure.

Haith, D. A. (1976). Land use and water quality in new york rivers.

Hamilton, L. C. (1992). Regression with Graphics: A Second Course in Applied Statistics.

Duxbury Press, Belmont, California.

OKelly, M. (2006). A Tour Around PROC ROBUSTREG. In PhUSE.

Ripley, B., Venables, B., Hornik, K., Gebhardt, A., and Firth, D. (2013). Support Functions

and datasets to support Venables and Ripley, Modern Applied Statistics in S (4th edition,

2002).

Rousseeuw, P. J. (1984). Least Median of Squares Regression. Journal of the American

Statistical Association, 79(388):871-880.

Rousseeuw, P. J., Croux, C., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M., Verbeke, T.,

Koller, M., Maechler, M., and et al (2012). Basic Robust Statisics.

Rousseeuw, P. J. and Hubert, M. (1997). Recent developments in PROGRESS. 31.

Rousseeuw, P. J. and Leroy, A. (1987). Robust Regression & Outlier Detection. John Wiley

& Sons.

SAS Institute, Inc. (2011). SAS 9.3 HELP AND DOCUMENTATION. Cary, North Carolina.

Verardi, V. and Croux, C. (2009). Robust regression in Stata. The Stata Journal, 9(3):439

453.

Yaee, R. A. (2002). Robust Regression Analysis: Some Popular Statistical Package Options.

26

A R FUNCTIONS FOR EXACT UNIVARIATE LMS AND LTS ESTIMATION

A R functions for Exact Univariate LMS and LTS Es-

timation

> ## LMS ####

> lms <- function(x){

+ h <- (floor(length(x)/2)+1)

+ y.diff <- x[h:length(x)] - x[1:(h-1)]

+ min.i <- which(y.diff == min(y.diff))

+ print(mean(c(x[min.i], x[min.i+h-1])))

+ }

> ## LTS ####

> lts <- function(x){

+ h <- (floor(length(x)/2)+1)

+ sq <- rep(NA, length(x)-h+1)

+ for(j in 1:(length(x)-h+1)){

+ sq[j] <- var(x[j:(j+h-1)])

+ }

+ min.j <- which(sq == min(sq))

+ print(mean(x[min.j:(min.j+h-1)]))

+ }

> ## Example ----

> ODex <- c(40, 75, 80, 83, 86, 88, 90, 92, 93, 95)

> lms(ODex)

[1] 90.5

> lts(ODex)

[1] 90.66667

27

B SAS CODE AND OUTPUTS USED FOR SECTION 5

B SAS Code and Outputs used for Section 5

B.1 SAS Code

1 OPTIONS noxsync noxwait nodate nonumber ;

2 LIBNAME robreg path ;

3 x ' path \ f i l e na me . x l s x ' ; /*must remain open until SAS dataset

registers*/

4 FILENAME hdata

5 DDE EXCEL | path \[ filename . xlsx ] sheet!r2c1 : r21c5

6 NOTAB

7 ;

8 DATA robreg . hamilton ;

9 INFILE hdata DLM=' 09 ' x DSD ;

10 INPUT Basin : $ 11. Agri Forest Urban Nitro ;

11 LABEL Basin = New York river basin

12 Agri = Percentage of land in active agriculture

13 Forest= Percentage of land forested , brushland , or

plantation

14 Urban = Percentage of land urban

15 Nitro = Nitrogen concetration in river water ( mg/l)

16 ;

17 RUN ;

18

19 /***********/

20 /* OLS */

21 /***********/

22 PROC REG DATA=robreg . hamilton ;

23 TITLE OLS Regression ;

24 MODEL Nitro = Agri Forest Urban / R ;

25 RUN ;

26 QUIT ;

27

28 /***********/

29 /* LMS */

30 /***********/

31 PROC IML ;

32 TITLE LMS Regression ;

33 USE robreg . hamilton ; /*converts data frame into matrix*/

34 READ ALL VAR _ALL_ INTO hamilData ; /*reads everything from the data

frame*/

35 nitro = hamilData [ , 4 ] ; /*extract response values of observations*/

36 land = hamilData [ , 1 : 3 ] ; /*extract explanatory values of observations*/

37 opt = J ( 8 , 1 , . ) ; /*8x1 matrix whose values are set as . for options*/

38 opt [ 2 ] = 2; /*more info in output ; other options coerced to default*/

39 CALL LMS ( sc , coef , wgt , opt , nitro , land ) ;

40 QUIT ;

28

B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5

41

42 /***********/

43 /* LTS */

44 /***********/

45 PROC ROBUSTREG DATA=robreg . hamilton METHOD=LTS ;

46 TITLE LTS Regression ;

47 MODEL Nitro = Agri Forest Urban / DIAGNOSTICS ( ALL ) LEVERAGE ;

48 RUN ;

B.2 SAS Outputs

The REG Procedure

Model: MODEL1

Dependent Variable: Nitro Nitrogen concetration in river water (mg/l)

OLS Regression

Number of Observations Read 20

Number of Observations Used 20

Analysis of Variance

Source DF

Sum of

Squares

Mean

Square F Value Pr > F

Model 3 2.36602 0.78867 10.04 0.0006

Error 16 1.25656 0.07853

Corrected Total 19 3.62257

Root MSE 0.28024 R-Square 0.6531

Dependent Mean 1.15750 Adj R-Sq 0.5881

Coeff Var 24.21085

Parameter Estimates

Variable Label DF

Parameter

Estimate

Standard

Error t Value Pr > |t|

Intercept Intercept 1 1.42745 1.29346 1.10 0.2861

Agri Percentage of land in active agriculture 1 0.00851 0.01581 0.54 0.5981

Forest Percentage of land forested, brushland, or plantation 1 -0.00843 0.01448 -0.58 0.5684

Urban Percentage of land urban 1 0.02934 0.02757 1.06 0.3031

29

B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5

The REG Procedure

Model: MODEL1

Dependent Variable: Nitro Nitrogen concetration in river water (mg/l)

OLS Regression

Output Statistics

Obs

Dependent

Variable

Predicted

Value

Std Error

Mean Predict Residual

Std Error

Residual

Student

Residual -2-1 0 1 2

Cook's

D

1 1.1000 1.1610 0.0898 -0.0610 0.265 -0.230 | | | 0.002

2 1.0100 1.2166 0.0764 -0.2066 0.270 -0.766 | *| | 0.012

3 1.9000 1.7373 0.1695 0.1627 0.223 0.729 | |* | 0.077

4 1.0000 0.8499 0.1157 0.1501 0.255 0.588 | |* | 0.018

5 1.9900 2.1820 0.2741 -0.1920 0.0582 -3.299 |******| | 60.413

6 1.4200 1.1908 0.0646 0.2292 0.273 0.840 | |* | 0.010

7 2.0400 1.2544 0.0706 0.7856 0.271 2.897 | |***** | 0.142

8 1.6500 1.4531 0.1112 0.1969 0.257 0.765 | |* | 0.027

9 1.0100 1.1794 0.0991 -0.1694 0.262 -0.646 | *| | 0.015

10 1.2100 1.1758 0.0702 0.0342 0.271 0.126 | | | 0.000

11 1.3300 1.2333 0.1210 0.0967 0.253 0.382 | | | 0.008

12 0.7500 0.9478 0.0855 -0.1978 0.267 -0.741 | *| | 0.014

13 0.7300 0.7883 0.1009 -0.0583 0.261 -0.223 | | | 0.002

14 0.8000 0.8036 0.1101 -0.003628 0.258 -0.0141 | | | 0.000

15 0.7600 0.7247 0.1213 0.0353 0.253 0.140 | | | 0.001

16 0.8700 0.8060 0.0948 0.0640 0.264 0.243 | | | 0.002

17 0.8000 1.0571 0.1099 -0.2571 0.258 -0.997 | *| | 0.045

18 0.8700 0.8460 0.1621 0.0240 0.229 0.105 | | | 0.001

19 0.6600 1.1523 0.1576 -0.4923 0.232 -2.124 | ****| | 0.521

20 1.2500 1.3905 0.1248 -0.1405 0.251 -0.560 | *| | 0.019

Sum of Residuals 0

Sum of Squared Residuals 1.25656

Predicted Residual SS (PRESS) 21.55732

30

B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5

31

B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5

LMS: The 12th ordered squared residual will be minimized.

There are 4845 subsets of 4 cases out of 20 cases.

The algorithm will draw 2000 random subsets of 4 cases.

Random Subsampling for LMS

Minimum Criterion= 0.1776901227

Least Median of Squares (LMS) Method

Minimizing 12th Ordered Squared Residual.

Highest Possible Breakdown Value = 45.00 %

LMS Regression

Median and Mean

Median Mean

VAR1 20 19.4

VAR2 61.5 62.85

VAR3 1.14 3.2405

Intercep 1 1

Response 1.01 1.1575

Dispersion and Standard Deviation

Dispersion StdDev

VAR1 17.049925513 14.730562572

VAR2 19.273828841 17.842217823

VAR3 0.6226929318 7.0778806484

Intercep 0 0

Response 0.3632375435 0.4366484193

Subset Singular

Best

Criterion Percent

500 0 0.195351 25

1000 0 0.183161 50

1500 0 0.177690 75

2000 0 0.177690 100

32

B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5

Random Selection of 2000 Subsets

All 2000 Subsets were Nonsingular

LMS Objective Function = 0.0645437237

Preliminary LMS Scale = 0.09586219

Robust R Squared = 0.9139278457

Final LMS Scale = 0.1024008633

Observations of Best Subset

14 7 11 1

Estimated Coefficients

VAR1 VAR2 VAR3 Intercep

-0.011581144 -0.028795009 0.1413416723 2.9400524804

LMS Residuals

N Observed Estimated Residual Res / S

1 1.100000 1.035456 0.064544 0.630304

2 1.010000 1.074544 -0.064544 -0.630304

3 1.900000 1.902394 -0.002394 -0.023375

4 1.000000 1.046515 -0.046515 -0.454246

5 1.990000 6.736996 -4.746996 -46.356990

6 1.420000 1.523228 -0.103228 -1.008080

7 2.040000 1.975456 0.064544 0.630304

8 1.650000 1.470422 0.179578 1.753680

9 1.010000 1.007167 0.002833 0.027666

10 1.210000 1.070958 0.139042 1.357818

11 1.330000 1.265456 0.064544 0.630304

12 0.750000 0.728264 0.021736 0.212269

13 0.730000 0.539417 0.190583 1.861149

14 0.800000 0.735456 0.064544 0.630304

15 0.760000 0.502543 0.257457 2.514206

16 0.870000 0.601247 0.268753 2.624519

17 0.800000 0.827919 -0.027919 -0.272648

18 0.870000 0.816080 0.053920 0.526554

19 0.660000 1.173373 -0.513373 -5.013368

20 1.250000 1.239702 0.010298 0.100570

33

B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5

Distribution of Residuals

Median(U)= 4.6559203206

The run has been executed successfully.

MinRes 1st Qu. Median Mean 3rd Qu. MaxRes

-4.746995751 -0.037217268 0.0378280304 -0.206129679 0.1017927118 0.2687530018

Resistant Diagnostic

N U

Resistant

Diagnostic

1 2.150564 0.461899

2 3.893599 0.836268

3 6.843048 1.469752

4 16.385379 3.519257

5 174.497682 37.478666

6 15.235979 3.272388

7 28.046416 6.023818

8 5.271374 1.132187

9 2.845798 0.611221

10 2.466932 0.529849

11 6.035103 1.296221

12 2.471721 0.530877

13 3.207251 0.688854

14 4.910965 1.054779

15 4.400876 0.945221

16 4.163891 0.894322

17 3.346356 0.718731

18 8.768730 1.883351

19 15.481396 3.325099

20 3.105903 0.667087

34

B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5

The ROBUSTREG Procedure

LTS Regression

Model Information

Data Set ROBREG.HAMILTON

Dependent Variable Nitro Nitrogen concetration in river water (mg/l)

Number of Independent Variables 3

Number of Observations 20

Method LTS Estimation

Number of Observations Read 20

Number of Observations Used 20

Parameter Information

Parameter Effect

Intercept Intercept

Agri Agri

Forest Forest

Urban Urban

Summary Statistics

Variable Q1 Median Q3 Mean

Standard

Deviation MAD

Agri 5.0000 20.0000 27.0000 19.4000 14.7306 17.0499

Forest 54.5000 61.5000 78.0000 62.8500 17.8422 19.2738

Urban 0.8250 1.1400 2.0100 3.2405 7.0779 0.6227

Nitro 0.8000 1.0100 1.3750 1.1575 0.4366 0.3632

LTS Profile

Total Number of Observations 20

Number of Squares Minimized 16

Number of Coefficients 4

Highest Possible Breakdown Value 0.2500

LTS Parameter Estimates

Parameter DF Estimate

Intercept 1 3.2853

Agri 1 -0.0151

Forest 1 -0.0319

Urban 1 0.1235

Scale (sLTS) 0 0.1154

Scale (Wscale) 0 0.1147

35

B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5

Diagnostics

Obs Mahalanobis Distance Robust MCD Distance Leverage

Standardized

Robust Residual Outlier

1 1.0005 1.1798 0.2945

2 0.6791 1.6962 -1.0105

3 2.4496 2.9214 -0.2931

4 1.5125 10.9064 * -0.4692

5 4.1510 108.6323 * -38.4804 *

6 0.2431 9.9706 * -1.0524

7 0.5042 19.5216 * 0.7195

8 1.4287 1.3156 1.2085

9 1.1948 1.0546 -0.2465

10 0.4908 0.3881 0.8064

11 1.6098 1.4173 -0.0408

12 0.9043 0.9732 -0.1896

13 1.2310 1.3119 1.2123

14 1.4089 2.1799 0.0225

15 1.6161 1.9827 1.8749

16 1.1055 1.1118 1.8441

17 1.4045 1.3030 -0.5023

18 2.3245 2.0565 -0.2911

19 2.2486 2.2784 -5.2212 *

20 1.6793 1.8679 -0.1682

Diagnostics Summary

Observation Type Proportion Cutoff

Outlier 0.1000 3.0000

Leverage 0.2000 3.0575

R-Square for LTS Estimation

R-Square 0.8853

36

C R SCRIPT AND RESULTS USED FOR SECTION 5

C R Script and Results used for Section 5

> ## start: Data Input ########

> require("XLConnect")

> data.hamilton <- readWorksheetFromFile("hamildata.xlsx",

+ sheet="Data",

+ header=T,

+ startRow=1, endRow=21,

+ startCol=1, endCol=5)

> ## end: Data Input ########

>

> ## OLS ====

> OLSmodel.hamilton <- (lm(Nitro~Agri+Forests+Urban,

+ data=data.hamilton))

> summary(OLSmodel.hamilton)

Call:

lm(formula = Nitro ~ Agri + Forests + Urban, data = data.hamilton)

Residuals:

Min 1Q Median 3Q Max

-0.49229 -0.17505 0.01018 0.11003 0.78560

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.427453 1.293457 1.104 0.286

Agri 0.008505 0.015815 0.538 0.598

Forests -0.008433 0.014479 -0.582 0.568

Urban 0.029337 0.027572 1.064 0.303

Residual standard error: 0.2802 on 16 degrees of freedom

Multiple R-squared: 0.6531, Adjusted R-squared: 0.5881

F-statistic: 10.04 on 3 and 16 DF, p-value: 0.0005817

> ## LMS ====

> require("MASS")

> LMSmodel.hamilton <- lmsreg(Nitro~Agri+Forests+Urban,

+ data=data.hamilton)

> print(LMSmodel.hamilton)

Call:

lqs.formula(formula = Nitro ~ Agri + Forests + Urban, data = data.hamilton,

method = "lms")

Coefficients:

(Intercept) Agri Forests Urban

37

C R SCRIPT AND RESULTS USED FOR SECTION 5

3.40116 -0.01693 -0.03338 0.10374

Scale estimates 0.05056 0.05213

> ## LTS ====

> require("MASS")

> LTSmodel.hamilton.MASS <- ltsreg(Nitro~Agri+Forests+Urban,

+ data=data.hamilton)

> print(LTSmodel.hamilton.MASS)

Call:

lqs.formula(formula = Nitro ~ Agri + Forests + Urban, data = data.hamilton,

method = "lts")

Coefficients:

(Intercept) Agri Forests Urban

3.36336 -0.01672 -0.03334 0.13787

Scale estimates 0.08284 0.09678

> # or

> require("robustbase")

> LTSmodel.hamilton.robustbase <- ltsReg(Nitro~Agri+Forests+Urban,

+ data=data.hamilton)

> summary(LTSmodel.hamilton.robustbase)

Call:

ltsReg.formula(formula = Nitro ~ Agri + Forests + Urban, data = data.hamilton)

Residuals (from reweighted LS):

Min 1Q Median 3Q Max

-0.130496 -0.070646 -0.001935 0.074345 0.154205

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Intercept 3.115874 0.575651 5.413 9.15e-05 ***

Agri -0.014039 0.006925 -2.027 0.062103 .

Forests -0.029167 0.006411 -4.549 0.000454 ***

Urban 0.119286 0.019622 6.079 2.84e-05 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1056 on 14 degrees of freedom

Multiple R-Squared: 0.9416, Adjusted R-squared: 0.9291

F-statistic: 75.28 on 3 and 14 DF, p-value: 7.066e-09

38

E REVIEW QUESTIONS

D Software Specications Used

Software Version

SAS 9.3 TS Level 1M0

R 2.15.2 (2012-10-26)

XLConnect package 0.2-3

MASS package 7.3-22

robustbase package 0.9-4

E Review Questions

True or False

1. The term robust in robust regression is a misnomer because robust regression only

deals with outlying values, i.e. it does not address assumption violations.

[FALSE]

2. LMS and LTS estimation are forms of M-estimation because they minimize an objective

function, after all. [FALSE]

3. If a wayward observation is considered both as an outlier in the y-direction and as a

leverage point, then it will always be consistent with the linear trend followed by the

majority of the data points. [FALSE]

4. Software implementations of LMS and LTS robust regression use resampling techniques

only because said procedures are asymptotically inecient [FALSE]

5. The LTS estimator is relatively asymptotically inecent. [FALSE]

Multiple Choice

1. Also known as a leverage point

a.) Outlier in the x-direction

b.) Outlier in the y-direction

c.) Regression outlier

2. What kind of outliers does robust regression help identify?

a.) Outlier in the x-direction

b.) Outlier in the y-direction

c.) Regression outlier

39

E REVIEW QUESTIONS

3. In a positively skewed distribution, the LMS and LTS estimators will always be:

a.) less than the mean and the median

b.) between the mean and the median of said distribution

c.) greater than the mean and the median of said distribution

4. The breakdown bounds of the LMS and LTS univariate estimators are (approximately):

a.) 0%

b.) 25%

c.) 50%

5. The LTS estimator only includes the sub-batch of size with the lowest squared

residuals in computing for its value.

a.)

__

n

2

__

+ 1

b.) n

__

n

2

__

c.) n

__

n

2

__

1

40

- Easterly, Levine, Roodman; Aid, Policies and GrowthCargado porAna Calinhos
- Notes on EconometricsCargado porJorge Rojas-Vallejos
- Chain Ladder as dCargado portayko177
- TRẮC NGHIỆM KTLTCCargado porMai Anh Thư
- Chapter 13 Linear regressionCargado porzgmngubane
- lewisLinzer.pdfCargado porAlvin Duke Sy
- Statistics -Regression and probabilityCargado porNandkumar Thombare
- Articol EngCargado porottonagy
- final 2006Cargado porZoryana Podilchuk
- Robust RegressionCargado porJia Lenine P. Domingo
- Chain LadderCargado porYoyis2391
- 2013QMEA Lecture05 MLR AsymptoticsCargado porเป๋าเด็ดกระเป๋าคุณภาพราคาถูก
- Updated Dols ModelCargado porliemuel
- Prueba AnalistaCargado porprueba analista
- Topic 4 Panel Regression Model WbleCargado porHan Yong
- A Study of Leadership Style and Learning Organization in Canadia.pdfCargado porMuhammad Farrukh Rana
- 1.Format Man-The Effects of Credit Risk Mitigation Strategies on Profitability OfCargado porImpact Journals
- model_specification_oh.pdfCargado porDeri Yanto
- Basic Statistical Tools for ResearchCargado porJoDryc Dioquino
- Simple Regression - Session 11Cargado porRitushree Ray
- calculos de molienda martillosCargado porPoncho López de Ere
- Arg02_Quantitative Tools and ExcelCargado porwelcomeoly
- 5a53c72327ff7 as Published BEST Aug 2017Cargado porarispriyatmono
- Data Spss FaizalCargado porAndiina Nur Khasanah
- CG Seminar 1BCargado porPadam Shrestha
- Tests for the Difference Between Two Linear Regression InterceptsCargado porscjofyWFawlroa2r06YFVabfbaj
- TutorialCargado porkhuzani
- TPBCargado porAqeelah Muhamad Shahal
- An statistical analysis of Funds allocated to the restoration of Heritage Buildings in Trinidad and TobagoCargado porJulianna Vanessa Baptiste
- What is Missing in Non Destructive TesCargado porPDDELUCA

- Uji NormalitasCargado porThsb Desta
- su2013Cargado porDimas Adip
- Lecture 15-3 Cross Section and Panel (Truncated Regression, Heckman Sample Selection)Cargado porDaniel Bogiatzis Gibbons
- CORRELATION ANALYSIS.pptCargado porsalhotraonline
- Chapter 15 Final (Homework Answers)Cargado porKelly Johnson
- Hybrid Approach for Intrusion Detection Model Using Combination of K-Means Clustering Algorithm and Random Forest ClassificationCargado portheijes
- Chapter 6 f Test of a Linear Restriction (EC220)Cargado porHassaan Ahmad
- For Moderation Analysis Ch 15Cargado porMudassar Aziz
- Regression Models-Course ProjectCargado porParehjui
- Logistic Regression[1]Cargado poranil_kumar3184754
- Applied Probablities 2Cargado porMahalingam Nanjappan
- nichols iig2015-2Cargado porapi-281558303
- PCA pptCargado porHabib Rehman
- 126-136.pdfCargado porray m derania
- Lecture 30Cargado porWinny Shiru Machira
- CVEN2002 Week12Cargado porKai Liu
- Reading Statistics and ResearchCargado pordurantier
- Https Tutorials Iq Harvard Edu R Rstatistics Rstatistics HTMLCargado porcsscs
- Regression in R & Python paperCargado pordesaiha
- 2009_3_ANOVACargado porVineeta Singh
- Spss Trainingboek Advanced Statistics and DataminingCargado porYoonHo Joung
- Analysis 1Cargado porAditya Kumar Konathala
- ch05Cargado porAdam A. Dost
- Fisher's TestCargado porBradley Jackson
- Statistical Variance AnalysisCargado porTew Baquial
- Introduction to SemCargado porvishalkvyas1
- Methodology Sample Test.pdfCargado pornbc4me
- F=testCargado porShamik Misra
- Stat501.101_SU14Cargado porHafizah Halim
- ECMT1020 - Week 06 WorkshopCargado porMinh Bui