Está en la página 1de 42

Stat 146 WFW

2nd Sem. A.Y. 2012-13


An Introduction to Robust Regression
Carl Dominick
CALUB
Robert Elcivir
RULONA
Emkay
EVANGELISTA
Contents
1 Introduction 1
1.1 Outliers:
What they are and what they do . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Robust Regression:
What does it do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Univariate Robust Estimation 7
2.1 LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 LTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Large Batch Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Robust Regression 15
3.1 LMS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 LTS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Inference in Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Software Implementations 18
4.1 PROGRESS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Illustration:
Land Use and Water Quality in New York Rivers 20
Bibliography 26
Appendix 27
A R functions for Exact Univariate LMS and LTS Estimation 27
B SAS Code and Outputs used for Section 5 28
B.1 SAS Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
B.2 SAS Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
C R Script and Results used for Section 5 37
D Software Specications Used 39
E Review Questions 39
1 INTRODUCTION
1 Introduction
Before proceeding to discuss robust regression, salient features of ordinary least squares
(OLS) regression must rst be revisited.
Ordinary Least Squares Regression The classical linear model:
Y
i
=
0
+
1
X
i1
+
2
X
i2
+ +
p
X
ip
+
i
where
i
N(0,
2
) i
estimates its parameters,
0
,
1
, . . . ,
p
, as the values that would minimize the sum of the
squared residuals, i.e.

j
= arg

j
min
_
n

i=1
(y
i
y
i
)
2
_
where
_

_
y =

0
+
p

j=1

j
X
j
j;
j = 0, 1, . . . , p
The reason why OLS regression is popular is because of the convenience brought about by
its properties, e.g. parameter estimates are BLUE, ease in computation, and simplicity in
interpretation.
However, there is a caveat to the beauty of OLS regression it imposes stringent assump-
tions, viz. normality, independence of observations, and homoskedasticity. OLS is quite
sensitive to departures from these classical assumptions.
But it is not just the fulllment of the classical assumptions that aects the tenability of
inferences. OLS regression is quite sensitive to outliers because of the nature of how the
parameter estimates are arrived at.
1.1 Outliers:
What they are and what they do
Outliers persist for various reasons encoding errors, data contamination, or observations
surrounded by unqiue circumstances. Regardless of source, outliers pose a serious threat to
data analysis through the distortion of resulting inferences.
1
1.1 Outliers:
What they are and what they do 1 INTRODUCTION
In fact, the presence of outliers introduces non-normality into the equation through heavy-
tailed error distributions (Hamilton, 1992). Robust regression assigns lower weights to out-
lying observations so as to limit their spurious inuence, thus rendering resistance to the
inferences.
In order to appreciate the benets brought by robust regression, the dierent characteristics
of outliers and how they garble the analysis are presented.
Leverage Point An observation whose explanatory value(s) lie far from the bulk of the
dataset is deemed as a leverage point. Leverage points need to be paid special attention
because of their potential to inuence the resulting OLS estimates greatly. Ipso facto, the
presence of a leverage point has the potential to severely distort inferences made from the
subject data.
To illustrate its eect (and understand where the termleverage comes from), consider the
following datasets taken from Rousseeuw and Leroy (1987).
Dataset.noLev
x y
1 5.00 0.30
2 1.00 1.23
3 1.27 1.78
4 1.57 2.79
5 2.10 3.90
Dataset.wLev
x y
1 0.20 0.30
2 1.00 1.23
3 1.27 1.78
4 1.57 2.79
5 2.10 3.90
One of the datapoints in Dataset.noLev has been erroneously encoded into Dataset.wLev,
in particular the x-value, causing an observation to lie far from the other data points along
the x-axis (a plot is presented in Figure 1 on the following page to better visualize the
datasets). The resulting tted OLS models on the two datasets are then compared.
Fitted OLS Model without Leverage
R-squared: 0.9557
Parameter Estimates:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3726911 0.3316215 -1.123845 0.342892886
x 1.9321589 0.2402323 8.042877 0.004014026
Fitted OLS Model with Leverage
R-squared: 0.2277
Parameter Estimates:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.8956610 1.1431918 2.5329618 0.08520275
x -0.4093515 0.4352826 -0.9404268 0.41637622
2
1.1 Outliers:
What they are and what they do 1 INTRODUCTION
Notice that the stability of the OLS model tted on the dataset with the leverage point is
comparably lower than the the one tted on the dataset without, with the R-squared falling
from 0.9557 to 0.2277. Furthermore, the validity of the parameter estimates of the tted
OLS model has become dubious upon the introduction of the leverage point, as reected by
the dierence in the standard errors (or equivalently, the p-values).
Apart from the degradation in the tenability of the parameter estimates, juxtaposing the two
models also points out the drastic change in the estimated slope. This is a very dangerous
case under the context of regression as it may lead to misleading inferences.
0 2 4 6
0
1
2
3
4
5
6
no leverage
with leverage
Figure 1: Scatterplot and Fitted OLS Lines of Datasets 2a and 2b
The substantial change in the values of the parameter estimates caused by the presence of
the leverage point is illustrated in Figure 1.
On another note, notice that the outlier pulled the tted OLS model towards it, similar
to how an external force acting on lever changes the levers orientation (hence the term
leverage).
Setting the trivia aside, the potential for leverage points to mislead does not only come
from the wild change in parameter estimates, but also from how the drastic change in the
tted OLS line masks which observations are supposed to be treated as outliers. In other
words, discrimination of outliers based on the tted regression model becomes misleading as
well.
To provide an insight on this, the residuals of the Dataset.wLev from its tted OLS model
and the residuals of the same dataset from the model tted on Dataset.noLev
1
are com-
1
Technically speaking, this kind of procedure is spurious. The proper procedure will be discussed later on.
But for the purposes of illustrating the eects of leverage points, this example is enough since the premise
is that the tted OLS model on the dataset without a leverage point and the model tted with the bulk of
the observations in the dataset that are close enough
3
1.1 Outliers:
What they are and what they do 1 INTRODUCTION
pared.
From OLS Model
x y Residuals Std.Residuals
1 5.00 0.30 -0.5489037 -0.345044
2 1.00 1.23 -1.2563095 -1.331951
3 1.27 1.78 -0.5957846 -0.689797
4 1.57 2.79 0.5370208 0.676806
5 2.10 3.90 1.8639771 2.548244
From Model Fitted on Data without Leverage
x y Residuals Std.Residuals
1 5.00 0.30 -8.9881033 -1.97266528
2 1.00 1.23 -0.3294678 -0.13005900
3 1.27 1.78 -0.3011507 -0.12613446
4 1.57 2.79 0.1292017 0.04767381
5 2.10 3.90 0.2151575 0.05292135
Looking at the standardized residuals from the rst and second models generated, it can
be observed that the observations considered as outliers by the two models are dierent.
The OLS model identies the observation on x = 2.1 as an outlier, despite its consistency
with the general linear trend folowed by the rest of the points. Meanwhile, the second
model discriminates the observation at x = 0.20 as the relatively wayward one, which is not
surprising because the second set of residuals have been obtained from a model tted on a
set of points that closely follow a specic linear trend.
While on the topic of identifying outlying observations using residuals, it is worth mentioning
that although Studentized residuals could be used on the identifying the leverage point from
the residuals of the OLS model tted on the dataset with a leverage point in this example to
discover that the rst observation is actually an outlier instead of the fth, this is not always
the case.
The Studentized residual only singles out one observation at a time. Ipso facto, the inclusion
of other outliers in the computation for the Studentized residual of one of the actual outliers
will fail to inate the Studentized residual.
Apart from the propensity of leverage points to severely aect analysis through spurious
estimates and discrimination of outliers, special attention is given to them because they are
more likely to occur in a multidimensional setting. Naturally, the consideration of more
explanatory variables would provide more opportunities for leverage points to appear.
That said, not all outliers are detrimental to analysis; some outliers are benign only because
they do not debilitate the inferences.
Figure 2 on the next page highlights how important it is to keep in mind that leverage points
only have the potential to impair analysis. While Figure 2(a) is similar to the illustration of
the eects of leverage points (Dataset.noLev and Dataset.wLev), Figure 2(b) shows that
4
1.2 Robust Regression:
What does it do 1 INTRODUCTION
(a) Debilitating Outlier (b) Benign Outlier
Figure 2: Examples of a Debilitating and a Benign Leverage Point
Source: Hamilton (1992)
the outlying observation, despite being a leverage point, is still consistent with the linear
trend followed by the bulk of the data points.
If a leverage point is consistent with the linear trend followed by the majority of the dataset,
it is concomitant that the observation would not just be a leverage point, but also an outlier
in the y-direction. This is not to say, however, that an observation considered as a leverage
point and an outlier in the y-direction will mean that it is consistent with the linear trend
of the majority. If the value of its response or one of its explanatory variables is too far o,
then it will not follow the linear trend.
Outlying points that deviate away from the linear trend exhibited by the majority of the
datapoints are labeled as regression outliers.
That said, it actually the presence of regression outliers that erode the tenability of the
parameter estiamtes. So, leverage points are considered detrimentally inuential if they
are regression outliers as well. If not, then the subject observation is a benign leverage
point.
1.2 Robust Regression:
What does it do
It is worth mentioning before discussing the essence of robust regression that in OLS re-
gression, outliers are determined based on their deviation from the tted line using various
measures such as the adjusted, standardized, and studentized residuals; DFts; DFBetas;
Cooks Distance, etc.
5
1.2 Robust Regression:
What does it do 1 INTRODUCTION
As previously mentioned, this sort of discrimination of residuals could lead to complica-
tions as tting an OLS line would mask regression outliers before their eects are marginal-
ized.
Robust Regression, on the other hand, ts a line using resistant estimators rst. In using re-
sistant estimators in the estimation process itself, eects of outlying values are marginalized,
thus obtaining a robust solution.
Figure 3: Illustration of the dierence in response of OLS and robust regression to outliers
Source: Hamilton (1992)
Notice in Figure 3, the OLS tted model was pulled downward by the data values of four
regression outliers San Jose, San Diego, San Francisco, and Los Angeles while the
tted model using robust regression ignored the excessive inuence imposed by said outliers.
In this sense, robust regression is sometimes referred to as resistant regression.
It is noteworthy that these resistant estimators essentially assign weights to observations
(similar to L-estimators). Wayward observations are just assigned lower weights but not
always zero-weights.
Outliers are then identied based on their deviation from the robust solution.
While there is a plethora of robust estimators available e.g. repeated median, iteratively
reweighted least squares this article will only be focusing on two: the least median of
squares and the least trimmed squares. These two estimators are noted for their very high
breakdown bound.
6
2 UNIVARIATE ROBUST ESTIMATION
2 Univariate Robust Estimation
This section will be presenting two estimators used in robust regression the least median
of squares (LMS) and the least trimmed squares (LTS) but only under the univariate
setting. LTS and LMS estimation under multidimensional data will be presented in the next
section, but already under the context of regression since estimation involving more than
one variables is usually done in this setting.
That said, this section will proceed as follows: a brief description about the estimation
procedure is presented, followed by an outline on its computation, then an illustration, and
ending with a presentation of its properties.
2.1 Least Median of Squares (LMS)
As the name implies, the LMS estimator,

LMS
, is computed as the value that would minimize
the median squared deviation, i.e.:

LMS
= arg

inf
_
Med
_
_
y
i

_
2
__
In fact, this denition would imply that the LMS estimators objective function is given
by:

_
y
i
;

_
= Med
_
_
y
i

_
2
_
However, LMS estimation is not a form of M-estimation because the objective function above
does not include all observations, which is inconsistent with the denition of M-estimation.
In fact, the Help and Documentation of SAS 9.3 (SAS Institute, Inc., 2011) dierentiates
LMS and LTS estimation from M-estimation.
Computing for the LMS estimator
(1) First order of business is to arrange the batch of size n in ascending order,
y
(1)
, y
(2)
, . . . , y
(n)
y
(1)
y
(2)
. . . y
(n)
If n is odd, just repeat the median and include it in the ordered batch. Adjust the batch
size accordingly and still denote it by n.
(2) Compute for:
h =
__
n
2
__
+ 1
7
2.1 LMS 2 UNIVARIATE ROBUST ESTIMATION
(3) Partition the batch into two parts, where the second partition starts at y
h
. Denote them
as:
y
(1)
, y
(2)
, . . . , y
(n)

_
y
(1)
,
y
(h)
,
y
(2)
,
y
(h+1)
,
. . . ,
. . . ,
y
(nh+1)
y
(n)
Note that both of the sub-batches are of size n h + 1. Ipso facto, there should be a
one-to-one correspondence between the sub-batches.
(4) Compute for:
y
(d)
i
= y
(i+h1)
y
(i)
i = 1, 2, . . . , n h + 1
(5) The LMS estimate is the midpoint of the values corresponding to the pair with the least
dierence, i.e.:

LMS
=
y
(h+k1)
+ y
(k)
2
where y
(h+k1)
y
(k)
= min
i
y
(d)
i
Illustration Consider the following batch of numbers taken from Rousseeuw and Leroy
(1987):
40 75 80 83 86 88 90 92 93 95
Note that n = 10, which means that h =
__
10
2

+ 1 = 6. Hence, a dividing line is cast just


before the 6
th
ordered observation.
40 75 80 83 86

88 90 92 93 95
After dividing the batch into two sub-batches, the sub-batches are then paired up and their
dierences obtained.

88
40

90
75

92
80

93
83

95
86
min
_
y
(d)
1
, y
(d)
2
, y
(d)
3
, y
(d)
4
, y
(d)
5
_
= min {48, 15, 12, 10, 9}
= 9 = y
(d)
5
= y
(10)
y
(5)
= 95 86

LMS
=
y
(10)
+ y
(5)
2
=
95 + 86
2
= 90.5
Properties of the LMS estimator Having laid down the procedure involved in com-
puting its computation, some salient properties of the LMS estimator are presented, which
are:
1. has a breakdown bound of 50%
2. the LMS estimator is location and scale equivariant (i.e. linear equivariant);
8
2.1 LMS 2 UNIVARIATE ROBUST ESTIMATION
3. a solution for the objective function always exists;
4. the objective function is not smooth; and
5. the objective function has a low convergence rate.
9
2.2 LTS 2 UNIVARIATE ROBUST ESTIMATION
30 40 50 60 70 80 90 100
LMS Mean
Median
Figure 4: Wilkinson dot plot and location of the mean, median, and the LMS estimator.
In addition to the properties mentioned, the LMS estimator is also considered as some sort
of mode estimator, in that it tends to the modal value of the batch. In simpler terms, the
LMS estimator tends to where the values cluster, as seen in Figure 4, compared to orthodox
location estimators such as the mean and median.
Since the LMS estimator is aected by the shape (or the skewness) of the data, it is inherently
less reliable than other robust estimators because it is more variable.
Despite its having a higher variability and a non-smooth objective function with a slow
convergence rate, the LMS estimator is generalizeabile to a multidimensional case while still
maintaining a high breakdown bound and linear equivariance.
2.2 Least Trimmed Squares (LTS)
The LTS estimator, meanwhile, is computed as the value that would minimize the trimmed
sum of ordered squared deviations. In mathematical notation:

LTS
= arg

inf
_
h

i=1
r
2
(i)
_
where
_

_
h =
__
n
2
__
+ 1,
r
j
= (y
j
) j = 1, 2, . . . , n,
where r
2
(1)
r
2
(2)
. . . r
2
(n)
Again, this denition would imply that the objective function of the LTS estimator is given
10
2.2 LTS 2 UNIVARIATE ROBUST ESTIMATION
by:

_
y
i
;

_
=
h

i=1
r
2
(i)
As before, it must be kept in mind that the LTS estimator is still not an M-estimator because
it does not include all observations in evaluating its objective function, similar to the premise
of how the LMS estimator is not an M-estimator.
Note that the upper bound of the summation is h, not n. So in essence, the LTS estimator
minimizes the sum of the lower h ordered squared residuals, equivalently discarding the upper
nh squared deviations.
Computing for the LTS estimator
(1) As before, rst order of business is to sort the data:
y
(1)
, y
(2)
, . . . , y
(n)
y
(1)
y
(2)
. . . y
(n)
But n here can take on any positive integer value special procedures are neither needed
for odd nor even n.
(2) Compute for h =
__
n
2

+ 1
(3) Now, partition the sorted data into nh+1 sub-batchs, each of size h, in the following
manner:
_
y
(1)
, y
(2)
, . . . , y
(h)
_
,
_
y
(2)
, y
(3)
, . . . , y
(h+1)
_
,
.
.
.
_
y
(nh+1)
, y
(nh+2)
, . . . , y
(n)
_
i.e. Simply enclose the rst h units of the sorted batch to obtain the rst sub-batch. To
obtain the second sample, just move the left and the right enclosures one unit to the
right. Repeat the process nh+1 times (including the rst iteration) until the right
enclosure reaches the end of the batch. Each repition would then correspond to one
sub-batch.
(4) Next, compute for the means of each sub-batch. There are two ways to go about this:
y
(j)
=
1
h
j+h1

i=j
y
(i)
(1)
=
h y
(j1)
y
(j1)
+ y
(j+h1)
h
(2)
where j = 2, 3, . . . , n h + 1
Note that Equation 1 is simply the sub-batch mean.
11
2.2 LTS 2 UNIVARIATE ROBUST ESTIMATION
To understand Equation 2, keep in mind that the nh+1 sub-batchs are obtained in
a progressive manner. For example, the second sub-batch contains some elements from
the rst sub-batch, but the rst ordered observation is excluded while the (h + 1)
th
ob-
servation is included.
Generally speaking, the (j + 1)
th
sub-batch is the same as the j
th
sub-batch, but exclud-
ing the j
th
observation and including the (j + h)
th
observation, where
j = 1, 2, . . . , n h.
That said, note that before Equation 2 can be used, Equation 1 must rst be evaluated
at j = 1.
(5) After obtaining the nh+1 means, the nh+1 variances must then be computed for.
Any of the two formulae can be used for this:
SQ
(j)
=
j+h1

i=j
_
y
(i)
y
(i)
_
2
(3)
= SQ
(j1)
y
2
(j1)
+ h
_
y
(j1)
_
2
+ y
2
(j+h1)
h
_
y
(j)
_
2
(4)
j = 2, 3, . . . , n h + 1
Again, Equation 4 is a recursive form of Equation 3. Also, Equation 3 must be evaluated
at j = 1 rst before proceeding to use Equation 4
(6) The LTS estimator is then taken as the mean corresponding to the sub-batch with the
least variance, SQ
(j)
, i.e.:

LTS
= y
(k)
where SQ
(k)
= min
j
SQ
(j)
Before moving on, care must be taken when using the recursive formulae Equa-
tions 2 and 4 in that rounding-o must not be done within each iteration. Rounding-
o the y
(j)
s and the SQ
(j)
s in each iteration will result in not just grouping errors, but
also its propagation.
LTS Illustration Consider the same batch of numbers from the previous illustration:
40 75 80 83 86 88 90 92 93 95
The resulting sub-batch means and sub-batch variances are approximated here only to con-
serve space again, these values must not be rounded-o before obtaining the actual LTS
estimate.
12
2.2 LTS 2 UNIVARIATE ROBUST ESTIMATION
That said, the y
(j)
s and the SQ
(j)
s are computed as follows:
40 75 80 83 86 88
. .
y
(1)
75.3333
SQ
(1)
160.3333
90 92 93 95
40 75 80 83 86 88 90
. .
y
(2)
83.6667
SQ
(2)
153.3333
92 93 95
40 75 80 83 86 88 90 92
. .
y
(3)
= 86.5
SQ
(3)
= 99.5
93 95
40 75 80 83 86 88 90 92 93
. .
y
(4)
88.6667
SQ
(4)
71.3333
95
40 75 80 83 86 88 90 92 93 95
. .
y
(5)
90.6667
SQ
(5)
55.3333
min
_
SQ
(1)
, SQ
(2)
, SQ
(3)
, SQ
(4)
, SQ
(5)
_
= min {160.33, 153.33, 99.5, 71.33, 55.33}
= 55.33 = SQ
(5)

LTS
= y
(5)
90.6667
As previously mentioned, the LTS estimator includes only the elements from the sub-batch,
of size h, with the lowest variance. In doing so, theh other nh observations are excluded. So
really, the LTS estimator, at least as presented, is the trimmed mean of the sub-batch with
the lowest squared deviations with a trimming proportion of
_
1
h
n
_
. It need not be said
that, having described the LTS estimator as a trimmed mean, it allows for an asymmetric
trimming of observations.
Properties of the LTS Estimator Unlike the LMS estimator, the LTS estimator per-
forms (relatively) well under asymptotic eciency. Meaning to say, it has comparably faster
convergence rate, or equivalently, it takes less iterations before a value for the estimate is
arrived at, at least compared to the LTS estimator.
13
2.3 Large Batch Estimation 2 UNIVARIATE ROBUST ESTIMATION
Like the LMS estimator, the LTS estimator:
1. has a breakdown bound of 50%;
2. is linearly equivariant (i.e. location and scale equivariance);
3. is extendable to to multidimensional cases (while still maintaining a high breakdown
bound and linear equivariance); and
4. a lack of a smooth objective function.
30 40 50 60 70 80 90 100
LTS
LMS Mean
Median
Figure 5: Wilkinson dot plot and locations of the mean, median, LTS, and LMS estimators.
Like the LMS estimator, the LTS estimator should also be located somewhere near the
modal value of the batch (at least relative to the mean and the median). Since the objective
function of the LTS estimator is based on the ordered partition of the batch with the smallest
variance, which more often than not is the interval around which the data values cluster, then
it should follow that the LTS estimator as well can be likened into a modal estimator.
2.3 LMS and LTS Estimation in Large Batches
The compromise for having a high breakdown bound, among others, of these estimators is
the ineciency in computation. As illustrated in the previous examples, computation of
these estimators involves solving for the scales of the sub-batches (range and sum of squared
deviations for LMS and LTS, respectively) nh+1 times.
In especially large batches, this is quite impractical. To render eciency in solving for the
LMS and LTS estimators of large batches, resampling techniques are used instead. Thus,
solutions are determined randomly for large batch sizes. Ipso facto, it is possible to yield
inconsistent resultant computational values.
14
3 ROBUST REGRESSION
3 Robust Regression
This section is outlined as follows: a brief description of the properties of the robust regression
techniques are presented, in particular the objective function that is used to arrive at pa-
rameter estimates and the breakdown bounds of the parameter estimates. After, inferential
properties under the robust regression techniques are presented.
3.1 LMS Regression
The parameter estimates are estimated in LMS regression as those that would yield the
minimum median of squared residuals, i.e.:
arg

min
_
Med
_
r
2
(i)
__
= arg

min
_
Med

r
(i)

_
where r
i
= y
i
y
i
i
The resulting breakdown bound of the resulting estimates are:
BDB(LMS) =
__
n p
2
__
+ 1
n
provided that p > 1, p being the number of parameters estimated.
3.2 LTS Regression
The parameter estimates in LTS regression are computed as the ones that would yield the
minimum trimmed sum of ordered squared residuals:
arg

inf
_
h

i=1
r
2
(i)
_
where
_

_
h =
__
n
2
__
+ 1,
r
j
= (y
j
y
j
) j = 1, 2, . . . , n,
where r
2
(1)
r
2
(2)
. . . r
2
(n)
with a breakdown bound of:
BDB(LTS) =
__
n p
2
__
+ 1
n
where p is the number of parameters estimated.
15
3.3 Inference in Robust Regression 3 ROBUST REGRESSION
3.3 Inference in Robust Regression
Scale estimator of error terms,
s
LMS
=
_
1 +
5
n p
_
c
h,n

r
(h)

(5)
s
LTS
= d
h,n

_
1
h
h

i=1
r
2
(i)
(6)
where d
h,n
=
1

1
2
h c
h,n

_
1
c
h,n
_
c
h,n
=
1

1
_
n + h
2n
_
h =
__
n
2
__
+ 1
Note that c
h,n
and d
h,n
are chosen to make the scale estimators consistent with the Gaus-
sian model (Rousseeuw and Hubert, 1997).
Moreover, it is important to note that Equation 5 only applies for odd n, and that
_
1 +
5
np
_
is a nite population correction factor (see Rousseeuw and Hubert (1997)).
It is noteworthy to mention that there are more more ecient scale estimates (see Rousseeuw
and Hubert (1997)) based on Equations 5 and 6; but for the purposes of just introducing the
notion of robust regression, these equations should suce.
Coecient of determination, R
2
Rousseeuw and Hubert (1997) proposes a robust coun-
terpart of the OLS notion of R
2
, based on Equations 5 and 6, for LMS and LTS regression
as:
R
2
LMS
= 1
s
LMS
_
1 +
5
np
_
c
h,n

(h),LMS

R
2
LTS
= 1
s
LTS
d
h,n
_
1
h

h
i=1
r
2
(i),LTS
where the r

s are the corresponding deviations of the y-observations from their univariate


LMS or LTS estimates.
16
3.3 Inference in Robust Regression 3 ROBUST REGRESSION
Unfortunately, this is as far as the related literature goes regarding inference in robust
regression. Noteworthy is how the Help and Documentation of SAS 9.3 (SAS Institute, Inc.,
2011) species that there is no test for the canonical linear hypothesis under LMS and LTS
regression.
Interpretation of predicted values Looking at the logic the previous paragraphs fol-
lowed, then it is quite obvious that the resultant predicted values would be interpreted under
the paradigm of robustness. In other words, whereas the parameter estimates of the model
were computed for using a robust solution, ipso facto, the resulting predicted values would
be based on the linear relationship of the majority of the values as determined by the robust
solution employed.
17
4 SOFTWARE IMPLEMENTATIONS
4 Software Implementations
Since the objective functions of the LMS and LTS estimators are not smooth, they do not
lend themselves to mathematical optimization. In other words, there is no formula for
computing the parameter estimates. (In fact, it is apparent that this diculty is inherent to
all regression estimators with high breakdown bounds (Bhar, nd).)
4.1 PROGRESS algorithm
To this end, Rousseeuw and Hubert (1997) proposed an algorithm in computing for the pa-
rameter estimates under LMS and LTS regression called PROGRESS (Program for RObust
reGRESSion). PROGRESS essentially involves resampling methods.
The details of the algorithm will not be discussed here, but a general ow of the algorithm
will instead be outlined.
Briey describing the algorithm, the process involves rst obtaining a subsample (or sub-
batch) with a comparably lower size to make the computation eciently feasible. After
obtaining the parameter estimates, the objective function is then evaluated. The process
is repeated a number of times. The (overall) parameter estimates are then obtained as
the estimates generated from the subsample that yielded the lowest value of the evaluated
objective function.
Modern algorithms for robust regression are based on PROGRESS.
4.2 SAS
LMS Regression LMS regression in SAS is done in the Interactive Matrix Language
(IML) environment. Ipso facto, the SAS datasets generated via the DATA step or the IMPORT
procedure, or any other SAS function other than those generated within the IML procedure
for that matter, must be converted into an object that is usable within PROC IML. An example
of how to do such task can be seen in Appendix B.1.
That said, to the following command is invoked while in PROC IML to conduct an LMS
regression:
CALL LMS(sc, coef, wgt, opt, y, x);
It is important to mention that the parameters of this function are in the form of a matrix
or a vector. After all, the function is called inside the IML environment.
For the purposes of this introduction, only the last three of the six function parameters are
discussed, opt, y, and x. The rst three function parameters are implicitly left to their
default values.
18
4.3 R 4 SOFTWARE IMPLEMENTATIONS
x is the matrix of explanatory values, with the rows as the observations and the column as the
explanatory variables. An additonal column of 1s (for the intercept) need not be included
since it could be specied in opt to include the intercept in the estimated paramters.
y and opt are vectors corresponding to the vector of the values of the response variable and
the dierent options, respectively.
LTS Regression Fortunately, invoking LTS regression as SAS is not as cumbersome as
LMS regression and is simply invoked via PROC ROBREG. The general syntax is:
PROC ROBUSTREG DATA=dataset METHOD=LTS;
MODEL response = var1 var2 . . . vark / options
RUN;
4.3 R
LMS Regression Before being able to invoke LMS regression in R, the MASS package
must rst be installed.
After installing the package, conducting an LMS regression is as simple as:
lmsreg(formula, dataframe)
which is actually a wrapper function of the lqs function.
While there are other function parameters (such as seed number and weights), these paramter
specications are sucient for this and the subsequent R functions presented.
LTS Regression As with LMS regression, LTS regression in R also requires the MASS
package, and is called via the following function:
ltsreg(formula, dataframe)
which too is an lqs wrapper-function.
There is another option, though, which is via:
ltsReg(formula, dataframe)
which requires the robustbase package.
19
5 ILLUSTRATION:
LAND USE AND WATER QUALITY IN NEW YORK RIVERS
5 Illustration:
Land Use and Water Quality in New York Rivers
As a demonstration on how robust regression analysis usually goes, consider the following
dataset taken from Haith (1976) (as cited in Hamilton (1992)).
Table 1: Land use and nitrogen content in 20 river basins.
Basin Agri Forest Urban Nitro
1 Olean 26 63 1.49 1.1
2 Cassadaga 29 57 0.79 1.01
3 Oatka 54 26 2.38 1.9
4 Neversink 2 84 3.88 1
5 Hackensack 3 27 32.61 1.99
6 Wappinger 19 61 3.96 1.42
7 Fishkill 16 60 6.71 2.04
8 Honeoye 40 43 1.64 1.65
9 Susquehanna 28 62 1.25 1.01
10 Chenago 26 60 1.13 1.21
11 Tioughnioga 26 53 1.08 1.33
12 West Canada 15 75 0.86 0.75
13 East Canada 6 84 0.62 0.73
14 Saranac 3 81 1.15 0.8
15 Ausable 2 89 1.05 0.76
16 Black 6 82 0.65 0.87
17 Schoharie 22 70 1.12 0.8
18 Raquette 4 75 0.58 0.87
19 Oswegatchie 21 56 0.63 0.66
20 Chocton 40 49 1.23 1.25
The variable Basin is the name of the river basin/ area containing the river basin. Agri is the
percentage of land in active agriculture, while Forest is the percentage of land forested, brush-
land, or plantation. Urban is the percentage of land urban (including residential, commercial,
and industrial). Nitro is the nitrogen concetration in the river water (in mg/L).
That said, the data in Table 1 was used to explore the eect of the dierent types of land
use on nonpoint-source water pollution.
OLS, LMS, and LTS regression models are tted on this dataset and then compared. For
the purposes of this demonstration, SAS outputs are used. The codes used to generate the
outputs are in Appendix B.1 on page 28
20
5 ILLUSTRATION:
LAND USE AND WATER QUALITY IN NEW YORK RIVERS
(a) OLS Model
(b) OLS Residuals
Figure 6: Selected SAS PROC REG Outputs
21
5 ILLUSTRATION:
LAND USE AND WATER QUALITY IN NEW YORK RIVERS
(a) LMS Model (b) LMS Residuals
Figure 7: Selected SAS PROC IML Outputs
22
5 ILLUSTRATION:
LAND USE AND WATER QUALITY IN NEW YORK RIVERS
(a) LTS Model (b) LTS Model R
2
(c) LTS Residuals
Figure 8: Selected SAS PROC ROBREG Outputs
23
5 ILLUSTRATION:
LAND USE AND WATER QUALITY IN NEW YORK RIVERS
In comparing the tted models using OLS, LMS, and LTS regression, there are three key
aspects the could be scrutinized as far as resistance goes: (i) R
2
, (ii) parameter estimates,
and (iii) identied outlying observations.
Coecient of determination Note that the OLS R
2
, which is approximately equal to
65 percent, is comparably lower ot the LMS and LTS R
2
s, which are approximately 91
percent and 88 percent, respectively. Not much information can be obtained from these
values, though. In order to gain a deeper understanding of the dynamics of the relationship
among the extents of the dierent land uses on water pollution, let us look at the parameter
estimates.
Parameter estimates To facilitate the comparison of the resultant parameter estimates
from the three regression procedures, Table 2 below summarizes the parameter estimates
taken from Figures 6(a), 7(a), and 7(a).
Table 2: Parameter Estimates from the Fitted OLS, LMS, and LTS models
Variable OLS LMS LTS
Agri 0.0085 -0.0116 -0.0151
Forest -0.0084 -0.0288 -0.0319
Urban 0.0293 0.1413 0.1235
Table 2 shows that the eect of the extent of usage of dierent types of land usage on water
pollution are comparatively marginal, at around half of the magnitude of the estimated
eects by the LMS and LTS procedures; and whose stability are rendered questionable by
the estimated standard errors (as shown in Figure 6(a) on page 21).
The estimated eects of the extent of use per land usage type by the LMS and LTS regression
methods do not lie very far from one another; and are more pronounced compared to the
OLS estimates (as have already been mentioned). Whats important to note is the change
in sign of the parameter estimate of the percentage agricultural land usage.
Ipso facto, it can be implied that one the outlying observation(s) are (bad) leverage points
with respect to the percentage agricultural land usage (which shall be further discussed in
the next paragraph).
That said, it is apparent from the robust parameter estimates that, among the three variables,
percentage land usage has the largest (positive) eect on water pollution. It could be that
this is because of the distribution of urban wastes into the surrounding waters result in an
increase in their nitrogen concentration because of how these kinds of wastes have the largest
nitrogen content.
The negative signs of the agricultural land usage and forested land percentage, on the other
hand, could be attributed to the relatively smaller nitrogen content associated with these
types of land usage compared to the nitrogen content of the wastes associated with other
types of land usage that could have been instead (particularly urban).
24
5 ILLUSTRATION:
LAND USE AND WATER QUALITY IN NEW YORK RIVERS
The magnitude of the eect of the percentage of forested land around twice that of the
percentage of land used for agriculture could be due to the relatively lack of nitrogen content
in the run-o content from former land usage compared to the latter.
So, in essence, interpretations of these parameter estimates could be because of the nitrogen
content associated with the types of land usage, with that of urban wastes to apparently
possess the largest nitrogen content.
Hence, based on these resultant estimates using robust solutions, it is possible that the
outlying observations have an increased nitrogen concentration in its surrounding waters
despite the increased percentage of agricultural land usage because of major agricultural
activities (such as industrial farming, extensive use of fertilizers, etc.) that produce run-o
wastes that contribute to the proliferation of nitrogen pollution in the surrounding bodies of
water. This can be implied from the manifested positive sign of the estimated change nitrogen
concentration due to a change in the percentage agricultural land usage which pulled said
estimate towards it.
Outlying observations Note that based on the OLS tted model, observation numbers
5, 7, and 19 (Hackensack, Fishkill, Oswegatchie) have been identied as outliers Figure 6(b)
on page 21; while the robust models have identied only observations 5 and 19 (Hackensack
and Oswegatchie) as the outliers (Figures 7(b) on page 22 and 8(c) on page 23).
Following the logic mentioned in the last paragraph, it could be that Hackensack and Oswe-
gatchie are areas where there are industrial agricultural activities; and that their inclusion
in the sample considered has pulled the OLS estimates towards their geographic pattern dy-
namics from the usual pattern boosting the quantitative eect of percentage agricultural
usage, and slightly pushing that of the percentage of forested land upward, at the expense
of the quantitative eect of percentage urbanised land, on nitrogen concentration than what
is common.
Meanwhile, Fishkill is an area that is consistent with the pattern exhibited by the majority
of the areas near the subject river basin.
25
REFERENCES REFERENCES
References
Bhar, L. (n.d.). Robust regression.
Chen, C. (n.d.). Robust Regression and Outlier Detection with the ROBUSTREG Procedure.
Haith, D. A. (1976). Land use and water quality in new york rivers.
Hamilton, L. C. (1992). Regression with Graphics: A Second Course in Applied Statistics.
Duxbury Press, Belmont, California.
OKelly, M. (2006). A Tour Around PROC ROBUSTREG. In PhUSE.
Ripley, B., Venables, B., Hornik, K., Gebhardt, A., and Firth, D. (2013). Support Functions
and datasets to support Venables and Ripley, Modern Applied Statistics in S (4th edition,
2002).
Rousseeuw, P. J. (1984). Least Median of Squares Regression. Journal of the American
Statistical Association, 79(388):871-880.
Rousseeuw, P. J., Croux, C., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M., Verbeke, T.,
Koller, M., Maechler, M., and et al (2012). Basic Robust Statisics.
Rousseeuw, P. J. and Hubert, M. (1997). Recent developments in PROGRESS. 31.
Rousseeuw, P. J. and Leroy, A. (1987). Robust Regression & Outlier Detection. John Wiley
& Sons.
SAS Institute, Inc. (2011). SAS 9.3 HELP AND DOCUMENTATION. Cary, North Carolina.
Verardi, V. and Croux, C. (2009). Robust regression in Stata. The Stata Journal, 9(3):439
453.
Yaee, R. A. (2002). Robust Regression Analysis: Some Popular Statistical Package Options.
26
A R FUNCTIONS FOR EXACT UNIVARIATE LMS AND LTS ESTIMATION
A R functions for Exact Univariate LMS and LTS Es-
timation
> ## LMS ####
> lms <- function(x){
+ h <- (floor(length(x)/2)+1)
+ y.diff <- x[h:length(x)] - x[1:(h-1)]
+ min.i <- which(y.diff == min(y.diff))
+ print(mean(c(x[min.i], x[min.i+h-1])))
+ }
> ## LTS ####
> lts <- function(x){
+ h <- (floor(length(x)/2)+1)
+ sq <- rep(NA, length(x)-h+1)
+ for(j in 1:(length(x)-h+1)){
+ sq[j] <- var(x[j:(j+h-1)])
+ }
+ min.j <- which(sq == min(sq))
+ print(mean(x[min.j:(min.j+h-1)]))
+ }
> ## Example ----
> ODex <- c(40, 75, 80, 83, 86, 88, 90, 92, 93, 95)
> lms(ODex)
[1] 90.5
> lts(ODex)
[1] 90.66667
27
B SAS CODE AND OUTPUTS USED FOR SECTION 5
B SAS Code and Outputs used for Section 5
B.1 SAS Code
1 OPTIONS noxsync noxwait nodate nonumber ;
2 LIBNAME robreg path ;
3 x ' path \ f i l e na me . x l s x ' ; /*must remain open until SAS dataset
registers*/
4 FILENAME hdata
5 DDE EXCEL | path \[ filename . xlsx ] sheet!r2c1 : r21c5
6 NOTAB
7 ;
8 DATA robreg . hamilton ;
9 INFILE hdata DLM=' 09 ' x DSD ;
10 INPUT Basin : $ 11. Agri Forest Urban Nitro ;
11 LABEL Basin = New York river basin
12 Agri = Percentage of land in active agriculture
13 Forest= Percentage of land forested , brushland , or
plantation
14 Urban = Percentage of land urban
15 Nitro = Nitrogen concetration in river water ( mg/l)
16 ;
17 RUN ;
18
19 /***********/
20 /* OLS */
21 /***********/
22 PROC REG DATA=robreg . hamilton ;
23 TITLE OLS Regression ;
24 MODEL Nitro = Agri Forest Urban / R ;
25 RUN ;
26 QUIT ;
27
28 /***********/
29 /* LMS */
30 /***********/
31 PROC IML ;
32 TITLE LMS Regression ;
33 USE robreg . hamilton ; /*converts data frame into matrix*/
34 READ ALL VAR _ALL_ INTO hamilData ; /*reads everything from the data
frame*/
35 nitro = hamilData [ , 4 ] ; /*extract response values of observations*/
36 land = hamilData [ , 1 : 3 ] ; /*extract explanatory values of observations*/
37 opt = J ( 8 , 1 , . ) ; /*8x1 matrix whose values are set as . for options*/
38 opt [ 2 ] = 2; /*more info in output ; other options coerced to default*/
39 CALL LMS ( sc , coef , wgt , opt , nitro , land ) ;
40 QUIT ;
28
B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5
41
42 /***********/
43 /* LTS */
44 /***********/
45 PROC ROBUSTREG DATA=robreg . hamilton METHOD=LTS ;
46 TITLE LTS Regression ;
47 MODEL Nitro = Agri Forest Urban / DIAGNOSTICS ( ALL ) LEVERAGE ;
48 RUN ;
B.2 SAS Outputs
The REG Procedure
Model: MODEL1
Dependent Variable: Nitro Nitrogen concetration in river water (mg/l)
OLS Regression
Number of Observations Read 20
Number of Observations Used 20
Analysis of Variance
Source DF
Sum of
Squares
Mean
Square F Value Pr > F
Model 3 2.36602 0.78867 10.04 0.0006
Error 16 1.25656 0.07853
Corrected Total 19 3.62257
Root MSE 0.28024 R-Square 0.6531
Dependent Mean 1.15750 Adj R-Sq 0.5881
Coeff Var 24.21085
Parameter Estimates
Variable Label DF
Parameter
Estimate
Standard
Error t Value Pr > |t|
Intercept Intercept 1 1.42745 1.29346 1.10 0.2861
Agri Percentage of land in active agriculture 1 0.00851 0.01581 0.54 0.5981
Forest Percentage of land forested, brushland, or plantation 1 -0.00843 0.01448 -0.58 0.5684
Urban Percentage of land urban 1 0.02934 0.02757 1.06 0.3031
29
B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5
The REG Procedure
Model: MODEL1
Dependent Variable: Nitro Nitrogen concetration in river water (mg/l)
OLS Regression
Output Statistics
Obs
Dependent
Variable
Predicted
Value
Std Error
Mean Predict Residual
Std Error
Residual
Student
Residual -2-1 0 1 2
Cook's
D
1 1.1000 1.1610 0.0898 -0.0610 0.265 -0.230 | | | 0.002
2 1.0100 1.2166 0.0764 -0.2066 0.270 -0.766 | *| | 0.012
3 1.9000 1.7373 0.1695 0.1627 0.223 0.729 | |* | 0.077
4 1.0000 0.8499 0.1157 0.1501 0.255 0.588 | |* | 0.018
5 1.9900 2.1820 0.2741 -0.1920 0.0582 -3.299 |******| | 60.413
6 1.4200 1.1908 0.0646 0.2292 0.273 0.840 | |* | 0.010
7 2.0400 1.2544 0.0706 0.7856 0.271 2.897 | |***** | 0.142
8 1.6500 1.4531 0.1112 0.1969 0.257 0.765 | |* | 0.027
9 1.0100 1.1794 0.0991 -0.1694 0.262 -0.646 | *| | 0.015
10 1.2100 1.1758 0.0702 0.0342 0.271 0.126 | | | 0.000
11 1.3300 1.2333 0.1210 0.0967 0.253 0.382 | | | 0.008
12 0.7500 0.9478 0.0855 -0.1978 0.267 -0.741 | *| | 0.014
13 0.7300 0.7883 0.1009 -0.0583 0.261 -0.223 | | | 0.002
14 0.8000 0.8036 0.1101 -0.003628 0.258 -0.0141 | | | 0.000
15 0.7600 0.7247 0.1213 0.0353 0.253 0.140 | | | 0.001
16 0.8700 0.8060 0.0948 0.0640 0.264 0.243 | | | 0.002
17 0.8000 1.0571 0.1099 -0.2571 0.258 -0.997 | *| | 0.045
18 0.8700 0.8460 0.1621 0.0240 0.229 0.105 | | | 0.001
19 0.6600 1.1523 0.1576 -0.4923 0.232 -2.124 | ****| | 0.521
20 1.2500 1.3905 0.1248 -0.1405 0.251 -0.560 | *| | 0.019
Sum of Residuals 0
Sum of Squared Residuals 1.25656
Predicted Residual SS (PRESS) 21.55732
30
B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5
31
B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5
LMS: The 12th ordered squared residual will be minimized.
There are 4845 subsets of 4 cases out of 20 cases.
The algorithm will draw 2000 random subsets of 4 cases.
Random Subsampling for LMS
Minimum Criterion= 0.1776901227
Least Median of Squares (LMS) Method
Minimizing 12th Ordered Squared Residual.
Highest Possible Breakdown Value = 45.00 %
LMS Regression
Median and Mean
Median Mean
VAR1 20 19.4
VAR2 61.5 62.85
VAR3 1.14 3.2405
Intercep 1 1
Response 1.01 1.1575
Dispersion and Standard Deviation
Dispersion StdDev
VAR1 17.049925513 14.730562572
VAR2 19.273828841 17.842217823
VAR3 0.6226929318 7.0778806484
Intercep 0 0
Response 0.3632375435 0.4366484193
Subset Singular
Best
Criterion Percent
500 0 0.195351 25
1000 0 0.183161 50
1500 0 0.177690 75
2000 0 0.177690 100
32
B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5
Random Selection of 2000 Subsets
All 2000 Subsets were Nonsingular
LMS Objective Function = 0.0645437237
Preliminary LMS Scale = 0.09586219
Robust R Squared = 0.9139278457
Final LMS Scale = 0.1024008633
Observations of Best Subset
14 7 11 1
Estimated Coefficients
VAR1 VAR2 VAR3 Intercep
-0.011581144 -0.028795009 0.1413416723 2.9400524804
LMS Residuals
N Observed Estimated Residual Res / S
1 1.100000 1.035456 0.064544 0.630304
2 1.010000 1.074544 -0.064544 -0.630304
3 1.900000 1.902394 -0.002394 -0.023375
4 1.000000 1.046515 -0.046515 -0.454246
5 1.990000 6.736996 -4.746996 -46.356990
6 1.420000 1.523228 -0.103228 -1.008080
7 2.040000 1.975456 0.064544 0.630304
8 1.650000 1.470422 0.179578 1.753680
9 1.010000 1.007167 0.002833 0.027666
10 1.210000 1.070958 0.139042 1.357818
11 1.330000 1.265456 0.064544 0.630304
12 0.750000 0.728264 0.021736 0.212269
13 0.730000 0.539417 0.190583 1.861149
14 0.800000 0.735456 0.064544 0.630304
15 0.760000 0.502543 0.257457 2.514206
16 0.870000 0.601247 0.268753 2.624519
17 0.800000 0.827919 -0.027919 -0.272648
18 0.870000 0.816080 0.053920 0.526554
19 0.660000 1.173373 -0.513373 -5.013368
20 1.250000 1.239702 0.010298 0.100570
33
B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5
Distribution of Residuals
Median(U)= 4.6559203206
The run has been executed successfully.
MinRes 1st Qu. Median Mean 3rd Qu. MaxRes
-4.746995751 -0.037217268 0.0378280304 -0.206129679 0.1017927118 0.2687530018
Resistant Diagnostic
N U
Resistant
Diagnostic
1 2.150564 0.461899
2 3.893599 0.836268
3 6.843048 1.469752
4 16.385379 3.519257
5 174.497682 37.478666
6 15.235979 3.272388
7 28.046416 6.023818
8 5.271374 1.132187
9 2.845798 0.611221
10 2.466932 0.529849
11 6.035103 1.296221
12 2.471721 0.530877
13 3.207251 0.688854
14 4.910965 1.054779
15 4.400876 0.945221
16 4.163891 0.894322
17 3.346356 0.718731
18 8.768730 1.883351
19 15.481396 3.325099
20 3.105903 0.667087
34
B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5
The ROBUSTREG Procedure
LTS Regression
Model Information
Data Set ROBREG.HAMILTON
Dependent Variable Nitro Nitrogen concetration in river water (mg/l)
Number of Independent Variables 3
Number of Observations 20
Method LTS Estimation
Number of Observations Read 20
Number of Observations Used 20
Parameter Information
Parameter Effect
Intercept Intercept
Agri Agri
Forest Forest
Urban Urban
Summary Statistics
Variable Q1 Median Q3 Mean
Standard
Deviation MAD
Agri 5.0000 20.0000 27.0000 19.4000 14.7306 17.0499
Forest 54.5000 61.5000 78.0000 62.8500 17.8422 19.2738
Urban 0.8250 1.1400 2.0100 3.2405 7.0779 0.6227
Nitro 0.8000 1.0100 1.3750 1.1575 0.4366 0.3632
LTS Profile
Total Number of Observations 20
Number of Squares Minimized 16
Number of Coefficients 4
Highest Possible Breakdown Value 0.2500
LTS Parameter Estimates
Parameter DF Estimate
Intercept 1 3.2853
Agri 1 -0.0151
Forest 1 -0.0319
Urban 1 0.1235
Scale (sLTS) 0 0.1154
Scale (Wscale) 0 0.1147
35
B.2 SAS Outputs B SAS CODE AND OUTPUTS USED FOR SECTION 5
Diagnostics
Obs Mahalanobis Distance Robust MCD Distance Leverage
Standardized
Robust Residual Outlier
1 1.0005 1.1798 0.2945
2 0.6791 1.6962 -1.0105
3 2.4496 2.9214 -0.2931
4 1.5125 10.9064 * -0.4692
5 4.1510 108.6323 * -38.4804 *
6 0.2431 9.9706 * -1.0524
7 0.5042 19.5216 * 0.7195
8 1.4287 1.3156 1.2085
9 1.1948 1.0546 -0.2465
10 0.4908 0.3881 0.8064
11 1.6098 1.4173 -0.0408
12 0.9043 0.9732 -0.1896
13 1.2310 1.3119 1.2123
14 1.4089 2.1799 0.0225
15 1.6161 1.9827 1.8749
16 1.1055 1.1118 1.8441
17 1.4045 1.3030 -0.5023
18 2.3245 2.0565 -0.2911
19 2.2486 2.2784 -5.2212 *
20 1.6793 1.8679 -0.1682
Diagnostics Summary
Observation Type Proportion Cutoff
Outlier 0.1000 3.0000
Leverage 0.2000 3.0575
R-Square for LTS Estimation
R-Square 0.8853
36
C R SCRIPT AND RESULTS USED FOR SECTION 5
C R Script and Results used for Section 5
> ## start: Data Input ########
> require("XLConnect")
> data.hamilton <- readWorksheetFromFile("hamildata.xlsx",
+ sheet="Data",
+ header=T,
+ startRow=1, endRow=21,
+ startCol=1, endCol=5)
> ## end: Data Input ########
>
> ## OLS ====
> OLSmodel.hamilton <- (lm(Nitro~Agri+Forests+Urban,
+ data=data.hamilton))
> summary(OLSmodel.hamilton)
Call:
lm(formula = Nitro ~ Agri + Forests + Urban, data = data.hamilton)
Residuals:
Min 1Q Median 3Q Max
-0.49229 -0.17505 0.01018 0.11003 0.78560
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.427453 1.293457 1.104 0.286
Agri 0.008505 0.015815 0.538 0.598
Forests -0.008433 0.014479 -0.582 0.568
Urban 0.029337 0.027572 1.064 0.303
Residual standard error: 0.2802 on 16 degrees of freedom
Multiple R-squared: 0.6531, Adjusted R-squared: 0.5881
F-statistic: 10.04 on 3 and 16 DF, p-value: 0.0005817
> ## LMS ====
> require("MASS")
> LMSmodel.hamilton <- lmsreg(Nitro~Agri+Forests+Urban,
+ data=data.hamilton)
> print(LMSmodel.hamilton)
Call:
lqs.formula(formula = Nitro ~ Agri + Forests + Urban, data = data.hamilton,
method = "lms")
Coefficients:
(Intercept) Agri Forests Urban
37
C R SCRIPT AND RESULTS USED FOR SECTION 5
3.40116 -0.01693 -0.03338 0.10374
Scale estimates 0.05056 0.05213
> ## LTS ====
> require("MASS")
> LTSmodel.hamilton.MASS <- ltsreg(Nitro~Agri+Forests+Urban,
+ data=data.hamilton)
> print(LTSmodel.hamilton.MASS)
Call:
lqs.formula(formula = Nitro ~ Agri + Forests + Urban, data = data.hamilton,
method = "lts")
Coefficients:
(Intercept) Agri Forests Urban
3.36336 -0.01672 -0.03334 0.13787
Scale estimates 0.08284 0.09678
> # or
> require("robustbase")
> LTSmodel.hamilton.robustbase <- ltsReg(Nitro~Agri+Forests+Urban,
+ data=data.hamilton)
> summary(LTSmodel.hamilton.robustbase)
Call:
ltsReg.formula(formula = Nitro ~ Agri + Forests + Urban, data = data.hamilton)
Residuals (from reweighted LS):
Min 1Q Median 3Q Max
-0.130496 -0.070646 -0.001935 0.074345 0.154205
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Intercept 3.115874 0.575651 5.413 9.15e-05 ***
Agri -0.014039 0.006925 -2.027 0.062103 .
Forests -0.029167 0.006411 -4.549 0.000454 ***
Urban 0.119286 0.019622 6.079 2.84e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1056 on 14 degrees of freedom
Multiple R-Squared: 0.9416, Adjusted R-squared: 0.9291
F-statistic: 75.28 on 3 and 14 DF, p-value: 7.066e-09
38
E REVIEW QUESTIONS
D Software Specications Used
Software Version
SAS 9.3 TS Level 1M0
R 2.15.2 (2012-10-26)
XLConnect package 0.2-3
MASS package 7.3-22
robustbase package 0.9-4
E Review Questions
True or False
1. The term robust in robust regression is a misnomer because robust regression only
deals with outlying values, i.e. it does not address assumption violations.
[FALSE]
2. LMS and LTS estimation are forms of M-estimation because they minimize an objective
function, after all. [FALSE]
3. If a wayward observation is considered both as an outlier in the y-direction and as a
leverage point, then it will always be consistent with the linear trend followed by the
majority of the data points. [FALSE]
4. Software implementations of LMS and LTS robust regression use resampling techniques
only because said procedures are asymptotically inecient [FALSE]
5. The LTS estimator is relatively asymptotically inecent. [FALSE]
Multiple Choice
1. Also known as a leverage point
a.) Outlier in the x-direction
b.) Outlier in the y-direction
c.) Regression outlier
2. What kind of outliers does robust regression help identify?
a.) Outlier in the x-direction
b.) Outlier in the y-direction
c.) Regression outlier
39
E REVIEW QUESTIONS
3. In a positively skewed distribution, the LMS and LTS estimators will always be:
a.) less than the mean and the median
b.) between the mean and the median of said distribution
c.) greater than the mean and the median of said distribution
4. The breakdown bounds of the LMS and LTS univariate estimators are (approximately):
a.) 0%
b.) 25%
c.) 50%
5. The LTS estimator only includes the sub-batch of size with the lowest squared
residuals in computing for its value.
a.)
__
n
2
__
+ 1
b.) n
__
n
2
__
c.) n
__
n
2
__
1
40