Está en la página 1de 57

Ordinary Least Squares Regression

PO7001: Quantitative Methods I


Kenneth Benoit

24 November 2010

Independent and Dependent variables


I

A dependent variable represents the quantity we wish to


explain variation in, or the thing we are trying to explain
Typical examples of a dependent variable in political science:
I
I
I

An independent variable represents a quantity whose variation


will be used to explain variation in the dependent variable
Typical examples of independent variables in political science:
I
I
I
I

votes received by a governing party


support for a referendum result like Lisbon
party support for European integration

demographic: gender, national background, age


economic: socioeconomic status, income, national wealth
political: party affiliation of ones parents
institutional: district magnitude, electoral system, presidential
v. parliamentary
behavioural: campaign spending levels

Using this language implies causality: X Y

The importance of variation


I

Variation in outcomes of the dependent variable are what


we seek to explain in social and political research

We seek to explain these outcomes using (independent)


variables

The very language here, the term variable suggests that


the quantity so named has to vary

Conversely, a quantity that does not vary is impossible to


study in this way. This also applies to samples that do not
vary: these will not help us in research.

Typically when we collect data, we wish to have as much


variation in our sample as possible

Example: In the returns from schooling example, it would be


best to have as much variation in schooling as possible, to
maximize the leverage of our research

Different functional relationships


Linear A linear relationship, also known as a straight-line
relationship, exists if a line drawn through the central
tendency of the points is a straight-line
Curvilinear Exists if the relationship between variables is not a
straight-line function, but is instead curved
Example: Television viewing (Y ) as a function of age (X )

(More on this in Quant 2)

Correlations

The central idea behind correlation is that two variables have


a relationship such that there is a systematic relationship
between the their values

This relationship can be positive or negative

This relationship varies in strength

Bivariate associations are usually depicted graphically by use


of a scatterplot

We can also summarize (as per an earlier week) the


correlation using Pearsons r to show the direction and
strength of the bivariate relationship

Pearsons r revisited

=
=

P
)(Yi Y )
(X X
qPi i
2
2
i (Xi X ) (Yi Y )
Sum of Products
p
Sum of Squaresx Sum of Squaresy
SP
p
SSx SSy

Testing the significance of Pearsons r

Pearsons r measures correlation in the sample but the


association we are interested in exists in the population

Question: what is the probability that any correlation we


measure in a sample really exists in the population, and is not
merely due to sampling error

H0 : x,y = 0 No correlation exists in the population

HA : x,y 6= 0

Significance can be tested by the t-ratio:

r N 2
t=
1 r2

Pearsons r significance testing: Example


I

Lets assume we have data such that: r = .24, N = 8

So computation of t is:
t

=
=
=

.24 8 2
p
1 (.24)2
.24(2.45)

1 .0576
.59
.59

=
= .61
.97
.9424

From R (using qt()), we know that the critical value for t with
df=6, = .05 is 2.447 so we do not reject H0

In R there is a test for this called cor.test(x,y)

Pearsons r significance testing using R


> x <- c(12,10,6,16,8,9,12,11)
> y <- c(12,8,12,11,10,8,16,15)
> cor(x,y)
[1] 0.2420615
> # compute the empirical t-value
> (t.calc <- (.24*sqrt(8-2) / sqrt(1-.24^2)))
[1] 0.6055768
> # compute the critical t-value
> (t.crit <- qt(1-.05/2, 6))
[1] 2.446912
> # compare
> t.calc > t.crit
[1] FALSE
> # using Rs built-in correlation significance test
> cor.test(x,y)
Pearsons product-moment correlation
data: x and y
t = 0.6111, df = 6, p-value = 0.5636
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.5577490 0.8087778
sample estimates:
cor
0.2420615

Regression analysis
I

Recall the basic linear model:


yi = a + bXi

Here the relationship is determined by two parameters:


a the intercept: this refers to the value of Y when
X is zero
b the slope: or the rate of change in Y for a
one-unit change in X . Also known as the
regression coefficient
Note that this implies a straight line, perfect relationship
however. Because this is never the case in real research, we
also have an error term or residual ei :
Yi = 0 + 1 xi + i
and we use the and terminology instead of the high-school
math quantities a and b

Example

35
30
25
20
15

Sentence length (Y)

40

45

par(mar=c(4,4,1,1))
x <- c(0,3,1,0,6,5,3,4,10,8)
y <- c(12,13,15,19,26,27,29,31,40,48)
plot(x, y, xlab="Number of prior convictions (X)",
ylab="Sentence length (Y)", pch=19)
abline(h=c(10,20,30,40), col="grey70")

Number of prior convictions (X)

10

Least squares formulas


For the three parameters (simple regression):
I

the regression coefficient:


P
(xi x)(yi y )

P
1 =
(xi x)2

the intercept:
0 = y 1 x

and the residual variance 2 :


1 X

2 =
[yi (0 + 1 xi )]2
n2

Least squares formulas continued

Things to note:
I

the prediction line is y = 0 + 1 x


the value yi = 0 + 1 xi is the predicted value for xi

the residual is ei = yi yi

The residual sum of squares (RSS) =

The estimate for 2 is the same as

2
i ei

2 = RSS/(n 2)

Example to show fomulas in R


> x <- c(0,3,1,0,6,5,3,4,10,8)
> y <- c(12,13,15,19,26,27,29,31,40,48)
> (data <- data.frame(x, y, xdev=(x-mean(x)), ydev=(y-mean(y)),
+
xdevydev=((x-mean(x))*(y-mean(y))),
+
xdev2=(x-mean(x))^2,
+
ydev2=(y-mean(y))^2))
x y xdev ydev xdevydev xdev2 ydev2
1
0 12
-4 -14
56
16
196
2
3 13
-1 -13
13
1
169
3
1 15
-3 -11
33
9
121
4
0 19
-4
-7
28
16
49
5
6 26
2
0
0
4
0
6
5 27
1
1
1
1
1
7
3 29
-1
3
-3
1
9
8
4 31
0
5
0
0
25
9 10 40
6
14
84
36
196
10 8 48
4
22
88
16
484
> (SP <- sum(data$xdevydev))
[1] 300
> (SSx <- sum(data$xdev2))
[1] 100
> (SSy <- sum(data$ydev2))
[1] 1250
> (b1 <- SP / SSx)
[1] 3

From observed to predicted relationship

In the above example, 0 = 14, 1 = 3

This linear equation forms the regression line

The regression line always passes through two points:


I
I

the point (x = 0, y = 0 )
the point (
x , y ) (the average X predicts the average Y )
2
i ei

The residual sum of squares (RSS) =

The regression line is that which minimizes the RSS

Plot of regression example


45

Regression line

30
25
20
15

Sentence length (Y)

35

40

Yhat = 14 + 3X

Y intercept

Number of prior convictions (X)

10

Requirements for regression

1. Both variables should be measured at the interval level


2. The relationship (in the population) between X and Y is
linear
I
I

A transformation may fix non-linearity in some cases


Outliers or influential points may need special treatment

3. Sample is randomly chosen (necessary for inference)


4. Both variables must be normally distributed, unless we have a
very large sample

Regression terminology

y is the dependent variable


I

referred to also (by Greene) as a regressand

X are the independent variables


I
I

also known as explanatory variables


also known as regressors

y is regressed on X

The error term  is sometimes called a disturbance

Notation

Independent variables X

Dependent variable Y

Yi is a random variable (not directly observed)

yi is a realised value of Yi (observed)

So dependent variable sometimes refers to a set of numbers in


your dataset (y ) and sometimes to a random variable at each
i (Yi ).

Interpreting the regression results


I

The Y -intercept corresponds to the expected value of Y when


X = 0 (may or may not be meaningful)

The regression coefficient 1 refers to the expected change in


Y resulting from a one-unit change in X

Typically we are more interested in 1 than 0 , but 0 forms a


vital part of any regression estimation

We can make a prediction yi for any i (and xi , although we


should be careful to choose reasonable values of xi )
Example: For xi =13,
yi

= 0 + 1 (13)
= 14 + 3(13)
= 14 + 39
= 53

Sums of squares (ANOVA)

TSS Total sum of squares

P
(yi y )2

P
ESS Estimation or Regression sum of squares (
yi y )2
P 2 P
RSS Residual sum of squares
ei = (
yi yi )2
The key to remember is that TSS = ESS + RSS

R2

How much of the variance did we explain?


PN
PN
2
(yi y )2
RSS
i=1 (yi yi )
= 1 PN
= Pi=1
R =1
N
TSS
)2
)2
i=1 (yi y
i=1 (yi y
2

Can be interpreted as the proportion of total variance explained by


the model.

R 2 and Pearsons r

For simple linear regression (i.e. one independent variable), R 2 is


the same as the correlation coefficient, Pearsons r , squared.

R2
I

Note that computing the regression line comes from the same
sums of squares (SP, SSx , SSY ) used in computing the
correlation r

The squared correlation R 2 is an important quantity in


regression (sometimes called the coefficient of determination)

R 2 is the proportion of variance in Y determined by variation


in X

0 R 2 1.0

But remember that R 2 is not a regression parameter

Computation:
SP
r=p
SSx SSy

R 2 computation example

=
=
=

SP
SSx SSy
300

(100)(1250)
300

125000
300
353.55
.85

r 2 is then (.85)2 = .72


This means that 72% of the sentence length is explained by the
number of priors

R2
I

A much over-used statistic: it may not be what we are


interested in at all

Interpretation: the proportion of the variation in y that is


explained lineraly by the independent variables

Defined in terms of sums of squares:


ESS
TSS
RSS
= 1
TSS
P
(yi yi )2
= 1 P
(yi y )2

R2 =

Alternatively, R 2 is the squared correlation coefficient between


y and y

R 2 continued

When a model has no intercept, it is possible for R 2 to lie


outside the interval (0, 1)

R 2 rises with the addition of more explanatory variables. For


n1
this reason we often report adjusted R 2 : 1 (1 R 2 ) nk1
where k is the total number of regressors in the linear model
(excluding the constant)

Whether R 2 is high or not depends a lot on the overall


variance in Y

To R 2 values from different Y samples cannot be compared

R 2 continued

Solid arrow:
P variation in y when X is unknown (TSS Total Sum of
Squares (yi y )2 )

Dashed arrow: variation


in y when X is known (ESS Estimation
P
Sum of Squares (
yi y )2 )

R 2 decomposed

= y + 

Var(y ) = Var(
y ) + Var(e) + 2Cov(
y , e)
Var(y ) = Var(
y ) + Var(e) + 0
X
X
2
(yi y ) /N =
(
yi y
)2 /N +
(ei e)2 /N
X
X
X
(yi y )2 =
(
yi y
)2 +
(ei e)2
X
X
X
(yi y )2 =
(
yi y
)2 +
ei2

TSS
TSS/TSS

= ESS + RSS
= ESS/TSS + RSSTSS

1 = R 2 + unexplained variance

Example from height-weight data


(of course) there is a direct way to compute regression statistics in R:
> # regression models in R
> regmodel <- lm(y ~ x)
> summary(regmodel)
Call:
lm(formula = y ~ x)
Residuals:
Min
1Q Median
-5.2121 -2.5682 -0.6515

3Q
1.5303

Max
8.2424

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.0909
21.6202
2.271
0.0636 .
x
0.7576
0.3992
1.898
0.1065
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4.587 on 6 degrees of freedom
Multiple R-squared: 0.375, Adjusted R-squared: 0.2709
F-statistic: 3.601 on 1 and 6 DF, p-value: 0.1065

Illustration using the Anscombe dataset


[show R analysis here]
## Illustration using Anscombe dataset
data(anscombe)
attach(anscombe)
round(coef(lm(y1~x1)), 2)
# same b0, b1
round(coef(lm(y2~x2)), 2)
round(coef(lm(y3~x3)), 2)
round(coef(lm(y4~x4)), 2)
round(summary(lm(y1~x1))$r.squared, 2)
# same R^2
round(summary(lm(y2~x2))$r.squared, 2)
round(summary(lm(y3~x3))$r.squared, 2)
round(summary(lm(y4~x4))$r.squared, 2)
# plot the four x-y pairs
par(mfrow=c(2,2), mar=c(4, 4, 1, 1)+0.1) # 4 plots in one graphic window
plot(x1,y1)
abline(lm(y1~x1), col="red", lty="dashed")
plot(x2,y2)
abline(lm(y2~x2), col="red", lty="dashed")
plot(x3,y3)
abline(lm(y3~x3), col="red", lty="dashed")
plot(x4,y4)1
abline(lm(y4~x4), col="red", lty="dashed")
detach(anscombe)

y2
4

y1

10

11

Anscombe plots

10

12

14

10

12

14

x2

y4
6

y3

10

10

12

12

x1

10
x3

12

14

10

12

14
x4

16

18

Extensions of the regression model


I

The regression model can be generalized to multiple


regression, which involves regressing Y on several
independent variables X1 , X2 , etc.

Regression allows us to isolate the linear contribution of each


unit of each Xk on Y , by holding everything else constant

This is the most common and most powerful basic technique


in social science statistics, and something that is used in
virtually every analysis that attempts to establish any sort of
cause and effect

The additional X variables are typically considered to be


control variables

Some variables may also be categorical, especially when they


are dummy variables (0 or 1)

Linear model

The linear model can be written as:


Yi = Xi + i
i N(0, 2 )
Or, alternatively, as:
Yi N(i , 2 )
i = Xi

Linear model

Two components of the model:


Yi N(i , 2 )
i = Xi

Stochastic
Systematic

Generalised version:
Yi f (i , )
i = g (Xi , )

Stochastic
Systematic

Model

Yi f (i , )
i = g (Xi , )

Stochastic
Systematic

Stochastic component: varies over repeated (hypothetical)


observations on the same unit.
Systematic component: varies across units, but constant given X .

Model

Yi f (i , )
i = g (Xi , )

Stochastic
Systematic

Two types of uncertainty:


Estimation uncertainty: lack of knowledge about and ; can be
reduced by increasing N.
Fundamental uncertainty: represented by stochastic component
and exists independent of researcher.

Inference from regression


In linear regression, the sampling distribution of the coefficient
estimates form a normal distribution, which is approximated by a t
distribution due to approximating by s.
Thus we can calculate a confidence interval for each estimated
coefficient.
Or perform a hypothesis test along the lines of:

H0 :1 = 0
H1 :1 6= 0

Inference from regression


To calculate the confidence interval, we need to calculate the
standard error of the coefficient.
Rule of thumb to get the 95% confidence interval:

2SE < < + 2SE


Thus if is positive, we are 95% certain it is different from zero
when 2SE > 0. (Or when the t value is greater than 2 or less
than 2.)
In R, we get the standard errors by using the summary command
on the model output.

Simple regression model: example


Does infant mortality get lower as GDP per capita increases,
measured in 1976?

> m <- lm(INFMORT ~ LEVEL, data=aclp,


subset=(YEAR == 1976 & INFMORT >= 0))
> summary(m)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.3028647 2.7222356 12.234 9.44e-13 ***
LEVEL
-0.0018998 0.0003038 -6.253 9.29e-07 ***

F -test

In simple linear regression, we can do an F -test:


H0 :1 = 0
H1 :1 6= 0

F =

ESS/1
ESS
= 2
RSS/(n 2)

with 1 and n 2 degrees of freedom.

Example
> require(foreign)
> dail <- read.dta("dailcorrected.dta")
> summary(lm(votes1st ~ spend_total + incumb + electorate + minister, data=dail))
Call:
lm(formula = votes1st ~ spend_total + incumb + electorate + minister,
data = dail)
Residuals:
Min
1Q
-4934.1 -1038.8

Median
-347.6

3Q
1054.0

Max
6900.3

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
7.966e+02 4.172e+02
1.909
0.0569 .
spend_total
1.737e-01 1.095e-02 15.862
<2e-16 ***
incumbIncumbent 2.522e+03 2.207e+02 11.424
<2e-16 ***
electorate
-4.827e-04 5.404e-03 -0.089
0.9289
minister
-1.303e+02 3.965e+02 -0.329
0.7425
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1847 on 458 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.6478, Adjusted R-squared: 0.6447
F-statistic: 210.6 on 4 and 458 DF, p-value: < 2.2e-16

CLRM: Basic Assumptions

1. Specification:
I

I
I
I

Relationship between X and Y in the population is linear:


E(Y ) = X
No extraneous variables in X
No omitted independent variables
Parameters () are constant

2. E() = 0
3. Error terms:
I
I

Var() = 2 , or homoskedastic errors


E(ri ,j ) = 0, or no auto-correlation

CLRM: Basic Assumptions (cont.)

4. X is non-stochastic, meaning observations on independent


variables are fixed in repeated samples
I
I

implies no measurement error in X


implies no serial correlation where a lagged value of Y would
be used an independent variable
no simultaneity or endogenous X variables

5. N > k, or number of observations is greater than number of


independent variables (in matrix terms: rank(X ) = k), and no
exact linear relationships exist in X
6. Normally distributed errors: |X N(0, 2 ). Technically
however this is a convenience rather than a strict assumption

Ordinary Least Squares (OLS)

Objective: minimize
I
I

P
(Yi Yi )2 , where

Yi = b0 + b1 Xi
error ei = (Yi Yi )

b =
=

ei2 =

P
)(Yi Y )
(Xi X
P
)
(Xi X
P
XY
P i 2i
Xi

The intercept is: b0 = Y b1 X

OLS rationale
I

Formulas are very simple

Closely related to ANOVA (sums of squares decomposition)

Predicted Y is sample mean when Pr(Y |X ) =Pr(Y )


I

I
I

In the special case where Y has no relation to X , b1 = 0, then


OLS fit is simply Y = b0
, so Y = Y
Why? Because b0 = Y b1 X
Prediction is then sample mean when X is unrelated to Y

Since OLS is then an extension of the sample mean, it has the


same attractice properties (efficiency and lack of bias)

Alternatives exist but OLS has generally the best properties


when assumptions are met

OLS in matrix notation


I

Formula for coefficient :


Y
0

= X + 

XY

= X 0X + X 0

X 0Y

= X 0X + 0

(X 0 X )1 X 0 Y

= +0

= (X 0 X )1 X 0 Y
I

Formula for variance-covariance matrix: 2 (X 0 X )1


I

In simple
P case where y = 0 + 1 x, this gives
2 / (xi x)2 for the variance of 1
Note how increasing the variation in X will reduce the variance
of 1

The hat matrix

The hat matrix H is defined as:


= (X 0 X )1 X 0 y
X = X (X 0 X )1 X 0 y
y = Hy

I
I

H = X (X 0 X )1 X 0 is called the hat-matrix


P 2
Other important quantities, such as y ,
ei (RSS) can be
expressed as functions of H
Corrections for heteroskedastic errors (robust standard
errors) involve manipulating H

Some important OLS properties to understand


Applies to y = + x + 
I

If = 0 and the only regressor is the intercept, then this is


the same as regressing y on a column of ones, and hence
= y the mean of the observations

If = 0 so that therePis no intercept and one explanatory


variable x, then = P xy
x2

If there is an intercept and one explanatory variable, then


P
x)(yi y )
i (x
Pi
=
(xi x)2
P
(xi x)yi
= Pi
(xi x)2

Some important OLS properties (cont.)

If the observations are expressed as deviations from


P their
P
means, y = y y and x = x x, then =
x y / x 2

The intercept can be estimated as y


x . This implies that
the intercept is estimated by the value that causes the sum of
the OLS residuals to equal zero.

The mean of the y values equals the mean y values together


with previous properties, implies that the OLS regression line
passes through the overall mean of the data points

Normally distributed errors

OLS in R
> dail <- read.dta("dail2002.dta")
> mdl <- lm(votes1st ~ spend_total*incumb + minister, data=dail)
> summary(mdl)
Call:
lm(formula = votes1st ~ spend_total * incumb + minister, data = dail)
Residuals:
Min
1Q
-5555.8 -979.2

Median
-262.4

3Q
877.2

Max
6816.5

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
469.37438 161.54635
2.906 0.00384 **
spend_total
0.20336
0.01148 17.713 < 2e-16 ***
incumb
5150.75818 536.36856
9.603 < 2e-16 ***
minister
1260.00137 474.96610
2.653 0.00826 **
spend_total:incumb
-0.14904
0.02746 -5.428 9.28e-08 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1796 on 457 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6672, Adjusted R-squared: 0.6643
F-statistic:
229 on 4 and 457 DF, p-value: < 2.2e-16

OLS in Stata
. use dail2002
(Ireland 2002 Dail Election - Candidate Spending Data)
. gen spendXinc = spend_total * incumb
(2 missing values generated)
. reg votes1st spend_total incumb minister spendXinc
Source |
SS
df
MS
-------------+-----------------------------Model | 2.9549e+09
4
738728297
Residual | 1.4739e+09
457 3225201.58
-------------+-----------------------------Total | 4.4288e+09
461 9607007.17

Number of obs
F( 4,
457)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

462
229.05
0.0000
0.6672
0.6643
1795.9

-----------------------------------------------------------------------------votes1st |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------spend_total |
.2033637
.0114807
17.71
0.000
.1808021
.2259252
incumb |
5150.758
536.3686
9.60
0.000
4096.704
6204.813
minister |
1260.001
474.9661
2.65
0.008
326.613
2193.39
spendXinc | -.1490399
.0274584
-5.43
0.000
-.2030003
-.0950794
_cons |
469.3744
161.5464
2.91
0.004
151.9086
786.8402
------------------------------------------------------------------------------

Sums of squares (ANOVA)

TSS Total sum of squares

P
(yi y )2

P
ESS Estimation or Regression sum of squares (
yi y )2
P 2 P
RSS Residual sum of squares
ei = (
yi yi )2
The key to remember is that TSS = ESS + RSS

Examining the sums of squares


> yhat <- mdl$fitted.values
# uses the lm object mdl from previous
> ybar <- mean(mdl$model[,1])
> y <- mdl$model[,1]
# cant use dail$votes1st since diff N
> TSS <- sum((y-ybar)^2)
> ESS <- sum((yhat-ybar)^2)
> RSS <- sum((yhat-y)^2)
> RSS
[1] 1473917120
> sum(mdl$residuals^2)
[1] 1473917120
> (r2 <- ESS/TSS)
[1] 0.6671995
> (adjr2 <- (1 - (1-r2)*(462-1)/(462-4-1)))
[1] 0.6642865
> summary(mdl)$r.squared
# note the call to summary()
[1] 0.6671995
> RSS/457
[1] 3225202
> sqrt(RSS/457)
[1] 1795.885
> summary(mdl)$sigma
[1] 1795.885

Regression model return values


Here we will talk about the quantities returned with the lm()
command and lm class objects.

Next (last) week

ANOVA - Analysis of variance

Regression with categorical independent variables

Regression with interaction terms (Brambor, Clark, and


Golder)

Preview of Quant 2 with non-linear/normal models

También podría gustarte