Está en la página 1de 17

Extensions of the Two-Variable Linear Regression Model

and Finding Outliers


Jamie Monogan
University of Georgia

Intermediate Political Methodology

Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

1 / 17

Objectives

By the end of this meeting, participants should be able to:


Explain the problems of regression through the origin.
Rescale the variables in a regression model and estimate a linear
model with standardized variables.
Solve problems with logarithmic and exponential functions.
Interpret linear models in which one or more variables has been
logged.
Estimate a reciprocal model.
Diagnose outliers, leverage, and influence points.

Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

2 / 17

Regression Through the Origin


Very Bad 99.9999% of the Time

What if we assumed 1 = 0 and assumed the following population


regression function: Yi = 2 Xi + ui .
Our OLS estimator of this model is:
P
Xi Yi

2 = P 2
Xi
Our estimators for standard error of regression, standard error of 2 ,
and raw r 2 also change.
Notice the lack of centering in the estimator. We are now finding the
best fit line that also goes through the origin.
Many of the great mathematical properties we derived from the
model with the intercept are now gone. Under regression through the
origin, the sum of the residuals need not be zero.
This model is only used in special applications, and you have to be
100% sure that it is appropriate before using it.
Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

3 / 17

Rescaling Variables
In general, it is probably best to scale your variables accordingly
before you even estimate a model.
Before estimating a model with income as a variable, you should think
about what unit you want to use. $1? $1,000?
If you rescale the variable beforehand, then you can sensibly interpret
the results afterward. Just remember, the true 2 tells us for a
one-unit change in the input variable, how many units will the
outcome change on average. You have to know what a unit is,
though.
For example, suppose Xi is years of education and Yi is dollars of
income. How would you interpret the slope coefficient for this
estimated model?
Yi = 20000 + 1000Xi + ui
So get the scaling right beforehand and know your data in order to
interpret a model well.
Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

4 / 17

Rescaling Variables, the Post Hoc Strategy


Gujarati & Porter show some results about rescaling the after the
fact. This is fun because it shows how theoretical properties of
estimators can help us find real results.
Suppose you estimate Yi = 1 + 2 Xi + ui . What if you want to
rescale the variables as such: Yi = w1 Yi and Xi = w2 Xi for w1 & w2
two constant terms.
It turns out you can fill in the blanks for the model
Yi = 1 + 2Xi + ui . In particular:
1

2 = w
w2 2
1 = w1 1
And formulas for error variance and r 2 also can be derived.

Take our model where Xi is years of education and Yi is dollars of


income:Yi = 20000 + 1000Xi + ui . What is the model in which
income is measured in thousands of dollars of income?
1
Yi = 1000
Yi
Xi = 1 Xi
Yi = 20 + 1Xi + ui
Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

5 / 17

Standardizing Variables
A common rescaling we might like to do is standardizing our variables.
The formula is simple:
Yi Y
Yi =
SY

X
i X
Xi =
SX
Each rescaled variable has a 0 mean and a standard deviation of 1.
Consequently the standardized regression model simplifies:
Yi = 1 + 2 Xi + ui
= 2 Xi + ui
In terminology that is awful but path dependent, we call the
coefficients from a standardized regression beta coefficients or beta
weights.
I recommend just referring to these coefficients as standardized
coefficients.
Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

6 / 17

Interpreting Beta Weights


We interpret beta weights by saying for a standard deviation increase
in Xi , we expect Yi to increase by 2 standard deviations on average.
What if our model of income as a function of education worked-out
as follows when the variables were standardized? How would we
interpret this equation?
Yi = .9Xi + ui
Beta weights are substantially more interesting in multiple regression.
They allow you to ask: Which of my predictors explains more
variance in the outcome?

Looking Ahead: Occupational Prestige Data (Duncan)


library(car); data(Duncan)
model<-lm(prestige~education+income, data=Duncan)
summary(model)
library(QuantPsyc); lm.beta(model)
Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

7 / 17

Exponents and Logarithms: A Math Review


Definition:
(logb P = Q) (b Q = P)
Important properties of logarithms:
1
2
3
4
5
6
7

b logb x = x
logb (xy ) = logb (x) + logb (y )
logb ( yx ) = logb (x) logb (y )
logb (y x ) = x logb y
logb ( x1 ) = logb (x) b x = b1x
logb (1) = 0 b 0 = 1
a (x)
logb (x) = log
Let logb (x) = y . Then b y = x.
log (b)
a

loga (b y )

loga (x)

y loga (b) =

loga (x)
loga (x)
loga (b)
loga (x)
loga (b)

logb (x) =
Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

8 / 17

Logarithms
Base 10

The two most common logarithms are base 10 and base e.


Base 10: y = log10 (x) 10y = x
The base 10 logarithm is often simply written as log(x) with no
base denoted.
Check the book or software you are using, though. Sometimes log
means base 10, sometimes log means base e.

Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

9 / 17

Natural Log
Base e

10

20

ex

ln(x)

30

40

50

Base e: y = loge (x) y = ln(x) e y = x


The base e logarithm is referred to as the natural logarithm and is
written as ln(x).
e = 2.718281828 . . . Also, exp(x) = e x .

20

Jamie Monogan (UGA)

40

60

80

100

Extensions of the Two-Variable Linear Model

POLS 7014

10 / 17

Examples:

log( 10) =

1
2

log(1) = 0
log(10) = 1
log(100) = 2
ln(1) = 0
ln(e) = 1

Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

11 / 17

Logging Variables in a Regression Model


We can use the algebra of logarithms to force non-linear functional
forms into a linear framework.
For example, Yi = 1 Xi2 e ui can be expressed as
ln Yi = ln 1 + 2 ln Xi + ui . If we then say = ln 1 , then we can
estimate ln Yi = + 2 ln Xi + ui with least squares.
Similarly Yi = e 1 +2 Xi +ui can be expressed as ln Yi = 1 + 2 Xi + ui ,
which allows us to estimate this with least squares.
How do you want to interpret these results? For 1-unit shifts in the
log of X? (Or Y?) Would you rather interpret in terms of raw X or Y?
Using algebra allows you to make more interesting statements in
many of these cases. Gujarati & Porter also give some interpretation
advice on page 173.
You always have the option to graphically illustrate your model.
Often this is the clearest possible choice. See Chapter 3 of Political
Analysis Using R for examples.
Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

12 / 17

Example: Interpreting the Immigrant Policy Model


In our analysis of immigrant policy in the fifty states, we have really
been analyzing a logged outcome variable. For Yi the ratio of
welcoming laws to hostile and Xi the percentage of the public that is
liberal minus the percentage conservative, our results are:
ln Yi = .910 + .065Xi + ui
We could rewrite this:Yi = e .910+.065Xi +ui
Suppose Xi increases one unit. How does Yi change?
Yi

= e .910+.065(Xi +1)+ui
= e .910+.065Xi +ui e .065

Since e .065 = 1.067, we can say that Yi gets 1.067 times bigger for a
one unit increase in Xi on average.
For clearer interpretation, we can say: For each percentage point
more liberal a states population becomes, the ratio of welcoming to
hostile laws increases by 6.7% on average.
Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

13 / 17

Reciprocal Models

Yet another kind of rescaling you may consider is to include the


reciprocal of your input as the predictor:
 
1
Yi = 1 + 2
+ ui
Xi
This yields a non-linear effect of X on Y .
If the coefficient is positive, as X increases Y decreases on average.
If the coefficient is negative, as X increases Y increases on average.
Popular application: For data on unemployment and inflation, find
the Phillips curve.

Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

14 / 17

Outliers, Leverage, and Influence Points


An outlier is any observation that has a large residual. See (a).
A leverage point is any observation that is disproportionately distant
from the bulk of the values of a regressor(s). See (c).
An influential point an observation that is far from the bulk of the
values of a regressor and pulls the regression line towards itself. See
(b).

Source:
Jamie Monogan (UGA)

Gujarati & Porter 2009, 497


Extensions of the Two-Variable Linear Model

POLS 7014

15 / 17

Diagnosing Problematic Observations


Tools to diagnose data points that may cause trouble:
Visual evidence: scatterplots.
Outliers: studentized residuals.
Leverage: hat values (hi ). Be careful about what hi is.
= X(X0 X)1 X0 y = Hy, where H = {hij }.
In matrix notation:
y = X

Influence: Cooks distance. Di =

ui2
hi

.
2

(k + 1) (1 hi )2

Solutions
One option is to remove problematic observations and re-estimate the
model. Be very cautious if you take this approach, though.
Draper & Smith 1998: Outliers may convey information that other
data cannot. Outliers may warrant careful investigation. Only when
traced to recording error should outliers be rejected automatically.
Can you identify why an observation is an outlier, leverage, or
influence point? Such information can usually guide your decision.
Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

16 / 17

For Next Time


Read Gujarati & Porter chapter 7 (Multiple Regression Analysis: The
Problem of Estimation).
Load Agresti & Finlays data on crime in the states
R: library(foreign); crime<-read.dta(http://www.ats.ucla.edu/stat/stata/webbooks/reg/crime.dta)
Stata: use http://www.ats.ucla.edu/stat/stata/webbooks/reg/crime.dta

View the scatterplot with crime on the vertical axis and poverty on
the horizontal axis. What stands out? What is the best functional
form the model could take?
Do one of two things: (A) Estimate a model that is not linear in the
variables. (B) Estimate a linear-in-variables model, reporting both
unstandardized and standardized coefficients.
Diagnose your data for outliers, leverage, and influential data points.
Do you think it is fair to remove any observations? Why or why not?
If so, re-estimate your model without the influential observation(s).
Present the results of your model in journal-worthy table.
Graph the functional form of your model with the real data on a
scatterplot.
Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

POLS 7014

17 / 17

También podría gustarte