Extensions of The Two-Variable Linear Regression Model and Finding Outliers

Extensions of the Two-Variable Linear Regression Model
and Finding Outliers

Jamie Monogan
University of Georgia
Intermediate Political Methodology
Jamie Monogan (UGA)
Extensions of the Two-Variable Linear Model
POLS 7014
1 / 17
Objectives
By the end of this meeting, participants should be able to:

Explain the problems of regression through the origin.
Rescale the variables in a regression model and estimate a linear
model with standardized variables.
Solve problems with logarithmic and exponential functions.
Interpret linear models in which one or more variables has been
logged.
Estimate a reciprocal model.
Diagnose outliers, leverage, and influence points.
Jamie Monogan (UGA)
POLS 7014
2 / 17
Regression Through the Origin

Very Bad 99.9999% of the Time
What if we assumed 1 = 0 and assumed the following population

regression function: Yi = 2 Xi + ui .
Our OLS estimator of this model is:
P
Xi Yi
2 = P 2
Xi
Our estimators for standard error of regression, standard error of 2 ,
and raw r 2 also change.
Notice the lack of centering in the estimator. We are now finding the
best fit line that also goes through the origin.
Many of the great mathematical properties we derived from the
model with the intercept are now gone. Under regression through the
origin, the sum of the residuals need not be zero.
This model is only used in special applications, and you have to be
100% sure that it is appropriate before using it.
Jamie Monogan (UGA)
POLS 7014
3 / 17
Rescaling Variables
In general, it is probably best to scale your variables accordingly
before you even estimate a model.
Before estimating a model with income as a variable, you should think
about what unit you want to use. $1? $1,000?
If you rescale the variable beforehand, then you can sensibly interpret
the results afterward. Just remember, the true 2 tells us for a
one-unit change in the input variable, how many units will the
outcome change on average. You have to know what a unit is,
though.
For example, suppose Xi is years of education and Yi is dollars of
income. How would you interpret the slope coefficient for this
estimated model?
Yi = 20000 + 1000Xi + ui
So get the scaling right beforehand and know your data in order to
interpret a model well.
Jamie Monogan (UGA)
POLS 7014
4 / 17
Rescaling Variables, the Post Hoc Strategy

Gujarati & Porter show some results about rescaling the after the
fact. This is fun because it shows how theoretical properties of
estimators can help us find real results.
Suppose you estimate Yi = 1 + 2 Xi + ui . What if you want to
rescale the variables as such: Yi = w1 Yi and Xi = w2 Xi for w1 & w2
two constant terms.
It turns out you can fill in the blanks for the model
Yi = 1 + 2Xi + ui . In particular:
1
2 = w
w2 2
1 = w1 1
And formulas for error variance and r 2 also can be derived.
Take our model where Xi is years of education and Yi is dollars of

income:Yi = 20000 + 1000Xi + ui . What is the model in which
income is measured in thousands of dollars of income?
1
Yi = 1000
Yi
Xi = 1 Xi
Yi = 20 + 1Xi + ui
Jamie Monogan (UGA)
POLS 7014
5 / 17
Standardizing Variables
A common rescaling we might like to do is standardizing our variables.
The formula is simple:
Yi Y
Yi =
SY
X
i X
Xi =
SX
Each rescaled variable has a 0 mean and a standard deviation of 1.
Consequently the standardized regression model simplifies:
Yi = 1 + 2 Xi + ui
= 2 Xi + ui
In terminology that is awful but path dependent, we call the
coefficients from a standardized regression beta coefficients or beta
weights.
I recommend just referring to these coefficients as standardized
coefficients.
Jamie Monogan (UGA)
POLS 7014
6 / 17
Interpreting Beta Weights

We interpret beta weights by saying for a standard deviation increase
in Xi , we expect Yi to increase by 2 standard deviations on average.
What if our model of income as a function of education worked-out
as follows when the variables were standardized? How would we
interpret this equation?
Yi = .9Xi + ui
Beta weights are substantially more interesting in multiple regression.
They allow you to ask: Which of my predictors explains more
variance in the outcome?
Looking Ahead: Occupational Prestige Data (Duncan)

library(car); data(Duncan)
model<-lm(prestige~education+income, data=Duncan)
summary(model)
library(QuantPsyc); lm.beta(model)
Jamie Monogan (UGA)
POLS 7014
7 / 17
Exponents and Logarithms: A Math Review

Definition:
(logb P = Q) (b Q = P)
Important properties of logarithms:
1
2
3
4
5
6
7
b logb x = x
logb (xy ) = logb (x) + logb (y )
logb ( yx ) = logb (x) logb (y )
logb (y x ) = x logb y
logb ( x1 ) = logb (x) b x = b1x
logb (1) = 0 b 0 = 1
a (x)
logb (x) = log
Let logb (x) = y . Then b y = x.
log (b)
a
loga (b y )
loga (x)
y loga (b) =
loga (x)
loga (x)
loga (b)
loga (x)
loga (b)
logb (x) =
Jamie Monogan (UGA)
POLS 7014
8 / 17
Logarithms
Base 10
The two most common logarithms are base 10 and base e.

Base 10: y = log10 (x) 10y = x
The base 10 logarithm is often simply written as log(x) with no
base denoted.
Check the book or software you are using, though. Sometimes log
means base 10, sometimes log means base e.
Jamie Monogan (UGA)
POLS 7014
9 / 17
Natural Log
Base e
10
20
ex
ln(x)
30
40
50
Base e: y = loge (x) y = ln(x) e y = x

The base e logarithm is referred to as the natural logarithm and is
written as ln(x).
e = 2.718281828 . . . Also, exp(x) = e x .
20
Jamie Monogan (UGA)
40
60
80
100
POLS 7014
10 / 17
Examples:
log( 10) =
1
2
log(1) = 0
log(10) = 1
log(100) = 2
ln(1) = 0
ln(e) = 1
Jamie Monogan (UGA)
POLS 7014
11 / 17
Logging Variables in a Regression Model

We can use the algebra of logarithms to force non-linear functional
forms into a linear framework.
For example, Yi = 1 Xi2 e ui can be expressed as
ln Yi = ln 1 + 2 ln Xi + ui . If we then say = ln 1 , then we can
estimate ln Yi = + 2 ln Xi + ui with least squares.
Similarly Yi = e 1 +2 Xi +ui can be expressed as ln Yi = 1 + 2 Xi + ui ,
which allows us to estimate this with least squares.
How do you want to interpret these results? For 1-unit shifts in the
log of X? (Or Y?) Would you rather interpret in terms of raw X or Y?
Using algebra allows you to make more interesting statements in
many of these cases. Gujarati & Porter also give some interpretation
advice on page 173.
You always have the option to graphically illustrate your model.
Often this is the clearest possible choice. See Chapter 3 of Political
Analysis Using R for examples.
Jamie Monogan (UGA)
POLS 7014
12 / 17
Example: Interpreting the Immigrant Policy Model

In our analysis of immigrant policy in the fifty states, we have really
been analyzing a logged outcome variable. For Yi the ratio of
welcoming laws to hostile and Xi the percentage of the public that is
liberal minus the percentage conservative, our results are:
ln Yi = .910 + .065Xi + ui
We could rewrite this:Yi = e .910+.065Xi +ui
Suppose Xi increases one unit. How does Yi change?
Yi
= e .910+.065(Xi +1)+ui
= e .910+.065Xi +ui e .065
Since e .065 = 1.067, we can say that Yi gets 1.067 times bigger for a
one unit increase in Xi on average.
For clearer interpretation, we can say: For each percentage point
more liberal a states population becomes, the ratio of welcoming to
hostile laws increases by 6.7% on average.
Jamie Monogan (UGA)
POLS 7014
13 / 17
Reciprocal Models
Yet another kind of rescaling you may consider is to include the

reciprocal of your input as the predictor:

1
Yi = 1 + 2
+ ui
Xi
This yields a non-linear effect of X on Y .
If the coefficient is positive, as X increases Y decreases on average.
If the coefficient is negative, as X increases Y increases on average.
Popular application: For data on unemployment and inflation, find
the Phillips curve.
Jamie Monogan (UGA)
POLS 7014
14 / 17
Outliers, Leverage, and Influence Points

An outlier is any observation that has a large residual. See (a).
A leverage point is any observation that is disproportionately distant
from the bulk of the values of a regressor(s). See (c).
An influential point an observation that is far from the bulk of the
values of a regressor and pulls the regression line towards itself. See
(b).
Source:
Jamie Monogan (UGA)
Gujarati & Porter 2009, 497

POLS 7014
15 / 17
Diagnosing Problematic Observations

Tools to diagnose data points that may cause trouble:
Visual evidence: scatterplots.
Outliers: studentized residuals.
Leverage: hat values (hi ). Be careful about what hi is.
= X(X0 X)1 X0 y = Hy, where H = {hij }.
In matrix notation:
y = X
Influence: Cooks distance. Di =
ui2
hi
.
2
(k + 1) (1 hi )2
Solutions
One option is to remove problematic observations and re-estimate the
model. Be very cautious if you take this approach, though.
Draper & Smith 1998: Outliers may convey information that other
data cannot. Outliers may warrant careful investigation. Only when
traced to recording error should outliers be rejected automatically.
Can you identify why an observation is an outlier, leverage, or
influence point? Such information can usually guide your decision.
Jamie Monogan (UGA)
POLS 7014
16 / 17
For Next Time

Read Gujarati & Porter chapter 7 (Multiple Regression Analysis: The
Problem of Estimation).
Load Agresti & Finlays data on crime in the states
R: library(foreign); crime<-read.dta(http://www.ats.ucla.edu/stat/stata/webbooks/reg/crime.dta)
Stata: use http://www.ats.ucla.edu/stat/stata/webbooks/reg/crime.dta
View the scatterplot with crime on the vertical axis and poverty on
the horizontal axis. What stands out? What is the best functional
form the model could take?
Do one of two things: (A) Estimate a model that is not linear in the
variables. (B) Estimate a linear-in-variables model, reporting both
unstandardized and standardized coefficients.
Diagnose your data for outliers, leverage, and influential data points.
Do you think it is fair to remove any observations? Why or why not?
If so, re-estimate your model without the influential observation(s).
Present the results of your model in journal-worthy table.
Graph the functional form of your model with the real data on a
scatterplot.
Jamie Monogan (UGA)
POLS 7014
17 / 17

Extensions of The Two-Variable Linear Regression Model and Finding Outliers

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Extensions of The Two-Variable Linear Regression Model and Finding Outliers

Cargado por

Copyright:

Formatos disponibles

Extensions of the Two-Variable Linear Regression Model

and Finding Outliers

Intermediate Political Methodology

Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

By the end of this meeting, participants should be able to:

Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

Regression Through the Origin

What if we assumed 1 = 0 and assumed the following population

Extensions of the Two-Variable Linear Model

Extensions of the Two-Variable Linear Model

Rescaling Variables, the Post Hoc Strategy

Take our model where Xi is years of education and Yi is dollars of

Extensions of the Two-Variable Linear Model

Extensions of the Two-Variable Linear Model

Interpreting Beta Weights

Looking Ahead: Occupational Prestige Data (Duncan)

Extensions of the Two-Variable Linear Model

Exponents and Logarithms: A Math Review

Extensions of the Two-Variable Linear Model

The two most common logarithms are base 10 and base e.

Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

Base e: y = loge (x) y = ln(x) e y = x

Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

Logging Variables in a Regression Model

Extensions of the Two-Variable Linear Model

Example: Interpreting the Immigrant Policy Model

Extensions of the Two-Variable Linear Model

Yet another kind of rescaling you may consider is to include the

Jamie Monogan (UGA)

Extensions of the Two-Variable Linear Model

Outliers, Leverage, and Influence Points

Gujarati & Porter 2009, 497

Diagnosing Problematic Observations

Influence: Cooks distance. Di =

Extensions of the Two-Variable Linear Model

For Next Time

Extensions of the Two-Variable Linear Model

También podría gustarte