Está en la página 1de 45

OLS Data Analysis in

OLSDataAnalysisin
R
DinoChristenson&ScottPowell
Ohio State University
OhioStateUniversity
November20,2007

Introduction to R Outline
IntroductiontoROutline
II. DataDescription
Data Description
II. DataAnalysis
ii. Commandfunctions
Command functions
ii. Handrolling

III. OLSDiagnostics&Graphing
III
OLS Diagnostics & Graphing
IV. Functionsandloops
V. Movingforward

11/20/2007

Christenson&Powell:IntrotoR

Data Analysis: Descriptive Stats


DataAnalysis:DescriptiveStats
Rhasseveralbuiltin
commandsfor
describingdata
Thelist()
commandcanoutput
p
allelementsofan
object

Data Analysis: Descriptive Stats


DataAnalysis:DescriptiveStats
Thesummary()
y
commandcanbe
usedtodescribeall
variables contained
variablescontained
withinadataframe
Thesummary()
commandcanalso
be used with
beusedwith
individualvariables

Data Analysis: Descriptive Stats


DataAnalysis:DescriptiveStats
Simpleplotscanalso
p p
providefamiliarity
withthedata
Thehist()
commandproducesa
p
histogramforany
givendatavalues

Data Analysis: Descriptive Stats


DataAnalysis:DescriptiveStats
Simpleplotscanalso
p p
providefamiliarity
withthedata
Theplot()
commandcan
produceboth
univariateand
bivariate plots for
bivariateplotsfor
anygivenobjects

DataAnalysis:DescriptiveStats
y
p
Other Useful Commands
OtherUsefulCommands

sum
mean
var
sd
range

min
max
median
di
cor
summary

Data Analysis: Regression


DataAnalysis:Regression
Asmentionedabove,oneofthebigperksofusingRis
,
gp
g
flexibility.
Rcomeswithitsowncannedlinearregressioncommand:
lm(y ~ x)
However,weregoingtouseRtomakeourownOLS
estimator.Thenwewillcomparewiththecanned
procedure,aswellasStata.

Data Analysis: Regression


DataAnalysis:Regression

First,letstakealookatour
codeforthehandrolledOLS
estimator
TheHolyGrail:
(XX)
(X
X)-1 X
XY
Y
Weneedasinglematrixof
independentvariables
The cbind() command
Thecbind()
command
takestheindividualvariable
vectorsandcombinesthem
intoonexvariablematrix
A1isincludedasthefirst
elementtoaccountforthe
constant.

Data Analysis: Regression


DataAnalysis:Regression
Withthexandy
y
matricescomplete,
wecannow
manipulate them to
manipulatethemto
producecoefficients.
Afterperformingthe
divinemultiplication,
wecanobservethe
estimates by entering
estimatesbyentering
theobjectname(in
thiscaseb).

Data Analysis: Regression


DataAnalysis:Regression
Withthexandy
y
matricescomplete,
wecannow
manipulate them to
manipulatethemto
producecoefficients.
Afterperformingthe
divinemultiplication,
wecanobservethe
estimates byentering
by entering
theobjectname(in
thiscaseb).

Data Analysis: Regression


DataAnalysis:Regression

Tofindthestandard
errors,weneedto
computeboththe
varianceoftheresiduals
andthecovmatrixofthe
d h
f h
xs.
Thesqrtofthediagonal
elementsofthisvarcov
l
f hi
matrix willgiveusthe
standarderrors.
Oh
Otherteststatisticscanbe
i i
b
easilycomputed.
Viewthestandarderrors.

Data Analysis: Regression


DataAnalysis:Regression

Tofindthestandard
errors,weneedto
computeboththe
varianceoftheresiduals
andthecovmatrixofthe
d h
f h
xs.
Thesqrtofthediagonal
elementsofthisvarcov
l
f hi
matrixwillgiveusthe
standarderrors.
Oh
Otherteststatisticscan
i i
beeasilycomputed.
Viewthestandarderrors.

Data Analysis: Regression


DataAnalysis:Regression

Tofindthestandard
errors,weneedto
computeboththe
varianceoftheresiduals
andthecovmatrixofthe
d h
f h
xs.
Thesqrtofthediagonal
elementsofthisvarcov
l
f hi
matrixwillgiveusthe
standarderrors.
Oh
Otherteststatisticscanbe
i i
b
easilycomputed.
Viewthestandarderrors.

Data Analysis: Regression


DataAnalysis:Regression
TimetoCompare
p
Usethelm()
commandtoestimate
themodelusingRs
cannedprocedure
p
Aswecansee,the
estimatesarevery
similar

Data Analysis: Regression


DataAnalysis:Regression
TimetoCompare
p
Wecanalsoseehow
boththehandrolled
andcannedOLS
d
d OLS
proceduresstackup
toStata
Usethereg
commandtoestimate
the model
themodel
Aswecansee,the
estimatesareonce
againverysimilar

Data Analysis: Regression


DataAnalysis:Regression

DataAnalysis:Regression
y
g
Other Useful Commands
OtherUsefulCommands

lm

Linear Model

lme

glm
- General lm

Mixed Effects

multinom
- Multinomial
Logit

anova

optim
- General
Optimizer

OLS Diagnostics in R
OLSDiagnosticsinR
Postestimationdiagnosticsarekeytodata
g
y
analysis
Wewanttomakesureweestimatedtheproper
model
Besides,Irfan willhurtyouifyouneglecttodothis

Furthermore,diagnosticsallowusthe
g
opportunitytoshowoffsomeofRsgraphs
Rsrealstrengthisthatithasvirtuallyunlimited
graphing capabilities
graphingcapabilities
Ofcourse,suchstrengthsonRspartisdependenton
yourknowledgeofbothRandstatistics
Still,withjustsomebasicswecandosomecoolgraphs
Still with just some basics we can do some cool graphs
11/20/2007

Christenson&Powell:IntrotoR

19

OLS Diagnostics in R
OLSDiagnosticsinR
Whatcouldbeunjustifiably drivingourdata?
Outlier:unusualobservation
O tli
l b
ti
Leverage:abilitytochangetheslopeofthe
regression line
regressionline
Influence:thecombinedimpactofstrongleverage
and outlier status
andoutlierstatus
AccordingtoJohnFox,influence=leverage*outliers

11/20/2007

Christenson&Powell:IntrotoR

20

OLS Diagnostics: Leverage


OLSDiagnostics:Leverage
Recallourols
eca ou o s model
ode
ols.model1<-lm(formula =
repvshr~income+presvote+pressup)

Ourmeasureofleverage:isthehi orhatvalue
Itsjustthepredictedvalueswrittenintermsofhi
Where,H
Where Hij isthecontributionofobservationY
is the contribution of observation Yitothefitted
to the fitted
valueYj
Ifhij islarge,thentheith observationhasasignificantimpacton
the jth fittedvalue
thejth
fitted value
So,skippingtheformulas,weknowthatthelargerthehatvalue
thegreatertheleverageofthatobservation

11/20/2007

Christenson&Powell:IntrotoR

21

OLS Diagnostics: Leverage


OLSDiagnostics:Leverage
Findthehatvalues
Find the hat values
hatvalues(ols.model1)

Calculatetheaveragehatvalue
avg.mod1<-ncol(x)/nrow(x)
11/20/2007

Christenson&Powell:IntrotoR

22

OLS Diagnostics: Leverage


OLSDiagnostics:Leverage

0.35

18

0.20

0.25

0.30

20

3
11

0.15

plot(hatvalues(ols.model
1))
abline(h=1*(ncol(x))/nro
w(x))
abline(h=2*(ncol(x))/nro
bli (h 2*(
l( ))/
w(x))
abline(h=3*(ncol(x))/nro
w(x))
identify(hatvalues(ols.m
odel1))

14
0.10

Butapictureisworthahundred
numbers?
Graphthehatvalueswithlinesfor
theaverage,twicetheavg (large
samples)andthreetimestheavg
(small samples) hat values
(smallsamples)hatvalues
hatvalues(ols.model1)

identify letsusselectthedata
pointsinthenewgraph

State#2isovertwicetheavg
Nothing above three times
Nothingabovethreetimes

11/20/2007

Christenson&Powell:IntrotoR

19

10

15

20

Index

23

OLS Diagnostics: Outliers


OLSDiagnostics:Outliers
CanwefindanydatapointsthatareunusualforY
y
p
ui
giventheXs?
*
ui =
u ( 1 ) 1 hi
Usestudentized residuals
Wecanseewhetherthereisasignificantchangein
h h h
f
h
themodel
Iftheirabsolutevaluesarelargerthan2,thenthe
g
correspondingobservationsarelikelytobeoutliers)
rstudent(ols.model1)

11/20/2007

Christenson&Powell:IntrotoR

24

OLS Diagnostics: Outliers


OLSDiagnostics:Outliers

11/20/2007

2
1

14

15
1
0

19
10

5
-1

Perhapsthereisamistake
i d
indataentry
Perhapsthemodelis
misspecified intermsof
functionalform
(forthcoming) or omitted
(forthcoming)oromitted
vars
Maybeyoucanthrowout
yourbadobservation
Ifyoumustincludethebad
y
observation,tryrobust
regression

22
3
-2

rstu
udent(ols.model1)

Again,letsplotthemwith
li
linesfor2&2
f 2& 2
States2and3appeartobe
outliers,ordarnclose
Weshoulddefinitelytakea
We should definitely take a
lookatwhatmakesthese
statesunusual

Christenson&Powell:IntrotoR

10

15

20

Index

25

OLS Diagnostics: Influence


OLSDiagnostics:Influence

IfCooksDisgreaterthan4/(nk
/
1),thentheobservationissaidto
exertundueinfluence
Letsjustplotit
plot(cookd(ols.model1))
abline(h=4/(nrow(x)ncol(x)))
Identify(cookd(ols.mode
y
l1))

States2and(maybe)3areinthe
troublezone

0.4

h
1 hi

0.3

k + 1

0.2

13
0.1

18
11
0.0

'2
i

0.5

CooksDgivesakindofsummary
for each observationssinfluence
foreachobservation
influence

coo
okd(ols.model1)

17

1
5

10

15

20

Index

11/20/2007

Christenson&Powell:IntrotoR

26

OLS Diagnostics: Influence


OLSDiagnostics:Influence

Forahostofmeasures
of influence including
ofinfluence,including
df betasanddf fits
influence.measu
res(ols.model1)

dfbeta givesthe
influenceofan
observationonthe
coefficients orthe
changeinivscoefficient
causedbydeletinga
singleobservation

Simplecommandsfor
partialregressionplots
canbefoundonFoxs
website
website
11/20/2007

Christenson&Powell:IntrotoR

27

11/20/2007

qq.plot(ols.model1,dist
l ( l
d l1 di
ribution="norm")
Theproblemsareagain2and13,
,
g
with3,22and14borderingon
troublethistimearound

-1

Pl
Plotsempiricalquantiles
t
ii l
til ofa
f
variableagainststudentized
residuals
Lookingforobs onastraightline
InRitissimpletoplottheerror
I R it i i l t l t th
bandsaswell
Deviationrequiresusto
transformourvariables

2
14

22
3
-2

Isourdatadistributednormally?
Was it correct to use a linear
Wasitcorrecttousealinear
model?
Useaquantile plot(qq plot)to
check

Studen
ntized Residuals(olss.model1)

OLS Diagnostics: Normality


OLSDiagnostics:Normality

13
-2

Christenson&Powell:IntrotoR

-1

norm Quantiles

28

OLS Diagnostics: Normality


OLSDiagnostics:Normality

11/20/2007

0.0

0.1

0.2

0..3

0.4

density.default(x = rstudent(ols.model1))

Density

Asimpledensityplot
p
yp
ofthestudentized
residualshelpsto
determine the nature
determinethenature
ofourdata
Theapparent
deviationfromthe
normalcurveisnot
severe but there
severe,butthere
certainlyseemstobe
aslightnegativeskew

-4

Christenson&Powell:IntrotoR

-2

N = 22 Bandwidth = 0.4217

29

11/20/2007

10
0
-20

-10

resid(ols.model1)

0
-10
-20
30

40

50

60

70

30000

35000

40000

45000

50000

0
-10
-20

-10

resid(o
ols.model1)

10

income

10

fitted.values(ols.model1)

-20

par(mfrow=c(2,2))
plot(resid(ols.model1)
~fitted.values(ols.mod
el1))
plot(resid(ols.model1)
p
~income)
plot(resid(ols.model1)
~presvote)
p
plot(resid(ols.model1)
(
(
)
~pressup)

resid(ols.model1)

Wecanalsoeasilylookfor
heteroskedasticity
Plottingtheresidualsagainstthe
fittedvaluesandthecontinuous
independentvariablesletsus
examineourstatisticalmodelfor
l
d lf
thepresenceofunbalanced
errorvariance

resid(o
ols.model1)

10

OLS Diagnostics: Error Variance


OLSDiagnostics:ErrorVariance

35

40

45

50
presvote

Christenson&Powell:IntrotoR

55

60

65

65

70

75

80

85

90

95

pressup

30

OLS Diagnostics: Error Variance


OLSDiagnostics:ErrorVariance
Formaltestsforheteroskedasticity areavailablefromthelmtest
library

library(lmtest)
bptest(ols.model1) willgiveyoutheBreuschPaganteststat
gqtest(ols.model1) willgiveyoutheGoldfeld
will give you the GoldfeldQuandttest
Quandttest stat
hmctest(ols.model1)willgiveyoutheHarrisonMcCabeteststat

11/20/2007

Christenson&Powell:IntrotoR

31

OLS Diagnostics: Collinearity


OLSDiagnostics:Collinearity
Finally,letslookoutfor
collinearity
Togetthevarianceinflation
factors
vif(ols.model1)

Letslookattheconditionindex
fromtheperturb
p
libraryy
library(perturb)
colldiag(ols.model1)

Issues
Issueshereisthelargest
here is the largest
conditionindex
Ifitislargerthan30,Houston
we have
wehave
11/20/2007

Christenson&Powell:IntrotoR

32

OLS Diagnostics: Shortcut


OLSDiagnostics:Shortcut

11/20/2007

1
0
-1

Standardized residu
uals

0
-10

--2

-20

13

13

plot(ols.model1,
which=1:4)

30

40

50

60

70

-2

-1

Fitted values

1.5

Theoretical Quantiles

Scale-Location

Cook's distance

0.3
0

Cook's d
distance

1.0

3
13

0.0

0.1

0.5

0.2

0.4

0.5

13

0.0

Standardize
ed residuals

N
Nowyouhaveno
h
excusenottorunsome
diagnostics!
Btw,lookatthehigh
Bt l k t th hi h
residualsinthervf plot
for14,13and3
suggesting outliers
suggestingoutliers

10

14

Residuals

Myfavoriteshortcut
commandtogetyou
fouressentialdiagnostic
plotsafteryourunyour
model
d l

Normal Q-Q
2

Residuals vs Fitted

30

40

50

60

Fitted values

Christenson&Powell:IntrotoR

70

10

15

20

Obs. number

33

The Final Act: Loops and Functions


TheFinalAct:LoopsandFunctions
Aswasmentionedabove,Rsbiggestassetisitsflexibility.
,
gg
y
Loopsandfunctionsdirectlyutilizethisasset.
Loopscanbeimplementedforanumberofpurposes,
essentiallywhenrepeatedactionsareneeded(i.e.
simulations).
)
Functionsallowustocreateourowncommands.Thisis
especiallyusefulwhenacannedproceduredoesnotexist.
WewillcreateourownOLSfunctionwiththehandrolled
code used earlier.
codeusedearlier.

Loops
for loopsarethe
p
mostcommonandthe
onlytypeofloopwe
will look at today
willlookattoday.
Thefirstloop
p
commandattheright
showssimpleloop
iteration.
iteration

Loops
However,wecanalso
,
seehowloopscanbe
alittlemoreuseful.
Thesecondexample
Th
d
l
atright(although
inefficient)calculates
themeanofincome
Notehowtheindex
accesses elements of
accesseselementsof
theincomevector.
LoopsandMonte
Carlo

Loops
However,wecanalso
,
seehowloopscanbe
alittlemoreuseful.
Thesecondexample
Th
d
l
atright(although
inefficient)calculates
themeanofincome
Notehowtheindex
accesses elements of
accesseselementsof
theincomevector.
LoopsandMonte
Carlo

Functions

Nowwewillmakeourown
linearregressionfunction
usingourhandrolledOLS
code
Functions require inputs
Functionsrequireinputs
(whicharetheobjectstobe
utilized)andarguments
(whicharethecommands
thatthefunctionperforms)
Theactualestimation
proceduredoesnotchange.
However some changes are
However,somechangesare
made.

Functions

First,wehavetotellRthat
wearecreatingafunction.
Wellnameitols.

Thisletsusgeneralizethe
Thi
l t
li th
proceduretomultiple
objects.

Second,wehavetotellthe
functionwhatwewant
returnedorwhatwewant
theoutputtolooklike.

Functions

First,wehavetotellRthat
wearecreatingafunction.
Wellnameitols.

Thisletsusgeneralizethe
Thi
l t
li th
proceduretomultiple
objects.

Second,wehavetotellthe
functionwhatwewant
returnedorwhatwewant
theoutputtolooklike.

Functions

First,wehavetotellRthat
wearecreatingafunction.
Wellnameitols.

Thisletsusgeneralizethe
Thi
l t
li th
proceduretomultiple
objects.

Second,wehavetotellthe
functionwhatwewant
returnedorwhatwe
wanttheoutputtolook
like.

Functions
OLS:HandrolledvsFunction

Functions
Implementingour
p
g
newfunctionols,
wegetpreciselythe
output that we
outputthatwe
askedfor.
Wecancheckthis
againsttheresults
produced by the
producedbythe
standardlm
function.

Functions
Implementingour
p
g
newfunctionols,
wegetpreciselythe
output that we asked
outputthatweasked
for.
Wecancheckthis
againsttheresults
produced by the
producedbythe
standardlm
function.

Favorite Resources
Favorite

InvaluableResourcesonline
TheRmanuals
h
l
http://cran.rproject.org/manuals.html
Foxsslideshttp://socserv.mcmaster.ca/jfox/Courses/Rcourse/index.html
Faraway's book
http://cran.rproject.org/doc/contrib/FarawayPRA.pdf
//
/ /
/
Anderson'sICPSRlecturesusingR
http://socserv.mcmaster.ca/andersen/icpsr.html
Arai'sguidehttp://people.su.se/~ma/R_intro/
UCLAnoteshttp://www.ats.ucla.edu/stat/SPLUS/default.htm
Keeles introguidehttp://www.polisci.ohiostate.edu/faculty/lkeele/RIntro.pdf

G tRb k
GreatRbooks
Verzanis book
http://www.amazon.com/UsingIntroductoryStatisticsJohn
Verzani/dp/1584884509
Maindonald
M i d
ld andBraunsbook
dB
b k
http://www.amazon.com/DataAnalysisGraphicsUsingR/dp/0521813360

11/20/2007

Christenson&Powell:IntrotoR

45