Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Introdução ao R
2013
1 of 100
Working directory
Introdução ao R (Prof. Henrique Castro, FEA-USP)
2 of 100
Installing and loading R packages
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Install packages
install.packages("foreign")
Loading packages
install.packages("foreign")
3 of 100
Importing data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Exercise
• Create a csv file and import.
4 of 100
Importing data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
From Excel
• The best way to read an Excel file is to export it to a comma
delimited file and import it using the method above.
• On windows systems you can use the RODBC package to access
Excel files.
• The first row should contain variable/column names.
library(RODBC)
channel <- odbcConnectExcel("c:/myexel.xls")
mydata <- sqlFetch(channel, "mysheet")
odbcClose(channel)
5 of 100
Importing data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
From SPSS
# save SPSS dataset in transport format
get file=’c:\mydata.sav’.
export outfile=’c:\mydata.por’.
# in R
library(Hmisc)
mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE)
# last option converts value labels to R factors
From Stata
# input Stata file
library(foreign)
mydata <- read.dta("c:/mydata.dta")
6 of 100
Getting Information on a Dataset
Introdução ao R (Prof. Henrique Castro, FEA-USP)
# dimensions of an object
dim(object)
# print mydata
mydata
7 of 100
Selecting Elements
Introdução ao R (Prof. Henrique Castro, FEA-USP)
8 of 100
Vectors and Data Frames
Introdução ao R (Prof. Henrique Castro, FEA-USP)
9 of 100
Missing Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Function complete.cases()
> complete.cases(mydata)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[25] TRUE
10 of 100
Missing Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Function na.omit()
12 of 100
Data Management
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Recoding variables
> mydata$x1cat<-ifelse(mydata$x1>0.5,c("high"),c("low"))
> head(mydata)
y x1 x2 sum mean x1cat
1 0.8122147 0.4219462 0.02138096 0.4433272 0.2216636 low
2 0.6111948 0.3738518 0.53024983 0.9041016 0.4520508 low
3 0.3270520 0.2195572 0.81832713 1.0378844 0.5189422 low
4 0.9560805 0.2477098 0.68361351 0.9313233 0.4656617 low
5 0.9247670 0.2061416 0.57439575 0.7805373 0.3902687 low
6 0.8935983 0.6727692 0.65139071 1.3241599 0.6620799 high
13 of 100
Data Management
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Merging Data
> x<-data.frame(1:10,rep(1))
> names(x)<-c("id","x1")
> head(x,2)
id x1
1 1 1
2 2 1
> y<-data.frame(1:10,rep(2))
> names(y)<-c("id","y1")
> head(y,2)
id y1
1 1 2
2 2 2
> dataxy<-merge(x,y,by="id")
> head(dataxy)
id x1 y1
1 1 1 2
2 2 1 2
3 3 1 2
4 4 1 2
5 5 1 2
6 6 1 2
14 of 100
Data Management
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Reshaping Data
> d1 <- data.frame(subject = c("1", "2"),
+ x0 = c("male", "female"),
+ x1_2000 = 1:2,
+ x1_2005 = 5:6,
+ x2_2000 = 1:2,
+ x2_2005 = 5:6
+ )
> d1
subject x0 x1_2000 x1_2005 x2_2000 x2_2005
1 1 male 1 5 1 5
2 2 female 2 6 2 6
15 of 100
Data Management
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Reshaping Data
> d1 <- data.frame(subject = c("1", "2"),
+ x0 = c("male", "female"),
+ x1_2000 = 1:2,
+ x1_2005 = 5:6,
+ x2_2000 = 1:2,
+ x2_2005 = 5:6
+ )
> d1
subject x0 x1_2000 x1_2005 x2_2000 x2_2005
1 1 male 1 5 1 5
2 2 female 2 6 2 6
16 of 100
Subsetting Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
17 of 100
Subsetting Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
18 of 100
Subsetting Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Selecting Observations
# first 5 observerations
> newdata <- mydata[1:5,]
> newdata
y x1 x2 sum mean x1cat
1 0.8122147 0.4219462 0.02138096 0.4433272 0.2216636 Low
2 0.6111948 0.3738518 0.53024983 0.9041016 0.4520508 Low
3 0.3270520 0.2195572 0.81832713 1.0378844 0.5189422 Low
4 0.9560805 0.2477098 0.68361351 0.9313233 0.4656617 Low
5 0.9247670 0.2061416 0.57439575 0.7805373 0.3902687 Low
19 of 100
Subsetting Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
20 of 100
Subsetting Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Random Samples
> set.seed(1)
> x<-1:25
> sample(x, 7, replace = F)
[1] 7 9 14 20 5 18 22
> set.seed(1)
> mysample <- mydata[sample(1:nrow(mydata), 7, replace=FALSE),]
> mysample
y x1 x2 sum mean x1cat
7 0.59623223 0.7039585 0.77752234 1.4814808 0.7407404 High
9 0.05417394 0.6776613 0.99746985 1.6751312 0.8375656 High
14 0.02172214 0.2773447 0.02295936 0.3003041 0.1501520 Low
20 0.61295952 0.8376189 0.40297945 1.2405984 0.6202992 High
5 0.92476700 0.2061416 0.57439575 0.7805373 0.3902687 Low
18 0.34776470 0.1576324 0.56327819 0.7209106 0.3604553 Low
22 0.48290474 0.1548488 0.40990097 0.5647498 0.2823749 Low
21 of 100
Numeric Functions
Introdução ao R (Prof. Henrique Castro, FEA-USP)
22 of 100
Statistical Functions
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Function Description
sd(x) standard deviation of object(x). also look at var(x) for variance and
mad(x) for median absolute deviation.
median(x) median
range(x) range
sum(x) sum
diff(x, lag=1) lagged differences, with lag indicating which lag to use
min(x) minimum
max(x) maximum
Summary function
> # mean,median,25th and 75th quartiles,min,max
> summary(mydata)
y x1 x2 sum
Min. :0.02172 Min. :0.03389 Min. :0.02138 Min. :0.3003
1st Qu.:0.32603 1st Qu.:0.16459 1st Qu.:0.38283 1st Qu.:0.5796
Median :0.54611 Median :0.32560 Median :0.56884 Median :0.9313
Mean :0.50232 Mean :0.37177 Mean :0.55256 Mean :0.9200
3rd Qu.:0.67414 3rd Qu.:0.57398 3rd Qu.:0.79498 3rd Qu.:1.1864
Max. :0.95608 Max. :0.83762 Max. :0.99747 Max. :1.6751
NA’s :1 NA’s :1 NA’s :2
mean x1cat
Min. :0.1502 Length:25
1st Qu.:0.2898 Class :character
Median :0.4657 Mode :character
Mean :0.4600
3rd Qu.:0.5932
Max. :0.8376
NA’s :2
24 of 100
Descriptive Statistics
Introdução ao R (Prof. Henrique Castro, FEA-USP)
> library(pastecs)
> stat.desc(mydata)
y x1 x2 sum mean x1cat
nbr.val 25.00000000 24.00000000 24.00000000 23.00000000 23.00000000 NA
nbr.null 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 NA
nbr.na 0.00000000 1.00000000 1.00000000 2.00000000 2.00000000 NA
min 0.02172214 0.03388894 0.02138096 0.30030407 0.15015204 NA
max 0.95608055 0.83761895 0.99746985 1.67513117 0.83756558 NA
range 0.93435840 0.80373000 0.97608890 1.37482709 0.68741355 NA
sum 12.55789913 8.92258074 13.26137199 21.16102961 10.58051480 NA
median 0.54610883 0.32559825 0.56883697 0.93132330 0.46566165 NA
mean 0.50231597 0.37177420 0.55255717 0.92004477 0.46002238 NA
SE.mean 0.05751775 0.05243628 0.06117871 0.07890179 0.03945090 NA
CI.mean.0.95 0.11871081 0.10847271 0.12655781 0.16363231 0.08181615 NA
var 0.08270730 0.06598952 0.08982804 0.14318634 0.03579659 NA
std.dev 0.28758876 0.25688425 0.29971327 0.37839971 0.18919986 NA
coef.var 0.57252563 0.69096848 0.54241133 0.41128402 0.41128402 NA
25 of 100
Descriptive Statistics
Introdução ao R (Prof. Henrique Castro, FEA-USP)
> library(psych)
> describe(mydata[c(-6)])
var n mean sd median trimmed mad min max range skew kurtosis se
y 1 25 0.50 0.29 0.55 0.51 0.33 0.02 0.96 0.93 -0.11 -1.17 0.06
x1 2 24 0.37 0.26 0.33 0.36 0.26 0.03 0.84 0.80 0.39 -1.31 0.05
x2 3 24 0.55 0.30 0.57 0.56 0.32 0.02 1.00 0.98 -0.21 -1.13 0.06
sum 4 23 0.92 0.38 0.93 0.91 0.50 0.30 1.68 1.37 0.13 -1.08 0.08
mean 5 23 0.46 0.19 0.47 0.45 0.25 0.15 0.84 0.69 0.13 -1.08 0.04
26 of 100
Descriptive Statistics
Introdução ao R (Prof. Henrique Castro, FEA-USP)
27 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
# Creating a Graph ●
attach(mtcars) ● ●
30
plot(mpg~wt, xlab="Car weight", ylab="Miles per gallon")
title("Regression of MPG on Weight") ●
25
Miles per gallon
●
● ●
Saving Graphs ●
●
●
●
20
●
● ●
●
●
●
●
●
●
2 3 4 5
Car weight
28 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
12
• You can create histograms with the
function hist(x) where x is a numeric
10
vector of values to be plotted.
• The option freq=FALSE plots
8
Frequency
probability densities instead of
6
frequencies.
4
• The option breaks= controls the
number of bins. 2
# Simple Histogram
0
hist(mtcars$mpg) 10 15 20 25 30 35
mtcars$mpg
29 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
7
function hist(x) where x is a numeric
6
vector of values to be plotted.
5
• The option freq=FALSE plots
Frequency
4
probability densities instead of
frequencies.
3
• The option breaks= controls the
2
number of bins. 1
mtcars$mpg
30 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Volvo 142E ●
Ford Pantera L ●
Camaro Z28 ●
Dodge Challenger ●
●
●
Fiat 128 ●
Chrysler Imperial ●
Lincoln Continental ●
Merc 450SLC ●
dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7, Merc 450SL ●
Merc 230 ●
Merc 240D ●
Duster 360 ●
Valiant ●
Hornet Sportabout ●
Hornet 4 Drive ●
Datsun 710 ●
Mazda RX4 ●
10 15 20 25 30
31 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Datsun 710 ●
6
color of the groups label. Hornet 4 Drive
Mazda RX4 Wag ●
●
Mazda RX4 ●
Ferrari Dino ●
Merc 280 ●
Valiant ●
# Dotplot: Grouped Sorted and Colored Merc 280C ●
Merc 450SL ●
x$cyl <- factor(x$cyl) # it must be a factor Merc 450SE ●
Ford Pantera L ●
x$color[x$cyl==4] <- "red" Dodge Challenger ●
AMC Javelin ●
x$color[x$cyl==6] <- "blue" Merc 450SLC ●
Maserati Bora ●
x$color[x$cyl==8] <- "darkgreen" Chrysler Imperial ●
Duster 360 ●
dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl, Camaro Z28 ●
Lincoln Continental ●
main="Gas Milage for Car Models\ngrouped by cylinder", Cadillac Fleetwood ●
32 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
14
function, where height is a vector or
matrix.
12
• If height is a vector, the values determine
10
the heights of the bars in the plot.
8
# Simple Bar Plot
6
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution",
xlab="Number of Gears")
4
2
0
3 4 5
Number of Gears
33 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
5 Gears
matrix.
• If height is a vector, the values determine
the heights of the bars in the plot.
4 Gears
# Simple Horizontal Bar Plot with Added Labels
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution", horiz=TRUE,
names.arg=c("3 Gears", "4 Gears", "5 Gears"))
3 Gears
0 2 4 6 8 10 12 14
34 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
14
0
beside=FALSE then each bar of the plot
corresponds to a column of height, with
12
the values in the column giving the
10
heights of stacked “sub-bars”.
8
# Stacked Bar Plot with Colors and Legend
counts <- table(mtcars$vs, mtcars$gear)
6
barplot(counts, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("darkblue","red"),
4
legend = rownames(counts)) 2
0
3 4 5
Number of Gears
35 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
12
• If height is a matrix and beside=TRUE, 0
1
then the values in each column are
10
juxtaposed rather than stacked.
8
# Grouped Bar Plot
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS",
6
xlab="Number of Gears", col=c("darkblue","red"),
legend = rownames(counts), beside=TRUE)
4
2
0
3 4 5
Number of Gears
36 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Line Charts
• Line charts are created with the
function lines(x, y, type=)
where x and y are numeric vectors of
(x,y) points to connect. type description
• type= can take the values in the
table. p points
l lines
• The lines( ) function adds o overplotted points and lines
information to a graph. It can not b, c points (empty if "c") joined by lines
produce a graph on its own. s, S stair steps
• Usually it follows a plot(x, y) h histogram-like vertical lines
command that produces a graph. n does not produce any points or lines
• By default, plot( ) plots the (x,y)
points. Use the type="n" option in
the plot( ) command, to create the
graph with axes, titles, etc., but
without plotting the points.
37 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
5
Line Charts
4
x <- 1:5 # create some data
y <- x # create some data
3
y
# plotting symbol and color
plot(x, y, type="n")
lines(x, y, type="o") 2
1
1 2 3 4 5
x
38 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Line Charts
type= p type= l type= o type= b
5
# plotting symbol and color
4
par(pch=22, col="red")
3
y
y
# all plots on one page
2
par(mfrow=c(2,4))
1
1 3 5 1 3 5 1 3 5 1 3 5
opts = c("p","l","o","b","c","s","S","h") x x x x
for(i in 1:length(opts)){
type= c type= s type= S type= h
heading = paste("type=",opts[i])
5
plot(x, y, type="n", main=heading)
4
lines(x, y, type=opts[i])
3
3
y
y
} 2
2
# return to original setting
1
1
par(mfrow=c(1,1)) 1 3 5 1 3 5 1 3 5 1 3 5
x x x x
39 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
5
4
4
par(pch=22, col="blue")
3
y
y
par(mfrow=c(2,4))
2
opts = c("p","l","o","b","c","s","S","h")
1
1 3 5 1 3 5 1 3 5 1 3 5
for(i in 1:length(opts)){ x x x x
heading = paste("type=",opts[i])
type= c type= s type= S type= h
plot(x, y, main=heading)
5
lines(x, y, type=opts[i])
4
}
3
3
y
y
# return to original setting 2
2
par(mfrow=c(1,1))
1
1
1 3 5 1 3 5 1 3 5 1 3 5
x x x x
40 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Pie Charts
• Pie charts are not recommended in the R
Pie Chart of Countries
documentation, and their features are
somewhat limited.
• The authors recommend bar or dot plots UK
Germany
# Simple Pie Chart
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie(slices, labels = lbls, main="Pie Chart of Countries")
41 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
US 20%
# Pie Chart with Percentages
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France") Australia 8%
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels France 16%
pie(slices,labels = lbls, col=rainbow(length(lbls)),
main="Pie Chart of Countries") Germany 32%
42 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
43 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
30
variables or for variables by group.
25
where x is a formula and data=
20
denotes the data frame providing the
data.
15
# Boxplot of MPG by Car Cylinders
boxplot(mpg~cyl,data=mtcars, main="Car Milage Data", 10
xlab="Number of Cylinders", ylab="Miles Per Gallon") 4 6 8
Number of Cylinders
44 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
30
• The basic function is plot(x, y),
20
15
# Simple Scatterplot
attach(mtcars)
plot(wt, mpg, main = "Scatterplot Example",
10
xlab = "Car Weight",
ylab = "Miles Per Gallon", pch=19) 2 3 4 5
Car Weight
45 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Scatterplot Example
30
Add Fit Line to Scatterplot
20
abline(lm(mpg~wt), col="red")
15
10
2 3 4 5
Car Weight
46 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Simple Scatterplot
cyl
• The scatterplot( ) function in the 4 Enhanced Scatter Plot
6
car package offers many enhanced 8
features, including fit lines, marginal
box plots, conditioning on a factor,
30
Miles Per Gallon
and interactive point identification.
25
• Each of these features is optional.
20
# Enhanced Scatterplot of MPG vs. Weight
15
# by Number of Car Cylinders
library(car) 10
scatterplot(mpg ~ wt | cyl, data=mtcars, smoother = F,
xlab="Weight of Car", ylab="Miles Per Gallon", 2 3 4 5
main="Enhanced Scatter Plot",
labels=row.names(mtcars)) Weight of Car
47 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
30
Scatterplot Matrices mpg
20
10
• There are at least 4 useful functions
300
disp
for creating scatterplot matrices.
100
5.0
# Basic Scatterplot Matrix
4.0
pairs(~mpg+disp+drat+wt,data=mtcars, drat
3.0
main="Simple Scatterplot Matrix")
2 3 4 5
wt
48 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Scatterplot Matrices
100 300 2 3 4 5
30
scatterplot matrix on a factor, and
20
4
6
8
optionally include lowess and linear
10
disp
best fit lines, and boxplot, densities,
300
or histograms in the principal
100
diagonal, as well as rug plots in the
5.0
drat
margins of the cells.
4.0
3.0
# Scatterplot Matrices from the car Package wt
2 3 4 5
library(car)
scatterplotMatrix(~mpg+disp+drat+wt|cyl,
data=mtcars, smoother=F, by.group=TRUE,
diagonal = "density") 10 20 30 3.0 4.0 5.0
49 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
3D Scatterplot
3D Scatterplots
35
• Use the function scatterplot3d(x,
30
y, z).
25
mpg
disp
500
20
400
# 3D Scatterplot 300
15
library(scatterplot3d) 200
attach(mtcars) 100
10
0
scatterplot3d(wt,disp,mpg, main="3D Scatterplot") 1 2 3 4 5 6
wt
50 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
3D Scatterplot
35
30
library(scatterplot3d)
attach(mtcars)
25
mpg
scatterplot3d(wt,disp,mpg, pch=16,
disp
highlight.3d=TRUE, 500
20
400
type="h", main="3D Scatterplot") 300
15
200
100
10
0
1 2 3 4 5 6
wt
51 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
3D Scatterplot
3D Scatterplots with Coloring and
Vertical Drop Lines and Regression
Plane
35
30
library(scatterplot3d)
attach(mtcars)
25
s3d <-scatterplot3d(wt,disp,mpg, pch=16,
mpg
disp
highlight.3d=TRUE, 500
20
type="h", main="3D Scatterplot") 400
300
fit <- lm(mpg ~ wt+disp)
15
200
s3d$plane3d(fit) 100
10
0
1 2 3 4 5 6
wt
52 of 100
Creating a Graph
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Combining Plots
30
300
mpg
disp
the option mfrow=c(nrows, ncols) to
20
100
create a matrix of nrows x ncols plots
10
2 3 4 5 2 3 4 5
that are filled in by row. wt wt
5
0 2 4 6 8
Frequency
4
3
# 4 figures arranged in 2 rows and 2 columns
2
attach(mtcars)
par(mfrow=c(2,2)) 2 3 4 5
53 of 100
Hypothesis Testing
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Student’s t-Test
• The t.test() function produces a variety of t-tests.
• Unlike most statistical packages, the default assumes unequal variance and
applies the Welsh df modification.
54 of 100
Multiple (Linear) Regression
Introdução ao R (Prof. Henrique Castro, FEA-USP)
55 of 100
Linear Regression
Introdução ao R (Prof. Henrique Castro, FEA-USP)
57 of 100
Simple Linear Regression
Introdução ao R (Prof. Henrique Castro, FEA-USP)
80
Frequency
Frequency
80
40
R code
40
0
0
0 200 600 1000 0 5 10 15 20 25
par(mfrow=c(2,2))
journals$subs journals$citeprice
hist(journals$subs, main="")
hist(journals$citeprice, main="")
hist(log(journals$subs), main="")
hist(log(journals$citeprice), main="")
par(mfrow=c(1,1))
Frequency
Frequency
40
40
20
20
0
0
0 2 4 6 8 −6 −4 −2 0 2 4
log(journals$subs) log(journals$citeprice)
58 of 100
Simple Linear Regression
Introdução ao R (Prof. Henrique Castro, FEA-USP)
• The goal is to estimate the effect of the price per citation on the
number of library subscriptions.
> jmodel<-lm(log(subs)~log(citeprice), data = journals)
> summary(jmodel)
Call:
lm(formula = log(subs) ~ log(citeprice), data = journals)
Residuals:
Min 1Q Median 3Q Max
-2.72478 -0.53609 0.03721 0.46619 1.84808
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.76621 0.05591 85.25 <2e-16 ***
log(citeprice) -0.53305 0.03561 -14.97 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
59 of 100
Simple Linear Regression
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Confidence intervals
• It is good practice to give a measure of error along with every
estimate.
• One way to do this is to provide a confidence interval.
• This is available via the extractor function confint().
60 of 100
Simple Linear Regression
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Prediction
• Often a regression model is used for prediction.
• There are two types of predictions: the prediction of points on the regression
line and the prediction of a new data value.
• The standard errors of predictions for new data take into account both the
uncertainty in the regression line and the variation of the individual points
about the line.
• Thus, the prediction interval is larger than that for prediction of points on the
line.
• The function predict() provides both types of standard errors.
> predict(jmodel, newdata=data.frame(citeprice=2.11), interval = "confidence")
fit lwr upr
1 4.368188 4.247485 4.48889
> predict(jmodel, newdata=data.frame(citeprice=2.11), interval = "prediction")
fit lwr upr
1 4.368188 2.883746 5.852629
61 of 100
Simple Linear Regression
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Diagnostic plots
Standardized residuals
Residuals vs Fitted Normal Q−Q
• The plot() command for class IO IO
2
Residuals
1
0
lm() object provides four
−1
BoIES BoIES
diagnostic plots.
−3
−3
MEPiTE MEPiTE
3 4 5 6 7 −2 −1 0 1 2
• The figure depicts the result for Fitted values Theoretical Quantiles
the journals regression.
Standardized residuals
Standardized residuals
• We set the graphical parameter Scale−Location
MEPiTE
Residuals vs Leverage
3
BoIES RoRPE
IO
mfrow to c(2, 2) using the
−1 1
1.0
−4
0.0
MEPiTE
matrix of plotting areas to see all
3 4 5 6 7 0.00 0.02 0.04 0.06
four plots simultaneously. Fitted values Leverage
62 of 100
Simple Linear Regression
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Hypothesis:
log(citeprice) = - 0.5
63 of 100
R and LATEX: texreg package
Introdução ao R (Prof. Henrique Castro, FEA-USP)
> library(texreg)
> texreg(jmodel, dcolumn = TRUE, booktabs = TRUE)
\begin{table} Model 1
\begin{center}
\begin{tabular}{l D{.}{.}{3.5}@{} }
\toprule
(Intercept) 4.77∗∗∗
\midrule
& \multicolumn{1}{c}{Model 1} \\
(0.06)
(Intercept) & 4.77^{***} \\
& (0.06) \\
log(citeprice) −0.53∗∗∗
log(citeprice) & -0.53^{***} \\
& (0.04) \\
(0.04)
\midrule
R$^2$ & 0.56 \\ R2 0.56
Adj. R$^2$ & 0.55 \\
Num. obs. & 180 \\ Adj. R2 0.55
\bottomrule
\multicolumn{2}{l}{\scriptsize{ Num. obs. 180
\textsuperscript{***}$p<0.001$,
\textsuperscript{**}$p<0.01$, *** p < 0.001, ** p < 0.01, * p < 0.05
\textsuperscript{*}$p<0.05$}}
\end{tabular}
\caption{Statistical models} Table: Statistical models
\label{table:coefficients}
\end{center}
\end{table}
64 of 100
R and LATEX: stargazer package
Introdução ao R (Prof. Henrique Castro, FEA-USP)
> library(stargazer)
> stargazer(jmodel, align=T)
\begin{table}[!htbp] \centering
\caption{}
\label{}
\begin{tabular}{@{\extracolsep{5pt}}lD{.}{.}{-3} } \\[-1.8ex]\hline \hline \\[-1.8ex]
& \multicolumn{1}{c}{\textit{Dependent variable:}} \\ \cline{2-2}
\\[-1.8ex] & \multicolumn{1}{c}{log(subs)} \\ \hline \\[-1.8ex]
log(citeprice) & -0.533^{***} \\
& (0.036) \\
& \\
Constant & 4.766^{***} \\
& (0.056) \\
& \\ \hline \\[-1.8ex]
Observations & \multicolumn{1}{c}{180} \\
R$^{2}$ & \multicolumn{1}{c}{0.557} \\
Adjusted R$^{2}$ & \multicolumn{1}{c}{0.555} \\
Residual Std. Error & \multicolumn{1}{c}{0.750 (df = 178)} \\
F Statistic & \multicolumn{1}{c}{224.037$^{***}$ (df = 1; 178)} \\ \hline \hline \\[-1.8ex]
\textit{Note:} & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\
\normalsize
\end{tabular}
\end{table}
65 of 100
R and LATEX: stargazer package
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Table: caption
Dependent variable:
log(subs)
log(citeprice) −0.533∗∗∗
(0.036)
Constant 4.766∗∗∗
(0.056)
Observations 180
R2 0.557
Adjusted R2 0.555
Residual Std. Error 0.750 (df = 178)
F Statistic 224.037∗∗∗ (df = 1; 178)
Note: ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01
66 of 100
Multiple Linear Regression
Introdução ao R (Prof. Henrique Castro, FEA-USP)
> data("CPS1988")
> head(CPS1988)
wage education experience ethnicity smsa region parttime
1 354.9 7 45 cauc yes northeast no
2 123.5 12 1 cauc yes northeast yes
3 370.4 9 9 cauc yes northeast no
4 754.9 11 46 cauc yes northeast no
5 593.5 12 36 cauc yes northeast no
6 377.2 16 22 cauc yes northeast no
67 of 100
Multiple Linear Regression
Introdução ao R (Prof. Henrique Castro, FEA-USP)
> options(digits=4)
> educmodel<-lm(log(wage)~experience+I(experience^2)+education+ethnicity,data=CPS1988)
> summary(educmodel)
Call:
lm(formula = log(wage) ~ experience + I(experience^2) + education +
ethnicity, data = CPS1988)
Residuals:
Min 1Q Median 3Q Max
-2.943 -0.316 0.058 0.376 4.383
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.321395 0.019174 225.4 <2e-16 ***
experience 0.077473 0.000880 88.0 <2e-16 ***
I(experience^2) -0.001316 0.000019 -69.3 <2e-16 ***
education 0.085673 0.001272 67.3 <2e-16 ***
ethnicityafam -0.243364 0.012918 -18.8 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
68 of 100
Multiple Linear Regression
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Interactions
• Let us consider an interaction between ethnicity and education.
> cps_int<-lm(log(wage)~experience+I(experience^2)+education*ethnicity,data=CPS1988)
> coeftest(cps_int)
t test of coefficients:
69 of 100
Multiple Linear Regression
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Interactions
• Let us consider only the interaction between ethnicity and education.
> cps_int2<-lm(log(wage)~experience+I(experience^2)+education:ethnicity,data=CPS1988)
> coeftest(cps_int2)
t test of coefficients:
70 of 100
R and LATEX: stargazer package
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Dependent variable:
log(wage)
(1) (2)
education 0.101∗∗∗ 0.076∗∗∗
(0.001) (0.001)
experience 0.020∗∗∗
(0.0003)
71 of 100
Linear Regression with Time Series Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
72 of 100
Linear Regression with Time Series Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.9384 -0.3069 -0.0697 0.2697 1.1731
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0188 0.0485 -0.39 0.7
x 0.4995 0.0539 9.27 4.6e-15 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
74 of 100
Linear Regression with Time Series Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
> summary(dynmodel)
Call:
dynlm(formula = d(y) ~ L(d(y)) + L(x, 4), data = data.ts)
Residuals:
Min 1Q Median 3Q Max
-2.0278 -0.4267 0.0013 0.5882 2.0425
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0143 0.0838 -0.17 0.86
L(d(y)) -0.5087 0.0876 -5.81 8.9e-08 ***
L(x, 4) 0.0214 0.0940 0.23 0.82
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
76 of 100
Linear Regression with Panel Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
• For illustrating the basic fixed- and random-effects methods, we use the
well-known Grunfeld data (Grunfeld 1958) comprising 20 annual
observations on the three variables for 11 large US firms for the years
1935-1954.
• The basic one-way panel regression is
• where invest is the real gross investment, value is the real value of the
firm, and capital is the real value of the capital stock.
• Originally employed in a study of the determinants of corporate
investment in a University of Chicago Ph.D. thesis, these data have been
a textbook classic since the 1970s.
• The package AER provides the full data set comprising all 11 firms.
77 of 100
Linear Regression with Panel Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
> gr_pool <- plm(inv ~ value + capital, data = pgr, model = "pooling")
> gr_fe <- plm(inv ~ value + capital, data = pgr, model = "within")
> gr_re <- plm(inv ~ value + capital, data = pgr, model = "random")
79 of 100
Linear Regression with Panel Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
library(stargazer)
stargazer(gr_pool, gr_fe, gr_re, align=T, column.labels = c("POLS", "FE", "RE"),
dep.var.caption = "Dependent variable: investment")
80 of 100
Linear Regression with Panel Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
81 of 100
Linear Regression with Panel Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Hausman Test
82 of 100
Dynamic Linear Regression with Panel Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
83 of 100
Dynamic Linear Regression with Panel Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
84 of 100
Dynamic Linear Regression with Panel Data
Introdução ao R (Prof. Henrique Castro, FEA-USP)
85 of 100
Time Series
Introdução ao R (Prof. Henrique Castro, FEA-USP)
86 of 100
Time Series: quantmod Package
Introdução ao R (Prof. Henrique Castro, FEA-USP)
> library(quantmod)
> getSymbols("AAPL",src="yahoo")
[1] "AAPL"
> head(AAPL)
AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2007-01-03 86.29 86.58 81.90 83.80 44225700 81.03
2007-01-04 84.05 85.95 83.82 85.66 30259300 82.83
2007-01-05 85.77 86.20 84.40 85.05 29812200 82.24
2007-01-08 85.96 86.53 85.28 85.47 28468100 82.64
2007-01-09 86.45 92.98 85.15 92.57 119617800 89.51
2007-01-10 94.75 97.80 93.45 97.00 105460000 93.79
88 of 100
Time Series: quantmod Package
Introdução ao R (Prof. Henrique Castro, FEA-USP)
AAPL [2007−01−03/2013−11−22]
Last 519.8 700
600
500
400
300
200
100
120
100 Volume (millions):
80 7,979,500
60
40
20
89 of 100
Time Series: quantmod Package
Introdução ao R (Prof. Henrique Castro, FEA-USP)
> head(to.weekly(AAPL))
AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2007-01-05 86.29 86.58 81.90 85.05 104297200 82.24
2007-01-12 85.96 97.80 85.15 94.62 351865300 91.49
2007-01-19 95.68 97.60 88.12 88.50 236407700 85.57
2007-01-26 89.14 89.16 84.99 85.38 195789700 82.55
2007-02-02 86.30 86.65 83.70 84.75 129342000 81.95
2007-02-09 84.30 86.51 82.86 83.27 144630100 80.51
> head(to.monthly(AAPL))
AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
jan 2007 86.29 97.80 81.90 85.73 971777900 82.89
fev 2007 86.23 90.81 82.86 84.61 490084100 81.81
mar 2007 84.03 96.83 83.75 92.91 568523000 89.84
abr 2007 94.14 102.50 89.60 99.80 480705000 96.50
mai 2007 99.59 122.17 98.55 121.19 620181000 117.18
jun 2007 121.10 127.61 115.40 122.04 831412200 118.00
90 of 100
Time Series: quantmod Package
Introdução ao R (Prof. Henrique Castro, FEA-USP)
0.1
> head(r.aapl)
monthly.returns
−0.4
2007-01-31 0.022695
jan 2007 jan 2008 jan 2009 jan 2010 jan 2011 jan 2012 jan 2013
2007-02-28 -0.013115
2007-03-30 0.093631
2007-04-30 0.071513 acf(r.aapl)
2007-05-31 0.194168
−0.2 0.1
2007-06-29 0.006973 ACF
5 10 15
Lag
> library(TSA)
> par(mfrow=c(2,1))
> acf(r.aapl, main = "acf(r.aapl)") 5 10 15
> pacf(r.aapl, main = "pacf(r.aapl)") Lag
> par(mfrow=c(1,1))
92 of 100
Time Series: ARIMA models
Introdução ao R (Prof. Henrique Castro, FEA-USP)
93 of 100
Time Series: ARIMA models
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Coefficients:
ar1 intercept
0.170 0.022
s.e. 0.107 0.013
> coeftest(ar.aapl)
z test of coefficients:
94 of 100
Time Series: ARIMA models
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Box-Ljung test
data: resid(ar.aapl)
X-squared = 4.592, df = 4, p-value = 0.3318
Box-Ljung test
data: resid(ar.aapl)
X-squared = 9.166, df = 9, p-value = 0.4221
Box-Ljung test
data: resid(ar.aapl)
X-squared = 12.21, df = 14, p-value = 0.5896
95 of 100
Time Series: ARCH effects
Introdução ao R (Prof. Henrique Castro, FEA-USP)
Box-Ljung test
data: resid(ar.aapl)^2
X-squared = 2.762, df = 5, p-value = 0.7367
Box-Ljung test
data: resid(ar.aapl)^2
X-squared = 26.59, df = 10, p-value = 0.003021
Box-Ljung test
data: resid(ar.aapl)^2
X-squared = 27.31, df = 15, p-value = 0.02633
96 of 100
Time Series: ARCH effects
Introdução ao R (Prof. Henrique Castro, FEA-USP)
ARCH LM Test
data: r.aapl
Chi-squared = 20.93, df = 12, p-value = 0.05143
97 of 100
Time Series: GARCH Model
Introdução ao R (Prof. Henrique Castro, FEA-USP)
GARCH(1, 1)
> library(fGarch)
> model<-garchFit(~garch(1,1), data = r.aapl, trace=F)
> summary(model)
Title: GARCH Modelling
Conditional Distribution: norm
Coefficient(s):
Estimate Std. Error t value Pr(>|t|)
mu 0.0215996 0.0104962 2.058 0.0396 *
omega 0.0006796 0.0005811 1.169 0.2422
alpha1 0.1014303 0.0807611 1.256 0.2091
beta1 0.8378938 0.0795071 10.539 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
98 of 100
References
Introdução ao R (Prof. Henrique Castro, FEA-USP)
2013
100 of 100