Está en la página 1de 20

Regression and Correlation

Introduction
Simple Linear Regression & GLM
Least Squares Trend Line Fitting
Model Testing
Inference and Regression
Regression Diagnostics
Correlation
Chpts. 16 & 17 W&S
1

Introduction
Recall, to date, have focused on statistics examining one
variable from either one or two samples.

We now turn our attention to examining the relationships


between two variables from one sample. These statistics are
often referred to as bivariate statistics (as opposed to
univariate).

Regression and correlation are the major approaches to


bivariate analysis.

Multiple regression can be used to extend the case to three or


more variables.
2

Regression is a method
used to predict the value Regression
of one numerical value
from another.

The most commonly


encountered type of
regression is simple linear
regression, which draws a
straight line through a
cloud of points to predict
the response variable (Y)
from the explanatory
variable (X).
3
Simple Linear Regression
A typical question asked often by biologists is:
how does variable Y change as a function of variable X ?

35 X is the independent or
30 predictor variable.
25 Independent because it
20 is assigned by
Y

15 investigator &
10 independent of
5 measurement error.
0
1 2 3 4 5
Y is the dependent or
X
response variable.
4

Simple Linear Regression

The simplest relationship between two variables is a


straight line (most parsimonious solution).

Simple indicates that there is only one independent


variable.

Linear indicates the nature of the model type.

Hence, Simple Linear Regression.

Plot the Data! Plot the Data!

As with univariate analysis, your first look at the


data should always be some form of exploratory
graphical analysis.

The simplest form of a graph used to relate two


variables is referred to as a scatter plot.

NEVER skip this step! The data may not even be


linear and a different model may be more
appropriate.

6
Simple Linear Regression
- Example -

Let examine a simple example that looks at the effect


of age on blood pressure:
X: 28 23 52 42 27 29 43 34 40 28
Y: 70 68 90 75 68 80 78 70 80 72

A simple initial
scatterplot
80
Y

Y does not appear


40

to respond to X
0

0 10 20 30 40 50 60

X 7

Simple Linear Regression


- Example -

However, be careful with


95

your data plot! The


90

previous scatterplot is
Blood Pressure (mm Hg)

85

misleading.
80

Changing the (1) aspect


75

ratio and (2) scale reveals


70

a more obvious and


65

pronounced trend than


previously apparent. 20 30 40 50 60

Age (years)

Simple Linear Regression


- Example -

After plotting the data, we see what looks like a linear


cloud of data.

We must now fit a linear model to those data.

Recall from basic algebra and geometry that the


equation for a straight line is something like this:

y = bx + a

where b is the slope and a is the y-intercept.


9
The General Linear Model

We can reformulate our equation for a line to a


more formal and utilitarian model:

y = + x +
Where: = the y-intercept
= the slope
= error deviation from mean

This is the General Linear Model (GLM)!

10

The General Linear Model


95.0
But, we still need to fit
a line through the cloud 90.7
Blood Pressure (mm Hg)

of points. 86.4

82.1
We use a procedure
known as Least Squares 77.9

Trend Line Fitting. 73.6

69.3
We attempt to minimize
the s. 65.0
20.0 28.0 36.0 44.0 52.0 60.0
Age (yrs)

11

Least Squares Trend Line Fitting

The procedure is analogous to what we have already


done in univariate analysis whereby we calculate the
least squares for variance determinations.

Assume the line takes the form


of eq. 1 (y-hat is a predicted
y = a + b x (1) value of y for a value of x)
f (a, b) = ( y y )
2
(2) Goal is to minimize the function
of eq. 2
a = y bx (3)
Solving simultaneously, you will
find eq. 3

12
Least Squares Trend Line Fitting

95
Equation 3 is very notable,
because it means that the 90
?
Blood Pressure (mm Hg)
least squares trend line
MUST run through the 85

point defining the mean of


80 ?
X and the mean of Y.
75
Defining the X ,Y
Y-intercept provides a 70

second point and a line can


65
be drawn. 20 30 40 50 60

Age (yrs)
13

Least Squares Trend Line Fitting


- Procedure -

1. Determine x, x2, y, xy

2. Calculate the means of x & y

3. Calculate slope via Sxx and Sxy

4. Calculate the y-intercept

5. Draw line through means and y-intercept

14

Least Squares Trend Line Fitting


- Lexicon -

Before proceeding further, we need to develop and


clarify a shorthand notation:

Sxx = sum of the squared-x deviations


Syy = sum of the squared-y deviations
Sxy = sum of the products of deviations
syx = variance of y about x

15
Least Squares Trend Line Fitting
- Example -

x y x2 xy Mean x = 36.90
31 7.8 961 241.8 Mean y = 10.38
32 8.3 1024 265.6
33 7.6 1089 250.8
34 9.1 1156 309.4 Lastly, we need to
35 9.6 1225 336.0 calculate:
35 9.8 1225 343.0
40 11.8 1600 472.0 Sxx: Sum of squared-x
41 12.1 1681 496.1 deviations
42 14.7 1764 617.4 Sxy: Sum of xy-product
46 13.0 2116 598.0
deviations
369 103.8 13841 3930.1
16

Least Squares Trend Line Fitting


- Example -

( x )2
Sxx = 224.9
S xx = x 2

N
( x )( y )
S xy = xy N Sxy = 99.88
S xy
b= b = 0.444
S xx
a = y bx a = -6.0076

Thus, y = a + bx = 6.0076 + (0.444) x


17

Least Squares Trend Line Fitting


- Example -

20

15 Y = 0.444 X - 6.007
10
Y

-5

-10
0 10 20 30 40 50
X

18
Least Squares Trend Line Fitting
- Caveat -
Hypothetical Data
You have just fitted your 0.45
first regression line! Predicted
0.4
Response
0.35
You have also created a via Fitted Valid
linear model against which 0.3 Line Prediction
Range
you could predict any value 0.25

of y from x. 0.2

However, BEWARE, this


0.15 Actual
Response
model is only valid in the 0.1

region of x that you 0.05

examined. 0
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
19

OLS using R

Let's return to our original age and blood pressure example:


> X
[1] 28 23 52 42 27 29 43 34 40 28
> Y
[1] 70 68 90 75 68 80 78 70 80 72
> lm(X~Y)

Call:
lm(formula = X ~ Y)

Coefficients:
(Intercept) Y
-46.377 1.078 20

> plot(X,Y, cex=1.5)


> abline(lm(Y~X))
90
85
80
Y

75
70

25 30 35 40 45 50
21
X
> fitted(model1)
1 2 3 4 5
71.01666 67.92322 85.86517 79.67829 70.39797
6 7 8 9 10
71.63535 80.29698 74.72879 78.44092 71.01666

> resid(model1)
1 2 3 4 4
-1.01665799 0.07678293 4.13482561 -4.67829256 -2.39796981
6 7 8 9 10
8.36465383 -2.29698074 -4.72878709 1.55908381 0.98334201

> segments(X,fitted(model1),X,Y) 90
85
80
Y

75
70

25 30 35 40 45 50
22
X

Model Testing

To use the least squares trend line for predictive


purposes, two conditions are necessary:

(1) The straight line model fits the data

(2) The straight line being fitted is not horizontal


(i.e., 0). In other words, the regression line
needs to be a better predictor of y than than the
mean of y.

23

Model Testing
To meet the conditions for the regression of y on x:

The x values must be fixed by investigator and have


negligible error.
For each value of x there is a normal distribution of y values.
The distribution of y around each x must have similar
variances (i.e., have = yx2 [read as variance of y
independent of x])
The expected values of y for each x lie on a straight line.

24
Model Testing

In other words we can say that we have satisfied the


model:
y = + x +

In which the s are:


- normally distributed
- have a mean of zero and variance yx2
- independent of the xs
- independent of each other
25

Model Testing

95.0
Once again, recall
that the s, or 90.7
Blood Pressure (mm Hg)

residuals, are
represented by the
86.4
difference between 82.1

the observed ys and 77.9


predicted y-hats and
73.6
are represented in
the diagram by the 69.3

vertical red lines. 65.0


20.0 28.0 36.0 44.0 52.0 60.0
Age (yrs)

27
Model Testing
Thus, the assumptions of linear regression are largely tied
to the behavior of the residuals!

We test for the violation of assumptions primarily by


examining the s:

(1) Examine the normality of the s.


(2) Check linearity by plotting s against the predicted
values of y (should have random scatter around = 0.
(3) Check equality of variance by plotting the s against
the xs.

28

Normality
You may subject the
residuals to the same
measures of skewness,
kurtosis, and tests of
normality that we have
previously used in
univariate analysis.

29

Linearity

A graph of the s against the


expected values of y (y-hats)
should produce a band
centered around
= 0.

Any other systematic pattern


is indicative of a nonlinear
trend.

30
Equality of
Variance
Variances that are
independent of x (i.e.,
homogeneous) will
result in a horizontal
band of points around
= 0.

Variances that are


dependent upon x will
have a fan-shape.
31

Independence

If the observations have a


natural sequence in time or
space, they may suffer
from autocorrelation.

Plot residuals against rank


order in time or space to
check.

32

Model Testing

If the diagrams do not reveal a substantive departure


from the assumptions, we have not proved the model
correct, but there is no evidence that it is wrong.

If a problem is revealed, consider a closer examination


of outliers, transformations, or switching to another
model.

33
Model Testing

If we determine that a straight line is a reasonable model


and we have not violated any obvious assumptions, we
need to more closely examine the parameters of the
model.

Specifically, we should examine the characteristics of the


slope because this provides information about the nature
of the relationship between x and y.

If = 0 then there is no relationship


> 0 then there is a positive relationship
< 0 then there is a negative relationship
34

Model Testing
- Beta -

As in previous situations though, we are left with an


important question:

Q. If b 0, how non-zero does it have to be to be


significantly non-zero?

A. Once again, this is tied to issues of sample size


and variance. Statistical tests have been developed
to test the individual model parameters.

36
Model Testing
- Beta -

Ho : = 0 To test whether the line


H a : 0, < 0, > 0 is horizontal or not, we
b o must perform an explicit
Test statistic : t =
s y x / S xx test of Ho: = 0 using a
where : b = S xy / S xx form of t-test.
s y2 x = ( S yy b S xy ) /(n 2)
and : For two-tailed test:
Reject if |t| t/2, n-2
S xx = x ( x ) / n ,
2 2

S yy = y 2 ( y ) / n ,
2
NB: df = N-2
S xy = xy ( x y) / n
37

Model Testing
- Alpha, Beta, y-hat -

38

> model<-(lm(X~Y)) Age vs. BP Example


> summary(model)

Call:
lm(formula = X ~ Y)

Residuals:
Min 1Q Median 3Q Max
-10.88342 -2.71830 0.08607 4.00782 7.50782

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -46.3765 20.3033 -2.284 0.05173 .
Y 1.0782 0.2693 4.004 0.00393 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 5.655 on 8 degrees of freedom


Multiple R-squared: 0.6671, Adjusted R-squared: 0.6255
F-statistic: 16.03 on 1 and 8 DF, p-value: 0.003928
39
> par(mfrow=c(2,2))
Age vs. BP Example
> plot(model)

Residuals vs Fitted Normal Q-Q


10

Standardized residuals

2.0
6 6
5

3
Residuals

1.0
-1.0 0.0
0
-5

8 4
4

70 75 80 85 -1.5 -0.5 0.5 1.5

Fitted values Theoretical Quantiles

Scale-Location Residuals vs Leverage


1.5
Standardized residuals

Standardized residuals

6
6
2

3
4
1.0

3 1
1

0.5
0.5

0
-1

0.5
Cook's distance
0.0

4 1

70 75 80 85 0.0 0.1 0.2 0.3 0.4 0.5

Fitted values Leverage 40

Model Testing
- Confidence Intervals -

From our least squares trend line fitting, it is obvious that


all points do not lie exactly along the fitted line.

Often, we wish to place 95% CIs around our best fit trend
line.

By hand, this is tedious. It requires one to pick sequential


x* values and determine the CI0.95 for each point and then
replotting data. Note that if you do this, you get a pair of
flared lines.

41
Model Testing
- Confidence Intervals -

Note that outside the CI0.95 there is another pair of


lines referred to as the PI0.95.

This is the 95% Prediction Interval and is used to


predict a single y from a single x*.

CI 0.95 : y t / 2, n 2 s y x
(
1 x * x
+
) 2 The 1 lessens
n S xx
the influence of
the means, hence
1 x * x
PI 0.95 : y t / 2, n 2 s y x 1 + +
( ) 2
less flare on plot.
n S xx

43

> plot(X,Y) Age vs. BP Example


> abline(model)
> XV<-seq(20,65,5)
> YV<-predict(model,list(X=XV),int="c")
> matlines(XV,YV,lty=c(1,2,2))
90
85
80
Y

75
70

25 30 35 40 45 50

X 44

Nonparametric Regression

If the data are suspected to be linear and you are


unable to correct either a variance or normality
assumption (via transformation), it may be
appropriate to conduct a nonparametric regression.

There are a variety of options, one of which is


Kendalls robust line-fit method.

45
Kendalls Robust Line-fit Method
- Procedure -

The method is fairly straightforward:

Rank order the x, y pairs based on the xs.

Compute a slope (Sji) for EVERY pair of x-values, for


i = 1 to n - 1 and j > i :
Y j Yi
S ji =
X j Xi
There will be n(n-1)/2
slope estimates per sample.
46

Kendalls Robust Line-fit Method


- Procedure -

After computing all possible Sji, the nonparametric


estimate b of the slope is the MEDIAN of the of the
Sji values.

The Kendalls Rank Correlation Coefficient (an


ordering test) can be used to test b for sig.

To estimate the y-intercept, compute the n-values of


Yi - bXi and again choose the MEDIAN.

47

Kendalls Robust Line-fit Method


- Example -

Calculate Sjis:
Data Set:
X Y S21 = (8.14 -8.98)/(12-0) = -0.07000
0.0 8.98 S32 = (6.67 - 8.14)/(29.5-12) = -0.08400
12.0 8.14 .
29.5 6.67 .
43.0 6.08 S31 = (6.67 - 8.98)/(29.5 - 0) = -0.07831
53.0 5.90 .
62.5 5.83 .
75.5 4.68 S91 = (3.72 - 8.98)/(93.0 - 0) = -0.05656
85.0 4.20
93.0 3.72 Median of the 36 slopes: b = -0.05436
48
Kendalls Robust Line-fit Method
- Example -

To estimate the y-intercept, compute the 9 values of ai


using Yi - bXi.

For i = 1: a1 = 8.98 - (-0.05436)(0.0) = 8.980

For i = 2: a2 = 8.14 - (-0.05436)(12.0) = 8.792

MEDIAN of the n intercepts: a = 8.783

THEREFORE: Y = 8.783 - 0.05436 X


49

Correlation
There are many purposes to regression, but the main one
is for prediction. Thus, the goal is to determine the
NATURE of the relationship between two variables.

Often as a next step, or for other reasons, one wishes just


to determine the STRENGTH of the relationship, one
would do a correlation analysis.

The product is a correlation coefficient, r.

NB: Regression and correlation are different, but not


mutually exclusive, techniques.
50

Correlation

The correlation coefficient is determined using the


same sample statistics as used in regression:

S xy
r=
S xx S yy

In all cases, -1 r +1
r = 0 is no relationship
r = 1 is a perfect relationship (pos. or neg.)

51
Coefficient of Determination

To demonstrate the inter-relatedness of correlation &


regression, lets return to regression momentarily to tie up
a loose end.

When we discuss the variability in y which is explained by


the linear association between x and y, we frequently refer
to the coefficient of determination, R2:

S xy2
R2 =
S xx S yy
52

Coefficient of Determination

If R2 is large (close to 1.0) virtually all of the variability is


explained by the relationship. Knowledge of x permits
knowledge of y.

If R2 is close to 0, there is no relationship and a knowledge


of x permits no insight in to y.

R2 is sometimes symbolized as r2 (since it is the square of


the correlation coefficient) but, to clearly differentiate the
two, use a capital R for regression and a lowercase r for
correlation.
53

Correlation

The assumptions for correlation are different than for


regression:

Subjects are sampled at random.


Both x and y contain sampling variability.
For each value of x there is a normal dist. of ys.
For each value of y there is a normal dist. of xs.
The x distributions have the same variance.
The y distributions have the same variance.
The joint distribution of x and y is bivariate normal.

54
Regression
Model

vs.

Correlation
Model

55

Nonparametric Correlation

When we discuss the correlation coefficient it is usually


assumed that we are referring to the parametric correlation
coefficient which is most correctly referred to as the
Pearson Product Moment Correlation Coefficient.

However, recognize that there are MANY types of


correlation coefficients to deal with different situations.
One, most often used for the failure of parametric
assumptions is the nonparametric Spearmans Rank
Correlation Coefficient.

56

Spearmans Rank Correlation


- Procedure -

Ho: E (rs) = 0 (i.e., the ranking of the x variable is


independent of the ranking of the y variable)
Ha: E(rs) 0, E(rs) > 0, E(rs) > 0

6d2
rs = 1 with d = rx ry (diff. in x, y ranks)
N (N 2 1)
Test statistic : z = rs n 1

Test can be performed on continuous data converted to


ranks, or ordinal data.
57
Spearmans Rank Correlation
- Example -

Rank of rat health by two observers


at the end of an endurance experiment.

Rat Obs-1 Obs-2 d d2 6d 2


rs = 1
1 4 4 0 0
N (N 2 1)
2 1 2 -1 1
3 6 5 1 1
4 5 6 -1 1 rs = 1 - (6)(8)/(7)(48)
5 3 1 2 4 rs = 0.857
6 2 3 -1 1
7 7 7 0 0 z = 2.099, P = 0.018
Sum d2=8
reject Ho
58

59