Está en la página 1de 20

# Regression and Correlation

Introduction
Simple Linear Regression & GLM
Least Squares Trend Line Fitting
Model Testing
Inference and Regression
Regression Diagnostics
Correlation
Chpts. 16 & 17 W&S
1

Introduction
Recall, to date, have focused on statistics examining one
variable from either one or two samples.

## We now turn our attention to examining the relationships

between two variables from one sample. These statistics are
often referred to as bivariate statistics (as opposed to
univariate).

## Regression and correlation are the major approaches to

bivariate analysis.

## Multiple regression can be used to extend the case to three or

more variables.
2

Regression is a method
used to predict the value Regression
of one numerical value
from another.

## The most commonly

encountered type of
regression is simple linear
regression, which draws a
straight line through a
cloud of points to predict
the response variable (Y)
from the explanatory
variable (X).
3
Simple Linear Regression
A typical question asked often by biologists is:
how does variable Y change as a function of variable X ?

35 X is the independent or
30 predictor variable.
25 Independent because it
20 is assigned by
Y

15 investigator &
10 independent of
5 measurement error.
0
1 2 3 4 5
Y is the dependent or
X
response variable.
4

## The simplest relationship between two variables is a

straight line (most parsimonious solution).

variable.

## As with univariate analysis, your first look at the

data should always be some form of exploratory
graphical analysis.

## The simplest form of a graph used to relate two

variables is referred to as a scatter plot.

## NEVER skip this step! The data may not even be

linear and a different model may be more
appropriate.

6
Simple Linear Regression
- Example -

## Let examine a simple example that looks at the effect

of age on blood pressure:
X: 28 23 52 42 27 29 43 34 40 28
Y: 70 68 90 75 68 80 78 70 80 72

A simple initial
scatterplot
80
Y

## Y does not appear

40

to respond to X
0

0 10 20 30 40 50 60

X 7

- Example -

95

## your data plot! The

90

previous scatterplot is
Blood Pressure (mm Hg)

85

misleading.
80

75

70

65

## pronounced trend than

previously apparent. 20 30 40 50 60

Age (years)

- Example -

cloud of data.

## Recall from basic algebra and geometry that the

equation for a straight line is something like this:

y = bx + a

## where b is the slope and a is the y-intercept.

9
The General Linear Model

## We can reformulate our equation for a line to a

more formal and utilitarian model:

y = + x +
Where: = the y-intercept
= the slope
= error deviation from mean

10

## The General Linear Model

95.0
But, we still need to fit
a line through the cloud 90.7
Blood Pressure (mm Hg)

of points. 86.4

82.1
We use a procedure
known as Least Squares 77.9

## Trend Line Fitting. 73.6

69.3
We attempt to minimize
the s. 65.0
20.0 28.0 36.0 44.0 52.0 60.0
Age (yrs)

11

## The procedure is analogous to what we have already

done in univariate analysis whereby we calculate the
least squares for variance determinations.

## Assume the line takes the form

of eq. 1 (y-hat is a predicted
y = a + b x (1) value of y for a value of x)
f (a, b) = ( y y )
2
(2) Goal is to minimize the function
of eq. 2
a = y bx (3)
Solving simultaneously, you will
find eq. 3

12
Least Squares Trend Line Fitting

95
Equation 3 is very notable,
because it means that the 90
?
Blood Pressure (mm Hg)
least squares trend line
MUST run through the 85

## point defining the mean of

80 ?
X and the mean of Y.
75
Defining the X ,Y
Y-intercept provides a 70

## second point and a line can

65
be drawn. 20 30 40 50 60

Age (yrs)
13

## Least Squares Trend Line Fitting

- Procedure -

1. Determine x, x2, y, xy

14

- Lexicon -

## Before proceeding further, we need to develop and

clarify a shorthand notation:

## Sxx = sum of the squared-x deviations

Syy = sum of the squared-y deviations
Sxy = sum of the products of deviations
syx = variance of y about x

15
Least Squares Trend Line Fitting
- Example -

x y x2 xy Mean x = 36.90
31 7.8 961 241.8 Mean y = 10.38
32 8.3 1024 265.6
33 7.6 1089 250.8
34 9.1 1156 309.4 Lastly, we need to
35 9.6 1225 336.0 calculate:
35 9.8 1225 343.0
40 11.8 1600 472.0 Sxx: Sum of squared-x
41 12.1 1681 496.1 deviations
42 14.7 1764 617.4 Sxy: Sum of xy-product
46 13.0 2116 598.0
deviations
369 103.8 13841 3930.1
16

## Least Squares Trend Line Fitting

- Example -

( x )2
Sxx = 224.9
S xx = x 2

N
( x )( y )
S xy = xy N Sxy = 99.88
S xy
b= b = 0.444
S xx
a = y bx a = -6.0076

17

## Least Squares Trend Line Fitting

- Example -

20

15 Y = 0.444 X - 6.007
10
Y

-5

-10
0 10 20 30 40 50
X

18
Least Squares Trend Line Fitting
- Caveat -
Hypothetical Data
You have just fitted your 0.45
first regression line! Predicted
0.4
Response
0.35
You have also created a via Fitted Valid
linear model against which 0.3 Line Prediction
Range
you could predict any value 0.25

of y from x. 0.2

## However, BEWARE, this

0.15 Actual
Response
model is only valid in the 0.1

## region of x that you 0.05

examined. 0
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
19

OLS using R

## Let's return to our original age and blood pressure example:

> X
 28 23 52 42 27 29 43 34 40 28
> Y
 70 68 90 75 68 80 78 70 80 72
> lm(X~Y)

Call:
lm(formula = X ~ Y)

Coefficients:
(Intercept) Y
-46.377 1.078 20

## > plot(X,Y, cex=1.5)

> abline(lm(Y~X))
90
85
80
Y

75
70

25 30 35 40 45 50
21
X
> fitted(model1)
1 2 3 4 5
71.01666 67.92322 85.86517 79.67829 70.39797
6 7 8 9 10
71.63535 80.29698 74.72879 78.44092 71.01666

> resid(model1)
1 2 3 4 4
-1.01665799 0.07678293 4.13482561 -4.67829256 -2.39796981
6 7 8 9 10
8.36465383 -2.29698074 -4.72878709 1.55908381 0.98334201

> segments(X,fitted(model1),X,Y) 90
85
80
Y

75
70

25 30 35 40 45 50
22
X

Model Testing

## To use the least squares trend line for predictive

purposes, two conditions are necessary:

## (2) The straight line being fitted is not horizontal

(i.e., 0). In other words, the regression line
needs to be a better predictor of y than than the
mean of y.

23

Model Testing
To meet the conditions for the regression of y on x:

## The x values must be fixed by investigator and have

negligible error.
For each value of x there is a normal distribution of y values.
The distribution of y around each x must have similar
variances (i.e., have = yx2 [read as variance of y
independent of x])
The expected values of y for each x lie on a straight line.

24
Model Testing

model:
y = + x +

## In which the s are:

- normally distributed
- have a mean of zero and variance yx2
- independent of the xs
- independent of each other
25

Model Testing

95.0
Once again, recall
that the s, or 90.7
Blood Pressure (mm Hg)

residuals, are
represented by the
86.4
difference between 82.1

## the observed ys and 77.9

predicted y-hats and
73.6
are represented in
the diagram by the 69.3

## vertical red lines. 65.0

20.0 28.0 36.0 44.0 52.0 60.0
Age (yrs)

27
Model Testing
Thus, the assumptions of linear regression are largely tied
to the behavior of the residuals!

examining the s:

## (1) Examine the normality of the s.

(2) Check linearity by plotting s against the predicted
values of y (should have random scatter around = 0.
(3) Check equality of variance by plotting the s against
the xs.

28

Normality
You may subject the
residuals to the same
measures of skewness,
kurtosis, and tests of
normality that we have
previously used in
univariate analysis.

29

Linearity

## A graph of the s against the

expected values of y (y-hats)
should produce a band
centered around
= 0.

## Any other systematic pattern

is indicative of a nonlinear
trend.

30
Equality of
Variance
Variances that are
independent of x (i.e.,
homogeneous) will
result in a horizontal
band of points around
= 0.

## Variances that are

dependent upon x will
have a fan-shape.
31

Independence

## If the observations have a

natural sequence in time or
space, they may suffer
from autocorrelation.

## Plot residuals against rank

order in time or space to
check.

32

Model Testing

## If the diagrams do not reveal a substantive departure

from the assumptions, we have not proved the model
correct, but there is no evidence that it is wrong.

## If a problem is revealed, consider a closer examination

of outliers, transformations, or switching to another
model.

33
Model Testing

## If we determine that a straight line is a reasonable model

and we have not violated any obvious assumptions, we
need to more closely examine the parameters of the
model.

## Specifically, we should examine the characteristics of the

slope because this provides information about the nature
of the relationship between x and y.

## If = 0 then there is no relationship

> 0 then there is a positive relationship
< 0 then there is a negative relationship
34

Model Testing
- Beta -

## As in previous situations though, we are left with an

important question:

## Q. If b 0, how non-zero does it have to be to be

significantly non-zero?

## A. Once again, this is tied to issues of sample size

and variance. Statistical tests have been developed
to test the individual model parameters.

36
Model Testing
- Beta -

## Ho : = 0 To test whether the line

H a : 0, < 0, > 0 is horizontal or not, we
b o must perform an explicit
Test statistic : t =
s y x / S xx test of Ho: = 0 using a
where : b = S xy / S xx form of t-test.
s y2 x = ( S yy b S xy ) /(n 2)
and : For two-tailed test:
Reject if |t| t/2, n-2
S xx = x ( x ) / n ,
2 2

S yy = y 2 ( y ) / n ,
2
NB: df = N-2
S xy = xy ( x y) / n
37

Model Testing
- Alpha, Beta, y-hat -

38

## > model<-(lm(X~Y)) Age vs. BP Example

> summary(model)

Call:
lm(formula = X ~ Y)

Residuals:
Min 1Q Median 3Q Max
-10.88342 -2.71830 0.08607 4.00782 7.50782

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -46.3765 20.3033 -2.284 0.05173 .
Y 1.0782 0.2693 4.004 0.00393 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

## Residual standard error: 5.655 on 8 degrees of freedom

Multiple R-squared: 0.6671, Adjusted R-squared: 0.6255
F-statistic: 16.03 on 1 and 8 DF, p-value: 0.003928
39
> par(mfrow=c(2,2))
Age vs. BP Example
> plot(model)

## Residuals vs Fitted Normal Q-Q

10

Standardized residuals

2.0
6 6
5

3
Residuals

1.0
-1.0 0.0
0
-5

8 4
4

## Scale-Location Residuals vs Leverage

1.5
Standardized residuals

Standardized residuals

6
6
2

3
4
1.0

3 1
1

0.5
0.5

0
-1

0.5
Cook's distance
0.0

4 1

## Fitted values Leverage 40

Model Testing
- Confidence Intervals -

## From our least squares trend line fitting, it is obvious that

all points do not lie exactly along the fitted line.

Often, we wish to place 95% CIs around our best fit trend
line.

## By hand, this is tedious. It requires one to pick sequential

x* values and determine the CI0.95 for each point and then
replotting data. Note that if you do this, you get a pair of
flared lines.

41
Model Testing
- Confidence Intervals -

## Note that outside the CI0.95 there is another pair of

lines referred to as the PI0.95.

## This is the 95% Prediction Interval and is used to

predict a single y from a single x*.

CI 0.95 : y t / 2, n 2 s y x
(
1 x * x
+
) 2 The 1 lessens
n S xx
the influence of
the means, hence
1 x * x
PI 0.95 : y t / 2, n 2 s y x 1 + +
( ) 2
less flare on plot.
n S xx

43

## > plot(X,Y) Age vs. BP Example

> abline(model)
> XV<-seq(20,65,5)
> YV<-predict(model,list(X=XV),int="c")
> matlines(XV,YV,lty=c(1,2,2))
90
85
80
Y

75
70

25 30 35 40 45 50

X 44

Nonparametric Regression

## If the data are suspected to be linear and you are

unable to correct either a variance or normality
assumption (via transformation), it may be
appropriate to conduct a nonparametric regression.

## There are a variety of options, one of which is

Kendalls robust line-fit method.

45
Kendalls Robust Line-fit Method
- Procedure -

## Compute a slope (Sji) for EVERY pair of x-values, for

i = 1 to n - 1 and j > i :
Y j Yi
S ji =
X j Xi
There will be n(n-1)/2
slope estimates per sample.
46

- Procedure -

## After computing all possible Sji, the nonparametric

estimate b of the slope is the MEDIAN of the of the
Sji values.

## The Kendalls Rank Correlation Coefficient (an

ordering test) can be used to test b for sig.

## To estimate the y-intercept, compute the n-values of

Yi - bXi and again choose the MEDIAN.

47

## Kendalls Robust Line-fit Method

- Example -

Calculate Sjis:
Data Set:
X Y S21 = (8.14 -8.98)/(12-0) = -0.07000
0.0 8.98 S32 = (6.67 - 8.14)/(29.5-12) = -0.08400
12.0 8.14 .
29.5 6.67 .
43.0 6.08 S31 = (6.67 - 8.98)/(29.5 - 0) = -0.07831
53.0 5.90 .
62.5 5.83 .
75.5 4.68 S91 = (3.72 - 8.98)/(93.0 - 0) = -0.05656
85.0 4.20
93.0 3.72 Median of the 36 slopes: b = -0.05436
48
Kendalls Robust Line-fit Method
- Example -

using Yi - bXi.

## THEREFORE: Y = 8.783 - 0.05436 X

49

Correlation
There are many purposes to regression, but the main one
is for prediction. Thus, the goal is to determine the
NATURE of the relationship between two variables.

## Often as a next step, or for other reasons, one wishes just

to determine the STRENGTH of the relationship, one
would do a correlation analysis.

## NB: Regression and correlation are different, but not

mutually exclusive, techniques.
50

Correlation

## The correlation coefficient is determined using the

same sample statistics as used in regression:

S xy
r=
S xx S yy

In all cases, -1 r +1
r = 0 is no relationship
r = 1 is a perfect relationship (pos. or neg.)

51
Coefficient of Determination

## To demonstrate the inter-relatedness of correlation &

regression, lets return to regression momentarily to tie up
a loose end.

## When we discuss the variability in y which is explained by

the linear association between x and y, we frequently refer
to the coefficient of determination, R2:

S xy2
R2 =
S xx S yy
52

Coefficient of Determination

## If R2 is large (close to 1.0) virtually all of the variability is

explained by the relationship. Knowledge of x permits
knowledge of y.

## If R2 is close to 0, there is no relationship and a knowledge

of x permits no insight in to y.

## R2 is sometimes symbolized as r2 (since it is the square of

the correlation coefficient) but, to clearly differentiate the
two, use a capital R for regression and a lowercase r for
correlation.
53

Correlation

regression:

## Subjects are sampled at random.

Both x and y contain sampling variability.
For each value of x there is a normal dist. of ys.
For each value of y there is a normal dist. of xs.
The x distributions have the same variance.
The y distributions have the same variance.
The joint distribution of x and y is bivariate normal.

54
Regression
Model

vs.

Correlation
Model

55

Nonparametric Correlation

## When we discuss the correlation coefficient it is usually

assumed that we are referring to the parametric correlation
coefficient which is most correctly referred to as the
Pearson Product Moment Correlation Coefficient.

## However, recognize that there are MANY types of

correlation coefficients to deal with different situations.
One, most often used for the failure of parametric
assumptions is the nonparametric Spearmans Rank
Correlation Coefficient.

56

- Procedure -

## Ho: E (rs) = 0 (i.e., the ranking of the x variable is

independent of the ranking of the y variable)
Ha: E(rs) 0, E(rs) > 0, E(rs) > 0

6d2
rs = 1 with d = rx ry (diff. in x, y ranks)
N (N 2 1)
Test statistic : z = rs n 1

## Test can be performed on continuous data converted to

ranks, or ordinal data.
57
Spearmans Rank Correlation
- Example -

## Rank of rat health by two observers

at the end of an endurance experiment.

## Rat Obs-1 Obs-2 d d2 6d 2

rs = 1
1 4 4 0 0
N (N 2 1)
2 1 2 -1 1
3 6 5 1 1
4 5 6 -1 1 rs = 1 - (6)(8)/(7)(48)
5 3 1 2 4 rs = 0.857
6 2 3 -1 1
7 7 7 0 0 z = 2.099, P = 0.018
Sum d2=8
reject Ho
58

59