RegrCorr PDF

Regression and Correlation
Introduction
Simple Linear Regression & GLM
Least Squares Trend Line Fitting
Model Testing
Inference and Regression
Regression Diagnostics
Correlation
Chpts. 16 & 17 W&S
1
Introduction
Recall, to date, have focused on statistics examining one
variable from either one or two samples.
We now turn our attention to examining the relationships

between two variables from one sample. These statistics are
often referred to as bivariate statistics (as opposed to
univariate).
Regression and correlation are the major approaches to

bivariate analysis.
Multiple regression can be used to extend the case to three or

more variables.
2
Regression is a method
used to predict the value Regression
of one numerical value
from another.
The most commonly

encountered type of
regression is simple linear
regression, which draws a
straight line through a
cloud of points to predict
the response variable (Y)
from the explanatory
variable (X).
3
Simple Linear Regression
A typical question asked often by biologists is:
how does variable Y change as a function of variable X ?
35 X is the independent or
30 predictor variable.
25 Independent because it
20 is assigned by
Y
15 investigator &
10 independent of
5 measurement error.
0
1 2 3 4 5
Y is the dependent or
X
response variable.
4
The simplest relationship between two variables is a

straight line (most parsimonious solution).
Simple indicates that there is only one independent

variable.
Linear indicates the nature of the model type.
Hence, Simple Linear Regression.
Plot the Data! Plot the Data!
As with univariate analysis, your first look at the

data should always be some form of exploratory
graphical analysis.
The simplest form of a graph used to relate two

variables is referred to as a scatter plot.
NEVER skip this step! The data may not even be

linear and a different model may be more
appropriate.
6
- Example -
Let examine a simple example that looks at the effect

of age on blood pressure:
X: 28 23 52 42 27 29 43 34 40 28
Y: 70 68 90 75 68 80 78 70 80 72
A simple initial
scatterplot
80
Y
Y does not appear

40
to respond to X
0
0 10 20 30 40 50 60
X 7

- Example -
However, be careful with

95
your data plot! The

90
previous scatterplot is
Blood Pressure (mm Hg)
85
misleading.
80
Changing the (1) aspect

75
ratio and (2) scale reveals

70
a more obvious and

65
pronounced trend than

previously apparent. 20 30 40 50 60
Age (years)

- Example -
After plotting the data, we see what looks like a linear

cloud of data.
We must now fit a linear model to those data.
Recall from basic algebra and geometry that the

equation for a straight line is something like this:
y = bx + a
where b is the slope and a is the y-intercept.

9
The General Linear Model
We can reformulate our equation for a line to a

more formal and utilitarian model:
y = + x +
Where: = the y-intercept
= the slope
= error deviation from mean
This is the General Linear Model (GLM)!
10
The General Linear Model

95.0
But, we still need to fit
a line through the cloud 90.7
of points. 86.4
82.1
We use a procedure
known as Least Squares 77.9
Trend Line Fitting. 73.6
69.3
We attempt to minimize
the s. 65.0
20.0 28.0 36.0 44.0 52.0 60.0
Age (yrs)
11
The procedure is analogous to what we have already

done in univariate analysis whereby we calculate the
least squares for variance determinations.
Assume the line takes the form

of eq. 1 (y-hat is a predicted
y = a + b x (1) value of y for a value of x)
f (a, b) = ( y y )
2
(2) Goal is to minimize the function
of eq. 2
a = y bx (3)
Solving simultaneously, you will
find eq. 3
12
95
Equation 3 is very notable,
because it means that the 90
?
least squares trend line
MUST run through the 85
point defining the mean of

80 ?
X and the mean of Y.
75
Defining the X ,Y
Y-intercept provides a 70
second point and a line can

65
be drawn. 20 30 40 50 60
Age (yrs)
13

- Procedure -
1. Determine x, x2, y, xy
2. Calculate the means of x & y
3. Calculate slope via Sxx and Sxy
4. Calculate the y-intercept
5. Draw line through means and y-intercept
14

- Lexicon -
Before proceeding further, we need to develop and

clarify a shorthand notation:
Sxx = sum of the squared-x deviations

Syy = sum of the squared-y deviations
Sxy = sum of the products of deviations
syx = variance of y about x
15
- Example -
x y x2 xy Mean x = 36.90
31 7.8 961 241.8 Mean y = 10.38
32 8.3 1024 265.6
33 7.6 1089 250.8
34 9.1 1156 309.4 Lastly, we need to
35 9.6 1225 336.0 calculate:
35 9.8 1225 343.0
40 11.8 1600 472.0 Sxx: Sum of squared-x
41 12.1 1681 496.1 deviations
42 14.7 1764 617.4 Sxy: Sum of xy-product
46 13.0 2116 598.0
deviations
369 103.8 13841 3930.1
16

- Example -
( x )2
Sxx = 224.9
S xx = x 2

N
( x )( y )
S xy = xy N Sxy = 99.88
S xy
b= b = 0.444
S xx
a = y bx a = -6.0076
Thus, y = a + bx = 6.0076 + (0.444) x

17

- Example -
20
15 Y = 0.444 X - 6.007
10
Y
-5
-10
0 10 20 30 40 50
X
18
- Caveat -
Hypothetical Data
You have just fitted your 0.45
first regression line! Predicted
0.4
Response
0.35
You have also created a via Fitted Valid
linear model against which 0.3 Line Prediction
Range
you could predict any value 0.25
of y from x. 0.2
However, BEWARE, this

0.15 Actual
Response
model is only valid in the 0.1
region of x that you 0.05
examined. 0
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
19
OLS using R
Let's return to our original age and blood pressure example:

> X
[1] 28 23 52 42 27 29 43 34 40 28
> Y
[1] 70 68 90 75 68 80 78 70 80 72
> lm(X~Y)
Call:
lm(formula = X ~ Y)
Coefficients:
(Intercept) Y
-46.377 1.078 20
> plot(X,Y, cex=1.5)

> abline(lm(Y~X))
90
85
80
Y
75
70
25 30 35 40 45 50
21
X
> fitted(model1)
1 2 3 4 5
71.01666 67.92322 85.86517 79.67829 70.39797
6 7 8 9 10
71.63535 80.29698 74.72879 78.44092 71.01666
> resid(model1)
1 2 3 4 4
-1.01665799 0.07678293 4.13482561 -4.67829256 -2.39796981
6 7 8 9 10
8.36465383 -2.29698074 -4.72878709 1.55908381 0.98334201
> segments(X,fitted(model1),X,Y) 90
85
80
Y
75
70
25 30 35 40 45 50
22
X
Model Testing
To use the least squares trend line for predictive

purposes, two conditions are necessary:
(1) The straight line model fits the data
(2) The straight line being fitted is not horizontal

(i.e., 0). In other words, the regression line
needs to be a better predictor of y than than the
mean of y.
23
Model Testing
To meet the conditions for the regression of y on x:
The x values must be fixed by investigator and have

negligible error.
For each value of x there is a normal distribution of y values.
The distribution of y around each x must have similar
variances (i.e., have = yx2 [read as variance of y
independent of x])
The expected values of y for each x lie on a straight line.
24
Model Testing
In other words we can say that we have satisfied the

model:
y = + x +
In which the s are:

- normally distributed
- have a mean of zero and variance yx2
- independent of the xs
- independent of each other
25
Model Testing
95.0
Once again, recall
that the s, or 90.7
residuals, are
represented by the
86.4
difference between 82.1
the observed ys and 77.9

predicted y-hats and
73.6
are represented in
the diagram by the 69.3
vertical red lines. 65.0

20.0 28.0 36.0 44.0 52.0 60.0
Age (yrs)
27
Model Testing
Thus, the assumptions of linear regression are largely tied
to the behavior of the residuals!
We test for the violation of assumptions primarily by

examining the s:
(1) Examine the normality of the s.

(2) Check linearity by plotting s against the predicted
values of y (should have random scatter around = 0.
(3) Check equality of variance by plotting the s against
the xs.
28
Normality
You may subject the
residuals to the same
measures of skewness,
kurtosis, and tests of
normality that we have
previously used in
univariate analysis.
29
Linearity
A graph of the s against the

expected values of y (y-hats)
should produce a band
centered around
= 0.
Any other systematic pattern

is indicative of a nonlinear
trend.
30
Equality of
Variance
Variances that are
independent of x (i.e.,
homogeneous) will
result in a horizontal
band of points around
= 0.
Variances that are

dependent upon x will
have a fan-shape.
31
Independence
If the observations have a

natural sequence in time or
space, they may suffer
from autocorrelation.
Plot residuals against rank

order in time or space to
check.
32
Model Testing
If the diagrams do not reveal a substantive departure

from the assumptions, we have not proved the model
correct, but there is no evidence that it is wrong.
If a problem is revealed, consider a closer examination

of outliers, transformations, or switching to another
model.
33
Model Testing
If we determine that a straight line is a reasonable model

and we have not violated any obvious assumptions, we
need to more closely examine the parameters of the
model.
Specifically, we should examine the characteristics of the

slope because this provides information about the nature
of the relationship between x and y.
If = 0 then there is no relationship

> 0 then there is a positive relationship
< 0 then there is a negative relationship
34
Model Testing
- Beta -
As in previous situations though, we are left with an

important question:
Q. If b 0, how non-zero does it have to be to be

significantly non-zero?
A. Once again, this is tied to issues of sample size

and variance. Statistical tests have been developed
to test the individual model parameters.
36
Model Testing
- Beta -
Ho : = 0 To test whether the line

H a : 0, < 0, > 0 is horizontal or not, we
b o must perform an explicit
Test statistic : t =
s y x / S xx test of Ho: = 0 using a
where : b = S xy / S xx form of t-test.
s y2 x = ( S yy b S xy ) /(n 2)
and : For two-tailed test:
Reject if |t| t/2, n-2
S xx = x ( x ) / n ,
2 2
S yy = y 2 ( y ) / n ,
2
NB: df = N-2
S xy = xy ( x y) / n
37
Model Testing
- Alpha, Beta, y-hat -
38
> model<-(lm(X~Y)) Age vs. BP Example

> summary(model)
Call:
lm(formula = X ~ Y)
Residuals:
Min 1Q Median 3Q Max
-10.88342 -2.71830 0.08607 4.00782 7.50782
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -46.3765 20.3033 -2.284 0.05173 .
Y 1.0782 0.2693 4.004 0.00393 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 5.655 on 8 degrees of freedom

Multiple R-squared: 0.6671, Adjusted R-squared: 0.6255
F-statistic: 16.03 on 1 and 8 DF, p-value: 0.003928
39
> par(mfrow=c(2,2))
Age vs. BP Example
> plot(model)
Residuals vs Fitted Normal Q-Q

10
Standardized residuals
2.0
6 6
5
3
Residuals
1.0
-1.0 0.0
0
-5
8 4
4
70 75 80 85 -1.5 -0.5 0.5 1.5
Fitted values Theoretical Quantiles
Scale-Location Residuals vs Leverage

1.5
6
6
2
3
4
1.0
3 1
1
0.5
0.5
0
-1
0.5
Cook's distance
0.0
4 1
70 75 80 85 0.0 0.1 0.2 0.3 0.4 0.5
Fitted values Leverage 40
Model Testing
- Confidence Intervals -
From our least squares trend line fitting, it is obvious that

all points do not lie exactly along the fitted line.
Often, we wish to place 95% CIs around our best fit trend
line.
By hand, this is tedious. It requires one to pick sequential

x* values and determine the CI0.95 for each point and then
replotting data. Note that if you do this, you get a pair of
flared lines.
41
Model Testing
- Confidence Intervals -
Note that outside the CI0.95 there is another pair of

lines referred to as the PI0.95.
This is the 95% Prediction Interval and is used to

predict a single y from a single x*.
CI 0.95 : y t / 2, n 2 s y x
(
1 x * x
+
) 2 The 1 lessens
n S xx
the influence of
the means, hence
1 x * x
PI 0.95 : y t / 2, n 2 s y x 1 + +
( ) 2
less flare on plot.
n S xx
43
> plot(X,Y) Age vs. BP Example

> abline(model)
> XV<-seq(20,65,5)
> YV<-predict(model,list(X=XV),int="c")
> matlines(XV,YV,lty=c(1,2,2))
90
85
80
Y
75
70
25 30 35 40 45 50
X 44
Nonparametric Regression
If the data are suspected to be linear and you are

unable to correct either a variance or normality
assumption (via transformation), it may be
appropriate to conduct a nonparametric regression.
There are a variety of options, one of which is

Kendalls robust line-fit method.
45
Kendalls Robust Line-fit Method
- Procedure -
The method is fairly straightforward:
Rank order the x, y pairs based on the xs.
Compute a slope (Sji) for EVERY pair of x-values, for

i = 1 to n - 1 and j > i :
Y j Yi
S ji =
X j Xi
There will be n(n-1)/2
slope estimates per sample.
46

- Procedure -
After computing all possible Sji, the nonparametric

estimate b of the slope is the MEDIAN of the of the
Sji values.
The Kendalls Rank Correlation Coefficient (an

ordering test) can be used to test b for sig.
To estimate the y-intercept, compute the n-values of

Yi - bXi and again choose the MEDIAN.
47

- Example -
Calculate Sjis:
Data Set:
X Y S21 = (8.14 -8.98)/(12-0) = -0.07000
0.0 8.98 S32 = (6.67 - 8.14)/(29.5-12) = -0.08400
12.0 8.14 .
29.5 6.67 .
43.0 6.08 S31 = (6.67 - 8.98)/(29.5 - 0) = -0.07831
53.0 5.90 .
62.5 5.83 .
75.5 4.68 S91 = (3.72 - 8.98)/(93.0 - 0) = -0.05656
85.0 4.20
93.0 3.72 Median of the 36 slopes: b = -0.05436
48
- Example -
To estimate the y-intercept, compute the 9 values of ai

using Yi - bXi.
For i = 1: a1 = 8.98 - (-0.05436)(0.0) = 8.980
For i = 2: a2 = 8.14 - (-0.05436)(12.0) = 8.792
MEDIAN of the n intercepts: a = 8.783
THEREFORE: Y = 8.783 - 0.05436 X

49
Correlation
There are many purposes to regression, but the main one
is for prediction. Thus, the goal is to determine the
NATURE of the relationship between two variables.
Often as a next step, or for other reasons, one wishes just

to determine the STRENGTH of the relationship, one
would do a correlation analysis.
The product is a correlation coefficient, r.
NB: Regression and correlation are different, but not

mutually exclusive, techniques.
50
Correlation
The correlation coefficient is determined using the

same sample statistics as used in regression:
S xy
r=
S xx S yy
In all cases, -1 r +1
r = 0 is no relationship
r = 1 is a perfect relationship (pos. or neg.)
51
Coefficient of Determination
To demonstrate the inter-relatedness of correlation &

regression, lets return to regression momentarily to tie up
a loose end.
When we discuss the variability in y which is explained by

the linear association between x and y, we frequently refer
to the coefficient of determination, R2:
S xy2
R2 =
S xx S yy
52
Coefficient of Determination
If R2 is large (close to 1.0) virtually all of the variability is

explained by the relationship. Knowledge of x permits
knowledge of y.
If R2 is close to 0, there is no relationship and a knowledge

of x permits no insight in to y.
R2 is sometimes symbolized as r2 (since it is the square of

the correlation coefficient) but, to clearly differentiate the
two, use a capital R for regression and a lowercase r for
correlation.
53
Correlation
The assumptions for correlation are different than for

regression:
Subjects are sampled at random.

Both x and y contain sampling variability.
For each value of x there is a normal dist. of ys.
For each value of y there is a normal dist. of xs.
The x distributions have the same variance.
The y distributions have the same variance.
The joint distribution of x and y is bivariate normal.
54
Regression
Model
vs.
Correlation
Model
55
Nonparametric Correlation
When we discuss the correlation coefficient it is usually

assumed that we are referring to the parametric correlation
coefficient which is most correctly referred to as the
Pearson Product Moment Correlation Coefficient.
However, recognize that there are MANY types of

correlation coefficients to deal with different situations.
One, most often used for the failure of parametric
assumptions is the nonparametric Spearmans Rank
Correlation Coefficient.
56
Spearmans Rank Correlation

- Procedure -
Ho: E (rs) = 0 (i.e., the ranking of the x variable is

independent of the ranking of the y variable)
Ha: E(rs) 0, E(rs) > 0, E(rs) > 0
6d2
rs = 1 with d = rx ry (diff. in x, y ranks)
N (N 2 1)
Test statistic : z = rs n 1
Test can be performed on continuous data converted to

ranks, or ordinal data.
57
Spearmans Rank Correlation
- Example -
Rank of rat health by two observers

at the end of an endurance experiment.
Rat Obs-1 Obs-2 d d2 6d 2

rs = 1
1 4 4 0 0
N (N 2 1)
2 1 2 -1 1
3 6 5 1 1
4 5 6 -1 1 rs = 1 - (6)(8)/(7)(48)
5 3 1 2 4 rs = 0.857
6 2 3 -1 1
7 7 7 0 0 z = 2.099, P = 0.018
Sum d2=8
reject Ho
58
59

RegrCorr PDF

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

RegrCorr PDF

Cargado por

Copyright:

Formatos disponibles

Regression and Correlation

We now turn our attention to examining the relationships

Regression and correlation are the major approaches to

Multiple regression can be used to extend the case to three or

The most commonly

Simple Linear Regression

The simplest relationship between two variables is a

Simple indicates that there is only one independent

Linear indicates the nature of the model type.

Hence, Simple Linear Regression.

Plot the Data! Plot the Data!

As with univariate analysis, your first look at the

The simplest form of a graph used to relate two

NEVER skip this step! The data may not even be

Let examine a simple example that looks at the effect

Y does not appear

Simple Linear Regression

However, be careful with

your data plot! The

Changing the (1) aspect

ratio and (2) scale reveals

a more obvious and

pronounced trend than

Simple Linear Regression

After plotting the data, we see what looks like a linear

We must now fit a linear model to those data.

Recall from basic algebra and geometry that the

where b is the slope and a is the y-intercept.

We can reformulate our equation for a line to a

This is the General Linear Model (GLM)!

The General Linear Model

Trend Line Fitting. 73.6

Least Squares Trend Line Fitting

The procedure is analogous to what we have already

Assume the line takes the form

point defining the mean of

second point and a line can

Least Squares Trend Line Fitting

2. Calculate the means of x & y

3. Calculate slope via Sxx and Sxy

4. Calculate the y-intercept

5. Draw line through means and y-intercept

Least Squares Trend Line Fitting

Before proceeding further, we need to develop and

Sxx = sum of the squared-x deviations

Least Squares Trend Line Fitting

Thus, y = a + bx = 6.0076 + (0.444) x

Least Squares Trend Line Fitting

However, BEWARE, this

region of x that you 0.05

Let's return to our original age and blood pressure example:

> plot(X,Y, cex=1.5)

To use the least squares trend line for predictive

(1) The straight line model fits the data

(2) The straight line being fitted is not horizontal

The x values must be fixed by investigator and have

In other words we can say that we have satisfied the

In which the s are:

the observed ys and 77.9

vertical red lines. 65.0

We test for the violation of assumptions primarily by

(1) Examine the normality of the s.