Está en la página 1de 33

Analysis of quantitative outcomes

(AS13)
EPM304 Advanced Statistical Methods in Epidemiology

Course: PG Diploma/ MSc Epidemiology

This document contains a copy of the study material located within the computer
assisted learning (CAL) session.
If you have any questions regarding this document or your course, please contact
DLsupport via DLsupport@lshtm.ac.uk.
Important note: this document does not replace the CAL material found on your
module CDROM. When studying this session, please ensure you work through the
CDROM material first. This document can then be used for revision purposes to
refer back to specific sessions.
These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of
the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale
or further copying.
London School of Hygiene & Tropical Medicine September 2013 v1.1

Section 1: Analysis of quantitative outcomes


Aims
To give an introduction to analysing quantitative outcomes in regression
Objectives
By the end of this session you will be able to:
model the relationship between a quantitative outcome and explanatory variable(s)
using linear regression;
interpret the parameters of the regression model and use significance tests to
assess the strength of evidence of an association;
use regression diagnostics to check the model assumptions;
use regression modelling to adjust for confounding of an explanatory variable by
another variable.

Section 2: Planning your study


In SC14 and SC15 you were introduced to assessment of correlation between
two quantitative variables, as well as linear regression of a quantitative outcome
(note: quantitative variables are also referred to as continuous variables).
The aim of this session is to recap and extend this work.
If you need to review any materials before you continue, refer to the appropriate
sessions below.
Correlation
Linear regression

SC14
SC15

2.1: Planning your study


To illustrate the concepts and method of quantitative regression we will use a
simple example and data from one study:
The In-Vitro Fertilization study
Click on this study to see the details below.
Interaction: Hyperlink: The In-Vitro Fertilization study:
Output (appears in new window):
In-Vitro Fertilization study

This study was set up to compare babies conceived following in-vitro fertilization to
those from the general population. The data used here refer to the records of 641
singleton births following in-vitro fertilization (IVF).

Section 3: Background - correlation


We want to examine the association between birth weight and gestational age in
our dataset. Firstly we use a scatter plot:

What statistic could we calculate to determine the strength of association


between these two variables?
Interaction: thought bubble:
Output (appears below):
Pearsons correlation coefficient, r, is a measure between -1 and +1 governed by
the direction and strength of the relationship between two quantitative variables.
If one increases as the other increases r is positive, and if one decreases as the
other increases r is negative. If there is no relationship between the two
variables then r is 0. In a scatter plot of one variable against the other, the closer
the points are to a straight line the closer the value of r will be to +1 or -1, i.e.
the stronger the linear relationship between the variables.

3.1: Background - correlation


Below are some examples of some different correlation coefficients of
hypothetical data. Use the drop-down menu to explore what different values of r
might look like graphically.
Interaction: Pulldown: r = 0.9:

10

15

20

25

30

35

r = 0.9

10

x
Interaction: Pulldown: r = 0.7:

10

20

30

40

r = 0.7

x
Interaction: Pulldown: r = 0.3:

10

20 25 30 35 40 45 50

r = 0.3

10

x
Interaction: Pulldown: r = -0.5:

50
20

30

40

60

70

r = -0.5

10

3.2: Background - correlation


Returning to the association between birth weight and gestational age, displayed
in the graph below, the Pearsons correlation coefficient was 0.74.

What does this indicate?


Interaction: Hotspot: Weak positive association
Output:
(appears in new window):
No although the correlation coefficient is positive, indicating a positive relationship,
we would generally consider such a large correlation coefficient to indicate a strong
association.
Note however, that there are no fixed rules on what defines a strong versus weak
association.
Interaction: Hotspot: Strong negative association
Output:
(appears in new window):
No although this is quite a large correlation coefficient indicating a strong
association, it is positive indicating a positive association.
Note however, that there are no fixed rules on what defines a strong versus weak
association.
Interaction: Hotspot: Strong positive association
Output:
(appears in new window):
Yes this is quite a large correlation coefficient indicating a strong association and it
is positive indicating a positive association.
Note however, that there are no fixed rules on what defines a strong versus weak
association.

3.3: Background why do we need linear regression?

The correlation coefficient tells us the strength and direction of the linear
relationship, but it does not allow us to quantify the relationship of the two variables,
for example, by how much does one variable change, on average, with a unit change
in the other. It also does not allow us to quantify the relationship of two variables,
while adjusting for confounding with a third variable. It also assumes a linear
relationship between the two variables, which may not be the case. For such
situations we need linear regression.

3.4: Background linear regression


Simple linear regression between two continuous variables uses a method known as
least squares to derive an equation y = a + bx, with a and b chosen to ensure the
best fit of the line to the data.
Note: y are the values of the dependent or outcome variable plotted on the y-axis
(the vertical axis) and x are the values of the independent or exposure variable
plotted on the x-axis (the horizontal axis).
For example:

Note that a is the predicted value of y when x is zero and b is the gradient of the
slope i.e. a one unit change in x is predicted to lead to a change in y of b units. In
our example, a would be the predicted birth weight for a gestational age of zero and
b would be the increase in birth weight in grams given by a one week increase in
gestational age.

3.5: Background linear regression


The least squares method works by reducing the distances between each point and
the line. We will now illustrate this using a subset of the data of only seven points:

4000
Birthweight (grams)
2000
3000
1000
0
25

30
35
gestational age in weeks

40

The vertical distance from each point to the line is called a residual. Press swap to
see these illustrated.
Interaction: Button:
Swap: Output (changes to figure below):

4000
Birthweight (grams)
2000
3000
1000
0
25

30
35
gestational age in weeks

40

3.6: Background least squares


The least squares estimates of a and b are derived by minimising the sum of the
square of each residual.
By squaring the residuals, we penalise larger residuals, so for example, two residuals
of 0 and 10 would have a sum of squares of 0+100=100, whereas two residuals of 5
and 5 would have a sum of squares of 25+25=50 and so we would prefer the two
residuals of 5 and 5.
Calculate the sum of squares of these residuals to one decimal place:

4000
Birthweight (grams)
2000
3000

-133.6

555.2

-666.0
-800.3

1000

26.6

-305.8

623.3

25

30
35
gestational age in weeks

40

Interaction: Calculation: (calc)


Output(appears in new window):
Incorrect answer:
No. The sum of squares of the residuals is 623.3^2 + 305.8^2 + 26.6^2 +
555.2^2 + 800.3^2 + 133.6^2 + 666.0^2 = 1892856.2
Correct answer:
Correct
Yes, the sum of squares of the residuals is 623.3^2 + 305.8^2 + 26.6^2 +
555.2^2 + 800.3^2 + 133.6^2 + 666.0^2 = 1892856.2

3.7: Background least squares


The estimates of a and b that minimise the sum of squares of the residuals are
given by:

b =

( x x )( y y )
(x x)
i

a = y bx
Note: these equations are given for information and you are not expected to
memorise them.

3.8: Background likelihood theory


An alternative way to estimate a and b is by likelihood theory.
Firstly we rewrite the equation to explicitly include the residuals ei as follows:
yi = a + bxi + ei
Then we assume these residuals ei are Normally distributed with mean zero and
variance 2 i.e. N(0,2). So we can now re-write the log likelihood for the ei using:
ei = yi a bxi
If we now maximise the log likelihood, which is written in terms of the data yi and xi
and the parameters a and b, we obtain the maximum likelihood estimates of a and b.
These turn out to be exactly the same as the least squares estimates i.e.

b =

( x x )( y y )
(x x)
i

a = y bx

3.9: Background interpreting output


Returning to our original example, we fit a linear regression line to the data

5000
4000
Birthweight (grams)
2000
3000
1000
0
25

30

35
gestational age in weeks

40

45

In this example, a will be predicted value of the birth weight for a baby with a
gestational age of zero weeks. This does not make much sense and so we first centre
the gestational age around a central point, in this case the mean gestational age in
our sample, which was 38.7 weeks i.e. we subtract 38.7 from each gestational age in
our dataset. This is called mean-centering. Now a represents the predicted birth
weight in grams at the average gestational age in our sample of 38.7 weeks.

3.10: Background interpreting output


If we fit a linear regression line to the data (with bweight representing the birth
weight in grams and mgest representing the mean-centred gestational age in
weeks) we get the following output.
bweight

Coef.

Std. Err.

P>t

[95% Conf. Interval]

mgest
_cons

206.6412
3129.137

7.484572
17.42493

27.61
179.58

0.000
0.000

191.9439
3094.92

221.3386
3163.354

How would you interpret the constant coefficient of 3129.137?


Interaction: thought bubble:
Output (appears below):
This is the estimate of a and so it is the predicted birth weight when the meancentred gestational age is zero i.e. the predicted birth weight is 3129.137 when the
gestational age is 38.7 weeks.

Note: the mean birth weight in our dataset is 3129.137 i.e. the same as a the
regression line will always go through the point ( x , y ) .
How would you interpret the coefficient of mgest (the mean-centred gestational
age)?
Interaction: thought bubble:
Output (appears below):
For every increase in gestational age of one week, the predicted birth weight
increases by 206.6 grams.

3.11: Background interpreting output


bweight

Coef.

Std. Err.

P>t

[95% Conf. Interval]

mgest
_cons

206.6412
3129.137

7.484572
17.42493

27.61
179.58

0.000
0.000

191.9439
3094.92

221.3386
3163.354

The estimate of the slope, b, is the expected change in the outcome (i.e. birth
weight) for a unit increase in the exposure variable (i.e. gestational age). Here it is
estimated to be 206.6 grams. The output gives the 95% confidence interval for this
parameter to be 191.9g to 221.3g.
If there was no association between gestational age and birth weight the true value
of the parameter would be zero and the points on the scatter plot would be randomly
scattered about the mean values of birth weight. However, based on this analysis the
lower limit of the 95% confidence interval is substantially above zero indicating there
is strong evidence for an association.
We can confirm this by looking at the Wald test which compares the ratio of the
parameter estimate to its standard error with a t distribution. The larger the value of
b compared to its standard error the larger the test statistic, and the smaller the Pvalue (stronger evidence of an association). The test statistic t is given as 27.6 under
the column labelled t in the output.
The null hypothesis of the test is that b is zero, or in other words that there is no
association between birth weight and gestational age. The P-value is reported as
P<0.001 under P>t confirming that there is very strong evidence of an association
between birth weight and gestational age, when the relationship is modelled as
linear.

3.12: Background regression equation


bweight

Coef.

Std. Err.

P>t

[95% Conf. Interval]

mgest
_cons

206.6412
3129.137

7.484572
17.42493

27.61
179.58

0.000
0.000

191.9439
3094.92

221.3386
3163.354

We can see from the output that the best prediction of birth weight will be given
by the equation:
Birth weight = 3129.1 + 206.6*mgest
= 3129.1 + 206.6*(gestational age 38.7)
What is the best prediction of the birth weight for a gestational age of 30 weeks
(to the nearest gram)?
Interaction: Calculation: (calc)
Output(appears in new window):
Incorrect answer:
No. The best prediction is:
= 3129.1 + 206.6*(gestational age 38.7)
= 3129.1 + 206.6*(30 38.7)
= 3129.1 - 206.6*8.7
= 1331.68
= 1332 to the nearest gram
Correct answer:
Correct
Yes. The best prediction is:
= 3129.1 + 206.6*(gestational age 38.7)
= 3129.1 + 206.6*(30 38.7)
= 3129.1 - 206.6*8.7
= 1331.68
= 1332 to the nearest gram

Section 4: Checking model assumptions


The assumptions of a linear regression model are
1. The residuals come from a Normal distribution and are independent from
each other (i.e. no correlation in the residuals between observations).
2. The variance of the residuals is constant across y and x.
3. The correct relationship between y and x has been modelled.
We can check whether these assumptions seem reasonable using plots of the
residuals. This is easiest done using standardised residuals (i.e. with mean=0 and
standard deviation=1). These are obtained by dividing each residual by the
standard deviation of all the residuals.

4.1: Normality assumption


The Normality assumption can be checked by producing a histogram of the
residuals.
Note: because these are standardised residuals they should come from a Standard
Normal distribution, e.g. we expect about 95% of values to lie between 2 and +2.

.4
.3
Density
.2
.1
0
-4

-2

0
Standardized residuals

The histogram looks symmetrical and reasonably bell-shaped, therefore the


assumption that the residuals come from a Normal distribution is reasonable. The
larger the sample size the less the shape of the distribution of the residuals will
affect the model estimation.

4.2: Constant variance


The second assumption of constant variance of the residuals across y and x can be

checked from a scatter plot of the residuals versus the predicted values y (also
called fitted values) i.e. for each point on the graph below, the fitted value is
plotted against the residual.

5000
4000
Birthweight (grams)
2000
3000
1000

Predicted (or fitted) value

residual

25

30

35
gestational age in weeks

40

45

-4

Standardized residuals
-2
0
2

The residuals should be randomly scattered about zero with constant variance
over the predicted values.

1000

2000
Fitted values

3000

4000

There is no obvious relationship between the residuals and


change in the variance of the residuals with

y , and no evidence of a

y .

4.3: Linear relationship


We started our analysis with a scatter plot of the data. The importance of this is
highlighted by scatter plots of some hypothetical data below. Use the drop down
menu to examine the different plots. Note: the correlation coefficient is the same in
each example.
Interaction: Pulldown: linear relationship:

This scatter plot shows a roughly linear relationship between y and x and so linear
regression is appropriate.
Interaction: Pulldown: remote point:

There is a remote point, an observation which is far from the range of the other
data.
Interaction: Pulldown: outlier:

There is an outlier, a point which is not well fitted by the model. Remote points
and outliers can change the regression parameters substantially, and in this case
they are known as influential points. In order to identify influential points the
model can be re-fitted with one observation left out each time, or they can be
spotted by eye in a scatter plot and then the regression parameters estimated with
and without the observation in the model. The first step once an influential point
has been identified would be to check whether there has been a data-entry error, if
this check is possible. If there is no data entry error, the observation should not
be removed from the data unless there is very good reason to do so. One option is
to report the results of the analysis with the observation included and with it
removed in order to demonstrate how sensitive the results are to this observation.
Interaction: Pulldown: non-linear:

In this example, the values of y seem to initially decrease with increasing values of
x, but then start to increase with even greater values of x. Hence, a linear regression
will not give a good fit to such data and would be inappropriate.
Interaction: Pulldown: two clusters:

There are two clusters of points, with what appears to be random scatter in each
cluster. This may suggest some kind of threshold in x, below which y takes one
average value and above which y takes another average value.
Interaction: Pulldown: two lines:

There appear to be two different lines shown here. It may well be that the value of y
depends on x and on another binary variable.

Section 5: Categorical variables


The model we have fitted assumes that birth weight is linearly associated with
gestational age within the range of gestational age in the data. Other models we
have fitted in this course have used categories or grouped data to examine the
relationship between an outcome variable, e.g. log odds and an explanatory variable.
We can do this with a quantitative outcome too.

5.1: Categorical variables


For our birth weight data we can group the data according to gestational age.

Categories
gestational age

Freq.

Mean of birth
weights

Standard deviation
of birth weights

<38 wks
38-39 wks
39-40 wks
>40 wks

157
130
166
188

2,445.567
3,186.023
3,365.193
3,452.223

705.815
408.357
456.661
441.366

The means of birth weight for the four groups increase sequentially from around 2.5
kg to 3.5 kg.

5.2: Categorical variables


We can fit a simple linear regression model of birth weight with three indicators for
the highest three groups of gestational age and estimate the mean differences
between each of these three groups and the first (i.e. the <38 week category).
bweight

Coef.

Std. Err.

P>t

[95% Conf. Interval]

gestcat
2
3
4

740.4562
919.6259
1006.657

61.27115
57.522
55.86212

12.08
15.99
18.02

0.000
0.000
0.000

620.1383
806.6702
896.9604

860.7741
1032.582
1116.353

_cons

2445.567

41.23697

59.31

0.000

2364.59

2526.544

5.3: Categorical variables


bweight

Coef.

Std. Err.

P>t

[95% Conf. Interval]

gestcat
2
3
4

740.4562
919.6259
1006.657

61.27115
57.522
55.86212

12.08
15.99
18.02

0.000
0.000
0.000

620.1383
806.6702
896.9604

860.7741
1032.582
1116.353

_cons

2445.567

41.23697

59.31

0.000

2364.59

2526.544

The intercept, estimated as 2445.567, is the predicted mean value of birth weight in
the baseline category of gestational age (i.e. the <38 week category).
The value 740.4562 for gestcat 2 is the predicted increase in birth weight among
babies in the next gestational age category (38-39 weeks) compared to the baseline
category (<38 weeks).
Similarly 919.6259 and 1006.657 are the predicted increases for the next two
categories (39-40 weeks and >40 weeks), respectively, each compared to the
baseline category (<38 weeks).

The estimated increases can be added to the estimated mean in the baseline
category to get the predicted mean value of birth weight in the four categories.
The resulting equations for the predicted values are

y = 2445.567,
y = 2445.567 + 740.4562 = 3186.023,
y = 2445.567 + 919.6259 = 3365.193 and
y = 2445.567 + 1006.657 = 3452.223 respectively for babies in the first, second,

third and fourth categories.

These values correspond exactly to those shown in the column headed Mean of birth
weights in the previous table.
Interaction: Button: Show:
Output (appears below):
Categories
gestational age

Freq.

Mean of birth
weights

Standard deviation
of birth weights

<38 wks
38-39 wks
39-40 wks
>40 wks

157
130
166
188

2445.567
3186.023
3365.193
3452.223

705.8151
408.3571
456.661
441.3662

5.4: Categorical variables


The predicted values for this regression model can be seen on the scatter plot below.

5000
Birthweight (grams)
2000
3000
4000
1000
25

30

35
gestational age in weeks

40

You can click swap to add the original line to the same graph.
Interaction: Button: Swap:
Output (figure changes to this and text appears below):

45

5000
4000
Birthweight (grams)
2000
3000
1000
0
25

30

35
gestational age in weeks

40

45

We can see that the categorical variable gives a reasonable fit to the data in the
middle values of gestational age, but does particularly poorly at younger gestational
ages.

Section 6: Quadratic relationships


The first model we fitted assumed a linear relationship between birth weight and
gestational age, while the model using gestational age categories is more flexible
about the shape of the relationship between the two variables, though clearly it
makes some strong assumptions about what that relationship is, e.g. it assumes that
all babies born before 38 weeks have the same mean birth weight. Another option is
to model a quadratic relationship between birth weight and gestational age, which
would allow some departure from a linear relationship.

6.1: Quadratic relationships


Consider the (hypothetical) example below, in which it is possible that there is some
curvature.

24
22
20
18

16
14
12
10

10

10

10

12

14

16

18

20

22

24

10

12

14

16

18

20

22

24

A straight line ignores any curvature there may be between y and x.

10

10

12

14

16

18

20

22

24

Categorising x allows a non-linear relationship between y and x.

10

Fitting a quadratic allows some departure from a linear relationship.

6.2: Quadratic relationships


To fit a quadratic relationship the model becomes
y = a + bx + cx2
where a is still the estimated birth weight of a baby when x=0 (i.e. with mean
gestational age) but now the slope of the line is described by two coefficients b and
c. In this case if b and c were both positive this would mean that the babys growth
accelerates over the period of gestation that we are examining. If b is positive and c
is negative then the babys growth rates would be slowing down over the period of
gestation that we are examining. Note that interpretation of the two parameter
estimates, b and c, is less straightforward than interpretation of the parameter
estimates when x was categorical.

6.3: Quadratic relationships


We can fit this model by first calculating a new variable that is the square of the
mean-centred gestational age e.g. mgest2=mgest^2 and then fitting a linear
regression model on both the mean-centred gestational age and its square:
bweight

Coef.

Std. Err.

P>t

[95% Conf.
Interval]

mgest
mgest2
_cons

193.2323
-2.796836
3144.296

10.73898
1.608682
19.46009

17.99
-1.74
161.58

0.000
0.083
0.000

172.1443
-5.955789
3106.083

So the regression equation is now:


Birth weight = 3144.3 + 193.2*mgest 2.8*mgest2
= 3144.3 + 193.2*(age 38.7) 2.8*(age 38.7)2

214.3203
.3621158
3182.51

6.4: Quadratic relationships

linear

1000

Birthweight (grams)
2000
3000

4000

5000

The graph below shows both the fitted linear and quadratic relationships. We can see
that between 32 and 42 weeks gestation the lines are virtually identical, and it is
only for babies born before 32 weeks that there appears to be any curvature. There
are very few babies in the dataset born before about 32 weeks. It is unsurprising
that there is little statistical evidence of departure from a linear relationship (P=0.08
for mgest2 in the output).

quadratic

30

25

35
gestational age in weeks

40

45

Section 7: Multivariable regression


Just as we have done with Poisson and logistic regression models, we can include
several covariates in a linear regression model. For example, we can fit a model of
birth weight to gestational age and gender (0=male, 1=female):
bweight

Coef.

Std. Err.

P>t

[95% Conf. Interval]

mgest
gender
_cons

206.4446
-161.7075
3208.604

7.363321
34.29034
24.03779

28.04
-4.72
133.48

0.000
0.000
0.000

191.9853
-229.0431
3161.401

220.9039
-94.37194
3255.806

7.1: Multivariable regression


bweight

Coef.

Std. Err.

P>t

[95% Conf. Interval]

mgest
gender
_cons

206.4446
-161.7075
3208.604

7.363321
34.29034
24.03779

28.04
-4.72
133.48

0.000
0.000
0.000

191.9853
-229.0431
3161.401

220.9039
-94.37194
3255.806

The value for mgest of 206.4 gives us the estimated increase in birth weight for each
additional week of gestation, after adjusting for gender. The value -161.7 shows
that females (coded 1 in the data) are estimated to be 161.7 g lighter than males
(coded 0 in the data), after adjusting for week of gestation. Note that gender is not
a confounder between gestational age and birth weight here, because adjusting for it
barely changes the mgest estimate (from 206.6412 to 206.4446). However, we can
see from the p-value and confidence interval that gender is independently associated
with birth weight.
What is the regression equation fitted here?
Interaction: thought bubble:
Output (appears below):
Birth weight = 3208.6 + 206.4*mgest 161.7*gender
= 3208.6 + 206.4*(gestational age 38.7) 161.7*gender
where gender takes the value 0 for males and 1 for females

7.2: Multivariable regression


We can plot two separate lines of predicted birth weight by gestational age for
males and females on the scatter plot.

5000
4000
Birthweight (grams)
2000
3000

difference
is 161.7g

1000

males

females

25

30

35
gestational age in weeks

40

45

This makes clear that we are fitting two parallel lines to the data, with equal
gradients but different intercepts.
What are the regression equations for males and females separately?
Interaction: thought bubble:
Output (appears below):
Males
Birth weight = 3208.6 + 206.4*(gestational age 38.7) 161.7*0
= 3208.6 + 206.4*(gestational age 38.7)
Females
Birth weight = 3208.6 + 206.4*(gestational age 38.7) 161.7*1
= 3208.6 161.7 + 206.4*(gestational age 38.7)
= 3046.9 + 206.4*(gestational age 38.7)
So we can see that the gradients are the same, but the intercepts are different.

7.3: Multivariable regression


When there is more than one covariate in the model, the goodness-of-fit checking
should also include plots of the residuals against each covariate. With one
covariate this is not necessary, as

y is determined by x.

7.4: Multivariable regression


Note that the size of the effect estimate depends on the units of the covariate. For
example, here gestational age is measured in weeks, but if it had been measured in
days the parameter for gestational age would be 7 times smaller (since it is the
estimated increase per day not per week). It is useful to consider the clinical
impact of the covariate on the outcome in the wider population. For example, a
baby born at 32 weeks is estimated to be 1.65kg lighter than a baby born at 40
weeks (206.44 x -8 = -1652g).

Section 8: Analysis of variance for goodness of fit


When testing for the significance of an association between the quantitative
outcome and a covariate that has a single parameter, we can use the p-value for
the parameter in the output table, as we did for the covariate gender above.
However, if we want to test the impact of several parameters simultaneously we
need to use an F test from the analysis of variance table. The F test in linear
regression is analogous to the likelihood ratio test in logistic or Poisson regression.

8.1: Analysis of variance for goodness of fit


A measure that allows us to evaluate how well a particular linear regression model
fits the data is the sum of squares of the differences between the observed values
of the outcome variable, and the values predicted by the model, i.e. the residual
sum of squares,

( y i

y i)2

This is obtained from an analysis of variance table. For example, the analysis of
variance table produced when fitting the model with gestational age categories
was:
Source

SS

df

MS

Model
Residual

102656030
170064092

3
637

34218676.7
266976.596

Total

272720122

640

426125.19

Here the residual sum of squares (SS) is 170064092.

8.2: Analysis of variance for goodness of fit


Source

SS

df

MS

Model

102656030

34218676.7

Residual

170064092

637

266976.596

Total

272720122

640

426125.19

If the residual sum of squares is zero the regression line would fit the data
perfectly, that is every observed point would lie on the fitted line. By contrast
larger values would indicate worse fits, since the deviations of the observed points
from the regression line will be larger. Two possible factors contribute to a high
residual sum of squares; either there is a lot of variation in the data, i.e. 2 is large,
or the model does not explain much of the variation observed.

8.3: Analysis of variance for goodness of fit


Source

SS

df

MS

Model
Residual

102656030
170064092

3
637

34218676.7
266976.596

Total

272720122

640

426125.19

We can obtain an estimate of 2 as:


2 = residual sum of squares/df
where df is the degrees of freedom.
This can be calculated as:

170064092/637 = 266976.596

This value is known as the residual mean square (MS shown in the last column in
the table), and its square root is called the Root MSE (mean square error = 516.7).
The larger the Model MS compared to the Residual MS, the better the model fits the
data. Under the Null hypothesis of no effect of gestational age, the Model MS and
the Residual MS are two independent estimates of 2 and their ratio is expected to
be 1. We can then test the significance of the current model, by dividing these two
terms.
F=

Model MS
Residual MS

This is known as an F test. Under a null hypothesis of no effect of gestational age


on birth weight, this statistic would be expected to follow the F distribution with the
appropriate degrees of freedom. This is written as F (3, 637), as we have fitted
three parameters for the age categories and we are then left with 641 observations
minus these three parameters minus the constant, leaving 637 for the residual SS.
Calculate the F statistic for the Null hypothesis of no relationship between birth
weight and gestational age to one decimal place.
Interaction: Calculation: (calc)

Output(appears in new window):


Incorrect answer:
No.
F = Model MS = 34218676.7 / 266976.596 = 128.2
Residual MS
Correct answer:
Correct
Yes.
F=

Model MS = 34218676.7 / 266976.596 = 128.2


Residual MS

8.4: Analysis of variance for goodness of fit


If there is a strong relationship between the outcome and the exposure variables
used in the model, then the Model MS will be much larger than the Residual MS,
and F will be (substantially) greater than 1. If there is no relationship then, on
average, the Model MS will be equal to the Residual MS and F will be 1. In this
example the value of F is much larger than 1, so there is evidence that birth weight
changes with gestational age. The p-value for the null hypothesis that there is no
relationship, based on F = 128.2 is p<0.0001.

8.5: Analysis of variance for goodness of fit


If we include gender in the model with gestational age as categorical, we get the
output below.
Source

SS

df

MS

Model
Residual

106953800
165766322

4
636

26738450
260638.871

Total

272720122

640

426125.19

bweight

Coef.

Std. Err.

P>t

[95% Conf. Interval]

gestcat
2
3
4

736.1429
909.2086
1012.249

60.54885
56.89301
55.21226

12.16
15.98
18.33

0.000
0.000
0.000

617.243
797.4877
903.8286

855.0427
1020.929
1120.669

gender
_cons

-164.2448
2528.212

40.44732
45.54496

-4.06
55.51

0.000
0.000

-243.6713
2438.776

-84.81839
2617.649

The analysis of variance table for this model is at the top. For this model, our null
hypothesis is that there is no effect of gestational age (considered in categories) or
sex. A test of whether the true effects of both explanatory variables are zero is
made by examining F(4, 636) = 102.59 and referring this to the F distribution. The
p-value for this hypothesis is again <0.001, indicating strong evidence that the two
exposure variables gestational age categories and gender taken together contribute
to explain the data variability.

8.6: Analysis of variance for goodness of fit


Adjusting for gender has made very little difference to the parameter estimates for
gestational age so gender is not a confounder here. But for demonstration, in
order to assess whether gestational age (as categorical) is associated with birth
weight after adjusting for gender we need to do a partial F test.
The F statistic is F(3, 636) = 131.06, which gives a p-value of <0.001, so there is
very strong evidence of an association between gestational age category and birth
weight independently of gender.
This partial F test is analogous to the likelihood ratio test in logistic or Poisson
regression. With a likelihood ratio test you would need to fit one model with
gestational age and gender and compare it to a model with just gender, but you
dont need to do that with linear regression.

Section 9: Summary
This is the end of AS13.
The main points of this session will appear below as you click on the relevant title.
Fitting regression equation
Fitting a linear regression equation to data from two quantitative variables uses
one of two approaches, both of which give the same result: least squares
minimises the sum of squares of residuals, while maximum likelihood assumes
the residuals follow a Normal distribution with mean=0 and standard deviation=1.
Checking assumptions
Linearity can be checked by a scatter-plot of the outcome and explanatory
variables, Normality of residuals can be checked with a histogram of standardised
residuals and constancy of the variance of the residuals can be checked by
examining a scatter-plot of the residuals plotted against the fitted values.
Allowing for a non-linear relationship
Two common ways of allowing for a non-linear relationship are to either:
categorise the explanatory variable or to allow for a quadratic relationship by
including both the explanatory variable and its square in the regression.

Multivariable linear regression


As for logistic, Poisson and Cox regression, multiple variables can be included in
the regression equation to produce estimates that are adjusted for the other
variables. The partial F test is used instead of the likelihood ratio test to examine
the evidence for some variables adjusted for other variable(s).