7 vistas

Cargado por Lakshmi Seth

stats notes

stats notes

© All Rights Reserved

- Anova
- as11
- ec06
- md01
- md02
- md04
- md08.pdf
- md07
- The Persistence of Bureaucracy: A Meta-Analysis of Weber’s Model of Bureaucratic Control
- Role of Executing E-commerce for Small and medium enterprises in Danang city
- as09
- as10
- UT Dallas Syllabus for hcs6312.001 06f taught by Pamela Rollins (rollins)
- UT Dallas Syllabus for hcs6312.501.11f taught by Pamela Rollins (rollins)
- md03
- md05
- Regional systems of innovation and the knowledge production function: The Spanish case (Eng)/ Los sistemas regionales de innovación y la función de producción de conocimiento: el caso español (Ing)/ Eskualdeko berrikuntza sistemak eta ezagutza produkzioaren funtzioa: espainiar kasua (Ing)
- Chapter3 Regression Analysis 2003(2) (1)
- md06
- as12

Está en la página 1de 33

(AS13)

EPM304 Advanced Statistical Methods in Epidemiology

This document contains a copy of the study material located within the computer

assisted learning (CAL) session.

If you have any questions regarding this document or your course, please contact

DLsupport via DLsupport@lshtm.ac.uk.

Important note: this document does not replace the CAL material found on your

module CDROM. When studying this session, please ensure you work through the

CDROM material first. This document can then be used for revision purposes to

refer back to specific sessions.

These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of

the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale

or further copying.

London School of Hygiene & Tropical Medicine September 2013 v1.1

Aims

To give an introduction to analysing quantitative outcomes in regression

Objectives

By the end of this session you will be able to:

model the relationship between a quantitative outcome and explanatory variable(s)

using linear regression;

interpret the parameters of the regression model and use significance tests to

assess the strength of evidence of an association;

use regression diagnostics to check the model assumptions;

use regression modelling to adjust for confounding of an explanatory variable by

another variable.

In SC14 and SC15 you were introduced to assessment of correlation between

two quantitative variables, as well as linear regression of a quantitative outcome

(note: quantitative variables are also referred to as continuous variables).

The aim of this session is to recap and extend this work.

If you need to review any materials before you continue, refer to the appropriate

sessions below.

Correlation

Linear regression

SC14

SC15

To illustrate the concepts and method of quantitative regression we will use a

simple example and data from one study:

The In-Vitro Fertilization study

Click on this study to see the details below.

Interaction: Hyperlink: The In-Vitro Fertilization study:

Output (appears in new window):

In-Vitro Fertilization study

This study was set up to compare babies conceived following in-vitro fertilization to

those from the general population. The data used here refer to the records of 641

singleton births following in-vitro fertilization (IVF).

We want to examine the association between birth weight and gestational age in

our dataset. Firstly we use a scatter plot:

between these two variables?

Interaction: thought bubble:

Output (appears below):

Pearsons correlation coefficient, r, is a measure between -1 and +1 governed by

the direction and strength of the relationship between two quantitative variables.

If one increases as the other increases r is positive, and if one decreases as the

other increases r is negative. If there is no relationship between the two

variables then r is 0. In a scatter plot of one variable against the other, the closer

the points are to a straight line the closer the value of r will be to +1 or -1, i.e.

the stronger the linear relationship between the variables.

Below are some examples of some different correlation coefficients of

hypothetical data. Use the drop-down menu to explore what different values of r

might look like graphically.

Interaction: Pulldown: r = 0.9:

10

15

20

25

30

35

r = 0.9

10

x

Interaction: Pulldown: r = 0.7:

10

20

30

40

r = 0.7

x

Interaction: Pulldown: r = 0.3:

10

20 25 30 35 40 45 50

r = 0.3

10

x

Interaction: Pulldown: r = -0.5:

50

20

30

40

60

70

r = -0.5

10

Returning to the association between birth weight and gestational age, displayed

in the graph below, the Pearsons correlation coefficient was 0.74.

Interaction: Hotspot: Weak positive association

Output:

(appears in new window):

No although the correlation coefficient is positive, indicating a positive relationship,

we would generally consider such a large correlation coefficient to indicate a strong

association.

Note however, that there are no fixed rules on what defines a strong versus weak

association.

Interaction: Hotspot: Strong negative association

Output:

(appears in new window):

No although this is quite a large correlation coefficient indicating a strong

association, it is positive indicating a positive association.

Note however, that there are no fixed rules on what defines a strong versus weak

association.

Interaction: Hotspot: Strong positive association

Output:

(appears in new window):

Yes this is quite a large correlation coefficient indicating a strong association and it

is positive indicating a positive association.

Note however, that there are no fixed rules on what defines a strong versus weak

association.

The correlation coefficient tells us the strength and direction of the linear

relationship, but it does not allow us to quantify the relationship of the two variables,

for example, by how much does one variable change, on average, with a unit change

in the other. It also does not allow us to quantify the relationship of two variables,

while adjusting for confounding with a third variable. It also assumes a linear

relationship between the two variables, which may not be the case. For such

situations we need linear regression.

Simple linear regression between two continuous variables uses a method known as

least squares to derive an equation y = a + bx, with a and b chosen to ensure the

best fit of the line to the data.

Note: y are the values of the dependent or outcome variable plotted on the y-axis

(the vertical axis) and x are the values of the independent or exposure variable

plotted on the x-axis (the horizontal axis).

For example:

Note that a is the predicted value of y when x is zero and b is the gradient of the

slope i.e. a one unit change in x is predicted to lead to a change in y of b units. In

our example, a would be the predicted birth weight for a gestational age of zero and

b would be the increase in birth weight in grams given by a one week increase in

gestational age.

The least squares method works by reducing the distances between each point and

the line. We will now illustrate this using a subset of the data of only seven points:

4000

Birthweight (grams)

2000

3000

1000

0

25

30

35

gestational age in weeks

40

The vertical distance from each point to the line is called a residual. Press swap to

see these illustrated.

Interaction: Button:

Swap: Output (changes to figure below):

4000

Birthweight (grams)

2000

3000

1000

0

25

30

35

gestational age in weeks

40

The least squares estimates of a and b are derived by minimising the sum of the

square of each residual.

By squaring the residuals, we penalise larger residuals, so for example, two residuals

of 0 and 10 would have a sum of squares of 0+100=100, whereas two residuals of 5

and 5 would have a sum of squares of 25+25=50 and so we would prefer the two

residuals of 5 and 5.

Calculate the sum of squares of these residuals to one decimal place:

4000

Birthweight (grams)

2000

3000

-133.6

555.2

-666.0

-800.3

1000

26.6

-305.8

623.3

25

30

35

gestational age in weeks

40

Output(appears in new window):

Incorrect answer:

No. The sum of squares of the residuals is 623.3^2 + 305.8^2 + 26.6^2 +

555.2^2 + 800.3^2 + 133.6^2 + 666.0^2 = 1892856.2

Correct answer:

Correct

Yes, the sum of squares of the residuals is 623.3^2 + 305.8^2 + 26.6^2 +

555.2^2 + 800.3^2 + 133.6^2 + 666.0^2 = 1892856.2

The estimates of a and b that minimise the sum of squares of the residuals are

given by:

b =

( x x )( y y )

(x x)

i

a = y bx

Note: these equations are given for information and you are not expected to

memorise them.

An alternative way to estimate a and b is by likelihood theory.

Firstly we rewrite the equation to explicitly include the residuals ei as follows:

yi = a + bxi + ei

Then we assume these residuals ei are Normally distributed with mean zero and

variance 2 i.e. N(0,2). So we can now re-write the log likelihood for the ei using:

ei = yi a bxi

If we now maximise the log likelihood, which is written in terms of the data yi and xi

and the parameters a and b, we obtain the maximum likelihood estimates of a and b.

These turn out to be exactly the same as the least squares estimates i.e.

b =

( x x )( y y )

(x x)

i

a = y bx

Returning to our original example, we fit a linear regression line to the data

5000

4000

Birthweight (grams)

2000

3000

1000

0

25

30

35

gestational age in weeks

40

45

In this example, a will be predicted value of the birth weight for a baby with a

gestational age of zero weeks. This does not make much sense and so we first centre

the gestational age around a central point, in this case the mean gestational age in

our sample, which was 38.7 weeks i.e. we subtract 38.7 from each gestational age in

our dataset. This is called mean-centering. Now a represents the predicted birth

weight in grams at the average gestational age in our sample of 38.7 weeks.

If we fit a linear regression line to the data (with bweight representing the birth

weight in grams and mgest representing the mean-centred gestational age in

weeks) we get the following output.

bweight

Coef.

Std. Err.

P>t

mgest

_cons

206.6412

3129.137

7.484572

17.42493

27.61

179.58

0.000

0.000

191.9439

3094.92

221.3386

3163.354

Interaction: thought bubble:

Output (appears below):

This is the estimate of a and so it is the predicted birth weight when the meancentred gestational age is zero i.e. the predicted birth weight is 3129.137 when the

gestational age is 38.7 weeks.

Note: the mean birth weight in our dataset is 3129.137 i.e. the same as a the

regression line will always go through the point ( x , y ) .

How would you interpret the coefficient of mgest (the mean-centred gestational

age)?

Interaction: thought bubble:

Output (appears below):

For every increase in gestational age of one week, the predicted birth weight

increases by 206.6 grams.

bweight

Coef.

Std. Err.

P>t

mgest

_cons

206.6412

3129.137

7.484572

17.42493

27.61

179.58

0.000

0.000

191.9439

3094.92

221.3386

3163.354

The estimate of the slope, b, is the expected change in the outcome (i.e. birth

weight) for a unit increase in the exposure variable (i.e. gestational age). Here it is

estimated to be 206.6 grams. The output gives the 95% confidence interval for this

parameter to be 191.9g to 221.3g.

If there was no association between gestational age and birth weight the true value

of the parameter would be zero and the points on the scatter plot would be randomly

scattered about the mean values of birth weight. However, based on this analysis the

lower limit of the 95% confidence interval is substantially above zero indicating there

is strong evidence for an association.

We can confirm this by looking at the Wald test which compares the ratio of the

parameter estimate to its standard error with a t distribution. The larger the value of

b compared to its standard error the larger the test statistic, and the smaller the Pvalue (stronger evidence of an association). The test statistic t is given as 27.6 under

the column labelled t in the output.

The null hypothesis of the test is that b is zero, or in other words that there is no

association between birth weight and gestational age. The P-value is reported as

P<0.001 under P>t confirming that there is very strong evidence of an association

between birth weight and gestational age, when the relationship is modelled as

linear.

bweight

Coef.

Std. Err.

P>t

mgest

_cons

206.6412

3129.137

7.484572

17.42493

27.61

179.58

0.000

0.000

191.9439

3094.92

221.3386

3163.354

We can see from the output that the best prediction of birth weight will be given

by the equation:

Birth weight = 3129.1 + 206.6*mgest

= 3129.1 + 206.6*(gestational age 38.7)

What is the best prediction of the birth weight for a gestational age of 30 weeks

(to the nearest gram)?

Interaction: Calculation: (calc)

Output(appears in new window):

Incorrect answer:

No. The best prediction is:

= 3129.1 + 206.6*(gestational age 38.7)

= 3129.1 + 206.6*(30 38.7)

= 3129.1 - 206.6*8.7

= 1331.68

= 1332 to the nearest gram

Correct answer:

Correct

Yes. The best prediction is:

= 3129.1 + 206.6*(gestational age 38.7)

= 3129.1 + 206.6*(30 38.7)

= 3129.1 - 206.6*8.7

= 1331.68

= 1332 to the nearest gram

The assumptions of a linear regression model are

1. The residuals come from a Normal distribution and are independent from

each other (i.e. no correlation in the residuals between observations).

2. The variance of the residuals is constant across y and x.

3. The correct relationship between y and x has been modelled.

We can check whether these assumptions seem reasonable using plots of the

residuals. This is easiest done using standardised residuals (i.e. with mean=0 and

standard deviation=1). These are obtained by dividing each residual by the

standard deviation of all the residuals.

The Normality assumption can be checked by producing a histogram of the

residuals.

Note: because these are standardised residuals they should come from a Standard

Normal distribution, e.g. we expect about 95% of values to lie between 2 and +2.

.4

.3

Density

.2

.1

0

-4

-2

0

Standardized residuals

assumption that the residuals come from a Normal distribution is reasonable. The

larger the sample size the less the shape of the distribution of the residuals will

affect the model estimation.

The second assumption of constant variance of the residuals across y and x can be

checked from a scatter plot of the residuals versus the predicted values y (also

called fitted values) i.e. for each point on the graph below, the fitted value is

plotted against the residual.

5000

4000

Birthweight (grams)

2000

3000

1000

residual

25

30

35

gestational age in weeks

40

45

-4

Standardized residuals

-2

0

2

The residuals should be randomly scattered about zero with constant variance

over the predicted values.

1000

2000

Fitted values

3000

4000

change in the variance of the residuals with

y , and no evidence of a

y .

We started our analysis with a scatter plot of the data. The importance of this is

highlighted by scatter plots of some hypothetical data below. Use the drop down

menu to examine the different plots. Note: the correlation coefficient is the same in

each example.

Interaction: Pulldown: linear relationship:

This scatter plot shows a roughly linear relationship between y and x and so linear

regression is appropriate.

Interaction: Pulldown: remote point:

There is a remote point, an observation which is far from the range of the other

data.

Interaction: Pulldown: outlier:

There is an outlier, a point which is not well fitted by the model. Remote points

and outliers can change the regression parameters substantially, and in this case

they are known as influential points. In order to identify influential points the

model can be re-fitted with one observation left out each time, or they can be

spotted by eye in a scatter plot and then the regression parameters estimated with

and without the observation in the model. The first step once an influential point

has been identified would be to check whether there has been a data-entry error, if

this check is possible. If there is no data entry error, the observation should not

be removed from the data unless there is very good reason to do so. One option is

to report the results of the analysis with the observation included and with it

removed in order to demonstrate how sensitive the results are to this observation.

Interaction: Pulldown: non-linear:

In this example, the values of y seem to initially decrease with increasing values of

x, but then start to increase with even greater values of x. Hence, a linear regression

will not give a good fit to such data and would be inappropriate.

Interaction: Pulldown: two clusters:

There are two clusters of points, with what appears to be random scatter in each

cluster. This may suggest some kind of threshold in x, below which y takes one

average value and above which y takes another average value.

Interaction: Pulldown: two lines:

There appear to be two different lines shown here. It may well be that the value of y

depends on x and on another binary variable.

The model we have fitted assumes that birth weight is linearly associated with

gestational age within the range of gestational age in the data. Other models we

have fitted in this course have used categories or grouped data to examine the

relationship between an outcome variable, e.g. log odds and an explanatory variable.

We can do this with a quantitative outcome too.

For our birth weight data we can group the data according to gestational age.

Categories

gestational age

Freq.

Mean of birth

weights

Standard deviation

of birth weights

<38 wks

38-39 wks

39-40 wks

>40 wks

157

130

166

188

2,445.567

3,186.023

3,365.193

3,452.223

705.815

408.357

456.661

441.366

The means of birth weight for the four groups increase sequentially from around 2.5

kg to 3.5 kg.

We can fit a simple linear regression model of birth weight with three indicators for

the highest three groups of gestational age and estimate the mean differences

between each of these three groups and the first (i.e. the <38 week category).

bweight

Coef.

Std. Err.

P>t

gestcat

2

3

4

740.4562

919.6259

1006.657

61.27115

57.522

55.86212

12.08

15.99

18.02

0.000

0.000

0.000

620.1383

806.6702

896.9604

860.7741

1032.582

1116.353

_cons

2445.567

41.23697

59.31

0.000

2364.59

2526.544

bweight

Coef.

Std. Err.

P>t

gestcat

2

3

4

740.4562

919.6259

1006.657

61.27115

57.522

55.86212

12.08

15.99

18.02

0.000

0.000

0.000

620.1383

806.6702

896.9604

860.7741

1032.582

1116.353

_cons

2445.567

41.23697

59.31

0.000

2364.59

2526.544

The intercept, estimated as 2445.567, is the predicted mean value of birth weight in

the baseline category of gestational age (i.e. the <38 week category).

The value 740.4562 for gestcat 2 is the predicted increase in birth weight among

babies in the next gestational age category (38-39 weeks) compared to the baseline

category (<38 weeks).

Similarly 919.6259 and 1006.657 are the predicted increases for the next two

categories (39-40 weeks and >40 weeks), respectively, each compared to the

baseline category (<38 weeks).

The estimated increases can be added to the estimated mean in the baseline

category to get the predicted mean value of birth weight in the four categories.

The resulting equations for the predicted values are

y = 2445.567,

y = 2445.567 + 740.4562 = 3186.023,

y = 2445.567 + 919.6259 = 3365.193 and

y = 2445.567 + 1006.657 = 3452.223 respectively for babies in the first, second,

These values correspond exactly to those shown in the column headed Mean of birth

weights in the previous table.

Interaction: Button: Show:

Output (appears below):

Categories

gestational age

Freq.

Mean of birth

weights

Standard deviation

of birth weights

<38 wks

38-39 wks

39-40 wks

>40 wks

157

130

166

188

2445.567

3186.023

3365.193

3452.223

705.8151

408.3571

456.661

441.3662

The predicted values for this regression model can be seen on the scatter plot below.

5000

Birthweight (grams)

2000

3000

4000

1000

25

30

35

gestational age in weeks

40

You can click swap to add the original line to the same graph.

Interaction: Button: Swap:

Output (figure changes to this and text appears below):

45

5000

4000

Birthweight (grams)

2000

3000

1000

0

25

30

35

gestational age in weeks

40

45

We can see that the categorical variable gives a reasonable fit to the data in the

middle values of gestational age, but does particularly poorly at younger gestational

ages.

The first model we fitted assumed a linear relationship between birth weight and

gestational age, while the model using gestational age categories is more flexible

about the shape of the relationship between the two variables, though clearly it

makes some strong assumptions about what that relationship is, e.g. it assumes that

all babies born before 38 weeks have the same mean birth weight. Another option is

to model a quadratic relationship between birth weight and gestational age, which

would allow some departure from a linear relationship.

Consider the (hypothetical) example below, in which it is possible that there is some

curvature.

24

22

20

18

16

14

12

10

10

10

10

12

14

16

18

20

22

24

10

12

14

16

18

20

22

24

10

10

12

14

16

18

20

22

24

10

To fit a quadratic relationship the model becomes

y = a + bx + cx2

where a is still the estimated birth weight of a baby when x=0 (i.e. with mean

gestational age) but now the slope of the line is described by two coefficients b and

c. In this case if b and c were both positive this would mean that the babys growth

accelerates over the period of gestation that we are examining. If b is positive and c

is negative then the babys growth rates would be slowing down over the period of

gestation that we are examining. Note that interpretation of the two parameter

estimates, b and c, is less straightforward than interpretation of the parameter

estimates when x was categorical.

We can fit this model by first calculating a new variable that is the square of the

mean-centred gestational age e.g. mgest2=mgest^2 and then fitting a linear

regression model on both the mean-centred gestational age and its square:

bweight

Coef.

Std. Err.

P>t

[95% Conf.

Interval]

mgest

mgest2

_cons

193.2323

-2.796836

3144.296

10.73898

1.608682

19.46009

17.99

-1.74

161.58

0.000

0.083

0.000

172.1443

-5.955789

3106.083

Birth weight = 3144.3 + 193.2*mgest 2.8*mgest2

= 3144.3 + 193.2*(age 38.7) 2.8*(age 38.7)2

214.3203

.3621158

3182.51

linear

1000

Birthweight (grams)

2000

3000

4000

5000

The graph below shows both the fitted linear and quadratic relationships. We can see

that between 32 and 42 weeks gestation the lines are virtually identical, and it is

only for babies born before 32 weeks that there appears to be any curvature. There

are very few babies in the dataset born before about 32 weeks. It is unsurprising

that there is little statistical evidence of departure from a linear relationship (P=0.08

for mgest2 in the output).

quadratic

30

25

35

gestational age in weeks

40

45

Just as we have done with Poisson and logistic regression models, we can include

several covariates in a linear regression model. For example, we can fit a model of

birth weight to gestational age and gender (0=male, 1=female):

bweight

Coef.

Std. Err.

P>t

mgest

gender

_cons

206.4446

-161.7075

3208.604

7.363321

34.29034

24.03779

28.04

-4.72

133.48

0.000

0.000

0.000

191.9853

-229.0431

3161.401

220.9039

-94.37194

3255.806

bweight

Coef.

Std. Err.

P>t

mgest

gender

_cons

206.4446

-161.7075

3208.604

7.363321

34.29034

24.03779

28.04

-4.72

133.48

0.000

0.000

0.000

191.9853

-229.0431

3161.401

220.9039

-94.37194

3255.806

The value for mgest of 206.4 gives us the estimated increase in birth weight for each

additional week of gestation, after adjusting for gender. The value -161.7 shows

that females (coded 1 in the data) are estimated to be 161.7 g lighter than males

(coded 0 in the data), after adjusting for week of gestation. Note that gender is not

a confounder between gestational age and birth weight here, because adjusting for it

barely changes the mgest estimate (from 206.6412 to 206.4446). However, we can

see from the p-value and confidence interval that gender is independently associated

with birth weight.

What is the regression equation fitted here?

Interaction: thought bubble:

Output (appears below):

Birth weight = 3208.6 + 206.4*mgest 161.7*gender

= 3208.6 + 206.4*(gestational age 38.7) 161.7*gender

where gender takes the value 0 for males and 1 for females

We can plot two separate lines of predicted birth weight by gestational age for

males and females on the scatter plot.

5000

4000

Birthweight (grams)

2000

3000

difference

is 161.7g

1000

males

females

25

30

35

gestational age in weeks

40

45

This makes clear that we are fitting two parallel lines to the data, with equal

gradients but different intercepts.

What are the regression equations for males and females separately?

Interaction: thought bubble:

Output (appears below):

Males

Birth weight = 3208.6 + 206.4*(gestational age 38.7) 161.7*0

= 3208.6 + 206.4*(gestational age 38.7)

Females

Birth weight = 3208.6 + 206.4*(gestational age 38.7) 161.7*1

= 3208.6 161.7 + 206.4*(gestational age 38.7)

= 3046.9 + 206.4*(gestational age 38.7)

So we can see that the gradients are the same, but the intercepts are different.

When there is more than one covariate in the model, the goodness-of-fit checking

should also include plots of the residuals against each covariate. With one

covariate this is not necessary, as

y is determined by x.

Note that the size of the effect estimate depends on the units of the covariate. For

example, here gestational age is measured in weeks, but if it had been measured in

days the parameter for gestational age would be 7 times smaller (since it is the

estimated increase per day not per week). It is useful to consider the clinical

impact of the covariate on the outcome in the wider population. For example, a

baby born at 32 weeks is estimated to be 1.65kg lighter than a baby born at 40

weeks (206.44 x -8 = -1652g).

When testing for the significance of an association between the quantitative

outcome and a covariate that has a single parameter, we can use the p-value for

the parameter in the output table, as we did for the covariate gender above.

However, if we want to test the impact of several parameters simultaneously we

need to use an F test from the analysis of variance table. The F test in linear

regression is analogous to the likelihood ratio test in logistic or Poisson regression.

A measure that allows us to evaluate how well a particular linear regression model

fits the data is the sum of squares of the differences between the observed values

of the outcome variable, and the values predicted by the model, i.e. the residual

sum of squares,

( y i

y i)2

This is obtained from an analysis of variance table. For example, the analysis of

variance table produced when fitting the model with gestational age categories

was:

Source

SS

df

MS

Model

Residual

102656030

170064092

3

637

34218676.7

266976.596

Total

272720122

640

426125.19

Source

SS

df

MS

Model

102656030

34218676.7

Residual

170064092

637

266976.596

Total

272720122

640

426125.19

If the residual sum of squares is zero the regression line would fit the data

perfectly, that is every observed point would lie on the fitted line. By contrast

larger values would indicate worse fits, since the deviations of the observed points

from the regression line will be larger. Two possible factors contribute to a high

residual sum of squares; either there is a lot of variation in the data, i.e. 2 is large,

or the model does not explain much of the variation observed.

Source

SS

df

MS

Model

Residual

102656030

170064092

3

637

34218676.7

266976.596

Total

272720122

640

426125.19

2 = residual sum of squares/df

where df is the degrees of freedom.

This can be calculated as:

170064092/637 = 266976.596

This value is known as the residual mean square (MS shown in the last column in

the table), and its square root is called the Root MSE (mean square error = 516.7).

The larger the Model MS compared to the Residual MS, the better the model fits the

data. Under the Null hypothesis of no effect of gestational age, the Model MS and

the Residual MS are two independent estimates of 2 and their ratio is expected to

be 1. We can then test the significance of the current model, by dividing these two

terms.

F=

Model MS

Residual MS

on birth weight, this statistic would be expected to follow the F distribution with the

appropriate degrees of freedom. This is written as F (3, 637), as we have fitted

three parameters for the age categories and we are then left with 641 observations

minus these three parameters minus the constant, leaving 637 for the residual SS.

Calculate the F statistic for the Null hypothesis of no relationship between birth

weight and gestational age to one decimal place.

Interaction: Calculation: (calc)

Incorrect answer:

No.

F = Model MS = 34218676.7 / 266976.596 = 128.2

Residual MS

Correct answer:

Correct

Yes.

F=

Residual MS

If there is a strong relationship between the outcome and the exposure variables

used in the model, then the Model MS will be much larger than the Residual MS,

and F will be (substantially) greater than 1. If there is no relationship then, on

average, the Model MS will be equal to the Residual MS and F will be 1. In this

example the value of F is much larger than 1, so there is evidence that birth weight

changes with gestational age. The p-value for the null hypothesis that there is no

relationship, based on F = 128.2 is p<0.0001.

If we include gender in the model with gestational age as categorical, we get the

output below.

Source

SS

df

MS

Model

Residual

106953800

165766322

4

636

26738450

260638.871

Total

272720122

640

426125.19

bweight

Coef.

Std. Err.

P>t

gestcat

2

3

4

736.1429

909.2086

1012.249

60.54885

56.89301

55.21226

12.16

15.98

18.33

0.000

0.000

0.000

617.243

797.4877

903.8286

855.0427

1020.929

1120.669

gender

_cons

-164.2448

2528.212

40.44732

45.54496

-4.06

55.51

0.000

0.000

-243.6713

2438.776

-84.81839

2617.649

The analysis of variance table for this model is at the top. For this model, our null

hypothesis is that there is no effect of gestational age (considered in categories) or

sex. A test of whether the true effects of both explanatory variables are zero is

made by examining F(4, 636) = 102.59 and referring this to the F distribution. The

p-value for this hypothesis is again <0.001, indicating strong evidence that the two

exposure variables gestational age categories and gender taken together contribute

to explain the data variability.

Adjusting for gender has made very little difference to the parameter estimates for

gestational age so gender is not a confounder here. But for demonstration, in

order to assess whether gestational age (as categorical) is associated with birth

weight after adjusting for gender we need to do a partial F test.

The F statistic is F(3, 636) = 131.06, which gives a p-value of <0.001, so there is

very strong evidence of an association between gestational age category and birth

weight independently of gender.

This partial F test is analogous to the likelihood ratio test in logistic or Poisson

regression. With a likelihood ratio test you would need to fit one model with

gestational age and gender and compare it to a model with just gender, but you

dont need to do that with linear regression.

Section 9: Summary

This is the end of AS13.

The main points of this session will appear below as you click on the relevant title.

Fitting regression equation

Fitting a linear regression equation to data from two quantitative variables uses

one of two approaches, both of which give the same result: least squares

minimises the sum of squares of residuals, while maximum likelihood assumes

the residuals follow a Normal distribution with mean=0 and standard deviation=1.

Checking assumptions

Linearity can be checked by a scatter-plot of the outcome and explanatory

variables, Normality of residuals can be checked with a histogram of standardised

residuals and constancy of the variance of the residuals can be checked by

examining a scatter-plot of the residuals plotted against the fitted values.

Allowing for a non-linear relationship

Two common ways of allowing for a non-linear relationship are to either:

categorise the explanatory variable or to allow for a quadratic relationship by

including both the explanatory variable and its square in the regression.

As for logistic, Poisson and Cox regression, multiple variables can be included in

the regression equation to produce estimates that are adjusted for the other

variables. The partial F test is used instead of the likelihood ratio test to examine

the evidence for some variables adjusted for other variable(s).

- AnovaCargado porNguyen Be
- as11Cargado porLakshmi Seth
- ec06Cargado porLakshmi Seth
- md01Cargado porLakshmi Seth
- md02Cargado porLakshmi Seth
- md04Cargado porLakshmi Seth
- md08.pdfCargado porLakshmi Seth
- md07Cargado porLakshmi Seth
- The Persistence of Bureaucracy: A Meta-Analysis of Weber’s Model of Bureaucratic ControlCargado porozpqu
- Role of Executing E-commerce for Small and medium enterprises in Danang cityCargado porTrinh Le Tan
- as09Cargado porLakshmi Seth
- as10Cargado porLakshmi Seth
- UT Dallas Syllabus for hcs6312.001 06f taught by Pamela Rollins (rollins)Cargado porUT Dallas Provost's Technology Group
- UT Dallas Syllabus for hcs6312.501.11f taught by Pamela Rollins (rollins)Cargado porUT Dallas Provost's Technology Group
- md03Cargado porLakshmi Seth
- md05Cargado porLakshmi Seth
- Regional systems of innovation and the knowledge production function: The Spanish case (Eng)/ Los sistemas regionales de innovación y la función de producción de conocimiento: el caso español (Ing)/ Eskualdeko berrikuntza sistemak eta ezagutza produkzioaren funtzioa: espainiar kasua (Ing)Cargado porEKAI Center
- Chapter3 Regression Analysis 2003(2) (1)Cargado porMichael Denrich
- md06Cargado porLakshmi Seth
- as12Cargado porLakshmi Seth
- Statistics # 1Cargado porSohaib Hassan
- Scribd textCargado porR Rabi Reddy
- ajas681509-1514Cargado porBakhtawar Siddiqui
- Project.docxCargado porSergio
- RegressionCargado porAlok Negi
- HW#9.pdfCargado porMartin De Haro Garcia
- Jurnal 2Cargado porframesti
- lab#5-take-homeCargado porConnor Smith Rico Suave
- Evaluating LLPs of Non-Performing loans for ProfitabilityCargado porhasan md
- chapter 8 part 1Cargado porapi-232613595

- Mitchell H. Katz-Evaluating Clinical and Public Health Interventions_ a Practical Guide to Study Design and Statistics (2010)Cargado porLakshmi Seth
- 2161.pptCargado porLakshmi Seth
- On Slums 2011FinalCargado porLakshmi Seth
- md04Cargado porLakshmi Seth
- ec02Cargado porLakshmi Seth
- epm_201_2015Cargado porLakshmi Seth
- md08.pdfCargado porLakshmi Seth
- ec03Cargado porLakshmi Seth
- md07Cargado porLakshmi Seth
- md06Cargado porLakshmi Seth
- md05Cargado porLakshmi Seth
- md03Cargado porLakshmi Seth
- ec09Cargado porLakshmi Seth
- ec07Cargado porLakshmi Seth
- ec08Cargado porLakshmi Seth
- ec05Cargado porLakshmi Seth
- ec04Cargado porLakshmi Seth
- ec01Cargado porLakshmi Seth
- as12Cargado porLakshmi Seth
- as10Cargado porLakshmi Seth
- as09Cargado porLakshmi Seth
- as08Cargado porLakshmi Seth

- Panel_Data_1.docCargado porAbdul Aziz
- Time Series and System Analysis With ApplicationsCargado portemkimleang
- Econo - OutlineCargado porwaqas118
- Econometrics_ch5.pptCargado porKashif Khurshid
- Threshold Autoregressive Models in Finance- A Comparative ApproacCargado porRara Aya Tiara
- SA.320.735.01 SP15 KraayCargado pors
- Econometrics - Instrumental VariablesCargado porevapanda123
- pls_pm_cle4bee88.pdfCargado porAurangzeb Chaudhary
- 13-1-69-1-10-20170328.pdfCargado porIJSSHE
- Statistics 2008Cargado porAmeer Bakhsh
- Assignment 11 SolutionsCargado porahmed
- ecmt2210.docCargado porMatthew Robinson
- ForecastingCargado porDaniel Ong
- Determinants of Technical Efficiency in Maize Production-The Case of Smallholder Farmers in Dhidhessa District of Illuababora Zone, EthiopiaCargado porAlexander Decker
- Beamer 7Cargado porSMILEY 0519
- Shadow EconomyCargado porThalia Sanders
- S13_Ch_5Cargado porstef
- QMB FormulaCargado porilyas
- Harvard EconCargado porscribewriter1990
- Advantages of Forecasting the Demand for TourismnewCargado porHenry Bet
- ECON 321 Midterm PrepCargado porBobby Mylan
- concordance-c-index_2.pdfCargado pornuriyesan
- Linear regressionCargado porNeshen Joy Saclolo Costillas
- EconomicsCargado porObaydur Rahman
- IRC 108 - Guidelines Traffic Prediction on Rural Highways 1996.pdfCargado porAnonymous QeDHspL
- Summary of Formulas About Simple Linear RegressionCargado porWilliam Noguera
- Relation CHRD and VETCargado porgenerationpoet
- mlmus3_ch10Cargado porEdwin Johny Asnate Salazar
- Ken Black QA ch15Cargado porRushabh Vora
- solucion1Cargado porCarlos Alfredo H Ch