Está en la página 1de 12

Kelli Peterman

EPID 602
Winter 2019
Homework 4: Logistic Regression

In the homework assignments we will be using data from the Child Health and Development
Studies to answer the question: Is girls’ age of menarche affected by their mothers’ cigarette
smoking during pregnancy? The data sets are located on the Epid 602 Canvas. In the fourth
assignment we will use logistic regression to explore the associations between mothers’
cigarette smoking during pregnancy and daughters’ ages at menarche.

Please remember that you may consult with each other while working on this assignment but
you must run your own code, write your answers in your own words, and submit your own
assignment. Please upload an electronic word document (No PDFs will be accepted) of your
assignment with your code pasted in one the last page via Canvas no later than 1PM on
3/29/19. Only output that is relevant to the questions should be pasted in below each question.
Please do not change the numbering or ordering of the questions.

Be sure that: 1) Your SAS code runs from start to finish.


2) Your results make sense (check your sample size and look for unreasonable,
unlikely, or impossible answers).
3) Your code is well commented (the top of your file should include the
homework number and your name, each question should be identified in the
code, and each new task should be described by comments) and formatted
(indentation and carriage returns should be used to improve readability). 5% will
be deducted if either of these two tasks is not completed.

1
Kelli Peterman

1. (1 point) We’re going to use the same dataset as we did for HW3: hw3and4. You should
already have a copy of this dataset in the folder you’re using for your chds library. If not,
you can find it in the CHDS folder on the Canvas site. (You might also want to download it
again in case you changed it while you were doing HW3.) Remember that all files and
libraries should be located in your Private folder on your IFS space or similar. Create a
temporary dataset called hw4 containing all the observations and variables from the
permanent dataset hw3and4. If you like using formats, also include a separate libname
statement to reference your format library from HW2. Paste a copy of the log output from
the LIBNAME and DATA statements into your homework. (1 point)
NOTE: Libref CHDS was successfully assigned as follows:
Engine: V9
Physical Name: K:\Epid602\HW4
40 *Search for formats in library;
41 OPTIONS FMTSEARCH = (CHDS);
NOTE: There were 1003 observations read from the data set CHDS.HW3AND4.
NOTE: The data set WORK.HW4 has 1003 observations and 25 variables.

2. (4 points) Just as in HW3, we’re going to conduct a complete-case analysis that excludes
observations with missing data. This time, though, our outcome is the binary measure of
late age at menarche (LATEMENS).
a. Create a new dataset based on hw4 containing only observations with nonmissing
information for the outcome (LATEMENS), exposure (MOMCIGS), and covariates
(MOMMENS3, PARITY3, MOMED3, INCOME3, and RACE3).

b. How many observations are in the new dataset? What percent of observations in the
original dataset contained missing information?
There are 836 observations in the new dataset. (165/1003*100) = 16.45% of the observations in
the original dataset contained missing information.

c. Why is the number of observations in the dataset for HW4 different from the dataset
for HW3? If you don’t remember, go back to HW2 and check how we coded the
continuous outcome variable (TEENMENS) and the binary one (LATEMENS).
For TEENMENS we assigned missing to any girl who had not reached menarche by time of
interview. In LATEMENS, we assigned missing to girls who had not reached menarche by the
time of the interview if they were under the age of 14, and 1 to girls had had not reached
menarche at the time of interview if they were 14 or older, because they were considered
late menarche at that point. Thus TEENMENS has more missing data (22 missing) than
LATEMENS (9 missing).

3. (12 points) We’ll start with a simple logistic regression.


a. In your new dataset, use PROC LOGISTIC (or another procedure you prefer, such as
PROC GENMOD) to run a simple logistic regression of late age at menarche (LATEMENS)
on the simplified variable of mothers’ smoking during pregnancy (MOMCIGS).
 Treat MOMCIGS as a class, rather than ordinal, variable (i.e., do include a CLASS
statement or create dummy variables).

2
Kelli Peterman

 Remember to specify the reference group: Use the mothers who did not smoke at
all during pregnancy as the reference group.
 Remember to include the option /PARAM=REF at the end of the class statement to
get the parameterization we want for interpreting the coefficients. If you do not
do this, your coefficients will not correspond to your odds ratios.
 Remember to tell SAS that you want to model the probability of having late
menarche (LATEMENS=1).

b. Report and interpret the coefficients (NOT the odds ratios) for the exposure variable.

As smoking increases from 0 to the MOMCIGS1 category, the log odds of late menarche
decrease by 0.1431 units. As smoking increases from 0 to the MOMCIGS2 category, the log
odds of late menarche increase by 0.00926 units. As smoking increases from 0 to the
MOMCIGS3 category, the log odds of late menarche decrease by 0.1315 units.
c. Show the calculations to convert the coefficients for the exposure variable and their
standard errors into odds ratios with 95% confidence intervals.
Momcigs 1 OR: e-0.1431 = 0.867
CI: (-0.1431 - 1.96(0.2437), -0.1431 + 1.96(0.2437)) = (e-0.6207, e0.3345)
= (0.538, 1.397)
Momcigs2 OR: e0.00926 = 1.009
CI: (0.00926 – (1.96*0.3238), 0.00926 + (1.96*0.3238)) = (e-0.625, e0.6439)
= (0.535, 1.904)
Momcigs3 OR: e-0.1315 = 0.877
CI: (-0.1315 – (1.96*0.2366), -0.1315 + (1.96*0.2366)) = (e-0.595, e0.3322)
= (0.551, 1.394)
d. Report and interpret the odds ratios with confidence intervals.
The odds of late menarche are 13.3% (0.538, 1.397) lower among those in the second smoking
category compared to the 0 smoking category. The odds of late menarche are 1.009 (0.535,
1.904) times higher in the third smoking category compared to the 0 smoking category. The
odds of late menarche are 12.3% (0.551, 1.394) lower in the fourth smoking category compared
to the 0 smoking category.
4. (4 points) One thing we have to watch out for in the adjusted model are sparse or empty
cells. Recheck Tables 1 and 2 from HW2 and look for sparse or empty cells. It looks like we
have a problem with the race variable.

a. Where are our sparse (i.e., fewer than 10 observations) and/or empty cells with respect
to race? How many observations are there in these cells?

3
Kelli Peterman

The observation for late menarche and race = other in Table 1 is close to being sparse
with 10 observations in this cell. In Table 2, the cell for race = other and 1-9 cigarettes
per day is sparse, with 1 observation; the cell for race = other and 20 or more cigarettes
per day is also sparse, with 7 observations. The cell for race = other and 10-19 cigarettes
per day is empty, with 0 observations.
b. Create a new binary race variable that combines the Black and Other Race categories
into a single Nonwhite category. Create a 2-way frequency table of the new race
variable and the prenatal smoking variable (MOMCIGS) and paste the table below.

5. (12 points) Now we’ll adjust for covariates. Just as in HW3, we’re going to skip the model-
building steps we discussed in class and all run the same adjusted model.
a. Run a multiple logistic regression adjusted for mother’s age at menarche (MOMMENS3),
mother’s parity (PARITY3), mother’s education (MOMED3), family income (INCOME3),
and child’s race (using your new binary race variable).
 Include a CLASS statement or create dummy variables for all variables. Just as
before, use observations whose mothers did not smoke at all during pregnancy as
the referent group for MOMCIGS. Don’t worry about which group is the referent
group for the other variables.
 Remember to include the option /PARAM=REF at the end of the class statement to
get the parameterization we want for interpreting the coefficients. If you do not
do this, your coefficients will not correspond to your odds ratios.
 Ask SAS to provide some output we can use to assess model fit: the Hosmer-
Lemeshow test and the influence plot panels. In addition, save the output, including
PROB and DIFCHISQ, as a new dataset.

b. Using SAS output or your own calculations from the coefficients, report and interpret
the odds ratios (including confidence intervals) for mother’s prenatal smoking. Do these
estimates differ substantially from the simple regression estimates?

4
Kelli Peterman

The odds of late menarche are 7.8% (0.562, 1.512) lower in the 1-9 cigarettes smoking category
compared to the no smoking referent group. The odds of late menarche are 1.3% (0.514, 1.895)
lower in the 10-19 cigarettes smoking category compared to the no smoking referent group.
The odds of late menarche are 7.8% (0.566, 1.500) lower in the 20 or more cigarettes smoking
category compared to the no smoking referent group.
The beta coefficients do not differ significantly between the simple regression estimates and
the adjusted estimates. The odds ratios also do not differ significantly from the simple
regression estimates. The odds ratio for the 10-19 cigarettes was slightly above 1 in the simple
regression estimate and is below 1 in the adjusted model, but the confidence intervals for both
odds ratios include 1, meaning neither odds ratio is significant. The other odds ratios change
only slightly, and the confidence intervals for all also include 1.
c. Do you think the unadjusted results were confounded by the other variables?
The unadjusted results don’t seem to be confounded by the other variables because
there was not a significant change in estimates between models.
d. According to the Type 3 analysis, which covariates are statistically significantly
associated with the outcome at the 95% confidence level when the other variables are
included in the model?
Mommens3 is significantly associated with the outcome (p < 0.0001). The dichotomous
race variable is almost significantly associated with the outcome (p=0.054).
e. What is the null hypothesis for Hosmer-Lemeshow test? What is your conclusion for this
model based on this test (i.e., does the model fit the data)?
The null hypothesis is that the model has adequate fit. Based on the test, we would fail to reject
the null and can conclude that the model does fit the data, p=0.3589.

6. (4 points) We’re going to check for influential observations.

a. Paste the first panel of four plots from the influence diagnostics plots output (includes
Pearson residuals, deviance residuals, and leverage) from your model in question 5
below.

5
Kelli Peterman

b. Using the output dataset from the model, plot the Pearson chi-square difference
(DIFCHISQ) vs. predicted probabilities (PROB). Recall that DIFCHISQ specifies the change
in the chi-square goodness-of-fit statistic attributable to deleting the individual
observation. Are there any observations that appear to be influential based on this plot?
What are their CHILD ID values?

There appear to be 3 possible influential observations: CHILD ID 377115, CHILD ID


460415, and CHILD ID 400415.

7. (12 points) We’re going to check for effect measure modification of the association
between maternal prenatal smoking (MOMCIGS) and age at menarche (LATEMENS) by race
(the new binary race variable). We’ll do it two different ways so we can compare them. Do
not remove any observations based on your conclusions from question 6.

6
Kelli Peterman

a. First, we’ll use interaction terms. Run a multiple logistic regression using all the
adjustment variables from question 5. In addition, include interaction terms for your
new binary race variable and maternal prenatal smoking.
 Include a CLASS statement or create dummy variables for all variables.
 Just as before, use observations whose mothers did not smoke at all during
pregnancy as the referent group.
 Use white race as the referent group for the binary race variable.
 Don’t worry about which group is the referent group for the other variables.
 Remember to include the option /PARAM=REF in the class statement to get the
parameterization we want for interpreting the coefficients.

a. Are we using these product terms to test for statistical or biological interaction? Explain.
We are testing for statistical interaction because in a biological interaction, biological
causes produce an event. A biological interaction either exists or doesn’t exist. Instead,
we are testing for statistical interaction to see if the effect of smoking cigarettes on late
menarche differs based on race which isn’t a biological interaction that exists in real life.
b. Are we testing for additive or multiplicative interaction? Explain.
We are testing for a multiplicative interaction because we are using logistic regression,
not linear regression.
c. According to the Type 3 analysis, do we have evidence of interaction? Explain.
There is no evidence of interaction because the p value for the interaction term is
greater than 0.05, p=0.5154.
d. Calculate a likelihood-ratio test comparing the model from question 7 to the model from
question 5. Show your calculations, and report and interpret the result. Is your
conclusion different from in part c?
Without interaction: With interaction:

LRT: 873.473 - 870.603 = 2.87


Df = 15 – 12 = 3
P = 0.412
We fail to reject the null hypothesis because the likelihood ratio statistic is 2.87, p =
0.412, which is not significant. There is no evidence that the interaction term is a better
fit for the data. The conclusion is the same as part c.
e. Next we will use the coefficients from this interaction model to calculate by hand,
report, and interpret the following odds ratios. Show all your calculations. You do not
have to calculate or report confidence intervals (HINT: you can check your answers with
SAS using approaches from the lab in week 10).

7
Kelli Peterman

(1) nonwhite girls whose mothers smoked 20 or more cigarettes/day vs. nonwhite girls
whose mothers smoked zero cigarettes/day
OR = exp(0.017 – 0.4076 – 1.3128) / exp(-0.4076) = 0.182/0.665 = 0.274
The odds of late menarche for nonwhite girls whose mother smoked 20 or more
cigarettes/day are 72.6% lower than nonwhite girls whose mothers smoked zero
cigarettes per day.
(2) nonwhite girls whose mothers smoked 20 or more cigarettes/day vs. white girls
whose mothers smoked zero cigarettes/day.
OR = exp(0.017 – 0.4076 – 1.3128) / exp(0) = 0.182/1 = 0.182
The odds of late menarche for nonwhite girls whose mothers smoked 20 or more
cigarettes/day are 81.8% lower than white girls whose mothers smoked zero
cigarettes/day.

8. (6 points) Now we’ll use composite dummy variables instead of interaction terms.
a. Create a set of dummy variables to represent each exposure–race group (there are 8
groups in all). Then run a multiple logistic regression using all the adjustment variables
from question 5 and including the composite dummy variables.
 Use white race girls with nonsmoking mothers as the referent group.
 Include a CLASS statement or create dummy variables for the confounder variables.
 Remember to include the option /PARAM=REF at the end of the class statement to
get the parameterization we want for interpreting the coefficients.

b. Report and interpret the odds ratios, including 95% confidence intervals, for (1) white
girls whose mothers smoked 20 or more cigarettes/day and (2) nonwhite girls whose
mothers smoked 20 or more cigarettes/day. Do not forget to include the referent group
from the model in your interpretations.
1) The odds of late menarche are 1.017 (0.610, 1.697) times higher in white girls whose
mothers smoked 20 or more cigarettes/day than white girls whose mothers smoked no
cigarettes.
2) The odds of late menarche are 81.8% (0.023, 1.425) lower in nonwhite girls whose
mothers smoked 20 or more cigarettes/day than white girls whose mothers smoked no
cigarettes.

8
Kelli Peterman

9. (10 points)
a. Use the interaction model to fill in the table below. Make sure you are reporting results
stratified by race ((take note of the reference groups specified in the table). You should
use the composite dummy variable approach from question 8 to answer this question
(don’t forget to change the referent group in the model when appropriate). Please note
that you could also use the model from question 7 to answer this question. Don’t forget
to fill in the footnote.

Table 4. Adjusted relative odds of late menarche, Child Health and


Development Study, California, pregnancy years 1959–1966a
Whites Non-Whites
Odds 95% Odds 95%
Ratio Confidence Ratio Confidence
Interval Interval
Maternal prenatal
smoking (cigarettes/day)
0 Referent Referent
(1.00) (1.00)
1–9 0.835 (0.470, 1.483) 1.246 (0.477, 3.253)
10–19 0.939 (0.464, 1.901) 1.403 (0.258, 7.615)
≥ 20 1.017 (0.610, 1.697) 0.274 (0.034, 2.215)
a n=836, adjusted by mother’s age at menarche, mother’s parity, mother’s

education, and family income.

b. Write a short (2–5 sentences) summary of the results of your analysis. Include odds
ratios and 95% confidence intervals as appropriate. The goal is to present a full and
accurate picture of your results while also remaining concise.
Results from table 4 were obtained by a logistic regression model where prenatal smoking and
race predicted the odds of late menarche, adjusting for mother’s age at menarche, mother’s
parity, mother’s education, and family income. Compared to white girls with mothers who did
not smoke during pregnancy, white girls whose mothers smoked 1-9 cigarettes/day were 16.5%
(0.470, 1.483) less likely to have late menarche, white girls whose mothers smoked 10-19
cigarettes/day were 6.1% (0.464, 1.901) less likely to have late menarche, and white girls whose
mothers smoked 20 or greater cigarettes/day were 1.017 (0.610, 1.697) times more likely to
have late menarche. Compared to non-white girls whose mothers did not smoke during
pregnancy, non-white girls whose mothers smoked 1-9 cigarettes/day were 1.246 (0.477,
3.254) times more likely to have late menarche, non-white girls whose mothers smoked 10-19
cigarettes/day were 1.403 (0.258, 7.615) times more likely to have late menarche, and non-
white girls whose mothers smoked 20 or greater cigarettes/day were 72.6% (0.034, 2.215) less
likely to have late menarche. None of these results are significant, with all confidence intervals
including 1.

9
Kelli Peterman

/*************************************************
EPID 602
MARCH 29, 2019
KELLI PETERMAN
HOMEWORK 4
*************************************************/

*Question 1: Creating a permanent library;


LIBNAME CHDS "K:\Epid602\HW4";
*Search for formats in library;
OPTIONS FMTSEARCH = (CHDS);

*Creating temporary dataset;


DATA HW4;
SET CHDS.HW3AND4;
RUN;

*Question 2;
*Checking missing data;
PROC FREQ DATA=HW4;
TABLES LATEMENS MOMCIGS MOMMENS3 PARITY3 MOMED3 INCOME3 RACE3;
RUN;

*Creating dataset with no missing;


DATA NOMISS4;
SET HW4;
IF LATEMENS = . OR MOMCIGS = . OR MOMMENS3 = . OR PARITY3 = . OR MOMED3
= . OR INCOME3 = . OR RACE3 = . THEN DELETE;
RUN;

*Checking nonmissing dataset;


PROC FREQ DATA=NOMISS4;
TABLES LATEMENS MOMCIGS MOMMENS3 PARITY3 MOMED3 INCOME3 RACE3;
RUN;

*Comparing missing data;


PROC FREQ DATA=HW4;
TABLES TEENMENS LATEMENS;
RUN;

*Question 3;
*Logistic regession;
PROC LOGISTIC DATA=NOMISS4;
CLASS MOMCIGS (REF="0") /PARAM=REF;
MODEL LATEMENS (EVENT="1") = MOMCIGS;
RUN;

*Question 4;
*Collapse Black and Other races into Nonwhite;
DATA NOMISS4;
SET NOMISS4;
IF RACE3 = 2 OR RACE3 = 3 THEN RACEDI = 2;
ELSE RACEDI = RACE3;
RUN;

*Checking recoding;
PROC FREQ DATA=NOMISS4;

10
Kelli Peterman

TABLES RACE3*RACEDI;
RUN;

*2 way frequency table of race and momcigs;


PROC FREQ DATA=NOMISS4;
TABLES RACEDI*MOMCIGS;
RUN;

*Question 5:
*Multiple logistic regression;
ODS GRAPHICS ON;
PROC LOGISTIC DATA=NOMISS4 DESCENDING;
CLASS MOMCIGS (REF="0") MOMMENS3 PARITY3 MOMED3 INCOME3
RACEDI/PARAM=REF;
MODEL LATEMENS (EVENT="1") = MOMCIGS MOMMENS3 PARITY3 MOMED3 INCOME3
RACEDI /LACKFIT IPLOTS;
OUTPUT OUT=DIAG P=PROB DIFCHISQ=DIFCHISQ;
RUN;
ODS GRAPHICS OFF;

*Question 6:
*Plotting influence diagnostics plot output;
GOPTIONS RESET = ALL;
SYMBOL1 POINTLABEL = ("#CHILD");
PROC GPLOT DATA=DIAG;
PLOT DIFCHISQ*PROB;
RUN;

*Question 7:
*A: Running logistic regression with interaction term to check for effect
modification;
PROC LOGISTIC DATA=NOMISS4 DESCENDING;
CLASS MOMCIGS (REF="0") MOMMENS3 PARITY3 MOMED3 INCOME3 RACEDI
(REF="1")/PARAM=REF;
MODEL LATEMENS (EVENT="1") = MOMCIGS MOMMENS3 PARITY3 MOMED3 INCOME3
RACEDI RACEDI*MOMCIGS;
RUN;

*D: Likelihood-ratio test - model without interaction term;


PROC LOGISTIC DATA=NOMISS4 DESCENDING;
CLASS MOMCIGS (REF="0") MOMMENS3 PARITY3 MOMED3 INCOME3 RACEDI
(REF="1")/PARAM=REF;
MODEL LATEMENS (EVENT="1") = MOMCIGS MOMMENS3 PARITY3 MOMED3 INCOME3
RACEDI;
RUN;

*Question 8: Dummy variables instead of interaction terms;


*Creating dummy variables;
DATA NOMISS4;
SET NOMISS4;
CIGRACE1 = (MOMCIGS=0 AND RACEDI=1);
CIGRACE2 = (MOMCIGS=1 AND RACEDI=1);
CIGRACE3 = (MOMCIGS=2 AND RACEDI=1);
CIGRACE4 = (MOMCIGS=3 AND RACEDI=1);
CIGRACE5 = (MOMCIGS=0 AND RACEDI=2);
CIGRACE6 = (MOMCIGS=1 AND RACEDI=2);

11
Kelli Peterman

CIGRACE7 = (MOMCIGS=2 AND RACEDI=2);


CIGRACE8 = (MOMCIGS=3 AND RACEDI=2);
RUN;

*Logistic regression with adjustment variables and dummy variables;


PROC LOGISTIC DATA=NOMISS4 DESCENDING;
CLASS MOMMENS3 PARITY3 MOMED3 INCOME3/PARAM=REF;
MODEL LATEMENS (EVENT="1") = CIGRACE2-CIGRACE8 MOMMENS3 PARITY3 MOMED3
INCOME3;
RUN;

*Question 9: logistic regression model with non-white as referent group;


PROC LOGISTIC DATA=NOMISS4 DESCENDING;
CLASS MOMMENS3 PARITY3 MOMED3 INCOME3/PARAM=REF;
MODEL LATEMENS (EVENT="1") = CIGRACE1-CIGRACE4 CIGRACE6-CIGRACE8
MOMMENS3 PARITY3 MOMED3 INCOME3;
RUN;

12

También podría gustarte