Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Notes for Submission: Upload your assignment directly to Crowdmark via the link you received by email
(let me know if you have not received this email). It is your responsibility to make sure your solution to
each question is submitted in the correct section, that the pages are rotated correctly, and that everything is
legible. Typed solutions are preferred. Be sure to include all R code and relevant output for each question
(where applicable). Once the solution key is posted on Learn, no further late submissions will be accepted.
Question 1 [8 marks]
Consider the following data on deaths from coronary disease among British male doctors reported in a study
by Doll and Hill (1966). The data are stratified by age. It is of interest to model the incidence of death due
to coronary disease and to examine how the effect of smoking on the rate of death, varies across age strata.
We therefore can consider a model
log(µ) = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5 + β6 x1 ∗ x5 + β7 x2 ∗ x5 + β8 x3 ∗ x5 + β9 x4 ∗ x5 + log(τ )
where x1 to x4 are indicator variables for the age strata 45 − 54, 55 − 64, 65 − 74, and 75 − 84 respectively,
x5 = 1 for smokers and is zero otherwise, τ is the corresponding person years of follow-up, and log(τ ) is an
offset term. A reduced model which excludes the interaction terms can also be fit and is appropriate if the
effect of the smoking variable is the same across all age strata.
a) [2 marks] Explain why it is important to include the offset term, log(τ ), in this model to obtain
interpretable estimates of the covariate effects.
b) [3 marks] By fitting appropriate models, test the hypothesis that the effect of smoking is the same for
all age groups.
c) [3 marks] Based on the most appropriate model, estimate the relative rate for coronary death for a 80
year old smoker versus a 80 year old non-smoker and give a 95 % confidence interval.
1
Question 2 [14 marks]
Ohlsson & Johansson (2010) 1 provided a dataset based on 38508 Swedish moped insurance policies and 860
associated claims during 1994-1999. The data are available on the course website in the file moped.txt. The
following variables are available:
The dataset has 28 observations based on combinations of the first three variables.
a) [5 marks] Develop a log linear model for the number of claims (number) using the explanatory
variables class, age, and zone, and log(duration) as the offset term. Use formal hypothesis tests to
determine if any explanatory variables can be excluded from the model or if any interaction terms
should be included. Explain your formal process of finding the best fitting model and include the R
summary output of your final model.
b) [3 marks] Use the model you selected in part (a) to calculate the expected the number of claims per
one thousand policy years among mopeds registered to drivers from small towns and countryside, more
than 2 years old, with weight over 60kg and more than 2 gears. Also calculate a 95% confidence interval.
c) [2 marks] Conduct a residual analysis of the model you selected in part (a). Are you satisfied with the
fit of the model?
d) [4 marks] Regardless of which model you selected in part (a) consider now the main effects regression
model. Fit this model using zone 4 as the baseline or comparison group. A local expert suggests that
there is not much difference between zones 4-7 and that perhaps they should be merged into a single
“rural” class. Use a Deviance test to determine if this simplification is appropriate. Clearly state the
null and alternative hypotheses in terms of the regression parameters of one of the models.
1 Ohlsson, E., & Johansson, B. (2010). Non-life insurance pricing with generalized linear models (Vol. 21). Berlin: Springer
2
Question 3 [8 marks]
In a prospective study subjects were randomly allocated to four different treatment groups and followed over
time to determine whether their health changed over the course of follow-up. One group received a placebo,
one group received drug A, one group received drug B, and one group received both drugs A and B. The
responses of all patients are summarized below.
Therefore, for this study there is one explanatory variable (treatment) and one trinomial response variable
(improved, no difference, worse). We want to know if the treatments have any effect on the response, or in
other words, if the distribution of responses is the same for each of the four treatment groups.
Let πij denote the probability of outcome j for the ith treatment group, let yij denote the frequency with
which this happens, j = 1, 2, 3, and let mi = yi1 + yi2 + yi3 , i = 1, 2, 3, 4.
a) [2 marks] Write down the likelihood function for the data in Table 1 under the most general model
and give the null hypothesis of interest (stated above) in terms of the parameters of this model.
b) [1 marks] Give expressions for the maximum likelihood estimate of πij under the model in (a).
c) [2 marks] Give the form of the log-linear model corresponding to the null hypothesis. Define any
notation that you introduce.
d) [3 marks] Fit the model in (c) and draw conclusions regarding the presence of any association in this
table.