Está en la página 1de 3

Stat 431 Assignment 3 Winter 2017

Due by 1:00pm on Friday, March 10, 2017

Notes for Submission: Upload your assignment directly to Crowdmark via the link you received by email
(let me know if you have not received this email). It is your responsibility to make sure your solution to
each question is submitted in the correct section, that the pages are rotated correctly, and that everything is
legible. Typed solutions are preferred. Be sure to include all R code and relevant output for each question
(where applicable). Once the solution key is posted on Learn, no further late submissions will be accepted.

Question 1 [8 marks]
Consider the following data on deaths from coronary disease among British male doctors reported in a study
by Doll and Hill (1966). The data are stratified by age. It is of interest to model the incidence of death due
to coronary disease and to examine how the effect of smoking on the rate of death, varies across age strata.
We therefore can consider a model

log(µ) = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5 + β6 x1 ∗ x5 + β7 x2 ∗ x5 + β8 x3 ∗ x5 + β9 x4 ∗ x5 + log(τ )

where x1 to x4 are indicator variables for the age strata 45 − 54, 55 − 64, 65 − 74, and 75 − 84 respectively,
x5 = 1 for smokers and is zero otherwise, τ is the corresponding person years of follow-up, and log(τ ) is an
offset term. A reduced model which excludes the interaction terms can also be fit and is appropriate if the
effect of the smoking variable is the same across all age strata.

Person-years of Follow-up Number of Deaths


Age Group Non-smokers Smokers Non-smokers Smokers
35-44 18790 52407 2 32
45-54 10673 43248 12 104
55-64 5710 28612 28 206
65-74 2585 12663 28 186
75-84 1462 5317 31 102

a) [2 marks] Explain why it is important to include the offset term, log(τ ), in this model to obtain
interpretable estimates of the covariate effects.
b) [3 marks] By fitting appropriate models, test the hypothesis that the effect of smoking is the same for
all age groups.
c) [3 marks] Based on the most appropriate model, estimate the relative rate for coronary death for a 80
year old smoker versus a 80 year old non-smoker and give a 95 % confidence interval.

1
Question 2 [14 marks]
Ohlsson & Johansson (2010) 1 provided a dataset based on 38508 Swedish moped insurance policies and 860
associated claims during 1994-1999. The data are available on the course website in the file moped.txt. The
following variables are available:

• class: 1=Weight over 60kg and more than 2 gears; 2=Other


• age: 1=At most 1 year, 2=2 years or more
• zone: 1=Central and semi-central parts of Sweden’s three largest cities, 2=suburbs and middle-sized
towns, 3=Lesser towns, except those in 5 or 7, 4=Small towns and countryside, except 5-7, 5=Northern
towns, 6=Northern countryside, 7=Gotland (Sweden’s largest island)
• duration: Total time the group of policies was in force [in years]
• severity: Total claim amount divided by the number of claims (i.e. the average cost per claim) [in
SEK]
• number: Number of claims made in the group
• pure: Pure premium: Total claim divided by duration (i.e. average cost per policy year) [in SEK]
• actual: Actual premium [in SEK] (the premium for one year according to the tariff in force 1999)

The dataset has 28 observations based on combinations of the first three variables.

a) [5 marks] Develop a log linear model for the number of claims (number) using the explanatory
variables class, age, and zone, and log(duration) as the offset term. Use formal hypothesis tests to
determine if any explanatory variables can be excluded from the model or if any interaction terms
should be included. Explain your formal process of finding the best fitting model and include the R
summary output of your final model.
b) [3 marks] Use the model you selected in part (a) to calculate the expected the number of claims per
one thousand policy years among mopeds registered to drivers from small towns and countryside, more
than 2 years old, with weight over 60kg and more than 2 gears. Also calculate a 95% confidence interval.
c) [2 marks] Conduct a residual analysis of the model you selected in part (a). Are you satisfied with the
fit of the model?
d) [4 marks] Regardless of which model you selected in part (a) consider now the main effects regression
model. Fit this model using zone 4 as the baseline or comparison group. A local expert suggests that
there is not much difference between zones 4-7 and that perhaps they should be merged into a single
“rural” class. Use a Deviance test to determine if this simplification is appropriate. Clearly state the
null and alternative hypotheses in terms of the regression parameters of one of the models.

1 Ohlsson, E., & Johansson, B. (2010). Non-life insurance pricing with generalized linear models (Vol. 21). Berlin: Springer

2
Question 3 [8 marks]
In a prospective study subjects were randomly allocated to four different treatment groups and followed over
time to determine whether their health changed over the course of follow-up. One group received a placebo,
one group received drug A, one group received drug B, and one group received both drugs A and B. The
responses of all patients are summarized below.

Randomized Controlled Trial with Multinomial Response


Response
Treatment Improved (j=1) No difference (j=2) Worse (j=3) Total
Placebo (i=1) 18 17 15 50
Drug A (i=2) 20 10 20 50
Drug B (i=3) 23 10 17 50
Drugs A & B (i=4) 25 13 12 50

Therefore, for this study there is one explanatory variable (treatment) and one trinomial response variable
(improved, no difference, worse). We want to know if the treatments have any effect on the response, or in
other words, if the distribution of responses is the same for each of the four treatment groups.
Let πij denote the probability of outcome j for the ith treatment group, let yij denote the frequency with
which this happens, j = 1, 2, 3, and let mi = yi1 + yi2 + yi3 , i = 1, 2, 3, 4.

a) [2 marks] Write down the likelihood function for the data in Table 1 under the most general model
and give the null hypothesis of interest (stated above) in terms of the parameters of this model.
b) [1 marks] Give expressions for the maximum likelihood estimate of πij under the model in (a).

c) [2 marks] Give the form of the log-linear model corresponding to the null hypothesis. Define any
notation that you introduce.
d) [3 marks] Fit the model in (c) and draw conclusions regarding the presence of any association in this
table.

También podría gustarte