Está en la página 1de 44

Bayesian Statistics:

An Introduction

PETER M. LEE
Formerly Provost of Wentworth College,
University of York, England

Fourth Edition

John Wiley & Sons, Ltd

Appendix D

Answers to Exercises
D.1
1.

Exercises on Chapter 1

Considering trumps and non-trumps separately, required probability is


    
3 23
26
11
.
2
=
50
3 10
13

Probability of a 2 : 1 split is 39/50, and the conditional probability that the king is the
odd one is 1/3, so probability one player has the king and the other the remaining two
is 13/50.
2. (a) If P(A) = 0 or P(A) = 1.
(b) Take A as odd red die, B as odd blue die and C as odd sum.
(c) Take C as {HHH, T HH, T HT, T T H}.
3. (a) P(homozygous) = 1/3;
(b)

P(heterozygous) = 2/3.

By Bayes Theorem
P(BB | 7 black) =

(1/3)(1)
64
=
.
(1/3)(1) + (2/3)(1/27 )
65

(c) Similarly we see by induction (wih case n = 0 given by part (a)) that P(BB | first n black) =
2n1 /(2n1 + 1) since
P(BB | first n + 1 black) =
=

{2n1 /(2n1 + 1)}(1)


+ 1)}(1) + {1/(2n1 + 1)}(1/2)

{2n1 /(2n1
2n
.
2n + 1
1

4.

Use P(GG) = P(M )P(GG|M ) + P(D)P(GG|D) to get


P(M ) =

P(GG) p2
p(1 p)

5. p(3) = p(4) = 1/36, p(5) = p(6) = 2/36, p(7) = = p(14) = 3/36,


p(15) = p(16) = 2/36, p(17) = p(18) = 1/36. As for the distribution function,

0
for x < 3;

1/36
for 3 6 x < 4;

2/36
for 4 6 x < 5;

4/36
for 5 6 x < 6;

6/36
for 6 6 x < 7;
F (x) =
(3[x]

12)/36
for
n 6 x < n + 1 where 7 6 n < 15;

32/36
for
15 6 x < 16;

34/36
for
16 6 x < 17;

35/36
for
17 6 x < 18;

1
for x > 18
where [x] denotes the integer part of x.
6.

P(k = 0) = (1 )n = (1 /n)n e . More generally


 
n k
p(k) =
(1 )nk
k

n

k
nn1

nk+1

k
1
...
1
=
k!
n
n n
n
n

k
exp().
k!


7. (a) P e
k = 0 = P(m
e =n
e = 0) = P(m
e = 0) P(e
n = 0) = e e = e(+) ,


P( e
k = 1 = P {m
e = 1, n
e = 0} or {m
e = 0, n
e = 1}
= e(+) + e(+) = ( + )e(+) .
(b) More generally
k
k
X
 X
P e
k=k =
P(m
e = m, n
e = k m) =
P(m
e = m) P(e
n = k m)
m=0

m=0

k
X
k
km
=
exp()
exp()
k!
(k m)!
m=0

k  
X
1
k
( + )k
exp(( + ))
exp(( + )).
m km =
k!
k!
m
m=0

(c) By definition of conditional probability




 P m
P m
e = m, n
e =km
e = m, e
k=k
e
=
P m
e = m|k = k =


P e
k=k
P e
k=k
=

m
m!

km

exp()) (km)!
exp()
(+)k
k!

exp(( + ))
 
m 
km
k

=
1
.
m
+
+

8.

Let y = x2 where x N(0, 1). Then

e6 y)
P (e
y 6 y) = P x
e2 6 y = P ( y 6 x


=P x
e 6 y ) P(e
x< x

so that (because P(e


x = y) = 0)

Fye(y) = Fxe ( y ) Fxe ( y )


and on differentiation
1
1

pye(y) = 12 y 2 pxe ( y ) + 12 y 2 pxe ( y ) .

Alternatively, you could argue that


P(y < ye 6 y + dy) = P(x < x
e 6 x + dx) + P(x dx 6 x
e < x)
implying
pye(y) dy = pxe(x) dx + pxe(x) dx

which as dy/ dx = 2x = 2 y leads to the same conclusion.


In the case where x N(0, 1) this gives

1
pye(y) = (2y) 2 exp 12 y
which is the density of 21 .
f 6 M iff every one of the indvidual Xi are less than or
9. By independence, since M
equal to M
FM (M ) = P(Xi 6 M i) = (F (M ))n ; Fm (m) = 1 (1 F (m))n ;
pM (M ) = nf (M )(F (M ))n1 ;
pm (m) = nf (m)(1 F (m))n1
3

10. Let m be the minimum of u and v and c be the length of the centre section. Then
p(m, c) = pue, ve(m, c) + pue, ve(c, m) = 2 if m + c 6 1 and 0 otherwise.
11. If F (x, y) = F (x)F (y) then by differentiation with respect to x and with respect
to y we have p(x, y) = p(x)p(y). The converse follows by integration.
In discrete cases if F (x, y) = F (x)F (y) then p(x, y) = F (x, y) F (x 1, y)
F (x, y 1) + F (x 1, y 1) = (Fxe(x) Fxe(x 1))(FP
Fye(y 1)) =
y
e(y) P
p(x)p(y). Coversely
if
p(x,
y)
=
p(x)p(y)
then
F
(x,
y)
=
e,e
y (, )
6x
6y px
P
P
and so F (x, y) = 6x pxe() 6y pye() = F (x)F (y).
12.

EX = n(1 )/; VX = n(1 ) 2 .

P
P
P
P
13. EX =
EZi2 =
1 = , while EX 2 =
EZi4 + i6=j EXi2 Xj2 = 3 +
( 1) so that VZ = EX 2 (EX)2 = 2. Similarly by integration.
14. EX = n, EX(X 1) = n(n 1) 2 and EX(X 1)(X 2) = n(n 1)(n
2) 3 , so EX = n, EX 2 = EX(X 1) + EX = n(n 1) + n and
E(X EX)3 = EX 3 3(EX 2 )(EX) + 2(EX)3
= EX(X 1)(X 2) 3(EX 2 )(EX) + 3EX 2 + 2(EX)3 2EX
= n(1 )(1 2),
p
and thus 1 = (1 2)/ [n(1 )].
15.

>

R
{x; |x|>c}

c2 p(x) dx = c2 P(|x | > c).

16. By symmetry Ex = Ey = Exy = 0 so C(x, y) = Exy ExEy = 0. However


0 = P(x = 0, y = 0) 6= P(x = 0) P(y = 0) = 21 12 = 41 .
1

e=x
17. p(y|x) = {2(1 2 )} 2 exp{ 12 (y x)2 /(1 2 )} so conditional on x
we have y N(x, 1 2 ), so E(y|x) = x. E(y 2 |x) = V(y|x) + {E(y|x)}2 =
1 2 + 2 x2 , hence E(xy|x) = x2 , E(x2 y 2 |x) = x2 2 x2 + 2 x4 . Therefore
Exy = and (as Ex4 = 3) Ex2 y 2 = 1 + 22 , so Exy (Ex)(Ey) = and Ex2 y 2
(Ex2 )(Ey 2 ) = 22 . As Vx = 1 and Vx2 = Ex4 (Ex2 )2 = 3 1 = 2 = Vy, the
result follows.

18. (a) p(x, y) = (x ex /x!) xy y (1 )xy so adding over x = y, y + 1, . . . and
P
using xy (1 )xy = e(1) we get p(y) = ()y e /y! so that ye P().
Now note that Eye|ex (e
y |e
x) = x and this has expectation .
(b)

Note that V ye|ex (e


y |e
x) = x(1 ) which has expectation (1 ) and that
4

Eye|ex (e
y |e
x) = x which has variance 2 so that the right hand side adds to , the
variance of ye.
19.

We note that
Z
I=
0

exp( 21 z 2 ) dz =

Z
0

exp( 21 (xy)2 ) y dx

for any y (on setting z = xy). Putting z in place of y, it follows that for any z
Z
I=
exp( 12 (zx)2 ) z dx
0

so that
Z
2
I =
0

exp( 21 z 2 ) dz

 Z

exp( 12 (zx)2 ) dx

=
0

exp{ 21 (x2 +1)z 2 } z dz dx.

Now set (1 + x2 )z 2 = 2t so that z dz = dt/(1 + x2 ) to get


Z

 Z
Z Z
dt
dx
I2 =
dx
=
exp(t)
dt
exp(t)
(1 + x2 )
(1 + x2 )
0
0
0
0






= [ exp(t)]0 tan1 x 0 = 1 21

=
2
p
and hence I = /2 so that the integral of from to is 1, and hence is a
probability density function. This method is apparently due to Laplace (1812, Section
24, pages 9495 in the first edition).

D.2
1.

Exercises on Chapter 2

p() = B(k + 1, n k + 1)

1

k (1 )nk or Be(k + 1, n k + 1).

2. x = 16.35525, so assuming uniform


prior, posterior is N(16.35525, 1/12). A

90% HDR is 16.35525 1.6449/ 12 or 16.35525 0.47484, that is, the interval
(15.88041, 16.83009).
3.

x N(0, 1) and N(16.35525, 1/12), so x N(16.35525, 13/12).

4. Assumming uniform prior, posterior is N(x, /n), so take n = k. If prior variance


is 0 , posterior is {1/0 + n/}1 , so take n the least integer such that n > (k
1)/0 .
1

5. Posterior is k(2/25) 2 exp{ 21 ( 0.33)2 25} for > 0 and 0 for <
0. Integrating we find 1 = kP(X > 0) where X N(0.33, 1/25), so k = {1
5

(1.65)}1 = 1.052. We now seek 1 such that p( 6 1 |x) = 0.95 or equivalently


kP(0 < X 6 1 ) = 1 with k and X as before. This results in 0.95 = 1.052{(51
1.65) (1.65)}, so (51 1.65) = 0.95/1.052 + 0.0495 = 0.9525 leading to
51 1.65 = 1.67 and so to 1 = 0.664. The required interval is thus [0, 0.664).n
6. From the point of view of someone starting from prior ignorance, my beliefs are
equivalent to an observation of mean and variance and yours to one of mean and
variance , so after taking into account my beliefs such a person is able to use Section
2.2 and conclude N(1 , 1 ) where 1/1 = 1/ + 1/ and 1 /1 = / + /.
7.

The likelihood (after inserting a convenient constant) is


1

l(x|) = (2/n) 2 exp{ 21 (x )2 /(/n)}.


Hence by Bayes Theorem, within I
1

Ac(1 )(2/n) 2 exp{ 12 (x )2 /(/n)} 6 p(|x)


1

6 Ac(1 + )(2/n) 2 exp{ 12 (x )2 /(/n)}


and outside I
1

0 6 p(|x) 6 AM c(2/n) 2 exp{ 12 (x )2 /(/n)},


where A is a constant equal to p(x)1 . Using the right hand inequality for the region
inside I we get
Z
Z
1
p(|x) d 6 Ac(1 + )
(2/n) 2 exp{ 12 (x )2 /(/n)} d
I

= Ac(1 + )

(2) 2 exp( 12 t2 ) dt, where t = (x )/

p
/n

= Ac(1 + )[( ) ( )] = Ac(1 + )(1 )


since ( ) = 1 ( ). Similarly, the same integral exceeds AQc(1 )(1 ),
and, if J is the outside of I ,
Z
06
p(|x) d 6 AM c.
J

Combining these results we have, since

R
I J

p(|x) d = 1,

Ac(1 )(1 ) 6 1 6 Ac[(1 + )(1 ) + M ],


and hence

1
1
6 Ac 6
.
(1 + )(1 ) + M
(1 )(1 )

The result now follows on remarking


p that the maximum value of the exponential in J
occurs at the end-points /n, where it has the value exp( 21 2 ).
6

8.

The likelihood is
nX

o
t(xi )()
n
h
 .X
io
h()n exp exp log () log 1
t(xi )
.

l(|x) h()n exp

This is of the data-translated form g(() t(x)) if the function h(x) 1.


9. Take prior S0 2
where = 8 and S0 /( 2) = 100 so S0 = 600 and take n =
30 and S = 13.22 (n 1) = 5052.96. Then (cf. Section 2.7) posterior is (S0 + S)2
+n
where + n = 38 and S0 + S = 5652.96. Values of 2 corresponding to a 90%
HDR for log 238 are (interpolating between 35 and 40 degrees of freedom) 25.365 and
54.269 so 90% interval is (104, 223).

10. n = 10, x = 5.032, S = 3.05996, s/ n = 0.1844. Posterior for mean is


such that ( 5.032)/0.1844 t , so as t9,0.95 = 1.833 a 90% HDR is (4.69, 5.37).
2
Posterior for variance is S2
, so as values of corresponding to a 90% HDR for
log 29 are 3.628 and 18.087 required interval is (0.17, 0.84).
11.

Sufficient statistic is

Pn

12. Sufficient statistic is (


geometric mean.
13.

i=1

Pn

xi , or equivalently x.

i=1

xi ,

Qn

i=1

) where x
is the
xi ) or equivalently (x, x

p() 0 exp(/).

14. If U(/2, /2) and x = tan , then by the usual change-of-variable rule
1
p(x) = 1 |d/ dx| = 1 1 + x2
, and similarly if U(/2, 3/2). The
result follows.
15.

Straightforwardly
n!
x y z
x! y! z!


n!
=
exp{n log(1 )}
x! y! (n x y)!

p(x|) =

exp[x log{/(1 } + y log{/(1 )}]


= g(x, y) h(, ) exp[t(x, y)(, ) + u(x, y)(, )]

16. 1 = 0 + n = 104; n1 = n0 + n = 101; 1 = (n0 0 + nx)/n1 = 88.96;

1 1
S = (n 1)s2 = 2970; S1 = S0 + S + (n1
) (0 x)2 = 3336; s1 / n1 =
0 +n
7

3336/(101 104) = 0.56. It follows that a posteriori


1
t1 ,
s1 / n1

S1 2
1 .

For prior, find 75% point of t4 from, e.g., Neaves Table 3.1 as 0.741. For posterior,
as degrees of freedom are large, can approximate t by normal, noting that the 75%
point of thep
standard normal distribution is 0.6745. Hence a 50% prior HDR for is
85 0.741 (350/4 1), that is (75.6, 94.4), while a 50% posterior HDR for is
88.96 0.6745 0.56, that is, (88.58, 89.34).
17. With p1 () being N(0, 1) we see that p1 (|x) is N(2, 2), and with p2 () being
N(1, 1) we see that p2 (|x) is N(3, 1). As
Z
Z
1
exp{ 21 (x )2 12 2 } d
p(x|)p1 () d =
2
q
1
2

= exp{ 14 x2 }
2
Z
1
q
exp{ 12 ( 21 x)2 / 12 } d

1
2 2
q
1
1
2
= exp{ 14 x2 } = exp{2}
2
2
and similarly
Z
Z
1
exp{ 12 (x )2 21 ( 1)2 } d
p(x|)p1 () d =
2
q
1
2

= exp{ 41 (x 1)2 }
2
Z
1
q

exp{ 12 ( 12 (x + 1))2 / 12 } d
1
2 2
q
1
1
2
= exp{ 41 (x 1)2 } = exp{ 41 }
2
2
so that just as in Section 2.13 we see that the posterior is an 0 to 0 mixture of N(2, 1)
and N(3, 1) where 0 32 e2 = 0.09 and 0 13 e1/4 = 0.26, so that 0 = 0.26 and
0 = 0.74. It follows that
P( > 1) = 0.26 0.1587 + 0.74 0.0228 = 0.058.

18.

Elementary manipulation gives


2

nX + n0 02 (n + n0 )

nX + n0 0
n + n0

2

1
2
[{n(n + n0 ) n2 }X + {n0 (n + n0 ) n2o }02 2(nn0 )X]
n + n0
nn0
2
1 1
=
[X + 02 2X0 ] = (n1
) (X )2 .
0 +n
n + n0
=

D.3

Exercises on Chapter 3

1. Using Bayes postulate p() = 1 for 0 6 6 1 we get a posterior (n + 1) n


which has mean (n + 1)/(n + 2).
2. From Table B.5, F 40,24 = 0.55 and F 40,24 = 1.87, so for Be(20, 12) take lower
limit 20 0.55/(12 + 20 0.55) = 0.48 and upper limit 20 1.87/(12 + 20
1.87) = 0.76 so 90% HDR (0.48.0.76). Similarly by interpolation F 41,25 = 0.56
and F 41,25 = 1.85, so for Be(20.5, 12.5) quote (0.48.0.75). Finally by interpolation
F 42,26 = 0.56 and F 42,26 = 1.83, so for Be(21, 13) quote (0.47.0.75). It does not
seem to make much difference whether we use a Be(0, 0), a Be( 12 , 12 ) or a Be(1, 1)
prior.
3.

Take /( + ) = 1/3 so = 2 and


V =

( +

)2 (

+ + 1)

2
32 (3

+ 1)

so = 55/27
= 2 and = 4. Posterior is then Be(2 + 8, 4 + 12), that is Be(10, 16).
The 95% values for F32,20 are 0.45 and 2.30 by interpolation, so those for F20,32 are
0.43 and 2.22. An appropriate interval for is from 10 0.43/(16 + 10 0.43 to
10 2.22/(16 + 10 2.22), that is (0.21, 0.58).
4. Take /( + ) = 0.4 and + = 12, so approximately = 5, = 7. Posterior
is then Be(5 + 12, 7 + 13), that is, Be(17, 20).
5. Take beta prior with /( + ) = 13 and + = 14 11 = 2.75. It is convenient
to take integer and , so take = 1 and = 2, giving /( + ) = 31 and
( + )/11 = 0.273. Variance of prior is 2/36 = 0.0555 so standard deviation
is 0.236. Data is such that n = 11 and X = 3, so posterior is Be(4, 10). From
tables, values of F corresponding to a 50% HDR for log F20,8 are F = 0.67 and F =
1.52. It follows that the appropriate interval of values of F8,20 is (1/F , 1/F ), that is
(0.66, 1.49). Consequently and appropriate interval for the proportion required is
4 0.66/(10 + 4 0.66) 6 6 4 1.49/(10 + 4 1.49), that is (0.21, 0.37). The
posterior mean is 4/(4 + 10) = 0.29, the posterior mode is (4 1)/(4 + 10 2) =
9

0.25 and using the relationship median


= (2 mean + mode)/3 the posterior mode is
approximately 0.28. The actual overall proportion 86/433 = 0.20 is consistent with
the above. [Data from the York A.U.T. Membership List for 1990/91.]
6. By Appendix A.13, Ex = n(1 )/ and Vx = n(1 )/ 2 , so g 0 (Ex) =
1
1 1
[(1 )/ 2 ] 2 , so Vx = 1/4n (cf. Section 1.5). By analogy with the argument
2n
that an arc-sine prior for the binomialdistribution is approximately data-translated,
1
1
this suggests a prior uniform in sinh1 so with density 12 2 (1 + ) 2 . But note
Jeffreys Rule suggests Be(0, 12 ) as remarked in Section 7.4.
7. AsPthis is a rare event, we use the Poisson distribution. We have n = 280,
T =
xi = 196. Posterior is S11 2 0 where S1 = S0 + 2n, 0 = + 2T . For
1 2
reference prior take = 1, S0 = 0 we get a posterior which is 560
the
393 . Using
1
approximation in Appendix A.3, we find that a 95% interval is 2 ( 785 1.96)2 /560,
that is, (0.61, 0.80).
8. Prior is such that /S0 = 0.66, 2/S02 = 0.1152 , so = 66, S0 = 100, so
posterior has S1 = 660, 0 = 458. This gives a posterior 95% interval 12 ( 915
1.96)2 /660, that is, (0.61, 0.79).
9. p(x|)/ = 3/2 12 x2 and 2 p(x|)/2 = 3/22 , so I(|x) = 3/22
and we take p() 1/ or equivalently a uniform prior in = log .
10.

We have 2 L/ 2 = x/ 2 z/(1 )2 , etc., so




n/ + n/(1 )
n/(1 )
I(, | x, y, z) =
n/(1 )
n/ + n/(1 )

and so after a little manipulation det I(, | x, y, z) = n{(1)}1 , suggesting


a prior
1
1
1
p(, ) 2 2 (1 ) 2
which is evidently related to the arc-sine distribution.
11. p(x|)/ = 1/ + log log x and 2 p(x|)/ 2 = 1/ 2 , so I(|x) =
1/ 2 and we take p() 1/ or equivalently a uniform prior in = log .

12. Using Appendix A.16, coefficient of variation is {2/( + 1)( 2)}. This
is less than0.01 if 21 ( + 1)( 2) > 1/0.012 or 2 20, 002 > 0, so if
> 12 (1 + 80, 009) = 141.9. Taking the reference prior p(, ) ( )2 , that
is, Pabb(, , 1) (cf. Section 3.6), we need 0 = n 1 > 141.9, that is, n at least
143.

10

13. Take prior p() = (d 1)d for 0 < < . Then posterior is p(|x) =
0
(d0 1)d for 1 < < where d0 = d + n(c + 1) and 1 = max{0 , M } where
M = max{xi }.
14. Prior p() 1/ as before, but likelihood is now p(71, 100 | ) = 1/ 2 for
> 100, so posterior is approximately

X
p( | 71, 100) 3 /
3 .
>100

Approximating sums by integrals, the posterior probability


P( > 0 | 71, 100) =

1002 /02 , and in particular the posterior median is 100 2 or about 140.
15. We have p() = 1/ and setting = log we get p() = p() |d/d| = 1.
Thus we expect all pages to be more or less equally dirty.
16. The group operation is (x, y) 7 x + y mod 2 and the appropriate Haar measure is Lebesgue measure on the circle, that is, arc length around the circle.
17.

Table of c2 {c2 + (x1 )2 }1 {c2 + (x2 )2 }1 :


\c
0
2
4
6
8

1
0.00
0.06
0.04
0.06
0.00

2
0.01
0.05
0.06
0.05
0.01

3
0.01
0.04
0.05
0.04
0.01

4
0.01
0.03
0.04
0.03
0.01

4
0.01
0.02
0.03
0.02
0.01

Integrating over c using Simpsons Rule (and ignoring the constant) for the above values of we get 0.11, 0.48, 0.57, 0.48 and 0.11 respectively. Integrating again we get
for intervals as shown:
(1, 1)
0.92

(1, 3)
2.60

(3, 5)
3.24

(5, 7)
2.60

(7, 9) Total
0.92 10.28

so the required posterior probability is 3.24/10.28 = 0.31.


18. First part follows as sum of concave functions is concave and a concave function
has a unique maximum. For the example, note that
p(x|) = exp( x)/{1 + exp( x)}2 ( < x < )
=

1
4

sech2 12 ( x) ( < x < )

(which is symmetrical about x = ), so that the log-likelihood is


L(|x) = x 2 log{1 + exp( x)}.
11

Hence
L0 (|x) = 1 2 exp( x)/{1 + exp( x)}
= {1 exp( x)}/{1 + exp( x)}
= 1 2/{1 + exp(x )}
00

L (|x) = 2 exp(x )/{1 + exp(x )}2


= 2 exp( x)/{1 + exp( x)}2 .
As this is clearly always negative, the log-likelihood is concave. Also
L0 (|x)/L00 (|x) = 21 {exp( x) exp(x )}
Z
I(|x) = 2
exp(2( x))/{1 + exp( x)}4 dx

Z
= (1/8)
sech4 ( x) dx

Z
= (1/8)
sech4 y dy

= (1/24)[sinh y sech3 y + 2 tanh y]


= 1/6.
(The integration can be checked by differentiation of the result). Now proceed as in
Section 3.10.

D.4

Exercises on Chapter 4


1
1. 1 (1 p0 ) = p0 = 1 + (1 0 )01 B 1
= 1 (1 0 )01 B 1 + (1
1
1
2 2 2
0 ) 0 B . Result then follows on noting 0 = {1 (1 0 )}
= 1+(10 ).
2.

Substituting in the formulae in the text we get


1 = (0.92 + 1.82 )1 = 0.648 = 0.802 ;
1 = 0.648(93.3/0.92 + 93.0/1.82 ) = 93.2;
o = ((93.0 93.3)/0.9) = (0.333) = 0.3696; 0 /(1 0 ) = 0.59;
p0 = ((93.0 93.2)/0.8) = (0.25) = 0.4013; p0 /(1 p0 ) = 0.67;
B = 0.67/0.59 = 1.14.

3. n = 12, x = 118.58, S = 12969, s/ n = 9.91, (x 100)/(s/ n) = 1.875.


Taking a normal approximation this gives a P -value of 1 (1.875) = 0.0303.

4. n = 300, n = 900/16 = 56.25, n(1 ) = 45.70, (56 56.25)/ 45.70 =


0.037, so certainly not significant.
12

5. Posterior with reference prior is S11 2 0 with S1 = 12 and 0 = 31. Values of


2 corresponding to a 90% HDR for log 231 are (by interpolation) 19.741 and 45.898,
so a 90% posterior HDR is from 1.65 to 3.82 which includes 3. So not appropriate to
reject null hypothesis.
6. Ifp
k = 0.048 then exp(2k) = 1.101 = 1/0.908, so take within = k

0.048 (/10)/2.5 = 0.006 .


7.

(/n)z =

Using standard power series expansions


1

B = (1 + 1/) 2 exp[ 12 (1 + )1 ]
1

= 2 (1 + ) 2 exp( 12 z 2 ) exp[ 21 z 2 (1 + )1 ]
1

= 2 (1 + 12 + . . . ) exp( 21 z 2 )[1 + 12 z 2 (1 + )1 + . . . ]
1

= 2 exp( 12 z 2 )(1 + 12 (z 2 + 1) + . . . ).

8.

Likelihood ratio is
P 2

n
h X
i
xi /( + )]
{2( + )}n/2 exp[ 12

2
2
P
1
+
exp

x
/
=
i

{2( )}n/2 exp[ 12


x2i /( )]

h X
i

x2i n /2
= exp

 P 2
xi /n
n

1 .
= exp

9. p1 (x)/ = { 12 ( + /n)1 + 12 (x )2 /( + /n)2 }p1 (x) which vanishes if + /n = (x )2 = z 2 /n so if = (z 2 1)/n. Hence p1 (x) 6

1
1
(2/n) 2 z 1 exp( 12 ). ConsequentlyB = (2/n) 2 exp( 21 z 2 )/p1 (x) > e z exp( 12 z 2 ).
For last part use p0 = 1 + (1 /0 )B 1 .
10. Posterior probability is a minimum when B is a minimum, hence when 2 log B
is a minimum, and
d (2 log B)/dn = {1 + n}1 + z 2 {1 + 1/n}2 (1/n2 )
which vanishes when n = z 2 1. Since z 2 1 is not necessarily an integer, this answer
is only approximate.
11. Test
against B(7324, 0.75) with mean 5493 and variance 1373, so z = (5474
5493)/ 1373 = 0.51. Not significant so theory confirmed.

13

12.

Likelihood ratio is
1

(22) 2 exp[ 21 u2 /(2)](2) 2 exp[ 12 (z )/]


12

(22)

21

exp[ 21 u2 /(2)](2/2)

exp[ 12 (z )/(/2)]

= (/2) 2 exp[ 12 u2 (1//2 1/2) + 12 (z )2 /]


1

= (/2) 2 exp[ 12 u2 /(2) + 12 (z )2 /]


p
p

With (/) = 100, u = 2 (2) and z = , we get B = 100/ 2 exp(2) =
9.57, although 2 standard deviation is beyond 1.96, the 5% level.
R
13. p1 (x) = 1 ()p(x|) d which is 1/ times the integral between + /2 and
/2 of an N(, /n) density and so as  /n is nearly the whole integral. Hence
1
p1 (x)
= 1/ , from which it easily follows that B = (2/n 2 ) 2 exp( 21 z 2 ). In
1
Section 4.5 we found B = (1 + n/) 2 exp[ 12 z 2 (1 + /n)1 ], which for large
1
n is about B = (/n) 2 exp( 12 z 2 ). This agrees with the first form found here if
2
= 2. As the variance of a uniform distribution on ( /2, + /2) is 2 /12,
this may be better expressed as 2 /12 = (/6) = 0.52.
14. Jeffreys considers the case where both the mean and the variance are unknown, and wishes to test H0 : = 0 versus H1 : 6= 0 using the conventional
choice of prior odds 0 /1 = 1, so that B = p0 /p1 . Under both hypotheses he uses
the standard conventional prior
for , namely p() 1/. He assumes that the prior
for is dependent on = as a scale factor. Thus if = / he assumes that
(, ) = p()p() so that and are independent. He then assumes that
(i) if the sample size n = 1 then B = 1, and
(ii) if the sample size n > 2 and S =

P
(Xi X)2 = 0, then B = 0, that is, p1 = 1.

From (i) he deduces that p() must be an even function with integral 1, while he shows
that (ii) is equivalent to the condition
Z
p() n2 d = .
0

He then shows that the simplest function satisfying these conditions is p() = 1 (1+
2 )1from which it follows that p() = p()|d/d| = 1 ( 2 + 2 )1 . Putting
= and generalizing to the case where H0 is that = 0 we get the distribution
in the question.
There is some arbitrariness in Jeffreys simplest function; if instead he had taken
p() = 1 (2 + 2 )1 he would have ended up with p() = 1 ( 2 + (
0 )2 )1 where = . However, even after this generalization, the argument is not
overwhelmingly convincing.

14

15. (a)

B>

Maximum likelihood estimator b of is x/n, so


0
b

x 

1 0
1 b

nx
;

1 0
p0 > 1 +
0

b
0

!x

1 b
1 0

!nx 1

(b) From tables (e.g. D. V. Lindley and W. F. Scott, New Cambridge Elementary
Statistical Tables, Cambridge: University Press 1995 [1st edn (1984)], Table 1, or
H. R. Neave, Statistics Tables for mathematicians, engineers, economists and the behavioural and management sciences, London: George Allen & Unwin (1978), Table 1.1) the probability of a binomial observation 6 14 from B(20, 0.5) is 0.9793,
so the appropriate (two-tailed) P -value is 2(1 0.9793) = 0.0414
= 1/24. The
maximum likelihood estimate is b = 15/20 = 0.75, so the lower bound on B is
(0.5/0.75)15 (0.5/0.25)5 = 220 /315 = 0.0731, implying a lower bound on p0 of
0.0681 or just over 1/15.
p

16. (a) n = 12, = 11, t = x/(s/ n) = 1.2/ (1.1/12) = 3.96 and if we take
k = 1 the Bayes factor B is
(1 + t2 /)(+1)/2
12

(1 + nk)

(b)

(1 +

t2 (1

nk)1 /)(+1)/2

0.004910
= 0.033.
(0.2774)(0.5356)

z = x/(s/ n) = 1.2/ 12 = 4.16 and (taking = as usual)




1
B = (1 + n) 2 exp 12 z 2 (1 + 1/n)1


1
= (1 + 12) 2 exp 21 (4.16)2 (1 + 1/12)1 = 0.001.

17. Two-tailed P -value is 0.0432 (cf. D. V. Lindley and W. F. Scott, New Cambridge
Elementary Statistical Tables, Cambridge: University Press 1995 [1st edn (1984)], Table 9), while Bayes factor is
(1 + t2 /)(+1)/2
21

(1 + nk)

(1 +

t2 (1

nk)1 /)(+1)/2

0.08712
= 0.377
(0.3162)(0.7313)

so F = 1/B = 2.65. Range (1/30P, 3/10P ) is (0.77, 6.94), so F is inside it and we


do not need to think again.
18. For P -value 0.1 think again if F not in ( 13 , 3). As p0 = [1 + F ]1 this means
if p0 not in (0.250, 0.750), so if n = 1000. Similarly if P -value 0.05, if p0 not in
(0.143, 0.600), so if n = 1000 (and the case n = 100 is marginal); if P -value 0.01,
if p0 not in (0.032, 0.231), so if n = 100 or n = 1000; if P -value 0.001, if p0 not in
(0.003, 0.029), so if n = 50, n = 100 or n = 1000.
15

D.5

Exercises on Chapter 5

S = 0.0498, s = 0.0789. Assuming a standard


1. Mean difference w = 0.053,
reference prior for and a variance known equal to 0.07892 , the posterior distribution
0.07892 /9) leading to a 90%
of the effect of Analyst A over Analyst B is N(0.053,
interval 0.053 1.6449 0.0789/3, that is, (0.010, 0.097). If the variance is not
assumed known then the normal distribution should be replaced by t8 .
2.

If variance assumed known, z = 0.053/(0.0789/3)


= 2.028 so with =
1

B = (1 + n) 2 exp[ 12 z 2 (1 + 1/n)1 ]
1

= (1 + 9) 2 exp[ 12 (2.028)2 (1 + 19 )1 ] = 0.497



1
and p0 = 1 + 0.4971
= 0.33 with the usual assumptions. If the variance is not
assumed known,
B=

{1 + 2.0282 /8}9/2
1
10 2 {1

+ 2.0282 101 /8}9/2



1
and p0 = 1 + 0.6131
= 0.38.

0.1546
= 0.613
(0.3162)(0.7980)

P
P
P
P
3. PWe know (xP
i ) and (xi yi ) is independently N(0, 2 i ),
i +yi ) N(2, 2
so (xi yi )2 /2 i 2n and hence if = 0 then
qX
X
(xi + yi )/
(xi yi )2 tn .
Hence test whether = 0. If 6= 0, it can be estimated as
4.

1
2

P
(xi + yi ).

In that case
1

B = (1 + 440.5/99) 2 exp[ 21 (1.91)2 {1 + 99/440.5}1 ]


1

= 5.4495 2 exp(1.4893) = 0.53.


If the prior probability of the null hypothesis is taken as 0 =
posterior probability of p0 = (1 + 0.531 )1 = 0.35.
5.

1
2,

then this gives a

m = 9, n = 12, x = 12.42, y = 12.27, sx = 0.1054, sy = 0.0989.

2
(a) Posterior
of is N(12.42 12.27, 0.1 (1/9 + 1/12)) so 90% interval is 0.15
1.6449 0.00194, that is, (0.077, 0.223).

p
(b)
= 8 0.10542 + 11 0.09892 = 0.1965, s = 0.1965/19 = 0.102, so
p S 1
s (m +n1 ) = 0.045, so from tables of t19 a 90% interval is 0.151.7290.045,
that is (0.072, 0.228).
16

6.

With independent priors uniform in , and log , that is, p(, , ) 1/,

p(, , | x, y) p(, , ) p(x|, ) p(y|, )


h
nX
(1/)(2)(m+n)/2 exp 12
(xi )2 +

1
2

o i
(yi )2 /

(m+n)/21 exp[ 21 {Sx + m(x )2 + 21 Sy + 12 n(y )2 }/]


Writing S = Sx + 12 Sy and = m + n 2 we get
1

p(, , |x, y) /21 exp[ 12 S/] (2/m) 2 exp[ 12 m( x)2 )/2]


1

(22/m) 2 exp[ 12 m( y)2 )/2]


p(|S) p(|, x) p(|, y)
where

p( | S)
is an S2
density,
p( | , x) is an N(x, /m) density,
p( | , x) is an N(y, 2/m) density,

It follows that, for given , the parameters and have independent normal distributions, and so that the joint density of = and is
p(, |x, y) = p(|S) p(|x y, )
where p( | x y, ) is an N(x y, (m1 + 2n1 ) density. This variance can now
be integrated out just as in Sections 2.12 and 5.2, giving a very similar conclusion, that
is, that if
(x y)
t=
1
s{m1 + 2n1 } 2
where s2 = S/, then t t .
7.

We find that
S1 = S0 + Sx + Sy + m0 20 + mx2 + n0 20 + ny 2 m1 21 n1 21

and then proceed as in Exercise 18 on Chapter 2 to show that


1 1
1 1
S1 = S0 + Sx + Sy + (m1
) (x 0 )2 + (n1
) (y 0 )2 .
0 +m
0 +n

8. m = 10, n = 9, x = 22.2, y = 23.1, sx = 1.253, sy = 0.650. Consequently


p

(s2x /m + s2y /n) = 0.452 and tan == (sy / n)/(sx / m) = 0.547, so


= 30 .
Interpolating in Table B.1 with 1 = 8 and 2 = 9 the 90% point of the Behrens distribution is 1.42, so a 90% HDR is 22.2 23.1 1.42 0.452, that is, (1.54, 0.26).
9.

Evidently f1 = (m 1)/(m 3) and


f2 =

(m 1)2
(sin4 + cos4 ).
(m 3)2 (m 5)
17

Also 1 = (sin2 +cos2 )2 = sin4 +cos4 +2 sin2 cos2 , so that sin4 +cos4 =
1 sin2 2 = cos2 2. The result follows.
10.

As Tx t(x) and Ty t(y) we have


p(Tx , Ty |x, y) (1 + Tx2 /(x))((x)+1)/2 (1 + Ty2 /(y))((y)+1)/2

Jacobian is trivial, and the result follows (cf. Appendix A.18).


11. Set y = 1 x/(2 + 1 x) so that 1 y = 1/(2 + 1 x) and dy/dx = 1 /(2 +
1 x)2 . Then using Appendix A.19 for the density of x

dx
x1 /21
y 1 /21 (1 y)2 /21
p(y) = p(x)
dy
(2 + 1 x)(1 +2 )/22
from which the result follows.
12. k = s21 /s22 = 6.77, so 1 = k/ = 6.77/ F24,14 . A 95% interval for F24,14
from Table B.5 is (0.40, 2.73) so one for is (2.5, 16.9).
13. k = 3 with x = y = 9. A an interval
to a 99%HDR for log F
corresponding
for F9,9 is (0.15, 6.54), so a 99% interval for is from 3 0.15 to 3 6.54, that
is, (0.67, 4.43).
14. Take 0 +0 = 6, 0 /(0 +0 ) = 12 , 0 +0 = 6, 0 /(0 +0 ) = 32 , so 0 = 3,
0 = 3, 0 = 4, 0 = 2 and hence = 3 + a = 11, = 3 + b = 15, = 4 + c = 52,
= 2 + d = 64 so that
log{( 12 )( 12 )/( 21 )( 21 )} = log 0.983 = 0.113
while 1 + 1 + 1 + 1 = 0.192. Hence posterior of log-odds ratio is 0
N(0.113, 0.192). The posterior probability that > is

(0.113/ 0.192) = (0.257) = 0.3986.

15. Cross-ratio is (45 29)/(28 30) = 1.554 so its logarithm is 0.441. More
accurately, adjusted value is (44.5 28.5)/(27.5 29.5) = 1.563 so its logarithm is
0.446. Value of a1 + b1 + c1 + d1 is 0.126, and so the posterior distribution of
the log-odds ratio is 0 N(0.441, 0.126) or more accurately N(0.446,
0.126).
The posterior probability that the log-odds ratio ispositive is (0.441/ 0.126) =
(1.242) = 0.8929 or more
= 0.8955.
p accurately (0.446/ 0.126) = (1.256)
p
With the same data sin1 (45/73) = 0.903 radians and sin1 (30/59) = 0.794
radians, and 1/4m +
1/4n = 0.00766, so the posterior probability that > is
((0.903 0.794)/ 0.00766) = (1.245) = 0.8934.
18

0 = log(/) log{(1 )/(1 )} so if = we get

16.

0 = log{/( )} log{(1 )/(1 + )}


= log{1 )/} + log{1 + /(1 )}

= / + /(1 ) = /{(1 )}.

17. With conventional priors, posteriors are near Be(56, 252) and Be(34, 212),
so approximating by normals of the same means and variances N(0.1818, 0.0004814),
N(0.1382, 0.0004822),
so N(0.0436, 0.000963) so P( > 0.01) =

((0.0436 0.01)/ 0.000963) = (1.083) = 0.8606 and so the posterior odds are
0.8606/(1 0.8606) = 6.174.
18. Using a normal approximation, x N(8.5, 8.5) and y N(11.0, 11.0), so that
x y N(2.5, 19.5).

D.6
1.

Exercises on Chapter 6

Straightforward substitution gives


Sample
n
r
tanh1 z
n tanh1 z

1
2
3
4
5
12
45
23
19
30
0.631 0.712 0.445 0.696 0.535
0.743 0.891 0.478 0.860 0.597
8.916 40.095 10.994 16.340 17.910

Total
129

94.255

Posterior for is N(94.255/129, 1/129) and 95% HDR is 0.7307 1.96 0.0880,
that is, (0.5582, 0.9032). A corresponding interval for is (0.507, 0.718).
2.

Another straightforward substitution gives


n
1/n
r
tanh1 z

45
0.0222
0.489
0.535

34
0.0294
0.545
0.611

49
0.0204
0.601
0.695

Hence 1 2 N(0.535 0.611, 0.0222 + 0.0294), that is, N(0.076, 0.2272 ).


Similarely 2 3 N(0.084, 0.2232 ) and 3 1 N(0.160, 0.2062 ). It follows
without detailed examination that there is no evidence of difference.
3.

ni tanh1 ri )/

ni , 1/


ni so required interval is

1.96
ni tanh1 ri
P
pP .
ni
ni
19

4.

We found in Section 6.1 that


p( | x, y) p()

(1 2 )(n1)/2
(1 r)n(3/2)

As p()(1 2 ) 2 (1 r)3/2 does not depend on n, for large n


p( | x, y)

(1 2 )n/2
(1 r)n

so that log l( | x, y) = c + 12 n log(1 2 ) n log(1 r), and hence


(/) log l( | x, y) = n(1 2 )1 + nr(1 r)1 .
implying that the maximum likelihood estimator b
= r. Further
( 2 /2 ) log l( | x, y) = n(1 2 )1 2n2 (1 2 )2 + nr2 (1 r)2 ,
so that if = r we have ( 2 /2 ) log l( | x, y) = n(1 2 )2 . This implies that
the information should be about I( | x, y) = n(1 2 )2 , and so leads to a prior
p() (1 2 )1 .
5.

We have
p(|x, y) p()(1 2 )(n1)/2

(cosh t r)(n1) dt

Now write r = cos so that


p(|x, y) p()(1 2 )(n1)/2

(cosh t + cos )(n1) dt

and set

Z
Ik =

(cosh t + cos )k dt

We know that I1 = / sin (cf. J. A. Edwards, A Treatise on the Integral Calcuous


(2 vols), London: Macmillan (1921) [reprinted New York: Chelsea (1955)], art. 180).
Moreover
(/)(cosh t + cos )k = k sin (cosh t + cos )(k+1)
and by induction
(/ sin )k (cosh t + cos )1 = k!(cosh t + cos )(k+1)
Differentiating under the integral sign, we conclude that
(/ sin )k I1 = k!Ik+1
20

(k > 0).

Taking k = n 2, or k + 1 = n 1, we get
Z
Ik =
(cosh t + cos )k dt (/ sin )k (/ sin ).
0

(ignoring the factorial). Consequently


p(|x, y) p()(1 2 )(n1)/2 (/ sin )k (/ sin ).
Since d(r)/d = d( cos )/d = sin and so / sin = d/d(r), we could
alternatively write


arc cos(r)
dn2
p(|x, y) p()(1 2 )(n1)/2
d(r)n2 (1 2 r2 ) 21

6.

Supposing that the prior is


p() (1 2 )k/2

and r = 0 then
p(|x, y) p()(1 2 )(n1)/2
so with the prior as in the question we get
p(|x, y) (1 2 )(k+n1)/2
If we define
t=

(k + n + 1) p

(1

2 )

= p

t
{(k + n + 1) + t2 }

so that
1 2 =

(k + n + 1)
d
(k + n + 1)2t
,
2
=
2
(k + n + 1) + t
dt
{(k + n + 1) + t2 }2
(k + n + 1)
d
=
dt
{(k + n + 1) + t2 }3/2

and hence
p(t|X, Y ) p(|X, Y ) d/ dt

(k+n1)/2
(k + n + 1)
(k + n + 1)

2
(k + n 1) + t
{(k + n + 1) + t2 }3/2


(k+n+2)/2
t2
1+
k+n+1
This can be recognized as Students t distribution on k + n + 1 degrees of freedom (see
Section 2.12).
21

7. See G. E. P. Box and G. C. Tiao, Bayesian Inference in Statistical Analysis, Reading, MA: Addison-Wesley (1973, Section 8.5.4the equation given in the question is
implicit in their equation (8.5.49)).
8. See G. E. P. Box and G. C. Tiao, Bayesian Inference in Statistical Analysis, Reading, MA: Addison-Wesley (1973, Section 8.5.4, and in particular equation (8.5.43)).
9.

For a bivariate normal distribution


log l(, , |x, y) = log 2 +

1
2 (x

1
2

log
2

) (x )(y ) 21 (y )2

where = 2 = 1/. Then


(/) log l = 12 / 12 (x )2
(/) log l == 21 / 12 (y )2 ,
(/) log l = / (x )(y ),
Consequently the information matrix, i.e. minus the matrix of second derivatives (taking expectations is trivial as all elements are constant) is

1 2 2

21 1 + 12 2
2
2
1 2 2


2
I = 12 1 + 21 2
2
2
2
1

+ 2 2 2

2
2
2
1

2
2
2
= 2
2
2 2 2( + 2 )
so that its determinant is
1
det I = 6
8


2

2

2

2
2
2

2
2
2( + 2 )

Adding 21 / times the last column to the first and 12 / times the last column to the
second we get

0
0
2
1

1
2
det I = 6
8
1
1
2
2( + )


1

1
= 6 (2) 2 + 2 = 3
8

4
We thus conclude that det I = 1/4 3 implying a reference prior
p(, , 3/2 .
22

Rather similarly we get




2 /2
1/ /2
(, , )
2
2 /2
= 1/ /
(, , )
2
/
/2


2
2
2
1
2
2
= 6 2

+ 2

2/2
2/2
1/ 22 /2

and evaluate this as 1/3 . It then follows that


p(, , ) | 1/3 | p(, , ) 3/2 .

10. Age is y, weight is x. n = 12, x = 2911, y = 38.75, Sxx = 865, 127,


Syy = 36.25,
p Sxy = 4727. Hence a = y =2 38.75, b = Sxy /S2xx = 0.005464,
/Sxx = Syy (1 r ) = 10.422, and
r = Sxy / (Sxx Syy ) = 0.8441, See = Syy Sxy
s2 = See /(n 2) = 1.0422. Take x0 = 3000 and get a + b(x0 x) = 39.24
as mean age for weight
3000. For a particular baby, note that the 95% point of
p
t10 is 1.812 and s {1 + n1 + (x0 x)2 /Sxx } = 1.067 so a 90% interval is
39.24 1.812
p 1.067, that is, (37.31, 41.17). For the mean weight of all such babies note s {n1 + (x0 x)2 /Sxx } = 0.310, so interval is (38.68, 39.80).
11.

From the formulae in the text

2
2
2
See = Syy Sxy
/Sxx = Syy 2Sxy
/Sxx + Sxy
/Sxx = Syy 2bSxy + b2 Sxx

where b = Sxy /Sxx . Hence


X
X
X
(yi y)(xi x) + b2
(xi x)2
See =
(yi y)2 2b
X
=
{(yi y) b(xi x)}2 .
The result now follows as yi = a.
P
P
P
12. Clearly (as stated in the hint)
ui =
vi =
ui vi = 0, hence u = v = 0
and Suv = 0. We now proceed on the lines of Section 6.3, redefining See as Syy
2
Suy 2 /Suu Svy
/Svv and noting that
X

(yi ui vi )2 =

{(yi y) + (y ) ui vi }

= Syy + n(yi y)2 + 2 Suu + 2 Svv 2Suy 2Svy


2
2
= Syy Suy
/Suu Svy
/Svv + n(yi y)2

+ Suu ( Suy /Suu )2 + Svv ( Svy /Svv )2


= See + n(yi y)2 + Suu ( Suy /Suu )2 + Svv ( Svy /Svv )2
23

We consequently get a density p(, , , | x, y) from which we integrate out both


and to get
p(, | x, y) n/2 exp[ 12 {See + n( a)2 }/].
The result now follows.
P
PP 2
13. I = 4,
Ki = 28, G = 1779,
xik = 114, 569, C = G2 /N = 113, 030.
2
2
Further, ST = 1539, St = (426 + 461 + 4502 + 4422 )/7 C = 93 and hence the
analysis of variance table is as follows:

Source
Treatments
Error
TOTAL

ANOVA Table
Degrees
Sum of of
squares freedom
93
6
1446
24
1539
27

Mean
square
31
60

Ratio
(< 1)

We conclude that the results from the four samples agree.


14. d = 2 + 4 + 6 3 5 7 with db = 1.4 and Kd = (6/4)1 so that
0.82(d + 1.4)/2.74 t25 .
15.

The analysis of variance table is as follows:

Source
Treatments
Blocks
Error
TOTAL

ANOVA Table
Degrees
Sum of
of
squares freedom
49,884
2
149,700
5
18,725 10
218,309 17

Mean
square
24,942
29,940
1,872

Ratio
13.3
16.0

We conclude that the treatments differ significantly (an F2,10 variable exceeds 9.43
with probability 0.5%).
16.

This is probably seen most clearly by example.

x111
1 1 0 1 0
x112
1 1 0 1 0

x121
1 1 0 0 1

x122

, A = 1 1 0 0 1
x=
x211
1 0 1 1 0

x212
1 0 1 1 1

x221
1 0 1 0 1
x222
1 0 1 0 1
24

When r = t = 2 we take

1
1

2
1

=
1 .
1

2
1

0
12
0

17. Evidently A+ A = I from which (a) and (b) and (d) are immediate. For (c), note
AA+ = A(AT A)1 AT which is clearly symmetric.
18.

We take

1 x1


P
1 x2
n
xi

T
P
P
A= .
,
A A=
..
xi
x2i

1 xn
P
so that writing S = (xi x)2 we see that det(AT A) = nS and hence
 P 2

 P

P
1
xi
Pxi
P yi
(AT A)1 =
,
AT y =
xi
n
xi yy
nS
b = (AT A)1 AT y is easily found. In particular
from which
 P(y y)(x x)
X
1  X X
i
i
P

xi
yi + n
xi yi =
.
b1 =
nS
(xi x)2

D.7

Exercises on Chapter 7

1. Recall that t is sufficient for given x if p(x|) = f (t, )g(x) (see Section 2.9).
Now

p(x|) + p(y|) = p(z|) if t = t(z)
p(t|) =
p(x|)
if t = t(x) for x 6= y
so that


p(t|) =

0
if x = y
p(x|) if x = t

Setting f (t, ) = p(t|) and g(x) = 1 (x 6= y), g(y) = 0, it follows that t is sufficient
for x given . It then follows from a nave appliaction of the weak sufficiency principle
that Ev(E, y, ) = Ev(E, z, ). However if x
e is a continuous random variable, then
pxe(y|) = 0 for all y, so we may take y and z as any two possible values of x
e.
2.

l( | g(x)) = l( | h(x)) from which the result follows.

3.

The likelihood function is easily seen to be

x/2, 2x, 2x + 1
1
(x 1)/2, 2x, 2x + 1
l(|x) =
for
=

3
1, 2, 3

when x is even
when x 6= 1 is odd
when x = 1.

The estimators d1 , d2 and d3 corresponding to the smallest, middle and largest possible
are

when x is even
x/2
(x 1)/2 when x 6= 1 is odd
d1 (x) =

1
when x = 1,
25

and d2 (x) = 2x, d3 (x) = 2x + 1. The values of p(d1 = ) given in the question now
follow. However, a Bayesian analyis leads to a posterior
p(|x) =

()IA ()
l(|x)()
=
p(x)
(d1 (x)) + (d2 (x)) + (d3 (x))

where () is the prior, A = {d1 (x), d2 (x), d3 (x)} and IA () is the indicator function
of A. Thus, indeed, the data conveys nothing to the Bayesian version except that
is d1 (x), d2 (x), or d3 (x). However, Bayesians are indifferent between d1 (x), d2 (x),
or d3 (x) only if they have equal prior probabilities, which cannot hold for all x if the
prior is proper. For a discussion, see J. O. Berger and R. L. Wolpert, The Likelihood
Principle, Hayward, CA: Institute of Mathematical Statistics (1984 and 1988, Example
34).
4. Computation of the posterior is straightforward. For a discussion, see J. O. Berger
and R. L. Wolpert The Likelihood Principle, Hayward, CA: Institute of Mathematical
Statistics (1984 and 1988, Example 35).
5.

Rules (a) and (b) are stopping times and rule (c) is not.

6.

n = 4, x = 1.25, z = 1.25 4 = 2.50.

(a) Reject at the 5% level (and indeed possibly do so on the basis of the first observation alone).
1

(a) Since B = (1 + n/) 2 exp[ 12 z 2 (1 + /n)1 ] = 1.07 with = , so with


0 = 12 we get p0 = (1 + B 1 )1 = 0.52. Null hypothesis still more probable than
not.
7.

As we do not stop the first four times but do the fifth time
3+1+2+5+7
3 1 2 5 11
exp(5)
3! 1! 5! 2! 7!
3 4 6 11 18
l(|x) 10 exp(5).

p(x|) =

8. ES = E(S + 1) 1 and after re-arrangement E(S + 1) is (s + 1)(R00 2)/(r00 2)


times a sum of probabilities for the beta-Pascal distribution with S replaced by (S + 1),
with s replaced by (s + 1), and with r00 replaced by (r00 1). As probabilities sum to
unity, the result follows.
9.

Up to a constant
L(|x, y) = log l(|x, y) = (x + n) log + (n x + y) log(1 )
26

so ( 2 L()|x, y)/ 2 ) = (x + n)/ 2 + (n x + y)/(1 )2 . Because the


expectations of x and y are n and n(1 )/ respectively, we get I(|x, y) =
n(1 + )/ 2 (1 ), so that Jeffreys rule leads to
1

p(|x, y) (1 + ) 2 1 (1 ) 2 .

P
P
x
10. Eu(x) =
u(x)p(x) =
u(x)2P
suggesting you would enter provided
e < Eu(x). If u(x) x then Eu(x)
2x 2x = resulting in the implausible proposition that you would pay an arbitrarily large entry fee e.
11. By differentiation of the log-likelihood L(|x) = x log + (n x) log(1 )
with respect to we see that x/n is the maximum likelihood estimator.
Because prior for is uniform, that is, Be(1, 1), posterior is Be(x + 1, n x + 1).
The question deals with a particular case of weighted quadratic loss, so we find d(x) as
Ew (|x) =

B(x + 1, n x) B(x + 1, n x + 1)
x
E((1 )1 |x)
=
= .
E(1 (1 )1 | x)
B(x + 1, n x + 1)
B(x, n x)
n

If x = 0 or x = n then the posterior loss (a, x) is infinite for all a because the integral
diverges at x = 0 or at x = n so the minimum is not well-defined.
12. The minimum of E( )2 2a(E( )+a2 clearly occurs when a = E( ).
But since the prior for is uniform, that is, Be(1, 1), its posterior is Be(x+1, nx+1)
and so its posterior mean is (x+1)/(n+2); similarly for y. We conclude that the Bayes
rule is
d(x, y) = (x y)/(n + 2).

13. Posterior mean (resulting from quadratic loss) is a weighted mean of the component means, so with the data in Section 2.10 is
0

10
20
20
115 10
14
+ 0
=
+
= 0.370.
10 + 20
10 + 20
129 10 + 20 129 10 + 20

Posterior median (resulting from absolute error loss) can be found by computing the
weighted mean of the distribution functions of the two beta distributions for various
values and honing in. Result is 0.343. Posterior mode (resulting from zero-one loss)
is not very meaningful in a bimodal case, and even in a case where the posterior is not
actually bimodal it is not very useful.
14.

If we take as loss function



L(, a) =

u(a ) if a 6
v( a) if a >
27

Suppose m(x) is a v/(u + v) fractile, so that




P x 6 m(x) > v/(u + v),
P x > m(x) > u/(u + v).
Suuppose further that d(x) is any other rule and, for definiteness, that d(x) > m(x)
for some particular x (the proof is similar if d(x) < m(x)). Then

if 6 m(x)
u[m(x) d(x)]
(u + v) [um(x) + vd(x)] if m(x) < < d(x)
L(, m(x)) L(, d(x)) =

v[d(x) m(x)]
if > d(x)
while for m(x) < < d(x)
(u + v) [um(x) + vd(x)] < u[ m(x)] < u[d(x) m(x)]
so that

L(, m(x)) L(, d(x)) 6

u[m(x) d(x)]
v[d(x) m(x)]

if 6 m(x)
if > m(x)

and hence on taking expectations over




(m(x), x) (d(x), x) 6 u[m(x) d(x)] P( 6 m(x) | x)


+ v[d(x) m(x)] P( > m(x) | x)


= d(x) m(x) uP( 6 m(x) | x)

+ vP( > m(x) | x)



6 d(x) m(x) uv/(u + v) + uv/(u + v) = 0
from which it follows that taking a v/(u + v) fractile of the posterior distribution does
indeed result in the appropriate Bayes rule for this loss function.
15.

By integration by parts, if Be(2, k) then


Z
P( < ) =
k(k + 1)(1 )k d
0
Z


= (k + 1)(1 )k 0 +
(k + 1)(1 )k d
0

= [(k + 1)(1 ) (1 )k+1 = 1 (1 )k (1 + k).


In this case, the prior for is Be(2, 12) so that the prior probability of H0 , that is, that
< 0.1, is 0.379, while the posterior is Be(2, 18) so that the posterior probability that
< 0.1 is 0.580.
(a) With 01 loss we accept the hypothesis with the greater posterior probability,
in this case H1 .

28

(b) The second suggested loss function is of the 0Ki form and the decision depends on the relative sizes of p1 and 2p0 . Again this results in a decision in facour of
H1 .
16.

Posterior expected losses are


(a0 , x) = 10p1 ,

(a1 , x) = 10p0 ,

(a2 , x) = 3p0 + 3p1 .

Choose a0 if 0 6 p0 6 0.3, choose a1 if 0.3 6 p0 6 0.7 and choose a2 if 0.7 6 p0 6


1.
17. Posterior variance is (2251 + 1001 )1 = 69.23 = 8.322 and the posterior
mean is 69.23(100/225 + 115/100) = 110.38. Posterior expected losses are
Z 110
Z
p(a1 , x) =
( 90)(|x) d +
2( 90)(|x) d
90

110

Now note that (with (x) the N(0, 1) density function


Z 0.046
Z 0.046
Z 110

z(z) d + 20.38
(z) d
( 90)(|x) d = 69.23
2.450
2.450
90
p
= (69.23/2){exp( 12 0.0462 ) exp( 21 2.4502 )}
+ 20.39{(0.046) (2.450)}
= 3.15 + 9.66 = 6.51.
By similar manipulations we find that (a1 , x) = 34.3, (a2 , x) = 3.6 and (a3 , x) =
3.3, and thus conclude that a3 is the Bayes decision.
18.

In the negative binomial case


p(|x) = p()p(x|)/pxe(x)


n+x1
=
(1 )n x /pxe(x)
x

It follows that the posterior mean is



Z 
n+x1
E(|x) =
(1 )n x /pxe(x)
x


x+1 n+x
=
(1 )n x /pxe(x)
n+x x+1
(x + 1) pxe(x + 1)
=
(n + x) pxe(x)
This leads in the same way as in Section 7.5 to the estimate
n =

( + 1)fn ( + 1)
.
(n + )(fn () + 1)
29

D.8
1.



Exercises on Chapter 8

Write = /( + ) and = ( + )1/2 . The Jacobian is




/( + )2
/ / /( + )2
=
= ( + )5/2
/ / 21 ( + )3/2 12 ( + )3/2

from which the result follows.


2.

p(, | x) p() p(|) p(x|). So


Y
p(, | x) n
ixi exp(( + 1)i )

and by the usual changeof-variable rule (using |dz/ d| = 1/(1 + )2 )


Y
p(z, | x) z n2 (1 z)n
ixi exp(i /z).
Integrating with respect to all the i using

x exp(/z) z x+1 we get

p(z|x) z nx2 (1 z)n


or z Be(nx 1, n + 1), from which it follows that Ez = (nx 1)/(nx + n). Now
note that
p(i | xi , ) p(xi |i ) p(i |) ixi exp(i ) exp(i ) ixi exp(i /z)
which is a G(xi , z) or 21 z2xi distribution, from which it follows that E(i | xi , ) = xi z
and so
E(i |x) = xi E(z|x) = xi (nx 1)/(nx + n).

b = bB . Expression for (b0 , X) has S1 replaced by a


3. (a) Use same estimator
weighted sum of squares about .
b with bi the posterior median of the ith component.
(b) Use estimator
4. We find the mean square error of the transformed data is 3.14 times smaller and
of the equivalent probabilities is 3.09 (as opposed to corresponding figures of 3.50 and
3.49 for the Efron-Morris estimator) For a complete analysis, see the program
http://www-users.york.ac.uk/~pml1/bayes/rprogs/baseball.txt
or
http://www-users.york.ac.uk/~pml1/bayes/cprogs/baseball.cpp

30

5. Evidently AT A = I or equivalently j T k = 1 if j = k and 0 if j 6= k. It


follows that Wj N(0, 1) and that C(Wj , Wk ) = 0 if j 6= k which, since the Wj have
a multivariate normal distribution, implies that Wj and Wk are independent. Further,
as AT A = I we also have AAT = I and so
W T W = (X )T AAT (X ) = (X )T (X ).
For the last part, normalize the ij for fixed j to produce a unit vector and the rest
follows as above.
6.

We know that
+

R(, bJS ) R(, bJS ) =


i
1 X h  bJS 2 bJS + 2 
+
E
2i E biJS biJS
.

To show that the experession in brackets is always > 0, calculate the expectation by
+
first conditioning on S1 . For any value S1 6 r 2, we have biJS = biJS so that it
is enough to show that the right hand side is positive when conditional on any value
+
S1 = s1 > r 2. Since in that case biJS = 0, it is enough to show that for any
s1 > r 2



h
i
r2
JS
b
i E i S1 = s1 = i 1
E(Xi i | S1 = s1 ) 6 0
s1
and hence it is enough to show thati E(Xi i | S1 = s1 ) > 0. Now S1 = s1
is equivalent to (X1 )2 = s1 (X2 )2 + + (Xn )2 . Conditioning
further on X2 , . . . , Xn and noting that

P(X1 = y | S1 = s1 , X2 = x2 , . . . , Xn = xn ) exp 21 (y 1 )2

P(X1 = y | S1 = s1 , X2 = x2 , . . . , Xn = xn ) exp 21 (y + 1 )2
we find that
E [1 (X1 1 ) | S1 = s1 , X2 = x2 , . . . Xn = xn ] =

1 y(e1 y e1 y )
e1 y + e1 y

p
where y = s1 (x2 )2 (xn )2 . The right hand side is an increasing
function of |1 y|, which is zero when 1 y = 0, and this completes the proof.
7.

Note that, as AT A + kI and AT A commute,




b bk = (AT A + kI) (AT A) (AT A)1 (AT A + kI)1 AT x
= k(AT A)1 bk .

The bias is


b(k) = (AT A + kI)1 AT A I ,

31

and the sum of the squares of the biases is


G(k) = b(k)T b(k) = (Ebk )T (Ebk ).
The variance-covariance matrix is
V(bk ) = (AT A + kI)1 AT A(AT A + kI)1
so that the sum of the variances of the regression coefficients is
F(k) = E(bk E)T (bk E)
= Trace(V(bk ))


= Trace (AT A + kI)1 AT A (AT A + kI)1
and the residual sum of squares is
RSSk = (x Abk )T (x Abk )
b + A(b bk )}T {(x A)
b + A(b bk )}.
= E{(x A)
b = AT x AT A(AT A)1 AT x = 0, this can be written as
Because AT (x A)
b T (x A)
b + (b bk )T AT A(b bk )
RSSk = (x A)
= RSS + k 2 bk T (AT A)1 bk
while the mean square error is
MSEk = E(bk )T (bk )
= E{(bk Ebk ) + (Ebk )}T {(bk Ebk ) + (Ebk )}
= E(bk Ebk )T (bk Ebk ) + (Ebk )T (Ebk )
= F(k) + G(k)
using the fact that E(bk Ebk ) = 0.
It can be shown that F(k) is a continuous monotonic decreasing function of k with
F 0 (0) < 0, while G(k) is a continuous monotonic increasing function with G(0) =
G 0 (0) = 0 which approaches T as an upper limit as k . It follows that there
always exists a k > 0 such that MSEk < MSE0 . In fact, this is always true when
k < 2/ T (cf. C. M. Theobold, Generalizations of mean square error applied to
ridge regression, Journal of the Royal Statistical Society Series B, 36 (1974), 103
106).
8.

We defined

H 1 = 1 1 B B T 1 B B T 1

so that
B T H 1 B = B T 1 B B T 1 B = 0.
32

If B is square and non-singular then


H 1 = 1 1 BB 1 B T

9.

1

B T 1 = 1 1 = 0.

It is easily seen that


B T 1 =
T




B=

1
0

1
0

21
0

0
21

0
1


0
1


,

so that

H 1 = 1 1 B B T 1 B B T 1
1 1
12 1
0
0
2
1
1 1
2 1
0
0

2
=
1 1
1 1

0
0
2
2
1 1
0
0
12 1
2
while

4
0
T
A A=
2
2

0
4
2
2

2
2
4
0

2
2
.
0
4

Now
K 1 = AT 1 A + H 1

4 + 12 1
21 1
1
1
2
4 + 12 1
=

2
2
2
2

2
2
4 + 21 1
21 1

12 1
4 + 12 1

10. See D. V. Lindley and A. F. M. Smith, Bayes estimates for the linear model
(with discussion), Journal of the Royal Statistical Society Series B, 34 (1971), 141
[reprinted in N. Polson and G. C. Tiao, Bayesian Inference (2 vols), (The International
Library of Critical Writings in Econometrics, No. 7), Aldershot: Edward Elgar (1995)].
11.

We have all ai = a so uT u = mb/a so 1 + uT u = (a mb)/a and thus






1
1
b
1
b
a (m + 1)b
ii =
1
=
1

=
a
a
a mb
a(a mb)
1 + uT u a
(m + 1)b
ij =
a(a mb)

where a = n/ + 1/ and b = 1/r.


33

D.9
1.

Exercises on Chapter 9

A simulation in R, the program for which can be seen on

http://www-users.york.ac.uk/~pml1/bayes/tprogs/integral.txt

resulted in values 1.684199, 1.516539, 1.7974, 1.921932, 1.595924, 1.573164, 1.812976,


1.880691, 1.641073, 1.770603 with a mean of 1.719450 as as opposed to the theoretical value of e 1 = 1.71828. The theoretical variance of a single value of eX where
X U(0, 1) is 21 (e 1)(3 e) = 0.24204 so that the standard deviation of the mean
p
of 10 values is 0.24204/10 = 0.15557 which compares with a sample standard
deviation of 0.1372971.
A simulation in C++, the program for which can be seen on
http://www-users.york.ac.uk/~pml1/bayes/cprogs/integral.cpp

resulted in values 1.74529, 1.86877, 1.68003, 1.95945, 1.62928, 1.62953, 1.84021,


1.87704, 1.49146, 1.55213 with a mean of 1.72732 and a sample standard deviation of
0.155317.
2. The easiest way to check the value of At is by induction. Then note that for large
t the matrix At is approximately


3/5
2/5
3/5
2/5
from which the result follows.
3.

In this case
Q = E[(y1 + y2 1) log + (y3 + y4 1) log(1 )]

giving
(t+1) =

E(y1 | (t) , x) + x2 1
E(y1 | (t) , x) + x2 + x3 + E(y4 | (t) , x) 1

where y1 B(x1 , /( + 1)) and y4 B(x4 , (1 )/(2 )), so that


(t+1) =
=

x1 (t) /( (t) + 1) + x2 1
x1 (t) /( (t) + 1) + x2 + x3 + x4 (1 (t) )/(2 (t) ) 2
461 (t) /( (t)

461 (t) /( (t) + 1) + 130 1


.
+ 1) + 130 + 161 + 515(1 (t) )/(2 (t) ) 2

Starting with (0) = 0.5, successive iterations give 0.4681, 0.4461, 0.4411, 0.4394,
0.4386, 0.4385, and thereafter 0.4384. This compares with a value of 0.439 found by
maximum likelihood in C. A. B. Smith, op. cit.
34

4. The proof that p( (t+1) | x) > p( (t+1) | x) in the subsection Why the EM algorithm works only uses the properties of a GEM algorithm. As for the last part, if
convergence has occurred there is no further room for increasing Q, so we must have
reached a maximum.
5. Take prior for variance S0 2 , that is, 35024 , and prior for mean which is N(0 , 0 )
where 0 = 85 and 0 = S0 /n0 ( 2) = 350/(1 2) = 175. We then have data with
n = 100, x = 89 and S1 = (n 1)s2 = 2970. Successive iterations using the EM
algorithm give the posterior mean as 88.9793 and at all subsequent stages 88.9862.
This compares with 88.96 as found in Exercise 16 on Chapter 2.
6. Means for the four looms are 97.3497, 91.7870, 95.7272 and 96.8861, overall
mean is 97.4375. Variance of observations from the same loom is 1.6250 and variance
of means from different looms in the population is 5.1680.
7. (a) For the example on genetic linkage in Section 9.3, see
http://www-users.york.ac.uk/~pml1/bayes/rprogs/dataaug.txt
or
http://www-users.york.ac.uk/~pml1/bayes/cprogs/dataaug.cpp

(b) For the example on chained data sampling due to Casella and George, see a similar
file with chained in place of dataaug.
(c) For the example on the semi-conjugate prior with a normal likelihood (using both
the EM algorithm and chained data augmentation, see a similar file with semiconj
in place of dataaug.
(d) For the example on a change-point analysis of data on coal mining disasters, see
similar files with coalnr.cpp or coal.cpp in place of dataaug.cpp (the difference is that the former uses only routines in W. H. Press et al., Numerical Recipes
in C (2nd edn), Cambridge: University Press 1992, whereas the latter program, which
is in many ways preferable, uses gamma variates with non-integer parameters).
8.

See the program referred to in part (a) of the previous answer.

9.

See the program referred to in part (b) of the answer to the question before last.

10.

See the program

35

http://www-users.york.ac.uk/~pml1/bayes/rprogs/semicon2.txt
or
http://www-users.york.ac.uk/pml1/bayes/cprogs/semicon2.cpp
11. B. P. Carlin and T. A. Louis, Bayes and Empirical Bayes Methods for Data Analysis (2nd edn), Chapman and Hall 2000, p. 149, remark that
In our dataset, each rat was weighed once a week for five consecutive weeks. In
fact the rats were all the same age at each weighing xi1 = 8, xi2 = 15, xi3 = 22,
xi4 = 29, and xi5 = 36 for all i. As a result we may simplify our computations by
rewriting the likelihood as
Yij N (i + i (xij x) , 2 ), i = 1, ..., k, j = 1, ..., ni ,
so that it is now reasonable to think of i and i as independent a priori. Thus we
may set = Diag(2 , 2 ), and replace the Wishart prior with a product of idependent
inverse gamma priors, say IG(a , b ) and IG(a , b ).
This being so, it probably suffices to proceed as in the Rats example supplied with
WinBUGS.
A more detailed description of the general set-up is to be found in Taken from
Illustration of Bayesian Inference in Normal Data Models Using Gibbs Sampling,
Alan E. Gelfand, Susan E. Hills, Amy Racine-Poon, and Adrian F. M. Smith, Journal
of the American Statistical Association 85 (412) (1990), 972985, Section 6, pp. 978
979, which can be found at
http://www-users.york.ac.uk/~pml1/bayes/rprogs/gelfand.pdf
(LATEXsource at gelfand.htm). A program for generating random matrices with
a Wishart distribution based on the algorithm of Odell and Feiveson described in
W. J. Kennedy and J. E. Gentle, Statistical Computing, Marcel Dekker 1980, pp. 231
232 can be found at
http://www-users.york.ac.uk/~pml1/bayes/rprogs/wishart.txt
while the relevant extract from Kennedy and Gentle is at
http://www-users.york.ac.uk/~pml1/bayes/rprogs/rwishart.pdf
12.

See the program at

http://www-users.york.ac.uk/~pml1/bayes/rprogs/linkagemh.txt
13.

See the program at

36

http://www-users.york.ac.uk/~pml1/bayes/winbugs/wheat.txt
or
http://www-users.york.ac.uk/~pml1/bayes/rprogs/wheat.txt
14.

See the program at


http://www-users.york.ac.uk/~pml1/bayes/winbugs/logisti2.txt
or
http://www-users.york.ac.uk/~pml1/bayes/rprogs/logisti2.txt

D.10

Exercises on Chapter 10

We note that, since EX 2 > (EX)2 ,

1.

Z 

Ew(x) =

f (x)q(x)
p(x)

2

Z
p(x) dx >

2

2

Z
=

|f (x)|q(x)
p(x) dx
p(x)

|f (x)|q(x) dx

If we take
p(x) = R

|f (x)|q(x)
|f ()|q() d

then
2

2

Z Z

Ew(x) =

|f ()|q() d

2

Z
p(x) dx. =

|f (x)|q(x) dx

so that this choice of p(x) minimizes Ew(x)2 and, since we always have Ew(x) = ,
consequently minimizes Vw(x).
2.

For sampling with the Cauchy distribution we proceed thus:


(a) Substituting x = tan we find
Z
1 1
dx =
= P(x > 2) =
1 + x2
2
= 0.147 583 6.

1
2

tan1 (2)/ = tan1 ( 12 )/

The variance is
(1 )/n = 0.125 802 7/n.
37

(b) Use the usual change of variable rule to deduce the density of y. Then note that
q(yi )
1 yi2
1/((1 + yi2 ))
=
=
p(yi )
2/yi2
2 1 + yi2
and that we can take f (y) = 1 as all values of y satisfy y > 2.
(c) Substitute xi = 2/yi in the above to get
4
1
.
2 4 + x2i
(d) On writing x = 2 tan and 0 = tan1 ( 12 ) we find
Z
Eb
=
0

1
4
dx =
2 4 + x2

1 1
2 sec2 d = 0 / =
2 sec2

Similarly, noting that sin(20 ) = 4/5 the integral for Eb


2 transforms to
Z
0

1
1
2 sec2 d =
4 2 sec4

1
2 cos2 d
4 2

1
{1 + cos(2)} d
2
0 4

0
= + 21 sin(2) 0 /(4)2


= tan1 ( 21 ) + 25 /(4 2 )
=

= 0.021 876 4
so that
V b = Eb
2 (Eb
)2 = 0.000 095 5.

3.

A suitable program is
n <- 10000
alpha <- 2
beta <- 3
x <- runif(n)
p <- function(x) dunif(x)
q <- function(x) dbeta(x,alpha,beta)
w <- q(x)/p(x)
pi <- w/sum(w)
samp <- sample(x,size=n,prob=pi,replace=T)
print(mean(samp))
print(var(samp))
38

A run of this program resulted in a value of the mean of 0.399 as compared with the
true value of 0.4 and a value of the mean of 0.0399 compared with a true value of 0.04.
4.

Continue the program (avoiding ambiguity in the definition of ) by


theta <- 0.1
le <- (1-theta)*n
lo <- 1:(n-le)
hi <- (le+1):n
y <- sort(samp)
r <- y[hi]-y[lo]
rm <- min(r)
lom <- min(lo[r==rm])
him <- min(hi[r==rm])
dd <- function(x) dbeta(x,alpha,beta)
plot(dd,xlim=c(0,1),ylim=c(0,2))
par(new=T)
plot(density(samp),xlim=c(0,1),ylim=c(0,2),
main="",xlab="",ylab="")
abline(v=y[lom])
abline(v=y[him])
print(y[lom])
print(y[him])

The lower limit resulting from a run of the program was 0.061 and the upper limit was
0.705. Referring to the tables in the Appendix we find that values of F corresponding
to a 90% HDR of log F6,4 are F = 0.21 and F = 5.75. It follows that an appropriate
interval of values of F4,6 is (1/F, 1/F), that is (0.17, 4.76), so that an appropriate
interval for is
2 0.17
2 4.76
66
3 + 2 0.17
3 + 2 4.76
that is (0.10, 0.76) (but note that in Section 3.1 we were looking for intervals in which
the distribution of the log-odds is higher than anywhere outside, rather than for HDRs
for the beta distribution itself).
5.

A suitable program is
r <- 10
phi <- rep(NA,r)
mu <- rep(NA,r)
S <- rep(NA,r)
n <- 100; xbar <- 89; SS <- (n-1)*30
mu0 <- 85; S0 <- 350; nu0 <- 4; n0 <- 1
phi0 <- S0/(n0*(nu0-2))
S[1] <- S0
nustar <- nu0 + n
39

for (i in 2:r) {
phi[i] <- (1/phi0 + n*nustar/S[i-1]){-1}
mu[i] <- phi[i]*(mu0/phi0 + n*xbar*nustar/S[i-1])
S[i] <- S0 + (SS+n*xbar2) - 2*n*xbar*mu[i+1] +
n*(mu[i]2 + phi[i])
cat("i",i,"phi",phi[i],"mu",mu[i],"S",S[i],"\n")
}
mustar <- mu[r]; phistar <- phi[r]; Sstar <- S[r]
cat("mu has mean",mustar,"and s.d.",sqrt(phistar),"\n")
cat("phi has mean",Sstar/(nustar-2),"and s.d.",
(Sstar/(nustar-2))*sqrt(2/(nustar-4)),"\n")
We get convergence to = 88.993 and = 0.322.
6.

In the discrete case the Kullback-Leibler divergence is


X
I(q : p) =
q(x) log{q(x)/p(x)}
 
 

X n
n x
n x
=
x (1 nx ) log
(1 nx )
(1 nx )
x
x
x
X n
=
x (1 nx ) {x log(/) + (n x) log[(1 )/(1 )]}
x
= n log(/) + n(1 ) log[(1 )/(1 )]

using the fact that the mean of B(n, ) is n.


For I(q : p) = I(p : q) we need
n log(/)+n(1) log[(1)/(1)] = n log(/rho)+n(1) log[(1)/(1)]

so


( ) log

from which it is clear that this can happen if and only if = (when I(q : p) = I(p :
q) = 0).
7.

In the continuous case the Kullback-Leibler divergence is


Z
I(q : p) = q(x) log{q(x)/p(x)} dx
Z
= (2)1/2 exp{ 12 (x )2 /}
n h
i
log (2)1/2 exp{ 12 (x )2 /}
h
io
log (2)1/2 exp{ 21 (x )2 /}
40

1
2

log(/) E(x )2 / + E(x )2 /

where x N(, ). By writing (x )2 = {(x ) + ( )}2 it is easily concluded


that
2I(q : p) = log(/) + ( )/ + ( )2 /.
In particular if = then
I(q : p) = 21 ( )2 /.

8.

We find
Z
I(q : p) = q(x) log{q(x)/p(x)} dx
Z
n
o
.
=
1 exp(x/) log 1 exp(x/) 2(2)1/2 exp( 21 x2 )
0

1
2

log(8) log Ex/ + 12 Ex2

where x E(). It follows that


I(q : p) =

1
2

log(8) log 1 + 2

and hence
1
I(q : p)
= + 2

2 I(q : p)
1
= 2 +2
2

It follows that I(q : p) is a minimum when = 2.


9.

See the paper by Corduneaunu and Bishop (2001).

10.

The model is
( 14 + 14 , 14 , 14 (1 ), 14 (1 ) + 14 )

and the values quoted are x1 = 461, x2 = 130, x3 = 161 and x4 = 515. An R
program for the ABC-REJ algorithm is
N <- 100000
etastar <- c(461,130,161,515)
n <- sum(etastar)
eps <- 22
d <- rep(NA,N)
trial <- rep(NA,N)
41

for (j in 1:N) {
etatrial <- runif(1)
inds <- sample(4,n,replace=T,p=c(0.25+etatrial/4,etatrial/4,
(1-etatrial)/4,(1-etatrial)/4+0.25))
samp <- c(sum(inds==1),sum(inds==2),sum(inds==3),sum(inds==4))
d <- sqrt(sum((etastar-samp)2))
if (d <= eps) trial[j] <- etatrial
}
eta <- trial[!is.na(trial)]
k <- length(eta)
m <- mean(eta)
s <- sd(eta)
cat("k",k,"m",m,"s",s,"\n")
Similar modifications can be made for the other algorithms.
11. Let Mi denote model i (i = 1, 2, 3, 4) with model parameters i , where 1 =
(, ), 2 = (, , ), 3 = (, , ) and 4 = (, , , ).
The reversible jump MCMC algorithm is as follows:
Initiate the algorithm by choosing an initial model, m to start in and initial parameter values m .
Each iteration t:
Update the model parameters mt1 , where mt1 is the model which the
algorithm is residing in at the end of iteration t 1.
Propose to move to a new model Mt0 with parameter values 0 . Calculate
the acceptance probability for the move and decide whether to accept or
reject the new model (and parameter values).
The models are nested with model 4 being the full model with sub-models model 2 and
model 3 and the simplest model 1. Each model either includes or excludes and
(exclusion of a parameter is equivalent to setting it equal to 0).
Given that the current model j, employ the following reversible jump:
Choose randomly either and , and propose to move to model i has the chosen
parameter included (excluded) if the parameter is excluded (included) in model j.
A bijection is required for transforming between the models. Therefore for inclusion
of a parameter propose u N(0, 2 ) for suitably chosen 2 and propose to set the
parameter equal to u. (This works best if xi and ti have been standardised to have
mean 0 and variance 1.) Leave the other parameters unchanged. Thus the Jacobian of
the transformation is simply 1.

42

Let Lj (j ) denote the likelihood of j given model Mj . Then the acceptance probability for the inclusion of a parameter (either or set equal to u) is
)
(
Li (i )
(i , Mi )
1/2
1

.
min 1,

Lj (j ) (j , Mj ) 1/2 exp( 1 u2 / 2 )/ 2 2
2

43

También podría gustarte