Cours MC

Monte Carlo Statistical Methods/October 29, 2001
Monte Carlo Statistical Methods Christian P. Robert Universit Paris Dauphine e
Based on the book Monte Carlo Statistical Methods by Christian P. Robert and George Casella Springer-Verlag 1999
Introduction
1.1 1.2 1.3 1.4 1.5
Statistical Models Likelihood Methods Bayesian Methods Deterministic Numerical Methods Simulation versus numerical analysis
Experimenters choice before fast computers: Describe an accurate model which usually precludes computation of explicit answers or Choose a standard model which would allow such computations, but may not be a close representation of a realistic model. Such problems contributed to the development of simulation-based inference
1.1
Statistical Models
Example 1 Censored data models Missing data models where densities are not sampled directly. Typical simple statistical model: we observe Y1 , , Yn f (y|). The distribution of the sample given by the product
n
f (yi |)
i=1
Inference about based on this likelihood.
With censored random variables, actual observations: Yi = min{Yi , u} where u is censoring point. Inference about based on the censored likelihood.
For instance, if X N (, 2 ) and the variable Z = X Y = min(X, Y ) is distributed as z z z + 1

1
Y N (, 2 ),
where and are the density and cdf of the normal N (0, 1) distribution.
Similarly, if X Weibull(, ), with density f (x) = x1 exp(x ) the censored variable Z = X , has the density f (z) = z e
z
constant,
I z + I
x e
dx (z) ,
where a () Dirac mass at a.
Example 2 Mixture models Models of mixtures of distributions: X fj with probability pj , for j = 1, 2, . . . , k, with overall density X p1 f1 (x) + + pk fk (x) . For a sample of independent random variables (X1 , , Xn ), sample density
n
{p1 f1 (xi ) + + pk fk (xi )} .

i=1
Expanding this product involves k n elementary terms: prohibitive to compute in large samples.
10
1.2
Likelihood Methods
Maximum Likelihood Methods For an iid sample X1 , . . . , Xn from a population with density f (x|1 , . . . , k ), the likelihood function is L(|x) = L(1 , . . . , k |x1 , . . . , xn ) =
n i=1
f (xi |1 , . . . , k ).
Global justications from asymptotics
11
Example 3 Gamma MLE X1 , , Xn iid observations from gamma density f (x|, ) = where is known. Log likelihood
n
1 x1 ex/ , ()
log L(, |x1 , , xn ) = log

i=1 n
f (xi |, )
1 = log x1 exi / () i i=1 = n log ()

n n
n log + ( 1)
i=1
log xi
i=1
xi /.
12
Solving log L(, |x1 , , xn ) = 0 is straightforward Yields the MLE =

n
xi /(n).
i=1
13
When also unknown, additional equation log L(, |x1 , , xn ) = 0 is particularly nasty! Involve dicult computations (incl. derivative of the gamma function, the digamma function) Explicit solution no longer possible
14
Example 4 Students t distribution Reasonable alternative to normal errors T (p, , ) more robust against possible modeling errors Density of T (p, , ) proportional to 1 (x )2 1+ p 2
(p+1)/2
15
When p known and and both unknown, the likelihood

n p+1 2 n i=1
(xi )2 1+ p 2
may have n local minima, each of which needs to be calculated to determine the global maximum.
16
0.0
-10
0.00001
0.00002
0.00003
0.00004
-5
10
15
20
Multiplicity of modes of the likelihood from C(, 1) when n = 3 and x1 = 0, x2 = 5, x3 = 9.
17
Example 5 Mixtures again For a mixture of two normal distributions, pN (, 2 ) + (1 p)N (, 2 ) , likelihood proportional to
n
p
i=1
xi
+ (1 p)
xi
containing 2n terms. Standard maximization techniques often fail to nd the global maximum because of multimodality of the likelihood function.
18
In the special case f (x|, ) = (1 ) exp{(1/2)x2 }+ with > 0 known exp{(1/2 2 )(x)2 } (1)
Then, whatever n, the likelihood is unbounded:

0
lim ( = x1 , |x1 , . . . , xn ) =
19
Echantillon N(0,1)
0.6 0.0 0.1 0.2 0.3 0.4 0.5
Sample from (1)
20
sigma
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0 mu
Likelihood of (1)
21
1.3
Bayesian Methods
In the Bayesian paradigm, information brought by the data x, realization of X f (x|), combined with prior information specied by prior distribution with density ()
22
Summary in a probability distribution, (|x), called the posterior distribution Derived from the joint distribution f (x|)(), according to (|x) = f (x|)() , f (x|)()d [Bayes Theorem] where m(x) = is the marginal density of X f (x|)()d
23
Example 6 Binomial Bayes Estimator For an observation X from the binomial distribution B(n, p) the (so-called) conjugate prior is the family of beta distributions Be(a, b) The classical Bayes estimator is the posterior mean
(a + b + n) (a + x)(n x + b)
1
p px+a1 (1 p)nx+b1 dp
x+a . a+b+n
24
The curse of conjugate priors The use of conjugate priors for computational reasons implies a restriction on the modeling of the available prior information may be detrimental to the usefulness of the Bayesian approach gives an impression of subjective manipulation of the prior information disconnected from reality.
25
Example 7 Logistic Regression Standard regression model for binary (0 1) responses: distribution of Y {0, 1} modeled by exp(xt ) P (Y = 1) = p = . 1 + exp(xt ) Equivalently, the logit transform of p, logit(p) = log[p/(1 p)] satises logit(p) = xt .
26
Computation of a condence region on quite delicate when (|x) not explicit. In particular, when the condence region involves only one component of a vector parameter, calculation of (|x) requires the integration of the joint distribution over all the other parameters.
27
Example 8 Cauchy condence regions X1 , , Xn an iid sample from the Cauchy distribution C(, ), with prior (, ) = 1 . Condence region on then based on
n
(|x1 , , xn )
0
n1 i=1
1+
xi
2 1
d ,
an integral which cannot be evaluated explicitly. Similar computational problems with likelihood estimation in this model.
28
1.4
Deterministic Numerical Methods
To solve an equation of the form f (x) = 0, the NewtonRaphson algorithm produces a sequence xn : xn+1 = xn f x
1
f (xn )
x=xn
that converges to a solution of f (x) = 0. [Note that f is a matrix x in multidimensional settings.]
29
Optimization of smooth functions F done using the equation F (x) = 0, where F denotes the gradient of F , vector of derivatives of F .
The corresponding techniques are gradient methods, where the sequence xn is xn+1 = xn where
t t
(xn ) F (xn ) ,
F denotes the matrix of second derivatives of F .
30
Numerical computation of an integral

b
I=
a
h(x)dx
can be done by Riemann integration or by improved techniques like the trapezoidal rule 1 I= 2
n1
(xi+1 xi )(h(xi ) + h(xi+1 )) ,

i=1
where the xi s constitute an ordered partition of [a, b], or Simpsons rule = I 3

n n
f (a) + 4
i=1
h(x2i1 ) + 2
i=1
h(x2i ) + f (b)
31
1.5
Simulation versus numerical analysis: when is it useful?
numerical methods do not take into account the probabilistic aspects of the problem, numerical integration often focus on regions of low probability occurrence of local modes of a likelihood often cause more problems for a deterministic gradient method than for simulation methods
32
but simulation methods very rarely take into account the specic analytical form of the functions (For instance, because of the randomness induced by the simulation, a gradient method yields a much faster determination of the mode of a unimodal density) For small dimensions, integration by Riemann sums or by quadrature converges much faster than the mean of a simulated sample. It is thus often reasonable to use a numerical approach when dealing with regular functions in small dimensions
33
When the statistician needs to study the details of a likelihood surface or posterior distribution needs to simultaneously estimate several features of these functions or when the distributions are highly multimodal it is preferable to use a simulation-based approach
34
Fruitless to advocate the superiority of one method over the other More reasonable to justify the use of simulation-based methods by the statistician in terms of his/her expertise. The intuition acquired by a statistician in his/her every-day processing of random models can be directly exploited in the implementation of simulation techniques
35
Random Variable Generation
2.1 Basic methods 2.2 Beyond Uniform Distribution
36
Rely on the possibility of producing (computer-wise) an endless ow of random variables (usually iid) from well-known distributions Given a uniform random number generator, illustration of methods that produce random variables from both standard and nonstandard distributions
37
2.1
Basic Methods
2.1.1
Introduction
For a function F on IR, generalized inverse of F , F , dened by F (u) = inf {x; F (x) u} . Probability Integral Transform: If U U[0,1] , then the random variable F (U ) has the distribution F .
38
Consequence: To generate a random variable X F , suces to generate U U[0,1] and then make the transform x = F (u)
39
2.1.2
Desiderata and Limitations
Any one who considers arithmetical methods of reproducing random digits is, of course, in a state of sin. As has been pointed out several times, there is no such thing as a random number---there are only methods of producing random numbers, and a strict arithmetic procedure of course is not such a method." [John Von Neumann, 1951)]
40
Production of a deterministic sequence of values in [0, 1] which imitates a sequence of iid uniform random variables U[0,1] . Cant use the physical imitation of a random draw [no guarantee of uniformity, no reproducibility] Random sequence in the sense: Having generated (X1 , , Xn ), knowledge of Xn [or of (X1 , , Xn )] imparts no discernible knowledge of the value of Xn+1 .
41
Deterministic: Given the initial value X0 , sample (X1 , , Xn ) always the same Validity of a random number generator based on a single sample X1 , , Xn when n tends to +, not on replications (X11 , , X1n ), (X21 , , X2n ), . . . (Xk1 , , Xkn ) where n xed and k tends to innity.
42
2.1.3
Uniform pseudo-random number generator
Algorithm starting from an initial value u0 and a transformation D, which produces a sequence (ui ) = (Di (u0 )) in [0, 1]. For all n, (u1 , , un ) reproduces the behavior of an iid U[0,1] sample (V1 , , Vn ) when compared through usual tests
43
Validity of the algorithm means that the sequence U1 , , Un leads to accept the hypothesis H : U1 , , Un are iid U[0,1] .
The set of tests used is generally of some consequence KolmogorovSmirnov Time series methods, for correlation between Ui and (Ui1 , , Uik ) nonparametric tests Marsaglias battery of tests called Die Hard (!)
44
2.1.4
The Kiss Generator
A real-life generated random sequence takes values on {0, 1, , M } rather than in [0, 1] [M largest integer accepted by the computer]
45
Period, T0 , of a generator: smallest integer T such that ui+T = ui for every i, A generator of the form Xn+1 = f (Xn ) has a period no greater than M +1
46
Warning! A uniform generator on [0, 1] should not never take the values 0 and 1 [Gentle, 1998]
47
Congruential generator on {0, 1, , M }: dened by the function D(x) = (ax + b) mod (M + 1). Period and other performance of congruential generators depend heavily on (a, b). With a rational, pairs (xn , D(xn )) lie on parallel lines.
48
1.0 0.0 0.2 0.4 0.6 0.8
0.0 0.2 0.4 0.6 0.8 1.0
Representation of the line y = 69069x mod 1 by uniform sampling with sampling step 3 104 .
49
For k k matrix T , with entries in {0, 1}, shift register generator: given by the transformation xn+1 = T xn (mod 2) where xn represented as a vector of binary coordinates eni {0, 1},
k1
xn =
i=0
eni 2i .
50
To generate a sequence of integers X1 , X2 , , the Kiss algorithm generates three sequences of integers First, a congruential generator In+1 = (69069 In + 23606797) (mod 232 ) , Then two shift register generators (Jn ) and (Kn ) Overall sequence Xn+1 = (In+1 + Jn+1 + Kn+1 ) (mod 232 ) The period of Kiss is of order 295 Kiss has been successfully tested on Die Hard
51
2.2
Beyond Uniform Distributions
Generation of any sequence of random variables can be formally implemented through a uniform generator For distributions with explicit forms of F (for instance, exponential, double-exponential or Weibull distributions), the Probability Integral Transform can be implemented. Case specic methods, which rely on properties of the distribution (for instance, normal distribution, Poisson distribution)
52
More general (indirect) methods exist, for example the accept-reject and the ratio-of-uniform methods Simulation of the standard distributions is accomplished quite eciently by many statistical programming packages (for instance, IMSL, Gauss, Mathematica, Matlab/Scilab, Splus/R).
53
2.2.1
Transformation Methods
Case where a distribution F is linked in a simple way to another distribution easy to simulate. Example 9 Exponential variables If U U[0,1] , the random variable X = log U/ has distribution P (X x) = = P ( log U x) P (U ex ) = 1 ex ,
the exponential distribution Exp().
54
Other random variables that can be generated starting from an exponential include
Y = 2
j=1
log(Uj ) 2 2
Y =
j=1
log(Uj ) Ga(a, )
Y =
a j=1 log(Uj ) a+b j=1 log(Uj )
Be(a, b)
55
Points to note Transformation quite simple to use There are more ecient algorithms for gamma and beta random variables Cannot generate gamma random variables with a non-integer shape parameter For instance, cannot get a 2 variable, which would get us a 1 N (0, 1) variable.
56
Example 10 Normal variables If r, polar coordinates of (X1 , X2 ), then,

2 2 r2 = X1 + X2 2 = Exp(1/2) 2
and uniform distribution on [0, 2] Consequence: If U1 , U2 iid U[0,1] , X1 X2 iid N (0, 1). = = 2 log(U1 ) cos(2U2 ) 2 log(U1 ) sin(2U2 )
57
Box-Muller Algorithm: 1. Generate U1 , U2 iid U[0,1] ; 2. Define x1 = x2 = 2 log(u1 ) cos(2u2 ) , 2 log(u1 ) sin(2u2 ) ;
3. Take x1 and x2 as two independent draws from N (0, 1).
58
Unlike algorithms based on the CLT, this algorithm is exact Get two normals for the price of two uniforms Drawback (in speed) in calculating log, cos and sin.
59
Example 11 Poisson generation Poissonexponential connection: If N P() and Xi Exp(), i IN , P (N = k) = P (X1 + + Xk 1 < X1 + + Xk+1 ) .
60
A Poisson can be simulated by generating exponentials until their sum exceeds 1. This method is simple, but is really practical only for smaller values of . On average, the number of exponential variables required is . Other approaches are more suitable for large s.
61
A generator of Poisson random variables can produce negative binomial random variables since, Y Ga(n, (1 p)/p) implies X N eg(n, p) X|y P(y)
62
Mixture representation The representation of the negative binomial is a particular case of a mixture distribution The principle of a mixture representation is to represent a density f as the marginal of another distribution, for example f (x) =
iY
pi fi (x) ,
If the component distributions fi (x) can be easily generated, X can be obtained by rst choosing fi with probability pi and then generating an observation from fi .
63
2.2.2
Accept-Reject Methods
Many distributions from which dicult, or even impossible, to directly simulate. Another class of methods that only require us to know the functional form of the density f of interest only up to a multiplicative constant. The key to this method is to use a simpler (simulation-wise) density g, the instrumental density, from which the simulation from the target density f is actually done.
64
Accept-Reject method Given a density of interest f , nd a density g and a constant M such that f (x) M g(x) on the support of f .
1. Generate X g, U U[0,1] ; 2. Accept Y = X if U f (X)/M g(X) ; 3. Return to 1. otherwise.
65
Validation of the Accept-Reject method This algorithm produces a variable Y distributed according to f
66
Uniform repartition under the graph of f of accepted points
67
Two interesting properties: First, it provides a generic method to simulate from any density f that is known up to a multiplicative factor Property particularly important in Bayesian calculations: there, the posterior distribution (|x) () f (x|) . is specied up to a normalizing constant Second, the probability of acceptance in the algorithm is 1/M , e.g., expected number of trials until a variable is accepted is M
68
Some intuition In cases f and g both probability densities, the constant M is necessarily larger that 1. The size of M , and thus the eciency of the algorithm, functions of how closely g can imitate f , especially in the tails For f /g to remain bounded, necessary for g to have tails thicker than those of f . It is therefore impossible to use the A-R algorithm to simulate a Cauchy distribution f using a normal distribution g, however the reverse works quite well.
69
Example 12 Normal from a Cauchy 1 f (x) = exp(x2 /2) 2 1 1 , 2 1+x densities of the normal and Cauchy distributions. g(x) = f (x) = g(x) attained at x = 1.
2 (1 + x2 ) ex /2 2
and
2 = 1.52 e
70
So probability of acceptance 1/1.52 = 0.66, and, on the average, one out of every three simulated Cauchy variables is rejected. Mean number of trials to success 1.52.
71
Example 13 Normal/Double Exponential Generate a N (0, 1) by using a double-exponential distribution with density g(x|) = (/2) exp(|x|) f (x) g(x|) 2 1 2 /2 e
and minimum of this bound (in ) attained for = 1. Probability of acceptance /2e = .76: To produce one normal random variable, this Accept-Reject algorithm requires on the average 1/.76 1.3 uniform variables. Compare with the xed single uniform required by the Box-Muller algorithm.
72
Example 14 Gamma with non-integer shape parameter Illustrates a real advantage of the Accept-Reject algorithm The gamma distribution Ga(, ) represented as the sum of exponential random variables, only if is an integer
73
Can use the Accept-Reject algorithm with instrumental distribution Ga(a, b), with a = [], (Without loss of generality, = 1.) Up to a normalizing constant, f /gb = ba xa exp{(1 b)x} ba for b 1. The maximum is attained at b = a/. a (1 b)e
a
0.
74
Example 15 Truncated Normal distributions Truncated Normals appear in many contexts Constraints x produce densities proportional to e
(x)2 /2 2
I x I
for a bound large compared with Alternatives far superior to the na method of generating a ve N (, 2 ) until exceeding , which requires an average number of 1/(( )/) simulations from N (, 2 ) for one acceptance.
75
Instrumental distribution: translated exponential distribution, Exp(, ), with density g (z) = e(z) I z . I The ratio f /g is bounded by f /g 1/ exp(2 /2 ) 1/ exp(2 /2) if > , otherwise.
76
Monte Carlo Integration
3.1 3.2 3.3 3.4
Introduction Classical Monte Carlo Integration Importance Sampling Acceleration Methods
77
3.1
Introduction
Two major classes of numerical problems that arise in statistical inference optimization - generally associated with the likelihood approach integration- generally associated with the Bayesian approach
78
Example 16 Bayes median Bayes estimators are not always posterior expectations, but rather solutions of the minimization problem min
2
L(, ) () f (x|) d . the Bayes estimator is the posterior
For quadratic loss mean.
79
For absolute error loss L(, ) = | |, the Bayes estimator is the posterior median, solution to the equation () f (x|) d =

() f (x|) d
which can be quite complicated.
80
3.2
Classical Monte Carlo integration
Generic problem of evaluating the integral IEf [h(X)] =

X
h(x) f (x) dx
where X is uni- or multidimensional, f is a closed form, partly closed form, or implicit density, and h is a function
81
First use a sample (X1 , . . . , Xm ) from the density f to approximate the integral by the empirical average hm Average hm IEf [h(X)] by the Strong Law of Large Numbers 1 = m
m
h(xj ) ,
j=1
82
Estimate the variance with vm and for m large, hm IEf [h(X)] N (0, 1). vm Note: This can lead to the construction of a convergence test and of condence bounds on the approximation of IEf [h(X)]. 1 1 = mm1
m
[h(xj ) hm ]2 ,
j=1
83
Example 17 Cauchy prior For estimating a normal mean, a robust prior is a Cauchy prior X N (, 1), C(0, 1).
Under squared error loss, posterior mean
(x) =
(x)2 /2 e d 1 + 2 1 (x)2 /2 e d 1 + 2
84
Form of suggests simulating iid variables 1 , , m N (x, 1) and calculate i m i=1 2 1 + i m (x) = . 1 m i=1 2 1 + i The Law of Large Numbers implies m (x) (x) as m .
85
Example 18 Normal cdf Approximation of the normal cdf

t
(t) =
2 1 ey /2 dy 2
by the Monte Carlo method: 1 (t) = n

n
I xi t , I
i=1
based on generating a sample of size n, (x1 , . . . , xn ) using the Box-Muller algorithm.
86
Exact variance (t)(1 (t))/n, as the variables I xi t iid I Bernoulli((t)). For values of t around t = 0 the variance approximately 1/4n: to achieve a precision of four decimals the approximation requires on average n = 2 104 simulations, that is, 200 million iterations. Greater accuracy is achieved in the tails.
87
3.3
Importance Sampling
Simulation from f (the true density) is not necessarily optimal Alternative to direct sampling from f is importance sampling, based on alternative representation IEf [h(X)] =
X
h(x)
f (x) g(x)
g(x) dx .
which allows us to use other distributions than f
88
Evaluation of IEf [h(X)] =

X
h(x) f (x) dx
by 1. Generate a sample X1 , . . . , Xn from a distribution g 2. Use the approximation 1 m

m j=1
f (Xj ) h(Xj ) g(Xj )
89
The estimator 1 m
m j=1
f (Xj ) h(Xj ) g(Xj )
h(x) f (x) dx
X
90
same reason the regular Monte Carlo estimator hm converges; converges for any choice of the distribution g [as long as supp(g) supp(f )]. Instrumental distribution g chosen from distributions easy to simulate. The same sample (generated from g) can be used repeatedly, not only for dierent functions h, but also for dierent densities f .
91
Although g can be any density, some choices are better than others: Finite variance only when IEf f (X) h (X) = g(X)
2
f (X) h (x) dx < . g(X)

2
Instrumental distributions with tails lighter than those of f (that is, with sup f /g = ) not appropriate. If sup f /g = , the weights f (xj )/g(xj ) vary widely, giving too much importance to a few values xj .
92
The choice of g that minimizes the variance of the importance sampling estimator is g (x) =
Z
|h(x)| f (x) . |h(z)| f (z) dz
Rather formal optimality result since optimal choice of g (x) requires the knowledge of h(x)f (x)dx, the integral of interest!
93
Practical alternative
m j=1
h(Xj ) f (Xj )/g(Xj )

m j=1
f (Xj )/g(Xj )
where f and g are known up to constants. Also converges to Numbers. h(x)f (x)dx by the Strong Law of Large
Biased, but the bias is quite small In some settings beats the unbiased estimator in squared error loss.
94
Example 19 Students t distribution X T (, , 2 ), with density (( + 1)/2) f (x) = (/2) (x )2 1+ 2

(+1)/2
Without loss of generality, take = 0, = 1. Calculate the integral

2.1
x5 f (x)dx.
95
Simulation possibilities 2 Directly from f , since f = N (0,1)
Importance sampling using Cauchy C(0, 1) Importance sampling using a normal (expected to be nonoptimal). Importance sampling using a U([0, 1/2.1])
96
Simulation results: Uniform is best Cauchy is OK f and Normal are rotten

7.0 5.0
0
5.5
6.0
6.5
10000
20000
30000
40000
50000
97
sampling from f (solid lines), importance sampling with: Cauchy instrumental (short dashes), U([0, 1/2.1]) instrumental (long dashes) and normal instrumental (dots).
98
3.4
Acceleration Methods
(a) Use correlation to reduce variance Specialized but ecient if applicable With two samples (X1 , . . . , Xm ) and (Y1 , . . . , Ym ) from f to estimate I=
I R
h(x)f (x)dx .
Denote 1 = 1 I m
m
h(Xi )
i=1
and
2 = 1 I m
h(Yi )
i=1
with each having mean I and variance 2
99
The variance of the average is var I1 + I2 2 2 1 = + Cov(I1 , I2 ). 2 2
So negatively correlated samples better than independent samples
100
(b) Antithetic Variables - constructing negatively correlated variables If f symmetric around , take Yi = 2 Xi If Xi = F 1 (Ui ), take Yi = F 1 (1 Ui )
101
(c) Control Variates - another strategy Suppose I = h(x)f (x)dx is the desired integral and I0 = h0 (x)f (x)dx is known Estimate I with I and I0 with I0 , and construct the combined estimator I = I + (I0 I0 ) I is unbiased for I and var(I ) = var(I) + 2 var(I) + 2Cov(I, I0 )
102
For the optimal choice cov(I, I0 ) = , 0 ) var(I
we have var(I ) = (1 2 ) var(I) , where is the correlation between I and I0
103
Example 20 Control variate integration Evaluate P (X > a) = Xi s iid f .

a
f (x)dx and start with

1 2
1 n
n i=1
I i > a), I(X
Suppose we know P (X > ) = Control variate estimator 1 n

n
I i > a) + I(X
i=1
1 n
I i > ) P (X > ) . I(X

i=1
improves if <0 and || < 2 cov(1 , 3 ) P (X > a) =2 . var(3 ) P (X > )
104
(d) Another method - Conditional Expectations Use the conditioning inequality var(IE[(X)|Y]) var((X)) sometimes called Rao-Blackwellization If I is an estimator of I = IEf [h(X)], based on X simulated from the joint distribution f (x, y), such that f (x, y)dy = f (x), the estimator I = IEf [I|y1 , . . . , yn ] dominates I(x1 , . . . , xn ) in terms of variance
105
Example 21 Students t expectation Compute IE[h(x) = exp(x2 )] when X T (, 0, 2 )
Students t distribution can be simulated as X|y N (, 2 y) and Y 1 2 .
106
Therefore, the empirical average

m 2 exp(Xj ) , j=1
can be improved upon using the sample ((X1 , Y1 ), . . . , (Xm , Ym )), since m m 1 1 1 IE[exp(X 2 )|Yj ] = m j=1 m j=1 2 2 Yj + 1 is the conditional expectation. The conditional expectation can have ten times greater precision.
107
0.50
0
0.52
0.54
0.56
0.58
0.60
2000
4000
6000
8000
10000
Estimators of IE[exp(X 2 )]: simple average (solid lines) and conditional expectation (dots) for (, , ) = (4.6, 0, 1).
108
Markov Chains
4.1 4.2 4.3 4.4 4.5 4.6
Basic Notions Irreducibility Transience/Recurrence Invariant Measures Ergodicity and stationarity Limit Theorems
109
Use of Markov chains Many algorithms can be described as Markov chains Needed properties The quantity of interest is what the chain converges to We need to know When will chains converge What do they converge to
110
4.1
Basic notions
A Markov chain is a sequence of random variables that can be thought of as evolving over time. Probability of a transition depends on the particular set that the chain is in Chain dened through its transition kernel A transition kernel is a function K dened on X B(X ) such that (i). x X , K(x, ) is a probability measure; (ii). A B(X ), K(, A) is measurable.
111
When X is discrete, the transition kernel simply is a (transition) matrix K with elements Pxy = P (Xn = y|Xn1 = x) , x, y X .
In the continuous case, the kernel also denotes the conditional density K(x, x ) of the transition K(x, ) P (X A|x) =
A
K(x, x )dx .
112
Given a transition kernel K, a sequence X0 , X1 , . . . , Xn , . . . of random variables is a Markov chain denoted by (Xn ), if, for any t, the conditional distribution of Xt given xt1 , xt2 , . . . , x0 is the same as the distribution of Xt given xt1 . That is, P (Xk+1 A|x0 , x1 , x2 , . . . , xk ) = P (Xk+1 A|xk ) =
A
K(xk , dx)
113
Example 22 AR(1) Models Simple illustration of Markov chains on continuous state space Xn = Xn1 + n , with n N (0, 2 ) If the n s are independent, Xn independent from Xn2 , Xn3 , . . . conditionally on Xn1 . IR,
114
Note that the entire structure of the chain only depends on The transition function K The initial state x0 or initial distribution X0
115
4.2
Irreducibility
Irreducibility is one measure of the sensitivity of the Markov chain to initial conditions It leads to a guarantee of convergence In the discrete case, the chain is irreducible if all states communicate, namely if Px (y < ) > 0 , y being the rst time y is visited x, y X ,
116
In the continuous case, the chain is -irreducible for some measure if for some n, K n (x, A) > 0 for all x X for every A B(X ) with (A) > 0
117
Example 23 AR(1) again Xn+1 = Xn + n+1 with n iid normal variables, The chain is irreducible The reference measure is Lebesgue measure In fact, K(x, A) > 0 for every x IR and every A such that (A) > 0.
118
If n is uniform on [1, 1] and || > 1, Xn+1 Xn ( 1)Xn 1 0 for Xn 1/( 1), the chain is increasing and cannot visit previous values.
119
4.2.1
Cycles and Aperiodicity
Sometimes deterministic constraints on the moves from Xn to Xn+1 . In the discrete case, the period of a state X is d() = g.c.d. {m 1; K m (, ) > 0} , where g.c.d. is the greatest common denominator.
120
For an irreducible chain on a nite space X , the transition matrix is a block matrix 0 D1 0 0 0 0 D2 0 P = , .. . Dd 0 0 0 where the blocks Di are stochastic matrices. From block 1 you must go to block 2, from 2 to 3, etc. You return to the initial group every d-th step
121
If the chain is irreducible (so all states communicate) only one value for the period. An irreducible chain is aperiodic if it has period 1. If one state x X satises Pxx > 0, the chain (Xn ) is aperiodic, although this is not a necessary condition.
122
For continuous chains, similar denition: If the transition kernel has density f (|xn ), sucient condition for aperiodicity is that f (|xn ) is positive in a neighborhood of xn (since the chain can then remain in this neighborhood for an arbitrary number of instants before visiting any set A). For instance, in the AR(1)Example, (Xn ) is aperiodic when n is distributed according to U[1,1] and || < 1
123
4.3
Transience and Recurrence
Irreducibility ensures that every set A will be visited by the Markov chain (Xn ) This property is too weak to ensure that the trajectory of (Xn ) will enter A often enough. A Markov chain must enjoy good stability properties to guarantee an acceptable approximation of the simulated model. Formalizing this stability leads to dierent notions of recurrence For discrete chains, the recurrence of a state equivalent to probability one of sure return. Always satised for irreducible chains on nite spaces
124
In a nite state space X , denote the average number of visits to a state by
=
i=1
I (Xi ) I
If IE [ ] = the state is recurrent If IE [ ] < the state is transient For irreducible chains, recurrence/transience property of the chain, not of a particular state Similar denitions for the continuous case.
125
Stronger form of recurrence: Harris recurrence A set A is Harris recurrent if Px (A = ) = 1 for all x A. The chain (Xn ) is Harris recurrent if it is -irreducible for every set A with (A) > 0, A is Harris recurrent. Note that Px (A = ) = 1 implies IEx [A ] =
126
4.4
Invariant Measures
Stability increases for the chain (Xn ) if marginal distribution of Xn independent of n Requires the existence of a probability distribution such that Xn+1 if Xn
A measure is invariant for the transition kernel K(, ) if (B) =

X
K(x, B) (dx) ,
B B(X ) .
127
The chain is positive recurrent if is a probability measure. Otherwise it is null recurrent If probability measure also called stationary distribution since X0 implies that Xn for every n The stationary distribution is unique
128
Example 24 Back to AR(1) For the AR(1) model Xn = Xn1 + n , with n N (0, 2 ), the transition kernel is N (xn1 , 2 ) and N (, 2 ) is stationary only if = and 2 = 2 2 + 2 . IR,
129
These conditions imply that = 0, and hence || < 1. N (0, 2 /(1 2 )) is the unique stationary distribution 2 = 2 /(1 2 ),
130
4.5
Ergodicity and convergence
We nally consider: to what is the chain converging? The invariant distribution natural candidate for the limiting distribution A fundamental property is ergodicity, or independence of initial conditions. In the discrete case, a state is ergodic if
n
lim |K n (, ) ()| = 0 .
131
In general , we establish convergence using the total variation norm 1 2 and we want K n (x, )(dx)
TV TV
= sup |1 (A) 2 (A)|.

A
= sup
A
K n (x, A)(dx) (A)
to be small.
132
If (Xn ) Harris positive recurrent and aperiodic, then

n
lim
K n (x, )(dx)
TV
=0
for every initial distribution . We thus take Harris positive recurrent and aperiodic as equivalent to ergodic Convergence in total variation implies
n
lim |IE [h(Xn )] IE [h(X)]| = 0
for every bounded function h.
133
There are dierence speeds of convergence ergodic (fast) geometrically ergodic (faster) uniformly ergodic (fastest)
134
4.6
Limit theorems
Ergodicity determines the probabilistic properties of average behavior of the chain. But also need of statistical inference, made by induction from the observed sample.
n If Px close to 0, no direct information about n Xn Px
We need LLNs and CLTs!!!
135
Classical LLNs and CLTs not directly applicable due to: Markovian dependence structure between the observations Xi Non-stationarity of the sequence
136
Ergodic Theorem LLN If the Markov chain (Xn ) is Harris recurrent, then for any function h with E|h| < , 1 lim h(Xi ) = n n h(x)d(x),
137
To get a CLT, we need more assumptions. For MCMC, the easiest is reversibility: A Markov chain (Xn ) is reversible if for all n Xn+1 |Xn+2 Xn+1 |Xn . The direction of time does not matter
138
If the Markov chain (Xn ) is Harris recurrent and reversible, 1 N where 0<
2 h N
(h(Xn ) IE [h])
n=1
2 N (0, h ) .
= IE [h (X0 )]
+2
k=1
IE [h(X0 )h(Xk )] < +.
139
Monte Carlo Optimization
5.1 Introduction 5.2 Stochastic Exploration 5.3 Stochastic Approximation
140
5.1
Introduction
Dierences between the numerical approach and the simulation approach to the problem max h()
lie in the treatment of the function h. Using deterministic numerical methods, the analytical properties of the target function (e.g., convexity, boundedness, smoothness) are often paramount. For the simulation approach, we are more concerned with h from a probabilistic (rather than analytical) point of view.
141
Distinguish between two approaches to Monte Carlo optimization: 1. Exploratory approach Goal: to optimize h by describing its entire range Actual properties of h play a lesser role here 2. Probabilistic approximation of h Monte Carlo exploits probabilistic properties of h approach mostly tied to missing data methods
142
5.2
Stochastic Exploration
When is bounded, (which may always be achieved by reparameterization) 1. Simulate from a uniform distribution on , u1 , . . . , um U
2. Use the approximation h = max(h(u1 ), . . . , h(um )). m
143
A more probabilistic approach If h is positive with h() d < +
nding max h()
is equivalent to nding the modes of h
144
Extension: transform h to H that satises: (i). H 0 and H < .
(ii). h and H have the same maxima Examples: H() = exp(h()/T ) or H() = exp{h()/T }/(1 + exp{h()/T })
145
Example 25 Minimization Consider minimizing h(x, y) = + (x sin(20y) + y sin(20x))2 cosh(sin(10x)x) (x cos(10y) y sin(10x))2 cosh(cos(20y)y) ,
with global minimum 0 at (x, y) = (0, 0).
146
0
1
Z 3
0.5
0.5
0 Y
0 X
-0.5
-0.5
-1
-1
Grid representation h(x, y) on [1, 1]2
147
Properties: Many local minima. Standard methods may not nd the global minimum We can simulate from exp(h(x, y)). Get the minimum from the resulting h(xi , yi )s.
148
5.2.1
Stochastic gradient
Deterministic numerical minimizer that produces a sequence (j ) = arg max h()
if the domain IRd the function h are concave
149
The sequence (j ) is constructed by j+1 = j + j h(j ) , j > 0 , [Gradient sequence] where h is the gradient of h
(j ) is chosen to aid convergence.
150
Stochastic variant Stochastic perturbation: With a second sequence (j ), dene j+1 = j + where j are uniform on the unit sphere |||| = 1 h(x, y) = h(x + y) h(x y) 2||y|| h(x) j h(j , j j ) j 2j
151
This method does not always go along the steepest slope This can be a plus, as it may avoid being trapped in local maxima or saddlepoints
152
Example 26 More minimization Use the stochastic gradient method with our test function h, with dierent sequences of j s and j s, and dierent convergence patterns.
153
Choices of (j ) Case 1 - j = 1/10j : poor evaluation of the minimum and in big jumps in the rst iterations. Case 2 - j = 1/100j : converges to the closest local minima Case 3 - j = 1/10 log(1 + j) : a slower decrease in (j ) tends to achieve better minima.
154
Results of three dierent stochastic gradient runs j j T h(T ) mint h(t ) Iteration 1/10j 1/10j (0.166, 1.02) 1.287 0.115 50 1/100j 1/100j (0.629, 0.786) 0.00013 0.00013 93 1/10 log(1 + j) 1/j (0.0004, 0.245) 4.24 106 2.163 107 58
The iteration T is obtained by the stopping rule ||T T 1 || < 105
155
0.8
0.9
1.0
1.1
-0.2
0.0
0.2
0.4
0.6
(1) j = j = 1/10j
Stochastic gradient paths for starting point (0.65, 0.8) (darker shades mean higher elevations)
156
0.785
0.790
0.795
0.800
0.805
0.630
0.635
0.640
0.645
0.650
(2) j = j = 1/100j
157
0.2
-0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
(3) j = 1/10 log(1 + j), j = 1/j
158
5.2.2
Simulated Annealing
Name borrowed from Metallurgy: A metal manufactured by a slow decrease of temperature (annealing) is stronger than a metal manufactured by a fast decrease of temperature. Fundamental idea: A change of scale, called temperature, allows easier and faster exploration of function h Rescaling partially avoids being trapped in local maxima.
159
Idea Given a temperature T > 0, generate

T T 1 , 2 , . . . () exp(h()/T ) T and approximate the maximum of h based on the sequence i .
As T 0, the values simulated concentrate in a narrower and narrower neighborhood of the local maxima of h
160
Metropolis algorithm
Starting from 0 , 1. uniform in a neighborhood of 0 2. the new value of is generated by: 1 = 0 with probability = exp(h/T ) 1 with probability 1 ,
where h = h() h(0 ).
161
Features if h() h(0 ), is accepted with probability 1 if h() < h(0 ), may still be accepted with probability = 0
162
Features (contd.) If 0 is a local maximum of h, the algorithm escapes from 0 with a probability that depends on T Usually, the simulated annealing algorithm modies the temperature T at each iteration/on-line (heterogeneous Markov chain)
163
Algorithm 27 Simulated Annealing
1. Simulate g(| i |) [instrumental distribution] 2. Accept i+1 = with probability i = exp{hi /Ti } 1; take i+1 = i otherwise. 3. Update Ti to Ti+1 .
164
Convergence Under mild assumptions on (Ti ), this algorithm is guaranteed to nd the global maximum
165
Example 28 More minimization Apply the simulated annealing algorithm to minimize h, with g uniform on [0.1, 0.1], and dierent sequences (Ti ). The results change with the sequences Ti
166
Results of simulated annealing runs for dierent values of Ti and starting point (0.5, 0.4).
Case Ti T h(T ) mint h(t ) Accept. rate 1 1/10i (1.94, 0.480) 0.198 4.02 107 0.9998 2 1/ log(1 + i) (1.99, 0.133) 3.408 3.823 107 0.96 3 100/ log(1 + i) (0.575, 0.430) .0017 4.708 109 0.6888
167
-1.0
-0.5
0.0
0.5
1.0
1.5
-2
-1
(1) Ti = 1/10 i
Simulated annealing sequence of 5000 points for the rst choice of the temperature Ti and starting point (0.5, 0.4)
168
-1.0
-0.5
0.0
0.5
1.0
1.5
-2
-1
(2) Ti = 1/ log(i + 1)
169
-0.5
0.0
0.5
1.0
-1.0
-0.5
0.0
0.5
(3) Ti = 100/ log(i + 1)
170
5.2.3
Recursive integration
Also called Prior feedback : a very statistical approach Based on the result that if there exists a unique solution satisfying = arg max h() ,
then
lim
eh() d = , eh() d
provided h is continuous at .
171
Observations with a log likelihood function (|x). The MLE can be obtained as e (|x) ()d lim = . e (|x) ()d where is a positive density The Bayes estimator associated with the prior distribution is
(x)
e (|x) ()d e (|x) ()d
So we have
lim (x) =
172
Example 29 Gamma shape Estimation of , from (|x) x(1) ex () . IE[|x, ] can be obtained by simulation
Sequence of Bayes estimators of for the estimation of when X G(, 1) and x = 1.5
5 2.02
10 2.04
100 1000 1.89 1.98
5000 1.94
104 2.00
173
5.3
Stochastic Approximation
Methods that work directly with the objective function and are less concerned with fast explorations of the space. Approximations of the objective function that could result in an additional level of error. Many of these approximation methods only work in missing data models, where the likelihood g(x|) can be expressed as g(x|) =
Z
f (x, z|) dz
174
Example 30 Censored data likelihood Observe Y1 , . . ., Yn , iid, from f (y ) Order the observations so that y = (y1 , , ym ) are uncensored and (ym+1 , . . . , yn ) are censored (and equal to a) The likelihood function is
m
L(|y) =
i=1
f (yi ) [1 F (a )]
nm
where F is the cdf associated with f .
175
If we had observed the last n m values, say z = (zm+1 , . . . , zn ), with zi > a (i = m + 1, . . . , n), we could have constructed the (complete data) likelihood
m n
Lc (|y, z) =
i=1
f (yi )
i=m+1
f (zi ),
which is easier to work with.
176
5.3.1
Monte Carlo Approximation h(x) = IE[H(x, Z)]
can be approximated by 1 h(x) = m Problems : h(x) needs to be evaluated at many points, thus involves generation of many samples of Zi s of size m. The sample changes with every value of x: the resulting sequence of evaluations of h usually not smooth enough
m
H(x, zi ) Zi f (z|x) .
i=1
177
Importance sampling solution Use a single sample of Zi g and optimize m (x) = 1 h m

m i=1
f (zi |x) H(x, zi ) , g(zi )
This evaluation of h does not depend [so much] on x
178
Features hm is a sum, thus with possibly less analytical properties than the original h. For example, no smoothing eect of the integral on the integrand H(x, z) The choice of g is very inuential in obtaining a good approximation of the function h(x) The number of points zi used in the approximation should vary with x to achieve the same precision in the approximation of h(x), but it is usually impossible to assess in advance. When g(z) = f (z|x0 ), Geyers (1996) recursive process in which x0 is updated by the solution of the last optimization at each step.
179
Algorithm 31 Monte Carlo Maximization At step i 1.. Generate z1 , . . . , zm f (z|xi ) and compute hgi with gi = f (|xi ). 2.. Find x = arg max hgi (x). 3.. Update xi to xi+1 = x . Repeat until xi = xi+1 .
180
Example 32 MLE MLEs in exponential families; h(x|) = c()ex(x) = c()h(x|) . normalizing constant c() may be unknown or dicult to compute (gamma or beta distributions, for example) Since h(x|)dx = 1 c()
maximization of h(x|) in is equivalent to maximizing h(x|) = log log h(x|) h(x|) log IE h(x|) h(X|) , h(X|)
181
Accomplished by maximizing the approximation h(x|) 1 log log m h(x|)

m i=1
h(xi |) , i |) h(x
where the Xi s are generated from h(x|)
182
In more general missing-data models, the likelihood function is L(|x) = Thus f (x, z|)dz
L(|x) f (x, Z|) = IE x L(|x) f (x, Z|)

m i=1
Z f (x, z|)
can be maximized through approximations like 1 h () = m f (x, zi |) f (x, zi |) z1 , . . . , zm f (z|x, )
183
Example 33 ARCH Models Gaussian ARCH (Auto Regressive Conditionally Heteroscedastic) model, for t = 2, . . . , T , Z = (1 + Z 2 )1/2 , N (0, 1), t t t t1 Xt = aZt + t , t Np (0, 2 Ip ).
184
The approximation of the likelihood ratio is then based on the simulation of the missing data Z T = (Z1 , . . . , ZT ) from f (z T xT , ) f (z T , xT )
T
2T exp
t=1 T
||xt azt ||2 /2 2
ez1 /2
e , 2 )1/2 (1 + zt1 t=2

m i=1
2 2 zt /2(1+zt1 )
The likelihood approximation is given by 1 m

T f (zi , xT ) . T , xT ) f (zi
185
5.3.2
The EM Algorithm
Introduced by Dempster, Laird and Rubin (1977) Takes advantage of the representation g(x|) =
Z
f (x, z|) dz
and solves a sequence of easier maximization problems whose limit is the answer to the original problem.
186
Note EM algorithm relates to MCMC algorithms in the sense that it can be seen as a forerunner of the Gibbs sampler in its Data Augmentation version, replacing simulation by maximization.
187
Suppose that we observe X1 , . . . , Xn , iid from g(x|) and want to compute = arg max L(|x) =
n
g(xi |).
i=1
We augment the data with z, where X, Z f (x, z|) and note the identity f (x, z|) k(z|, x) = , g(x|) where k(z|, x) is the conditional distribution of the missing data Z given the observed data x.
188
Principle This identity leads to the following relationship between the complete-data likelihood Lc (|x, z) = f (x, z|) and the observed data likelihood L(|x). For any value 0 , log L(|x) = IE0 [log Lc (|x, z)|0 , x] IE0 [log k(z|, x)|0 , x], where the expectation is with respect to k(z|0 , x).
189
Properties 1. The strength of the EM algorithm is that we only have to deal with the rst term on the right side above, as the other term can be ignored. 2. The observed likelihood is increased at every iteration : this is a convergence guarantee
190
Preparation Denote the expected log-likelihood by Q(|0 , x) = IE0 [log Lc (|x, z)|0 , x].
A sequence of estimators (j) , j = 1, 2, . . ., is obtained iteratively by Q((j) |(j1) , x) = max Q(|(j1) , x).
191
Algorithm 34 The EM Algorithm Iterate 1. (the E-step) Compute Q(|(m) , x) = IE(m) [log Lc (|x, z)|x] , 2. (the M-step) Maximize Q(|(m) , x) in and take (m+1) = arg max Q(|(m) , x).
until a xed point Q is obtained
192
Example 35 Censored data If f (x ) is the N (, 1) density, the censored data likelihood is 1 1 L(|x) = exp 2 (2)m/2
m
(xi )2
i=1
[1 (a )]
nm
and the complete-data log-likelihood is 1 c log L (|x, z) 2 1 2 (xi ) (zi )2 , 2 i=m+1 i=1
m n
where the zi s are observations from the truncated Normal distribution exp{ 1 (z )2 } (z ) 2 k(z|, x) = = , 1 (a ) 2[1 (a )] a < z.
193
At the jth step in the EM sequence, we have 1 Q(|(j) , x) 2

m
(xi )2
i=1 n a
1 2 i=m+1
(zi )2 k(z|(j) , x) dzi ,
194
Dierentiating with respect to yields (j+1) where IE[Z|(j) ] =

a
m + (n m)IE[Z|(j) ] x = , n
(a (j) ) zk(z|(j) , x) dz = (j) + . (j) ) 1 (a
Thus, the EM sequence is dened by (j+1) (a (j) ) m nm = x+ (j) + , (j)) n n 1 (a
which converges to the MLE .
195
5.3.3
MCEM
A (potential) diculty with the EM algorithm is the computation of Q(|0 , x) To overcome this diculty, use 1 Q(|0 , x) = m where Z1 , . . . , Zm k(z|x, ). When m , this quantity converges to Q(|0 , x).
m
log Lc (|x, z) ,
i=1
196
The Metropolis-Hastings Algorithm
6.1 6.2 6.3 6.4
Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Examples Extensions
197
6.1
Monte Carlo Methods based on Markov Chains
Unnecessary to use a sample from the distribution f to approximate the integral h(x)f (x)dx ,
Now we obtain X1 , . . . , Xn f (approx) without directly simulating from f , using an ergodic Markov chain with stationary distribution f
198
Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f Insures the convergence in distribution of (X (t) ) to a random variable from f . For a large enough T0 , X (T0 ) can be considered as distributed from f Produce a dependent sample X (T0 ) , X (T0 +1) , . . ., which is generated from f , sucient for most approximation purposes.
199
6.2
The MetropolisHastings algorithm
6.2.1
Basics
The algorithm starts with the objective (target) density f A conditional density q(y|x) called the instrumental (or proposal) distribution, is then chosen.
200
Algorithm 36 MetropolisHastings Given x(t) , 1. Generate Yt q(y|x(t) ). 2. Take X (t+1) = where (x, y) = min Yt x(t) with prob. (x(t) , Yt ), with prob. 1 (x(t) , Yt ), f (y) q(x|y) ,1 f (x) q(y|x) .
201
Features Always accept upwards moves Independent of normalizing constants for both f and q(|x) (constants independent of x) Never move to values with f (y) = 0 The chain (x(t) )t may take the same value several times in a row, even though f is a density wrt Lebesgue measure The sequence (yt )t is usually not a Markov chain
202
6.2.2
Convergence properties
1. The M-H Markov chain is reversible, with invariant/stationary density f since it satises the detailed balance condition f (y) K(y, x) = f (x) K(x, y) 2. As f is a probability measure, the chain is positive recurrent 3. If f (Yt ) q(X (t) |Yt ) P 1 < 1. f (X (t) ) q(Yt |X (t) )
(1)
that is, the event {X (t+1) = X (t) } is possible, then the chain is aperiodic
203
4. If q(y|x) > 0 for every (x, y), the chain is irreducible 5. For M-H, f -irreducibility implies Harris recurrence 6. Thus, for M-H satisfying (1) and (2) (a) For h, with IEf |h(X)| < , 1 lim T T (b) and
n T
(2)
h(X (t) ) =
t=1
h(x)df (x)
a.e. f.
lim
K n (x, )(dx) f
TV
=0
for every initial distribution , where K n (x, ) denotes the kernel for n transitions.
204
6.3
A Collection of Metropolis-Hastings Algorithms
6.3.1
The Independent Case
The instrumental distribution q is independent of X (t) , and is denoted g by analogy with Accept-Reject.
205
Algorithm 37 Independent Metropolis-Hastings Given x(t) , 1. Generate Yt g(y) 2. Take X (t+1) =
Yt x(t)
with prob. min otherwise.
f (Yt ) g(x(t) ) ,1 , (t) ) g(Y ) f (x t
206
The resulting sample is not iid There can be strong convergence properties: The algorithm produces a uniformly ergodic chain if there exists a constant M such that f (x) M g(x) , In this case, K n (x, ) f
TV
x supp f.
1 1 M
.
1 M.
and the expected acceptance probability is at least
207
Example 38 Generating gamma variables Generate the Ga(, ) distribution using a gamma Ga([], b = []/) candidate Algorithm 39 Gamma accept-reject 1. Generate Y Ga([], []/) 2. Accept X = Y with prob. e y exp(y/)
[]
208
and Algorithm 40 Gamma Metropolis-Hastings 1. Generate Yt Ga([], []/) 2. Take X (t+1) =
Yt x(t)
with prob. otherwise.
Yt exp x(t)
x(t) Yt
[]
209
Comparison Close agreement in M-H and A-R, with a slight edge to M-H.
9.5 8.0
0
8.5
9.0
(5000 iterations)
Accept-reject (solid line) vs. MetropolisHastings (dotted line) estimators of IEf [X 2 ] = 8.33, for = 2.43 based on Ga(2, 2/2.43)
1000
2000
3000
4000
210
6.3.2
Random walk MetropolisHastings
Use the proposal Yt = X (t) + t , where t g, independent of X (t) . The instrumental density is now of the form g(y x) and the Markov chain is a random walk if we take g to be symmetric
211
Algorithm 41 Random walk Metropolis Given x(t) 1. Generate Yt g(y x(t) ) 2. Take X (t+1) =
Yt x(t)
f (Yt ) with prob. min 1, , f (x(t) ) otherwise.
212
Example 42 Random walk normal Generate N (0, 1) based on the uniform proposal [, ] [Hastings (1970)] The probability of acceptance is then (x , yt ) = exp{(x
(t) (t)2 2 yt )/2} 1.
213
Sample statistics 0.1 mean 0.399 variance 0.698 0.5 1.0 0.111 0.10 1.11 1.06
As , we get better histograms and a faster exploration of the support of f .
214
250
0.5
400
0.5
400
200
0.0
300
0.0
300
150
-0.5
200
-0.5
200
100
-1.0
100
-1.0
100
50
-1.5
-1.5
-1
0 (a)
-2
0 (b)
-3
-2
-1
0 (c)
Three samples based on U[, ] with (a) = 0.1, (b) = 0.5 and (c) = 1.0, superimposed with the convergence of the means (15, 000 simulations).
-1.5
-1.0
-0.5
0.0
0.5
215
Convergence properties Uniform ergodicity prohibited by random walk structure At best, geometric ergodicity: For a symmetric density f , log-concave in the tails, and a positive and symmetric density g, the chain (X (t) ) is geometrically ergodic.
216
Example 43 Comparison of tail eects Random-walk MetropolisHastings algorithms based on a N (0, 1) instrumental for the generation of (a) a N (0, 1) distribution and (b) a distribution with density (x) (1 + |x|)3
1.5 1.5
0 50 100 (a) 150 200
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5
0
-1.0
-0.5
0.0
0.5
1.0
50
100 (b)
150
200
90% condence envelopes of the means, derived from 500 parallel independent chains
217
6.4
Extensions
There are many other algorithms Adaptive Rejection Metropolis Sampling Reversible Jump Langevin algorithms to name a few...
218
6.4.1 Facts:
Reversible jump MCMC
- No clear dominating measure in variable dimension spaces - Gibbs sampling does not apply Solution: - Create xed dimension moves locally - Supplements 1 from Mk1 and 2 from Mk2 by u12 and u21 resp. so that (1 , u12 ) and (2 , u21 ) are in bijection (one-to-one correspondance): (2 , u21 ) = T (1 , u12 )
219
- Use acceptance probability min (k2 , 2 )12 g(u21 ) T (1 , u12 ) ,1 (k1 , 1 )21 g(u12 ) (1 , u12 ) [Green, 1995]
220
6.4.2
Langevin Algorithms
Proposal based on the Langevin diusion Lt is dened by the stochastic dierential equation dLt = dBt + 1 2 log f (Lt )dt,
where Bt is the standard Brownian motion The Langevin diusion is the only non-explosive diusion which is reversible with respect to f .
221
Discretization: x
(t+1)
=x
(t)
2 + 2
log f (x(t) ) + t ,
t Np (0, Ip )
where 2 corresponds to the discretization Unfortunately, the discretized chain may be be transient, for instance when
x
lim
log f (x)|x|1 > 1
222
MH correction Accept the new value Yt with probability f (Yt ) (t) ) f (x exp Yt x
(t)
2 2 2 2
log f (x ) log f (Yt )
(t)
2 2 1. 2 2
exp x(t) Yt
Choice of the scaling factor Should lead to an acceptance rate of 0.574 to achieve optimal convergence rates (when the components of x are uncorrelated)
223
6.4.3
Optimizing the Acceptance Rate
Problem of choice of the transition kernel from a practical point of view Most common alternatives: (a) a fully automated algorithm like ARMS; (b) an instrumental density g which approximates f , such that f /g is bounded for uniform ergodicity to apply; (c) a random walk In both cases (b) and (c), the choice of g is critical,
224
Case of the independent MetropolisHastings algorithm Choice of g that maximizes the average acceptance rate = IE min = 2P f (Y ) g(X) ,1 f (X) g(Y ) f (X) f (Y ) , g(Y ) g(X)
X f, Y g,
Related to the speed of convergence of 1 T

T
h(X (t) )
t=1
to IEf [h(X)] and to the ability of the algorithm to explore any complexity of f
225
Practical implementation Choose a parameterized instrumental distribution g(|) and adjusting the corresponding parameters based on the evaluated acceptance rate 2 () = m
m
I {f (yi )g(xi )>f (xi )g(yi )} , I

i=1
where x1 , . . . , xm sample from f and y1 , . . . , ym iid sample from g.
226
Example 44 Inverse Gaussian distribution. Simulation from f (z|1 , 2 ) z

3/2
2 exp 1 z +2 z
1 2 + log
22
I IR+ (z) I
based on the Gamma distribution Ga(, ) with = Since f (x) 2 1/2 x exp ( 1 )x g(x) x ,
2 /1
the optimal value of x is x = ( + 1/2) ( + 1/2)2 + 42 (1 ) . 2( 1 )
227
The analytical optimization (in ) of M () = (x )1/2 exp ( 1 )x is impossible () 0.2 0.22 0.5 0.41 0.8 0.54 0.9 0.56 1 0.60 1.1 0.63 1.2 0.64 1.5 0.71 1.148 1.115 2 x
IE[Z] 1.137 1.158 IE[1/Z] 1.116 1.108
1.164 1.154 1.133 1.148 1.181 1.116 1.115 1.120 1.126 1.095
(1 = 1.5, 2 = 2, and m = 5000).
228
Case of the random walk Dierent approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f . If x(t) and yt are close, i.e. f (x(t) ) f (yt ) y is accepted with probability f (yt ) ,1 1. min f (x(t) ) For multimodal densities with well separated modes, the negative eect of limited moves on the surface of f clearly shows.
229
If the average acceptance rate is low, the successive values of f (yt ) tend to be small compared with f (x(t) ), which means that the random walk moves quickly on the surface of f since it often reaches the borders of the support of f
230
Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995]
231
The Gibbs Sampler
7.1 General Principles 7.2 Data Augmentation 7.3 Improper Priors
232
7.1
General Principles
A very specic simulation algorithm based on the target f Uses the conditional densities f1 , . . . , fp from f Start with the random variable X = (X1 , . . . , Xp ) Simulate from the conditional densities, Xi |x1 , x2 , . . . , xi1 , xi+1 , . . . , xp fi (xi |x1 , x2 , . . . , xi1 , xi+1 , . . . , xp ) for i = 1, 2, . . . , p.
233
Algorithm 45 The Gibbs sampler
Given x(t) = (x1 , . . . , xp ), generate 1. X1 2. X2

(t+1) (t+1)
(t)
(t)
f1 (x1 |x2 , . . . , xp ); f2 (x2 |x1

(t+1)
(t)
(t)
, x3 , . . . , xp ),
(t+1)
(t)
(t)
... p. Xp
(t+1)
fp (xp |x1
(t+1)
, . . . , xp1 )
Then X(t+1) X f
234
The full conditionals densities f1 , . . . , fp are the only densities used for simulation. Thus, even in a high dimensional problem, all of the simulations may be univariate
235
Example 46 Bivariate Gibbs sampler (X, Y ) f (x, y) Generate a sequence of observations by
Set X0 = x0 , and for t = 1, 2, . . . , generate Yt Xt fY |X (|xt1 ) fX|Y (|yt )
where fY |X and fX|Y are the conditional distributions
236
(Xt , Yt )t , is a Markov chain (Xt )t and (Yt )t individually are Markov chains For example, the chain (Xt )t has transition density K(x, x ) = with invariant density fX () fY |X (y|x)fX|Y (x |y)dy,
237
For the special case (X, Y ) N2 0, the Gibbs sampler is 1 1 ,
Given yt , generate Xt+1 | yt Yt+1 | xt+1 N (yt , 1 2 ) , N (xt+1 , 1 2 ).
238
Example 47 Auto-exponential model On IR3 , density + f (y1 , y2 , y3 ) exp{(y1 + y2 + y3 + 12 y1 y2 + 23 y2 y3 + 31 y3 y1 )} , with known ij > 0. The full conditional densities are exponential Y3 |y1 , y2 Exp (1 + 23 y2 + 31 y1 ) , In contrast, the other conditionals, and the marginal distributions are dicult.
239
Properties of the Gibbs sampler Formally, a special case of a sequence of 1-D M-H kernels, all with acceptance rate uniformly equal to 1. The Gibbs sampler 1. limits the choice of instrumental distributions 2. requires some knowledge of f 3. is, by construction, multidimensional 4. does not apply to problems where the number of parameters varies as the resulting chain is not irreducible.
240
7.1.1
Completion
The Gibbs sampler can be generalized in much wider generality A density g is a completion of f if g(x, z) dz = f (x)
Z
241
Purpose g should have full conditionals that are easy to simulate for a Gibbs sampler to be implemented with g rather than f For p > 1, write y = (x, z) and denote the conditional densities of g(y) = g(y1 , . . . , yp ) by Y1 |y2 , . . . , yp Y2 |y1 , y3 , . . . , yp Yp |y1 , . . . , yp1 g1 (y1 |y2 , . . . , yp ), g2 (y2 |y1 , y3 , . . . , yp ), ..., gp (yp |y1 , . . . , yp1 ).
242
The move from Y (t) to Y (t+1) is dened as follows: Algorithm 48 Completion Gibbs sampler
Given (y1 , . . . , yp ), simulate 1. Y1

(t+1) (t+1)
(t)
(t)
g1 (y1 |y2 , . . . , yp ), g2 (y2 |y1

(t+1)
(t)
(t)
2. Y2 ... p. Yp
, y3 , . . . , yp ), , . . . , yp1 ).
(t+1)
(t)
(t)
(t+1)
gp (yp |y1
(t+1)
243
Example 49 Cauchy-normal Consider the density e /2 f (|0 ) [1 + ( 0 )2 ] posterior from the model X| N (, 1) and C(0 , 1). Then f (|0 )
0
2
2 /2
[1+(0 )2 ] /2
1 d,
and therefore g(, ) e

2
/2
e[1+(0 )
] /2
1 ,
244
with conditional densities g1 (|) g2 (|) 1 + ( 0 )2 = Ga , 2 0 1 = N , . 1+ 1+ ,
The parameter is completely meaningless for the problem at hand but serves to facilitate computations.
245
7.1.2
Slice sampler
If f () can be written as a product

k
fi (),
i=1
it can be completed
k
I 0i fi () , I
i=1
leading to the following Gibbs algorithm:
246
Algorithm 50 Slice sampler
Simulate 1. 1
(t+1)
U[0,f1 ((t) )] ;
... k. k
(t+1)
U[0,fk ((t) )] ;
k+1. (t+1) UA(t+1) , with A(t+1) = {y; fi (y) i

(t+1)
, i = 1, . . . , k}.
247
The slice sampler usually enjoys good theoretical properties (like geometric ergodicity). As k increases, the determination of the set A(t+1) may get increasingly complex.
248
Example 51 Normal simulation For the standard normal density, f (x) exp(x2 /2), a slice sampler is based on |x X| U[0,exp(x2 /2)] , U
[ 2 log(),
2 log()]
249
7.1.3
Properties of the Gibbs sampler (Y1 , Y2 , , Yp ) g(y1 , . . . , yp )
If either (i) g (i) (yi ) > 0 for every i = 1, , p, implies that g(y1 , . . . , yp ) > 0, where g (i) denotes the marginal distribution of Yi , or [Positivity condition] (ii) the transition kernel is absolutely continuous with respect to g, then the chain is irreducible and positive Harris recurrent. (i). If h(y)g(y)dy < , then lim 1 T
T
nT
h1 (Y (t) ) =
t=1
h(y)g(y)dy a.e. g.
250
(ii). If, in addition, (Y (t) ) is aperiodic, then

n
lim
K n (y, )(dx) f
TV
=0
for every initial distribution .
251
7.1.4
Hammersley-Cliord Theorem
An illustration that conditionals determine the joint distribution If the joint density g(y1 , y2 ) have conditional distributions g1 (y1 |y2 ) and g2 (y2 |y1 ), then g(y1 , y2 ) = g2 (y2 |y1 ) . g2 (v|y1 )/g1 (y1 |v) dv
252
General case Under the positivity condition, the joint distribution g satises
p
g(y1 , . . . , yp )
j=1
g j (y j |y 1 , . . . , y g j (y j |y 1 , . . . , y
j1 j1
,y ,y
j+1 j+1
, . . . , y p) , . . . , y p)
for every permutation
on {1, 2, . . . , p} and every y Y.
253
7.1.5
Hierarchical models
The Gibbs sampler is particularly well suited to hierarchical models Example 52 Hierarchical models in animal epidemiology Counts of the number of cases of clinical mastitis in 127 dairy cattle herds over a one year period. Number of cases in herd i Xi P(i ) i = 1, , m
where i is the underlying rate of infection in herd i Lack of independence might manifest itself as overdispersion.
254
Modied model Xi i i P(i ) Ga(, i ) IG(a, b),
The Gibbs sampler corresponds to conditionals i i (i |x, , i ) = Ga(xi + , [1 + 1/i ]1 ) (i |x, , a, b, i ) = IG( + a, [i + 1/b]1 )
255
7.2
Data Augmentation
The Gibbs sampler with only two steps is particularly useful Algorithm 53 Data Augmentation
Given y (t) , 1.. Simulate Y1 2.. Simulate Y2

(t+1) (t+1)
g1 (y1 |y2 ) ; g2 (y2 |y1

(t+1)
(t)
).
256
Convergence is ensured
(Y1 , Y2 )(t) Y1 Y2
(t) (t)
(Y1 , Y2 ) g Y2 g2
Y1 g1
257
Example 54 Grouped counting data 360 consecutive records of the number of passages per unit time. Number of passages Number of observations 0 139 1 128 2 55 3 25 4 or more 13
258
Feature Observations with 4 passages and more are grouped If observations are Poisson P(), the likelihood is (|x1 , . . . , x5 )
3
347 128+552+253
1e
i=0
i!
13
which can be dicult to work with. Idea With a prior () = 1/, complete the vector (y1 , . . . , y13 ) of the 13 units larger than 4
259
Algorithm 55 Poisson-Gamma Gibbs
1.. Simulate Yi 2.. Simulate
(t)
P((t1) ) I y4 I
i = 1, . . . , 13
13
(t) Ga 313 +
i=1
yi , 360
(t)
The Bayes estimator 1 = 360T converges quite rapidly

T 13
313 +
t=1 i=1
yi
(t)
260
1.025
1.023
10 20 30 40
1.024
0.9
1.0 lambda
1.1
1.2
1.021
0
1.022
100
200
300
400
500
261
7.2.1
Rao-Blackwellization
If (y1 , y2 , . . . , yp )(t) , t = 1, 2, . . . T is the output from a Gibbs sampler 1 0 = T

T t=1
h y1
(t)
h(y1 )g(y1 )dy1
and is unbiased. The Rao-Blackwellization replaces 0 with its conditional expectation rb 1 = T

T t=1 (t) IE h(Y1 )|y2 , . . . , yp . (t)
262
Then Both estimators converge to IE[h(Y1 )] Both are unbiased, and var IE h(Y1 )|Y2 , . . . , Yp(t)
(t)
var(h(Y1 )),
so rb is uniformly better (for Data Augmentation)
263
Some examples of Rao-Blackwellization For the bivariate normal 0 0 , 1 1
(X, Y ) N
the Gibbs sampler is based upon X|y N (y, 1 2 )
Y | x N (x, 1 2 ).
264
To estimate = IE(X) we could use 1 0 = T or its Rao-Blackwellized version 1 1 = T 1 (i) (i) IE[X |Y ] = T i=1
1 2 T T T
X (i)
i=1
Y (i) ,
i=1
2 2 which satises 0 /1 =
> 1.
265
For the Poisson-Gamma Gibbs sampler, we could estimate with 1 0 = T

T
(t) ,
t=1
but we instead used the Rao-Blackwellized version = 1 T

T t=1 T 13
IE[(t) |x1 , x2 , . . . , x5 , y1 , y2 , . . . , y13 ] 313 +

t=1 i=1
(i)
(i)
(i)
1 360T
yi
(t)
266
Another substantial benet of Rao-Blackwellization is in the approximation of densities of dierent components of y without nonparametric density estimation methods. The estimator 1 T and is unbiased.,
T t=1
gi (yi |yj , j = i) gi (yi ),
(t)
267
7.2.2
The Duality Principle
Ties together the properties of the two Markov chains in Data Augmentation Consider a Markov chain (X (t) ) and a sequence (Y (t) ) of random variables generated from the conditional distributions X (t) |y (t) Y (t+1) |x(t) , y (t) Properties If the chain (Y (t) ) is ergodic then so is (X (t) ) The conclusion holds for geometric or uniform ergodicity. The chain (Y (t) ) can be discrete, and the chain (X (t) ) can be continuous. (x|y (t) ) f (y|x(t) , y (t) ) .
268
7.2.3
Parameterization
Convergence of both Gibbs sampling and MetropolisHastings algorithms may suer from a poor choice of parameterization The overall advice is to try to make the components as independent as possible.
269
Example 56 Random eects model In the simple random eects model yij = + i + ij where
2 2 i N (0, ) and ij N (0, y )
i = 1, . . . , I,
j = 1, . . . , J
for a at prior on , the Gibbs sampler implemented for the (, 1 , . . . , I )

2 2 parameterization exhibits high correlation if y /(IJ ) is large and consequent slow convergence
270
If the model is rewritten as the hierarchy

2 yij N (i , y ), 2 i N (, ),
the correlations between the i s and between and the i s are lower
271
7.3
Improper Priors
Unsuspected danger resulting from careless use of MCMC algorithms: It can happen that all conditional distributions are well dened, all conditional distributions may be simulated from, but... the system of conditional distributions may not correspond to any joint distribution Warning The problem is due to careless use of the Gibbs sampler in a situation for which the underlying assumptions are violated
272
Example 57 Conditional exponential distributions For the model X1 |x2 Exp(x2 ) , X2 |x1 Exp(x1 )
the only function f (x1 , x2 ) that is a candidate for the joint density is f (x1 , x2 ) exp(x1 x2 ), but f (x1 , x2 )dx1 dx2 =
Thus, these conditional distributions do not correspond to a joint probability distribution.
273
Example 58 Improper random eects For a random eects model, Yij = + i + ij , where i N (0, 2 ) and ij N (0, 2 ), the Jereys (improper) prior for the parameters , and is 1 (, , ) = 2 2 .
2 2
i = 1, . . . , I, j = 1, . . . , J,
274
The conditional distributions

2 2
i |y, , ,
|, y, 2 , 2 2 |, , y, 2 2 |, , y, 2
J(i ) y N , (J 2 + 2 )1 J + 2 2 N ( , 2 /JI) , y IG I/2, (1/2)

2 i i
IG IJ/2, (1/2)
i,j
(yij i )2 ,
are well-dened and a Gibbs sampling can be easily implemented in this setting.
275
Evolution of ((t) ) and corresponding histogram

30
25
20
10
-4
-3
-2 (1000 iterations)
-1
-8
-6
-4 observations
freq. 15
-2
276
The gure shows the sequence of the (t) and the corresponding histogram for 1000 iterations. The trend of the sequence and the histogram do not indicate that the corresponding joint distribution does not exist
277
Final notes on impropriety The improper posterior Markov chain cannot be positive recurrent The major task in such settings is to nd indicators that ag that something is wrong. However, the output of an improper Gibbs sampler may not dier from a positive recurrent Markov chain. Example The random eects model was initially treated in Gelfand et al. (1990) as a legitimate model
278
Diagnosing Convergence
8.1 Stopping the Chain 8.2 Monitoring Stationarity Convergence 8.3 Monitoring Average Convergence
279
8.1
Stopping the Chain
Convergence results do not tell us when to stop the MCMC algorithm and produce our estimates. Methods of controlling the chain in the sense of a stopping rule to guarantee that the number of iterations is sucient.
280
Three types of convergence 1. Convergence to the Stationary Distribution Minimal requirement for approximation of simulation from f 2. Convergence of Averages convergence of the empirical average 1 T
T
h((t) ) IEf [h()].

t=1
most relevant in the implementation of MCMC algorithms. 3. Convergence to iid Sampling (t) (t) How close a sample (1 , . . . , n ) is to being iid.
281
8.1.1
Single vs. Multiple Chains
Some methods involve the simulation in sf parallel of M independent (t) chains (m ) (1 m M ) Some are based on a single on-line chain. Motivations for parallel chains Variability and dependence on the initial values are reduced Potentialy easier to control convergence comparing estimation of quantities of interest over dierent chains, But... in a naive implementation, slower chain governs convergence But... initial distribution is paramount
282
8.2
Monitoring Convergence to the Stationary Distribution
8.2.1
Graphical Methods
Natural empirical approach to convergence control is to draw pictures May detect deviant or nonstationary behaviors A rst idea is to draw the sequence of the (t) s against t However, this plot is only useful for strong nonstationarities of the chain.
283
Example 59 Witchs hat distribution Consider (|y) (1 ) d e

y
2
/(2 2 )
+ I C (), I
y IRd , C = [0, 1]d
One mode very concentrated around y for small and
284
Naive implementation of the Gibbs sampler Algorithm 60 Witchs hat distribution 1. Generate Ui U[0,1] 2. Generate i U[0,1]
+ N (yi , 2 , 0, 1)
if Ui < /(wi + ) otherwise
285
where
+ N (, 2 , , ) = N (, 2 ) restricted to [, ]
and
(1 ) 2 { Pj=i (yj j )2 /(22 )} wi = e d+1
1 yi
yi
286
0
1 0.8 0.6 0.4 0.2 0
0 0.4 0.2 0.6 0.8 1
000 40000 10000 20000 30
287
0.0 0.2 0.4 0.6 0.8 1.0

0
200
400
600
800
1000
initial value 0.0217
0.0 0.2 0.4 0.6 0.8 1.0

0
200
400
600
800
1000
initial value 0.9098
Chain (1 ) for two initial values, 0.0217 (top) and 0.9098 (bottom)
(t)
288
Strong attraction of the mode that gives the impression of stationarity Chain with initial value 0.9098 achieves a momentary escape from the mode, but is actually atypical.
289
8.2.2
Other monitors of stationarity
Nonparametric tests of stationarity Standard nonparametric tests (Kolmogorov-Smirnov,...) When the chain is stationary, (t1 ) and (t2 ) have the same marginal distribution for arbitrary times t1 and t2 . Methods based on Renewal Theory Methods based on Distance Evaluations between the n-step kernel and the marginal Remember, stationarity is not the main concern!
290
8.3
Monitoring Convergence of Averages
Evaluation of the convergence of 1 T

T
h((t) ) IEf [h()].

t=1
291
8.3.1
Graphical Methods
Purely graphical evaluation based on cumulative sums, graphing the partial dierences
i i DT = t=1
[h((t) ) ST ],
i = 1, , T,
[CUSUM] where 1 ST = T
T
h((t) ).
t=1
292
i When the mixing of the chain is high, the graph of DT is highly irregular and concentrated around 0. (Looks like a Brownian bridge.)
Slowly mixing chains (chains with a slow pace of exploration of the stationary distribution) produce regular graphs with long excursions away from 0.
293
-25
-20
0
-15
-10
-5
200
400
600
800
1000
CUSUM criterion for the witch hat chain (1000 iterations).
294
But... The pathological shape the witchs hat distribution is actually close to the ideal shape of Yu and Mykland, and there is no indication that the chain has not yet left the mode (0.7, 0.7) This diculty is common to most on-line methods, that is, to diagnoses based on a single chain. It is almost impossible to detect the existence of other modes Youve only seen where youve been
295
8.3.2
Multiple Estimates
In most cases, the graph of the raw sequence ((t) ) is unhelpful Given some quantities of interest IEf [h()], a more helpful indicator is the behavior of the averages 1 T in terms of T . A more controlled approach is to use simultaneously several convergent estimators of IEf [h()] based on the same chain ((t) ), until all estimations coincide (up to a given precision).
T
h((t) )
t=1
296
Most common estimators The empirical average ST Rao-Blackwellized versions of this average,
C ST
1 = T
IE[h()| (t) ] ,
t=1
Importance sampling alternatives

P ST
1 = T
wt h((t) ) ,
t=1
where wt = f ((t) )/gt ((t) ) and gt is the true density used for the simulation of (t) .
297
P Note that ST removes the correlation between the (t) s So up to P second order, ST behaves as an independent sum. This implies that P VarST decreases at speed 1/T in stationarity settings. Thus, nonstationarity can be detected if the decrease of the variations of P ST does not t in a condence parabola of order 1/ T .
298
Example 61 Cauchy posterior For the posterior distribution (|x1 , x2 , x3 ) e

/2
2 2
1 . 1 + ( xi )2 i=1
a completion Gibbs sampling algorithm can be derived via articial variables, 1 , 2 , 3 , such that (, 1 , 2 , 3 |x1 , x2 , x3 ) e
2
/2
3 i=1
e(1+(xi )
)i /2
299
resulting in the Gibbs sampler based on the conditionals (i = 1, 2, 3) i |, xi |x1 , x2 , x3 , 1 , 2 , 3 Exp N 1 + ( xi )2 2 i xi , i + 2 i

i
, 1 2 i i + .
300
0
-10
200
400
600
800
-5
5 (20 000 iterations)
10
15
20
Comparison of the normal-Cauchy density and of the histogram (20, 000 points)
301
Eciency of this algorithm: agreement between the histogram of the simulated (t) s and the true posterior distribution If the function of interest is h() = exp(/) the dierent approximations of IE [h()] can be monitored.
302
0.80
0
0.81
0.82
0.83
0.84
0.85
100
200
300
400
500
(thousand iterations)
C R Convergence of ST (full line), ST (dotted line), ST (mixed) P and ST (long dashes)
303
C The strong agreement of ST , ST indicates convergence
The bad behavior the importance sampler is most likely associated with an innite variance.
304
Example 62 Beta mixture For the (dicult) chain (X (t) ) X

(t+1)
x(t) Y (t) Be( + 1, 1)
with probability 1 x(t) , otherwise,
estimate the expectation of h(x) = x1 with the four estimators:
305
C ST and ST are indistinguishable
importance sampling is the best mixing is very slow

R C The nal values are 0.224, 0.224, 0.211 and 0.223 for ST , ST , ST P and ST respectively, to compare with a theoretical value of 0.2.
306
0.18
0
0.20
0.22
0.24
0.26
0.28
200
400
600
800
1000
(thousand iterations)
R C Convergence of ST (full line), ST (dotted line), ST (mixed P dashes) and ST (long dashes) of IE[(X (t) )0.8 ] for Be(0.2, 1).
307
Limitations of the method (1). Multiple estimates may not be available (2). Intrinsically conservative the slowest ox drives the team (3). Cannot detect missing modes Youve only seen where youve been (4). Empirical and conditional estimators often similar, while importance sampler enjoys innite variance
308
8.3.3
Within and between variances
Control strategy devised by Gelman and Rubin (1992) Starts with the derivation of a distribution related with the modes of f , obtained by numerical methods (??). For instance, mixture of Students t distributions centered around the identied modes of f For every quantity of interest = h(), stopping rule based on the dierence between a weighted estimator of the variance and the variance of estimators from dierent chains
309
Denote BT WT with m and m = h(m ). BT and WT represent the between- and within-chains variances.
(t) (t)
= =
1 M 1 M
( m )2 ,
m=1 M
s2 m
m=1
1 = M
M m=1
1 T
T (t) (m m )2 , t=1
1 = T
T (t) m , t=1
1 = M
m
m=1
310
Estimator of the posterior variance of T = 2 T 1 WT + BT . T
Comparison of T with WT through 2 RT F approximately, so IERT 1. Stopping rule is based on either testing that the mean of RT is equal to 1 or on condence intervals on RT .
311
Example 63 (Normal-Cauchy again) Evolution of RT for h() = , M = 100 and 10, 00 iterations. Convergence after about 6, 00 iterations. But... superimposed graph of WT does not exhibit stationarity The distribution of (t) is stationary after a few hundred iterations : the criterion is conservative
312
1.015
1.010
1.005
1.000
200
400
600
800
Evolutions of RT (solid lines and scale on the left) and of WT (dotted lines and scale on the right)
34.6
34.8
35.0
35.2
35.4
313
Example 64 (Witches hat again) The density (|y) has a very concentrated mode around y Use the uniform distribution on C = [0, 1]d as the initial distribution The scale of RT is very concentrated Stability of RT (and of WT ) implies convergence But... the chain has not left the neighborhood of the mode (0.7, 0.7)!
314
1.016
1.018
1.014
1.010
1.012
1.006
1.008
200
400
600
800
Evolutions of RT (solid lines and scale on the left) and of WT (dotted lines and scale on the right)
0.002
0.004
0.006
0.008
315
Comments This method has enjoyed wide usage, in particular because of its simplicity and of its connections with the standard tools of linear regression. However... The accurate construction of the initial distribution can be delicate/time-consuming. In some models, number of modes too great to complete identication The method relies on normal approximations
316
In general, best to use a battery of tests/assessments: There are many others that we have not mentioned. Some are Methods based on Renewal Theory Methods based on Discretization
317
Implementation in Missing Data Models
9.1 First examples 9.2 Finite mixtures of distributions 9.3 Extensions
318
Missing data models are a natural application for simulation Simulation replaces the missing data part so that one can proceed with a classical inference on the complete model. The EM algorithm rst described a rigorous and general formulation of statistical inference though completion of missing data. Potential of Markov Chain Monte Carlo algorithms in the analysis of missing data models
319
9.1
First examples
Example 65 Rounding eect Numerous settings (surveys, medical experiments, epidemiological studies, design of experiment, quality control, etc.) produce a grouping of the original observations into less informative categories, often for reasons beyond the control of the experimenter: Data coarsening For instance, approximation bias in a study on smoking habits
320
Yi Exp() the number of cigarettes smoked per day is unobserved (rounding) and instead we observe bins Xi , where Xi |gi , yi = [[yi ], [yi ] + 1) [20[yi /20], 20[yi /20] + 20) if gi = 0, (cig. reported) if gi = 1, (packs reported)
321
This means that, as yi increases, the probability of rounding up the answer xi to the nearest full pack of cigarettes also increases, under the constraint 2 > 0, thus we observe the Gi s according to Gi |yi Bernoulli((1 2 yi )), where is the cdf of N (0, 1).
322
If c(xi ) denotes the center of the ith bin, the likelihood function is
n c(xi )+10 c(xi )10 c(xi )+1/2 1gi gi
L(, 1 , 2 |x, g)
=
i=1
eyi [1 (1 2 yi )]dyi eyi (1 2 yi )dyi
c(xi )1/2
[incomplete-data]
323
The complete-data likelihood is L(, 1 , 2 |y, x, g)

n
=
i=1
eyi [1 (1 2 yi )] I I(c(xi ) 10 yi < c(xi ) + 10)] eyi (1 2 yi ) I I(c(xi ) 1/2 yi < c(xi ) + 1/2)]
1gi gi
Solutions EM algorithm (Monte Carlo version) Gibbs sampling (priors on , 1 and 2 ).
324
Contingency table When several variables are studied simultaneously in a sample, each corresponding to a grouping of individual data If the context is suciently informative to allow for a modeling of the individual data, the completion of the contingency table (by reconstruction of the individual data) may improve inference
325
Example 66 Lizard habitat Observation of two characteristics of the habitat of 164 lizards Diameter (inches) Height (feet) 4.0 > 4.75 32 4.75 86 > 4.0 11 35
326
Distribution on the individual observations Xijk of diameter and of height (i, j = 1, 2, k = 1, . . . , nij ) Yijk = log(Xijk ) N (0, ) , where = 2 1 1
= 2 0 .
327
T N2 (, ; Qij ) represents the normal distribution restricted to one of the four quadrants induced by (log(4.75), log(4))
Prior
1 (, , ) = I [1,1] () I
328
Algorithm 67 Contingency table completion
T 1. Simulate yijk N2 (, ; Qij )
(i, j = 1, 2, k = 1, . . . , nij );
2. Simulate N2 (y, /164); 3. Simulate 2 from the inverted gamma distribution IG(164,
i,j,k
(yijk )t 1 (yijk )/2) ; 0
4. Simulate according to (1 2 )164/2 exp{

i,j,k
(yijk )t 1 (yijk )/2} ,
329
Note: The distribution in step 4. requires a MetropolisHastings step based, for instance, on an inverse Wishart distribution
330
Qualitative models Example 68 Probit Regression Threshold model: observe Yi Bernoulli{0, 1} with pi = (xt ) , i where is the standard normal cdf. Create latent (unobservable) continuous rvs Yi where Yi = 1 0 if Yi > 0, otherwise. IRp .
Thus, pi = P (Yi = 1) = P (Yi > 0), and we have an automatic way to complete the model
331
Given prior Np (0 , ) on Algorithm 69 Probit posterior distribution
1. Simulate
yi
N+ (xt , 1, 0) i N (xt , 1, 0) i
if yi = 1, if yi = 0,
(i = 1, . . . , n)
2. Simulate Np (1 + XX t )1 (1 0 +
i yi xi ), (1 + XX t )1
332
Incomplete observations arise in numerous settings. A survey with multiple questions may include nonresponses to some personal questions; A calibration experiment may lack observations for some values of the calibration parameters; A pharmaceutical experiment on the aftereects of a toxic product may skip some doses for a given patient.
333
Analysis of such structures complicated by the fact that the failure to observe is not always explained. If these missing observations are at random the incompletely observed data only play a role through their marginal distribution. But... these distributions are not always explicit and a natural approach leading to a Gibbs sampler algorithm is to replace the missing data by simulation.
334
Example 70 Non-ignorable non-response Average incomes and numbers of responses/non-responses to a survey on the income by age, sex and marital status. Age < 30 > 30 Men Single 20.0 24/1 30.0 15/5 Married 21.0 5/11 36.0 2/8 Women Single Married 16.0 16.0 11/1 2/2 18.0 8/4 0/4
335
Observations grouped by average, exponential shape for the individual data,

ya,s,m,i
with
Exp(a,s,m ) a,s,m = 0 + a + s + m ,
where 1 i na,s,m a (a = 1, 2) corresponds to age (junior/senior) s (s = 1, 2) corresponds to sex (fem./male) m (m = 1, 2) corresponds to family (single/married) The model is unidentiable, but that can be remedied by constraining 1 = 1 = 1 = 0.
336
More dicult problem: Nonresponse depends on the income in the shape of a logit model, pa,s,m,i where pa,s,m,i denotes the probability of nonresponse and (w0 , w1 ) are the logit parameters.
exp{w0 + w1 ya,s,m,i } = , 1 + exp{w0 + w1 ya,s,m,i }
337
The likelihood of the complete model is

na,s,m i=1 exp{za,s,m,i (w0 + w1 ya,s,m,i )} (0 + a + s + m )ra,s,m 1 + exp{w0 + w1 ya,s,m,i }
a=1,2 s=1,2 m=1,2
exp ra,s,m y a,s,m (0 + a + s + m ) where

za,s,m,i is the indicator of a missing observation
na,s,m is the number of people by category ra,s,m is the number of responses by category y a,s,m is the average of these responses by category
338
Completion of the data

The ya,s,m,i s from exp{za,s,m,i (w0 + w1 ya,s,m,i )} exp(ya,s,m,i a,s,m ) , 1 + exp{w0 + w1 ya,s,m,i }
[requires a MetropolisHastings step] The parameters are simulated from (0 + a + s + m )ra,s,m

a=1,2 s=1,2 m=1,2
exp ra,s,m y a,s,m (0 + a + s + m ) using a gamma instrumental distribution.
339
And (w0 , w1 ) from

na,s,m i=1 exp{za,s,m,i (w0 + w1 ya,s,m,i )} 1 + exp{w0 + w1 ya,s,m,i }
a=1,2 s=1,2 m=1,2
[logit model]
340
9.2
Finite mixtures of distributions
Distribution f (x) = with p1 + . . . + pk = 1

k
pj f (x|j ) ,
j=1
341
useful in practical modeling challenging from an inferential point of view (i.e., estimation of pj and j ) likelihood dicult to work with
n
pj f (xi |j ) ,
L(p, |x1 , . . . , xn )
i=1
j=1
containing k n terms.
342
Missing data structure Associate with every observation xi an indicator variable zi {1, . . . , k} Indicates which component of the mixture xi comes from. Demarginalized model: zi Mk (1; p1 , . . . , pk ), xi |zi f (x|zi ) .
343
Completed model likelihood:

n
(p, |x , . . . , x ) i i
i=1 k
pzi f (xi |zi ) pj f (xi |j )

j=1 i;zi =j
344
Algorithm 71 Mixture simulation
1. Simulate zi (i = 1, . . . , n) from P (zi = j) pj f (xi |j ) and compute the statistics

n n
(j = 1, . . . , k)
nj =
i=1
I zi =j , I
nj xj =
i=1
I zi =j xi . I
345
2. Generate (j = 1, . . . , k) j j + nj xj , j + nj , j + n j Dk (1 + n1 , . . . , k + nk ) .
346
Example 72 Normal mixtures f (x) =

k
pj
j=1
2 (xj )2 /(2j )
2 j
with conjugate distributions on (j , j ) j |j N

2 j , j /j
2 j
IG
j + 3 j , 2 2
347
Algorithm 73 Normal mixture
1. Simulate (i = 1, . . . , n)
2 zi P (zi = j) pj exp (xi j )2 /(2j ) 1 j
and compute the statistics (j = 1, . . . , k)

n n n
nj =
i=1
I zi =j , I
nj xj =
i=1
I zi =j xi , I
s2 = j
i=1
I zi =j (xi xj )2 . I
348
2. Generate j |j
2 j
N IG
2 j j j + nj xj , j + n j j + n j
, ,
j + n j + 3 j + s 2 j , 2 2
p Dk (1 + n1 , . . . , k + nk ) .
349
Good performance of the Gibbs sampler guaranteed by the duality principle Practical implementation of the Gibbs sampler might, face serious convergence diculties (phenomenon of the absorbing component)
350
Trapping states When only a small number of observations are allocated to a given component j0 , the following probabilities are quite small: (1). The probability of allocating new observations to the component j0 . (2). The probability of reallocating, to another component, observations already allocated to j0 .
351
A paradox While the Gibbs sampler chain (z (t) , (t) ) is irreducible, there exists (almost)absorbing states (or trapping state) which require an enormous number of iterations of the Gibbs sampler to escape from.
352
Example 74 Acidity level in lakes 149 observations of acidity levels in lakes in the American North-East Mixture model t with the Gibbs sampler algorithm Lack of evolution of estimated (plug-in) density from the Gibbs sampler when iterations increase Phenomenon which occurs often in mixture settings, due to weak identiability of these models.
353
T = 500
T = 1000
15
10
-2
0
-2
10
15
T = 2000
T = 3000
15
10
-2
0
-2
10
15
T = 4000
T = 5000
15
10
-2
0
-2
10
15
Estimation of the density for 3 components and T iterations
354
9.3
Extensions
Relaxation of the independence assumption between observations leads to hidden Markov chains Dierent constraints on the changes of components corresponds to changepoint models
355
Example 75 Switching AR model Xt |xt1 , zt , zt1 Zt |zt1 where Zt takes values in {0, 1} initial values z0 = 0, x0 = 0 . N (zt + (xt1 zt1 ), 2 ), zt1 I zt1 (zt ) + (1 zt1 )I 1zt1 (zt ), I I
356
States zt not observed and the algorithm completes the sample x1 , . . ., xT by simulating every zt (1 < t < T ) from P (Zt |zt1 , zt+1 , xt , xt1 , xt+1 ) 1 exp 2 (xt zt (xt1 zt1 ))2 2 + (xt+1 zt+1 (xt zt ))2 zt1 I zt1 (zt ) + (1 zt1 )I 1zt1 (zt ) I I (zt I zt (zt+1 ) + (1 zt )I 1zt (zt+1 )) . I I
357
Given the prior (1 0 ) N (0, 2 ), (, 2 ) = 1/, 0 , 1 U[0,1] ,
the generation of the parameters is straightforward because of the conjugate structure.
358
Example 76 Hidden Markov Poisson model Observe Xt s depending on an unobserved Markov chain (Zt ) such that (i, j = 1, 2) Xt |zt P(zt ), Noninformative prior (1 , 2 , p11 , p22 ) = 1 I 2 <1 I 1 P (Zt = i|Zt1 = j) = pji .
359
0
0
50
100
150
200
Data of Leroux and Puterman (1992) on the number of moves of a lamb fetus during 240 successive 5 second periods.
360
Changepoint model Sample (x1 , . . . , xn ) associated with a latent index such that X1 , . . . , X | X +1 , . . . , Xn | where the support of 0 is {1, . . . , n}.
iid
0 ( ), f (x|1 ) f (x|2 ),
iid
361
Example 77 Changepoint regression yi with U{1, . . . , n}, and

1 2 j = (j , j )t N2 (0 , ) , j IG(0 , 0 ) 2 N (1 + 1 xi , 1 ) 2 N (2 + 2 xi , 2 )
for i = 1, . . . , , for i = + 1, . . . , n,
W2
1 1 , V
0 N (, C)
where Wp (, A) is the Wishart
362
Example 78 Stochastic Volatility Popular in nancial applications, in describing series with sudden changes in the magnitude of variation of the observed values through a latent linear process (Yt ), the volatility
Yt = Yt1 + Yt /2 Yt = e t , t1
where
and
N (0, 1).
363
Observed likelihood L( , |y0 , . . . , yT ) obtained by integrating the complete-data likelihood

Lc ( , |y0 , . . . , yT , y0 , . . . , yT ) T
exp
t=0
2 yt yt e
+ yt /2 T
( )T +1 exp (y0 )2 + t=1
(yt yt1 )2
/2( )2 .
364
-10
0
-5
10
15
20
100
200
300
400
500
Simulated stochastic volatility process with = 1 and
= .9

Cours MC

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Cours MC

Cargado por

Copyright:

Formatos disponibles

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods Christian P. Robert Universit Paris Dauphine e

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

1.1 1.2 1.3 1.4 1.5

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Inference about based on this likelihood.

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

For instance, if X N (, 2 ) and the variable Z = X Y = min(X, Y ) is distributed as z z z + 1

Monte Carlo Statistical Methods/October 29, 2001

where a () Dirac mass at a.

Monte Carlo Statistical Methods/October 29, 2001

{p1 f1 (xi ) + + pk fk (xi )} .

Monte Carlo Statistical Methods/October 29, 2001

Global justications from asymptotics

Monte Carlo Statistical Methods/October 29, 2001

log L(, |x1 , , xn ) = log

1 = log x1 exi / () i i=1 = n log ()

Monte Carlo Statistical Methods/October 29, 2001

Solving log L(, |x1 , , xn ) = 0 is straightforward Yields the MLE =

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

When p known and and both unknown, the likelihood

Monte Carlo Statistical Methods/October 29, 2001

Multiplicity of modes of the likelihood from C(, 1) when n = 3 and x1 = 0, x2 = 5, x3 = 9.

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Then, whatever n, the likelihood is unbounded:

Monte Carlo Statistical Methods/October 29, 2001

Sample from (1)

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Deterministic Numerical Methods

that converges to a solution of f (x) = 0. [Note that f is a matrix x in multidimensional settings.]

Monte Carlo Statistical Methods/October 29, 2001

F denotes the matrix of second derivatives of F .

Monte Carlo Statistical Methods/October 29, 2001

Numerical computation of an integral

(xi+1 xi )(h(xi ) + h(xi+1 )) ,

where the xi s constitute an ordered partition of [a, b], or Simpsons rule = I 3

Monte Carlo Statistical Methods/October 29, 2001

Simulation versus numerical analysis: when is it useful?

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Random Variable Generation

2.1 Basic methods 2.2 Beyond Uniform Distribution

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods/October 29, 2001