Está en la página 1de 364

Monte Carlo Statistical Methods/October 29, 2001

Monte Carlo Statistical Methods Christian P. Robert Universit Paris Dauphine e

Monte Carlo Statistical Methods/October 29, 2001

Based on the book Monte Carlo Statistical Methods by Christian P. Robert and George Casella Springer-Verlag 1999

Monte Carlo Statistical Methods/October 29, 2001

Introduction

1.1 1.2 1.3 1.4 1.5

Statistical Models Likelihood Methods Bayesian Methods Deterministic Numerical Methods Simulation versus numerical analysis

Monte Carlo Statistical Methods/October 29, 2001

Experimenters choice before fast computers: Describe an accurate model which usually precludes computation of explicit answers or Choose a standard model which would allow such computations, but may not be a close representation of a realistic model. Such problems contributed to the development of simulation-based inference

Monte Carlo Statistical Methods/October 29, 2001

1.1

Statistical Models

Example 1 Censored data models Missing data models where densities are not sampled directly. Typical simple statistical model: we observe Y1 , , Yn f (y|). The distribution of the sample given by the product
n

f (yi |)
i=1

Inference about based on this likelihood.

Monte Carlo Statistical Methods/October 29, 2001

With censored random variables, actual observations: Yi = min{Yi , u} where u is censoring point. Inference about based on the censored likelihood.

Monte Carlo Statistical Methods/October 29, 2001

For instance, if X N (, 2 ) and the variable Z = X Y = min(X, Y ) is distributed as z z z + 1


1

Y N (, 2 ),

where and are the density and cdf of the normal N (0, 1) distribution.

Monte Carlo Statistical Methods/October 29, 2001

Similarly, if X Weibull(, ), with density f (x) = x1 exp(x ) the censored variable Z = X , has the density f (z) = z e
z

constant,

I z + I

x e

dx (z) ,

where a () Dirac mass at a.

Monte Carlo Statistical Methods/October 29, 2001

Example 2 Mixture models Models of mixtures of distributions: X fj with probability pj , for j = 1, 2, . . . , k, with overall density X p1 f1 (x) + + pk fk (x) . For a sample of independent random variables (X1 , , Xn ), sample density
n

{p1 f1 (xi ) + + pk fk (xi )} .


i=1

Expanding this product involves k n elementary terms: prohibitive to compute in large samples.

Monte Carlo Statistical Methods/October 29, 2001

10

1.2

Likelihood Methods

Maximum Likelihood Methods For an iid sample X1 , . . . , Xn from a population with density f (x|1 , . . . , k ), the likelihood function is L(|x) = L(1 , . . . , k |x1 , . . . , xn ) =
n i=1

f (xi |1 , . . . , k ).

Global justications from asymptotics

Monte Carlo Statistical Methods/October 29, 2001

11

Example 3 Gamma MLE X1 , , Xn iid observations from gamma density f (x|, ) = where is known. Log likelihood
n

1 x1 ex/ , ()

log L(, |x1 , , xn ) = log


i=1 n

f (xi |, )

1 = log x1 exi / () i i=1 = n log ()


n n

n log + ( 1)
i=1

log xi
i=1

xi /.

Monte Carlo Statistical Methods/October 29, 2001

12

Solving log L(, |x1 , , xn ) = 0 is straightforward Yields the MLE =


n

xi /(n).
i=1

Monte Carlo Statistical Methods/October 29, 2001

13

When also unknown, additional equation log L(, |x1 , , xn ) = 0 is particularly nasty! Involve dicult computations (incl. derivative of the gamma function, the digamma function) Explicit solution no longer possible

Monte Carlo Statistical Methods/October 29, 2001

14

Example 4 Students t distribution Reasonable alternative to normal errors T (p, , ) more robust against possible modeling errors Density of T (p, , ) proportional to 1 (x )2 1+ p 2
(p+1)/2

Monte Carlo Statistical Methods/October 29, 2001

15

When p known and and both unknown, the likelihood


n p+1 2 n i=1

(xi )2 1+ p 2

may have n local minima, each of which needs to be calculated to determine the global maximum.

Monte Carlo Statistical Methods/October 29, 2001

16

0.0
-10

0.00001

0.00002

0.00003

0.00004

-5

10

15

20

Multiplicity of modes of the likelihood from C(, 1) when n = 3 and x1 = 0, x2 = 5, x3 = 9.

Monte Carlo Statistical Methods/October 29, 2001

17

Example 5 Mixtures again For a mixture of two normal distributions, pN (, 2 ) + (1 p)N (, 2 ) , likelihood proportional to
n

p
i=1

xi

+ (1 p)

xi

containing 2n terms. Standard maximization techniques often fail to nd the global maximum because of multimodality of the likelihood function.

Monte Carlo Statistical Methods/October 29, 2001

18

In the special case f (x|, ) = (1 ) exp{(1/2)x2 }+ with > 0 known exp{(1/2 2 )(x)2 } (1)

Then, whatever n, the likelihood is unbounded:


0

lim ( = x1 , |x1 , . . . , xn ) =

Monte Carlo Statistical Methods/October 29, 2001

19

Echantillon N(0,1)
0.6 0.0 0.1 0.2 0.3 0.4 0.5

Sample from (1)

Monte Carlo Statistical Methods/October 29, 2001

20

sigma

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0 mu

Likelihood of (1)

Monte Carlo Statistical Methods/October 29, 2001

21

1.3

Bayesian Methods

In the Bayesian paradigm, information brought by the data x, realization of X f (x|), combined with prior information specied by prior distribution with density ()

Monte Carlo Statistical Methods/October 29, 2001

22

Summary in a probability distribution, (|x), called the posterior distribution Derived from the joint distribution f (x|)(), according to (|x) = f (x|)() , f (x|)()d [Bayes Theorem] where m(x) = is the marginal density of X f (x|)()d

Monte Carlo Statistical Methods/October 29, 2001

23

Example 6 Binomial Bayes Estimator For an observation X from the binomial distribution B(n, p) the (so-called) conjugate prior is the family of beta distributions Be(a, b) The classical Bayes estimator is the posterior mean

(a + b + n) (a + x)(n x + b)
1

p px+a1 (1 p)nx+b1 dp

x+a . a+b+n

Monte Carlo Statistical Methods/October 29, 2001

24

The curse of conjugate priors The use of conjugate priors for computational reasons implies a restriction on the modeling of the available prior information may be detrimental to the usefulness of the Bayesian approach gives an impression of subjective manipulation of the prior information disconnected from reality.

Monte Carlo Statistical Methods/October 29, 2001

25

Example 7 Logistic Regression Standard regression model for binary (0 1) responses: distribution of Y {0, 1} modeled by exp(xt ) P (Y = 1) = p = . 1 + exp(xt ) Equivalently, the logit transform of p, logit(p) = log[p/(1 p)] satises logit(p) = xt .

Monte Carlo Statistical Methods/October 29, 2001

26

Computation of a condence region on quite delicate when (|x) not explicit. In particular, when the condence region involves only one component of a vector parameter, calculation of (|x) requires the integration of the joint distribution over all the other parameters.

Monte Carlo Statistical Methods/October 29, 2001

27

Example 8 Cauchy condence regions X1 , , Xn an iid sample from the Cauchy distribution C(, ), with prior (, ) = 1 . Condence region on then based on
n

(|x1 , , xn )
0

n1 i=1

1+

xi

2 1

d ,

an integral which cannot be evaluated explicitly. Similar computational problems with likelihood estimation in this model.

Monte Carlo Statistical Methods/October 29, 2001

28

1.4

Deterministic Numerical Methods

To solve an equation of the form f (x) = 0, the NewtonRaphson algorithm produces a sequence xn : xn+1 = xn f x
1

f (xn )
x=xn

that converges to a solution of f (x) = 0. [Note that f is a matrix x in multidimensional settings.]

Monte Carlo Statistical Methods/October 29, 2001

29

Optimization of smooth functions F done using the equation F (x) = 0, where F denotes the gradient of F , vector of derivatives of F .

The corresponding techniques are gradient methods, where the sequence xn is xn+1 = xn where
t t

(xn ) F (xn ) ,

F denotes the matrix of second derivatives of F .

Monte Carlo Statistical Methods/October 29, 2001

30

Numerical computation of an integral


b

I=
a

h(x)dx

can be done by Riemann integration or by improved techniques like the trapezoidal rule 1 I= 2
n1

(xi+1 xi )(h(xi ) + h(xi+1 )) ,


i=1

where the xi s constitute an ordered partition of [a, b], or Simpsons rule = I 3


n n

f (a) + 4
i=1

h(x2i1 ) + 2
i=1

h(x2i ) + f (b)

Monte Carlo Statistical Methods/October 29, 2001

31

1.5

Simulation versus numerical analysis: when is it useful?

numerical methods do not take into account the probabilistic aspects of the problem, numerical integration often focus on regions of low probability occurrence of local modes of a likelihood often cause more problems for a deterministic gradient method than for simulation methods

Monte Carlo Statistical Methods/October 29, 2001

32

but simulation methods very rarely take into account the specic analytical form of the functions (For instance, because of the randomness induced by the simulation, a gradient method yields a much faster determination of the mode of a unimodal density) For small dimensions, integration by Riemann sums or by quadrature converges much faster than the mean of a simulated sample. It is thus often reasonable to use a numerical approach when dealing with regular functions in small dimensions

Monte Carlo Statistical Methods/October 29, 2001

33

When the statistician needs to study the details of a likelihood surface or posterior distribution needs to simultaneously estimate several features of these functions or when the distributions are highly multimodal it is preferable to use a simulation-based approach

Monte Carlo Statistical Methods/October 29, 2001

34

Fruitless to advocate the superiority of one method over the other More reasonable to justify the use of simulation-based methods by the statistician in terms of his/her expertise. The intuition acquired by a statistician in his/her every-day processing of random models can be directly exploited in the implementation of simulation techniques

Monte Carlo Statistical Methods/October 29, 2001

35

Random Variable Generation

2.1 Basic methods 2.2 Beyond Uniform Distribution

Monte Carlo Statistical Methods/October 29, 2001

36

Rely on the possibility of producing (computer-wise) an endless ow of random variables (usually iid) from well-known distributions Given a uniform random number generator, illustration of methods that produce random variables from both standard and nonstandard distributions

Monte Carlo Statistical Methods/October 29, 2001

37

2.1

Basic Methods

2.1.1

Introduction

For a function F on IR, generalized inverse of F , F , dened by F (u) = inf {x; F (x) u} . Probability Integral Transform: If U U[0,1] , then the random variable F (U ) has the distribution F .

Monte Carlo Statistical Methods/October 29, 2001

38

Consequence: To generate a random variable X F , suces to generate U U[0,1] and then make the transform x = F (u)

Monte Carlo Statistical Methods/October 29, 2001

39

2.1.2

Desiderata and Limitations

Any one who considers arithmetical methods of reproducing random digits is, of course, in a state of sin. As has been pointed out several times, there is no such thing as a random number---there are only methods of producing random numbers, and a strict arithmetic procedure of course is not such a method." [John Von Neumann, 1951)]

Monte Carlo Statistical Methods/October 29, 2001

40

Production of a deterministic sequence of values in [0, 1] which imitates a sequence of iid uniform random variables U[0,1] . Cant use the physical imitation of a random draw [no guarantee of uniformity, no reproducibility] Random sequence in the sense: Having generated (X1 , , Xn ), knowledge of Xn [or of (X1 , , Xn )] imparts no discernible knowledge of the value of Xn+1 .

Monte Carlo Statistical Methods/October 29, 2001

41

Deterministic: Given the initial value X0 , sample (X1 , , Xn ) always the same Validity of a random number generator based on a single sample X1 , , Xn when n tends to +, not on replications (X11 , , X1n ), (X21 , , X2n ), . . . (Xk1 , , Xkn ) where n xed and k tends to innity.

Monte Carlo Statistical Methods/October 29, 2001

42

2.1.3

Uniform pseudo-random number generator

Algorithm starting from an initial value u0 and a transformation D, which produces a sequence (ui ) = (Di (u0 )) in [0, 1]. For all n, (u1 , , un ) reproduces the behavior of an iid U[0,1] sample (V1 , , Vn ) when compared through usual tests

Monte Carlo Statistical Methods/October 29, 2001

43

Validity of the algorithm means that the sequence U1 , , Un leads to accept the hypothesis H : U1 , , Un are iid U[0,1] .

The set of tests used is generally of some consequence KolmogorovSmirnov Time series methods, for correlation between Ui and (Ui1 , , Uik ) nonparametric tests Marsaglias battery of tests called Die Hard (!)

Monte Carlo Statistical Methods/October 29, 2001

44

2.1.4

The Kiss Generator

A real-life generated random sequence takes values on {0, 1, , M } rather than in [0, 1] [M largest integer accepted by the computer]

Monte Carlo Statistical Methods/October 29, 2001

45

Period, T0 , of a generator: smallest integer T such that ui+T = ui for every i, A generator of the form Xn+1 = f (Xn ) has a period no greater than M +1

Monte Carlo Statistical Methods/October 29, 2001

46

Warning! A uniform generator on [0, 1] should not never take the values 0 and 1 [Gentle, 1998]

Monte Carlo Statistical Methods/October 29, 2001

47

Congruential generator on {0, 1, , M }: dened by the function D(x) = (ax + b) mod (M + 1). Period and other performance of congruential generators depend heavily on (a, b). With a rational, pairs (xn , D(xn )) lie on parallel lines.

Monte Carlo Statistical Methods/October 29, 2001

48

1.0 0.0 0.2 0.4 0.6 0.8

0.0 0.2 0.4 0.6 0.8 1.0

Representation of the line y = 69069x mod 1 by uniform sampling with sampling step 3 104 .

Monte Carlo Statistical Methods/October 29, 2001

49

For k k matrix T , with entries in {0, 1}, shift register generator: given by the transformation xn+1 = T xn (mod 2) where xn represented as a vector of binary coordinates eni {0, 1},
k1

xn =
i=0

eni 2i .

Monte Carlo Statistical Methods/October 29, 2001

50

To generate a sequence of integers X1 , X2 , , the Kiss algorithm generates three sequences of integers First, a congruential generator In+1 = (69069 In + 23606797) (mod 232 ) , Then two shift register generators (Jn ) and (Kn ) Overall sequence Xn+1 = (In+1 + Jn+1 + Kn+1 ) (mod 232 ) The period of Kiss is of order 295 Kiss has been successfully tested on Die Hard

Monte Carlo Statistical Methods/October 29, 2001

51

2.2

Beyond Uniform Distributions

Generation of any sequence of random variables can be formally implemented through a uniform generator For distributions with explicit forms of F (for instance, exponential, double-exponential or Weibull distributions), the Probability Integral Transform can be implemented. Case specic methods, which rely on properties of the distribution (for instance, normal distribution, Poisson distribution)

Monte Carlo Statistical Methods/October 29, 2001

52

More general (indirect) methods exist, for example the accept-reject and the ratio-of-uniform methods Simulation of the standard distributions is accomplished quite eciently by many statistical programming packages (for instance, IMSL, Gauss, Mathematica, Matlab/Scilab, Splus/R).

Monte Carlo Statistical Methods/October 29, 2001

53

2.2.1

Transformation Methods

Case where a distribution F is linked in a simple way to another distribution easy to simulate. Example 9 Exponential variables If U U[0,1] , the random variable X = log U/ has distribution P (X x) = = P ( log U x) P (U ex ) = 1 ex ,

the exponential distribution Exp().

Monte Carlo Statistical Methods/October 29, 2001

54

Other random variables that can be generated starting from an exponential include

Y = 2
j=1

log(Uj ) 2 2

Y =
j=1

log(Uj ) Ga(a, )

Y =

a j=1 log(Uj ) a+b j=1 log(Uj )

Be(a, b)

Monte Carlo Statistical Methods/October 29, 2001

55

Points to note Transformation quite simple to use There are more ecient algorithms for gamma and beta random variables Cannot generate gamma random variables with a non-integer shape parameter For instance, cannot get a 2 variable, which would get us a 1 N (0, 1) variable.

Monte Carlo Statistical Methods/October 29, 2001

56

Example 10 Normal variables If r, polar coordinates of (X1 , X2 ), then,


2 2 r2 = X1 + X2 2 = Exp(1/2) 2

and uniform distribution on [0, 2] Consequence: If U1 , U2 iid U[0,1] , X1 X2 iid N (0, 1). = = 2 log(U1 ) cos(2U2 ) 2 log(U1 ) sin(2U2 )

Monte Carlo Statistical Methods/October 29, 2001

57

Box-Muller Algorithm: 1. Generate U1 , U2 iid U[0,1] ; 2. Define x1 = x2 = 2 log(u1 ) cos(2u2 ) , 2 log(u1 ) sin(2u2 ) ;

3. Take x1 and x2 as two independent draws from N (0, 1).

Monte Carlo Statistical Methods/October 29, 2001

58

Unlike algorithms based on the CLT, this algorithm is exact Get two normals for the price of two uniforms Drawback (in speed) in calculating log, cos and sin.

Monte Carlo Statistical Methods/October 29, 2001

59

Example 11 Poisson generation Poissonexponential connection: If N P() and Xi Exp(), i IN , P (N = k) = P (X1 + + Xk 1 < X1 + + Xk+1 ) .

Monte Carlo Statistical Methods/October 29, 2001

60

A Poisson can be simulated by generating exponentials until their sum exceeds 1. This method is simple, but is really practical only for smaller values of . On average, the number of exponential variables required is . Other approaches are more suitable for large s.

Monte Carlo Statistical Methods/October 29, 2001

61

A generator of Poisson random variables can produce negative binomial random variables since, Y Ga(n, (1 p)/p) implies X N eg(n, p) X|y P(y)

Monte Carlo Statistical Methods/October 29, 2001

62

Mixture representation The representation of the negative binomial is a particular case of a mixture distribution The principle of a mixture representation is to represent a density f as the marginal of another distribution, for example f (x) =
iY

pi fi (x) ,

If the component distributions fi (x) can be easily generated, X can be obtained by rst choosing fi with probability pi and then generating an observation from fi .

Monte Carlo Statistical Methods/October 29, 2001

63

2.2.2

Accept-Reject Methods

Many distributions from which dicult, or even impossible, to directly simulate. Another class of methods that only require us to know the functional form of the density f of interest only up to a multiplicative constant. The key to this method is to use a simpler (simulation-wise) density g, the instrumental density, from which the simulation from the target density f is actually done.

Monte Carlo Statistical Methods/October 29, 2001

64

Accept-Reject method Given a density of interest f , nd a density g and a constant M such that f (x) M g(x) on the support of f .

1. Generate X g, U U[0,1] ; 2. Accept Y = X if U f (X)/M g(X) ; 3. Return to 1. otherwise.

Monte Carlo Statistical Methods/October 29, 2001

65

Validation of the Accept-Reject method This algorithm produces a variable Y distributed according to f

Monte Carlo Statistical Methods/October 29, 2001

66

Uniform repartition under the graph of f of accepted points

Monte Carlo Statistical Methods/October 29, 2001

67

Two interesting properties: First, it provides a generic method to simulate from any density f that is known up to a multiplicative factor Property particularly important in Bayesian calculations: there, the posterior distribution (|x) () f (x|) . is specied up to a normalizing constant Second, the probability of acceptance in the algorithm is 1/M , e.g., expected number of trials until a variable is accepted is M

Monte Carlo Statistical Methods/October 29, 2001

68

Some intuition In cases f and g both probability densities, the constant M is necessarily larger that 1. The size of M , and thus the eciency of the algorithm, functions of how closely g can imitate f , especially in the tails For f /g to remain bounded, necessary for g to have tails thicker than those of f . It is therefore impossible to use the A-R algorithm to simulate a Cauchy distribution f using a normal distribution g, however the reverse works quite well.

Monte Carlo Statistical Methods/October 29, 2001

69

Example 12 Normal from a Cauchy 1 f (x) = exp(x2 /2) 2 1 1 , 2 1+x densities of the normal and Cauchy distributions. g(x) = f (x) = g(x) attained at x = 1.
2 (1 + x2 ) ex /2 2

and

2 = 1.52 e

Monte Carlo Statistical Methods/October 29, 2001

70

So probability of acceptance 1/1.52 = 0.66, and, on the average, one out of every three simulated Cauchy variables is rejected. Mean number of trials to success 1.52.

Monte Carlo Statistical Methods/October 29, 2001

71

Example 13 Normal/Double Exponential Generate a N (0, 1) by using a double-exponential distribution with density g(x|) = (/2) exp(|x|) f (x) g(x|) 2 1 2 /2 e

and minimum of this bound (in ) attained for = 1. Probability of acceptance /2e = .76: To produce one normal random variable, this Accept-Reject algorithm requires on the average 1/.76 1.3 uniform variables. Compare with the xed single uniform required by the Box-Muller algorithm.

Monte Carlo Statistical Methods/October 29, 2001

72

Example 14 Gamma with non-integer shape parameter Illustrates a real advantage of the Accept-Reject algorithm The gamma distribution Ga(, ) represented as the sum of exponential random variables, only if is an integer

Monte Carlo Statistical Methods/October 29, 2001

73

Can use the Accept-Reject algorithm with instrumental distribution Ga(a, b), with a = [], (Without loss of generality, = 1.) Up to a normalizing constant, f /gb = ba xa exp{(1 b)x} ba for b 1. The maximum is attained at b = a/. a (1 b)e
a

0.

Monte Carlo Statistical Methods/October 29, 2001

74

Example 15 Truncated Normal distributions Truncated Normals appear in many contexts Constraints x produce densities proportional to e
(x)2 /2 2

I x I

for a bound large compared with Alternatives far superior to the na method of generating a ve N (, 2 ) until exceeding , which requires an average number of 1/(( )/) simulations from N (, 2 ) for one acceptance.

Monte Carlo Statistical Methods/October 29, 2001

75

Instrumental distribution: translated exponential distribution, Exp(, ), with density g (z) = e(z) I z . I The ratio f /g is bounded by f /g 1/ exp(2 /2 ) 1/ exp(2 /2) if > , otherwise.

Monte Carlo Statistical Methods/October 29, 2001

76

Monte Carlo Integration

3.1 3.2 3.3 3.4

Introduction Classical Monte Carlo Integration Importance Sampling Acceleration Methods

Monte Carlo Statistical Methods/October 29, 2001

77

3.1

Introduction

Two major classes of numerical problems that arise in statistical inference optimization - generally associated with the likelihood approach integration- generally associated with the Bayesian approach

Monte Carlo Statistical Methods/October 29, 2001

78

Example 16 Bayes median Bayes estimators are not always posterior expectations, but rather solutions of the minimization problem min
2

L(, ) () f (x|) d . the Bayes estimator is the posterior

For quadratic loss mean.

Monte Carlo Statistical Methods/October 29, 2001

79

For absolute error loss L(, ) = | |, the Bayes estimator is the posterior median, solution to the equation () f (x|) d =

() f (x|) d

which can be quite complicated.

Monte Carlo Statistical Methods/October 29, 2001

80

3.2

Classical Monte Carlo integration

Generic problem of evaluating the integral IEf [h(X)] =


X

h(x) f (x) dx

where X is uni- or multidimensional, f is a closed form, partly closed form, or implicit density, and h is a function

Monte Carlo Statistical Methods/October 29, 2001

81

First use a sample (X1 , . . . , Xm ) from the density f to approximate the integral by the empirical average hm Average hm IEf [h(X)] by the Strong Law of Large Numbers 1 = m
m

h(xj ) ,
j=1

Monte Carlo Statistical Methods/October 29, 2001

82

Estimate the variance with vm and for m large, hm IEf [h(X)] N (0, 1). vm Note: This can lead to the construction of a convergence test and of condence bounds on the approximation of IEf [h(X)]. 1 1 = mm1
m

[h(xj ) hm ]2 ,
j=1

Monte Carlo Statistical Methods/October 29, 2001

83

Example 17 Cauchy prior For estimating a normal mean, a robust prior is a Cauchy prior X N (, 1), C(0, 1).

Under squared error loss, posterior mean

(x) =

(x)2 /2 e d 1 + 2 1 (x)2 /2 e d 1 + 2

Monte Carlo Statistical Methods/October 29, 2001

84

Form of suggests simulating iid variables 1 , , m N (x, 1) and calculate i m i=1 2 1 + i m (x) = . 1 m i=1 2 1 + i The Law of Large Numbers implies m (x) (x) as m .

Monte Carlo Statistical Methods/October 29, 2001

85

Example 18 Normal cdf Approximation of the normal cdf


t

(t) =

2 1 ey /2 dy 2

by the Monte Carlo method: 1 (t) = n


n

I xi t , I
i=1

based on generating a sample of size n, (x1 , . . . , xn ) using the Box-Muller algorithm.

Monte Carlo Statistical Methods/October 29, 2001

86

Exact variance (t)(1 (t))/n, as the variables I xi t iid I Bernoulli((t)). For values of t around t = 0 the variance approximately 1/4n: to achieve a precision of four decimals the approximation requires on average n = 2 104 simulations, that is, 200 million iterations. Greater accuracy is achieved in the tails.

Monte Carlo Statistical Methods/October 29, 2001

87

3.3

Importance Sampling

Simulation from f (the true density) is not necessarily optimal Alternative to direct sampling from f is importance sampling, based on alternative representation IEf [h(X)] =
X

h(x)

f (x) g(x)

g(x) dx .

which allows us to use other distributions than f

Monte Carlo Statistical Methods/October 29, 2001

88

Evaluation of IEf [h(X)] =


X

h(x) f (x) dx

by 1. Generate a sample X1 , . . . , Xn from a distribution g 2. Use the approximation 1 m


m j=1

f (Xj ) h(Xj ) g(Xj )

Monte Carlo Statistical Methods/October 29, 2001

89

The estimator 1 m
m j=1

f (Xj ) h(Xj ) g(Xj )

h(x) f (x) dx
X

Monte Carlo Statistical Methods/October 29, 2001

90

same reason the regular Monte Carlo estimator hm converges; converges for any choice of the distribution g [as long as supp(g) supp(f )]. Instrumental distribution g chosen from distributions easy to simulate. The same sample (generated from g) can be used repeatedly, not only for dierent functions h, but also for dierent densities f .

Monte Carlo Statistical Methods/October 29, 2001

91

Although g can be any density, some choices are better than others: Finite variance only when IEf f (X) h (X) = g(X)
2

f (X) h (x) dx < . g(X)


2

Instrumental distributions with tails lighter than those of f (that is, with sup f /g = ) not appropriate. If sup f /g = , the weights f (xj )/g(xj ) vary widely, giving too much importance to a few values xj .

Monte Carlo Statistical Methods/October 29, 2001

92

The choice of g that minimizes the variance of the importance sampling estimator is g (x) =
Z

|h(x)| f (x) . |h(z)| f (z) dz

Rather formal optimality result since optimal choice of g (x) requires the knowledge of h(x)f (x)dx, the integral of interest!

Monte Carlo Statistical Methods/October 29, 2001

93

Practical alternative
m j=1

h(Xj ) f (Xj )/g(Xj )


m j=1

f (Xj )/g(Xj )

where f and g are known up to constants. Also converges to Numbers. h(x)f (x)dx by the Strong Law of Large

Biased, but the bias is quite small In some settings beats the unbiased estimator in squared error loss.

Monte Carlo Statistical Methods/October 29, 2001

94

Example 19 Students t distribution X T (, , 2 ), with density (( + 1)/2) f (x) = (/2) (x )2 1+ 2


(+1)/2

Without loss of generality, take = 0, = 1. Calculate the integral


2.1

x5 f (x)dx.

Monte Carlo Statistical Methods/October 29, 2001

95

Simulation possibilities 2 Directly from f , since f = N (0,1)

Importance sampling using Cauchy C(0, 1) Importance sampling using a normal (expected to be nonoptimal). Importance sampling using a U([0, 1/2.1])

Monte Carlo Statistical Methods/October 29, 2001

96

Simulation results: Uniform is best Cauchy is OK f and Normal are rotten


7.0 5.0
0

5.5

6.0

6.5

10000

20000

30000

40000

50000

Monte Carlo Statistical Methods/October 29, 2001

97

sampling from f (solid lines), importance sampling with: Cauchy instrumental (short dashes), U([0, 1/2.1]) instrumental (long dashes) and normal instrumental (dots).

Monte Carlo Statistical Methods/October 29, 2001

98

3.4

Acceleration Methods

(a) Use correlation to reduce variance Specialized but ecient if applicable With two samples (X1 , . . . , Xm ) and (Y1 , . . . , Ym ) from f to estimate I=
I R

h(x)f (x)dx .

Denote 1 = 1 I m
m

h(Xi )
i=1

and

2 = 1 I m

h(Yi )
i=1

with each having mean I and variance 2

Monte Carlo Statistical Methods/October 29, 2001

99

The variance of the average is var I1 + I2 2 2 1 = + Cov(I1 , I2 ). 2 2

So negatively correlated samples better than independent samples

Monte Carlo Statistical Methods/October 29, 2001

100

(b) Antithetic Variables - constructing negatively correlated variables If f symmetric around , take Yi = 2 Xi If Xi = F 1 (Ui ), take Yi = F 1 (1 Ui )

Monte Carlo Statistical Methods/October 29, 2001

101

(c) Control Variates - another strategy Suppose I = h(x)f (x)dx is the desired integral and I0 = h0 (x)f (x)dx is known Estimate I with I and I0 with I0 , and construct the combined estimator I = I + (I0 I0 ) I is unbiased for I and var(I ) = var(I) + 2 var(I) + 2Cov(I, I0 )

Monte Carlo Statistical Methods/October 29, 2001

102

For the optimal choice cov(I, I0 ) = , 0 ) var(I

we have var(I ) = (1 2 ) var(I) , where is the correlation between I and I0

Monte Carlo Statistical Methods/October 29, 2001

103

Example 20 Control variate integration Evaluate P (X > a) = Xi s iid f .


a

f (x)dx and start with


1 2

1 n

n i=1

I i > a), I(X

Suppose we know P (X > ) = Control variate estimator 1 n


n

I i > a) + I(X
i=1

1 n

I i > ) P (X > ) . I(X


i=1

improves if <0 and || < 2 cov(1 , 3 ) P (X > a) =2 . var(3 ) P (X > )

Monte Carlo Statistical Methods/October 29, 2001

104

(d) Another method - Conditional Expectations Use the conditioning inequality var(IE[(X)|Y]) var((X)) sometimes called Rao-Blackwellization If I is an estimator of I = IEf [h(X)], based on X simulated from the joint distribution f (x, y), such that f (x, y)dy = f (x), the estimator I = IEf [I|y1 , . . . , yn ] dominates I(x1 , . . . , xn ) in terms of variance

Monte Carlo Statistical Methods/October 29, 2001

105

Example 21 Students t expectation Compute IE[h(x) = exp(x2 )] when X T (, 0, 2 )

Students t distribution can be simulated as X|y N (, 2 y) and Y 1 2 .

Monte Carlo Statistical Methods/October 29, 2001

106

Therefore, the empirical average


m 2 exp(Xj ) , j=1

can be improved upon using the sample ((X1 , Y1 ), . . . , (Xm , Ym )), since m m 1 1 1 IE[exp(X 2 )|Yj ] = m j=1 m j=1 2 2 Yj + 1 is the conditional expectation. The conditional expectation can have ten times greater precision.

Monte Carlo Statistical Methods/October 29, 2001

107

0.50
0

0.52

0.54

0.56

0.58

0.60

2000

4000

6000

8000

10000

Estimators of IE[exp(X 2 )]: simple average (solid lines) and conditional expectation (dots) for (, , ) = (4.6, 0, 1).

Monte Carlo Statistical Methods/October 29, 2001

108

Markov Chains

4.1 4.2 4.3 4.4 4.5 4.6

Basic Notions Irreducibility Transience/Recurrence Invariant Measures Ergodicity and stationarity Limit Theorems

Monte Carlo Statistical Methods/October 29, 2001

109

Use of Markov chains Many algorithms can be described as Markov chains Needed properties The quantity of interest is what the chain converges to We need to know When will chains converge What do they converge to

Monte Carlo Statistical Methods/October 29, 2001

110

4.1

Basic notions

A Markov chain is a sequence of random variables that can be thought of as evolving over time. Probability of a transition depends on the particular set that the chain is in Chain dened through its transition kernel A transition kernel is a function K dened on X B(X ) such that (i). x X , K(x, ) is a probability measure; (ii). A B(X ), K(, A) is measurable.

Monte Carlo Statistical Methods/October 29, 2001

111

When X is discrete, the transition kernel simply is a (transition) matrix K with elements Pxy = P (Xn = y|Xn1 = x) , x, y X .

In the continuous case, the kernel also denotes the conditional density K(x, x ) of the transition K(x, ) P (X A|x) =
A

K(x, x )dx .

Monte Carlo Statistical Methods/October 29, 2001

112

Given a transition kernel K, a sequence X0 , X1 , . . . , Xn , . . . of random variables is a Markov chain denoted by (Xn ), if, for any t, the conditional distribution of Xt given xt1 , xt2 , . . . , x0 is the same as the distribution of Xt given xt1 . That is, P (Xk+1 A|x0 , x1 , x2 , . . . , xk ) = P (Xk+1 A|xk ) =
A

K(xk , dx)

Monte Carlo Statistical Methods/October 29, 2001

113

Example 22 AR(1) Models Simple illustration of Markov chains on continuous state space Xn = Xn1 + n , with n N (0, 2 ) If the n s are independent, Xn independent from Xn2 , Xn3 , . . . conditionally on Xn1 . IR,

Monte Carlo Statistical Methods/October 29, 2001

114

Note that the entire structure of the chain only depends on The transition function K The initial state x0 or initial distribution X0

Monte Carlo Statistical Methods/October 29, 2001

115

4.2

Irreducibility

Irreducibility is one measure of the sensitivity of the Markov chain to initial conditions It leads to a guarantee of convergence In the discrete case, the chain is irreducible if all states communicate, namely if Px (y < ) > 0 , y being the rst time y is visited x, y X ,

Monte Carlo Statistical Methods/October 29, 2001

116

In the continuous case, the chain is -irreducible for some measure if for some n, K n (x, A) > 0 for all x X for every A B(X ) with (A) > 0

Monte Carlo Statistical Methods/October 29, 2001

117

Example 23 AR(1) again Xn+1 = Xn + n+1 with n iid normal variables, The chain is irreducible The reference measure is Lebesgue measure In fact, K(x, A) > 0 for every x IR and every A such that (A) > 0.

Monte Carlo Statistical Methods/October 29, 2001

118

If n is uniform on [1, 1] and || > 1, Xn+1 Xn ( 1)Xn 1 0 for Xn 1/( 1), the chain is increasing and cannot visit previous values.

Monte Carlo Statistical Methods/October 29, 2001

119

4.2.1

Cycles and Aperiodicity

Sometimes deterministic constraints on the moves from Xn to Xn+1 . In the discrete case, the period of a state X is d() = g.c.d. {m 1; K m (, ) > 0} , where g.c.d. is the greatest common denominator.

Monte Carlo Statistical Methods/October 29, 2001

120

For an irreducible chain on a nite space X , the transition matrix is a block matrix 0 D1 0 0 0 0 D2 0 P = , .. . Dd 0 0 0 where the blocks Di are stochastic matrices. From block 1 you must go to block 2, from 2 to 3, etc. You return to the initial group every d-th step

Monte Carlo Statistical Methods/October 29, 2001

121

If the chain is irreducible (so all states communicate) only one value for the period. An irreducible chain is aperiodic if it has period 1. If one state x X satises Pxx > 0, the chain (Xn ) is aperiodic, although this is not a necessary condition.

Monte Carlo Statistical Methods/October 29, 2001

122

For continuous chains, similar denition: If the transition kernel has density f (|xn ), sucient condition for aperiodicity is that f (|xn ) is positive in a neighborhood of xn (since the chain can then remain in this neighborhood for an arbitrary number of instants before visiting any set A). For instance, in the AR(1)Example, (Xn ) is aperiodic when n is distributed according to U[1,1] and || < 1

Monte Carlo Statistical Methods/October 29, 2001

123

4.3

Transience and Recurrence

Irreducibility ensures that every set A will be visited by the Markov chain (Xn ) This property is too weak to ensure that the trajectory of (Xn ) will enter A often enough. A Markov chain must enjoy good stability properties to guarantee an acceptable approximation of the simulated model. Formalizing this stability leads to dierent notions of recurrence For discrete chains, the recurrence of a state equivalent to probability one of sure return. Always satised for irreducible chains on nite spaces

Monte Carlo Statistical Methods/October 29, 2001

124

In a nite state space X , denote the average number of visits to a state by

=
i=1

I (Xi ) I

If IE [ ] = the state is recurrent If IE [ ] < the state is transient For irreducible chains, recurrence/transience property of the chain, not of a particular state Similar denitions for the continuous case.

Monte Carlo Statistical Methods/October 29, 2001

125

Stronger form of recurrence: Harris recurrence A set A is Harris recurrent if Px (A = ) = 1 for all x A. The chain (Xn ) is Harris recurrent if it is -irreducible for every set A with (A) > 0, A is Harris recurrent. Note that Px (A = ) = 1 implies IEx [A ] =

Monte Carlo Statistical Methods/October 29, 2001

126

4.4

Invariant Measures

Stability increases for the chain (Xn ) if marginal distribution of Xn independent of n Requires the existence of a probability distribution such that Xn+1 if Xn

A measure is invariant for the transition kernel K(, ) if (B) =


X

K(x, B) (dx) ,

B B(X ) .

Monte Carlo Statistical Methods/October 29, 2001

127

The chain is positive recurrent if is a probability measure. Otherwise it is null recurrent If probability measure also called stationary distribution since X0 implies that Xn for every n The stationary distribution is unique

Monte Carlo Statistical Methods/October 29, 2001

128

Example 24 Back to AR(1) For the AR(1) model Xn = Xn1 + n , with n N (0, 2 ), the transition kernel is N (xn1 , 2 ) and N (, 2 ) is stationary only if = and 2 = 2 2 + 2 . IR,

Monte Carlo Statistical Methods/October 29, 2001

129

These conditions imply that = 0, and hence || < 1. N (0, 2 /(1 2 )) is the unique stationary distribution 2 = 2 /(1 2 ),

Monte Carlo Statistical Methods/October 29, 2001

130

4.5

Ergodicity and convergence

We nally consider: to what is the chain converging? The invariant distribution natural candidate for the limiting distribution A fundamental property is ergodicity, or independence of initial conditions. In the discrete case, a state is ergodic if
n

lim |K n (, ) ()| = 0 .

Monte Carlo Statistical Methods/October 29, 2001

131

In general , we establish convergence using the total variation norm 1 2 and we want K n (x, )(dx)
TV TV

= sup |1 (A) 2 (A)|.


A

= sup
A

K n (x, A)(dx) (A)

to be small.

Monte Carlo Statistical Methods/October 29, 2001

132

If (Xn ) Harris positive recurrent and aperiodic, then


n

lim

K n (x, )(dx)
TV

=0

for every initial distribution . We thus take Harris positive recurrent and aperiodic as equivalent to ergodic Convergence in total variation implies
n

lim |IE [h(Xn )] IE [h(X)]| = 0

for every bounded function h.

Monte Carlo Statistical Methods/October 29, 2001

133

There are dierence speeds of convergence ergodic (fast) geometrically ergodic (faster) uniformly ergodic (fastest)

Monte Carlo Statistical Methods/October 29, 2001

134

4.6

Limit theorems

Ergodicity determines the probabilistic properties of average behavior of the chain. But also need of statistical inference, made by induction from the observed sample.
n If Px close to 0, no direct information about n Xn Px

We need LLNs and CLTs!!!

Monte Carlo Statistical Methods/October 29, 2001

135

Classical LLNs and CLTs not directly applicable due to: Markovian dependence structure between the observations Xi Non-stationarity of the sequence

Monte Carlo Statistical Methods/October 29, 2001

136

Ergodic Theorem LLN If the Markov chain (Xn ) is Harris recurrent, then for any function h with E|h| < , 1 lim h(Xi ) = n n h(x)d(x),

Monte Carlo Statistical Methods/October 29, 2001

137

To get a CLT, we need more assumptions. For MCMC, the easiest is reversibility: A Markov chain (Xn ) is reversible if for all n Xn+1 |Xn+2 Xn+1 |Xn . The direction of time does not matter

Monte Carlo Statistical Methods/October 29, 2001

138

If the Markov chain (Xn ) is Harris recurrent and reversible, 1 N where 0<
2 h N

(h(Xn ) IE [h])
n=1

2 N (0, h ) .

= IE [h (X0 )]

+2
k=1

IE [h(X0 )h(Xk )] < +.

Monte Carlo Statistical Methods/October 29, 2001

139

Monte Carlo Optimization

5.1 Introduction 5.2 Stochastic Exploration 5.3 Stochastic Approximation

Monte Carlo Statistical Methods/October 29, 2001

140

5.1

Introduction

Dierences between the numerical approach and the simulation approach to the problem max h()

lie in the treatment of the function h. Using deterministic numerical methods, the analytical properties of the target function (e.g., convexity, boundedness, smoothness) are often paramount. For the simulation approach, we are more concerned with h from a probabilistic (rather than analytical) point of view.

Monte Carlo Statistical Methods/October 29, 2001

141

Distinguish between two approaches to Monte Carlo optimization: 1. Exploratory approach Goal: to optimize h by describing its entire range Actual properties of h play a lesser role here 2. Probabilistic approximation of h Monte Carlo exploits probabilistic properties of h approach mostly tied to missing data methods

Monte Carlo Statistical Methods/October 29, 2001

142

5.2

Stochastic Exploration

When is bounded, (which may always be achieved by reparameterization) 1. Simulate from a uniform distribution on , u1 , . . . , um U

2. Use the approximation h = max(h(u1 ), . . . , h(um )). m

Monte Carlo Statistical Methods/October 29, 2001

143

A more probabilistic approach If h is positive with h() d < +

nding max h()

is equivalent to nding the modes of h

Monte Carlo Statistical Methods/October 29, 2001

144

Extension: transform h to H that satises: (i). H 0 and H < .

(ii). h and H have the same maxima Examples: H() = exp(h()/T ) or H() = exp{h()/T }/(1 + exp{h()/T })

Monte Carlo Statistical Methods/October 29, 2001

145

Example 25 Minimization Consider minimizing h(x, y) = + (x sin(20y) + y sin(20x))2 cosh(sin(10x)x) (x cos(10y) y sin(10x))2 cosh(cos(20y)y) ,

with global minimum 0 at (x, y) = (0, 0).

Monte Carlo Statistical Methods/October 29, 2001

146

0
1

Z 3

0.5

0.5

0 Y
0 X

-0.5

-0.5

-1

-1

Grid representation h(x, y) on [1, 1]2

Monte Carlo Statistical Methods/October 29, 2001

147

Properties: Many local minima. Standard methods may not nd the global minimum We can simulate from exp(h(x, y)). Get the minimum from the resulting h(xi , yi )s.

Monte Carlo Statistical Methods/October 29, 2001

148

5.2.1

Stochastic gradient

Deterministic numerical minimizer that produces a sequence (j ) = arg max h()

if the domain IRd the function h are concave

Monte Carlo Statistical Methods/October 29, 2001

149

The sequence (j ) is constructed by j+1 = j + j h(j ) , j > 0 , [Gradient sequence] where h is the gradient of h

(j ) is chosen to aid convergence.

Monte Carlo Statistical Methods/October 29, 2001

150

Stochastic variant Stochastic perturbation: With a second sequence (j ), dene j+1 = j + where j are uniform on the unit sphere |||| = 1 h(x, y) = h(x + y) h(x y) 2||y|| h(x) j h(j , j j ) j 2j

Monte Carlo Statistical Methods/October 29, 2001

151

This method does not always go along the steepest slope This can be a plus, as it may avoid being trapped in local maxima or saddlepoints

Monte Carlo Statistical Methods/October 29, 2001

152

Example 26 More minimization Use the stochastic gradient method with our test function h, with dierent sequences of j s and j s, and dierent convergence patterns.

Monte Carlo Statistical Methods/October 29, 2001

153

Choices of (j ) Case 1 - j = 1/10j : poor evaluation of the minimum and in big jumps in the rst iterations. Case 2 - j = 1/100j : converges to the closest local minima Case 3 - j = 1/10 log(1 + j) : a slower decrease in (j ) tends to achieve better minima.

Monte Carlo Statistical Methods/October 29, 2001

154

Results of three dierent stochastic gradient runs j j T h(T ) mint h(t ) Iteration 1/10j 1/10j (0.166, 1.02) 1.287 0.115 50 1/100j 1/100j (0.629, 0.786) 0.00013 0.00013 93 1/10 log(1 + j) 1/j (0.0004, 0.245) 4.24 106 2.163 107 58

The iteration T is obtained by the stopping rule ||T T 1 || < 105

Monte Carlo Statistical Methods/October 29, 2001

155

0.8

0.9

1.0

1.1

-0.2

0.0

0.2

0.4

0.6

(1) j = j = 1/10j

Stochastic gradient paths for starting point (0.65, 0.8) (darker shades mean higher elevations)

Monte Carlo Statistical Methods/October 29, 2001

156

0.785

0.790

0.795

0.800

0.805

0.630

0.635

0.640

0.645

0.650

(2) j = j = 1/100j

Monte Carlo Statistical Methods/October 29, 2001

157

0.2
-0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

(3) j = 1/10 log(1 + j), j = 1/j

Monte Carlo Statistical Methods/October 29, 2001

158

5.2.2

Simulated Annealing

Name borrowed from Metallurgy: A metal manufactured by a slow decrease of temperature (annealing) is stronger than a metal manufactured by a fast decrease of temperature. Fundamental idea: A change of scale, called temperature, allows easier and faster exploration of function h Rescaling partially avoids being trapped in local maxima.

Monte Carlo Statistical Methods/October 29, 2001

159

Idea Given a temperature T > 0, generate


T T 1 , 2 , . . . () exp(h()/T ) T and approximate the maximum of h based on the sequence i .

As T 0, the values simulated concentrate in a narrower and narrower neighborhood of the local maxima of h

Monte Carlo Statistical Methods/October 29, 2001

160

Metropolis algorithm

Starting from 0 , 1. uniform in a neighborhood of 0 2. the new value of is generated by: 1 = 0 with probability = exp(h/T ) 1 with probability 1 ,

where h = h() h(0 ).

Monte Carlo Statistical Methods/October 29, 2001

161

Features if h() h(0 ), is accepted with probability 1 if h() < h(0 ), may still be accepted with probability = 0

Monte Carlo Statistical Methods/October 29, 2001

162

Features (contd.) If 0 is a local maximum of h, the algorithm escapes from 0 with a probability that depends on T Usually, the simulated annealing algorithm modies the temperature T at each iteration/on-line (heterogeneous Markov chain)

Monte Carlo Statistical Methods/October 29, 2001

163

Algorithm 27 Simulated Annealing

1. Simulate g(| i |) [instrumental distribution] 2. Accept i+1 = with probability i = exp{hi /Ti } 1; take i+1 = i otherwise. 3. Update Ti to Ti+1 .

Monte Carlo Statistical Methods/October 29, 2001

164

Convergence Under mild assumptions on (Ti ), this algorithm is guaranteed to nd the global maximum

Monte Carlo Statistical Methods/October 29, 2001

165

Example 28 More minimization Apply the simulated annealing algorithm to minimize h, with g uniform on [0.1, 0.1], and dierent sequences (Ti ). The results change with the sequences Ti

Monte Carlo Statistical Methods/October 29, 2001

166

Results of simulated annealing runs for dierent values of Ti and starting point (0.5, 0.4).
Case Ti T h(T ) mint h(t ) Accept. rate 1 1/10i (1.94, 0.480) 0.198 4.02 107 0.9998 2 1/ log(1 + i) (1.99, 0.133) 3.408 3.823 107 0.96 3 100/ log(1 + i) (0.575, 0.430) .0017 4.708 109 0.6888

Monte Carlo Statistical Methods/October 29, 2001

167

-1.0

-0.5

0.0

0.5

1.0

1.5

-2

-1

(1) Ti = 1/10 i

Simulated annealing sequence of 5000 points for the rst choice of the temperature Ti and starting point (0.5, 0.4)

Monte Carlo Statistical Methods/October 29, 2001

168

-1.0

-0.5

0.0

0.5

1.0

1.5

-2

-1

(2) Ti = 1/ log(i + 1)

Monte Carlo Statistical Methods/October 29, 2001

169

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5

(3) Ti = 100/ log(i + 1)

Monte Carlo Statistical Methods/October 29, 2001

170

5.2.3

Recursive integration

Also called Prior feedback : a very statistical approach Based on the result that if there exists a unique solution satisfying = arg max h() ,

then

lim

eh() d = , eh() d

provided h is continuous at .

Monte Carlo Statistical Methods/October 29, 2001

171

Observations with a log likelihood function (|x). The MLE can be obtained as e (|x) ()d lim = . e (|x) ()d where is a positive density The Bayes estimator associated with the prior distribution is
(x)

e (|x) ()d e (|x) ()d

So we have
lim (x) =

Monte Carlo Statistical Methods/October 29, 2001

172

Example 29 Gamma shape Estimation of , from (|x) x(1) ex () . IE[|x, ] can be obtained by simulation
Sequence of Bayes estimators of for the estimation of when X G(, 1) and x = 1.5

5 2.02

10 2.04

100 1000 1.89 1.98

5000 1.94

104 2.00

Monte Carlo Statistical Methods/October 29, 2001

173

5.3

Stochastic Approximation

Methods that work directly with the objective function and are less concerned with fast explorations of the space. Approximations of the objective function that could result in an additional level of error. Many of these approximation methods only work in missing data models, where the likelihood g(x|) can be expressed as g(x|) =
Z

f (x, z|) dz

Monte Carlo Statistical Methods/October 29, 2001

174

Example 30 Censored data likelihood Observe Y1 , . . ., Yn , iid, from f (y ) Order the observations so that y = (y1 , , ym ) are uncensored and (ym+1 , . . . , yn ) are censored (and equal to a) The likelihood function is
m

L(|y) =
i=1

f (yi ) [1 F (a )]

nm

where F is the cdf associated with f .

Monte Carlo Statistical Methods/October 29, 2001

175

If we had observed the last n m values, say z = (zm+1 , . . . , zn ), with zi > a (i = m + 1, . . . , n), we could have constructed the (complete data) likelihood
m n

Lc (|y, z) =
i=1

f (yi )
i=m+1

f (zi ),

which is easier to work with.

Monte Carlo Statistical Methods/October 29, 2001

176

5.3.1

Monte Carlo Approximation h(x) = IE[H(x, Z)]

can be approximated by 1 h(x) = m Problems : h(x) needs to be evaluated at many points, thus involves generation of many samples of Zi s of size m. The sample changes with every value of x: the resulting sequence of evaluations of h usually not smooth enough
m

H(x, zi ) Zi f (z|x) .
i=1

Monte Carlo Statistical Methods/October 29, 2001

177

Importance sampling solution Use a single sample of Zi g and optimize m (x) = 1 h m


m i=1

f (zi |x) H(x, zi ) , g(zi )

This evaluation of h does not depend [so much] on x

Monte Carlo Statistical Methods/October 29, 2001

178

Features hm is a sum, thus with possibly less analytical properties than the original h. For example, no smoothing eect of the integral on the integrand H(x, z) The choice of g is very inuential in obtaining a good approximation of the function h(x) The number of points zi used in the approximation should vary with x to achieve the same precision in the approximation of h(x), but it is usually impossible to assess in advance. When g(z) = f (z|x0 ), Geyers (1996) recursive process in which x0 is updated by the solution of the last optimization at each step.

Monte Carlo Statistical Methods/October 29, 2001

179

Algorithm 31 Monte Carlo Maximization At step i 1.. Generate z1 , . . . , zm f (z|xi ) and compute hgi with gi = f (|xi ). 2.. Find x = arg max hgi (x). 3.. Update xi to xi+1 = x . Repeat until xi = xi+1 .

Monte Carlo Statistical Methods/October 29, 2001

180

Example 32 MLE MLEs in exponential families; h(x|) = c()ex(x) = c()h(x|) . normalizing constant c() may be unknown or dicult to compute (gamma or beta distributions, for example) Since h(x|)dx = 1 c()

maximization of h(x|) in is equivalent to maximizing h(x|) = log log h(x|) h(x|) log IE h(x|) h(X|) , h(X|)

Monte Carlo Statistical Methods/October 29, 2001

181

Accomplished by maximizing the approximation h(x|) 1 log log m h(x|)


m i=1

h(xi |) , i |) h(x

where the Xi s are generated from h(x|)

Monte Carlo Statistical Methods/October 29, 2001

182

In more general missing-data models, the likelihood function is L(|x) = Thus f (x, z|)dz

L(|x) f (x, Z|) = IE x L(|x) f (x, Z|)


m i=1

Z f (x, z|)

can be maximized through approximations like 1 h () = m f (x, zi |) f (x, zi |) z1 , . . . , zm f (z|x, )

Monte Carlo Statistical Methods/October 29, 2001

183

Example 33 ARCH Models Gaussian ARCH (Auto Regressive Conditionally Heteroscedastic) model, for t = 2, . . . , T , Z = (1 + Z 2 )1/2 , N (0, 1), t t t t1 Xt = aZt + t , t Np (0, 2 Ip ).

Monte Carlo Statistical Methods/October 29, 2001

184

The approximation of the likelihood ratio is then based on the simulation of the missing data Z T = (Z1 , . . . , ZT ) from f (z T xT , ) f (z T , xT )
T

2T exp
t=1 T

||xt azt ||2 /2 2

ez1 /2

e , 2 )1/2 (1 + zt1 t=2


m i=1

2 2 zt /2(1+zt1 )

The likelihood approximation is given by 1 m


T f (zi , xT ) . T , xT ) f (zi

Monte Carlo Statistical Methods/October 29, 2001

185

5.3.2

The EM Algorithm

Introduced by Dempster, Laird and Rubin (1977) Takes advantage of the representation g(x|) =
Z

f (x, z|) dz

and solves a sequence of easier maximization problems whose limit is the answer to the original problem.

Monte Carlo Statistical Methods/October 29, 2001

186

Note EM algorithm relates to MCMC algorithms in the sense that it can be seen as a forerunner of the Gibbs sampler in its Data Augmentation version, replacing simulation by maximization.

Monte Carlo Statistical Methods/October 29, 2001

187

Suppose that we observe X1 , . . . , Xn , iid from g(x|) and want to compute = arg max L(|x) =
n

g(xi |).
i=1

We augment the data with z, where X, Z f (x, z|) and note the identity f (x, z|) k(z|, x) = , g(x|) where k(z|, x) is the conditional distribution of the missing data Z given the observed data x.

Monte Carlo Statistical Methods/October 29, 2001

188

Principle This identity leads to the following relationship between the complete-data likelihood Lc (|x, z) = f (x, z|) and the observed data likelihood L(|x). For any value 0 , log L(|x) = IE0 [log Lc (|x, z)|0 , x] IE0 [log k(z|, x)|0 , x], where the expectation is with respect to k(z|0 , x).

Monte Carlo Statistical Methods/October 29, 2001

189

Properties 1. The strength of the EM algorithm is that we only have to deal with the rst term on the right side above, as the other term can be ignored. 2. The observed likelihood is increased at every iteration : this is a convergence guarantee

Monte Carlo Statistical Methods/October 29, 2001

190

Preparation Denote the expected log-likelihood by Q(|0 , x) = IE0 [log Lc (|x, z)|0 , x].

A sequence of estimators (j) , j = 1, 2, . . ., is obtained iteratively by Q((j) |(j1) , x) = max Q(|(j1) , x).

Monte Carlo Statistical Methods/October 29, 2001

191

Algorithm 34 The EM Algorithm Iterate 1. (the E-step) Compute Q(|(m) , x) = IE(m) [log Lc (|x, z)|x] , 2. (the M-step) Maximize Q(|(m) , x) in and take (m+1) = arg max Q(|(m) , x).

until a xed point Q is obtained

Monte Carlo Statistical Methods/October 29, 2001

192

Example 35 Censored data If f (x ) is the N (, 1) density, the censored data likelihood is 1 1 L(|x) = exp 2 (2)m/2
m

(xi )2
i=1

[1 (a )]

nm

and the complete-data log-likelihood is 1 c log L (|x, z) 2 1 2 (xi ) (zi )2 , 2 i=m+1 i=1
m n

where the zi s are observations from the truncated Normal distribution exp{ 1 (z )2 } (z ) 2 k(z|, x) = = , 1 (a ) 2[1 (a )] a < z.

Monte Carlo Statistical Methods/October 29, 2001

193

At the jth step in the EM sequence, we have 1 Q(|(j) , x) 2


m

(xi )2
i=1 n a

1 2 i=m+1

(zi )2 k(z|(j) , x) dzi ,

Monte Carlo Statistical Methods/October 29, 2001

194

Dierentiating with respect to yields (j+1) where IE[Z|(j) ] =


a

m + (n m)IE[Z|(j) ] x = , n

(a (j) ) zk(z|(j) , x) dz = (j) + . (j) ) 1 (a

Thus, the EM sequence is dened by (j+1) (a (j) ) m nm = x+ (j) + , (j)) n n 1 (a

which converges to the MLE .

Monte Carlo Statistical Methods/October 29, 2001

195

5.3.3

MCEM

A (potential) diculty with the EM algorithm is the computation of Q(|0 , x) To overcome this diculty, use 1 Q(|0 , x) = m where Z1 , . . . , Zm k(z|x, ). When m , this quantity converges to Q(|0 , x).
m

log Lc (|x, z) ,
i=1

Monte Carlo Statistical Methods/October 29, 2001

196

The Metropolis-Hastings Algorithm

6.1 6.2 6.3 6.4

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Examples Extensions

Monte Carlo Statistical Methods/October 29, 2001

197

6.1

Monte Carlo Methods based on Markov Chains

Unnecessary to use a sample from the distribution f to approximate the integral h(x)f (x)dx ,

Now we obtain X1 , . . . , Xn f (approx) without directly simulating from f , using an ergodic Markov chain with stationary distribution f

Monte Carlo Statistical Methods/October 29, 2001

198

Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f Insures the convergence in distribution of (X (t) ) to a random variable from f . For a large enough T0 , X (T0 ) can be considered as distributed from f Produce a dependent sample X (T0 ) , X (T0 +1) , . . ., which is generated from f , sucient for most approximation purposes.

Monte Carlo Statistical Methods/October 29, 2001

199

6.2

The MetropolisHastings algorithm

6.2.1

Basics

The algorithm starts with the objective (target) density f A conditional density q(y|x) called the instrumental (or proposal) distribution, is then chosen.

Monte Carlo Statistical Methods/October 29, 2001

200

Algorithm 36 MetropolisHastings Given x(t) , 1. Generate Yt q(y|x(t) ). 2. Take X (t+1) = where (x, y) = min Yt x(t) with prob. (x(t) , Yt ), with prob. 1 (x(t) , Yt ), f (y) q(x|y) ,1 f (x) q(y|x) .

Monte Carlo Statistical Methods/October 29, 2001

201

Features Always accept upwards moves Independent of normalizing constants for both f and q(|x) (constants independent of x) Never move to values with f (y) = 0 The chain (x(t) )t may take the same value several times in a row, even though f is a density wrt Lebesgue measure The sequence (yt )t is usually not a Markov chain

Monte Carlo Statistical Methods/October 29, 2001

202

6.2.2

Convergence properties

1. The M-H Markov chain is reversible, with invariant/stationary density f since it satises the detailed balance condition f (y) K(y, x) = f (x) K(x, y) 2. As f is a probability measure, the chain is positive recurrent 3. If f (Yt ) q(X (t) |Yt ) P 1 < 1. f (X (t) ) q(Yt |X (t) )

(1)

that is, the event {X (t+1) = X (t) } is possible, then the chain is aperiodic

Monte Carlo Statistical Methods/October 29, 2001

203

4. If q(y|x) > 0 for every (x, y), the chain is irreducible 5. For M-H, f -irreducibility implies Harris recurrence 6. Thus, for M-H satisfying (1) and (2) (a) For h, with IEf |h(X)| < , 1 lim T T (b) and
n T

(2)

h(X (t) ) =
t=1

h(x)df (x)

a.e. f.

lim

K n (x, )(dx) f
TV

=0

for every initial distribution , where K n (x, ) denotes the kernel for n transitions.

Monte Carlo Statistical Methods/October 29, 2001

204

6.3

A Collection of Metropolis-Hastings Algorithms

6.3.1

The Independent Case

The instrumental distribution q is independent of X (t) , and is denoted g by analogy with Accept-Reject.

Monte Carlo Statistical Methods/October 29, 2001

205

Algorithm 37 Independent Metropolis-Hastings Given x(t) , 1. Generate Yt g(y) 2. Take X (t+1) =

Yt x(t)

with prob. min otherwise.

f (Yt ) g(x(t) ) ,1 , (t) ) g(Y ) f (x t

Monte Carlo Statistical Methods/October 29, 2001

206

The resulting sample is not iid There can be strong convergence properties: The algorithm produces a uniformly ergodic chain if there exists a constant M such that f (x) M g(x) , In this case, K n (x, ) f
TV

x supp f.

1 1 M

.
1 M.

and the expected acceptance probability is at least

Monte Carlo Statistical Methods/October 29, 2001

207

Example 38 Generating gamma variables Generate the Ga(, ) distribution using a gamma Ga([], b = []/) candidate Algorithm 39 Gamma accept-reject 1. Generate Y Ga([], []/) 2. Accept X = Y with prob. e y exp(y/)
[]

Monte Carlo Statistical Methods/October 29, 2001

208

and Algorithm 40 Gamma Metropolis-Hastings 1. Generate Yt Ga([], []/) 2. Take X (t+1) =

Yt x(t)

with prob. otherwise.

Yt exp x(t)

x(t) Yt

[]

Monte Carlo Statistical Methods/October 29, 2001

209

Comparison Close agreement in M-H and A-R, with a slight edge to M-H.
9.5 8.0
0

8.5

9.0

(5000 iterations)

Accept-reject (solid line) vs. MetropolisHastings (dotted line) estimators of IEf [X 2 ] = 8.33, for = 2.43 based on Ga(2, 2/2.43)

1000

2000

3000

4000

Monte Carlo Statistical Methods/October 29, 2001

210

6.3.2

Random walk MetropolisHastings

Use the proposal Yt = X (t) + t , where t g, independent of X (t) . The instrumental density is now of the form g(y x) and the Markov chain is a random walk if we take g to be symmetric

Monte Carlo Statistical Methods/October 29, 2001

211

Algorithm 41 Random walk Metropolis Given x(t) 1. Generate Yt g(y x(t) ) 2. Take X (t+1) =

Yt x(t)

f (Yt ) with prob. min 1, , f (x(t) ) otherwise.

Monte Carlo Statistical Methods/October 29, 2001

212

Example 42 Random walk normal Generate N (0, 1) based on the uniform proposal [, ] [Hastings (1970)] The probability of acceptance is then (x , yt ) = exp{(x
(t) (t)2 2 yt )/2} 1.

Monte Carlo Statistical Methods/October 29, 2001

213

Sample statistics 0.1 mean 0.399 variance 0.698 0.5 1.0 0.111 0.10 1.11 1.06

As , we get better histograms and a faster exploration of the support of f .

Monte Carlo Statistical Methods/October 29, 2001

214

250

0.5

400

0.5

400

200

0.0

300

0.0

300

150

-0.5

200

-0.5

200

100

-1.0

100

-1.0

100

50

-1.5

-1.5

-1

0 (a)

-2

0 (b)

-3

-2

-1

0 (c)

Three samples based on U[, ] with (a) = 0.1, (b) = 0.5 and (c) = 1.0, superimposed with the convergence of the means (15, 000 simulations).

-1.5

-1.0

-0.5

0.0

0.5

Monte Carlo Statistical Methods/October 29, 2001

215

Convergence properties Uniform ergodicity prohibited by random walk structure At best, geometric ergodicity: For a symmetric density f , log-concave in the tails, and a positive and symmetric density g, the chain (X (t) ) is geometrically ergodic.

Monte Carlo Statistical Methods/October 29, 2001

216

Example 43 Comparison of tail eects Random-walk MetropolisHastings algorithms based on a N (0, 1) instrumental for the generation of (a) a N (0, 1) distribution and (b) a distribution with density (x) (1 + |x|)3
1.5 1.5
0 50 100 (a) 150 200

1.0

0.5

0.0

-0.5

-1.0

-1.5

-1.5
0

-1.0

-0.5

0.0

0.5

1.0

50

100 (b)

150

200

90% condence envelopes of the means, derived from 500 parallel independent chains

Monte Carlo Statistical Methods/October 29, 2001

217

6.4

Extensions

There are many other algorithms Adaptive Rejection Metropolis Sampling Reversible Jump Langevin algorithms to name a few...

Monte Carlo Statistical Methods/October 29, 2001

218

6.4.1 Facts:

Reversible jump MCMC

- No clear dominating measure in variable dimension spaces - Gibbs sampling does not apply Solution: - Create xed dimension moves locally - Supplements 1 from Mk1 and 2 from Mk2 by u12 and u21 resp. so that (1 , u12 ) and (2 , u21 ) are in bijection (one-to-one correspondance): (2 , u21 ) = T (1 , u12 )

Monte Carlo Statistical Methods/October 29, 2001

219

- Use acceptance probability min (k2 , 2 )12 g(u21 ) T (1 , u12 ) ,1 (k1 , 1 )21 g(u12 ) (1 , u12 ) [Green, 1995]

Monte Carlo Statistical Methods/October 29, 2001

220

6.4.2

Langevin Algorithms

Proposal based on the Langevin diusion Lt is dened by the stochastic dierential equation dLt = dBt + 1 2 log f (Lt )dt,

where Bt is the standard Brownian motion The Langevin diusion is the only non-explosive diusion which is reversible with respect to f .

Monte Carlo Statistical Methods/October 29, 2001

221

Discretization: x
(t+1)

=x

(t)

2 + 2

log f (x(t) ) + t ,

t Np (0, Ip )

where 2 corresponds to the discretization Unfortunately, the discretized chain may be be transient, for instance when
x

lim

log f (x)|x|1 > 1

Monte Carlo Statistical Methods/October 29, 2001

222

MH correction Accept the new value Yt with probability f (Yt ) (t) ) f (x exp Yt x
(t)

2 2 2 2

log f (x ) log f (Yt )

(t)

2 2 1. 2 2

exp x(t) Yt

Choice of the scaling factor Should lead to an acceptance rate of 0.574 to achieve optimal convergence rates (when the components of x are uncorrelated)

Monte Carlo Statistical Methods/October 29, 2001

223

6.4.3

Optimizing the Acceptance Rate

Problem of choice of the transition kernel from a practical point of view Most common alternatives: (a) a fully automated algorithm like ARMS; (b) an instrumental density g which approximates f , such that f /g is bounded for uniform ergodicity to apply; (c) a random walk In both cases (b) and (c), the choice of g is critical,

Monte Carlo Statistical Methods/October 29, 2001

224

Case of the independent MetropolisHastings algorithm Choice of g that maximizes the average acceptance rate = IE min = 2P f (Y ) g(X) ,1 f (X) g(Y ) f (X) f (Y ) , g(Y ) g(X)

X f, Y g,

Related to the speed of convergence of 1 T


T

h(X (t) )
t=1

to IEf [h(X)] and to the ability of the algorithm to explore any complexity of f

Monte Carlo Statistical Methods/October 29, 2001

225

Practical implementation Choose a parameterized instrumental distribution g(|) and adjusting the corresponding parameters based on the evaluated acceptance rate 2 () = m
m

I {f (yi )g(xi )>f (xi )g(yi )} , I


i=1

where x1 , . . . , xm sample from f and y1 , . . . , ym iid sample from g.

Monte Carlo Statistical Methods/October 29, 2001

226

Example 44 Inverse Gaussian distribution. Simulation from f (z|1 , 2 ) z


3/2

2 exp 1 z +2 z

1 2 + log

22

I IR+ (z) I

based on the Gamma distribution Ga(, ) with = Since f (x) 2 1/2 x exp ( 1 )x g(x) x ,

2 /1

the optimal value of x is x = ( + 1/2) ( + 1/2)2 + 42 (1 ) . 2( 1 )

Monte Carlo Statistical Methods/October 29, 2001

227

The analytical optimization (in ) of M () = (x )1/2 exp ( 1 )x is impossible () 0.2 0.22 0.5 0.41 0.8 0.54 0.9 0.56 1 0.60 1.1 0.63 1.2 0.64 1.5 0.71 1.148 1.115 2 x

IE[Z] 1.137 1.158 IE[1/Z] 1.116 1.108

1.164 1.154 1.133 1.148 1.181 1.116 1.115 1.120 1.126 1.095

(1 = 1.5, 2 = 2, and m = 5000).

Monte Carlo Statistical Methods/October 29, 2001

228

Case of the random walk Dierent approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f . If x(t) and yt are close, i.e. f (x(t) ) f (yt ) y is accepted with probability f (yt ) ,1 1. min f (x(t) ) For multimodal densities with well separated modes, the negative eect of limited moves on the surface of f clearly shows.

Monte Carlo Statistical Methods/October 29, 2001

229

If the average acceptance rate is low, the successive values of f (yt ) tend to be small compared with f (x(t) ), which means that the random walk moves quickly on the surface of f since it often reaches the borders of the support of f

Monte Carlo Statistical Methods/October 29, 2001

230

Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995]

Monte Carlo Statistical Methods/October 29, 2001

231

The Gibbs Sampler

7.1 General Principles 7.2 Data Augmentation 7.3 Improper Priors

Monte Carlo Statistical Methods/October 29, 2001

232

7.1

General Principles

A very specic simulation algorithm based on the target f Uses the conditional densities f1 , . . . , fp from f Start with the random variable X = (X1 , . . . , Xp ) Simulate from the conditional densities, Xi |x1 , x2 , . . . , xi1 , xi+1 , . . . , xp fi (xi |x1 , x2 , . . . , xi1 , xi+1 , . . . , xp ) for i = 1, 2, . . . , p.

Monte Carlo Statistical Methods/October 29, 2001

233

Algorithm 45 The Gibbs sampler

Given x(t) = (x1 , . . . , xp ), generate 1. X1 2. X2


(t+1) (t+1)

(t)

(t)

f1 (x1 |x2 , . . . , xp ); f2 (x2 |x1


(t+1)

(t)

(t)

, x3 , . . . , xp ),
(t+1)

(t)

(t)

... p. Xp
(t+1)

fp (xp |x1

(t+1)

, . . . , xp1 )

Then X(t+1) X f

Monte Carlo Statistical Methods/October 29, 2001

234

The full conditionals densities f1 , . . . , fp are the only densities used for simulation. Thus, even in a high dimensional problem, all of the simulations may be univariate

Monte Carlo Statistical Methods/October 29, 2001

235

Example 46 Bivariate Gibbs sampler (X, Y ) f (x, y) Generate a sequence of observations by

Set X0 = x0 , and for t = 1, 2, . . . , generate Yt Xt fY |X (|xt1 ) fX|Y (|yt )

where fY |X and fX|Y are the conditional distributions

Monte Carlo Statistical Methods/October 29, 2001

236

(Xt , Yt )t , is a Markov chain (Xt )t and (Yt )t individually are Markov chains For example, the chain (Xt )t has transition density K(x, x ) = with invariant density fX () fY |X (y|x)fX|Y (x |y)dy,

Monte Carlo Statistical Methods/October 29, 2001

237

For the special case (X, Y ) N2 0, the Gibbs sampler is 1 1 ,

Given yt , generate Xt+1 | yt Yt+1 | xt+1 N (yt , 1 2 ) , N (xt+1 , 1 2 ).

Monte Carlo Statistical Methods/October 29, 2001

238

Example 47 Auto-exponential model On IR3 , density + f (y1 , y2 , y3 ) exp{(y1 + y2 + y3 + 12 y1 y2 + 23 y2 y3 + 31 y3 y1 )} , with known ij > 0. The full conditional densities are exponential Y3 |y1 , y2 Exp (1 + 23 y2 + 31 y1 ) , In contrast, the other conditionals, and the marginal distributions are dicult.

Monte Carlo Statistical Methods/October 29, 2001

239

Properties of the Gibbs sampler Formally, a special case of a sequence of 1-D M-H kernels, all with acceptance rate uniformly equal to 1. The Gibbs sampler 1. limits the choice of instrumental distributions 2. requires some knowledge of f 3. is, by construction, multidimensional 4. does not apply to problems where the number of parameters varies as the resulting chain is not irreducible.

Monte Carlo Statistical Methods/October 29, 2001

240

7.1.1

Completion

The Gibbs sampler can be generalized in much wider generality A density g is a completion of f if g(x, z) dz = f (x)
Z

Monte Carlo Statistical Methods/October 29, 2001

241

Purpose g should have full conditionals that are easy to simulate for a Gibbs sampler to be implemented with g rather than f For p > 1, write y = (x, z) and denote the conditional densities of g(y) = g(y1 , . . . , yp ) by Y1 |y2 , . . . , yp Y2 |y1 , y3 , . . . , yp Yp |y1 , . . . , yp1 g1 (y1 |y2 , . . . , yp ), g2 (y2 |y1 , y3 , . . . , yp ), ..., gp (yp |y1 , . . . , yp1 ).

Monte Carlo Statistical Methods/October 29, 2001

242

The move from Y (t) to Y (t+1) is dened as follows: Algorithm 48 Completion Gibbs sampler

Given (y1 , . . . , yp ), simulate 1. Y1


(t+1) (t+1)

(t)

(t)

g1 (y1 |y2 , . . . , yp ), g2 (y2 |y1


(t+1)

(t)

(t)

2. Y2 ... p. Yp

, y3 , . . . , yp ), , . . . , yp1 ).
(t+1)

(t)

(t)

(t+1)

gp (yp |y1

(t+1)

Monte Carlo Statistical Methods/October 29, 2001

243

Example 49 Cauchy-normal Consider the density e /2 f (|0 ) [1 + ( 0 )2 ] posterior from the model X| N (, 1) and C(0 , 1). Then f (|0 )
0
2

2 /2

[1+(0 )2 ] /2

1 d,

and therefore g(, ) e


2

/2

e[1+(0 )

] /2

1 ,

Monte Carlo Statistical Methods/October 29, 2001

244

with conditional densities g1 (|) g2 (|) 1 + ( 0 )2 = Ga , 2 0 1 = N , . 1+ 1+ ,

The parameter is completely meaningless for the problem at hand but serves to facilitate computations.

Monte Carlo Statistical Methods/October 29, 2001

245

7.1.2

Slice sampler

If f () can be written as a product


k

fi (),
i=1

it can be completed
k

I 0i fi () , I
i=1

leading to the following Gibbs algorithm:

Monte Carlo Statistical Methods/October 29, 2001

246

Algorithm 50 Slice sampler

Simulate 1. 1
(t+1)

U[0,f1 ((t) )] ;

... k. k
(t+1)

U[0,fk ((t) )] ;

k+1. (t+1) UA(t+1) , with A(t+1) = {y; fi (y) i


(t+1)

, i = 1, . . . , k}.

Monte Carlo Statistical Methods/October 29, 2001

247

The slice sampler usually enjoys good theoretical properties (like geometric ergodicity). As k increases, the determination of the set A(t+1) may get increasingly complex.

Monte Carlo Statistical Methods/October 29, 2001

248

Example 51 Normal simulation For the standard normal density, f (x) exp(x2 /2), a slice sampler is based on |x X| U[0,exp(x2 /2)] , U
[ 2 log(),

2 log()]

Monte Carlo Statistical Methods/October 29, 2001

249

7.1.3

Properties of the Gibbs sampler (Y1 , Y2 , , Yp ) g(y1 , . . . , yp )

If either (i) g (i) (yi ) > 0 for every i = 1, , p, implies that g(y1 , . . . , yp ) > 0, where g (i) denotes the marginal distribution of Yi , or [Positivity condition] (ii) the transition kernel is absolutely continuous with respect to g, then the chain is irreducible and positive Harris recurrent. (i). If h(y)g(y)dy < , then lim 1 T
T

nT

h1 (Y (t) ) =
t=1

h(y)g(y)dy a.e. g.

Monte Carlo Statistical Methods/October 29, 2001

250

(ii). If, in addition, (Y (t) ) is aperiodic, then


n

lim

K n (y, )(dx) f
TV

=0

for every initial distribution .

Monte Carlo Statistical Methods/October 29, 2001

251

7.1.4

Hammersley-Cliord Theorem

An illustration that conditionals determine the joint distribution If the joint density g(y1 , y2 ) have conditional distributions g1 (y1 |y2 ) and g2 (y2 |y1 ), then g(y1 , y2 ) = g2 (y2 |y1 ) . g2 (v|y1 )/g1 (y1 |v) dv

Monte Carlo Statistical Methods/October 29, 2001

252

General case Under the positivity condition, the joint distribution g satises
p

g(y1 , . . . , yp )
j=1

g j (y j |y 1 , . . . , y g j (y j |y 1 , . . . , y

j1 j1

,y ,y

j+1 j+1

, . . . , y p) , . . . , y p)

for every permutation

on {1, 2, . . . , p} and every y Y.

Monte Carlo Statistical Methods/October 29, 2001

253

7.1.5

Hierarchical models

The Gibbs sampler is particularly well suited to hierarchical models Example 52 Hierarchical models in animal epidemiology Counts of the number of cases of clinical mastitis in 127 dairy cattle herds over a one year period. Number of cases in herd i Xi P(i ) i = 1, , m

where i is the underlying rate of infection in herd i Lack of independence might manifest itself as overdispersion.

Monte Carlo Statistical Methods/October 29, 2001

254

Modied model Xi i i P(i ) Ga(, i ) IG(a, b),

The Gibbs sampler corresponds to conditionals i i (i |x, , i ) = Ga(xi + , [1 + 1/i ]1 ) (i |x, , a, b, i ) = IG( + a, [i + 1/b]1 )

Monte Carlo Statistical Methods/October 29, 2001

255

7.2

Data Augmentation

The Gibbs sampler with only two steps is particularly useful Algorithm 53 Data Augmentation

Given y (t) , 1.. Simulate Y1 2.. Simulate Y2


(t+1) (t+1)

g1 (y1 |y2 ) ; g2 (y2 |y1


(t+1)

(t)

).

Monte Carlo Statistical Methods/October 29, 2001

256

Convergence is ensured

(Y1 , Y2 )(t) Y1 Y2
(t) (t)

(Y1 , Y2 ) g Y2 g2

Y1 g1

Monte Carlo Statistical Methods/October 29, 2001

257

Example 54 Grouped counting data 360 consecutive records of the number of passages per unit time. Number of passages Number of observations 0 139 1 128 2 55 3 25 4 or more 13

Monte Carlo Statistical Methods/October 29, 2001

258

Feature Observations with 4 passages and more are grouped If observations are Poisson P(), the likelihood is (|x1 , . . . , x5 )
3

347 128+552+253

1e

i=0

i!

13

which can be dicult to work with. Idea With a prior () = 1/, complete the vector (y1 , . . . , y13 ) of the 13 units larger than 4

Monte Carlo Statistical Methods/October 29, 2001

259

Algorithm 55 Poisson-Gamma Gibbs

1.. Simulate Yi 2.. Simulate

(t)

P((t1) ) I y4 I

i = 1, . . . , 13
13

(t) Ga 313 +
i=1

yi , 360

(t)

The Bayes estimator 1 = 360T converges quite rapidly


T 13

313 +
t=1 i=1

yi

(t)

Monte Carlo Statistical Methods/October 29, 2001

260

1.025

1.023

10 20 30 40

1.024

0.9

1.0 lambda

1.1

1.2

1.021
0

1.022

100

200

300

400

500

Monte Carlo Statistical Methods/October 29, 2001

261

7.2.1

Rao-Blackwellization

If (y1 , y2 , . . . , yp )(t) , t = 1, 2, . . . T is the output from a Gibbs sampler 1 0 = T


T t=1

h y1

(t)

h(y1 )g(y1 )dy1

and is unbiased. The Rao-Blackwellization replaces 0 with its conditional expectation rb 1 = T


T t=1 (t) IE h(Y1 )|y2 , . . . , yp . (t)

Monte Carlo Statistical Methods/October 29, 2001

262

Then Both estimators converge to IE[h(Y1 )] Both are unbiased, and var IE h(Y1 )|Y2 , . . . , Yp(t)
(t)

var(h(Y1 )),

so rb is uniformly better (for Data Augmentation)

Monte Carlo Statistical Methods/October 29, 2001

263

Some examples of Rao-Blackwellization For the bivariate normal 0 0 , 1 1

(X, Y ) N

the Gibbs sampler is based upon X|y N (y, 1 2 )

Y | x N (x, 1 2 ).

Monte Carlo Statistical Methods/October 29, 2001

264

To estimate = IE(X) we could use 1 0 = T or its Rao-Blackwellized version 1 1 = T 1 (i) (i) IE[X |Y ] = T i=1
1 2 T T T

X (i)
i=1

Y (i) ,
i=1

2 2 which satises 0 /1 =

> 1.

Monte Carlo Statistical Methods/October 29, 2001

265

For the Poisson-Gamma Gibbs sampler, we could estimate with 1 0 = T


T

(t) ,
t=1

but we instead used the Rao-Blackwellized version = 1 T


T t=1 T 13

IE[(t) |x1 , x2 , . . . , x5 , y1 , y2 , . . . , y13 ] 313 +


t=1 i=1

(i)

(i)

(i)

1 360T

yi

(t)

Monte Carlo Statistical Methods/October 29, 2001

266

Another substantial benet of Rao-Blackwellization is in the approximation of densities of dierent components of y without nonparametric density estimation methods. The estimator 1 T and is unbiased.,
T t=1

gi (yi |yj , j = i) gi (yi ),

(t)

Monte Carlo Statistical Methods/October 29, 2001

267

7.2.2

The Duality Principle

Ties together the properties of the two Markov chains in Data Augmentation Consider a Markov chain (X (t) ) and a sequence (Y (t) ) of random variables generated from the conditional distributions X (t) |y (t) Y (t+1) |x(t) , y (t) Properties If the chain (Y (t) ) is ergodic then so is (X (t) ) The conclusion holds for geometric or uniform ergodicity. The chain (Y (t) ) can be discrete, and the chain (X (t) ) can be continuous. (x|y (t) ) f (y|x(t) , y (t) ) .

Monte Carlo Statistical Methods/October 29, 2001

268

7.2.3

Parameterization

Convergence of both Gibbs sampling and MetropolisHastings algorithms may suer from a poor choice of parameterization The overall advice is to try to make the components as independent as possible.

Monte Carlo Statistical Methods/October 29, 2001

269

Example 56 Random eects model In the simple random eects model yij = + i + ij where
2 2 i N (0, ) and ij N (0, y )

i = 1, . . . , I,

j = 1, . . . , J

for a at prior on , the Gibbs sampler implemented for the (, 1 , . . . , I )


2 2 parameterization exhibits high correlation if y /(IJ ) is large and consequent slow convergence

Monte Carlo Statistical Methods/October 29, 2001

270

If the model is rewritten as the hierarchy


2 yij N (i , y ), 2 i N (, ),

the correlations between the i s and between and the i s are lower

Monte Carlo Statistical Methods/October 29, 2001

271

7.3

Improper Priors

Unsuspected danger resulting from careless use of MCMC algorithms: It can happen that all conditional distributions are well dened, all conditional distributions may be simulated from, but... the system of conditional distributions may not correspond to any joint distribution Warning The problem is due to careless use of the Gibbs sampler in a situation for which the underlying assumptions are violated

Monte Carlo Statistical Methods/October 29, 2001

272

Example 57 Conditional exponential distributions For the model X1 |x2 Exp(x2 ) , X2 |x1 Exp(x1 )

the only function f (x1 , x2 ) that is a candidate for the joint density is f (x1 , x2 ) exp(x1 x2 ), but f (x1 , x2 )dx1 dx2 =

Thus, these conditional distributions do not correspond to a joint probability distribution.

Monte Carlo Statistical Methods/October 29, 2001

273

Example 58 Improper random eects For a random eects model, Yij = + i + ij , where i N (0, 2 ) and ij N (0, 2 ), the Jereys (improper) prior for the parameters , and is 1 (, , ) = 2 2 .
2 2

i = 1, . . . , I, j = 1, . . . , J,

Monte Carlo Statistical Methods/October 29, 2001

274

The conditional distributions


2 2

i |y, , ,

|, y, 2 , 2 2 |, , y, 2 2 |, , y, 2

J(i ) y N , (J 2 + 2 )1 J + 2 2 N ( , 2 /JI) , y IG I/2, (1/2)


2 i i

IG IJ/2, (1/2)
i,j

(yij i )2 ,

are well-dened and a Gibbs sampling can be easily implemented in this setting.

Monte Carlo Statistical Methods/October 29, 2001

275

Evolution of ((t) ) and corresponding histogram


30

25

20

10

-4

-3

-2 (1000 iterations)

-1

-8

-6

-4 observations

freq. 15

-2

Monte Carlo Statistical Methods/October 29, 2001

276

The gure shows the sequence of the (t) and the corresponding histogram for 1000 iterations. The trend of the sequence and the histogram do not indicate that the corresponding joint distribution does not exist

Monte Carlo Statistical Methods/October 29, 2001

277

Final notes on impropriety The improper posterior Markov chain cannot be positive recurrent The major task in such settings is to nd indicators that ag that something is wrong. However, the output of an improper Gibbs sampler may not dier from a positive recurrent Markov chain. Example The random eects model was initially treated in Gelfand et al. (1990) as a legitimate model

Monte Carlo Statistical Methods/October 29, 2001

278

Diagnosing Convergence

8.1 Stopping the Chain 8.2 Monitoring Stationarity Convergence 8.3 Monitoring Average Convergence

Monte Carlo Statistical Methods/October 29, 2001

279

8.1

Stopping the Chain

Convergence results do not tell us when to stop the MCMC algorithm and produce our estimates. Methods of controlling the chain in the sense of a stopping rule to guarantee that the number of iterations is sucient.

Monte Carlo Statistical Methods/October 29, 2001

280

Three types of convergence 1. Convergence to the Stationary Distribution Minimal requirement for approximation of simulation from f 2. Convergence of Averages convergence of the empirical average 1 T
T

h((t) ) IEf [h()].


t=1

most relevant in the implementation of MCMC algorithms. 3. Convergence to iid Sampling (t) (t) How close a sample (1 , . . . , n ) is to being iid.

Monte Carlo Statistical Methods/October 29, 2001

281

8.1.1

Single vs. Multiple Chains

Some methods involve the simulation in sf parallel of M independent (t) chains (m ) (1 m M ) Some are based on a single on-line chain. Motivations for parallel chains Variability and dependence on the initial values are reduced Potentialy easier to control convergence comparing estimation of quantities of interest over dierent chains, But... in a naive implementation, slower chain governs convergence But... initial distribution is paramount

Monte Carlo Statistical Methods/October 29, 2001

282

8.2

Monitoring Convergence to the Stationary Distribution

8.2.1

Graphical Methods

Natural empirical approach to convergence control is to draw pictures May detect deviant or nonstationary behaviors A rst idea is to draw the sequence of the (t) s against t However, this plot is only useful for strong nonstationarities of the chain.

Monte Carlo Statistical Methods/October 29, 2001

283

Example 59 Witchs hat distribution Consider (|y) (1 ) d e


y
2

/(2 2 )

+ I C (), I

y IRd , C = [0, 1]d

One mode very concentrated around y for small and

Monte Carlo Statistical Methods/October 29, 2001

284

Naive implementation of the Gibbs sampler Algorithm 60 Witchs hat distribution 1. Generate Ui U[0,1] 2. Generate i U[0,1]
+ N (yi , 2 , 0, 1)

if Ui < /(wi + ) otherwise

Monte Carlo Statistical Methods/October 29, 2001

285

where
+ N (, 2 , , ) = N (, 2 ) restricted to [, ]

and

(1 ) 2 { Pj=i (yj j )2 /(22 )} wi = e d+1

1 yi

yi

Monte Carlo Statistical Methods/October 29, 2001

286

0
1 0.8 0.6 0.4 0.2 0
0 0.4 0.2 0.6 0.8 1

000 40000 10000 20000 30

Monte Carlo Statistical Methods/October 29, 2001

287

0.0 0.2 0.4 0.6 0.8 1.0


0

200

400

600

800

1000

initial value 0.0217

0.0 0.2 0.4 0.6 0.8 1.0


0

200

400

600

800

1000

initial value 0.9098

Chain (1 ) for two initial values, 0.0217 (top) and 0.9098 (bottom)

(t)

Monte Carlo Statistical Methods/October 29, 2001

288

Strong attraction of the mode that gives the impression of stationarity Chain with initial value 0.9098 achieves a momentary escape from the mode, but is actually atypical.

Monte Carlo Statistical Methods/October 29, 2001

289

8.2.2

Other monitors of stationarity

Nonparametric tests of stationarity Standard nonparametric tests (Kolmogorov-Smirnov,...) When the chain is stationary, (t1 ) and (t2 ) have the same marginal distribution for arbitrary times t1 and t2 . Methods based on Renewal Theory Methods based on Distance Evaluations between the n-step kernel and the marginal Remember, stationarity is not the main concern!

Monte Carlo Statistical Methods/October 29, 2001

290

8.3

Monitoring Convergence of Averages

Evaluation of the convergence of 1 T


T

h((t) ) IEf [h()].


t=1

Monte Carlo Statistical Methods/October 29, 2001

291

8.3.1

Graphical Methods

Purely graphical evaluation based on cumulative sums, graphing the partial dierences
i i DT = t=1

[h((t) ) ST ],

i = 1, , T,

[CUSUM] where 1 ST = T
T

h((t) ).
t=1

Monte Carlo Statistical Methods/October 29, 2001

292

i When the mixing of the chain is high, the graph of DT is highly irregular and concentrated around 0. (Looks like a Brownian bridge.)

Slowly mixing chains (chains with a slow pace of exploration of the stationary distribution) produce regular graphs with long excursions away from 0.

Monte Carlo Statistical Methods/October 29, 2001

293

-25

-20
0

-15

-10

-5

200

400

600

800

1000

CUSUM criterion for the witch hat chain (1000 iterations).

Monte Carlo Statistical Methods/October 29, 2001

294

But... The pathological shape the witchs hat distribution is actually close to the ideal shape of Yu and Mykland, and there is no indication that the chain has not yet left the mode (0.7, 0.7) This diculty is common to most on-line methods, that is, to diagnoses based on a single chain. It is almost impossible to detect the existence of other modes Youve only seen where youve been

Monte Carlo Statistical Methods/October 29, 2001

295

8.3.2

Multiple Estimates

In most cases, the graph of the raw sequence ((t) ) is unhelpful Given some quantities of interest IEf [h()], a more helpful indicator is the behavior of the averages 1 T in terms of T . A more controlled approach is to use simultaneously several convergent estimators of IEf [h()] based on the same chain ((t) ), until all estimations coincide (up to a given precision).
T

h((t) )
t=1

Monte Carlo Statistical Methods/October 29, 2001

296

Most common estimators The empirical average ST Rao-Blackwellized versions of this average,
C ST

1 = T

IE[h()| (t) ] ,
t=1

Importance sampling alternatives


P ST

1 = T

wt h((t) ) ,
t=1

where wt = f ((t) )/gt ((t) ) and gt is the true density used for the simulation of (t) .

Monte Carlo Statistical Methods/October 29, 2001

297

P Note that ST removes the correlation between the (t) s So up to P second order, ST behaves as an independent sum. This implies that P VarST decreases at speed 1/T in stationarity settings. Thus, nonstationarity can be detected if the decrease of the variations of P ST does not t in a condence parabola of order 1/ T .

Monte Carlo Statistical Methods/October 29, 2001

298

Example 61 Cauchy posterior For the posterior distribution (|x1 , x2 , x3 ) e


/2
2 2

1 . 1 + ( xi )2 i=1

a completion Gibbs sampling algorithm can be derived via articial variables, 1 , 2 , 3 , such that (, 1 , 2 , 3 |x1 , x2 , x3 ) e
2

/2

3 i=1

e(1+(xi )

)i /2

Monte Carlo Statistical Methods/October 29, 2001

299

resulting in the Gibbs sampler based on the conditionals (i = 1, 2, 3) i |, xi |x1 , x2 , x3 , 1 , 2 , 3 Exp N 1 + ( xi )2 2 i xi , i + 2 i


i

, 1 2 i i + .

Monte Carlo Statistical Methods/October 29, 2001

300

0
-10

200

400

600

800

-5

5 (20 000 iterations)

10

15

20

Comparison of the normal-Cauchy density and of the histogram (20, 000 points)

Monte Carlo Statistical Methods/October 29, 2001

301

Eciency of this algorithm: agreement between the histogram of the simulated (t) s and the true posterior distribution If the function of interest is h() = exp(/) the dierent approximations of IE [h()] can be monitored.

Monte Carlo Statistical Methods/October 29, 2001

302

0.80
0

0.81

0.82

0.83

0.84

0.85

100

200

300

400

500

(thousand iterations)

C R Convergence of ST (full line), ST (dotted line), ST (mixed) P and ST (long dashes)

Monte Carlo Statistical Methods/October 29, 2001

303

C The strong agreement of ST , ST indicates convergence

The bad behavior the importance sampler is most likely associated with an innite variance.

Monte Carlo Statistical Methods/October 29, 2001

304

Example 62 Beta mixture For the (dicult) chain (X (t) ) X


(t+1)

x(t) Y (t) Be( + 1, 1)

with probability 1 x(t) , otherwise,

estimate the expectation of h(x) = x1 with the four estimators:

Monte Carlo Statistical Methods/October 29, 2001

305

C ST and ST are indistinguishable

importance sampling is the best mixing is very slow


R C The nal values are 0.224, 0.224, 0.211 and 0.223 for ST , ST , ST P and ST respectively, to compare with a theoretical value of 0.2.

Monte Carlo Statistical Methods/October 29, 2001

306

0.18
0

0.20

0.22

0.24

0.26

0.28

200

400

600

800

1000

(thousand iterations)

R C Convergence of ST (full line), ST (dotted line), ST (mixed P dashes) and ST (long dashes) of IE[(X (t) )0.8 ] for Be(0.2, 1).

Monte Carlo Statistical Methods/October 29, 2001

307

Limitations of the method (1). Multiple estimates may not be available (2). Intrinsically conservative the slowest ox drives the team (3). Cannot detect missing modes Youve only seen where youve been (4). Empirical and conditional estimators often similar, while importance sampler enjoys innite variance

Monte Carlo Statistical Methods/October 29, 2001

308

8.3.3

Within and between variances

Control strategy devised by Gelman and Rubin (1992) Starts with the derivation of a distribution related with the modes of f , obtained by numerical methods (??). For instance, mixture of Students t distributions centered around the identied modes of f For every quantity of interest = h(), stopping rule based on the dierence between a weighted estimator of the variance and the variance of estimators from dierent chains

Monte Carlo Statistical Methods/October 29, 2001

309

Denote BT WT with m and m = h(m ). BT and WT represent the between- and within-chains variances.
(t) (t)

= =

1 M 1 M

( m )2 ,
m=1 M

s2 m
m=1

1 = M

M m=1

1 T

T (t) (m m )2 , t=1

1 = T

T (t) m , t=1

1 = M

m
m=1

Monte Carlo Statistical Methods/October 29, 2001

310

Estimator of the posterior variance of T = 2 T 1 WT + BT . T

Comparison of T with WT through 2 RT F approximately, so IERT 1. Stopping rule is based on either testing that the mean of RT is equal to 1 or on condence intervals on RT .

Monte Carlo Statistical Methods/October 29, 2001

311

Example 63 (Normal-Cauchy again) Evolution of RT for h() = , M = 100 and 10, 00 iterations. Convergence after about 6, 00 iterations. But... superimposed graph of WT does not exhibit stationarity The distribution of (t) is stationary after a few hundred iterations : the criterion is conservative

Monte Carlo Statistical Methods/October 29, 2001

312

1.015

1.010

1.005

1.000

200

400

600

800

Evolutions of RT (solid lines and scale on the left) and of WT (dotted lines and scale on the right)

34.6

34.8

35.0

35.2

35.4

Monte Carlo Statistical Methods/October 29, 2001

313

Example 64 (Witches hat again) The density (|y) has a very concentrated mode around y Use the uniform distribution on C = [0, 1]d as the initial distribution The scale of RT is very concentrated Stability of RT (and of WT ) implies convergence But... the chain has not left the neighborhood of the mode (0.7, 0.7)!

Monte Carlo Statistical Methods/October 29, 2001

314

1.016

1.018

1.014

1.010

1.012

1.006

1.008

200

400

600

800

Evolutions of RT (solid lines and scale on the left) and of WT (dotted lines and scale on the right)

0.002

0.004

0.006

0.008

Monte Carlo Statistical Methods/October 29, 2001

315

Comments This method has enjoyed wide usage, in particular because of its simplicity and of its connections with the standard tools of linear regression. However... The accurate construction of the initial distribution can be delicate/time-consuming. In some models, number of modes too great to complete identication The method relies on normal approximations

Monte Carlo Statistical Methods/October 29, 2001

316

In general, best to use a battery of tests/assessments: There are many others that we have not mentioned. Some are Methods based on Renewal Theory Methods based on Discretization

Monte Carlo Statistical Methods/October 29, 2001

317

Implementation in Missing Data Models

9.1 First examples 9.2 Finite mixtures of distributions 9.3 Extensions

Monte Carlo Statistical Methods/October 29, 2001

318

Missing data models are a natural application for simulation Simulation replaces the missing data part so that one can proceed with a classical inference on the complete model. The EM algorithm rst described a rigorous and general formulation of statistical inference though completion of missing data. Potential of Markov Chain Monte Carlo algorithms in the analysis of missing data models

Monte Carlo Statistical Methods/October 29, 2001

319

9.1

First examples

Example 65 Rounding eect Numerous settings (surveys, medical experiments, epidemiological studies, design of experiment, quality control, etc.) produce a grouping of the original observations into less informative categories, often for reasons beyond the control of the experimenter: Data coarsening For instance, approximation bias in a study on smoking habits

Monte Carlo Statistical Methods/October 29, 2001

320

Yi Exp() the number of cigarettes smoked per day is unobserved (rounding) and instead we observe bins Xi , where Xi |gi , yi = [[yi ], [yi ] + 1) [20[yi /20], 20[yi /20] + 20) if gi = 0, (cig. reported) if gi = 1, (packs reported)

Monte Carlo Statistical Methods/October 29, 2001

321

This means that, as yi increases, the probability of rounding up the answer xi to the nearest full pack of cigarettes also increases, under the constraint 2 > 0, thus we observe the Gi s according to Gi |yi Bernoulli((1 2 yi )), where is the cdf of N (0, 1).

Monte Carlo Statistical Methods/October 29, 2001

322

If c(xi ) denotes the center of the ith bin, the likelihood function is
n c(xi )+10 c(xi )10 c(xi )+1/2 1gi gi

L(, 1 , 2 |x, g)

=
i=1

eyi [1 (1 2 yi )]dyi eyi (1 2 yi )dyi

c(xi )1/2

[incomplete-data]

Monte Carlo Statistical Methods/October 29, 2001

323

The complete-data likelihood is L(, 1 , 2 |y, x, g)


n

=
i=1

eyi [1 (1 2 yi )] I I(c(xi ) 10 yi < c(xi ) + 10)] eyi (1 2 yi ) I I(c(xi ) 1/2 yi < c(xi ) + 1/2)]
1gi gi

Solutions EM algorithm (Monte Carlo version) Gibbs sampling (priors on , 1 and 2 ).

Monte Carlo Statistical Methods/October 29, 2001

324

Contingency table When several variables are studied simultaneously in a sample, each corresponding to a grouping of individual data If the context is suciently informative to allow for a modeling of the individual data, the completion of the contingency table (by reconstruction of the individual data) may improve inference

Monte Carlo Statistical Methods/October 29, 2001

325

Example 66 Lizard habitat Observation of two characteristics of the habitat of 164 lizards Diameter (inches) Height (feet) 4.0 > 4.75 32 4.75 86 > 4.0 11 35

Monte Carlo Statistical Methods/October 29, 2001

326

Distribution on the individual observations Xijk of diameter and of height (i, j = 1, 2, k = 1, . . . , nij ) Yijk = log(Xijk ) N (0, ) , where = 2 1 1

= 2 0 .

Monte Carlo Statistical Methods/October 29, 2001

327

T N2 (, ; Qij ) represents the normal distribution restricted to one of the four quadrants induced by (log(4.75), log(4))

Prior

1 (, , ) = I [1,1] () I

Monte Carlo Statistical Methods/October 29, 2001

328

Algorithm 67 Contingency table completion

T 1. Simulate yijk N2 (, ; Qij )

(i, j = 1, 2, k = 1, . . . , nij );

2. Simulate N2 (y, /164); 3. Simulate 2 from the inverted gamma distribution IG(164,
i,j,k

(yijk )t 1 (yijk )/2) ; 0

4. Simulate according to (1 2 )164/2 exp{


i,j,k

(yijk )t 1 (yijk )/2} ,

Monte Carlo Statistical Methods/October 29, 2001

329

Note: The distribution in step 4. requires a MetropolisHastings step based, for instance, on an inverse Wishart distribution

Monte Carlo Statistical Methods/October 29, 2001

330

Qualitative models Example 68 Probit Regression Threshold model: observe Yi Bernoulli{0, 1} with pi = (xt ) , i where is the standard normal cdf. Create latent (unobservable) continuous rvs Yi where Yi = 1 0 if Yi > 0, otherwise. IRp .

Thus, pi = P (Yi = 1) = P (Yi > 0), and we have an automatic way to complete the model

Monte Carlo Statistical Methods/October 29, 2001

331

Given prior Np (0 , ) on Algorithm 69 Probit posterior distribution

1. Simulate
yi

N+ (xt , 1, 0) i N (xt , 1, 0) i

if yi = 1, if yi = 0,

(i = 1, . . . , n)

2. Simulate Np (1 + XX t )1 (1 0 +
i yi xi ), (1 + XX t )1

Monte Carlo Statistical Methods/October 29, 2001

332

Incomplete observations arise in numerous settings. A survey with multiple questions may include nonresponses to some personal questions; A calibration experiment may lack observations for some values of the calibration parameters; A pharmaceutical experiment on the aftereects of a toxic product may skip some doses for a given patient.

Monte Carlo Statistical Methods/October 29, 2001

333

Analysis of such structures complicated by the fact that the failure to observe is not always explained. If these missing observations are at random the incompletely observed data only play a role through their marginal distribution. But... these distributions are not always explicit and a natural approach leading to a Gibbs sampler algorithm is to replace the missing data by simulation.

Monte Carlo Statistical Methods/October 29, 2001

334

Example 70 Non-ignorable non-response Average incomes and numbers of responses/non-responses to a survey on the income by age, sex and marital status. Age < 30 > 30 Men Single 20.0 24/1 30.0 15/5 Married 21.0 5/11 36.0 2/8 Women Single Married 16.0 16.0 11/1 2/2 18.0 8/4 0/4

Monte Carlo Statistical Methods/October 29, 2001

335

Observations grouped by average, exponential shape for the individual data,


ya,s,m,i

with

Exp(a,s,m ) a,s,m = 0 + a + s + m ,

where 1 i na,s,m a (a = 1, 2) corresponds to age (junior/senior) s (s = 1, 2) corresponds to sex (fem./male) m (m = 1, 2) corresponds to family (single/married) The model is unidentiable, but that can be remedied by constraining 1 = 1 = 1 = 0.

Monte Carlo Statistical Methods/October 29, 2001

336

More dicult problem: Nonresponse depends on the income in the shape of a logit model, pa,s,m,i where pa,s,m,i denotes the probability of nonresponse and (w0 , w1 ) are the logit parameters.
exp{w0 + w1 ya,s,m,i } = , 1 + exp{w0 + w1 ya,s,m,i }

Monte Carlo Statistical Methods/October 29, 2001

337

The likelihood of the complete model is


na,s,m i=1 exp{za,s,m,i (w0 + w1 ya,s,m,i )} (0 + a + s + m )ra,s,m 1 + exp{w0 + w1 ya,s,m,i }

a=1,2 s=1,2 m=1,2

exp ra,s,m y a,s,m (0 + a + s + m ) where


za,s,m,i is the indicator of a missing observation

na,s,m is the number of people by category ra,s,m is the number of responses by category y a,s,m is the average of these responses by category

Monte Carlo Statistical Methods/October 29, 2001

338

Completion of the data


The ya,s,m,i s from exp{za,s,m,i (w0 + w1 ya,s,m,i )} exp(ya,s,m,i a,s,m ) , 1 + exp{w0 + w1 ya,s,m,i }

[requires a MetropolisHastings step] The parameters are simulated from (0 + a + s + m )ra,s,m


a=1,2 s=1,2 m=1,2

exp ra,s,m y a,s,m (0 + a + s + m ) using a gamma instrumental distribution.

Monte Carlo Statistical Methods/October 29, 2001

339

And (w0 , w1 ) from


na,s,m i=1 exp{za,s,m,i (w0 + w1 ya,s,m,i )} 1 + exp{w0 + w1 ya,s,m,i }

a=1,2 s=1,2 m=1,2

[logit model]

Monte Carlo Statistical Methods/October 29, 2001

340

9.2

Finite mixtures of distributions

Distribution f (x) = with p1 + . . . + pk = 1


k

pj f (x|j ) ,
j=1

Monte Carlo Statistical Methods/October 29, 2001

341

useful in practical modeling challenging from an inferential point of view (i.e., estimation of pj and j ) likelihood dicult to work with
n

pj f (xi |j ) ,

L(p, |x1 , . . . , xn )
i=1

j=1

containing k n terms.

Monte Carlo Statistical Methods/October 29, 2001

342

Missing data structure Associate with every observation xi an indicator variable zi {1, . . . , k} Indicates which component of the mixture xi comes from. Demarginalized model: zi Mk (1; p1 , . . . , pk ), xi |zi f (x|zi ) .

Monte Carlo Statistical Methods/October 29, 2001

343

Completed model likelihood:


n

(p, |x , . . . , x ) i i
i=1 k

pzi f (xi |zi ) pj f (xi |j )


j=1 i;zi =j

Monte Carlo Statistical Methods/October 29, 2001

344

Algorithm 71 Mixture simulation

1. Simulate zi (i = 1, . . . , n) from P (zi = j) pj f (xi |j ) and compute the statistics


n n

(j = 1, . . . , k)

nj =
i=1

I zi =j , I

nj xj =
i=1

I zi =j xi . I

Monte Carlo Statistical Methods/October 29, 2001

345

2. Generate (j = 1, . . . , k) j j + nj xj , j + nj , j + n j Dk (1 + n1 , . . . , k + nk ) .

Monte Carlo Statistical Methods/October 29, 2001

346

Example 72 Normal mixtures f (x) =


k

pj
j=1

2 (xj )2 /(2j )

2 j

with conjugate distributions on (j , j ) j |j N


2 j , j /j

2 j

IG

j + 3 j , 2 2

Monte Carlo Statistical Methods/October 29, 2001

347

Algorithm 73 Normal mixture

1. Simulate (i = 1, . . . , n)
2 zi P (zi = j) pj exp (xi j )2 /(2j ) 1 j

and compute the statistics (j = 1, . . . , k)


n n n

nj =
i=1

I zi =j , I

nj xj =
i=1

I zi =j xi , I

s2 = j
i=1

I zi =j (xi xj )2 . I

Monte Carlo Statistical Methods/October 29, 2001

348

2. Generate j |j
2 j

N IG

2 j j j + nj xj , j + n j j + n j

, ,

j + n j + 3 j + s 2 j , 2 2

p Dk (1 + n1 , . . . , k + nk ) .

Monte Carlo Statistical Methods/October 29, 2001

349

Good performance of the Gibbs sampler guaranteed by the duality principle Practical implementation of the Gibbs sampler might, face serious convergence diculties (phenomenon of the absorbing component)

Monte Carlo Statistical Methods/October 29, 2001

350

Trapping states When only a small number of observations are allocated to a given component j0 , the following probabilities are quite small: (1). The probability of allocating new observations to the component j0 . (2). The probability of reallocating, to another component, observations already allocated to j0 .

Monte Carlo Statistical Methods/October 29, 2001

351

A paradox While the Gibbs sampler chain (z (t) , (t) ) is irreducible, there exists (almost)absorbing states (or trapping state) which require an enormous number of iterations of the Gibbs sampler to escape from.

Monte Carlo Statistical Methods/October 29, 2001

352

Example 74 Acidity level in lakes 149 observations of acidity levels in lakes in the American North-East Mixture model t with the Gibbs sampler algorithm Lack of evolution of estimated (plug-in) density from the Gibbs sampler when iterations increase Phenomenon which occurs often in mixture settings, due to weak identiability of these models.

Monte Carlo Statistical Methods/October 29, 2001

353

T = 500

T = 1000

15

10

-2

0
-2

10

15

T = 2000

T = 3000

15

10

-2

0
-2

10

15

T = 4000

T = 5000

15

10

-2

0
-2

10

15

Estimation of the density for 3 components and T iterations

Monte Carlo Statistical Methods/October 29, 2001

354

9.3

Extensions

Relaxation of the independence assumption between observations leads to hidden Markov chains Dierent constraints on the changes of components corresponds to changepoint models

Monte Carlo Statistical Methods/October 29, 2001

355

Example 75 Switching AR model Xt |xt1 , zt , zt1 Zt |zt1 where Zt takes values in {0, 1} initial values z0 = 0, x0 = 0 . N (zt + (xt1 zt1 ), 2 ), zt1 I zt1 (zt ) + (1 zt1 )I 1zt1 (zt ), I I

Monte Carlo Statistical Methods/October 29, 2001

356

States zt not observed and the algorithm completes the sample x1 , . . ., xT by simulating every zt (1 < t < T ) from P (Zt |zt1 , zt+1 , xt , xt1 , xt+1 ) 1 exp 2 (xt zt (xt1 zt1 ))2 2 + (xt+1 zt+1 (xt zt ))2 zt1 I zt1 (zt ) + (1 zt1 )I 1zt1 (zt ) I I (zt I zt (zt+1 ) + (1 zt )I 1zt (zt+1 )) . I I

Monte Carlo Statistical Methods/October 29, 2001

357

Given the prior (1 0 ) N (0, 2 ), (, 2 ) = 1/, 0 , 1 U[0,1] ,

the generation of the parameters is straightforward because of the conjugate structure.

Monte Carlo Statistical Methods/October 29, 2001

358

Example 76 Hidden Markov Poisson model Observe Xt s depending on an unobserved Markov chain (Zt ) such that (i, j = 1, 2) Xt |zt P(zt ), Noninformative prior (1 , 2 , p11 , p22 ) = 1 I 2 <1 I 1 P (Zt = i|Zt1 = j) = pji .

Monte Carlo Statistical Methods/October 29, 2001

359

0
0

50

100

150

200

Data of Leroux and Puterman (1992) on the number of moves of a lamb fetus during 240 successive 5 second periods.

Monte Carlo Statistical Methods/October 29, 2001

360

Changepoint model Sample (x1 , . . . , xn ) associated with a latent index such that X1 , . . . , X | X +1 , . . . , Xn | where the support of 0 is {1, . . . , n}.
iid

0 ( ), f (x|1 ) f (x|2 ),

iid

Monte Carlo Statistical Methods/October 29, 2001

361

Example 77 Changepoint regression yi with U{1, . . . , n}, and


1 2 j = (j , j )t N2 (0 , ) , j IG(0 , 0 ) 2 N (1 + 1 xi , 1 ) 2 N (2 + 2 xi , 2 )

for i = 1, . . . , , for i = + 1, . . . , n,

W2

1 1 , V

0 N (, C)

where Wp (, A) is the Wishart

Monte Carlo Statistical Methods/October 29, 2001

362

Example 78 Stochastic Volatility Popular in nancial applications, in describing series with sudden changes in the magnitude of variation of the observed values through a latent linear process (Yt ), the volatility
Yt = Yt1 + Yt /2 Yt = e t , t1

where

and

N (0, 1).

Monte Carlo Statistical Methods/October 29, 2001

363

Observed likelihood L( , |y0 , . . . , yT ) obtained by integrating the complete-data likelihood


Lc ( , |y0 , . . . , yT , y0 , . . . , yT ) T

exp
t=0

2 yt yt e

+ yt /2 T

( )T +1 exp (y0 )2 + t=1

(yt yt1 )2

/2( )2 .

Monte Carlo Statistical Methods/October 29, 2001

364

-10
0

-5

10

15

20

100

200

300

400

500

Simulated stochastic volatility process with = 1 and

= .9

También podría gustarte