Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Based on the book Monte Carlo Statistical Methods by Christian P. Robert and George Casella Springer-Verlag 1999
Introduction
Statistical Models Likelihood Methods Bayesian Methods Deterministic Numerical Methods Simulation versus numerical analysis
Experimenters choice before fast computers: Describe an accurate model which usually precludes computation of explicit answers or Choose a standard model which would allow such computations, but may not be a close representation of a realistic model. Such problems contributed to the development of simulation-based inference
1.1
Statistical Models
Example 1 Censored data models Missing data models where densities are not sampled directly. Typical simple statistical model: we observe Y1 , , Yn f (y|). The distribution of the sample given by the product
n
f (yi |)
i=1
With censored random variables, actual observations: Yi = min{Yi , u} where u is censoring point. Inference about based on the censored likelihood.
Y N (, 2 ),
where and are the density and cdf of the normal N (0, 1) distribution.
Similarly, if X Weibull(, ), with density f (x) = x1 exp(x ) the censored variable Z = X , has the density f (z) = z e
z
constant,
I z + I
x e
dx (z) ,
Example 2 Mixture models Models of mixtures of distributions: X fj with probability pj , for j = 1, 2, . . . , k, with overall density X p1 f1 (x) + + pk fk (x) . For a sample of independent random variables (X1 , , Xn ), sample density
n
Expanding this product involves k n elementary terms: prohibitive to compute in large samples.
10
1.2
Likelihood Methods
Maximum Likelihood Methods For an iid sample X1 , . . . , Xn from a population with density f (x|1 , . . . , k ), the likelihood function is L(|x) = L(1 , . . . , k |x1 , . . . , xn ) =
n i=1
f (xi |1 , . . . , k ).
11
Example 3 Gamma MLE X1 , , Xn iid observations from gamma density f (x|, ) = where is known. Log likelihood
n
1 x1 ex/ , ()
f (xi |, )
n log + ( 1)
i=1
log xi
i=1
xi /.
12
xi /(n).
i=1
13
When also unknown, additional equation log L(, |x1 , , xn ) = 0 is particularly nasty! Involve dicult computations (incl. derivative of the gamma function, the digamma function) Explicit solution no longer possible
14
Example 4 Students t distribution Reasonable alternative to normal errors T (p, , ) more robust against possible modeling errors Density of T (p, , ) proportional to 1 (x )2 1+ p 2
(p+1)/2
15
(xi )2 1+ p 2
may have n local minima, each of which needs to be calculated to determine the global maximum.
16
0.0
-10
0.00001
0.00002
0.00003
0.00004
-5
10
15
20
17
Example 5 Mixtures again For a mixture of two normal distributions, pN (, 2 ) + (1 p)N (, 2 ) , likelihood proportional to
n
p
i=1
xi
+ (1 p)
xi
containing 2n terms. Standard maximization techniques often fail to nd the global maximum because of multimodality of the likelihood function.
18
In the special case f (x|, ) = (1 ) exp{(1/2)x2 }+ with > 0 known exp{(1/2 2 )(x)2 } (1)
lim ( = x1 , |x1 , . . . , xn ) =
19
Echantillon N(0,1)
0.6 0.0 0.1 0.2 0.3 0.4 0.5
20
sigma
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0 mu
Likelihood of (1)
21
1.3
Bayesian Methods
In the Bayesian paradigm, information brought by the data x, realization of X f (x|), combined with prior information specied by prior distribution with density ()
22
Summary in a probability distribution, (|x), called the posterior distribution Derived from the joint distribution f (x|)(), according to (|x) = f (x|)() , f (x|)()d [Bayes Theorem] where m(x) = is the marginal density of X f (x|)()d
23
Example 6 Binomial Bayes Estimator For an observation X from the binomial distribution B(n, p) the (so-called) conjugate prior is the family of beta distributions Be(a, b) The classical Bayes estimator is the posterior mean
(a + b + n) (a + x)(n x + b)
1
p px+a1 (1 p)nx+b1 dp
x+a . a+b+n
24
The curse of conjugate priors The use of conjugate priors for computational reasons implies a restriction on the modeling of the available prior information may be detrimental to the usefulness of the Bayesian approach gives an impression of subjective manipulation of the prior information disconnected from reality.
25
Example 7 Logistic Regression Standard regression model for binary (0 1) responses: distribution of Y {0, 1} modeled by exp(xt ) P (Y = 1) = p = . 1 + exp(xt ) Equivalently, the logit transform of p, logit(p) = log[p/(1 p)] satises logit(p) = xt .
26
Computation of a condence region on quite delicate when (|x) not explicit. In particular, when the condence region involves only one component of a vector parameter, calculation of (|x) requires the integration of the joint distribution over all the other parameters.
27
Example 8 Cauchy condence regions X1 , , Xn an iid sample from the Cauchy distribution C(, ), with prior (, ) = 1 . Condence region on then based on
n
(|x1 , , xn )
0
n1 i=1
1+
xi
2 1
d ,
an integral which cannot be evaluated explicitly. Similar computational problems with likelihood estimation in this model.
28
1.4
To solve an equation of the form f (x) = 0, the NewtonRaphson algorithm produces a sequence xn : xn+1 = xn f x
1
f (xn )
x=xn
29
Optimization of smooth functions F done using the equation F (x) = 0, where F denotes the gradient of F , vector of derivatives of F .
The corresponding techniques are gradient methods, where the sequence xn is xn+1 = xn where
t t
(xn ) F (xn ) ,
30
I=
a
h(x)dx
can be done by Riemann integration or by improved techniques like the trapezoidal rule 1 I= 2
n1
f (a) + 4
i=1
h(x2i1 ) + 2
i=1
h(x2i ) + f (b)
31
1.5
numerical methods do not take into account the probabilistic aspects of the problem, numerical integration often focus on regions of low probability occurrence of local modes of a likelihood often cause more problems for a deterministic gradient method than for simulation methods
32
but simulation methods very rarely take into account the specic analytical form of the functions (For instance, because of the randomness induced by the simulation, a gradient method yields a much faster determination of the mode of a unimodal density) For small dimensions, integration by Riemann sums or by quadrature converges much faster than the mean of a simulated sample. It is thus often reasonable to use a numerical approach when dealing with regular functions in small dimensions
33
When the statistician needs to study the details of a likelihood surface or posterior distribution needs to simultaneously estimate several features of these functions or when the distributions are highly multimodal it is preferable to use a simulation-based approach
34
Fruitless to advocate the superiority of one method over the other More reasonable to justify the use of simulation-based methods by the statistician in terms of his/her expertise. The intuition acquired by a statistician in his/her every-day processing of random models can be directly exploited in the implementation of simulation techniques
35
36
Rely on the possibility of producing (computer-wise) an endless ow of random variables (usually iid) from well-known distributions Given a uniform random number generator, illustration of methods that produce random variables from both standard and nonstandard distributions
37
2.1
Basic Methods
2.1.1
Introduction
For a function F on IR, generalized inverse of F , F , dened by F (u) = inf {x; F (x) u} . Probability Integral Transform: If U U[0,1] , then the random variable F (U ) has the distribution F .
38
Consequence: To generate a random variable X F , suces to generate U U[0,1] and then make the transform x = F (u)
39
2.1.2
Any one who considers arithmetical methods of reproducing random digits is, of course, in a state of sin. As has been pointed out several times, there is no such thing as a random number---there are only methods of producing random numbers, and a strict arithmetic procedure of course is not such a method." [John Von Neumann, 1951)]
40
Production of a deterministic sequence of values in [0, 1] which imitates a sequence of iid uniform random variables U[0,1] . Cant use the physical imitation of a random draw [no guarantee of uniformity, no reproducibility] Random sequence in the sense: Having generated (X1 , , Xn ), knowledge of Xn [or of (X1 , , Xn )] imparts no discernible knowledge of the value of Xn+1 .
41
Deterministic: Given the initial value X0 , sample (X1 , , Xn ) always the same Validity of a random number generator based on a single sample X1 , , Xn when n tends to +, not on replications (X11 , , X1n ), (X21 , , X2n ), . . . (Xk1 , , Xkn ) where n xed and k tends to innity.
42
2.1.3
Algorithm starting from an initial value u0 and a transformation D, which produces a sequence (ui ) = (Di (u0 )) in [0, 1]. For all n, (u1 , , un ) reproduces the behavior of an iid U[0,1] sample (V1 , , Vn ) when compared through usual tests
43
Validity of the algorithm means that the sequence U1 , , Un leads to accept the hypothesis H : U1 , , Un are iid U[0,1] .
The set of tests used is generally of some consequence KolmogorovSmirnov Time series methods, for correlation between Ui and (Ui1 , , Uik ) nonparametric tests Marsaglias battery of tests called Die Hard (!)
44
2.1.4
A real-life generated random sequence takes values on {0, 1, , M } rather than in [0, 1] [M largest integer accepted by the computer]
45
Period, T0 , of a generator: smallest integer T such that ui+T = ui for every i, A generator of the form Xn+1 = f (Xn ) has a period no greater than M +1
46
Warning! A uniform generator on [0, 1] should not never take the values 0 and 1 [Gentle, 1998]
47
Congruential generator on {0, 1, , M }: dened by the function D(x) = (ax + b) mod (M + 1). Period and other performance of congruential generators depend heavily on (a, b). With a rational, pairs (xn , D(xn )) lie on parallel lines.
48
Representation of the line y = 69069x mod 1 by uniform sampling with sampling step 3 104 .
49
For k k matrix T , with entries in {0, 1}, shift register generator: given by the transformation xn+1 = T xn (mod 2) where xn represented as a vector of binary coordinates eni {0, 1},
k1
xn =
i=0
eni 2i .
50
To generate a sequence of integers X1 , X2 , , the Kiss algorithm generates three sequences of integers First, a congruential generator In+1 = (69069 In + 23606797) (mod 232 ) , Then two shift register generators (Jn ) and (Kn ) Overall sequence Xn+1 = (In+1 + Jn+1 + Kn+1 ) (mod 232 ) The period of Kiss is of order 295 Kiss has been successfully tested on Die Hard
51
2.2
Generation of any sequence of random variables can be formally implemented through a uniform generator For distributions with explicit forms of F (for instance, exponential, double-exponential or Weibull distributions), the Probability Integral Transform can be implemented. Case specic methods, which rely on properties of the distribution (for instance, normal distribution, Poisson distribution)
52
More general (indirect) methods exist, for example the accept-reject and the ratio-of-uniform methods Simulation of the standard distributions is accomplished quite eciently by many statistical programming packages (for instance, IMSL, Gauss, Mathematica, Matlab/Scilab, Splus/R).
53
2.2.1
Transformation Methods
Case where a distribution F is linked in a simple way to another distribution easy to simulate. Example 9 Exponential variables If U U[0,1] , the random variable X = log U/ has distribution P (X x) = = P ( log U x) P (U ex ) = 1 ex ,
54
Other random variables that can be generated starting from an exponential include
Y = 2
j=1
log(Uj ) 2 2
Y =
j=1
log(Uj ) Ga(a, )
Y =
Be(a, b)
55
Points to note Transformation quite simple to use There are more ecient algorithms for gamma and beta random variables Cannot generate gamma random variables with a non-integer shape parameter For instance, cannot get a 2 variable, which would get us a 1 N (0, 1) variable.
56
and uniform distribution on [0, 2] Consequence: If U1 , U2 iid U[0,1] , X1 X2 iid N (0, 1). = = 2 log(U1 ) cos(2U2 ) 2 log(U1 ) sin(2U2 )
57
Box-Muller Algorithm: 1. Generate U1 , U2 iid U[0,1] ; 2. Define x1 = x2 = 2 log(u1 ) cos(2u2 ) , 2 log(u1 ) sin(2u2 ) ;
58
Unlike algorithms based on the CLT, this algorithm is exact Get two normals for the price of two uniforms Drawback (in speed) in calculating log, cos and sin.
59
Example 11 Poisson generation Poissonexponential connection: If N P() and Xi Exp(), i IN , P (N = k) = P (X1 + + Xk 1 < X1 + + Xk+1 ) .
60
A Poisson can be simulated by generating exponentials until their sum exceeds 1. This method is simple, but is really practical only for smaller values of . On average, the number of exponential variables required is . Other approaches are more suitable for large s.
61
A generator of Poisson random variables can produce negative binomial random variables since, Y Ga(n, (1 p)/p) implies X N eg(n, p) X|y P(y)
62
Mixture representation The representation of the negative binomial is a particular case of a mixture distribution The principle of a mixture representation is to represent a density f as the marginal of another distribution, for example f (x) =
iY
pi fi (x) ,
If the component distributions fi (x) can be easily generated, X can be obtained by rst choosing fi with probability pi and then generating an observation from fi .
63
2.2.2
Accept-Reject Methods
Many distributions from which dicult, or even impossible, to directly simulate. Another class of methods that only require us to know the functional form of the density f of interest only up to a multiplicative constant. The key to this method is to use a simpler (simulation-wise) density g, the instrumental density, from which the simulation from the target density f is actually done.
64
Accept-Reject method Given a density of interest f , nd a density g and a constant M such that f (x) M g(x) on the support of f .
65
Validation of the Accept-Reject method This algorithm produces a variable Y distributed according to f
66
67
Two interesting properties: First, it provides a generic method to simulate from any density f that is known up to a multiplicative factor Property particularly important in Bayesian calculations: there, the posterior distribution (|x) () f (x|) . is specied up to a normalizing constant Second, the probability of acceptance in the algorithm is 1/M , e.g., expected number of trials until a variable is accepted is M
68
Some intuition In cases f and g both probability densities, the constant M is necessarily larger that 1. The size of M , and thus the eciency of the algorithm, functions of how closely g can imitate f , especially in the tails For f /g to remain bounded, necessary for g to have tails thicker than those of f . It is therefore impossible to use the A-R algorithm to simulate a Cauchy distribution f using a normal distribution g, however the reverse works quite well.
69
Example 12 Normal from a Cauchy 1 f (x) = exp(x2 /2) 2 1 1 , 2 1+x densities of the normal and Cauchy distributions. g(x) = f (x) = g(x) attained at x = 1.
2 (1 + x2 ) ex /2 2
and
2 = 1.52 e
70
So probability of acceptance 1/1.52 = 0.66, and, on the average, one out of every three simulated Cauchy variables is rejected. Mean number of trials to success 1.52.
71
Example 13 Normal/Double Exponential Generate a N (0, 1) by using a double-exponential distribution with density g(x|) = (/2) exp(|x|) f (x) g(x|) 2 1 2 /2 e
and minimum of this bound (in ) attained for = 1. Probability of acceptance /2e = .76: To produce one normal random variable, this Accept-Reject algorithm requires on the average 1/.76 1.3 uniform variables. Compare with the xed single uniform required by the Box-Muller algorithm.
72
Example 14 Gamma with non-integer shape parameter Illustrates a real advantage of the Accept-Reject algorithm The gamma distribution Ga(, ) represented as the sum of exponential random variables, only if is an integer
73
Can use the Accept-Reject algorithm with instrumental distribution Ga(a, b), with a = [], (Without loss of generality, = 1.) Up to a normalizing constant, f /gb = ba xa exp{(1 b)x} ba for b 1. The maximum is attained at b = a/. a (1 b)e
a
0.
74
Example 15 Truncated Normal distributions Truncated Normals appear in many contexts Constraints x produce densities proportional to e
(x)2 /2 2
I x I
for a bound large compared with Alternatives far superior to the na method of generating a ve N (, 2 ) until exceeding , which requires an average number of 1/(( )/) simulations from N (, 2 ) for one acceptance.
75
Instrumental distribution: translated exponential distribution, Exp(, ), with density g (z) = e(z) I z . I The ratio f /g is bounded by f /g 1/ exp(2 /2 ) 1/ exp(2 /2) if > , otherwise.
76
77
3.1
Introduction
Two major classes of numerical problems that arise in statistical inference optimization - generally associated with the likelihood approach integration- generally associated with the Bayesian approach
78
Example 16 Bayes median Bayes estimators are not always posterior expectations, but rather solutions of the minimization problem min
2
79
For absolute error loss L(, ) = | |, the Bayes estimator is the posterior median, solution to the equation () f (x|) d =
() f (x|) d
80
3.2
h(x) f (x) dx
where X is uni- or multidimensional, f is a closed form, partly closed form, or implicit density, and h is a function
81
First use a sample (X1 , . . . , Xm ) from the density f to approximate the integral by the empirical average hm Average hm IEf [h(X)] by the Strong Law of Large Numbers 1 = m
m
h(xj ) ,
j=1
82
Estimate the variance with vm and for m large, hm IEf [h(X)] N (0, 1). vm Note: This can lead to the construction of a convergence test and of condence bounds on the approximation of IEf [h(X)]. 1 1 = mm1
m
[h(xj ) hm ]2 ,
j=1
83
Example 17 Cauchy prior For estimating a normal mean, a robust prior is a Cauchy prior X N (, 1), C(0, 1).
(x) =
(x)2 /2 e d 1 + 2 1 (x)2 /2 e d 1 + 2
84
Form of suggests simulating iid variables 1 , , m N (x, 1) and calculate i m i=1 2 1 + i m (x) = . 1 m i=1 2 1 + i The Law of Large Numbers implies m (x) (x) as m .
85
(t) =
2 1 ey /2 dy 2
I xi t , I
i=1
86
Exact variance (t)(1 (t))/n, as the variables I xi t iid I Bernoulli((t)). For values of t around t = 0 the variance approximately 1/4n: to achieve a precision of four decimals the approximation requires on average n = 2 104 simulations, that is, 200 million iterations. Greater accuracy is achieved in the tails.
87
3.3
Importance Sampling
Simulation from f (the true density) is not necessarily optimal Alternative to direct sampling from f is importance sampling, based on alternative representation IEf [h(X)] =
X
h(x)
f (x) g(x)
g(x) dx .
88
h(x) f (x) dx
89
The estimator 1 m
m j=1
h(x) f (x) dx
X
90
same reason the regular Monte Carlo estimator hm converges; converges for any choice of the distribution g [as long as supp(g) supp(f )]. Instrumental distribution g chosen from distributions easy to simulate. The same sample (generated from g) can be used repeatedly, not only for dierent functions h, but also for dierent densities f .
91
Although g can be any density, some choices are better than others: Finite variance only when IEf f (X) h (X) = g(X)
2
Instrumental distributions with tails lighter than those of f (that is, with sup f /g = ) not appropriate. If sup f /g = , the weights f (xj )/g(xj ) vary widely, giving too much importance to a few values xj .
92
The choice of g that minimizes the variance of the importance sampling estimator is g (x) =
Z
Rather formal optimality result since optimal choice of g (x) requires the knowledge of h(x)f (x)dx, the integral of interest!
93
Practical alternative
m j=1
f (Xj )/g(Xj )
where f and g are known up to constants. Also converges to Numbers. h(x)f (x)dx by the Strong Law of Large
Biased, but the bias is quite small In some settings beats the unbiased estimator in squared error loss.
94
x5 f (x)dx.
95
Importance sampling using Cauchy C(0, 1) Importance sampling using a normal (expected to be nonoptimal). Importance sampling using a U([0, 1/2.1])
96
5.5
6.0
6.5
10000
20000
30000
40000
50000
97
sampling from f (solid lines), importance sampling with: Cauchy instrumental (short dashes), U([0, 1/2.1]) instrumental (long dashes) and normal instrumental (dots).
98
3.4
Acceleration Methods
(a) Use correlation to reduce variance Specialized but ecient if applicable With two samples (X1 , . . . , Xm ) and (Y1 , . . . , Ym ) from f to estimate I=
I R
h(x)f (x)dx .
Denote 1 = 1 I m
m
h(Xi )
i=1
and
2 = 1 I m
h(Yi )
i=1
99
100
(b) Antithetic Variables - constructing negatively correlated variables If f symmetric around , take Yi = 2 Xi If Xi = F 1 (Ui ), take Yi = F 1 (1 Ui )
101
(c) Control Variates - another strategy Suppose I = h(x)f (x)dx is the desired integral and I0 = h0 (x)f (x)dx is known Estimate I with I and I0 with I0 , and construct the combined estimator I = I + (I0 I0 ) I is unbiased for I and var(I ) = var(I) + 2 var(I) + 2Cov(I, I0 )
102
103
1 n
n i=1
I i > a) + I(X
i=1
1 n
104
(d) Another method - Conditional Expectations Use the conditioning inequality var(IE[(X)|Y]) var((X)) sometimes called Rao-Blackwellization If I is an estimator of I = IEf [h(X)], based on X simulated from the joint distribution f (x, y), such that f (x, y)dy = f (x), the estimator I = IEf [I|y1 , . . . , yn ] dominates I(x1 , . . . , xn ) in terms of variance
105
106
can be improved upon using the sample ((X1 , Y1 ), . . . , (Xm , Ym )), since m m 1 1 1 IE[exp(X 2 )|Yj ] = m j=1 m j=1 2 2 Yj + 1 is the conditional expectation. The conditional expectation can have ten times greater precision.
107
0.50
0
0.52
0.54
0.56
0.58
0.60
2000
4000
6000
8000
10000
Estimators of IE[exp(X 2 )]: simple average (solid lines) and conditional expectation (dots) for (, , ) = (4.6, 0, 1).
108
Markov Chains
Basic Notions Irreducibility Transience/Recurrence Invariant Measures Ergodicity and stationarity Limit Theorems
109
Use of Markov chains Many algorithms can be described as Markov chains Needed properties The quantity of interest is what the chain converges to We need to know When will chains converge What do they converge to
110
4.1
Basic notions
A Markov chain is a sequence of random variables that can be thought of as evolving over time. Probability of a transition depends on the particular set that the chain is in Chain dened through its transition kernel A transition kernel is a function K dened on X B(X ) such that (i). x X , K(x, ) is a probability measure; (ii). A B(X ), K(, A) is measurable.
111
When X is discrete, the transition kernel simply is a (transition) matrix K with elements Pxy = P (Xn = y|Xn1 = x) , x, y X .
In the continuous case, the kernel also denotes the conditional density K(x, x ) of the transition K(x, ) P (X A|x) =
A
K(x, x )dx .
112
Given a transition kernel K, a sequence X0 , X1 , . . . , Xn , . . . of random variables is a Markov chain denoted by (Xn ), if, for any t, the conditional distribution of Xt given xt1 , xt2 , . . . , x0 is the same as the distribution of Xt given xt1 . That is, P (Xk+1 A|x0 , x1 , x2 , . . . , xk ) = P (Xk+1 A|xk ) =
A
K(xk , dx)
113
Example 22 AR(1) Models Simple illustration of Markov chains on continuous state space Xn = Xn1 + n , with n N (0, 2 ) If the n s are independent, Xn independent from Xn2 , Xn3 , . . . conditionally on Xn1 . IR,
114
Note that the entire structure of the chain only depends on The transition function K The initial state x0 or initial distribution X0
115
4.2
Irreducibility
Irreducibility is one measure of the sensitivity of the Markov chain to initial conditions It leads to a guarantee of convergence In the discrete case, the chain is irreducible if all states communicate, namely if Px (y < ) > 0 , y being the rst time y is visited x, y X ,
116
In the continuous case, the chain is -irreducible for some measure if for some n, K n (x, A) > 0 for all x X for every A B(X ) with (A) > 0
117
Example 23 AR(1) again Xn+1 = Xn + n+1 with n iid normal variables, The chain is irreducible The reference measure is Lebesgue measure In fact, K(x, A) > 0 for every x IR and every A such that (A) > 0.
118
If n is uniform on [1, 1] and || > 1, Xn+1 Xn ( 1)Xn 1 0 for Xn 1/( 1), the chain is increasing and cannot visit previous values.
119
4.2.1
Sometimes deterministic constraints on the moves from Xn to Xn+1 . In the discrete case, the period of a state X is d() = g.c.d. {m 1; K m (, ) > 0} , where g.c.d. is the greatest common denominator.
120
For an irreducible chain on a nite space X , the transition matrix is a block matrix 0 D1 0 0 0 0 D2 0 P = , .. . Dd 0 0 0 where the blocks Di are stochastic matrices. From block 1 you must go to block 2, from 2 to 3, etc. You return to the initial group every d-th step
121
If the chain is irreducible (so all states communicate) only one value for the period. An irreducible chain is aperiodic if it has period 1. If one state x X satises Pxx > 0, the chain (Xn ) is aperiodic, although this is not a necessary condition.
122
For continuous chains, similar denition: If the transition kernel has density f (|xn ), sucient condition for aperiodicity is that f (|xn ) is positive in a neighborhood of xn (since the chain can then remain in this neighborhood for an arbitrary number of instants before visiting any set A). For instance, in the AR(1)Example, (Xn ) is aperiodic when n is distributed according to U[1,1] and || < 1
123
4.3
Irreducibility ensures that every set A will be visited by the Markov chain (Xn ) This property is too weak to ensure that the trajectory of (Xn ) will enter A often enough. A Markov chain must enjoy good stability properties to guarantee an acceptable approximation of the simulated model. Formalizing this stability leads to dierent notions of recurrence For discrete chains, the recurrence of a state equivalent to probability one of sure return. Always satised for irreducible chains on nite spaces
124
=
i=1
I (Xi ) I
If IE [ ] = the state is recurrent If IE [ ] < the state is transient For irreducible chains, recurrence/transience property of the chain, not of a particular state Similar denitions for the continuous case.
125
Stronger form of recurrence: Harris recurrence A set A is Harris recurrent if Px (A = ) = 1 for all x A. The chain (Xn ) is Harris recurrent if it is -irreducible for every set A with (A) > 0, A is Harris recurrent. Note that Px (A = ) = 1 implies IEx [A ] =
126
4.4
Invariant Measures
Stability increases for the chain (Xn ) if marginal distribution of Xn independent of n Requires the existence of a probability distribution such that Xn+1 if Xn
K(x, B) (dx) ,
B B(X ) .
127
The chain is positive recurrent if is a probability measure. Otherwise it is null recurrent If probability measure also called stationary distribution since X0 implies that Xn for every n The stationary distribution is unique
128
Example 24 Back to AR(1) For the AR(1) model Xn = Xn1 + n , with n N (0, 2 ), the transition kernel is N (xn1 , 2 ) and N (, 2 ) is stationary only if = and 2 = 2 2 + 2 . IR,
129
These conditions imply that = 0, and hence || < 1. N (0, 2 /(1 2 )) is the unique stationary distribution 2 = 2 /(1 2 ),
130
4.5
We nally consider: to what is the chain converging? The invariant distribution natural candidate for the limiting distribution A fundamental property is ergodicity, or independence of initial conditions. In the discrete case, a state is ergodic if
n
lim |K n (, ) ()| = 0 .
131
In general , we establish convergence using the total variation norm 1 2 and we want K n (x, )(dx)
TV TV
= sup
A
to be small.
132
lim
K n (x, )(dx)
TV
=0
for every initial distribution . We thus take Harris positive recurrent and aperiodic as equivalent to ergodic Convergence in total variation implies
n
133
There are dierence speeds of convergence ergodic (fast) geometrically ergodic (faster) uniformly ergodic (fastest)
134
4.6
Limit theorems
Ergodicity determines the probabilistic properties of average behavior of the chain. But also need of statistical inference, made by induction from the observed sample.
n If Px close to 0, no direct information about n Xn Px
135
Classical LLNs and CLTs not directly applicable due to: Markovian dependence structure between the observations Xi Non-stationarity of the sequence
136
Ergodic Theorem LLN If the Markov chain (Xn ) is Harris recurrent, then for any function h with E|h| < , 1 lim h(Xi ) = n n h(x)d(x),
137
To get a CLT, we need more assumptions. For MCMC, the easiest is reversibility: A Markov chain (Xn ) is reversible if for all n Xn+1 |Xn+2 Xn+1 |Xn . The direction of time does not matter
138
If the Markov chain (Xn ) is Harris recurrent and reversible, 1 N where 0<
2 h N
(h(Xn ) IE [h])
n=1
2 N (0, h ) .
= IE [h (X0 )]
+2
k=1
139
140
5.1
Introduction
Dierences between the numerical approach and the simulation approach to the problem max h()
lie in the treatment of the function h. Using deterministic numerical methods, the analytical properties of the target function (e.g., convexity, boundedness, smoothness) are often paramount. For the simulation approach, we are more concerned with h from a probabilistic (rather than analytical) point of view.
141
Distinguish between two approaches to Monte Carlo optimization: 1. Exploratory approach Goal: to optimize h by describing its entire range Actual properties of h play a lesser role here 2. Probabilistic approximation of h Monte Carlo exploits probabilistic properties of h approach mostly tied to missing data methods
142
5.2
Stochastic Exploration
When is bounded, (which may always be achieved by reparameterization) 1. Simulate from a uniform distribution on , u1 , . . . , um U
143
144
(ii). h and H have the same maxima Examples: H() = exp(h()/T ) or H() = exp{h()/T }/(1 + exp{h()/T })
145
Example 25 Minimization Consider minimizing h(x, y) = + (x sin(20y) + y sin(20x))2 cosh(sin(10x)x) (x cos(10y) y sin(10x))2 cosh(cos(20y)y) ,
146
0
1
Z 3
0.5
0.5
0 Y
0 X
-0.5
-0.5
-1
-1
147
Properties: Many local minima. Standard methods may not nd the global minimum We can simulate from exp(h(x, y)). Get the minimum from the resulting h(xi , yi )s.
148
5.2.1
Stochastic gradient
149
The sequence (j ) is constructed by j+1 = j + j h(j ) , j > 0 , [Gradient sequence] where h is the gradient of h
150
Stochastic variant Stochastic perturbation: With a second sequence (j ), dene j+1 = j + where j are uniform on the unit sphere |||| = 1 h(x, y) = h(x + y) h(x y) 2||y|| h(x) j h(j , j j ) j 2j
151
This method does not always go along the steepest slope This can be a plus, as it may avoid being trapped in local maxima or saddlepoints
152
Example 26 More minimization Use the stochastic gradient method with our test function h, with dierent sequences of j s and j s, and dierent convergence patterns.
153
Choices of (j ) Case 1 - j = 1/10j : poor evaluation of the minimum and in big jumps in the rst iterations. Case 2 - j = 1/100j : converges to the closest local minima Case 3 - j = 1/10 log(1 + j) : a slower decrease in (j ) tends to achieve better minima.
154
Results of three dierent stochastic gradient runs j j T h(T ) mint h(t ) Iteration 1/10j 1/10j (0.166, 1.02) 1.287 0.115 50 1/100j 1/100j (0.629, 0.786) 0.00013 0.00013 93 1/10 log(1 + j) 1/j (0.0004, 0.245) 4.24 106 2.163 107 58
155
0.8
0.9
1.0
1.1
-0.2
0.0
0.2
0.4
0.6
(1) j = j = 1/10j
Stochastic gradient paths for starting point (0.65, 0.8) (darker shades mean higher elevations)
156
0.785
0.790
0.795
0.800
0.805
0.630
0.635
0.640
0.645
0.650
(2) j = j = 1/100j
157
0.2
-0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
158
5.2.2
Simulated Annealing
Name borrowed from Metallurgy: A metal manufactured by a slow decrease of temperature (annealing) is stronger than a metal manufactured by a fast decrease of temperature. Fundamental idea: A change of scale, called temperature, allows easier and faster exploration of function h Rescaling partially avoids being trapped in local maxima.
159
As T 0, the values simulated concentrate in a narrower and narrower neighborhood of the local maxima of h
160
Metropolis algorithm
Starting from 0 , 1. uniform in a neighborhood of 0 2. the new value of is generated by: 1 = 0 with probability = exp(h/T ) 1 with probability 1 ,
161
Features if h() h(0 ), is accepted with probability 1 if h() < h(0 ), may still be accepted with probability = 0
162
Features (contd.) If 0 is a local maximum of h, the algorithm escapes from 0 with a probability that depends on T Usually, the simulated annealing algorithm modies the temperature T at each iteration/on-line (heterogeneous Markov chain)
163
1. Simulate g(| i |) [instrumental distribution] 2. Accept i+1 = with probability i = exp{hi /Ti } 1; take i+1 = i otherwise. 3. Update Ti to Ti+1 .
164
Convergence Under mild assumptions on (Ti ), this algorithm is guaranteed to nd the global maximum
165
Example 28 More minimization Apply the simulated annealing algorithm to minimize h, with g uniform on [0.1, 0.1], and dierent sequences (Ti ). The results change with the sequences Ti
166
Results of simulated annealing runs for dierent values of Ti and starting point (0.5, 0.4).
Case Ti T h(T ) mint h(t ) Accept. rate 1 1/10i (1.94, 0.480) 0.198 4.02 107 0.9998 2 1/ log(1 + i) (1.99, 0.133) 3.408 3.823 107 0.96 3 100/ log(1 + i) (0.575, 0.430) .0017 4.708 109 0.6888
167
-1.0
-0.5
0.0
0.5
1.0
1.5
-2
-1
(1) Ti = 1/10 i
Simulated annealing sequence of 5000 points for the rst choice of the temperature Ti and starting point (0.5, 0.4)
168
-1.0
-0.5
0.0
0.5
1.0
1.5
-2
-1
(2) Ti = 1/ log(i + 1)
169
-0.5
0.0
0.5
1.0
-1.0
-0.5
0.0
0.5
170
5.2.3
Recursive integration
Also called Prior feedback : a very statistical approach Based on the result that if there exists a unique solution satisfying = arg max h() ,
then
lim
eh() d = , eh() d
provided h is continuous at .
171
Observations with a log likelihood function (|x). The MLE can be obtained as e (|x) ()d lim = . e (|x) ()d where is a positive density The Bayes estimator associated with the prior distribution is
(x)
So we have
lim (x) =
172
Example 29 Gamma shape Estimation of , from (|x) x(1) ex () . IE[|x, ] can be obtained by simulation
Sequence of Bayes estimators of for the estimation of when X G(, 1) and x = 1.5
5 2.02
10 2.04
5000 1.94
104 2.00
173
5.3
Stochastic Approximation
Methods that work directly with the objective function and are less concerned with fast explorations of the space. Approximations of the objective function that could result in an additional level of error. Many of these approximation methods only work in missing data models, where the likelihood g(x|) can be expressed as g(x|) =
Z
f (x, z|) dz
174
Example 30 Censored data likelihood Observe Y1 , . . ., Yn , iid, from f (y ) Order the observations so that y = (y1 , , ym ) are uncensored and (ym+1 , . . . , yn ) are censored (and equal to a) The likelihood function is
m
L(|y) =
i=1
f (yi ) [1 F (a )]
nm
175
If we had observed the last n m values, say z = (zm+1 , . . . , zn ), with zi > a (i = m + 1, . . . , n), we could have constructed the (complete data) likelihood
m n
Lc (|y, z) =
i=1
f (yi )
i=m+1
f (zi ),
176
5.3.1
can be approximated by 1 h(x) = m Problems : h(x) needs to be evaluated at many points, thus involves generation of many samples of Zi s of size m. The sample changes with every value of x: the resulting sequence of evaluations of h usually not smooth enough
m
H(x, zi ) Zi f (z|x) .
i=1
177
178
Features hm is a sum, thus with possibly less analytical properties than the original h. For example, no smoothing eect of the integral on the integrand H(x, z) The choice of g is very inuential in obtaining a good approximation of the function h(x) The number of points zi used in the approximation should vary with x to achieve the same precision in the approximation of h(x), but it is usually impossible to assess in advance. When g(z) = f (z|x0 ), Geyers (1996) recursive process in which x0 is updated by the solution of the last optimization at each step.
179
Algorithm 31 Monte Carlo Maximization At step i 1.. Generate z1 , . . . , zm f (z|xi ) and compute hgi with gi = f (|xi ). 2.. Find x = arg max hgi (x). 3.. Update xi to xi+1 = x . Repeat until xi = xi+1 .
180
Example 32 MLE MLEs in exponential families; h(x|) = c()ex(x) = c()h(x|) . normalizing constant c() may be unknown or dicult to compute (gamma or beta distributions, for example) Since h(x|)dx = 1 c()
maximization of h(x|) in is equivalent to maximizing h(x|) = log log h(x|) h(x|) log IE h(x|) h(X|) , h(X|)
181
h(xi |) , i |) h(x
182
In more general missing-data models, the likelihood function is L(|x) = Thus f (x, z|)dz
Z f (x, z|)
183
Example 33 ARCH Models Gaussian ARCH (Auto Regressive Conditionally Heteroscedastic) model, for t = 2, . . . , T , Z = (1 + Z 2 )1/2 , N (0, 1), t t t t1 Xt = aZt + t , t Np (0, 2 Ip ).
184
The approximation of the likelihood ratio is then based on the simulation of the missing data Z T = (Z1 , . . . , ZT ) from f (z T xT , ) f (z T , xT )
T
2T exp
t=1 T
ez1 /2
2 2 zt /2(1+zt1 )
185
5.3.2
The EM Algorithm
Introduced by Dempster, Laird and Rubin (1977) Takes advantage of the representation g(x|) =
Z
f (x, z|) dz
and solves a sequence of easier maximization problems whose limit is the answer to the original problem.
186
Note EM algorithm relates to MCMC algorithms in the sense that it can be seen as a forerunner of the Gibbs sampler in its Data Augmentation version, replacing simulation by maximization.
187
Suppose that we observe X1 , . . . , Xn , iid from g(x|) and want to compute = arg max L(|x) =
n
g(xi |).
i=1
We augment the data with z, where X, Z f (x, z|) and note the identity f (x, z|) k(z|, x) = , g(x|) where k(z|, x) is the conditional distribution of the missing data Z given the observed data x.
188
Principle This identity leads to the following relationship between the complete-data likelihood Lc (|x, z) = f (x, z|) and the observed data likelihood L(|x). For any value 0 , log L(|x) = IE0 [log Lc (|x, z)|0 , x] IE0 [log k(z|, x)|0 , x], where the expectation is with respect to k(z|0 , x).
189
Properties 1. The strength of the EM algorithm is that we only have to deal with the rst term on the right side above, as the other term can be ignored. 2. The observed likelihood is increased at every iteration : this is a convergence guarantee
190
Preparation Denote the expected log-likelihood by Q(|0 , x) = IE0 [log Lc (|x, z)|0 , x].
A sequence of estimators (j) , j = 1, 2, . . ., is obtained iteratively by Q((j) |(j1) , x) = max Q(|(j1) , x).
191
Algorithm 34 The EM Algorithm Iterate 1. (the E-step) Compute Q(|(m) , x) = IE(m) [log Lc (|x, z)|x] , 2. (the M-step) Maximize Q(|(m) , x) in and take (m+1) = arg max Q(|(m) , x).
192
Example 35 Censored data If f (x ) is the N (, 1) density, the censored data likelihood is 1 1 L(|x) = exp 2 (2)m/2
m
(xi )2
i=1
[1 (a )]
nm
and the complete-data log-likelihood is 1 c log L (|x, z) 2 1 2 (xi ) (zi )2 , 2 i=m+1 i=1
m n
where the zi s are observations from the truncated Normal distribution exp{ 1 (z )2 } (z ) 2 k(z|, x) = = , 1 (a ) 2[1 (a )] a < z.
193
(xi )2
i=1 n a
1 2 i=m+1
194
m + (n m)IE[Z|(j) ] x = , n
195
5.3.3
MCEM
A (potential) diculty with the EM algorithm is the computation of Q(|0 , x) To overcome this diculty, use 1 Q(|0 , x) = m where Z1 , . . . , Zm k(z|x, ). When m , this quantity converges to Q(|0 , x).
m
log Lc (|x, z) ,
i=1
196
197
6.1
Unnecessary to use a sample from the distribution f to approximate the integral h(x)f (x)dx ,
Now we obtain X1 , . . . , Xn f (approx) without directly simulating from f , using an ergodic Markov chain with stationary distribution f
198
Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f Insures the convergence in distribution of (X (t) ) to a random variable from f . For a large enough T0 , X (T0 ) can be considered as distributed from f Produce a dependent sample X (T0 ) , X (T0 +1) , . . ., which is generated from f , sucient for most approximation purposes.
199
6.2
6.2.1
Basics
The algorithm starts with the objective (target) density f A conditional density q(y|x) called the instrumental (or proposal) distribution, is then chosen.
200
Algorithm 36 MetropolisHastings Given x(t) , 1. Generate Yt q(y|x(t) ). 2. Take X (t+1) = where (x, y) = min Yt x(t) with prob. (x(t) , Yt ), with prob. 1 (x(t) , Yt ), f (y) q(x|y) ,1 f (x) q(y|x) .
201
Features Always accept upwards moves Independent of normalizing constants for both f and q(|x) (constants independent of x) Never move to values with f (y) = 0 The chain (x(t) )t may take the same value several times in a row, even though f is a density wrt Lebesgue measure The sequence (yt )t is usually not a Markov chain
202
6.2.2
Convergence properties
1. The M-H Markov chain is reversible, with invariant/stationary density f since it satises the detailed balance condition f (y) K(y, x) = f (x) K(x, y) 2. As f is a probability measure, the chain is positive recurrent 3. If f (Yt ) q(X (t) |Yt ) P 1 < 1. f (X (t) ) q(Yt |X (t) )
(1)
that is, the event {X (t+1) = X (t) } is possible, then the chain is aperiodic
203
4. If q(y|x) > 0 for every (x, y), the chain is irreducible 5. For M-H, f -irreducibility implies Harris recurrence 6. Thus, for M-H satisfying (1) and (2) (a) For h, with IEf |h(X)| < , 1 lim T T (b) and
n T
(2)
h(X (t) ) =
t=1
h(x)df (x)
a.e. f.
lim
K n (x, )(dx) f
TV
=0
for every initial distribution , where K n (x, ) denotes the kernel for n transitions.
204
6.3
6.3.1
The instrumental distribution q is independent of X (t) , and is denoted g by analogy with Accept-Reject.
205
Yt x(t)
206
The resulting sample is not iid There can be strong convergence properties: The algorithm produces a uniformly ergodic chain if there exists a constant M such that f (x) M g(x) , In this case, K n (x, ) f
TV
x supp f.
1 1 M
.
1 M.
207
Example 38 Generating gamma variables Generate the Ga(, ) distribution using a gamma Ga([], b = []/) candidate Algorithm 39 Gamma accept-reject 1. Generate Y Ga([], []/) 2. Accept X = Y with prob. e y exp(y/)
[]
208
Yt x(t)
Yt exp x(t)
x(t) Yt
[]
209
Comparison Close agreement in M-H and A-R, with a slight edge to M-H.
9.5 8.0
0
8.5
9.0
(5000 iterations)
Accept-reject (solid line) vs. MetropolisHastings (dotted line) estimators of IEf [X 2 ] = 8.33, for = 2.43 based on Ga(2, 2/2.43)
1000
2000
3000
4000
210
6.3.2
Use the proposal Yt = X (t) + t , where t g, independent of X (t) . The instrumental density is now of the form g(y x) and the Markov chain is a random walk if we take g to be symmetric
211
Algorithm 41 Random walk Metropolis Given x(t) 1. Generate Yt g(y x(t) ) 2. Take X (t+1) =
Yt x(t)
212
Example 42 Random walk normal Generate N (0, 1) based on the uniform proposal [, ] [Hastings (1970)] The probability of acceptance is then (x , yt ) = exp{(x
(t) (t)2 2 yt )/2} 1.
213
Sample statistics 0.1 mean 0.399 variance 0.698 0.5 1.0 0.111 0.10 1.11 1.06
214
250
0.5
400
0.5
400
200
0.0
300
0.0
300
150
-0.5
200
-0.5
200
100
-1.0
100
-1.0
100
50
-1.5
-1.5
-1
0 (a)
-2
0 (b)
-3
-2
-1
0 (c)
Three samples based on U[, ] with (a) = 0.1, (b) = 0.5 and (c) = 1.0, superimposed with the convergence of the means (15, 000 simulations).
-1.5
-1.0
-0.5
0.0
0.5
215
Convergence properties Uniform ergodicity prohibited by random walk structure At best, geometric ergodicity: For a symmetric density f , log-concave in the tails, and a positive and symmetric density g, the chain (X (t) ) is geometrically ergodic.
216
Example 43 Comparison of tail eects Random-walk MetropolisHastings algorithms based on a N (0, 1) instrumental for the generation of (a) a N (0, 1) distribution and (b) a distribution with density (x) (1 + |x|)3
1.5 1.5
0 50 100 (a) 150 200
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5
0
-1.0
-0.5
0.0
0.5
1.0
50
100 (b)
150
200
90% condence envelopes of the means, derived from 500 parallel independent chains
217
6.4
Extensions
There are many other algorithms Adaptive Rejection Metropolis Sampling Reversible Jump Langevin algorithms to name a few...
218
6.4.1 Facts:
- No clear dominating measure in variable dimension spaces - Gibbs sampling does not apply Solution: - Create xed dimension moves locally - Supplements 1 from Mk1 and 2 from Mk2 by u12 and u21 resp. so that (1 , u12 ) and (2 , u21 ) are in bijection (one-to-one correspondance): (2 , u21 ) = T (1 , u12 )
219
- Use acceptance probability min (k2 , 2 )12 g(u21 ) T (1 , u12 ) ,1 (k1 , 1 )21 g(u12 ) (1 , u12 ) [Green, 1995]
220
6.4.2
Langevin Algorithms
Proposal based on the Langevin diusion Lt is dened by the stochastic dierential equation dLt = dBt + 1 2 log f (Lt )dt,
where Bt is the standard Brownian motion The Langevin diusion is the only non-explosive diusion which is reversible with respect to f .
221
Discretization: x
(t+1)
=x
(t)
2 + 2
log f (x(t) ) + t ,
t Np (0, Ip )
where 2 corresponds to the discretization Unfortunately, the discretized chain may be be transient, for instance when
x
lim
222
MH correction Accept the new value Yt with probability f (Yt ) (t) ) f (x exp Yt x
(t)
2 2 2 2
(t)
2 2 1. 2 2
exp x(t) Yt
Choice of the scaling factor Should lead to an acceptance rate of 0.574 to achieve optimal convergence rates (when the components of x are uncorrelated)
223
6.4.3
Problem of choice of the transition kernel from a practical point of view Most common alternatives: (a) a fully automated algorithm like ARMS; (b) an instrumental density g which approximates f , such that f /g is bounded for uniform ergodicity to apply; (c) a random walk In both cases (b) and (c), the choice of g is critical,
224
Case of the independent MetropolisHastings algorithm Choice of g that maximizes the average acceptance rate = IE min = 2P f (Y ) g(X) ,1 f (X) g(Y ) f (X) f (Y ) , g(Y ) g(X)
X f, Y g,
h(X (t) )
t=1
to IEf [h(X)] and to the ability of the algorithm to explore any complexity of f
225
Practical implementation Choose a parameterized instrumental distribution g(|) and adjusting the corresponding parameters based on the evaluated acceptance rate 2 () = m
m
226
2 exp 1 z +2 z
1 2 + log
22
I IR+ (z) I
based on the Gamma distribution Ga(, ) with = Since f (x) 2 1/2 x exp ( 1 )x g(x) x ,
2 /1
227
The analytical optimization (in ) of M () = (x )1/2 exp ( 1 )x is impossible () 0.2 0.22 0.5 0.41 0.8 0.54 0.9 0.56 1 0.60 1.1 0.63 1.2 0.64 1.5 0.71 1.148 1.115 2 x
1.164 1.154 1.133 1.148 1.181 1.116 1.115 1.120 1.126 1.095
228
Case of the random walk Dierent approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f . If x(t) and yt are close, i.e. f (x(t) ) f (yt ) y is accepted with probability f (yt ) ,1 1. min f (x(t) ) For multimodal densities with well separated modes, the negative eect of limited moves on the surface of f clearly shows.
229
If the average acceptance rate is low, the successive values of f (yt ) tend to be small compared with f (x(t) ), which means that the random walk moves quickly on the surface of f since it often reaches the borders of the support of f
230
Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995]
231
232
7.1
General Principles
A very specic simulation algorithm based on the target f Uses the conditional densities f1 , . . . , fp from f Start with the random variable X = (X1 , . . . , Xp ) Simulate from the conditional densities, Xi |x1 , x2 , . . . , xi1 , xi+1 , . . . , xp fi (xi |x1 , x2 , . . . , xi1 , xi+1 , . . . , xp ) for i = 1, 2, . . . , p.
233
(t)
(t)
(t)
(t)
, x3 , . . . , xp ),
(t+1)
(t)
(t)
... p. Xp
(t+1)
fp (xp |x1
(t+1)
, . . . , xp1 )
Then X(t+1) X f
234
The full conditionals densities f1 , . . . , fp are the only densities used for simulation. Thus, even in a high dimensional problem, all of the simulations may be univariate
235
236
(Xt , Yt )t , is a Markov chain (Xt )t and (Yt )t individually are Markov chains For example, the chain (Xt )t has transition density K(x, x ) = with invariant density fX () fY |X (y|x)fX|Y (x |y)dy,
237
238
Example 47 Auto-exponential model On IR3 , density + f (y1 , y2 , y3 ) exp{(y1 + y2 + y3 + 12 y1 y2 + 23 y2 y3 + 31 y3 y1 )} , with known ij > 0. The full conditional densities are exponential Y3 |y1 , y2 Exp (1 + 23 y2 + 31 y1 ) , In contrast, the other conditionals, and the marginal distributions are dicult.
239
Properties of the Gibbs sampler Formally, a special case of a sequence of 1-D M-H kernels, all with acceptance rate uniformly equal to 1. The Gibbs sampler 1. limits the choice of instrumental distributions 2. requires some knowledge of f 3. is, by construction, multidimensional 4. does not apply to problems where the number of parameters varies as the resulting chain is not irreducible.
240
7.1.1
Completion
The Gibbs sampler can be generalized in much wider generality A density g is a completion of f if g(x, z) dz = f (x)
Z
241
Purpose g should have full conditionals that are easy to simulate for a Gibbs sampler to be implemented with g rather than f For p > 1, write y = (x, z) and denote the conditional densities of g(y) = g(y1 , . . . , yp ) by Y1 |y2 , . . . , yp Y2 |y1 , y3 , . . . , yp Yp |y1 , . . . , yp1 g1 (y1 |y2 , . . . , yp ), g2 (y2 |y1 , y3 , . . . , yp ), ..., gp (yp |y1 , . . . , yp1 ).
242
The move from Y (t) to Y (t+1) is dened as follows: Algorithm 48 Completion Gibbs sampler
(t)
(t)
(t)
(t)
2. Y2 ... p. Yp
, y3 , . . . , yp ), , . . . , yp1 ).
(t+1)
(t)
(t)
(t+1)
gp (yp |y1
(t+1)
243
Example 49 Cauchy-normal Consider the density e /2 f (|0 ) [1 + ( 0 )2 ] posterior from the model X| N (, 1) and C(0 , 1). Then f (|0 )
0
2
2 /2
[1+(0 )2 ] /2
1 d,
/2
e[1+(0 )
] /2
1 ,
244
The parameter is completely meaningless for the problem at hand but serves to facilitate computations.
245
7.1.2
Slice sampler
fi (),
i=1
it can be completed
k
I 0i fi () , I
i=1
246
Simulate 1. 1
(t+1)
U[0,f1 ((t) )] ;
... k. k
(t+1)
U[0,fk ((t) )] ;
, i = 1, . . . , k}.
247
The slice sampler usually enjoys good theoretical properties (like geometric ergodicity). As k increases, the determination of the set A(t+1) may get increasingly complex.
248
Example 51 Normal simulation For the standard normal density, f (x) exp(x2 /2), a slice sampler is based on |x X| U[0,exp(x2 /2)] , U
[ 2 log(),
2 log()]
249
7.1.3
If either (i) g (i) (yi ) > 0 for every i = 1, , p, implies that g(y1 , . . . , yp ) > 0, where g (i) denotes the marginal distribution of Yi , or [Positivity condition] (ii) the transition kernel is absolutely continuous with respect to g, then the chain is irreducible and positive Harris recurrent. (i). If h(y)g(y)dy < , then lim 1 T
T
nT
h1 (Y (t) ) =
t=1
h(y)g(y)dy a.e. g.
250
lim
K n (y, )(dx) f
TV
=0
251
7.1.4
Hammersley-Cliord Theorem
An illustration that conditionals determine the joint distribution If the joint density g(y1 , y2 ) have conditional distributions g1 (y1 |y2 ) and g2 (y2 |y1 ), then g(y1 , y2 ) = g2 (y2 |y1 ) . g2 (v|y1 )/g1 (y1 |v) dv
252
General case Under the positivity condition, the joint distribution g satises
p
g(y1 , . . . , yp )
j=1
g j (y j |y 1 , . . . , y g j (y j |y 1 , . . . , y
j1 j1
,y ,y
j+1 j+1
, . . . , y p) , . . . , y p)
253
7.1.5
Hierarchical models
The Gibbs sampler is particularly well suited to hierarchical models Example 52 Hierarchical models in animal epidemiology Counts of the number of cases of clinical mastitis in 127 dairy cattle herds over a one year period. Number of cases in herd i Xi P(i ) i = 1, , m
where i is the underlying rate of infection in herd i Lack of independence might manifest itself as overdispersion.
254
The Gibbs sampler corresponds to conditionals i i (i |x, , i ) = Ga(xi + , [1 + 1/i ]1 ) (i |x, , a, b, i ) = IG( + a, [i + 1/b]1 )
255
7.2
Data Augmentation
The Gibbs sampler with only two steps is particularly useful Algorithm 53 Data Augmentation
(t)
).
256
Convergence is ensured
(Y1 , Y2 )(t) Y1 Y2
(t) (t)
(Y1 , Y2 ) g Y2 g2
Y1 g1
257
Example 54 Grouped counting data 360 consecutive records of the number of passages per unit time. Number of passages Number of observations 0 139 1 128 2 55 3 25 4 or more 13
258
Feature Observations with 4 passages and more are grouped If observations are Poisson P(), the likelihood is (|x1 , . . . , x5 )
3
347 128+552+253
1e
i=0
i!
13
which can be dicult to work with. Idea With a prior () = 1/, complete the vector (y1 , . . . , y13 ) of the 13 units larger than 4
259
(t)
P((t1) ) I y4 I
i = 1, . . . , 13
13
(t) Ga 313 +
i=1
yi , 360
(t)
313 +
t=1 i=1
yi
(t)
260
1.025
1.023
10 20 30 40
1.024
0.9
1.0 lambda
1.1
1.2
1.021
0
1.022
100
200
300
400
500
261
7.2.1
Rao-Blackwellization
h y1
(t)
262
Then Both estimators converge to IE[h(Y1 )] Both are unbiased, and var IE h(Y1 )|Y2 , . . . , Yp(t)
(t)
var(h(Y1 )),
263
(X, Y ) N
Y | x N (x, 1 2 ).
264
To estimate = IE(X) we could use 1 0 = T or its Rao-Blackwellized version 1 1 = T 1 (i) (i) IE[X |Y ] = T i=1
1 2 T T T
X (i)
i=1
Y (i) ,
i=1
2 2 which satises 0 /1 =
> 1.
265
(t) ,
t=1
(i)
(i)
(i)
1 360T
yi
(t)
266
Another substantial benet of Rao-Blackwellization is in the approximation of densities of dierent components of y without nonparametric density estimation methods. The estimator 1 T and is unbiased.,
T t=1
(t)
267
7.2.2
Ties together the properties of the two Markov chains in Data Augmentation Consider a Markov chain (X (t) ) and a sequence (Y (t) ) of random variables generated from the conditional distributions X (t) |y (t) Y (t+1) |x(t) , y (t) Properties If the chain (Y (t) ) is ergodic then so is (X (t) ) The conclusion holds for geometric or uniform ergodicity. The chain (Y (t) ) can be discrete, and the chain (X (t) ) can be continuous. (x|y (t) ) f (y|x(t) , y (t) ) .
268
7.2.3
Parameterization
Convergence of both Gibbs sampling and MetropolisHastings algorithms may suer from a poor choice of parameterization The overall advice is to try to make the components as independent as possible.
269
Example 56 Random eects model In the simple random eects model yij = + i + ij where
2 2 i N (0, ) and ij N (0, y )
i = 1, . . . , I,
j = 1, . . . , J
270
the correlations between the i s and between and the i s are lower
271
7.3
Improper Priors
Unsuspected danger resulting from careless use of MCMC algorithms: It can happen that all conditional distributions are well dened, all conditional distributions may be simulated from, but... the system of conditional distributions may not correspond to any joint distribution Warning The problem is due to careless use of the Gibbs sampler in a situation for which the underlying assumptions are violated
272
Example 57 Conditional exponential distributions For the model X1 |x2 Exp(x2 ) , X2 |x1 Exp(x1 )
the only function f (x1 , x2 ) that is a candidate for the joint density is f (x1 , x2 ) exp(x1 x2 ), but f (x1 , x2 )dx1 dx2 =
273
Example 58 Improper random eects For a random eects model, Yij = + i + ij , where i N (0, 2 ) and ij N (0, 2 ), the Jereys (improper) prior for the parameters , and is 1 (, , ) = 2 2 .
2 2
i = 1, . . . , I, j = 1, . . . , J,
274
i |y, , ,
|, y, 2 , 2 2 |, , y, 2 2 |, , y, 2
IG IJ/2, (1/2)
i,j
(yij i )2 ,
are well-dened and a Gibbs sampling can be easily implemented in this setting.
275
25
20
10
-4
-3
-2 (1000 iterations)
-1
-8
-6
-4 observations
freq. 15
-2
276
The gure shows the sequence of the (t) and the corresponding histogram for 1000 iterations. The trend of the sequence and the histogram do not indicate that the corresponding joint distribution does not exist
277
Final notes on impropriety The improper posterior Markov chain cannot be positive recurrent The major task in such settings is to nd indicators that ag that something is wrong. However, the output of an improper Gibbs sampler may not dier from a positive recurrent Markov chain. Example The random eects model was initially treated in Gelfand et al. (1990) as a legitimate model
278
Diagnosing Convergence
8.1 Stopping the Chain 8.2 Monitoring Stationarity Convergence 8.3 Monitoring Average Convergence
279
8.1
Convergence results do not tell us when to stop the MCMC algorithm and produce our estimates. Methods of controlling the chain in the sense of a stopping rule to guarantee that the number of iterations is sucient.
280
Three types of convergence 1. Convergence to the Stationary Distribution Minimal requirement for approximation of simulation from f 2. Convergence of Averages convergence of the empirical average 1 T
T
most relevant in the implementation of MCMC algorithms. 3. Convergence to iid Sampling (t) (t) How close a sample (1 , . . . , n ) is to being iid.
281
8.1.1
Some methods involve the simulation in sf parallel of M independent (t) chains (m ) (1 m M ) Some are based on a single on-line chain. Motivations for parallel chains Variability and dependence on the initial values are reduced Potentialy easier to control convergence comparing estimation of quantities of interest over dierent chains, But... in a naive implementation, slower chain governs convergence But... initial distribution is paramount
282
8.2
8.2.1
Graphical Methods
Natural empirical approach to convergence control is to draw pictures May detect deviant or nonstationary behaviors A rst idea is to draw the sequence of the (t) s against t However, this plot is only useful for strong nonstationarities of the chain.
283
/(2 2 )
+ I C (), I
284
Naive implementation of the Gibbs sampler Algorithm 60 Witchs hat distribution 1. Generate Ui U[0,1] 2. Generate i U[0,1]
+ N (yi , 2 , 0, 1)
285
where
+ N (, 2 , , ) = N (, 2 ) restricted to [, ]
and
1 yi
yi
286
0
1 0.8 0.6 0.4 0.2 0
0 0.4 0.2 0.6 0.8 1
287
200
400
600
800
1000
200
400
600
800
1000
Chain (1 ) for two initial values, 0.0217 (top) and 0.9098 (bottom)
(t)
288
Strong attraction of the mode that gives the impression of stationarity Chain with initial value 0.9098 achieves a momentary escape from the mode, but is actually atypical.
289
8.2.2
Nonparametric tests of stationarity Standard nonparametric tests (Kolmogorov-Smirnov,...) When the chain is stationary, (t1 ) and (t2 ) have the same marginal distribution for arbitrary times t1 and t2 . Methods based on Renewal Theory Methods based on Distance Evaluations between the n-step kernel and the marginal Remember, stationarity is not the main concern!
290
8.3
291
8.3.1
Graphical Methods
Purely graphical evaluation based on cumulative sums, graphing the partial dierences
i i DT = t=1
[h((t) ) ST ],
i = 1, , T,
[CUSUM] where 1 ST = T
T
h((t) ).
t=1
292
i When the mixing of the chain is high, the graph of DT is highly irregular and concentrated around 0. (Looks like a Brownian bridge.)
Slowly mixing chains (chains with a slow pace of exploration of the stationary distribution) produce regular graphs with long excursions away from 0.
293
-25
-20
0
-15
-10
-5
200
400
600
800
1000
294
But... The pathological shape the witchs hat distribution is actually close to the ideal shape of Yu and Mykland, and there is no indication that the chain has not yet left the mode (0.7, 0.7) This diculty is common to most on-line methods, that is, to diagnoses based on a single chain. It is almost impossible to detect the existence of other modes Youve only seen where youve been
295
8.3.2
Multiple Estimates
In most cases, the graph of the raw sequence ((t) ) is unhelpful Given some quantities of interest IEf [h()], a more helpful indicator is the behavior of the averages 1 T in terms of T . A more controlled approach is to use simultaneously several convergent estimators of IEf [h()] based on the same chain ((t) ), until all estimations coincide (up to a given precision).
T
h((t) )
t=1
296
Most common estimators The empirical average ST Rao-Blackwellized versions of this average,
C ST
1 = T
IE[h()| (t) ] ,
t=1
1 = T
wt h((t) ) ,
t=1
where wt = f ((t) )/gt ((t) ) and gt is the true density used for the simulation of (t) .
297
P Note that ST removes the correlation between the (t) s So up to P second order, ST behaves as an independent sum. This implies that P VarST decreases at speed 1/T in stationarity settings. Thus, nonstationarity can be detected if the decrease of the variations of P ST does not t in a condence parabola of order 1/ T .
298
1 . 1 + ( xi )2 i=1
a completion Gibbs sampling algorithm can be derived via articial variables, 1 , 2 , 3 , such that (, 1 , 2 , 3 |x1 , x2 , x3 ) e
2
/2
3 i=1
e(1+(xi )
)i /2
299
, 1 2 i i + .
300
0
-10
200
400
600
800
-5
10
15
20
Comparison of the normal-Cauchy density and of the histogram (20, 000 points)
301
Eciency of this algorithm: agreement between the histogram of the simulated (t) s and the true posterior distribution If the function of interest is h() = exp(/) the dierent approximations of IE [h()] can be monitored.
302
0.80
0
0.81
0.82
0.83
0.84
0.85
100
200
300
400
500
(thousand iterations)
303
The bad behavior the importance sampler is most likely associated with an innite variance.
304
305
306
0.18
0
0.20
0.22
0.24
0.26
0.28
200
400
600
800
1000
(thousand iterations)
R C Convergence of ST (full line), ST (dotted line), ST (mixed P dashes) and ST (long dashes) of IE[(X (t) )0.8 ] for Be(0.2, 1).
307
Limitations of the method (1). Multiple estimates may not be available (2). Intrinsically conservative the slowest ox drives the team (3). Cannot detect missing modes Youve only seen where youve been (4). Empirical and conditional estimators often similar, while importance sampler enjoys innite variance
308
8.3.3
Control strategy devised by Gelman and Rubin (1992) Starts with the derivation of a distribution related with the modes of f , obtained by numerical methods (??). For instance, mixture of Students t distributions centered around the identied modes of f For every quantity of interest = h(), stopping rule based on the dierence between a weighted estimator of the variance and the variance of estimators from dierent chains
309
Denote BT WT with m and m = h(m ). BT and WT represent the between- and within-chains variances.
(t) (t)
= =
1 M 1 M
( m )2 ,
m=1 M
s2 m
m=1
1 = M
M m=1
1 T
T (t) (m m )2 , t=1
1 = T
T (t) m , t=1
1 = M
m
m=1
310
Comparison of T with WT through 2 RT F approximately, so IERT 1. Stopping rule is based on either testing that the mean of RT is equal to 1 or on condence intervals on RT .
311
Example 63 (Normal-Cauchy again) Evolution of RT for h() = , M = 100 and 10, 00 iterations. Convergence after about 6, 00 iterations. But... superimposed graph of WT does not exhibit stationarity The distribution of (t) is stationary after a few hundred iterations : the criterion is conservative
312
1.015
1.010
1.005
1.000
200
400
600
800
Evolutions of RT (solid lines and scale on the left) and of WT (dotted lines and scale on the right)
34.6
34.8
35.0
35.2
35.4
313
Example 64 (Witches hat again) The density (|y) has a very concentrated mode around y Use the uniform distribution on C = [0, 1]d as the initial distribution The scale of RT is very concentrated Stability of RT (and of WT ) implies convergence But... the chain has not left the neighborhood of the mode (0.7, 0.7)!
314
1.016
1.018
1.014
1.010
1.012
1.006
1.008
200
400
600
800
Evolutions of RT (solid lines and scale on the left) and of WT (dotted lines and scale on the right)
0.002
0.004
0.006
0.008
315
Comments This method has enjoyed wide usage, in particular because of its simplicity and of its connections with the standard tools of linear regression. However... The accurate construction of the initial distribution can be delicate/time-consuming. In some models, number of modes too great to complete identication The method relies on normal approximations
316
In general, best to use a battery of tests/assessments: There are many others that we have not mentioned. Some are Methods based on Renewal Theory Methods based on Discretization
317
318
Missing data models are a natural application for simulation Simulation replaces the missing data part so that one can proceed with a classical inference on the complete model. The EM algorithm rst described a rigorous and general formulation of statistical inference though completion of missing data. Potential of Markov Chain Monte Carlo algorithms in the analysis of missing data models
319
9.1
First examples
Example 65 Rounding eect Numerous settings (surveys, medical experiments, epidemiological studies, design of experiment, quality control, etc.) produce a grouping of the original observations into less informative categories, often for reasons beyond the control of the experimenter: Data coarsening For instance, approximation bias in a study on smoking habits
320
Yi Exp() the number of cigarettes smoked per day is unobserved (rounding) and instead we observe bins Xi , where Xi |gi , yi = [[yi ], [yi ] + 1) [20[yi /20], 20[yi /20] + 20) if gi = 0, (cig. reported) if gi = 1, (packs reported)
321
This means that, as yi increases, the probability of rounding up the answer xi to the nearest full pack of cigarettes also increases, under the constraint 2 > 0, thus we observe the Gi s according to Gi |yi Bernoulli((1 2 yi )), where is the cdf of N (0, 1).
322
If c(xi ) denotes the center of the ith bin, the likelihood function is
n c(xi )+10 c(xi )10 c(xi )+1/2 1gi gi
L(, 1 , 2 |x, g)
=
i=1
c(xi )1/2
[incomplete-data]
323
=
i=1
eyi [1 (1 2 yi )] I I(c(xi ) 10 yi < c(xi ) + 10)] eyi (1 2 yi ) I I(c(xi ) 1/2 yi < c(xi ) + 1/2)]
1gi gi
324
Contingency table When several variables are studied simultaneously in a sample, each corresponding to a grouping of individual data If the context is suciently informative to allow for a modeling of the individual data, the completion of the contingency table (by reconstruction of the individual data) may improve inference
325
Example 66 Lizard habitat Observation of two characteristics of the habitat of 164 lizards Diameter (inches) Height (feet) 4.0 > 4.75 32 4.75 86 > 4.0 11 35
326
Distribution on the individual observations Xijk of diameter and of height (i, j = 1, 2, k = 1, . . . , nij ) Yijk = log(Xijk ) N (0, ) , where = 2 1 1
= 2 0 .
327
T N2 (, ; Qij ) represents the normal distribution restricted to one of the four quadrants induced by (log(4.75), log(4))
Prior
1 (, , ) = I [1,1] () I
328
(i, j = 1, 2, k = 1, . . . , nij );
2. Simulate N2 (y, /164); 3. Simulate 2 from the inverted gamma distribution IG(164,
i,j,k
329
Note: The distribution in step 4. requires a MetropolisHastings step based, for instance, on an inverse Wishart distribution
330
Qualitative models Example 68 Probit Regression Threshold model: observe Yi Bernoulli{0, 1} with pi = (xt ) , i where is the standard normal cdf. Create latent (unobservable) continuous rvs Yi where Yi = 1 0 if Yi > 0, otherwise. IRp .
Thus, pi = P (Yi = 1) = P (Yi > 0), and we have an automatic way to complete the model
331
1. Simulate
yi
N+ (xt , 1, 0) i N (xt , 1, 0) i
if yi = 1, if yi = 0,
(i = 1, . . . , n)
2. Simulate Np (1 + XX t )1 (1 0 +
i yi xi ), (1 + XX t )1
332
Incomplete observations arise in numerous settings. A survey with multiple questions may include nonresponses to some personal questions; A calibration experiment may lack observations for some values of the calibration parameters; A pharmaceutical experiment on the aftereects of a toxic product may skip some doses for a given patient.
333
Analysis of such structures complicated by the fact that the failure to observe is not always explained. If these missing observations are at random the incompletely observed data only play a role through their marginal distribution. But... these distributions are not always explicit and a natural approach leading to a Gibbs sampler algorithm is to replace the missing data by simulation.
334
Example 70 Non-ignorable non-response Average incomes and numbers of responses/non-responses to a survey on the income by age, sex and marital status. Age < 30 > 30 Men Single 20.0 24/1 30.0 15/5 Married 21.0 5/11 36.0 2/8 Women Single Married 16.0 16.0 11/1 2/2 18.0 8/4 0/4
335
with
Exp(a,s,m ) a,s,m = 0 + a + s + m ,
where 1 i na,s,m a (a = 1, 2) corresponds to age (junior/senior) s (s = 1, 2) corresponds to sex (fem./male) m (m = 1, 2) corresponds to family (single/married) The model is unidentiable, but that can be remedied by constraining 1 = 1 = 1 = 0.
336
More dicult problem: Nonresponse depends on the income in the shape of a logit model, pa,s,m,i where pa,s,m,i denotes the probability of nonresponse and (w0 , w1 ) are the logit parameters.
exp{w0 + w1 ya,s,m,i } = , 1 + exp{w0 + w1 ya,s,m,i }
337
na,s,m is the number of people by category ra,s,m is the number of responses by category y a,s,m is the average of these responses by category
338
339
[logit model]
340
9.2
pj f (x|j ) ,
j=1
341
useful in practical modeling challenging from an inferential point of view (i.e., estimation of pj and j ) likelihood dicult to work with
n
pj f (xi |j ) ,
L(p, |x1 , . . . , xn )
i=1
j=1
containing k n terms.
342
Missing data structure Associate with every observation xi an indicator variable zi {1, . . . , k} Indicates which component of the mixture xi comes from. Demarginalized model: zi Mk (1; p1 , . . . , pk ), xi |zi f (x|zi ) .
343
(p, |x , . . . , x ) i i
i=1 k
344
(j = 1, . . . , k)
nj =
i=1
I zi =j , I
nj xj =
i=1
I zi =j xi . I
345
2. Generate (j = 1, . . . , k) j j + nj xj , j + nj , j + n j Dk (1 + n1 , . . . , k + nk ) .
346
pj
j=1
2 (xj )2 /(2j )
2 j
2 j
IG
j + 3 j , 2 2
347
1. Simulate (i = 1, . . . , n)
2 zi P (zi = j) pj exp (xi j )2 /(2j ) 1 j
nj =
i=1
I zi =j , I
nj xj =
i=1
I zi =j xi , I
s2 = j
i=1
I zi =j (xi xj )2 . I
348
2. Generate j |j
2 j
N IG
2 j j j + nj xj , j + n j j + n j
, ,
j + n j + 3 j + s 2 j , 2 2
p Dk (1 + n1 , . . . , k + nk ) .
349
Good performance of the Gibbs sampler guaranteed by the duality principle Practical implementation of the Gibbs sampler might, face serious convergence diculties (phenomenon of the absorbing component)
350
Trapping states When only a small number of observations are allocated to a given component j0 , the following probabilities are quite small: (1). The probability of allocating new observations to the component j0 . (2). The probability of reallocating, to another component, observations already allocated to j0 .
351
A paradox While the Gibbs sampler chain (z (t) , (t) ) is irreducible, there exists (almost)absorbing states (or trapping state) which require an enormous number of iterations of the Gibbs sampler to escape from.
352
Example 74 Acidity level in lakes 149 observations of acidity levels in lakes in the American North-East Mixture model t with the Gibbs sampler algorithm Lack of evolution of estimated (plug-in) density from the Gibbs sampler when iterations increase Phenomenon which occurs often in mixture settings, due to weak identiability of these models.
353
T = 500
T = 1000
15
10
-2
0
-2
10
15
T = 2000
T = 3000
15
10
-2
0
-2
10
15
T = 4000
T = 5000
15
10
-2
0
-2
10
15
354
9.3
Extensions
Relaxation of the independence assumption between observations leads to hidden Markov chains Dierent constraints on the changes of components corresponds to changepoint models
355
Example 75 Switching AR model Xt |xt1 , zt , zt1 Zt |zt1 where Zt takes values in {0, 1} initial values z0 = 0, x0 = 0 . N (zt + (xt1 zt1 ), 2 ), zt1 I zt1 (zt ) + (1 zt1 )I 1zt1 (zt ), I I
356
States zt not observed and the algorithm completes the sample x1 , . . ., xT by simulating every zt (1 < t < T ) from P (Zt |zt1 , zt+1 , xt , xt1 , xt+1 ) 1 exp 2 (xt zt (xt1 zt1 ))2 2 + (xt+1 zt+1 (xt zt ))2 zt1 I zt1 (zt ) + (1 zt1 )I 1zt1 (zt ) I I (zt I zt (zt+1 ) + (1 zt )I 1zt (zt+1 )) . I I
357
358
Example 76 Hidden Markov Poisson model Observe Xt s depending on an unobserved Markov chain (Zt ) such that (i, j = 1, 2) Xt |zt P(zt ), Noninformative prior (1 , 2 , p11 , p22 ) = 1 I 2 <1 I 1 P (Zt = i|Zt1 = j) = pji .
359
0
0
50
100
150
200
Data of Leroux and Puterman (1992) on the number of moves of a lamb fetus during 240 successive 5 second periods.
360
Changepoint model Sample (x1 , . . . , xn ) associated with a latent index such that X1 , . . . , X | X +1 , . . . , Xn | where the support of 0 is {1, . . . , n}.
iid
0 ( ), f (x|1 ) f (x|2 ),
iid
361
for i = 1, . . . , , for i = + 1, . . . , n,
W2
1 1 , V
0 N (, C)
362
Example 78 Stochastic Volatility Popular in nancial applications, in describing series with sudden changes in the magnitude of variation of the observed values through a latent linear process (Yt ), the volatility
Yt = Yt1 + Yt /2 Yt = e t , t1
where
and
N (0, 1).
363
exp
t=0
2 yt yt e
+ yt /2 T
(yt yt1 )2
/2( )2 .
364
-10
0
-5
10
15
20
100
200
300
400
500
= .9