Está en la página 1de 16

1 Basic Probability Theory

1.1 Conditional Probability

p(x, y)
p(y|x) =
pX (x)

p(x, y): joint probability/density function of (X, Y )

pX (x): marginal prob./density function of X

Z X
p(y|x)dy = 1 ( p(y|x) = 1)
y

Ex. (X1 , . . . , Xn ): n binomial trials with prob. of success p


Pn
⇒Y = i=1 Xi ∼ Bin(n.p)

py (1 − p)n−y 1
p(x|y) = ¡n¢ y n−y
= ¡ n¢ . ¤
y
p (1 − p) y

• p(x, y) = p(x|y)p(y) = p(y|x)p(x)

• Bayes’ rule
p(y|x)pX (x)
p(x|y) = R
x
p(y|x)pX (x)dx

Ex. Y ∼ Bin(N, θ): # of defectives in N products

X: # of defectives in n samples (sampled without replacement)

X|Y = y ∼ H(y, N, n) (hypergeometric distribution)


µ ¶ ¡y ¢¡N −y¢
N y
p(x, y) = θ (1 − θ)N −y x ¡Nn−x
¢
y n

1
p(Y = y|X = x)
¡N ¢ y ¡ ¢¡ −y¢
y
θ (1 − θ)N −y xy Nn−x
= P ¡ ¢¡ ¢
y N −y y N −y
y θ (1 − θ) x n−x
µ ¶
N − n y−x
= θ (1 − θ)N −n−(y−x) . ¤
y−x

1.2 Conditional Expectation

Z
E(X|Y = y) = xp(x|y)dx
x

Ex. (X1 , . . . , Xn ): n binomial trials with prob. of success p


Pn
⇒Y = i=1 Xi ∼ Bin(n.p),
¡n−1¢
y−1 y
E(X1 |Y = y) = p(X1 = 1|Y = y) = ¡n¢ = .¤
y
n

Useful identities of conditional expectation

• Double Expectation E(E(X|Y )) = E(X)

Z Z Z Z
x(p(x|y)dxp(y)dy = xp(x, y)dxdy = E(X)

• E{r(X)h(Y )} = E[h(Y )E{r(X|Y )}]

Ex. (X1 , . . . , Xn ): n binomial trials with prob. of success p


Pn
⇒Y = i=1 Xi ∼ Bin(n.p),

µ ¶
Y np
E{E(X1 |Y )} = E = = p = E(X1 ). ¤
n n

iid
Ex. X1 , X2 ∼ U (0, 1)

2
Y = min(X1 , X2 ), Z = max(X1 , X2 ),

joint dist. of (Y, Z)

= p(Y ≤ y, Z ≤ z)

= 2p(X1 < X2 , X1 < y, X2 < z)


Z z Z min(x2 ,y)
= 2 dx1 dx2
0 0

Z z 

 2yz − y 2 , 0 < y ≤ z < 1
= 2 min(x2 , y)dx2 =
0 

 z2, o.w.

∂2
p(y, z) = p(Y ≤ y, Z ≤ z)
∂y∂z



 2, 0 < y ≤ z < 1
=


 0, o.w.
Z 1
pY (y) = p(y, z)dz = 2(1 − y), 0 < y < 1
y
2 1
p(z|y) = = , 0<y≤z<1
2(1 − y) 1−y
Z 1
1 1+y
E(Z|Y ) = z dz = , 0<y<1¤
y 1−y 2

1.3 Distribution for Transformations of Random Vectors

• Jacobian: h = (h1 , . . . , hk )0 : Rk → Rk

t = (t1 , . . . , tk )0
¯ ¯
¯ ¯
¯ ∂ h (t) · · · ∂
h (t) ¯
¯ ∂t1 1 ∂t1 k ¯
¯ ¯
¯ .. .. ¯
Jh (t) = ¯¯ ... ¯
. . ¯
¯ ¯
¯ ¯
¯ ∂ h (t) · · · ∂
h (t)¯¯
¯ ∂tk 1 ∂tk k

3
• Y = g(X),

pY (y) = pX (g−1 (y))|Jg−1 (y)|


1
= pX (g−1 (y))
|Jg (g−1 (y))|

p.f. Let Ak = {x ∈ Rk : gi (x) ≤ yi , i = 1, . . . , k}.

Z Z
FY (y) = ··· pX (x1 , · · · , xk )dx1 · · · dxk
Ak
Z Z
= ··· pX (g−1 (t))|Jg−1 (t)|dt1 · · · dtk . ¤
Ak

Theorem. X1 ⊥⊥ X2 ∼ Γ(p, λ), Γ(q, λ)

X1
Y1 = X1 + X2 , Y2 = X1 +X2

• Y1 ⊥
⊥ Y2

• Y1 ∼ Γ(p + q, λ), Y2 ∼ β(p, q)

Recall:
λp xp−1 e−λx
Γ(p, λ) =
Γ(p)
Z ∞
Γ(p) = tp−1 e−t dt, Γ(p + 1) = pΓ(p)
0

xr−1 (1 − x)s−1
β(r, s) =
B(r, s)
Γ(r)Γ(s)
B(r, s) =
Γ(r + s)

4
gamma distributions

1.2
gamma(.5,.5)
gamma(1,1)
1.0

gamma(10,10)
0.8
0.6
0.4
0.2
0.0

0 1 2 3 4 5

beta distributions
15
10

beta(.5,9.5)
beta(.25,.25)
beta(3,3)
beta(1,1)
5
0

0.0 0.2 0.4 0.6 0.8 1.0

5
p.f.

p(x1 , x2 ) = {Γ(p)Γ(q)}−1 e−(x1 +x2 ) x1p−1 xq−1


2
 

 

 y1 = x1 + x2  x1 = y 1 y 2


 

 y2 = x1  x2 = y1 − y1 y2
x1 +x2

¯ ¯
¯ ¯
¯ dx1 dx2 ¯
¯ dy1 dy1 ¯
J = ¯¯ ¯
¯
¯ dx1 dx2 ¯
¯ dy2 dy2 ¯
¯ ¯
¯ ¯
¯y 1 − y ¯
¯ 2 2¯
= ¯¯ ¯
¯
¯y ¯
¯ 1 −y1 ¯
= −y1

e−y1 (y1 y2 )p−1 (y1 − y1 y2 )q−1 y1


∴ p(y1 , y2 ) =
Γ(p)Γ(q)
= Γp+q,λ (y1 )βp,q (y2 ). ¤

indep.
Corollary. X1 , . . . , Xn ∼ Γ(pi , λ)
n
X Xn
⇒ Xi ∼ Γ( pi , λ)
i=1 i=1

1.4 χ2 , F , t Distributions

• chi2 distribution with n degrees of freedom


iid
Theorem. X = (X1 , . . . , Xn ) ∼ N (0, σ 2 ).
n
X X2 i n 1
V = ∼ χ2n = Γ( , )
i=1
σ2 2 2

p.f. Let Zi = Xi /σ and T = Z12 . Then

√ √
p(Z12 ≤ t) = p(− t < Z1 < t)

6
√ √
∴ FT (t) = Φ( t) − Φ(− t)

√ 1
fT (t) = t−1/2 φ( t) = √ t−1/2 e−t/2 = Γ(1/2.1/2).

V is thus Γ(n/2, 1/2) = χ2n . ¤

• F distribution with k and m degrees of freedom

⊥ W ∼ χ2k , χ2m
– V ⊥

V /k
– S= W/m
∼ Fk,m

• t distribution with kdegrees of freedom

– Z ⊥⊥ V ∼ N (0, 1), χ2k

– Q = √Z ∼ tk
V /k

iid
Corollary. X = (X1 , . . . , Xn ) ∼ N (0, σ 2 ).
Pk
i=1 Xi2 /k
Pk+m ∼ Fk,m
i=k+1 Xi2 /m

X1
qP ∼ tk
k
i=2 +1Xi2 /k

7
chi−square distributions

0.5
0.4
chisq(2)
chisq(5)
chisq(10)

0.3
0.2
0.1
0.0

0 5 10 15 20 25 30

F distributions
1.0
0.8
0.6
0.4

F(10,1000)
F(10,50)
F(10,5)
0.2

F(1,5)
0.0

0 1 2 3 4 5

t distributions
0.4

normal
t(2)
t(8)
0.3
0.2
0.1
0.0

−4 −2 80 2 4

x
1.5 Orthogonal Transformation

• Orthogonal matrix:

– A0 = A−1 ; A0 A = AA0 = I

– u = Av ⇒ u0 u = v0 A0 Av = v0 v

– | det(A)| = 1 (det(A0 A) = det(A0 ) · det(A) = {det(A)}2 )

• Y = AX + c

pY (y) = pX (A−1 (Y − c))|A−1 |


1
= pX (A−1 (Y − c))
|A|
= pX (A0 (Y − c))

X X0
• if pX = √ 1 e− 2σ2 then
2πσ 2

1 (y−c)0 AA0 (y−c)


py = √ e− 2σ 2
2πσ 2
1 (y−c)0 (y−c)
= √ e− 2σ 2
2πσ 2
n
Y
1 yi − ci
= φ( )
i=1
σ σ

iid
Theorem. Zi ∼ N (µi , σ 2 ), i = 1, . . . , n

Y = Az + c, A is orthogonal
iid
⇒ Yi ∼ N (ηi , σ 2 ), i = 1, . . . , n, where η = Aµ + c.
iid
Theorem. . Zi ∼ N (µ, σ 2 ), i = 1, . . . , n


Pn
i=1 (Zi − Z̄)2
Z̄ ⊥

σ2

9
Pn
i=1 (Zi − Z̄)2
∼ χ2n−1
σ2

p.f. Consider Az with


 
1 √1 
 √n ··· n
A=



A∗

is orthogonal.

Let Y = AZ. Then Y0 Y = Z0 Z, and Y1 = nZ̄. So

n
X n
X
2 0 2 0
(Zi − Z̄) = Z Z − nZ̄ = Y Y − Y12 = Yi2 .
i=1 i=2

Also, since A is orthogonal, Y are independent, and

A∗ 1 = 0,

implying
 
 Y2 
 
.
 ..  = A∗ Z
 
 
 
Yn

have zero means. Hence,


n
X
Yi2 ⊥
⊥ Y12 ,
i=2

implying
Pn
i=1 (Zi − Z̄)2
Z̄ ⊥⊥ .
σ2

Also,
Pn Pn
i=1 (Zi − Z̄)2 i=1 Yi
2
= ∼ χ2n−1 . ¤
σ2 σ2

10
1.6 Bivariate Normal

 
Z1  iid
• Y = AZ + µ, Z =  
  ∼ N (0, 1)
Z2

 
 var(Y1 ) cov(Y1 , Y2 )
cov(Y) = 



cov(Y1 , Y2 ) var(Y2 )
 
 σ12 ρσ1 σ2 
= 



ρσ1 σ2 σ22
 
a211 + a212 a11 a21 + a12 a22 
= 



a11 a21 + a12 a22 a221 + a222
= A · A0

1 Z0 Z
pZ = √ e− 2
( 2π)2
1 (y−µ)0 (A−1 )0 A−1 (y−µ)
pY = √ e− 2
( 2π)2 | det(A)|
1 (y−µ)0 (AA0 )−1 (y−µ)
= √ p e− 2
( 2π)2 | det(AA0 )|
1 (y−µ)0 Σ−1 (y−µ)
= p e− 2 ,
2π det(Σ)

where Σ = cov(Y).

• For general X with E(X) = 0 and cov(X) = Σ, and Y = AX + µ, we have

E(Y) = µ

cov(Y) = AΣA0

11
Two dimensional Normal Distribution
µ1 = 0,µ 2 = 0,σ 11 = 10,σ 22 = 10,σ 12 = 15,ρ = 0.5

0.015

z 0.010

0.005 10

5
0.000
−10 0
−5 x2
0 −5
x1
5
10 −10

1  1 (x1−µ 1)2 x1−µ 1 x2−µ 2 (x2−µ 2)2


f (x) = .
exp − ,
2 
− 2ρ + 

2π σ11σ (1 − ρ )
2 2(1 − ρ )  σ11 σ11 σ22 σ22 
22 

• (X, Y ) ∼ BVN(µ, Σ)

⇒ X ∼ N (µ1 , σ12 ), Y ∼ N (µ2 , σ22 )

• but the reverse is not true

• degenerate normal (when |ρ| = 1)

Y − µ2 X − µ1
2

σ2 σ12

Simulate BVN(µ1 , µ2 , σ12 , σ22 , ρ):

U1 = aZ1 + bZ2 + µ1

U2 = cZ1 + dZ2 + µ2






 a2 + b2 = σ12



 c2 + d2 = σ22





 ac + bd = ρ

12
One solution:

p
U1 = σ 1 1 − ρ2 Z1 + ρσ1 Z2 + µ1

U 2 = σ 2 Z 1 + µ2

• X, Y ∼ BVN(µ1 , µ2 , σ12 , σ22 , ρ)

σ2
⇒ Y |X ∼ N (µ1 + ρ (X − µ1 ), σ 2 (1 − ρ2 ))
σ1

1.7 Approximations to Distributions and Moments

Asymptotic Theory

• The central limit theorem (CLT)

Theorem. (X1 , . . . , Xn ) is a sample from a population with mean µ and positive, finite

variance σ 2 . Then

n(X̄ − µ) d
→ Z ∼ N (0, 1)
σ

i.e.,
µ√ ¶
n(x − µ)
p(X̄ ≤ x) + Φ
σ

δ-method

Theorem. If h0 (µ) 6= 0, then


µ√ ¶
n(h(X̄) − h(µ))
p ≤x + Φ(x),
|h0 (µ)|σ

i.e.,

n[h(X̄) − h(µ)] ≈ N (0, [h0 (µ)]2 σ 2 )

13
Ex. h(X̄) = X̄(1 − X̄)

h(µ) = µ(1 − µ), h0 (µ) = 1 − 2µ

when µ 6= 12 ,
(1 − 2µ)2 σ 2
var(h(X̄)) =
n

where σ 2 = var(X).

If X ∼ Bernoulli(p), σ 2 = µ(1 − µ). ¤

• The (weak) law of large numbers (LLN) and Slutsky’s theorem

Ex.

1
Pk 2
k i=1 Zi
Fk,m = 1
Pk+m 2
m i=k+1 Zi
1
Pk 2
m→∞ k i=1 Zi
−−−→
LLN 1
Slutsky’s theorem 1 2
∼ χ .¤
k k

• variance stabilization:

finding h such that

σ 2 [h0 (µ)]2 = C

where C is a constant

λ
Ex. X ∼ Poisson(λ), var(X) = λ, var(X̄) = n

[h0 (λ)]2 λ = C
r
C
⇒ h0 (λ) =
λ

⇒ h(λ) = 2 Cλ + d

14

So we choose h(t) = t, then

√ p √ 1
n( X̄ − λ) ≈ N (0, ). ¤
4

1.8 Mean Squared Error (MSE) Prediction

Lemma. If Q(c) = E(Y − c)2 , then either Q(c) = ∞ for all c, or Q is minimized uniquely

by c = E(Y ).

Lemma. If X is a random vector and Y is a random variable, then either E(Y −g(X))2 = ∞

for every function g, or

E(Y − E(Y |X))2 ≤ E(Y − g(X))2

for every function g with strict inequality holding unless g(X) = E(Y |X). This implies that

E(Y |X) is the unique best mean squared error predictor.

p.f. From the first lemma

E((Y − g(X))2 |X = x) ≥ E((Y − E(Y |X))2 |X = x).

Take expectations on both sides, we have

E(Y − g(X))2 ≥ E(Y − E(Y |X))2 . ¤

Recall that var(Y |X) = E((Y − E(Y |X))2 |X) and note that

E(Y − g(X))2 = E(Y − E(Y |X))2 + E(g(X − E(Y |X))2

= E{var(Y |X)} + E(g(X) − E(Y |X))2

Take g(X) = E(Y ) in the above, we have

var(Y ) = E(var(Y |X)) + var(E(Y |X))

15
Ex. Z1 ⊥⊥ Z2 ∼ Bernoulli(0.5), X = Z1 , Y = Z1 Z2 ,

1
E(Y |X) = X,
2
1 1
var(E(Y |X)) = var( X) = ,
2 16
1 1
var(Y |X = x) = E[(Z1 Z2 − X)2 |Z1 = x] = x2 ,
2 4
1
E(var(Y |X)) = ,
8
3
var(Y ) = E(var(Y |X)) + var(E(Y |X)) = .¤
16

Theorem. If E(|Y |) < ∞, then

var(E(Y |X)) ≤ var(Y ).

If var(Y ) < ∞, strict inequality holds unless Y = E(Y |X).

• Bivariate normal:

(X, Y ) ∼ N (µ1 , µ2 , σ12 , σ22 , ρ)

σ2
Y |X ∼ N (µ2 + ρ (X − µ1 ), σ22 (1 − ρ2 ))
σ1

Hence the best predictor is µ2 + ρ σσ21 (X − µ1 ), and MSE of the predictor = σ22 (1 − ρ2 ).

• Regression line (regression towards the mean):

when σ1 = σ2 , µ1 = µ2 = µ,

E(Y |X) = (1 − ρ)µ + ρX

16

También podría gustarte