1 Basic Probability Theory

1 Basic Probability Theory
1.1 Conditional Probability
p(x, y)
p(y|x) =
pX (x)
p(x, y): joint probability/density function of (X, Y )
pX (x): marginal prob./density function of X
Z X
p(y|x)dy = 1 ( p(y|x) = 1)
y
Ex. (X1 , . . . , Xn ): n binomial trials with prob. of success p

Pn
⇒Y = i=1 Xi ∼ Bin(n.p)
py (1 − p)n−y 1
p(x|y) = ¡n¢ y n−y
= ¡ n¢ . ¤
y
p (1 − p) y
• p(x, y) = p(x|y)p(y) = p(y|x)p(x)
• Bayes’ rule
p(y|x)pX (x)
p(x|y) = R
x
p(y|x)pX (x)dx
Ex. Y ∼ Bin(N, θ): # of defectives in N products
X: # of defectives in n samples (sampled without replacement)
X|Y = y ∼ H(y, N, n) (hypergeometric distribution)

µ ¶ ¡y ¢¡N −y¢
N y
p(x, y) = θ (1 − θ)N −y x ¡Nn−x
¢
y n
1
p(Y = y|X = x)
¡N ¢ y ¡ ¢¡ −y¢
y
θ (1 − θ)N −y xy Nn−x
= P ¡ ¢¡ ¢
y N −y y N −y
y θ (1 − θ) x n−x
µ ¶
N − n y−x
= θ (1 − θ)N −n−(y−x) . ¤
y−x
1.2 Conditional Expectation
Z
E(X|Y = y) = xp(x|y)dx
x

Pn
⇒Y = i=1 Xi ∼ Bin(n.p),
¡n−1¢
y−1 y
E(X1 |Y = y) = p(X1 = 1|Y = y) = ¡n¢ = .¤
y
n
Useful identities of conditional expectation
• Double Expectation E(E(X|Y )) = E(X)
Z Z Z Z
x(p(x|y)dxp(y)dy = xp(x, y)dxdy = E(X)
• E{r(X)h(Y )} = E[h(Y )E{r(X|Y )}]

Pn
⇒Y = i=1 Xi ∼ Bin(n.p),
µ ¶
Y np
E{E(X1 |Y )} = E = = p = E(X1 ). ¤
n n
iid
Ex. X1 , X2 ∼ U (0, 1)
2
Y = min(X1 , X2 ), Z = max(X1 , X2 ),
joint dist. of (Y, Z)
= p(Y ≤ y, Z ≤ z)
= 2p(X1 < X2 , X1 < y, X2 < z)

Z z Z min(x2 ,y)
= 2 dx1 dx2
0 0

Z z 

 2yz − y 2 , 0 < y ≤ z < 1
= 2 min(x2 , y)dx2 =
0 

 z2, o.w.
∂2
p(y, z) = p(Y ≤ y, Z ≤ z)
∂y∂z



 2, 0 < y ≤ z < 1
=


 0, o.w.
Z 1
pY (y) = p(y, z)dz = 2(1 − y), 0 < y < 1
y
2 1
p(z|y) = = , 0<y≤z<1
2(1 − y) 1−y
Z 1
1 1+y
E(Z|Y ) = z dz = , 0<y<1¤
y 1−y 2
1.3 Distribution for Transformations of Random Vectors
• Jacobian: h = (h1 , . . . , hk )0 : Rk → Rk
t = (t1 , . . . , tk )0
¯ ¯
¯ ¯
¯ ∂ h (t) · · · ∂
h (t) ¯
¯ ∂t1 1 ∂t1 k ¯
¯ ¯
¯ .. .. ¯
Jh (t) = ¯¯ ... ¯
. . ¯
¯ ¯
¯ ¯
¯ ∂ h (t) · · · ∂
h (t)¯¯
¯ ∂tk 1 ∂tk k
3
• Y = g(X),
pY (y) = pX (g−1 (y))|Jg−1 (y)|

1
= pX (g−1 (y))
|Jg (g−1 (y))|
p.f. Let Ak = {x ∈ Rk : gi (x) ≤ yi , i = 1, . . . , k}.
Z Z
FY (y) = ··· pX (x1 , · · · , xk )dx1 · · · dxk
Ak
Z Z
= ··· pX (g−1 (t))|Jg−1 (t)|dt1 · · · dtk . ¤
Ak
Theorem. X1 ⊥⊥ X2 ∼ Γ(p, λ), Γ(q, λ)
X1
Y1 = X1 + X2 , Y2 = X1 +X2
• Y1 ⊥
⊥ Y2
• Y1 ∼ Γ(p + q, λ), Y2 ∼ β(p, q)
Recall:
λp xp−1 e−λx
Γ(p, λ) =
Γ(p)
Z ∞
Γ(p) = tp−1 e−t dt, Γ(p + 1) = pΓ(p)
0
xr−1 (1 − x)s−1
β(r, s) =
B(r, s)
Γ(r)Γ(s)
B(r, s) =
Γ(r + s)
4
gamma distributions
1.2
gamma(.5,.5)
gamma(1,1)
1.0
gamma(10,10)
0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5
beta distributions
15
10
beta(.5,9.5)
beta(.25,.25)
beta(3,3)
beta(1,1)
5
0
0.0 0.2 0.4 0.6 0.8 1.0
5
p.f.
p(x1 , x2 ) = {Γ(p)Γ(q)}−1 e−(x1 +x2 ) x1p−1 xq−1

2
 

 

 y1 = x1 + x2  x1 = y 1 y 2
⇒

 

 y2 = x1  x2 = y1 − y1 y2
x1 +x2
¯ ¯
¯ ¯
¯ dx1 dx2 ¯
¯ dy1 dy1 ¯
J = ¯¯ ¯
¯
¯ dx1 dx2 ¯
¯ dy2 dy2 ¯
¯ ¯
¯ ¯
¯y 1 − y ¯
¯ 2 2¯
= ¯¯ ¯
¯
¯y ¯
¯ 1 −y1 ¯
= −y1
e−y1 (y1 y2 )p−1 (y1 − y1 y2 )q−1 y1

∴ p(y1 , y2 ) =
Γ(p)Γ(q)
= Γp+q,λ (y1 )βp,q (y2 ). ¤
indep.
Corollary. X1 , . . . , Xn ∼ Γ(pi , λ)
n
X Xn
⇒ Xi ∼ Γ( pi , λ)
i=1 i=1
1.4 χ2 , F , t Distributions
• chi2 distribution with n degrees of freedom

iid
Theorem. X = (X1 , . . . , Xn ) ∼ N (0, σ 2 ).
n
X X2 i n 1
V = ∼ χ2n = Γ( , )
i=1
σ2 2 2
p.f. Let Zi = Xi /σ and T = Z12 . Then
√ √
p(Z12 ≤ t) = p(− t < Z1 < t)
6
√ √
∴ FT (t) = Φ( t) − Φ(− t)
√ 1
fT (t) = t−1/2 φ( t) = √ t−1/2 e−t/2 = Γ(1/2.1/2).
2π
V is thus Γ(n/2, 1/2) = χ2n . ¤
• F distribution with k and m degrees of freedom
⊥ W ∼ χ2k , χ2m
– V ⊥
V /k
– S= W/m
∼ Fk,m
• t distribution with kdegrees of freedom
– Z ⊥⊥ V ∼ N (0, 1), χ2k
– Q = √Z ∼ tk
V /k
iid
Corollary. X = (X1 , . . . , Xn ) ∼ N (0, σ 2 ).
Pk
i=1 Xi2 /k
Pk+m ∼ Fk,m
i=k+1 Xi2 /m
X1
qP ∼ tk
k
i=2 +1Xi2 /k
7
chi−square distributions
0.5
0.4
chisq(2)
chisq(5)
chisq(10)
0.3
0.2
0.1
0.0
0 5 10 15 20 25 30
F distributions
1.0
0.8
0.6
0.4
F(10,1000)
F(10,50)
F(10,5)
0.2
F(1,5)
0.0
0 1 2 3 4 5
t distributions
0.4
normal
t(2)
t(8)
0.3
0.2
0.1
0.0
−4 −2 80 2 4
x
1.5 Orthogonal Transformation
• Orthogonal matrix:
– A0 = A−1 ; A0 A = AA0 = I
– u = Av ⇒ u0 u = v0 A0 Av = v0 v
– | det(A)| = 1 (det(A0 A) = det(A0 ) · det(A) = {det(A)}2 )
• Y = AX + c
pY (y) = pX (A−1 (Y − c))|A−1 |

1
= pX (A−1 (Y − c))
|A|
= pX (A0 (Y − c))
X X0
• if pX = √ 1 e− 2σ2 then
2πσ 2
1 (y−c)0 AA0 (y−c)

py = √ e− 2σ 2
2πσ 2
1 (y−c)0 (y−c)
= √ e− 2σ 2
2πσ 2
n
Y
1 yi − ci
= φ( )
i=1
σ σ
iid
Theorem. Zi ∼ N (µi , σ 2 ), i = 1, . . . , n
Y = Az + c, A is orthogonal
iid
⇒ Yi ∼ N (ηi , σ 2 ), i = 1, . . . , n, where η = Aµ + c.
iid
Theorem. . Zi ∼ N (µ, σ 2 ), i = 1, . . . , n
⇒
Pn
i=1 (Zi − Z̄)2
Z̄ ⊥
⊥
σ2
9
Pn
i=1 (Zi − Z̄)2
∼ χ2n−1
σ2
p.f. Consider Az with

 
1 √1 
 √n ··· n
A=



A∗
is orthogonal.
√
Let Y = AZ. Then Y0 Y = Z0 Z, and Y1 = nZ̄. So
n
X n
X
2 0 2 0
(Zi − Z̄) = Z Z − nZ̄ = Y Y − Y12 = Yi2 .
i=1 i=2
Also, since A is orthogonal, Y are independent, and
A∗ 1 = 0,
implying
 
 Y2 
 
.
 ..  = A∗ Z
 
 
 
Yn
have zero means. Hence,

n
X
Yi2 ⊥
⊥ Y12 ,
i=2
implying
Pn
i=1 (Zi − Z̄)2
Z̄ ⊥⊥ .
σ2
Also,
Pn Pn
i=1 (Zi − Z̄)2 i=1 Yi
2
= ∼ χ2n−1 . ¤
σ2 σ2
10
1.6 Bivariate Normal
 
Z1  iid
• Y = AZ + µ, Z =  
  ∼ N (0, 1)
Z2
 
 var(Y1 ) cov(Y1 , Y2 )
cov(Y) = 



cov(Y1 , Y2 ) var(Y2 )
 
 σ12 ρσ1 σ2 
= 



ρσ1 σ2 σ22
 
a211 + a212 a11 a21 + a12 a22 
= 



a11 a21 + a12 a22 a221 + a222
= A · A0
1 Z0 Z
pZ = √ e− 2
( 2π)2
1 (y−µ)0 (A−1 )0 A−1 (y−µ)
pY = √ e− 2
( 2π)2 | det(A)|
1 (y−µ)0 (AA0 )−1 (y−µ)
= √ p e− 2
( 2π)2 | det(AA0 )|
1 (y−µ)0 Σ−1 (y−µ)
= p e− 2 ,
2π det(Σ)
where Σ = cov(Y).
• For general X with E(X) = 0 and cov(X) = Σ, and Y = AX + µ, we have
E(Y) = µ
cov(Y) = AΣA0
11
Two dimensional Normal Distribution
µ1 = 0,µ 2 = 0,σ 11 = 10,σ 22 = 10,σ 12 = 15,ρ = 0.5
0.015
z 0.010
0.005 10
5
0.000
−10 0
−5 x2
0 −5
x1
5
10 −10
1  1 (x1−µ 1)2 x1−µ 1 x2−µ 2 (x2−µ 2)2

f (x) = .
exp − ,
2 
− 2ρ + 

2π σ11σ (1 − ρ )
2 2(1 − ρ )  σ11 σ11 σ22 σ22 
22 
• (X, Y ) ∼ BVN(µ, Σ)
⇒ X ∼ N (µ1 , σ12 ), Y ∼ N (µ2 , σ22 )
• but the reverse is not true
• degenerate normal (when |ρ| = 1)
Y − µ2 X − µ1
2
=ρ
σ2 σ12
Simulate BVN(µ1 , µ2 , σ12 , σ22 , ρ):
U1 = aZ1 + bZ2 + µ1
U2 = cZ1 + dZ2 + µ2
⇒




 a2 + b2 = σ12



 c2 + d2 = σ22





 ac + bd = ρ
12
One solution:
p
U1 = σ 1 1 − ρ2 Z1 + ρσ1 Z2 + µ1
U 2 = σ 2 Z 1 + µ2
• X, Y ∼ BVN(µ1 , µ2 , σ12 , σ22 , ρ)
σ2
⇒ Y |X ∼ N (µ1 + ρ (X − µ1 ), σ 2 (1 − ρ2 ))
σ1
1.7 Approximations to Distributions and Moments
Asymptotic Theory
• The central limit theorem (CLT)
Theorem. (X1 , . . . , Xn ) is a sample from a population with mean µ and positive, finite
variance σ 2 . Then
√
n(X̄ − µ) d
→ Z ∼ N (0, 1)
σ
i.e.,
µ√ ¶
n(x − µ)
p(X̄ ≤ x) + Φ
σ
δ-method
Theorem. If h0 (µ) 6= 0, then

µ√ ¶
n(h(X̄) − h(µ))
p ≤x + Φ(x),
|h0 (µ)|σ
i.e.,
√
n[h(X̄) − h(µ)] ≈ N (0, [h0 (µ)]2 σ 2 )
13
Ex. h(X̄) = X̄(1 − X̄)
h(µ) = µ(1 − µ), h0 (µ) = 1 − 2µ
when µ 6= 12 ,
(1 − 2µ)2 σ 2
var(h(X̄)) =
n
where σ 2 = var(X).
If X ∼ Bernoulli(p), σ 2 = µ(1 − µ). ¤
• The (weak) law of large numbers (LLN) and Slutsky’s theorem
Ex.
1
Pk 2
k i=1 Zi
Fk,m = 1
Pk+m 2
m i=k+1 Zi
1
Pk 2
m→∞ k i=1 Zi
−−−→
LLN 1
Slutsky’s theorem 1 2
∼ χ .¤
k k
• variance stabilization:
finding h such that
σ 2 [h0 (µ)]2 = C
where C is a constant
λ
Ex. X ∼ Poisson(λ), var(X) = λ, var(X̄) = n
[h0 (λ)]2 λ = C
r
C
⇒ h0 (λ) =
λ
√
⇒ h(λ) = 2 Cλ + d
14
√
So we choose h(t) = t, then
√ p √ 1
n( X̄ − λ) ≈ N (0, ). ¤
4
1.8 Mean Squared Error (MSE) Prediction
Lemma. If Q(c) = E(Y − c)2 , then either Q(c) = ∞ for all c, or Q is minimized uniquely
by c = E(Y ).
Lemma. If X is a random vector and Y is a random variable, then either E(Y −g(X))2 = ∞
for every function g, or
E(Y − E(Y |X))2 ≤ E(Y − g(X))2
for every function g with strict inequality holding unless g(X) = E(Y |X). This implies that
E(Y |X) is the unique best mean squared error predictor.
p.f. From the first lemma
E((Y − g(X))2 |X = x) ≥ E((Y − E(Y |X))2 |X = x).
Take expectations on both sides, we have
E(Y − g(X))2 ≥ E(Y − E(Y |X))2 . ¤
Recall that var(Y |X) = E((Y − E(Y |X))2 |X) and note that
E(Y − g(X))2 = E(Y − E(Y |X))2 + E(g(X − E(Y |X))2
= E{var(Y |X)} + E(g(X) − E(Y |X))2
Take g(X) = E(Y ) in the above, we have
var(Y ) = E(var(Y |X)) + var(E(Y |X))
15
Ex. Z1 ⊥⊥ Z2 ∼ Bernoulli(0.5), X = Z1 , Y = Z1 Z2 ,
1
E(Y |X) = X,
2
1 1
var(E(Y |X)) = var( X) = ,
2 16
1 1
var(Y |X = x) = E[(Z1 Z2 − X)2 |Z1 = x] = x2 ,
2 4
1
E(var(Y |X)) = ,
8
3
var(Y ) = E(var(Y |X)) + var(E(Y |X)) = .¤
16
Theorem. If E(|Y |) < ∞, then
var(E(Y |X)) ≤ var(Y ).
If var(Y ) < ∞, strict inequality holds unless Y = E(Y |X).
• Bivariate normal:
(X, Y ) ∼ N (µ1 , µ2 , σ12 , σ22 , ρ)
σ2
Y |X ∼ N (µ2 + ρ (X − µ1 ), σ22 (1 − ρ2 ))
σ1
Hence the best predictor is µ2 + ρ σσ21 (X − µ1 ), and MSE of the predictor = σ22 (1 − ρ2 ).
• Regression line (regression towards the mean):
when σ1 = σ2 , µ1 = µ2 = µ,
E(Y |X) = (1 − ρ)µ + ρX
16

1 Basic Probability Theory

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

1 Basic Probability Theory

Cargado por

Copyright:

Formatos disponibles

1 Basic Probability Theory

1.1 Conditional Probability

p(x, y): joint probability/density function of (X, Y )

pX (x): marginal prob./density function of X

Ex. (X1 , . . . , Xn ): n binomial trials with prob. of success p

• p(x, y) = p(x|y)p(y) = p(y|x)p(x)

Ex. Y ∼ Bin(N, θ): # of defectives in N products

X: # of defectives in n samples (sampled without replacement)

X|Y = y ∼ H(y, N, n) (hypergeometric distribution)

1.2 Conditional Expectation

Ex. (X1 , . . . , Xn ): n binomial trials with prob. of success p

Useful identities of conditional expectation

• Double Expectation E(E(X|Y )) = E(X)

• E{r(X)h(Y )} = E[h(Y )E{r(X|Y )}]

Ex. (X1 , . . . , Xn ): n binomial trials with prob. of success p

joint dist. of (Y, Z)

= 2p(X1 < X2 , X1 < y, X2 < z)

1.3 Distribution for Transformations of Random Vectors

pY (y) = pX (g−1 (y))|Jg−1 (y)|

p.f. Let Ak = {x ∈ Rk : gi (x) ≤ yi , i = 1, . . . , k}.

Theorem. X1 ⊥⊥ X2 ∼ Γ(p, λ), Γ(q, λ)

• Y1 ∼ Γ(p + q, λ), Y2 ∼ β(p, q)

0.0 0.2 0.4 0.6 0.8 1.0

p(x1 , x2 ) = {Γ(p)Γ(q)}−1 e−(x1 +x2 ) x1p−1 xq−1

e−y1 (y1 y2 )p−1 (y1 − y1 y2 )q−1 y1

• chi2 distribution with n degrees of freedom

p.f. Let Zi = Xi /σ and T = Z12 . Then

V is thus Γ(n/2, 1/2) = χ2n . ¤

• F distribution with k and m degrees of freedom

• t distribution with kdegrees of freedom

– Z ⊥⊥ V ∼ N (0, 1), χ2k

– | det(A)| = 1 (det(A0 A) = det(A0 ) · det(A) = {det(A)}2 )

pY (y) = pX (A−1 (Y − c))|A−1 |

1 (y−c)0 AA0 (y−c)

p.f. Consider Az with

Also, since A is orthogonal, Y are independent, and

have zero means. Hence,

• For general X with E(X) = 0 and cov(X) = Σ, and Y = AX + µ, we have

1  1 (x1−µ 1)2 x1−µ 1 x2−µ 2 (x2−µ 2)2

⇒ X ∼ N (µ1 , σ12 ), Y ∼ N (µ2 , σ22 )

• but the reverse is not true

• degenerate normal (when |ρ| = 1)

Simulate BVN(µ1 , µ2 , σ12 , σ22 , ρ):

• X, Y ∼ BVN(µ1 , µ2 , σ12 , σ22 , ρ)

1.7 Approximations to Distributions and Moments

• The central limit theorem (CLT)

Theorem. If h0 (µ) 6= 0, then

h(µ) = µ(1 − µ), h0 (µ) = 1 − 2µ

If X ∼ Bernoulli(p), σ 2 = µ(1 − µ). ¤

• The (weak) law of large numbers (LLN) and Slutsky’s theorem

finding h such that

1.8 Mean Squared Error (MSE) Prediction

for every function g, or

E(Y − E(Y |X))2 ≤ E(Y − g(X))2

E(Y |X) is the unique best mean squared error predictor.

p.f. From the first lemma

E((Y − g(X))2 |X = x) ≥ E((Y − E(Y |X))2 |X = x).

Take expectations on both sides, we have

E(Y − g(X))2 ≥ E(Y − E(Y |X))2 . ¤

E(Y − g(X))2 = E(Y − E(Y |X))2 + E(g(X − E(Y |X))2

= E{var(Y |X)} + E(g(X) − E(Y |X))2

Take g(X) = E(Y ) in the above, we have