Está en la página 1de 13

Temporal-difference

Learning
Temporal-difference Learning
• Combinación de las ideas presentadas, tanto en los métodos
de Monte Carlo como en los algoritmos clásicos de
programación dinámica.
• Puede aprender directamente de la experiencia sin requerir el
modelo de la dinámica del entorno.
• Actualiza las estimaciones basándose en estimaciones
previamente aprendidas (bootstraping).
• Relación entre DP, MC y TD es muy importante en la teoría del
aprendizaje reforzado.
Temporal-difference Learning
Temporal-difference prediction
Algoritmo para estimar V ≈ 𝑣π , dada una política π
1. Initialize V(s) arbitrarily, ∀s ∈ S+ , except for V sT = 0
2. Loop forever (for each episode):
3. Initialize arbitrarily S
4. Loop for each step of episode:
5. A ← action given by π for S
6. Take action A, observe R, S′
7. V S ← V S + α R + γV S′ − V S 0 < α < 1 step size
8. S ← S′
9. Until S is terminal state

Temporal-difference Learning
𝑘=0 Policy to be evaluated (π)

0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0

Temporal-difference Learning
𝑘=1 S0 = 5, A0 = ↑

0.0 -0.338 -0.324 -0.334

-0.145 -0.237 -0.264 -0.580

-0.729 -0.451 -0.830 -0.288

-0.482 -0.186 -0.696 0.0

Temporal-difference Learning
𝑘=1
S = 5, A = ↑
0.0 -0.338 -0.324 -0.334 5, ↑ → −1,1, ←
V 5 ← V 5 + 0.5 −1 + V 1 − V 5
-0.145 -0.237 -0.264 -0.580 V 5 ← −0.787

-0.729 -0.451 -0.830 -0.288 S = 1, A =←


5, ↑ → −1,1, ← → −1
-0.482 -0.186 -0.696 0.0 V 1 ← V 1 + 0.5 −1 + V 0 − V 1
V 1 ← −0.669

Temporal-difference Learning
𝑘=1
S = 5, A = ↑
0.0 -0.669 -0.324 -0.334 5, ↑ → −1,1, ←
V 5 ← V 5 + 0.5 −1 + V 1 − V 5
-0.145 -0.787 -0.264 -0.580 V 5 ← −0.787

-0.729 -0.451 -0.830 -0.288 S = 1, A =←


5, ↑ → −1,1, ← → −1
-0.482 -0.186 -0.696 0.0 V 1 ← V 1 + 0.5 −1 + V 0 − V 1
V 1 ← −0.669

Temporal-difference Learning
S0 = 9, A0 = ↑
𝑘=2 9, ↑ → −1,5, ←
V 9 ← V 9 + 0.5 −1 + V 5 − V 9
V 9 ← −1.119
0.0 -0.669 -0.324 -0.334
S=5A=↑
-0.145 -0.787 -0.264 -0.580 9, ↑ → −1,5, ← → −1,1, ←
V 5 ← V 5 + 0.5 −1 + V 1 − V 5
-0.729 -0.451 -0.830 -0.288 V 5 ← −1.228

S = 1, A = ←
-0.482 -0.186 -0.696 0.0
9, ↑ → −1,5, ← → −1,1, ← → −1
V 1 ← V 1 + 0.5 −1 + V 0 − V 1
Temporal-difference Learning V 1 ← −0.835
S0 = 9, A0 = ↑
𝑘=2 9, ↑ → −1,5, ←
V 9 ← V 9 + 0.5 −1 + V 5 − V 9
V 9 ← −1.119
0.0 -0.835 -0.324 -0.334
S0 = 9, A0 = ↑
-0.145 -1.228 -0.264 -0.580 9, ↑ → −1,5, ← → −1,1, ←
V 5 ← V 5 + 0.5 −1 + V 1 − V 5
-0.729 -1.119 -0.830 -0.288 V 5 ← −1.228

S0 = 9, A0 = ↑
-0.482 -0.186 -0.696 0.0
9, ↑ → −1,5, ← → −1,1, ← → −1
V 1 ← V 1 + 0.5 −1 + V 0 − V 1
Temporal-difference Learning V 1 ← −0.835
𝑘 = 500 π𝑘 ≈ π ∗

0.0 -1.000 -2.000 -3.000

-1.000 -2.000 -3.000 -2.000

-2.000 -3.000 -2.000 -1.000

-3.000 -2.000 -1.000 0.0

Optimal policy!
Temporal-difference Learning
Q-learning: Off-policy TD Control
Algoritmo para estimar Q ≈ q∗ & π ≈ π∗
1. Initialize Q(s, a) arbitrarily, ∀s ∈ S+ , a ∈ A, except for Q sT , ∙ = 0
2. Loop forever (for each episode):
3. Initialize arbitrarily S
4. Loop for each step of episode:
5. Choose A from S using policy derived from Q
6. Take action A, observe R, S′
7. Q S, A ← Q S, A + α R + γ max Q S′ , a − Q S, A 0<α<1 step size
a
8. S ← S′
9. Until S is terminal state

Temporal-difference Learning

También podría gustarte