Documentos de Académico
Documentos de Profesional
Documentos de Cultura
s
e
Home a
r
c
h
Understanding Gradient Descent
FEB
Antes de comenzar...
df df dg dh
= ⋅ ⋅
dx dg dh dx
df (v) df (v) dv dz
= ⋅ ⋅
dx dv dz dx
df (v)
= dvd [ev ]
dv
= ev
dv d
= dz [sin(z)]
dz
= cos(z)
dz
= dxd [x2 ]
dx
= 2x
df (v)
= ev ⋅ cos(z) ⋅ 2x
dx
= esin(x ) ⋅ cos(x2 ) ⋅ 2x
2
Introducción
Vamos a comenzar viendo el ejemplo más
sencillo posible de RNA utilizando regresión
logística. Se presupone que el lector posee
conocimientos básicos sobre regresión
logística y sus ecuaciones. A continuación,
se presenta su grafo computacional:
Intuición
Antes de comenzar a tratar Gradient
Descent, probablemente debamos
preguntarnos que significa gradiente en
este contexto. Cuándo hablamos de
gradiente se puede pensar en un valor que
nos permite medir cuanto cambia el output
de una función cuando aumentas un poquito
el valor de los inputs. En funciones de una
sola variable, el gradiente se corresponde
con la derivada o pendiente de la función.
d
L(x) = dxd [(x − 3)2 ]
dx
= 2 ( dxd [x] + dxd [−3]) (x − 3)
= 2 (1 + 0) (x − 3)
= 2 (x − 3)
d
x=x−η f (x)
dx
Regresión Logística
*Los cálculos que aparecen a lo largo del
artículo, por motivos de simplicidad, están
realizados sobre un solo ejemplo de
entrenamiento (x) con dos características
(x1 , x2 ), pero son facilmente extrapolables a
un conjunto de datos de entrenamiento más
amplio, aplicando vectorización.
∂L ∂L da ∂z
= ⋅ ⋅
∂w1 ∂a dz ∂w1
∂L ∂L da ∂z
= ⋅ ⋅
∂w2 ∂a dz ∂w2
∂L ∂L da ∂z
= ⋅ ⋅
∂b ∂a dz ∂b
∂ ∂
L(a, y) = [−y ln(a) − (1 − y) ln(1 − a
∂a ∂a
∂
= −y ⋅ [ln(a)] + (y − 1)
∂a
∂
⋅ [ln(1 − a)]
∂a
1 ∂
= ⋅ ∂a [1 − a] ⋅ (y − 1) −
1−a
(0 − 1) (y − 1) y
= −
1−a a
1−y y
= −
1−a a
dz [ 1 + e−z ]
d d 1
a=
dz
d
( )
−z −1
= 1 + e
dz
= −(1 + e−z )−2 (−e−z )
e−z
=
(1 + e−z )2
1 e−z
= ⋅
1 + e−z 1 + e−z
1 (1 + e−z ) − 1
= ⋅
1 + e−z 1 + e−z
1 + e−z ( 1 + e−z )
1 1
= ⋅ 1 −
= a ⋅ (1 − a)
∂
z = ∂w∂ [w1 x1 + w2 x2 + b]
∂w1 1
= 1x1 + 0 + 0
= x1
∂ ∂
z= [w1 x1 + w2 x2 + b]
∂w2 ∂w2
∂ ∂ ∂
= ∂w2
[w1 x1 ] + x2 ⋅ ∂w2
[w2 ] + ∂w2
[b
= 0 + 1x2 + 0
= x2
∂ ∂
z= [w1 x1 + w2 x2 + b]
∂b ∂b
∂ ∂ ∂
= ∂b
[w1 x1 ] + ∂b
[w2 x2 ] + ∂b
[b
=0+0+1
=1
∂L ∂L da
= ⋅
∂z ∂a dz
1−y y
=( − ) ⋅ (a ⋅ (1 − a))
1−a a
a−y
=a⋅ ⋅ (−a + 1)
a(−a + 1)
(a − y)a(1 − a)
=
a(−a + 1)
(a − y)(1 − a)
=
−a + 1
=a−y
∂L ∂L da ∂z
= ⋅ ⋅
∂w1 ∂a dz ∂w1
∂L
= ⋅ x1
∂z
= x1 ⋅ (a − y)
∂L ∂L da ∂z
= ⋅ ⋅
∂w2 ∂a dz ∂w2
∂L
= ⋅ x2
∂z
= x2 ⋅ (a − y)
∂L ∂L da ∂z
= ⋅ ⋅
∂b ∂a dz ∂b
∂L
= ⋅1
∂z
=a−y
Forward Propagation
z = w1 x1 + w2 x2 + b
a = sigmoid(z)
Backward propagation
dw1 = x1 ∗ (a − y)
dw2 = x2 ∗ (a − y)
db = a − y
Parameter update
w1 = w1 − η ⋅ dw1
w2 = w2 − η ⋅ dw2
b = b − η ⋅ db
Redes neuronales
artificiales (RNAs)
∂L ∂L da[2] ∂z[2]
= [2] ⋅ [2] ⋅
∂w[2] ∂a dz ∂w[2]
∂L ∂L da[2] ∂z[2]
= [2] ⋅ [2] ⋅ [2]
∂b[2] ∂a dz ∂b
∂L ∂L da[2] [2]
= ⋅ = a −y
∂z[2] ∂a[2] dz[2]
∂L
= a[1] ⋅ (a[2] − y)
∂w[2]
∂L [2]
= a −y
∂b[2]
∂L ∂L da[2] ∂z[2]
= [2] ⋅ [2] ⋅ [1]
∂a[1] ∂a dz ∂a
∂z[2]
= ∂a∂[1] [w[2] a[1] + b[2] ]
∂a[1]
dz [1]
− (ez − e−z ) d
[e
z [1] + e−z
[1] [1] [1]
dz [1]
=
(ez + e−z )
[1] [1] 2
( dz[1] [e ] − [e ]) (e
d z [1] d −z [1] z [1]
dz [1]
− ( dzd[1] [ez ] + d
[e ]) (e
−z z
[1] [1]
dz [1]
− e−z )
[1]
=
(ez + e−z )
[1] [1] 2
( [−z ]) (e
z − e−z ⋅
[1] [1] d [1] z [1]
e dz [1]
− (ez + e−z ⋅ d
[−z ]) (e
[1]
[1] [1]
dz [1]
− e−z )
[1]
=
(ez + e−z )
[1] [1] 2
(e + e ) (e − (− dz[1] [z
z −z
[1] z d [1 [1] [1]
=
(ez + e−z )
[1] [1] 2
=
(ez + e−z )
[1] [1] 2
(e + e ) − (e − e )
z −z
[1] z −z [1] 2 [1] [1] 2
=
(ez + e−z )
[1] [1] 2
(e − e )
z −z [1] [1] 2
=1−
(ez + e−z )
[1] [1] 2
= 1 − tanh(z[1] )2
2
= 1 − a[1]