Está en la página 1de 7

Deep Convolutional and Recurrent Writer

Sadaf Gulshad, Jong-Hwan Kim


School of Electrical Engineering
KAIST
Daejeon, Republic of Korea
Email: sadaf@rit.kaist.ac.kr, johkim@rit.kaist.ac.kr

AbstractThis paper proposes a new architecture Deep Con- used with Bayesian inference in the proposed architecture
volutional and Recurrent writer (DCRW) for image generation are the auto-encoders. Auto-encoders are the neural networks
by adapting the deep Recurrent attentive writer (DRAW) ar- which try to copy the input into the output, but through some
chitecture which is a sequential variational auto-encoder with a
sequential attention mechanism for image generation. The main restrictions in the hidden layers, i.e. it tries to copy the data
difference between DRAW and DCRW is that in DCRW we which is similar to training data [3][4]. In this way, it learns the
have replaced RNN in encoder with CNN and after replacement nontrivial features of data through reconstruction. The auto-
attention mechanism have been used for CNN. The reason behind encoder model combined with the Gaussian latent variable
this modication is that CNNs are the state of the art for at each layer is called a variational auto-encoder (VAE).
image processing in deep learning and their basic architecture
is inspired from the visual cortex. Further, for the testing of The variational auto-encoder have an inference or recognition
proposed architecture experiments are performed on MNIST network and a generation network.
handwritten digits data set for generation of images and results Recently, Deep Recurrent attentive writer network (DRAW)
are analyzed. has been proposed [5], which tries to copy the natural phe-
nomenon of image reconstruction in sequential manner. It
I. I NTRODUCTION
implements the encoder or inferential network and decoder
In recent years, there have been a lot of progress in super- or generation network using Recurrent neural networks with
vised learning research but unsupervised learning research has an attention mechanism during reading and writing operations.
not reached that level yet. As the data are growing massively In this paper, we propose deep Convolutional Recurrent writer
in the form of images, videos, speech, text and laboratory architecture, which is the modication of DRAW architecture
experiments and most of the available data is unlabeled. and analyze the results on MNIST dataset.
So, in order to understand such unlabeled data, probabilistic The contributions of this paper are:
models are used for learning the underlying structure of data. 1) The RNN in encoder has been replaced by CNN as
Probabilistic graphical models can represent the distribution Convolutional Neural Networks are considered to be
over random variables as the graph and Bayesian networks state of the art in image processing in deep learning.
are one among them. Bayesian networks can perform inference 2) After replacing RNN with CNN the attention mechanism
by learning the distribution of random variables parameterized have been introduced into the Convolutional neural
by a set of parameters. These parameters can be learned network.
through various ways such as gradient descent algorithm.
Recurrent Neural Network architecture used in this paper
Learning of parameters is easy if all the random variables
is Long Short-Term Memory (LSTM) as they are good at
are observable. But that is not the case in complex problems.
handling long term sequential data by eradicating the problem
In complex problems, where hidden variables are in a large
of vanishing gradients [6] [7].
number e.g. image pixels it becomes difcult to maximize the
The background of Deep Convolutional and Attentive
observed variables and consequently the posterior distribution
Writer is presented in section II. It is followed by the explana-
of the network becomes intractable. Several solutions have
tion of the proposed networks architecture and its equations
been proposed for this problem, among them one is the mean-
in section III. Section IV gives the experimental results and
eld approach, which is taking expectations w.r.t posterior
nally discussions are given in section V.
distributions. This technique assumes that the computation of
latent variables given observed is easy to compute, but this is II. BACKGROUND
not the case in general so as a result, the expectations become
intractable. So, in order to solve this problem variational A. Bayesian Inference
Bayesian models have been proposed, in which instead of Bayesian networks are a type of directed graphical models.
maximizing the likelihood of observed variables lower bound They can be trained to learn the distribution of random
of likelihood is maximized [1]. variables. They perform inference using Bayes theorem for
Furthermore, Deep directed generative Models have been updating the probability given new information [8]. But while
introduced by merging the ideas of deep neural networks maximizing the likelihood of observed variables the Bayesian
and Bayesian inference [2]. The deep neural networks being learning faces the problem of intractability due to the presence

978-1-5090-6182-2/17/$31.00 2017 IEEE 2836


(a) Without Reparametrization Trick (b) With Reparametrization Trick

Fig. 1: Directed Graphical Model of Variational Bayes

of latent variables.[9] In order to understand the Bayesian from an isotropic Gaussian distribution, then the equation (3)
inference problem and its proposed solution, let us consider will become:
the directed graphical model shown in Fig. 1. Let x be the z = +  ,  N (0, I) (4)
observed random variable, z be the hidden random variable as
shown in Fig. 1a. The prior of latent variable z parametrized by Where and mean and standard deviation respectively are
is given by p (z), where represents the model parameters the outputs of encoder network.
and the intractable true posterior is given as p (z|x) = pp(x,z)
(x) . B. Variational Auto-encoder
Furthermore, let q (z|x) be the recognition model, which will
try to approximate it to the true distribution p (z|x) [9]. So, The main concept behind the variational auto-encoders is to
instead of maximizing the likelihood of p (x|z), lower bound combine the idea of deep neural networks known as an auto-
of likelihood is maximized. encoder and variational Bayesian theory. The main difference
In this scenario the lower bound of likelihood is given as: between auto-encoders and variational auto-encoders is that
  in variational auto-encoders the high dimensional input data
 p (x, z) represented by x and learned low dimensional data represented
log p (x) = DKL (q (z|x)||p (z)) q (z|x)log
q (z|x) by z are random variables and they make it possible for the
(1) auto-encoders to sample x from the distribution p(x|z), which
log p(x) = DKL (q (z|x)||p (z)) + L(q) (2) is further used for doing the regeneration of the input. Lets
consider the variational auto-encoder model shown in Fig. 2.
There are two terms on the right-hand side of equation (1), In the gure recognition or inference q (z|x) is denoted to
the rst term is KL-Divergence term which should be min- as encoder since given the observed variable x it gives a
imized i.e. the difference between the approximate posterior distribution (e.g. Bernoulli, Gaussian) over the possible values
q (z|x) and real p (z) should be reduced. The second term of hidden variable z from which the observed variable x can
on the right-hand side is reconstruction cost. Therefore, in be generated. In a similar manner p (x|z) is a probabilistic
order to maximize the likelihood, the rst term on the left- decoder since given hidden variable z it produces a distribution
hand side of equation (1) should be minimized and the second of the possible corresponding values of x and thus, performs
term should be maximized. In order to do this, the network the regeneration.
with generative parameters and variational parameters
is trained using backpropagation. But as the approximate C. DRAW
posterior q (z|x) was obtained using sampling, therefore, it Deep Recurrent Attentive Writer (DRAW) has been a re-
is not possible to take the derivative of q (z|x) which is cently proposed neural network architecture which generates
required for backpropagation. In order to solve this problem, images sequentially. Its main idea constitutes of a Recurrent
stochastic gradient variational Bayes (SGVB) estimator has Neural Network based encoder, which encodes the images
been introduced [1]. Which uses reparametrization trick shown through compression at each time step and a Recurrent Neural
in Fig. 1b. In which latent variable z is reparametrized by Network based decoder, which decodes the compressed data
converting it into a deterministic function of and noise : from encoder [5]. It is trained using auto-encoding variational
z = g(, ) (3) Bayes algorithm, which in turn uses the reparameterization
trick for the backpropagation of error through the network.
Using this technique z becomes deterministic and differen- DRAW encodes the image at a time step t while observing the
tiable as shown in Fig. 1b. Let the sampling of q (z|x) is done output of decoder from time 0 till time t1. Another important

2837
Fig. 2: Left: Convolutional and Feed Forward Writer, Right: Fig. 3: Convolutional and Recurrent writer without attention
Convolutional and Recurrent Writer with attention in encoding

contribution of this architecture is attention mechanism. It uses with the theory of variational auto-encoder being used in the
a unique sequential attention mechanism, which is applied architecture.
while reading the image for encoding and writing the image
after decoding operation. The attention mechanism introduced B. Convolutional and Recurrent Writer with attention
in this architecture is differentiable. The architecture is trained Convolutional and Recurrent architecture with attention as
end to end for the generation of images. shown in Fig. 2 right side is a deep variational auto-encoder
with read and write operations. Let X = {x1 , x2 , ....., xn }
III. T HE DCRW N ETWORK be the input data being fed into the architecture. The encoder
The proposed Deep Convolutional and Recurrent Writer architecture performs Conv(3332), P ool(22), Conv(3
(DCRW) architecture is a modication of DRAW architecture, 3 64) and P ool(2 2) operations on the incoming input
which is explained in the previous section. The two main images. On the other hand decoder architecture is implemented
modications in the architecture are the replacement of Re- using Recurrent Neural Network (RN N dec ).
current Neural Network at the encoder by the Convolutional The Convolutional neural network receives image input x
Neural Network as CNNs are the state of the art for feature and hdec dec
t1 through read operation, where ht1 is the output
extraction in images. The second modication is an attention of the previous decoder at each time step t. After extracting
mechanism applied to CNN architecture at each time step. the features through convolution operation outputs of henc t
Usually attention mechanism, which species where to look are used for sampling of latent distribution zt q(zt |henc
t ).
in the image is applied sequentially in Recurrent Neural The latent distribution used in the architecture is Gaussian
Networks. But here it is applied at the Convolutional Neural distribution N (Zt |t , t ) and the mean and variance of the
Network at each time step t to restrict the input region distribution are the outputs of encoder network given by
being observed by the recognition network. Finally, in the following equations.
recognition network, the RNN reconstructs the input at each t = W (henc
t ) (5)
time step. The network architecture is shown in Fig. 2.
t = exp(W (henc
t )) (6)
A. Architecture
The RNN network takes latent distribution as input and outputs
DCRW have been implemented in two different ways that
hdec
t at each time step through the write operation. This output
are, with and without attention mechanism during the read
is stored in cumulative canvas matrix and after T time steps
operation. At rst, the encoder is implemented using Convo-
P (x|z1:T ) is calculated from it. For each image x the network
lutional Neural Network (CNN) and decoder is implemented
performs the operations dened in following equations:
using Recurrent Neural Network (RNN) with an attention
mechanism in encoding (i.e. read) and decoding processes
(i.e. write) as shown in Fig. 2 Right side. Next, the encoder xt = x (ct1 ) (7)
is again implemented using CNN and decoder with RNN but rt = read(xt , xt , hdec
t1 ) (8)
without attention mechanism in read as shown in Fig. 3. These
architectures will be further explained in the next sections henc
t = CN N enc ([ rt , hdec
t1 ] ) (9)

2838
zt q(zt |henc
t ) (10) 2) Training of Network: In order to calculate the loss for
the training of network we will rewrite Variational lower bound
hdec
t = RN N dec (hdec
t1 , zt ) (11) from Equation (1) as:

ct = ct1 + write(hdec log p (x) = Eq [ log p (x|z)] DKL [ q (z|x)||p (z)] (24)
t ) (12)
1
where is the exponential (x) = 1+exp(x) of output from The rst term on the right-hand side of Equation (24) is
decoder network. reconstruction loss denoted by Lx and the second term is
1) Read and Write Operations with attention: In the regularization loss denoted by Lz . In DRAW architecture
DRAW architecture image read and write operations have the nal output is stored in canvass matrix cT after T time
been introduced with selective attention mechanism. In our steps and is used to determine the D(X|cT ). Where D is the
architecture DCRW, we also used the same attention mech- Gaussian distribution considered here. So the losses for this
anism on the Convolutional Neural Network. The attention architecture are given by:
mechanism introduced is fully differentiable making it possible Lx = log D(x|cT ) (25)
for the network to be trained end to end. The two-dimensional
attention mechanism is a 2D Gaussian array applied to the 
T

image as lter yielding a patch of the image. The center of Lz = DKL (q(zt |henc
t )||p(zt )) (26)
t=1
the lter is specied as gx , gy and stride as which controls
the zoom of the lter. Its mean location is given by: It can be clearly observed from Equation (26) that the regular-
ization loss depends upon the samples drawn from q(zt |henc t )
N
iX = gX + ( i 0.5) (13) which in turn depends on the input. If the latent distribution is
2 chosen to be Gaussian, then it can be calculated analytically
N and is given by the formula:
jY = gY + ( j 0.5) (14)  T 
2
1  2 2 2 T
As Gaussian is being used as a lter so the variance of the L =
z
t + t log t (27)
2 t=1 2
lter is also required, which will be calculated dynamically
from the output of the decoder: The total loss for the network is given by:

(gX , gY , log , log ) = W (hdec ) (15) L = Lz + L x (28)

A+1 Hence, In order to maximize the likelihood of data KL


gX = (gX + 1) (16) Divergence should be minimized.
2
B+1 C. Convolutional and Recurrent Writer without Attention in
gY = (gY + 1) (17) Read
2
Convolutional and Recurrent auto-encoder architecture
max(A, B) 1
= (18) without attention at read is shown in Fig. 3. It differs from
N 1 the architecture with attention in following aspects: Firstly,
After calculating the parameters for the lter the horizontal attention mechanism is a sequential process through which it
and vertical lter matrices are calculated as: reads the image but as the attention mechanism is removed
from encoder so the sequential read operation is also removed
1 (a iX )2
FX [i, a] = exp( (19) along with it. Secondly Convolutional network is not being
ZX 2 2 used for T time steps instead CNN extracts features from input
in a single time step. But the sampling of the output of CNN
1 (a jY )2
FY [j, b] = exp( (20) is being performed at each time step t for T time steps. So in
ZY 2 2
this case T at encoder network is one. After sampling zt from
where i, j denotes a point in lter patch and a, b a point in the the output of encoder henc it is given as input to decoder Long
whole image. Hence read and write operations with attention Short Term Memory Network (LSTM) for decoding at each
are given by: time step t. The decoder will give its output hdec t to write
module as input, which will keep writing at each time step
t1 ) = [FY xFX , FY xFX ]
read(x, xt , hdec T T
(21) t in canvass matrix. Attention mechanism is still present in
writing but has been removed while reading the image in this
wt = W (hdec
t ) (22) architecture.
1 The modied equations for the Deep Convolutional Recur-
t )=
write(hdec [FY xFX ]
T T
, FY xFX (23) rent Writer without attention while reading is given by:

henc = CN N enc (x) (29)

2839
(a) MNIST data generation at time step t = 1 (b) MNIST data generation at time step t = 4

(c) MNIST data generation at time step t = 6 (d) MNIST data generation at time step t = 10

Fig. 4: MNIST data generation

zt q(zt |henc ) (30) (23). The equation for encoding without attention mechanism
is given below:
hdec
t = RN N dec (hdec
t1 , zt ) (31) read(x) = CN N enc (x) (33)

ct = ct1 + write(hdec
t ) (32) 2) Training of Network: The loss function for the training
of this architecture without the attention at the input will be
It can be clearly observed from the above modied equations same as the architecture with attention at the input. Because,
that now input image x is directly given to the Convolutional still the architecture is Variational auto-encoder as it carries
Neural Network, and after the computation of henc sampling the basic idea of auto-encoder neural networks and Variational
is done at each time step from the CN N output. Then, using Bayes. Furthermore, although we are using CNN only at one
these samples decoding process is performed by RNN at each time step, but the sampling from its output is being done
time step. for T time steps or T times therefore, the loss equations will
1) Read without Attention and Write with Attention: In this still be same and the total loss will consist of regularization
architecture, the encoding operation is implemented without loss calculated analytically as we are using Gaussian, and
using attention mechanism. So the equations for attention reconstruction loss calculated from sampling. Hence the loss
mechanism introduced in DRAW are being used only for write equations will be same for architecture with attention and
operation, i.e. Gaussian lter matrices FX [i, a] and FY [j, b] are without attention mechanism. The loss equations will be
applied only while writing as given in Equation (19), (20) and optimized using a stochastic gradient descent algorithm.

2840
(a) MNIST data generation at time step t = 1 (b) MNIST data generation at time step t = 4

(c) MNIST data generation at time step t = 6 (d) MNIST data generation at time step t = 10

Fig. 5: MNIST data generation with no attention in Read

Experimental Hyper Parameters and Losses


D. Data Generation
LSTM No
Read Write
Architecture of Hidden Losses
Data is generated by the decoder using latent samples Size Size
Units
of distribution zt q(zt |henc
t ). The output of the decoder Convolutional
updates canvass matrix Ct at each time step t and after T and Recurrent Lx = 70.168
256 77 55
Writer with Lz = 1.5468
time steps the data is generated using D(x|CT ). Where x is attention
the newly generated image. There is no contribution of the Convolutional
encoder network while generating the images. and Recurrent Lx = 147.24
256 - 55
Writer without Lz = 31.065
attention in Read
IV. E XPERIMENTAL R ESULTS TABLE I: Networks Hyper-Parameters and Losses
In order to evaluate the proposed architectures, they were
trained to encode and then decode (regenerate) the MNIST
A. MNIST Data Generation using Convolutional and Recur-
dataset images. As the sampling operation performed at each
rent Network with Attention
time step t during the training of architecture was always
unique, therefore the images generated are also novel and Convolutional and Recurrent Neural network is trained end
indistinguishable from training samples of MNIST dataset. to end as the generative model on MNIST dataset. MNIST
The reconstruction loss used was binary cross-entropy. The is the large database with 60,000 training images and 10,000
network parameters and losses, i.e. reconstruction loss and test images commonly used for image processing experiments.
regularization loss for each network are given in Table I. Once the Convolutional Recurrent Neural network architecture

2841
(a) Reconstruction and Regularization loss for DCRW (b) Reconstruction and Regularization loss for DCRW with no
attention in read

Fig. 6: Reconstruction loss (Lx ) and Regularization loss (Lz )Plots


is trained data generation operation is performed, the regener- the deep Recurrent attentive writer (DRAW). We replaced the
ated MNIST images at time steps t = 1, t = 4, t = 6, t = 10 encoder of DRAW from Recurrent neural network to Con-
are shown in Fig. 4 . We can observe from the generated volutional Neural Network and applied attention mechanism
images that although the architecture was simple with only on CNN. The logic behind changing encoder from RNN to
two Convolutional and two pooling layers in encoder network CNN is that CNNs are the state of art for image processing in
and Recurrent Neural Network on the decoder side, it was able deep learning applications and their architecture is biologically
to reconstruct the readable images of MNIST numbers. The inspired from the visual cortex. The experimental results show
plot of regularization and reconstruction loss is also shown that DCRW gave the comparative results in the experiments.
in Fig. 6a and its values are given in Table I. We can clearly
VI. ACKNOWLEDGMENT
observe that losses decrease with the increase in the number of
iterations and reaches to the values comparable with DRAW This work was supported by the ICT R&D program of
architecture. MSIP/IITP [2016-0-00563, Research on adaptive machine
learning technology development for intelligent autonomous
B. MNIST Data Generation using Convolutional and Recur-
digital companion].
rent Network without Attention in Read
Final experiment of MNIST generation is performed using R EFERENCES
the Convolutional and recurrent networks without attention [1] D. P. Kingma and M. Welling, Auto-encoding variational bayes, arXiv
in encoding process as shown in Fig.3. Although in this preprint arXiv:1312.6114, 2013.
[2] D. J. Rezende, S. Mohamed, and D. Wierstra, Stochastic backpropaga-
architecture input is encoded at single time step using Convo- tion and approximate inference in deep generative models, arXiv preprint
lutional neural network and samples are decoded using RNN arXiv:1401.4082, 2014.
in T time steps. We can observe from the Figure 5 that the [3] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning,
2016, book in preparation for MIT Press. [Online]. Available:
network was able to reconstruct the images, although not as http://www.deeplearningbook.org
sharpened as with attention mechanism. Another observation [4] C.-Y. Liou, J.-C. Huang, and W.-C. Yang, Modeling word perception
made from the Figure 5 is that images are updated globally using the elman network, Neurocomputing, vol. 71, no. 16, pp. 3150
3157, 2008.
when the attention mechanism is removed from encoder, while [5] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra,
in the DCRW with attention at both read and write images Draw: A recurrent neural network for image generation, arXiv preprint
they were updated locally i.e. sequentially. Further, loss plots arXiv:1502.04623, 2015.
[6] F. A. Gers, J. Schmidhuber, and F. Cummins, Learning to forget:
for the deep Convolutional Recurrent writer (DCRW) without Continual prediction with lstm, Neural computation, vol. 12, no. 10,
attention in reading are shown in Fig. 6b. Similar to the results pp. 24512471, 2000.
of regeneration in Fig. 5 the loss plot also shows that the loss [7] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural
computation, vol. 9, no. 8, pp. 17351780, 1997.
of DCRW model without attention mechanism at reading is [8] K. P. Kording and D. M. Wolpert, Bayesian decision theory in sensori-
higher and hence the images are a bit blurrier than those of motor control, Trends in cognitive sciences, vol. 10, no. 7, pp. 319326,
with attention mechanism. 2006.
[9] T. Grifths and A. Yuille, Technical introduction: A primer on proba-
V. C ONCLUSION bilistic inference. ucla. department of statistics papers no. 2006010103.
ucla, los angeles, ca, 2006.
In this paper, we proposed deep Convolutional and Recur-
rent writer architecture (DCRW) which is the modication of

2842

También podría gustarte