Documentos de Académico
Documentos de Profesional
Documentos de Cultura
AbstractThis paper proposes a new architecture Deep Con- used with Bayesian inference in the proposed architecture
volutional and Recurrent writer (DCRW) for image generation are the auto-encoders. Auto-encoders are the neural networks
by adapting the deep Recurrent attentive writer (DRAW) ar- which try to copy the input into the output, but through some
chitecture which is a sequential variational auto-encoder with a
sequential attention mechanism for image generation. The main restrictions in the hidden layers, i.e. it tries to copy the data
difference between DRAW and DCRW is that in DCRW we which is similar to training data [3][4]. In this way, it learns the
have replaced RNN in encoder with CNN and after replacement nontrivial features of data through reconstruction. The auto-
attention mechanism have been used for CNN. The reason behind encoder model combined with the Gaussian latent variable
this modication is that CNNs are the state of the art for at each layer is called a variational auto-encoder (VAE).
image processing in deep learning and their basic architecture
is inspired from the visual cortex. Further, for the testing of The variational auto-encoder have an inference or recognition
proposed architecture experiments are performed on MNIST network and a generation network.
handwritten digits data set for generation of images and results Recently, Deep Recurrent attentive writer network (DRAW)
are analyzed. has been proposed [5], which tries to copy the natural phe-
nomenon of image reconstruction in sequential manner. It
I. I NTRODUCTION
implements the encoder or inferential network and decoder
In recent years, there have been a lot of progress in super- or generation network using Recurrent neural networks with
vised learning research but unsupervised learning research has an attention mechanism during reading and writing operations.
not reached that level yet. As the data are growing massively In this paper, we propose deep Convolutional Recurrent writer
in the form of images, videos, speech, text and laboratory architecture, which is the modication of DRAW architecture
experiments and most of the available data is unlabeled. and analyze the results on MNIST dataset.
So, in order to understand such unlabeled data, probabilistic The contributions of this paper are:
models are used for learning the underlying structure of data. 1) The RNN in encoder has been replaced by CNN as
Probabilistic graphical models can represent the distribution Convolutional Neural Networks are considered to be
over random variables as the graph and Bayesian networks state of the art in image processing in deep learning.
are one among them. Bayesian networks can perform inference 2) After replacing RNN with CNN the attention mechanism
by learning the distribution of random variables parameterized have been introduced into the Convolutional neural
by a set of parameters. These parameters can be learned network.
through various ways such as gradient descent algorithm.
Recurrent Neural Network architecture used in this paper
Learning of parameters is easy if all the random variables
is Long Short-Term Memory (LSTM) as they are good at
are observable. But that is not the case in complex problems.
handling long term sequential data by eradicating the problem
In complex problems, where hidden variables are in a large
of vanishing gradients [6] [7].
number e.g. image pixels it becomes difcult to maximize the
The background of Deep Convolutional and Attentive
observed variables and consequently the posterior distribution
Writer is presented in section II. It is followed by the explana-
of the network becomes intractable. Several solutions have
tion of the proposed networks architecture and its equations
been proposed for this problem, among them one is the mean-
in section III. Section IV gives the experimental results and
eld approach, which is taking expectations w.r.t posterior
nally discussions are given in section V.
distributions. This technique assumes that the computation of
latent variables given observed is easy to compute, but this is II. BACKGROUND
not the case in general so as a result, the expectations become
intractable. So, in order to solve this problem variational A. Bayesian Inference
Bayesian models have been proposed, in which instead of Bayesian networks are a type of directed graphical models.
maximizing the likelihood of observed variables lower bound They can be trained to learn the distribution of random
of likelihood is maximized [1]. variables. They perform inference using Bayes theorem for
Furthermore, Deep directed generative Models have been updating the probability given new information [8]. But while
introduced by merging the ideas of deep neural networks maximizing the likelihood of observed variables the Bayesian
and Bayesian inference [2]. The deep neural networks being learning faces the problem of intractability due to the presence
of latent variables.[9] In order to understand the Bayesian from an isotropic Gaussian distribution, then the equation (3)
inference problem and its proposed solution, let us consider will become:
the directed graphical model shown in Fig. 1. Let x be the z = + , N (0, I) (4)
observed random variable, z be the hidden random variable as
shown in Fig. 1a. The prior of latent variable z parametrized by Where and mean and standard deviation respectively are
is given by p (z), where represents the model parameters the outputs of encoder network.
and the intractable true posterior is given as p (z|x) = pp(x,z)
(x) . B. Variational Auto-encoder
Furthermore, let q (z|x) be the recognition model, which will
try to approximate it to the true distribution p (z|x) [9]. So, The main concept behind the variational auto-encoders is to
instead of maximizing the likelihood of p (x|z), lower bound combine the idea of deep neural networks known as an auto-
of likelihood is maximized. encoder and variational Bayesian theory. The main difference
In this scenario the lower bound of likelihood is given as: between auto-encoders and variational auto-encoders is that
in variational auto-encoders the high dimensional input data
p (x, z) represented by x and learned low dimensional data represented
log p (x) = DKL (q (z|x)||p (z)) q (z|x)log
q (z|x) by z are random variables and they make it possible for the
(1) auto-encoders to sample x from the distribution p(x|z), which
log p(x) = DKL (q (z|x)||p (z)) + L(q) (2) is further used for doing the regeneration of the input. Lets
consider the variational auto-encoder model shown in Fig. 2.
There are two terms on the right-hand side of equation (1), In the gure recognition or inference q (z|x) is denoted to
the rst term is KL-Divergence term which should be min- as encoder since given the observed variable x it gives a
imized i.e. the difference between the approximate posterior distribution (e.g. Bernoulli, Gaussian) over the possible values
q (z|x) and real p (z) should be reduced. The second term of hidden variable z from which the observed variable x can
on the right-hand side is reconstruction cost. Therefore, in be generated. In a similar manner p (x|z) is a probabilistic
order to maximize the likelihood, the rst term on the left- decoder since given hidden variable z it produces a distribution
hand side of equation (1) should be minimized and the second of the possible corresponding values of x and thus, performs
term should be maximized. In order to do this, the network the regeneration.
with generative parameters and variational parameters
is trained using backpropagation. But as the approximate C. DRAW
posterior q (z|x) was obtained using sampling, therefore, it Deep Recurrent Attentive Writer (DRAW) has been a re-
is not possible to take the derivative of q (z|x) which is cently proposed neural network architecture which generates
required for backpropagation. In order to solve this problem, images sequentially. Its main idea constitutes of a Recurrent
stochastic gradient variational Bayes (SGVB) estimator has Neural Network based encoder, which encodes the images
been introduced [1]. Which uses reparametrization trick shown through compression at each time step and a Recurrent Neural
in Fig. 1b. In which latent variable z is reparametrized by Network based decoder, which decodes the compressed data
converting it into a deterministic function of and noise : from encoder [5]. It is trained using auto-encoding variational
z = g(, ) (3) Bayes algorithm, which in turn uses the reparameterization
trick for the backpropagation of error through the network.
Using this technique z becomes deterministic and differen- DRAW encodes the image at a time step t while observing the
tiable as shown in Fig. 1b. Let the sampling of q (z|x) is done output of decoder from time 0 till time t1. Another important
2837
Fig. 2: Left: Convolutional and Feed Forward Writer, Right: Fig. 3: Convolutional and Recurrent writer without attention
Convolutional and Recurrent Writer with attention in encoding
contribution of this architecture is attention mechanism. It uses with the theory of variational auto-encoder being used in the
a unique sequential attention mechanism, which is applied architecture.
while reading the image for encoding and writing the image
after decoding operation. The attention mechanism introduced B. Convolutional and Recurrent Writer with attention
in this architecture is differentiable. The architecture is trained Convolutional and Recurrent architecture with attention as
end to end for the generation of images. shown in Fig. 2 right side is a deep variational auto-encoder
with read and write operations. Let X = {x1 , x2 , ....., xn }
III. T HE DCRW N ETWORK be the input data being fed into the architecture. The encoder
The proposed Deep Convolutional and Recurrent Writer architecture performs Conv(3332), P ool(22), Conv(3
(DCRW) architecture is a modication of DRAW architecture, 3 64) and P ool(2 2) operations on the incoming input
which is explained in the previous section. The two main images. On the other hand decoder architecture is implemented
modications in the architecture are the replacement of Re- using Recurrent Neural Network (RN N dec ).
current Neural Network at the encoder by the Convolutional The Convolutional neural network receives image input x
Neural Network as CNNs are the state of the art for feature and hdec dec
t1 through read operation, where ht1 is the output
extraction in images. The second modication is an attention of the previous decoder at each time step t. After extracting
mechanism applied to CNN architecture at each time step. the features through convolution operation outputs of henc t
Usually attention mechanism, which species where to look are used for sampling of latent distribution zt q(zt |henc
t ).
in the image is applied sequentially in Recurrent Neural The latent distribution used in the architecture is Gaussian
Networks. But here it is applied at the Convolutional Neural distribution N (Zt |t , t ) and the mean and variance of the
Network at each time step t to restrict the input region distribution are the outputs of encoder network given by
being observed by the recognition network. Finally, in the following equations.
recognition network, the RNN reconstructs the input at each t = W (henc
t ) (5)
time step. The network architecture is shown in Fig. 2.
t = exp(W (henc
t )) (6)
A. Architecture
The RNN network takes latent distribution as input and outputs
DCRW have been implemented in two different ways that
hdec
t at each time step through the write operation. This output
are, with and without attention mechanism during the read
is stored in cumulative canvas matrix and after T time steps
operation. At rst, the encoder is implemented using Convo-
P (x|z1:T ) is calculated from it. For each image x the network
lutional Neural Network (CNN) and decoder is implemented
performs the operations dened in following equations:
using Recurrent Neural Network (RNN) with an attention
mechanism in encoding (i.e. read) and decoding processes
(i.e. write) as shown in Fig. 2 Right side. Next, the encoder xt = x (ct1 ) (7)
is again implemented using CNN and decoder with RNN but rt = read(xt , xt , hdec
t1 ) (8)
without attention mechanism in read as shown in Fig. 3. These
architectures will be further explained in the next sections henc
t = CN N enc ([ rt , hdec
t1 ] ) (9)
2838
zt q(zt |henc
t ) (10) 2) Training of Network: In order to calculate the loss for
the training of network we will rewrite Variational lower bound
hdec
t = RN N dec (hdec
t1 , zt ) (11) from Equation (1) as:
ct = ct1 + write(hdec log p (x) = Eq [ log p (x|z)] DKL [ q (z|x)||p (z)] (24)
t ) (12)
1
where is the exponential (x) = 1+exp(x) of output from The rst term on the right-hand side of Equation (24) is
decoder network. reconstruction loss denoted by Lx and the second term is
1) Read and Write Operations with attention: In the regularization loss denoted by Lz . In DRAW architecture
DRAW architecture image read and write operations have the nal output is stored in canvass matrix cT after T time
been introduced with selective attention mechanism. In our steps and is used to determine the D(X|cT ). Where D is the
architecture DCRW, we also used the same attention mech- Gaussian distribution considered here. So the losses for this
anism on the Convolutional Neural Network. The attention architecture are given by:
mechanism introduced is fully differentiable making it possible Lx = log D(x|cT ) (25)
for the network to be trained end to end. The two-dimensional
attention mechanism is a 2D Gaussian array applied to the
T
image as lter yielding a patch of the image. The center of Lz = DKL (q(zt |henc
t )||p(zt )) (26)
t=1
the lter is specied as gx , gy and stride as which controls
the zoom of the lter. Its mean location is given by: It can be clearly observed from Equation (26) that the regular-
ization loss depends upon the samples drawn from q(zt |henc t )
N
iX = gX + ( i 0.5) (13) which in turn depends on the input. If the latent distribution is
2 chosen to be Gaussian, then it can be calculated analytically
N and is given by the formula:
jY = gY + ( j 0.5) (14) T
2
1 2 2 2 T
As Gaussian is being used as a lter so the variance of the L =
z
t + t log t (27)
2 t=1 2
lter is also required, which will be calculated dynamically
from the output of the decoder: The total loss for the network is given by:
2839
(a) MNIST data generation at time step t = 1 (b) MNIST data generation at time step t = 4
(c) MNIST data generation at time step t = 6 (d) MNIST data generation at time step t = 10
zt q(zt |henc ) (30) (23). The equation for encoding without attention mechanism
is given below:
hdec
t = RN N dec (hdec
t1 , zt ) (31) read(x) = CN N enc (x) (33)
ct = ct1 + write(hdec
t ) (32) 2) Training of Network: The loss function for the training
of this architecture without the attention at the input will be
It can be clearly observed from the above modied equations same as the architecture with attention at the input. Because,
that now input image x is directly given to the Convolutional still the architecture is Variational auto-encoder as it carries
Neural Network, and after the computation of henc sampling the basic idea of auto-encoder neural networks and Variational
is done at each time step from the CN N output. Then, using Bayes. Furthermore, although we are using CNN only at one
these samples decoding process is performed by RNN at each time step, but the sampling from its output is being done
time step. for T time steps or T times therefore, the loss equations will
1) Read without Attention and Write with Attention: In this still be same and the total loss will consist of regularization
architecture, the encoding operation is implemented without loss calculated analytically as we are using Gaussian, and
using attention mechanism. So the equations for attention reconstruction loss calculated from sampling. Hence the loss
mechanism introduced in DRAW are being used only for write equations will be same for architecture with attention and
operation, i.e. Gaussian lter matrices FX [i, a] and FY [j, b] are without attention mechanism. The loss equations will be
applied only while writing as given in Equation (19), (20) and optimized using a stochastic gradient descent algorithm.
2840
(a) MNIST data generation at time step t = 1 (b) MNIST data generation at time step t = 4
(c) MNIST data generation at time step t = 6 (d) MNIST data generation at time step t = 10
2841
(a) Reconstruction and Regularization loss for DCRW (b) Reconstruction and Regularization loss for DCRW with no
attention in read
2842