Está en la página 1de 21

Pixel Recursive Super Resolution

Ryan Dahl Mohammad Norouzi Jonathon Shlens

Google Brain
{rld,mnorouzi,shlens}@google.com
arXiv:1702.00783v1 [cs.CV] 2 Feb 2017

Abstract
88 input 3232 samples ground truth
We present a pixel recursive super resolution model that
synthesizes realistic details into images while enhancing
their resolution. A low resolution image may correspond
to multiple plausible high resolution images, thus modeling
the super resolution process with a pixel independent con-
ditional model often results in averaging different details
hence blurry edges. By contrast, our model is able to repre-
sent a multimodal conditional distribution by properly mod-
eling the statistical dependencies among the high resolution
image pixels, conditioned on a low resolution input. We
employ a PixelCNN architecture to define a strong prior
over natural images and jointly optimize this prior with a
deep conditioning convolutional network. Human evalua-
tions indicate that samples from our proposed model look
more photo realistic than a strong L2 regression baseline.

1. Introduction
The problem of super resolution entails artificially en-
larging a low resolution photograph to recover a plausi- Figure 1: Illustration of our probabilistic pixel recursive
ble high resolution version of it. When the zoom factor super resolution model trained end-to-end on a dataset of
is large, the input image does not contain all of the infor- celebrity faces. The left column shows 88 low resolution
mation necessary to accurately construct a high resolution inputs from the test set. The middle and last columns show
image. Thus, the problem is underspecified and many plau- 32 32 images as predicted by our model vs. the ground
sible high resolution images exist that match the low resolu- truth. Our model incorporates strong face priors to synthe-
tion input image. This problem is significant for improving size realistic hair and skin details.
the state-of-the-art in super resolution, and more generally
for building better conditional generative models of images.
A super resolution model must account for the complex new image details that appear plausible to a human ob-
variations of objects, viewpoints, illumination, and occlu- server. Generating realistic high resolution images is not
sions, especially as the zoom factor increases. When some possible unless the model draws sharp edges and makes
details do not exist in the source image, the challenge lies hard decisions about the type of textures, shapes, and pat-
not only in deblurring an image, but also in generating terns present at different parts of an image.
Work done as a member of the Google Brain Residency program Imagine a low resolution image of a face, e.g., the 88
(g.co/brainresidency). images depicted in the left column of Figure 1the details

1
of the hair and the skin are missing. Such details cannot be high resolution images at increasingly higher magnification
faithfully recovered using simple interpolation techniques factors. Convolutional neural networks (CNNs) represent
such as linear or bicubic. However, by incorporating the an approach to the problem that avoids explicit dictionary
prior knowledge of the faces and their typical variations, an construction, but rather implicitly extracts multiple layers
artist is able to paint believable details. In this paper, we of abstractions by learning layers of filter kernels. Dong et
show how a fully probabilistic model that is trained end-to- al. [5] employed a three layer CNN with MSE loss. Kim et
end can play the role of such an artist by synthesizing 3232 al. [16] improved accuracy by increasing the depth to 20
face images depicted in the middle column of Figure 1. Our layers and learning only the residuals between the high res-
super resolution model comprises two components that are olution image and an interpolated low resolution image.
trained jointly: a conditioning network, and a prior net- Most recently, SRResNet [18] uses many ResNet blocks to
work. The conditioning network effectively maps a low achieve state of the art pSNR and SSIM on standard super
resolution image to a distribution over corresponding high resolution benchmarkswe employ a similar design for our
resolution images, while the prior models high resolution conditional network and catchall regression baseline.
details to make the outputs look more realistic. Our con-
ditioning network consists of a deep stack of ResNet [10] Instead of using a per-pixel loss, Johnson et al.[14]
blocks, while our prior network comprises a PixelCNN [28] use Euclidean distance between activations of a pre-trained
architecture. CNN for models predictions vs. ground truth images. Us-
We find that standard super resolution metrics such as ing this so-called preceptual loss, they train feed-forward
peak signal-to-noise ratio (pSNR) and structural similar- networks for super resolution and style transfer. Bruna et
ity (SSIM) fail to properly measure the quality of predic- al. [3] also use perceptual loss to train a super resolution
tions for an underspecified super resolution task. These network, but inference is done via gradient propagation to
metrics prefer conservative blurry averages over more plau- the low-res input (e.g., [9]).
sible photo realistic details, as new fine details often do Ledig et al. [18] and Yu et al. [33] use GANs to cre-
not align exactly with the original details. Our evalua- ate compelling super resolution results showing the ability
tion studies demonstrate that humans easily distinguish real of the model to predict plausible high frequency details.
images from super resolution predictions when regression Snderby et al. [15] also investigate GANs for super res-
techniques are used, but they have a harder time telling our olution using a learned affine transformation that ensures
samples apart from real images. the models only generate images that downscale back to the
low resolution inputs. Snderby et al. [15] also explore a
2. Related work masked autoregressive model like PixelCNN [27] but with-
Super resolution has a long history in computer vi- out the gated layers and using a mixture of gaussians in-
sion [22]. Methods relying on interpolation [11] are easy stead of a multinomial distribution. Denton et al. [4] use a
to implement and widely used, however these methods suf- multi-scale adversarial network for image synthesis, but the
fer from a lack of expressivity since linear models cannot architecture also seems beneficial for super resolution.
express complex dependencies between the inputs and out- PixelRNN and PixelCNN by Oord et al. [27, 28] are
puts. In practice, such methods often fail to adequately pre- probabilistic generative models that impose an order on im-
dict high frequency details leading to blurry high resolution age pixels representing them as a long sequence. The proba-
outputs. bility of each pixel is then conditioned on the previous ones.
Enhancing linear methods with rich image priors such The gated PixelCNN obtained state of the art log-likelihood
as sparsity [2] or Gaussian mixtures [35] have substantially scores on CIFAR-10 and MNIST, making it one of the most
improved the quality of the methods; likewise, leveraging competetive probabilistic generative models.
low-level image statistics such as edge gradients improves
predictions [31, 26, 6, 12, 25, 17]. Much work has been Since PixelCNN uses log-likelihood for training, the
done on algorithms that search a database of patches and model is highly penelized if negligible probability is as-
combine them to create plausible high frequency details in signed to any of the training examples. By contrast, GANs
zoomed images [7, 13]. Recent patch-based work has fo- only learn enough to fool a non-stationary discriminator.
cused on improving basic interpolation methods by building One of their common failure cases is mode collapsing were
a dictionary of pre-learned filters on images and selecting samples are not diverse enough [21]. Furthermore, GANs
the appropriate patches by an efficient hashing mechanism require careful tuning of hyperparameters to ensure the dis-
[23]. Such dictionary methods have improved the inference criminator and generator are equally powerful and learn at
speed while being comparable to state-of-the-art. equal rates. PixelCNNs are more robust to hyperparame-
Another approach for super resolution is to abandon in- ter changes and usually have a nicely decaying loss curve.
ference speed requirements and focus on constructing the Thus, we adopt PixelCNN for super resolution applications.
3. Probabilistic super resolution How the dataset was created

We aim to learn a probabilistic super resolution model


that discerns the statistical dependencies between a high
resolution image and a corresponding low resolution im- 50%/50%
age. Let x and y denote a low resolution and a high resolu-
tion image, where y represents a ground-truth high res-
olution image. In order to learn a parametric model of Model output
p (y | x), we exploit a large dataset of pairs of low res-
olution inputs and ground-truth high resolution outputs, de- Cross-entropy
noted D {(x(i) , y(i) )}Ni=1 . One can easily collect such
a large dataset by starting from a set of high resolution im-
ages and lowering their resolution as much as needed. To L2 Regression
optimize the parameters of the conditional distribution p,
we maximize a conditional log-likelihood objective defined
as, X PixelCNN
O( | D) = log p(y | x) . (1)
(x,y )D

The key problem discussed in this paper is the exact form Figure 2: Top: A cartoon of how the input and output pairs
of p(y | x) that enables efficient learning and inference, for the toy dataset were created. Bottom: Example pre-
while generating realistic non-blurry outputs. We first dis- dictions for various algorithms trained on this dataset. The
cuss pixel-independent models that assume that each out- pixel independent L2 regression and cross-entropy models
put pixel is generated with an independent stochastic pro- do not exhibit multimodal predictions. The PixelCNN out-
cess given the input. We elaborate why these techniques put is stochastic and multiple samples will place a digit in
result in sub-optimal blurry super resolution results. Finally either corner 50% of the time.
we describe our pixel recursive super resolution model that
generates output pixels one at a time to enable modeling
the statistical dependencies between the output pixels using which case maximizing the conditional log-likelihood of (1)
PixelCNN [27, 28], and synthesizes sharp images from very reduces to minimizing the mean squared error (MSE) be-
blurry input. tween yi and Ci (x) across the pixels and channels through-
out the dataset. Super resolution models based on MSE re-
3.1. Pixel independent super resolution gression (e.g., [5, 16, 18]) fall within this family of pixel
independent models, where the outputs of a neural network
The simplest form of a probabilistic super resolution
parameterize a set of Gaussians with fixed bandwidth.
model assumes that the output pixels are conditionally inde-
Alternatively, one could use a flexible multinomial dis-
pendence given the inputs. As such, the conditional distri-
tribution as the pixel prediction model, in which case the
bution of p(y | x) factorizes into a product of independent
ouput dimensions are discretized into K possible values
pixel predictions. Suppose an RGB output y has M pixels
(e.g., K = 256) where yi {1, . . . , K}. The pixel pre-
each with three color channels, i.e., y R3M . Then,
diction model based on a multinomial softmax operator is
3M
X represented as,
log p(y | x) = log p(yi | x) . (2) K
X
i=1 log p(yi = k | x) = wjkT Ci (x)log exp{wjvT Ci (x)} ,
v=1
Two general forms of pixel prediction models have been ex- (4)
plored in the literature: Gaussian and multinomial distribu- where {wjk }3,K
j=1,k=1 denote the softmax weights for differ-
tions to model continuous and discrete pixel values respec- ent color channels and different discrete values.
tively. In the Gaussian case,
3.2. Synthetic multimodal task
1
log p(yi | x) = 2 kyi Ci (x)k22 log 2 2 , (3) To demonstrate how the above pixel independent models
2
can fail at conditional image modeling, we created a syn-
where Ci (x) denotes the ith element of a non-linear trans- thetic dataset that is explicitly multimodal. For many gen-
formation of x via a convolutional neural network. Ci (x) erative tasks like super resolution, colorization, and depth
is the estimated mean for the ith output pixel yi , and 2 estimation, models that are able to predict a mode without
denotes the variance. Often the variance is not learned, in averaging effects are desirable. For example, in coloriza-
tion, selecting a strong red or blue for a car is better than
selecting a sepia toned average of all of the colors of cars prior network!
that the model has been exposed to. In this synthetic task, (PixelCNN)
the input is an MNIST digit (1st row of Figure 2), and the
HR!
output is the same input digit but scaled and translated ei- image
ther into the upper left corner or upper right corner (2nd and logits
3rd rows of Figure 2). The dataset has an equal ratio of up- HR!
per left and upper right outputs, which we call the MNIST
+ image
corners dataset.
A convolutional network using per pixel squared error
loss (Figure 2, L2 Regression) produces two blurry fig-
ures. Replacing the continuous loss with a per-pixel cross- conditioning!
entropy produces crisper images but also fails to capture the network (CNN)!
stochastic bimodality because both digits are shown in both
corners (Figure 2, cross-entropy). In contrast, a model that
Figure 3: The proposed super resolution network com-
explicitly deals with multi-modality, PixelCNN stochasti-
prises a conditioning network and a prior network. The
cally predicts a digit in the upper-left or bottom-right cor-
conditioning network is a CNN that receives a low reso-
ners but never predicts both digits simultaneously (Figure 2,
lution image as input and outputs logits predicting the con-
PixelCNN).
ditional log-probability of each high resolution (HR) image
See Figure 5 for examples of our super resolution model
pixel. The prior network, a PixelCNN [28], makes predic-
predicting different modes on a realistic dataset.
tions based on previous stochastic predictions (indicated by
Any good generative model should be able to make sharp
dashed line). The models probability distribution is com-
single mode predictions and a dataset like this would be a
puted as a softmax operator on top of the sum of the two
good starting point for any new models.
sets of logits from the prior and conditioning networks.
4. Pixel recursive super resolution
The main issue with the previous probabilistic models with Gaussian or Logistic (mixture) conditionals as pro-
(Equations (3) and (4)) for super resolution is the lack posed in [24].
of conditional dependency between super resolution pixels.
There are two general methods to model statistical correla- Our model, outlined in Figure 3, comprises two ma-
tions between output pixels. One approach is to define the jor components that are fused together at a late stage and
conditional distribution of the output pixels jointly by ei- trained jointly: (1) a conditioning network (2) a prior net-
ther a multivariate Gaussian mixture [36] or an undirected work. The conditioning network is a pixel independent pre-
graphical model such as conditional random fields [8]. With diction model that maps a low resolution image to a proba-
these approaches one has to commit to a particular form of bilistic skeleton of a high resolution image, while the prior
statistical dependency between the output pixels, for which network is supposed to add natural high resolution details
inference can be computationally expensive. The second to make the outputs look more realistic.
approach that we follow in this work, is to factorize the con- Given an input x RL , let Ai (x) : RL RK denote
ditional distribution using chain rule as, a conditioning network predicting a vector of logit values
corresponding to the K possible values that the ith output
M
X pixel can take. Similarly, let Bi (y<i ) : Ri1 RK denote
log p(y | x) = log p(yi | x, y<i ) , (5) a prior network predicting a vector of logit values for the ith
i=1
output pixel. Our probabilistic model predicts a distribution
where the generation of each output dimension is condi- over the ith output pixel by simply adding the two sets of
tioned on the input, previous output pixels, and the previous logits and applying a softmax operator on them,
channels of the same output pixel. For simplicity of exposi-
tion, we ignore different output channels in our derivations, p(yi | x, y<i ) = softmax(Ai (x) + Bi (y<i )) . (6)
and use y<i to represent {y1 , . . . , yi1 }. The benefits of
this approach is that the exact form of the conditional de-
pendencies is flexible and the inference is straightforward. To optimize the parameters of A and B jointly, we per-
Inspired by the PixelCNN model, we use a multinomial dis- form stochastic gradient ascent to maximize the conditional
tribution to model discrete pixel values in Eq. (5). Alter- log likelihood in (1). That is, we optimize a cross-entropy
natively, one could use an autoregressive prediction model loss between the models predictions in (6) and discrete
ground truth labels yi {1, . . . , K}, a series of ResNet [10] blocks and transpose convolution
layers while maintaining 32 channels throughout. The last
M 
X X T layer uses a 1 1 convolution to increase the channels to
O1 = 1[yi ] (Ai (x) + Bi (y<i

))
2563 and uses the resulting activations to predict a multi-
(x,y )D i=1 (7)
 nomial distribution over 256 possible sub-pixel values via a
softmax operator.
lse(Ai (x) + Bi (y<i )) ,
This network provides the ability to absorb the global
where lse() is the log-sum-exp operator corresponding to structure of the image in the marginal probability distribu-
the log of the denominator of a softmax, and 1[k] denotes a tion of the pixels. Due to the softmax layer it can capture
K-dimensional one-hot indicator vector with its k th dimen- the rich intricacies of the high resolution distribution, but
sion set to 1. we have no way to coherently sample from it. Sampling
Our preliminery experiments indicate that models sub-pixels independently will mix the assortment of distri-
trained with (7) tend to ignore the conditioning network butions.
as the statistical correlation between a pixel and previous The prior network provides a way to tie together the sub-
high resolution pixels is stronger than its correlation with pixel distributions and allow us to take samples dependent
low resolution inputs. To mitigate this issue, we include on each other. We use 20 gated PixelCNN layers with 32
an additional loss in our objective to enforce the condition- channels at each layer. We leave conditioning until the late
ing network to be optimized. This additional loss measures stages of the network, where we add the pre-softmax ac-
the cross-entropy between the conditioning networks pre- tivations from the conditioning network and prior network
dictions via softmax(Ai (x)) and ground truth labels. The before computing the final joint softmax distribution.
total loss that is optimized in our experiments is a sum of Our model is built by using TensorFlow [1] and trained
two cross-entropy losses formulated as, across 8 GPUs with synchronous SGD updates. See ap-
pendix A for further details.
X M 
X T
O2 = 1[yi ] (2 Ai (x) + Bi (y<i

))
5. Experiments
(x,y )D i=1 (8)
We assess the effectiveness of the proposed pixel re-


lse(Ai (x) + Bi (y<i )) lse(Ai (x)) .
cursive super resolution method on two datasets contain-
Once the network is trained, sampling from the model ing small faces and bedroom images. The first dataset is
is straightforward. Using (6), starting at i = 1, first we a version of the CelebA dataset [19] composed of a set
sample a high resolution pixel. Then, we proceed pixel by of celebrity faces, which are cropped around the face. In
pixel, feeding in the previously sampled pixel values back the second dataset LSUN Bedrooms [32], images are center
into the network, and draw new high resolution pixels. The cropped. In both datasets we resize the images to 32 32
three channels of each pixel are generated sequentially in with bicubic interpolation and again to 8 8, constituting
turn. the output and input pairs for training and evaluation. We
We additionally consider greedy decoding, where one al- present representative super resolution examples on held
ways selects the pixel value with the largest probability and out test sets and report human evaluations of our predictions
sampling from a tempered softmax, where the concentra- in Table 1.
tion of a distribution p is adjusted by using a temperature We compare our results with two baselines: a pixel inde-
parameter > 0, pendent L2 regression (Regression) and a nearest neigh-
p bors search (NN). The architecture used for the regres-
p = . sion baseline is identical to the conditioning network used in
kp k1
our model, consisting of several ResNet blocks and upsam-
To control the concentration of our sampling distribution pling convolutional layers, except that the baseline regres-
p(yi | x, y<i ), it suffices to multiply the logits from A and sion model outputs three channels and has a final tanh()
B by a parameter . Note that as goes towards , the dis- non-linearity instead of ReLU. The regression architecture
tribution converges to the mode1 , and sampling converges to is similar in design to to SRResNet [18], which reports state
greedy decoding. of the art scores in image similarity metrics. Furthermore,
4.1. Implementation details we train the regression network to predict super resolution
residuals instead of the actual pixel values. The residu-
The conditioning network is a feed-forward convolu- als are computed based on bicubic interpolation of the in-
tional neural network that takes an 88 RGB image through put, and are known to work better to provide superior pre-
1 We use a non-standard notion of temperature that represents 1
in the dictions [16]. The nearest neighbors baseline searches the

standard notation. downsampled training set for the nearest example (using eu-
Input Regression Ours G. Truth NN

Figure 4: Samples from the model trained on LSUN Bed-


rooms at 32 32.

clidean distance) and returns its high resolution counterpart. Figure 5: Left: low-res input. Right: Diversity of super
resolution samples at = 1.2.
5.1. Sampling
Sampling from the model multiple times results in dif-
ferent high resolution images for a given low resolution im-
age (Figure 5). A given model will identify many plausi-
ble high resolution images that correspond to a given lower jump back and forth between themperhaps this is allevi-
resolution image. Each one of these samples may contain ated with more capacity and training time. Manually tuning
distinct qualitative features and each of these modes is con- the softmax temperature was necessary to find good looking
tained within the PixelCNN. Note that the differences be- samplesusually a value between 1.1 and 1.3 worked.
tween samples for the faces dataset are far less drastic than
seen in our synthetic dataset, where failure to cleanly pre- In Figure 6 are various test predictions with their nega-
dict modes meant complete failure. tive log probability scores listed below each image. Smaller
The quality of samples is sensitive to the softmax tem- scores means the model has assigned that image a larger
perature. When the mode is sampled ( = ) at each sub- probability mass. The greedy, bicubic, and regression faces
pixel, the samples are of poor quality, they look smooth with are preferred by the model despite being worse quality.
horizontal and vertical line artifacts. Sampling at = 1.0, This is probably because their smooth face-like structure
the exact probability given by the model, tend to be more doesnt contradict the learned distributions. Yet sampling
jittery with high frequency content. It seems in this case with the proper softmax temperature nevertheless finds re-
there are multiple less certain trajectories and the samples alistic looking images.
Ground Truth NN Bicubic Regression Greedy = 1.0 = 1.1 = 1.2

2.85 2.74 1.76 2.34 1.82 2.94 2.79 2.69

2.96 2.71 1.82 2.17 1.77 3.18 3.09 2.95

2.76 2.63 1.80 2.35 1.64 2.99 2.90 2.64

Figure 6: Our model does not produce calibrated log-probabilities for the samples. Negative log-probabilities are reported
below each image. Note that the best log-probabilities arise from bicubic interpolation and greedy sampling even though the
images are poor quality.

5.2. Image similarity our samples do indeed correspond to the low-resolution in-
put.
Many methods exist for quantifying image similarity that
attempt to measure human perception judgements of simi-
larity [29, 30, 20]. We quantified the prediction accuracy 5.3. Human study
of our model compared to ground truth using pSNR and
MS-SSIM (Table 1). We found that our own visual assess- We presented crowd sourced workers with two images: a
ment of the predicted image quality did not correspond to true image and the corresponding prediction from our vari-
these image similarities metrics. For instance, bicubic in- ous models. Workers were asked Which image, would you
terpolation achieved relatively high metrics even though the guess, is from a camera? Following the setup in Zhang et
samples appeared quite poor. This result matches recent ob- al [34], we present each image for one second at a time
servations that suggest that pSNR and SSIM provide poor before allowing them to answer. Workers are started with
judgements of super resolution quality [18, 14] when new 10 practice pairs during which they get feedback if they
details are synthesized. choose correctly or not. The practice pairs not counted in
the results. After the practice pairs, they are shown 45 ad-
To ensure that samples do indeed correspond to the low-
ditional pairs, 5 of which are golden questions designed to
resolution input, we measured how consistent the high res-
test if the person is paying attention. The golden question
olution output image is with the low resolution input image
pits a bicubicly upsampled image (very blurry) against the
(Table 1, consistency). Specifically, we measured the L2
ground truth. Excluding the golden and practice questions,
distance between the low-resolution input image and a bicu-
we count fourty answers per session. Sessions in which they
bic downsampled version of the high resolution estimate.
missed any golden questions are thrown out. Workers were
Lower L2 distances correspond to high resolutions that are
only allowed to participate in any of our studies once. We
more similar to the original low resolution image. Note that
continued running sessions until fourty different different
the nearest neighbor high resolution images are less consis-
workers were tested on each of the four algorithms.
tent even though we used a database of 3 million training
images to search for neighbors in the case of LSUN bed- We report in Table 1 the percent of the time users choose
rooms. In contrast, the bicubic resampling and the Pixel- an algorithms output over the ground truth counterpart.
CNN upsampling methods showed consistently better con- Note that 50% would say that an algorithm perfectly con-
sistency with the low resolution image. This indicates that fused the subjects.
Algorithm pSNR SSIM MS-SSIM Consistency % Fooled
Bicubic 28.92 0.84 0.76 0.006
NN 28.18 0.73 0.66 0.024
Regression 29.16 0.90 0.90 0.004 4.0 0.2
= 1.0 29.09 0.84 0.86 0.008 11.0 0.1
= 1.1 29.08 0.84 0.85 0.008 10.4 0.2
= 1.2 29.08 0.84 0.86 0.008 10.2 0.1
Bicubic 28.94 0.70 0.70 0.002
NN 28.15 0.49 0.45 0.040
Regression 28.87 0.74 0.75 0.003 2.1 0.1
= 1.0 28.92 0.58 0.60 0.016 17.7 0.4
= 1.1 28.92 0.59 0.59 0.017 22.4 0.3
= 1.2 28.93 0.59 0.58 0.018 27.9 0.3

Table 1: Top: Results on the cropped CelebA test dataset at 3232 magnified from 88. Bottom: LSUN bedrooms. pSNR,
SSIM, and MS-SSIM measure image similarity between samples and the ground truth. Consistency lists the MSE between
the input low-res image and downsampled samples on a [0, 1] scale. % Fooled reports how often the algorithms samples
fooled a human in a crowd sourced study; 50% would be perfectly confused.

6. Conclusion [3] J. Bruna, P. Sprechmann, and Y. LeCun. Super-resolution


with deep convolutional sufficient statistics. CoRR,
As in many image transformation tasks, the central prob- abs/1511.05666, 2015. 2
lem of super resolution is in hallucinating sharp details by [4] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep
choosing a mode of the output distribution. We explored generative image models using a laplacian pyramid of adver-
this underspecified problem using small images, demon- sarial networks. NIPS, 2015. 2
strating that even the smallest 88 images can be enlarged [5] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-
to sharp 3232 images. We introduced a toy dataset with resolution using deep convolutional networks. CoRR,
a small number of explicit modes to demonstrate the failure abs/1501.00092, 2015. 2, 3
cases of two common pixel independent likelihood models. [6] R. Fattal. Image upsampling via imposed edge statistics.
In the presented model, the conditioning network gets us ACM Trans. Graph., 26(3), July 2007. 2
most of the way towards predicting a high-resolution im- [7] W. T. Freeman, T. R. Jones, and E. C. Pasztor. Example-
age, but the outputs are blurry where the model is uncer- based super-resolution. IEEE Computer graphics and Appli-
tain. Combining the conditioning network with a PixelCNN cations, 2002. 2
model provides a strong prior over the output pixels, allow- [8] W. T. Freeman and E. C. Pasztor. Markov networks for super-
resolution. In CISS, 2000. 4
ing the model to generate crisp predications. Our human
[9] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm
evaluations indicate that samples from our model on aver-
of artistic style. CoRR, abs/1508.06576, 2015. 2
age look more photo realistic than a strong regression based
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
conditioning network alone. for image recognition. CVPR, 2015. 2, 5
[11] H. Hou and H. Andrews. Cubic splines for image interpola-
References tion and digital filtering. Acoustics, Speech and Signal Pro-
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, cessing, IEEE Transactions on, 26(6):508517, Jan. 2003.
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghe- 2
mawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, [12] J. Huang and D. Mumford. Statistics of natural images and
R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, models. In Computer Vision and Pattern Recognition, 1999.
R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, IEEE Computer Society Conference on., volume 1. IEEE,
J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, 1999. 2
V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. War- [13] J.-B. Huang, A. Singh, and N. Ahuja. Single image super-
den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor- resolution from transformed self-exemplars. In IEEE Con-
Flow: Large-scale machine learning on heterogeneous sys- ference on Computer Vision and Pattern Recognition), 2015.
tems, 2015. Software available from tensorflow.org. 5 2
[2] M. Aharon, M. Elad, and A. Bruckstein. Svdd: An algorithm [14] J. Johnson, A. Alahi, and F. Li. Perceptual losses
for designing overcomplete dictionaries for sparse represen- for real-time style transfer and super-resolution. CoRR,
tation. Trans. Sig. Proc., 54(11):43114322, Nov. 2006. 2 abs/1603.08155, 2016. 2, 7
Ours Ground Truth Ours Ground Truth age super-resolution using a generative adversarial network.
arXiv:1609.04802, 2016. 2, 3, 5, 7
[19] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face
attributes in the wild. In Proceedings of International Con-
ference on Computer Vision (ICCV), 2015. 5
[20] K. Ma, Q. Wu, Z. Wang, Z. Duanmu, H. Yong, H. Li, and
23/40 = 57% 34/40 = 85% L. Zhang. Group mad competition - a new methodology
to compare objective image quality models. In The IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), June 2016. 6
[21] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled
generative adversarial networks. CoRR, abs/1611.02163,
17/40 = 42% 30/40 = 75% 2016. 2
[22] K. Nasrollahi and T. B. Moeslund. Super-resolution: A com-
prehensive survey. Mach. Vision Appl., 25(6):14231468,
Aug. 2014. 2
[23] Y. Romano, J. Isidoro, and P. Milanfar. RAISR: rapid and ac-
curate image super resolution. CoRR, abs/1606.01299, 2016.
16/40 = 40% 26/40 = 65%
2
[24] T. Salimans, A. Karpathy, X. Chen, D. P. Kingma, and Y. Bu-
latov. Pixelcnn++: A pixelcnn implementation with dis-
cretized logistic mixture likelihood and other modifications.
under review at ICLR 2017. 4
1/40 = 2% 3/40 = 7% [25] Q. Shan, Z. Li, J. Jia, and C.-K. Tang. Fast image/video
upsampling. ACM Transactions on Graphics (TOG),
27(5):153, 2008. 2
[26] J. Sun, Z. Xu, and H.-Y. Shum. Image super-resolution us-
ing gradient profile prior. In Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on, pages
1/40 = 2% 3/40 = 7% 18. IEEE, 2008. 2
[27] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu.
Pixel recurrent neural networks. ICML, 2016. 2, 3
[28] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt,
A. Graves, and K. Kavukcuoglu. Conditional image genera-
tion with pixelcnn decoders. NIPS, 2016. 2, 3, 4
1/40 = 2% 4/40 = 1% [29] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simon-
celli. Image quality assessment: from error visibility to
Figure 7: The best and worst rated images in the human structural similarity. IEEE transactions on image process-
study. The fractions below the images denote how many ing, 13(4):600612, 2004. 6
times a person choose that image over the ground truth. [30] Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale
See the supplementary material for more images used in the structural similarity for image quality assessment. In Sig-
study. nals, Systems and Computers, 2004. Conference Record of
the Thirty-Seventh Asilomar Conference on, volume 2, pages
13981402. Ieee, 2004. 6
[15] C. Kaae Snderby, J. Caballero, L. Theis, W. Shi, and [31] C. Y. Yang, S. Liu, and M. H. Yang. Structured face hallu-
F. Huszar. Amortised MAP Inference for Image Super- cination. In 2013 IEEE Conference on Computer Vision and
resolution. ArXiv e-prints, Oct. 2016. 2 Pattern Recognition, pages 10991106, June 2013. 2
[16] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super- [32] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao. Lsun: Con-
resolution using very deep convolutional networks. CoRR, struction of a large-scale image dataset using deep learning
abs/1511.04587, 2015. 2, 3, 5 with humans in the loop. arXiv preprint arXiv:1506.03365,
[17] K. I. Kim and Y. Kwon. Single-image super-resolution using 2015. 5
sparse regression and natural image prior. IEEE Transactions [33] X. Yu and F. Porikli. Ultra-Resolving Face Images by Dis-
on Pattern Analysis and Machine Intelligence, 32(6):1127 criminative Generative Networks, pages 318333. Springer
1133, 2010. 2 International Publishing, Cham, 2016. 2
[18] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Te- [34] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-
jani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single im- tion. ECCV, 2016. 7
[35] D. Zoran and Y. Weiss. From learning models of natural
image patches to whole image restoration. In Proceedings
of the 2011 International Conference on Computer Vision,
ICCV 11, pages 479486, Washington, DC, USA, 2011.
IEEE Computer Society. 2
[36] D. Zoran and Y. Weiss. From learning models of natural
image patches to whole image restoration. In CVPR, 2011.
4
A.

Operation Kernel Strides Feature maps


Conditional network 8 8 3 input
B ResNet block 33 1 32
Transposed Convolution 33 2 32
B ResNet block 33 1 32
Transposed Convolution 33 2 32
B ResNet block 33 1 32
Convolution 11 1 3 256
PixelCNN network 32 32 3 input
Masked Convolution 77 1 64
20 Gated Convolution Layers 55 1 64
Masked Convolution 11 1 1024
Masked Convolution 11 1 3 256
Optimizer RMSProp (decay=0.95, momentum=0.9, epsilon=1e-8)
Batch size 32
Iterations 2,000,000 for Bedrooms, 200,000 for faces.
Learning Rate 0.0004 and divide by 2 every 500000 steps.
Weight, bias initialization truncated normal (stddev=0.1), Constant(0)

Table 2: Hyperparameters used for both datasets. For LSUN bedrooms B = 10 and for the cropped CelebA faces B = 6.
B. LSUN bedrooms samples

Input Bicubic Regression = 1.0 = 1.1 = 1.2 Truth NN


Input Bicubic Regression = 1.0 = 1.1 = 1.2 Truth NN
Input Bicubic Regression = 1.0 = 1.1 = 1.2 Truth NN
Input Bicubic Regression = 1.0 = 1.1 = 1.2 Truth NN
Input Bicubic Regression = 1.0 = 1.1 = 1.2 Truth NN
C. Cropped CelebA faces

Input Bicubic Regression = 1.0 = 1.1 = 1.2 Truth NN


Input Bicubic Regression = 1.0 = 1.1 = 1.2 Truth NN
Input Bicubic Regression = 1.0 = 1.1 = 1.2 Truth NN
Input Bicubic Regression = 1.0 = 1.1 = 1.2 Truth NN
Input Bicubic Regression = 1.0 = 1.1 = 1.2 Truth NN

También podría gustarte