Pixel Recurrent Neural Networks

Notes for the Data Science Reading Group meetup
June 21, 2017

The paper

Pixel Recurrent Neural Networks
Aäron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu
Google DeepMind


This paper won a Best Paper award at ICML 2016. It tackles a difficult problem and the results are impressive, but the paper is not an easy read. In these notes I’ll try to give some context for the approach and the results.

My main interest in proposing this paper was the prospect of learning novel architectures for 2D recurrent neural networks (RNNs). RNNs for 1D problems like language modeling and translation are well-established, but 2D versions are not nearly as thoroughly explored. And indeed the RNN models in this paper are eye-opening in various ways.

However the authors have a follow-on paper (Conditional Image Generation with PixelCNN Decoders) where they leave the PixelRNN models behind and instead focus on extending the PixelCNN model. They get better results more efficiently than with PixelRNN.

Content-aware fill

The work has been further developed by OpenAI (PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications) who have released code for it here. And speaking of code, this implementation speeds up generation a lot: Fast PixelCNN++: speedy image generation

Another motivation I had for looking at this paper was I thought it might provide an interesting alternative to Photoshop’s content-aware fill, which impressively constructs missing bits of images. However, while this paper’s models can extrapolate missing pixels, it cannot interpolate them. So it’s not a content-aware fill replacement out of the box, at least.

What is this about?

The authors want to model the distribution of natural images. The idea is that there is a probability distribution over all possible images and natural images are more likely according to that distribution. If we knew that distribution, we could ask questions like, “Given an image, what is the probability that it is a natural image (as opposed to random pixels)?” Or we could generate images because we could calculate conditional probabilities to answer the question, “Given some pixels of a natural image, what are the most likely values for pixels we haven’t seen yet?”

Image denoising

A good model for the distribution of images would help with tasks like compression, denoising, reconstruction, and superresolution.

This problem is in the realm of unsupervised learning, which is a term that is widely used but whose definition is a little elusive. Basically it means that we learn something about the structure of our data without having to train using a bunch of labeled examples. Examples of unsupervised learning include finding clusters of topics among multiple news items, or separating voices in a multi-speaker recording.

What do they do in this paper?

Fig. 1 Pixel indexing

The authors build a probabilistic model of natural images. Suppose we have an \(n\times n\) image \(\mathbf{x}\) consisting of pixels \(\left\{ x_{i}\right\} _{i=1}^{n^{2}}\), where we have indexed the pixels by reading them row by row starting at the top-left corner (Fig.  1). It would be great to have a feasible way to access the full joint probability distribution \(p(\mathbf{x})=p(x_{1},x_{2},…,x_{n^{2}})\) but that is beyond the current state of the art, so instead we focus on a particular factorization into conditional distributions


The contribution of this paper is to devise a tractable way to estimate a conditional probability \(p(x_{i}|x_{1},…,x_{i-1})\) , which is the probability of the \(i\)-th pixel \(x_{i}\) given all the previous pixels \(x_{1},…,x_{i-1}\).

One novel thing the authors do is model the pixel values as samples from the discrete set \(\{0,1,…,255\}\) instead of from a continuous distribution on \([0,1]\) (for example). This simplifies things and is in some ways more natural. Given this, it might be strictly more conventional to use \(P\) for probabilities rather than \(p\) for probability densities, but let’s not quibble.

Categories of image generators

One of the things shown in this paper are images generated from the model. There are several ways of generating images (some of which we have encountered in the reading group before). They can be categorized as:

  1. Variational autoencoders (VAE)
  2. Generative adversarial networks (GAN)
  3. Autoregressive models

I won’t say any more about these other than to note that PixelRNN falls into the autoregressive category.

The models

The paper describes four models:

  1. Row LSTM (PixelRNN)
  2. Diagonal BiLSTM (PixelRNN)
  3. PixelCNN
  4. Multi-scale PixelRNN

These are all pretty complicated and while I have a general idea of what they are doing, I don’t understand them well enough to code them up.

If you are interested in going further with the ideas in this paper, you should focus on the concepts of PixelCNN, masked convolutions, and residual blocks. They are the key ingredients for follow-on work by the authors and others.

The results

There are two categories of results: negative log-likelihood (NLL) which can be thought of as a measure of compressibility, and images generated by the model.

The NLL results are substantially better than state of the art. On CIFAR-10, for example, the best model (Diagonal BiLSTM) achieves 3.00 bits per dimension (bpd). In other words, you could theoretically build a lossless compressor for CIFAR-10 images that has a compression ratio of 8:3 = 2.67. I compressed CIFAR-10 via PNG and got 5.87 bpd. Granted, PNG was not designed for this kind of image, but a state-of-the-art lossless compressor like EMMA didn’t do much better. So the PixelRNN results are certainly noteworthy.

The other results are visual, where the bottom half of images are occluded and then filled in by the model. The results might not impress someone who doesn’t know how hard this is, but they look sort of natural if you squint at them a bit.