Deep Unsupervised Learning
Deep Unsupervised Learning
Tutorial – Part 1
Alex Graves Marc’Aurelio Ranzato
– Yann LeCun
Example
● ImageNet training set contains ~1.28M images, each assigned one of
1000 labels
● If labels are equally probable, complete set of randomly shuffled labels
contains ~log2(1000)*1.28M ≈ 12.8 Mbits
● Complete set of images uncompressed at 128 x128 contains ~500
Gbits: > 4 orders of magnitude more
● A large conv net (~30M weights) can memorise randomised ImageNet
labellings. Could it memorise randomised pixels?
UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION, Zhang et. al. 2016
Supervised Learning
● Given a dataset D of inputs x labelled with targets y, learn to predict
y from x, typically with maximum likelihood:
● Goal is to learn the ‘true’ distribution from which the data was drawn
● Means attempting to learn everything about the data
Where to Look
Not everyone agrees that trying to understand everything is a good
idea. Shouldn’t we instead focus on things that we believe will one day
be useful for us?
… we lived our lives under the constantly changing sky without sparing it a
glance or a thought. And why indeed should we? If the various formations had
had some meaning, if, for example, there had been concealed signs and
messages for us which it was important to decode correctly, unceasing
attention to what was happening would have been inescapable…
For decades, the quintessentially New York city has elevated its streets to the status of an icon.
van den Oord, A., et al. “WaveNet: A Generative Model for Raw Audio.” arxiv (2016).
PixelRNN - Model
● Fully visible
● Model pixels with Softmax
● ‘Language model’ for images
van den Oord, A., et al. “Pixel Recurrent Neural Networks.” ICML (2016).
Pixel RNN - Samples
van den Oord, A., et al. “Pixel Recurrent Neural Networks.” ICML (2016).
Conditional Pixel CNN
van den Oord, A., et al. “Conditional Pixel CNN.” NIPS (2016).
Autoregressive over slices, then pixels within a slice
Slice 1 Slice 2 Slice 3 Slice 4
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4
5 6 7 8 5 6 7 8
Source 5 6 7 8 5 6 7 8
9 9
9 9
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
5 6 7 8 5 6 7 8 5 6 7 8 5 6 7 8
Target 9 10 11 12 9 10 11 12 9 10 11 12 9 10 11 12
13 14 15 16 13 14 15 16 13 14 15 16 13 14 15 16
J. Menick et. al. Generating High Fidelity Images with subsample pixel networks and multidimensional upscaling (2018)
256 x 256 CelebA-HQ
Component Weights
Distribution over Sequences
Input Reconstruction
Encoder Decoder
Reconstruction cost
Slide: Irina Higgins, Loïc Matthey
Autoencoder
Latent
representation
Input Reconstruction
Encoder Decoder
Reconstruction cost
Slide: Irina Higgins, Loïc Matthey
Variational AutoEncoder
Kingma et al, 2014
Rezende et al, 2014
Latent
distribution
Input Reconstruction
Encoder Decoder
Coding Cost
Reconstruction cost
Slide: Irina Higgins, Loïc Matthey
Minimum Description Length for VAE
● Alice wants to transmit x as compactly as possible to Bob, who knows
only the prior p(z) and the decoder weights
● The coding cost is the number of bits required for Alice to transmit a
sample from qθ(z|x) to Bob (e.g. bits-back coding)
● The reconstruction cost measures the number of additional error
bits Alice will need to send to Bob to reconstruct the data given the
latent sample (e.g. arithmetic coding)
● The sum of the two costs is the total length of the message Alice needs
to send to Bob to allow him to recover x (c.f. variational inference)
…one must take seriously the idea of working with datasets, rather than datapoints, as
the key objects to model.
– Edwards & Storkey, Towards a Neural Statistician, (2017)
Associative Compression Networks
● ACNs modify the VAE loss by replacing the unconditional prior p(z) with a
conditional prior p(z|z’), where z’ is the latent representation of an
associated data point (one of the K nearest Euclidean neighbours to z)
● p(z|z’) – parameterised by an MLP – models only part of the latent space,
rather than the whole thing, which greatly reduces the coding cost
● Implicit amortisation: the more clustered the codes, the cheaper they are
● Result: rich, informative codes are learned, even with powerful decoders.
Graves et. al., Associative Compression Networks for Representation Learning (2018)
MDL for ACN
● Alice now wants to transmit the entire dataset to Bob, in any order
(justified for IID data?)
● Bob has the weights of the associative prior, decoder and encoder
● Alice chooses an ordering for the data that minimises total coding cost
(travelling salesman) and sends the data to Bob one at a time.
● After receiving each latent code + error bits, he decodes the datapoint,
then re-encodes it and uses the result to determine the associative
prior for the next code
Red bits are C
different from
standard VAE,
The rest is the
same
Unordered: KL from unconditional prior
Ordered: KL from conditional ACN prior
Binary MNIST reconstructions: leftmost column are test set images
CelebA Reconstructions: leftmost column from test set
‘Daydream’ sampling: encode data, sample latent from conditional prior,
generate new data conditioned on latent, repeat
Mutual Information
● Want codes that ‘describe’ the data as well as possible
● Mathematically, we want to maximise the mutual information
between the code z and the data x
van den Oord et al., Representation Learning with Contrastive Predictive Coding (2018)
General Artificial Intelligence
van den Oord et al., Representation Learning with Contrastive Predictive Coding (2018)
Representation Learning with Contrastive Predictive Coding General Artificial Intelligence
van den Oord et al., Representation Learning with Contrastive Predictive Coding (2018)
Representation Learning with Contrastive Predictive Coding General Artificial Intelligence
Gutmann et al., Noise-Contrastive Estimation (2009)
Representation Learning with Contrastive Predictive Coding General Artificial Intelligence
Representation Learning with Contrastive Predictive Coding General Artificial Intelligence
Representation Learning with Contrastive Predictive Coding General Artificial Intelligence
Representation Learning with Contrastive Predictive Coding General Artificial Intelligence
Representation Learning with Contrastive Predictive Coding General Artificial Intelligence
Representation Learning with Contrastive Predictive Coding General Artificial Intelligence
Representation Learning with Contrastive Predictive Coding General Artificial Intelligence
Representation Learning with Contrastive Predictive Coding General Artificial Intelligence
Speech - LibriSpeech
van den Oord et al., Representation Learning with Contrastive Predictive Coding (2018)
Representation Learning with Contrastive Predictive Coding General Artificial Intelligence
Images - ImageNet
M. Jaderberg et. al., Reinforcement Learning with Unsupervised Auxiliary Tasks. (2016)
Unsupervised RL Baselines
M. Jaderberg et. al., Reinforcement Learning with Unsupervised Auxiliary Tasks. (2016)
Sparse Rewards? More Cherries!
Auxiliary Losses
-- Batched A2C
-- Aux loss
Automated Curriculum learning For Neural Networks. Graves et. al. (2017)
Curiouser and Curiouser…
● Complexity Gain: Seek out data that
maximise the decrease in bits of
everything the agent has ever
observed (!). In other words find (or
create) the thing that makes the
most sense of the agent’s life
so far: science, art, music, jokes…
Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty,
Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes, Schmidhuber, 2008
Empowered Agents
Instead of curiosity, agent can be motivated by empowerment: attempt to
maximise the Mutual Information between the agent’s actions and the
consequences of its actions (e.g. the state the actions will lead to). Agent wants
to have as much control as possible over its future.
Klyubin et. al. Empowerment: A Universal Agent-Centric Measure of Control (2005)