Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Predictive Learning
Yann Le Cun
Facebook AI Research
Center for Data Science, NYU
Courant Institute of Mathematical Sciences, NYU
http://yann.lecun.com
1st Connectionist Summer School, CMU July 1986
Y LeCun
1st Connectionist Summer School, CMU July 1986
Y LeCun
Bartlett Mel
Yann LeCun
Stan Dehaene Mike Mozer
Richard Durbin Bart Selman
Gerry Tesauro
Andy Barto
Jordan Pollack
Dana Ballard
Mike Jordan
Jim Hendler
Dave Touretzky
Supervised Learning
PLANE
CAR
CAR
Deep Learning = The Entire Machine is Trainable
Y LeCun
Feature Trainable
Extractor Classifier
It's deep if it has more than one stage of non-linear feature transformation
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Very Deep ConvNet Architectures
Y LeCun
VGG
GoogLeNet
ResNet
ConvNet for Driving
Y LeCun
[Farabet et al.
ICML 2011]
[Farabet et al.
PAMI 2013]
[Lebret, Pinheiro, Collobert 2015][Kulkarni 11][Mitchell 12][Vinyals 14][Mao 14][Karpathy 14][Donahue 14]...
Driving Cars with Convolutional Nets
Y LeCun
MobilEye
NVIDIA
DeepMask: ConvNet Locates and Recognizes Objects
Y LeCun
https://arxiv.org/abs/1604.02135
Zagoruyko, Lerer, Lin, Pinheiro, Gross,
Chintala, Dollr
https://github.com/facebookresearc
h/deepmask
Image Recognition
Y LeCun
Image Recognition
Y LeCun
Image Recognition
Y LeCun
Image Recognition
Y LeCun
Image Recognition
Y LeCun
Image Recognition
Y LeCun
Results
Y LeCun
Y LeCun
Obstacles to AI
Y LeCun
Obstacles to Progress in AI
Predicting any part of the past, present or future percepts from whatever
information is available.
The number of samples required to train a large learning machine (for any
task) depends on the amount of information that we ask it to predict.
The more you ask of the machine, the larger it can be.
The brain has about 10^14 synapses and we only live for about 10^9
seconds. So we have a lot more parameters than data. This motivates the
idea that we must do a lot of unsupervised learning since the perceptual
input (including proprioception) is the only place we can get 10^5
dimensions of constraint per second.
Geoffrey Hinton (in his 2014 AMA on Reddit)
(but he has been saying that since the late 1970s)
(Yes, I know, this picture is slightly offensive to RL folks. But Ill make it up)
VizDoom Champion: Actor-Critic RL system from FAIR
Y LeCun
The Architecture
Of an
Intelligent System
Y LeCun
AI System: Learning Agent + Immutable Objective
Agent
State
Objective
Cost
Y LeCun
Agent Actions/
World Percepts
Outputs
Simulator
Predicted Inferred Action
Percepts World State Proposals Agent
Actor
Agent State
Actor State
Predicted Objective Cost
Critic Cost
What we need is Model-Based Reinforcement LearningY LeCun
Agent
World World World World
Simulator Simulator Simulator Simulator
Perception
Learning
Predictive Forward Models
Of the World
Learning Physics (PhysNet)
Y LeCun
Inferring the
State of the World from Text:
Entity RNN
Y LeCun
Augmenting Neural Nets with a Memory Module
Recurrentnet memory
Differentiable Memory
Y LeCun
Values Vi
K Ti X
e
Ci= K Tj X
Coefficients Ci
e Softmax
j
Keys Ki
Y = Ci V i Dot Products
i
Input (Address) X
Memory/Stack-Augmented Recurrent Nets
Y LeCun
Results on
Question Answering
Task
End-to-End Memory Network on bAbI tasks [Weston 2015]
Y LeCun
Entity Recurrent Neural Net
Y LeCun
Unsupervised Learning
Energy-Based Unsupervised Learning
Y LeCun
Y2
Y1
Capturing Dependencies Between Variables
with an Energy Function
Y LeCun
The energy surface is a contrast function that takes low values on the data
manifold, and higher values everywhere else
Special case: energy = negative log density
Example: the samples live in the manifold
Y 2=(Y 1 )2
Y1
Y2
Y LeCun
Implausible futures
(high energy) Plausible futures
(low energy)
Learning the Energy Function
Y LeCun
1. build the machine so that the volume of low energy stuff is constant
PCA, K-means, GMM, square ICA
2. push down of the energy of data points, push up everywhere else
Max likelihood (needs tractable partition function)
3. push down of the energy of data points, push up on chosen locations
contrastive divergence, Ratio Matching, Noise Contrastive Estimation,
Minimum Probability Flow
4. minimize the gradient and maximize the curvature around data points
score matching
5. train a dynamical system so that the dynamics goes to the manifold
denoising auto-encoder
6. use a regularizer that limits the volume of space that has low energy
Sparse coding, sparse auto-encoder, PSD
7. if E(Y) = ||Y - G(Y)||^2, make G(Y) as "constant" as possible.
Contracting auto-encoder, saturating auto-encoder
#1: constant volume of low energy
Energy surface for PCA and K-means
Y LeCun
1. build the machine so that the volume of low energy stuff is constant
PCA, K-means, GMM, square ICA...
K-Means,
PCA
Z constrained to 1-of-K code
E (Y )=W T WY Y 2 E (Y )=min z i Y W i Z i2
#6. use a regularizer that limits
the volume of space that has low energy
Y LeCun
Adversarial Training
The Hard Part: Prediction Under Uncertainty
Y LeCun
Percepts
Hidden State
Of the World
Energy-Based Unsupervised Learning
Y LeCun
Energy Function: Takes low value on data manifold, higher values everywhere else
Push down on the energy of desired outputs. Push up on everything else.
But how do we choose where to push up?
Y LeCun
Adversarial Training: A Trainable Objective Function
X F(X,Y) X
X F(X,Y) X
Y Y Discriminator F: minimize
Dataset
T(X) F(X,Y)
Y LeCun
Adversarial Training: A Trainable Objective Function
X F(X,Y) X
X F(X,Y) X
Y Y Discriminator F: minimize
Dataset
T(X) F(X,Y)
Y LeCun
Adversarial Training: A Trainable Objective Function
X F(X,Y) X
X F(X,Y) X
Y Y Discriminator F: minimize
Dataset
T(X) F(X,Y)
DCGAN: reverse ConvNet maps random vectors to images
Y LeCun
DCGAN:
adversarial training
to generate
images.
Trained on Manga
characters
Interpolates
between characters
Face Algebra (in DCGAN space) Y LeCun
Loss functions
Y LeCun
EBGAN solutions are Nash Equilibria
Loss functions for Discriminator and Generator
Nash Equilibrium
Y LeCun
EBGAN in which D is a Ladder Network
Ladder Network: auto-encoder with skip connections [Rasmus et al 2015]
Permutation-invariant MNIST (fully connected nets)
10 L2 10
250 250
250 250
250 250
500 500
1000 1000
784 L2 784
Energy-Based GAN trained on ImageNet at 128x128 pixels
Y LeCun
Energy-Based GAN trained on ImageNet at 256x256 pixels
Y LeCun
Trained on dogs
Y LeCun
Video Prediction
(with adversarial training)
[Mathieu, Couprie, LeCun ICLR 2016]
arXiv:1511:05440
Y LeCun
Multi-Scale ConvNet for Video Prediction