Escolar Documentos
Profissional Documentos
Cultura Documentos
Yann LeCun
Facebook AI Research
New York University
http://yann.lecun.com
AI today is mostly supervised learning
Training a machine by showing examples instead of programming it
When the output is wrong, tweak the parameters of the machine
Works well for:
Speech→words
Image→categories
Portrait→ name
Photo→caption
CAR
Text→topic
…. PLANE
1957 Perceptron & 1960 Adaline: Analog Computers
N
Linear threshold units y=sign( ∑ W i X i + b)
Perceptron: weights are motorized potentiometers i=1
https://youtu.be/X1G2g3SiCwU
Deep Learning
Traditional Machine Learning
Feature Trainable
Extractor Classifier
Trainable
Deep Learning
Weight
matrix
Hidden
Layer
1986-1996 Neural Net Hardware at Bell Labs, Holmdel
1986: 12x12 resistor array
Fixed resistor value
E-beam lithography
1988: 54x54 neural net
Programmable ternary weights
On-chip amplifiers and I/O
1991: Net32k: 256x128 net
Programmable ternary weights
320GOPS, convolver.
1992: ANNA: 64x64 net
ConvNet accelerator: 4GOPS
6-bit weights, 3-bit activations
Supervised Machine Learning = Function Optimization
Function with
adjustable parameters
Objective
Function Error
traffic light: -1
It's like walking in the mountains in a fog
and following the direction of steepest
descent to reach the village in the valley
But each sample gives us a noisy
estimate of the direction. So our path is
a bit random. ∂ L( W , X )
W i ←W i −η
∂Wi
Stochastic Gradient Descent (SGD)
Computing Gradients by Back-Propagation
C(X,Y,Θ) ●
A practical Application of Chain Rule
Cost
●
Backprop for the state gradients:
W
●
dC/dXi-1 = dC/dXi . dXi/dXi-1
dC/ Fn(Xn-1,Wn)
n ●
dC/dXi-1 = dC/dXi . dFi(Xi-1,Wi)/dXi-1
dWn dC/dXi Xi
Wi ●
Backprop for the weight gradients:
Fi(Xi-1,Wi)
dC/dWi ●
dC/dWi = dC/dXi . dXi/dWi
dC/dXi- Xi-1 ●
dC/dWi = dC/dXi . dFi(Xi-1,Wi)/dWi
1
F1(X0,W1)
Y (desired output)
X (input)
Convolutional Network Architecture [LeCun et al. NIPS 1989]
Pooling
Filter Bank +non-linearity
Pooling
`
...
100@25x121
CONVOLUTIONS (6x5)
...
20@30x125
20@30x484
CONVOLUTIONS (7x6)
3@36x484
YUV input
Semantic Segmentation with ConvNet for off-Road Driving
[Hadsell et al., J. of Field Robotics 2009]
[Sermanet et al., J. of Field Robotics 2009]
MobilEye
NVIDIA
Deep Learning Today
Depth inflation
VGG
[Simonyan 2013]
GoogLeNet
Szegedy 2014]
ResNet
[He et al. 2015]
DenseNet
[Huang et al 2017]
GOPS vs Accuracy on ImageNet vs #Parameters
[Canziani 2016]
ResNet50 and
ResNet 100 are used
routinely in
production.
Multilayer Architectures == Compositional Structure of Data
Natural is data is compositional => it is efficiently representable hierarchically
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Future: weakly/self-supervised learning on massive datasets
Maks R-CNN
running on
Caffe2Go
Detectron: open source vision in PyTorch
https://github.com/facebookresearch/maskrcnn-benchmark
DensePose: real-time body pose estimation
Original
SING
Nsynth
Unsupervised Translation [Lample 2018]
DrQA: open domain question answering
Question answering
systems that “knows” wikipedia
Applications of ConvNets
Self-driving cars, visual perception
Medical signal and image analysis
Radiology, dermatology, EEG/seizure prediction….
Bioinformatics/genomics
Speech recognition
Language translation
Image restoration/manipulation/style transfer
Robotics, manipulation
Physics
High-energy physics, astrophysics
New applications appear every day
E.g. environmental protection,….
Applications of Deep Learning
Medical image analysis [Mnih 2015]
Self-driving cars
Accessibility
Face recognition
Language translation
Virtual assistants*
Content Understanding for: [MobilEye]
Filtering
Selection/ranking
Search
Games
Security, anomaly detection
Diagnosis, prediction
Science!
[Geras 2017] [Esteva 2017]
Deep Learning on Graphs
IPAM workshop:
http://www.ipam.ucla.edu/programs/workshops/new-deep-learning-techniques/
Open Source Projects from FAIR
PyTorch: deep learning framework http://pytorch.org
Many examples and tutorials. Used by many research groups.
FAISS: fast similarity search (C++/CUDA)
ParlAI: training environment for dialog systems (Python)
ELF: distributed reinforcement learning framework
Median performance on 57
Atari games relative to
human performance
(100%=human)
Most methods require over
50 million frames to match
human performance (230
hours of play)
The best method
(combination) takes 18
million frames (83 hours).
Pure RL is hard to use in the real world
Pure RL requires too many
trials to learn anything
it’s OK in a game
it’s not OK in the real world
RL works in simple virtual
world that you can run faster
than real-time on many
machines in parallel.
You can’t run the real world faster than real time
What are we missing?
To get to “real” AI
1. Reasoning
2. Learning models of the world
3. Learning hierarchical representations of actions
What current deep learning methods enables
What we can have What we cannot have (yet)
Safer cars, autonomous cars Machines with common sense
Better medical image analysis Intelligent personal assistants
Personalized medicine “Smart” chatbots”
Adequate language translation Household robots
Useful but stupid chatbots Agile and dexterous robots
Information search, retrieval, filtering Artificial General Intelligence
Numerous applications in energy, (AGI)
finance, manufacturing,
environmental protection, commerce,
law, artistic creation, games,…..
Differentiable Programming:
Marrying Deep Learning
With Reasoning
Neural nets with dynamic, data-dependent structure,
A program whose gradient is generated
automatically.
Augmenting Neural Nets with a Memory Module
Recurrent networks cannot remember things for very long
The cortex only remember things for 20 seconds
We need a “hippocampus” (a separate memory module)
LSTM [Hochreiter 1997], registers
Memory networks [Weston et 2014] (FAIR), associative memory
Stacked-Augmented Recurrent Neural Net [Joulin & Mikolov 2014] (FAIR)
Neural Turing Machine [Graves 2014],
Differentiable Neural Computer [Graves 2016]
Software 2.0:
The operations in a program are only partially specified
They are trainable parameterized modules.
The precise operations are learned from data, only the general structure
of the program is designed.
Dynamic computational graph
Automatic differentiation by recording a “tape” of operations and rolling it
backwards with the Jacobian of each operator.
Implemented in PyTorch1.0, Chainer…
Easy if the front-end language is dynamic and interpreted (e.g Python)
Not so easy if we want to run without a Python runtime...
How do Humans
and Animal Learn?
So quickly
Babies learn how the world works by observation
Largely by observation, with remarkably little interaction.
Photos courtesy of
Emmanuel Dupoux
Early Conceptual Acquisition in Infants [from Emmanuel Dupoux]
pointing
Social-
helping vs false perceptual
communicati beliefs
hindering
ve
Perception
0 1 2 3 4 5 6 7 8 9 10 11 12
13 14
prooto-imitation
crawling walking
emotional contagion
Prediction is the essence of Intelligence
We learn models of the world by predicting
The Future:
Self-Supervised Learning
With massive amounts of data
and very large networks
Self-Supervised Learning
Word2vec
[Mikolov 2013]
FastText
[Joulin 2016]
BERT
Bidirectional Encoder
Representations from
Transformers
[Devlin 2018]
Video prediction:
Multiple futures are possible.
Training a system to make a single
prediction results in “blurry” results
the average of all the possible futures
The Next AI Revolution
THE REVOLUTION
WILL NOT BE SUPERVISED
(nor purely reinforced)
With thanks
To
Alyosha Efros
Could Self-Supervised Learning Lead to Common Sense?
Agent Actions/
World Percepts
Simulator Outputs
Predicted Inferred
Action
Percepts World State Agent
Proposals
Actor
Agent State
Actor State
Critic Predicted Objective
Cost Cost
Training the Actor with Optimized Action Sequences
Agent
World World World World
Simulator Simulator Simulator Simulator
Perception
Actor Actor Actor Actor
Forward model:
[s(t+1),r(t+1)] = f(s(t),a(t))
s(t): state
a(t): action
r(t): reward/cost
Observe a sequence of
states, actions, and
rewards (or costs).
Train the forward model
Learning Physics (PhysNet)
[Lerer, Gross, Fergus ICML 2016, arxiv:1603.01312]
ConvNet produces object masks that predict the trajectories of falling
blocks. Blurry predictions when uncertain
The Hard Part: Prediction Under Uncertainty
Invariant prediction: The training samples are merely representatives of a
whole set of possible outputs (e.g. a manifold of outputs).
Percepts
Hidden State
Of the World
Learning the “Data Manifold”: Energy-Based Approach
Y1
Energy Function for Data Manifold
Energy Function: Takes low value on data manifold, higher values everywhere else
Push down on the energy of desired outputs. Push up on everything else.
But how do we choose where to push up?
Implausible
futures Plausible futures
(high energy) (low energy)
Adversarial Training
Percepts
Hidden State
Of the World
Adversarial Training: the key to prediction under uncertainty?
Generative Adversarial Networks (GAN) [Goodfellow et al. NIPS 2014],
Energy-Based GAN [Zhao, Mathieu, LeCun ICLR 2017 & arXiv:1609.03126]
Past: X
Dataset Y
T(X) Discriminator F: minimize
F(X,Y)
F(X,Y)
Actual future Y
X
Predicted future Past: X
X
Discriminator
Z Generator Y F(X,Y)
F: maximize
G(X,Z)
Adversarial Training: the key to prediction under uncertainty?
Generative Adversarial Networks (GAN) [Goodfellow et al. NIPS 2014],
Energy-Based GAN [Zhao, Mathieu, LeCun ICLR 2017 & arXiv:1609.03126]
Past: X
Dataset Y
T(X) Discriminator F: minimize
F(X,Y)
F(X,Y)
Actual future Y
X
Predicted future Past: X
X
Discriminator
Z Generator Y F(X,Y)
F: maximize
G(X,Z) F(X,Y)
Adversarial Training: the key to prediction under uncertainty?
Generative Adversarial Networks (GAN) [Goodfellow et al. NIPS 2014],
Energy-Based GAN [Zhao, Mathieu, LeCun ICLR 2017 & arXiv:1609.03126]
Past: X
Dataset Y
T(X) Discriminator F: minimize
F(X,Y)
F(X,Y)
Actual future Y
X
Predicted future Past: X
X
Discriminator
Z Generator Y F(X,Y)
F: maximize
G(X,Z)
DCGAN: “reverse” ConvNet maps random vectors to images
DCGAN: adversarial training to generate images.
[Radford, Metz, Chintala 2015]
Input: random numbers; output: bedrooms.
Faces “invented” by a neural net (from NVIDIA)
[Sbai 2017]
Video Prediction with
Adversarial Training
[Mathieu, Couprie, LeCun ICLR 2016]
arXiv:1511:05440
Multi-Scale ConvNet for Video Prediction
4 to 8 frames input → ConvNet → 1 to 8 frames output
Multi-scale ConvNet, without pooling
If trained with least square: blurry output
Latent variable is
predicted from the target.
+
The latent variable is set
to zero half the time
during training (drop out) Noise
and corrupted with noise
The model predicts as
much as it can without
the latent var.
The latent var corrects
the residual error.
Application to Autonomous Driving
Overhead camera on
highway.
Vehicles are tracked
A “state” is a pixel
representation of a
rectangular window
centered around each
car.
Forward model is
trained to predict how
every car moves relative
to the central car.
steering and acceleration
are computed
Forward Model Architecture
expander
MPER: expert
regularization
Driving an Invisible Car in “Real” Traffic
Promising Areas for Research
Deep Learning on new domains (beyond multi-dimensional arrays)
Graphs, structured data...
Marrying deep learning and (logical) reasoning
Replacing symbols by vectors and logic by algebra
Self-supervised learning of world models
Dealing with uncertainty in high-dimensional continuous spaces
Learning hierarchical representations of control space
Instantiating complex/abstract action plans into simpler ones
Theory!
Compilers for differentiable programming.
Technology drives & motivates Science (and vice versa)
Science drives technology, but technology also drives science
Sciences are born from the study of technological artifacts
Telescope → optics
Steam engine → thermodynamics
Airplane → aerodynamics
Calculators → computer science
Telecommunication → information theory
What is the equivalent of thermodynamics for intelligence?
Are there underlying principles behind artificial and natural intelligence?
Are there simple principles behind learning?
Or is the brain a large collection of “hacks” produced by evolution?
Thank you