Escolar Documentos
Profissional Documentos
Cultura Documentos
Most slides have been adapted from Fei Fei Li lectures, cs231n, Stanford 2017
and some from Hinton lectures, “NN for Machine Learning” course, 2015.
Reasons to study neural computation
• Neuroscience: To understand how the brain actually works.
– Its very big and very complicated and made of stuff that dies when you poke
it around. So we need to use computer simulations.
• Spike generation:
– There is an axon hillock that generates outgoing spikes whenever enough charge
has flowed in at synapses to depolarize the cell membrane.
A mathematical model for biological neurons
𝑥1 𝑤1
𝑤1 𝑥1
𝑤2 𝑥2
𝑤3 𝑥3
How the brain works
• Each neuron receives inputs from other neurons
• The effect of each input line on the neuron is controlled by a synaptic weight
• The synaptic weights adapt so that the whole network learns to perform useful
computations
– Recognizing objects, understanding language, making plans, controlling the body.
• You have about 1011 neurons each with about 104 weights.
– A huge number of weights can affect the computation in a very short time. Much better
bandwidth than a workstation.
Be very careful with your brain analogies!
• Biological Neurons:
– Many different types
– Dendrites can perform complex non-linear computations
– Synapses are not a single weight but a complex non-linear dynamical system
– Rate code may not be adequate
[Dendritic Computation. London and Hausser]
Binary threshold neurons
• McCulloch-Pitts (1943): influenced Von Neumann.
– First compute a weighted sum of the inputs.
– send out a spike of activity if the weighted sum exceeds a threshold.
– McCulloch and Pitts thought that each spike is like the truth value of a
proposition and each neuron combines truth values to compute the truth
value of another proposition!
𝑓: Activation
𝑓 function
𝑖𝑛𝑝𝑢𝑡1
𝑤1
𝑤2
𝑖𝑛𝑝𝑢𝑡2
Σ 𝑓 𝑤𝑖 𝑥𝑖
𝑖
…
𝑤𝑑
𝑖𝑛𝑝𝑢𝑡𝑑
McCulloch-Pitts neuron: binary threshold
• Neuron, unit, or processing element: 𝑦
binary McCulloch-Pitts neuron
𝑥1
𝑤1
𝑥2 𝑤2
𝑦
𝜃: activation threshold
…
𝑤𝑑
1, 𝑧≥𝜃
𝑥𝑑 𝑦=
0, 𝑧<𝜃
Equivalent to
1
𝑥1 𝑏 𝑦
𝑤1
𝑥2 𝑤2
𝑦
…
𝑤𝑑
𝑥𝑑 bias: 𝑏 = −𝜃
9
AND & OR networks
• For -1 and 1 inputs:
10
Sigmoid neurons
• These give a real-valued output that is a smooth and bounded
function of their total input.
• This is guaranteed to find a set of weights that gets the right answer
for all the training cases if any such set exists.
Adjusting weights
16
How to learn the weights: multi class example
How to learn the weights: multi class example
• If correct: no change
• If wrong:
– lower score of the wrong answer (by removing the input from the weight
vector of the wrong answer)
– raise score of the target (by adding the input to the weight vector of the
target class)
How to learn the weights: multi class example
• If correct: no change
• If wrong:
– lower score of the wrong answer (by removing the input from the weight
vector of the wrong answer)
– raise score of the target (by adding the input to the weight vector of the
target class)
How to learn the weights: multi class example
• If correct: no change
• If wrong:
– lower score of the wrong answer (by removing the input from the weight
vector of the wrong answer)
– raise score of the target (by adding the input to the weight vector of the
target class)
How to learn the weights: multi class example
• If correct: no change
• If wrong:
– lower score of the wrong answer (by removing the input from the weight
vector of the wrong answer)
– raise score of the target (by adding the input to the weight vector of the
target class)
How to learn the weights: multi class example
• If correct: no change
• If wrong:
– lower score of the wrong answer (by removing the input from the weight
vector of the wrong answer)
– raise score of the target (by adding the input to the weight vector of the
target class)
How to learn the weights: multi class example
• If correct: no change
• If wrong:
– lower score of the wrong answer (by removing the input from the weight
vector of the wrong answer)
– raise score of the target (by adding the input to the weight vector of the
target class)
Single layer networks as template matching
• Weights for each class as a template (or sometimes also called a
prototype) for that class.
– The winner is the most similar template.
• The ways in which hand-written digits vary are much too complicated
to be captured by simple template matches of whole shapes.
32
MLP with Different Number of Layers
MLP with unit step activation function
33
Beyond linear models
Beyond linear models
MLP with single hidden layer
• Two-layer MLP (Number of layers of adaptive weights is counted)
𝑀 𝑀 𝑑
[2] [2] [1]
𝑜𝑘 𝒙 = 𝜓 𝑤𝑗𝑘 𝑧𝑗 ⇒ 𝑜𝑘 𝒙 = 𝜓 𝑤𝑗𝑘 𝜙 𝑤𝑖𝑗 𝑥𝑖
𝑗=0 𝑗=0 𝑖=0
𝑧0 = 1 𝑧𝑗
[1] [2]
𝑤𝑖𝑗 𝑤𝑗𝑘
𝑧1
𝑥0 = 1 𝜙
𝜓 𝑜1
𝑥1
…
𝜙
…
𝜓 𝑜𝐾
𝑥𝑑
𝜙 𝑧
Input 𝑀 Output
𝑖 = 0, … , 𝑑 𝑗 = 1…𝑀
𝑗 = 1…𝑀 𝑘 = 1, … , 𝐾
36
learns to extract features
• MLP with one hidden layer is a generalized linear model:
𝑀 [2]
– 𝑜𝑘 (𝒙) = 𝜓 𝑗=1 𝑗𝑘 𝑓𝑗 (𝒙)
𝑤
𝑑 [1]
– 𝑓𝑗 𝒙 = 𝜙 𝑖=0 𝑗𝑖 𝑥𝑖
𝑤
– The form of the nonlinearity (basis functions 𝑓𝑗 ) is adapted from the training data
(not fixed in advance)
• 𝑓𝑗 is defined based on parameters which can be also adapted during training
37
Deep networks
• Deeper networks (with multiple hidden layers) can work better than a
single-hidden-layer networks is an empirical observation
– despite the fact that their representational power is equal.
42
Computational graphs
Backpropagation: a simple example
Backpropagation: a simple example
Backpropagation: a simple example
Backpropagation: a simple example
Backpropagation: a simple example
Backpropagation: a simple example
Backpropagation: a simple example
How to propagate the gradients backward
How to propagate the gradients backward
Another example
Another example
Another example
Another example
Another example
Another example
Another example
Another example
Another example
Another example
Another example
Another example
Another example
Another example
= 𝑓 𝑧 [𝐿] × 𝑎
[𝐿−1]
𝑊 [2] 𝑓 𝑧
[1]
For convenience, we use the same activation functions for all layers. ×
𝑊 [1] 𝑥
However, output layer neurons most commonly do not need
activation function (they show class scores or real-valued targets.)
Backpropagation: Notation
• 𝒂[0] ← 𝐼𝑛𝑝𝑢𝑡
• 𝑜𝑢𝑡𝑝𝑢𝑡 ← 𝒂[𝐿]
𝑓(. )
𝑓(. )
𝑓(. )
𝑓(. )
79
Backpropagation: Last layer gradient
𝜕𝐸𝑛 𝜕𝐸𝑛 𝜕𝑎[𝐿]
[𝐿]
= [𝐿] [𝑙] [𝑙]
𝜕𝑊𝑖𝑗 𝜕𝑎 𝜕𝑊 [𝐿] 𝑎𝑖 = 𝑓 𝑧𝑖
𝑖𝑗 𝑀
[𝑙] [𝑙] [𝑙−1]
𝑧𝑗 = 𝑤𝑖𝑗 𝑎𝑖
For squared error loss:
2 𝜕𝐸 [𝐿] )
𝑗=0
𝐸𝑛 = 𝑎
[𝐿]
−𝑦
𝑛
= 2(𝑦 − 𝑎
𝜕𝑎[𝐿] 𝑎𝑗
[𝑙]
[𝐿] 𝑓
𝜕𝑎 [𝐿]
[𝐿]
𝜕𝑧𝑗 [𝑙]
′ 𝑧𝑗
[𝐿]
= 𝑓 𝑧𝑗 [𝐿]
𝜕𝑊𝑖𝑗 𝜕𝑊𝑖𝑗
[𝑙]
′ [𝐿] [𝐿−1] 𝑤𝑖𝑗
= 𝑓 𝑧𝑗 𝑎𝑖
[𝑙−1]
𝑎𝑖
Backpropagation:
[𝑙]
𝜕𝐸𝑛 𝜕𝐸𝑛 𝜕𝑎𝑗 [𝑙]
𝑎𝑖 = 𝑓 𝑧𝑖
[𝑙]
[𝑙]
= [𝑙]
× [𝑙] 𝑀
𝜕𝑤𝑖𝑗 𝜕𝑎𝑗 𝜕𝑤𝑖𝑗 [𝑙]
𝑧𝑗 = 𝑤𝑖𝑗 𝑎𝑖
[𝑙] [𝑙−1]
𝑗=0
[𝑙] [𝑙−1] [𝑙]
= 𝛿𝑗 × 𝑎𝑖 × 𝑓′ 𝑧𝑖 𝑎𝑗
[𝑙]
𝑓
[𝑙] 𝜕𝐸𝑛 [𝑙]
𝛿𝑗 = [𝑙] is the sensitivity of the output to 𝑎𝑗 𝑧𝑗
[𝑙]
𝜕𝑎𝑗
81
[𝑙−1] [𝑙] [𝑙]
𝛿𝑖 from 𝛿𝑖 𝑓
𝑎𝑗
[𝑙]
𝑧𝑗
[𝑙−1] [𝑙−1]
[𝑙−1] [𝑙] 𝑎𝑖 = 𝑓 𝑧𝑖
We will compute 𝜹 from 𝜹 : 𝑀
[𝑙−1] 𝜕𝐸𝑛 [𝑙] [𝑙] [𝑙−1] [𝑙]
𝛿𝑖 = [𝑙−1] 𝑧𝑗 = 𝑤𝑖𝑗 𝑎𝑖 𝑤𝑖𝑗
𝜕𝑎𝑖 𝑗=0
𝑑 [𝑙] [𝑙] [𝑙]
𝜕𝐸𝑛 𝜕𝑎𝑗 𝜕𝑧 𝑗
[𝑙−1]
𝑎𝑖
= [𝑙]
× [𝑙]
× [𝑙−1]
𝑗=1 𝜕𝑎𝑗 𝜕𝑧𝑗 𝜕𝑎 𝑖
𝑑 [𝑙]
𝜕𝐸𝑛 ′ 𝑧 [𝑙] [𝑙] [𝑙]
= [𝑙]
× 𝑓 𝑖 × 𝑤𝑖𝑗 𝑎𝑗
𝑗=1 𝜕𝑎𝑗
𝑑 [𝑙] [𝑙]
𝛿𝑗
[𝑙] [𝑙] [𝑙]
= 𝛿𝑗 × 𝑓 ′ 𝑧𝑖 × 𝑤𝑖𝑗 [𝑙]
𝑤𝑖𝑗
𝑗=1
𝑑 [𝑙]
[𝑙−1]
[𝑙] [𝑙] [𝑙] 𝑎𝑖
= 𝑓′ 𝑧𝑖 × 𝛿𝑗 × 𝑤𝑖𝑗 [𝑙−1]
𝑓 ′ 𝑧𝑖
𝑗=1
[𝑙−1]
𝛿𝑗 82
[𝐿]
Find and save 𝜹
83
Backpropagation of Errors
[𝐿] 2
𝑛
𝐸𝑛 = 𝑎𝑗 − 𝑦𝑗
𝑗
[2] [2] [2] [2] [2]
𝛿1 𝛿2 𝛿3 𝛿𝑗 𝛿𝐾
[2] [2] (𝑛) [2]
𝛿𝑗 = 2 𝑎𝑗 − 𝑦𝑗 𝑓 ′ 𝑧𝑗
[2]
𝑤𝑖𝑗
[1]
𝛿𝑖 𝑑[2]
[1] [1] [2] [2]
𝛿𝑖 = 𝑓 ′ 𝑧𝑖 × 𝛿𝑗 × 𝑤𝑖𝑗
𝑗=1
84
Gradients for vectorized code
Vectorized operations
Vectorized operations
Vectorized operations
Vectorized operations
Vectorized operations
Always check: The gradient with
respect to a variable should have
the same shape as the Variable
SVM example
Summary
• Neural nets may be very large: impractical to write down gradient formula by
hand for all parameters