Você está na página 1de 47

ECE 18-898G: Special Topics in Signal Processing:

Sparsity, Structure, and Inference


Neural Networks: A brief touch

Yuejie Chi
Department of Electrical and Computer Engineering

Spring 2018

1/41
Outline

• introduction to deep learning

• perceptron model (a single neuron)

• 1-hidden-layer (2-layer) neural network

2/41
The rise of deep neural networks

ImageNet Large Scale Visual Recognition Challenge (ILSVRC): Led by


Prof. Fei-Fei Li (Stanford).

Total number of non-empty synsets (categories): 21841;


Total number of images: 14,197,122

3/41
The rise of deep neural networks

The deeper, the better?

4/41
AlexNet

• Won the 2012 LSVRC competition by a large margin: top-1 and


top-5 error rates of 37.5% and 17.0%.

5/41
AlexNet

• 60 million parameters and 650,000 neurons.


• takes 5-6 days to train on two GTX 580 3GB GPUs.

3 fully connected
5 convolutional layers layers

Softmax
output

6/41
Faster training with ReLU

• Rectified linear units (ReLU):

y = max(0, x)

• compared to tanh and sigmoid,


training is much faster.

ReLU
2.5
tanh
sigmoid
2

1.5

0.5

-0.5

-1
-3 -2 -1 0 1 2 3

ReLu doesn’t saturate. 7/41


Reduce overfitting

Important to reduce overfitting since:

number of training data  number of parameters

8/41
Reduce overfitting

Important to reduce overfitting since:

number of training data  number of parameters

• Data augmentation: apply label-invariant transforms

8/41
Reduce overfitting

Important to reduce overfitting since:

number of training data  number of parameters

• Dropout

9/41
Reduce overfitting

Important to reduce overfitting since:

number of training data  number of parameters

• Other ways of “implicit regularization”:


◦ early stopping
◦ weight decay (ridge regression)
◦ ....

10/41
Learned hierarchical representations

Learned representations using CNN trained on ImageNet:

Figure credit: Y. Lecun’s slide with research credit to, Zeiler and
Fergus, 2013.
11/41
single-layer networks (perceptron)

12/41
Perceptron

Input x = [x1 , . . . , xd ] ∈ Rd , weight w = [x1 , . . . , xd ] ∈ Rd , output


y ∈ R;
d
!
 
y = σ w> x = σ
X
wi xi
i=1

where σ(·) is a nonlinear activation function, e.g. σ(z) = sign(z)


(hard thresholding) or σ(z) = sigmoid(z) = 1+e1−z (soft thresholding).

1 x
ayer input layer output layer
w1
x2 w
x y W2 y
x3 w3

Decision making at test stage: given a test sample x, calculate y.


13/41
Nonlinear activation

Nonlinear activation is critical for complex decision boundary.

14/41
Training of a perceptron

Empirical risk minimization: Given training data {xi , yi }ni=1 , find the
weight vector w:
n
1X
w
b = arg min `(w; xi , yi )
w∈Rd n i=1

• find the weight parameter w that best fits the data;


• popular choice for loss function: quadratic, cross entropy, hinge,
etc..  2
`(w; xi , yi ) = yi − σ(w> xi )
we’ll use the quadratic loss and sigmoid activation as an
example...

15/41
Training via (stochastic) gradient descent

n 
1 X 2
w
b = arg min yi − σ(w> xi ) := arg min `n (w)
w∈Rd 2n i=1 w∈Rd

• Gradient descent:
wt+1 = wt − ηt ∇`n (wt )
where ηt is the step-size or learning rate.

16/41
Training via (stochastic) gradient descent

n 
1 X 2
w
b = arg min yi − σ(w> xi ) := arg min `n (w)
w∈Rd 2n i=1 w∈Rd

• Gradient descent:
wt+1 = wt − ηt ∇`n (wt )
where ηt is the step-size or learning rate.
• The gradient can be calculated via chain rule.
◦ call ŷi = ŷi (w) = σ(w> xi ), then

d 1 2 dŷi (w)
(yi − ŷi ) = (ŷi − yi ) = (ŷi − yi )σ 0 (w> xi )xi
dw 2 dw
= (ŷi − yi )ŷi (1 − ŷi ) xi
| {z }
scalar
where we used σ 0 (z) = σ(z)(1 − σ 0 (z)). This is called “delta rule”.
16/41
Stochastic gradient descent

Stochastic gradient descent uses only a mini-batch of data every


iteration.
At every iteration t,
1 Draw a mini-batch of data indexed by St ∈ {1, · · · , n};
2 Update X
wt+1 = wt − ηt ∇`(wt ; xi , yi )
i∈St

17/41
Detour: Backpropagation

• Backpropagation is the basic algorithm to train neural network,


rediscovered several times in the literature in the 1970-80’s, but
popularized by the 1986 paper by Rumelhart, Hinton, and
Williams.
• Assuming node operations take unit time, backpropagation takes
linear time, specifically, O(Network Size) = O(V + E) to
compute the gradient, where V is the number of vertices and E
is the number of edges in the neural network.

main idea: chain rule from calculus.


d
f (g(x)) = f 0 (g(x))g 0 (x)
dx
Let’s illustrate the process with single-output, 2-layer NN
18/41
Derivations of backpropagation
dden layer input layer output layer
den layer input layer output
x y layer
Wwm,j y
hidden layer inputxlayer
den layer input layer output
x y layer
j
W output layer
y
dden layer input layer output
x y layer
Wx y W y y
den layer input layer output
x y layer
W y
hidden layer input layer output layer
hidden layer
x input
y W layer output layer
y
hidden layer input layer output layer

network output:
!   
X X X
ŷ = σ vm hm = σ  vm σ  wm,j xj 
m m j

loss function: f = 1
2 (y − ŷ)2 .
19/41
Backpropogation I

Optimize the weights for each layer, starting with the layer closest to
outputs and working back to the layer closest to inputs.
1 To update vm ’s: realize

df df dŷ
=
dvm dŷ dvm
dŷ
= (ŷ − y)
dvm
!
0
X
= (ŷ − y)σ vm hm hm
m
= (ŷ − y)ŷ(1 − ŷ) hm .
| {z }
δ

This is the same as updating the perceptron.

20/41
Backpropogation II

2 To update wm,j ’s: realize

df df dŷ dhm
=
dwm,j dŷ dhm dwm,j
= (ŷ − y)ŷ(1 − ŷ)vm hm (1 − hm )xj
= δvm hm (1 − hm )xj

21/41
Questions we may ask (in general)

• Representation: how well can a given network (fixed activation)


approximate / explain the training data?

• Generalization: how well can the learned w


b behave in
prediction during testing?

• Optimization: how does the output of (S)GD wt relate to w? b


(or we should really plug in wt in the previous two questions!)

22/41
Nonconvex landscape of perceptron can be very bad
SGD converges to local minimizers. Are they global?
Theorem 12.1 (Auer et al., 1995)
Let σ(·) be sigmoid and `(·) be the quadratic loss function. There
exists a sequence of training samples {xi , yi }ni=1 such that `n (w) has
b nd cd distinct local minima.

23/41
Nonconvex landscape of perceptron can be very bad
SGD converges to local minimizers. Are they global?
Theorem 12.1 (Auer et al., 1995)
Let σ(·) be sigmoid and `(·) be the quadratic loss function. There
exists a sequence of training samples {xi , yi }ni=1 such that `n (w) has
b nd cd distinct local minima.

Consequence: there may exist exponentially many bad local minima


with arbitrary data! — curse of dimensionality

23/41
Why?

• saturation of the sigmoid


0.35 0.6

0.3
0.5

0.25
0.4

0.2

0.3

0.15

0.2
0.1

0.1
0.05

0 0
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10

`(w; 10, 0.55) `(w; 0.7, 0.25)


◦ each sample produces a local min + flat surfaces away from the
minimizer

24/41
Why?

• saturation of the sigmoid


0.35 0.6

0.3
0.5

0.25
0.4

0.2

0.3

0.15

0.2
0.1

0.1
0.05

0 0
-10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -8 -6 -4 -2 0 2 4 6 8 10

`(w; 10, 0.55) `(w; 0.7, 0.25)


◦ each sample produces a local min + flat surfaces away from the
minimizer
◦ if the local min of sample A falls into the flat region of sample B
(and vice versa), the sum of sample losses preserve both minima.

24/41
Why?

• We get one local minimum per sample in 1D.


0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

local minima
0
-10 -8 -6 -4 -2 0 2 4 6 8 10

• Curse of dimensionality: we construct the samples to get b nd cd


distinct local minima in d dim.

25/41
Statistical models come to rescue
Data/measurements follow certain statistical models and hence are
not worst-case instances.

m
1 X
minimizew `n (w) = `(w; xi , yi )
m i=1

26/41
Statistical models come to rescue
Data/measurements follow certain statistical models and hence are
not worst-case instances.

m
1 X m→∞
minimizew `n (w) = `(w; xi , yi ) =⇒ E[`(w; x, y)] := `(w)
m i=1

26/41
Statistical models come to rescue
Data/measurements follow certain statistical models and hence are
not worst-case instances.

m
1 X m→∞
minimizew `n (w) = `(w; xi , yi ) =⇒ E[`(w; x, y)] := `(w)
m i=1

empirical risk ≈ population risk


3 3
θ0 = [1, 0]
θ̂n = [0.816, −0.268]
2 2

1 1

θ0
θ2

θ2
0 0

-1 -1

-2 -2

-3 -3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
θ1 θ1

Figure credit: Mei, Bai and Montanari 26/41


Statistical models of training data

Assume the training data {xi , yi }ni=1 is i.i.d. drawn from some
distribution:
(x, y) ∼ p(x, y)
We are using neural networks to fit p(x, y).
• A planted-truth model: let x ∼ N (0, I) and the label y is drawn
as
◦ regression model:
yi = σ(w?> xi )
◦ classification model: yi ∈ {0, 1}, where

P(yi = 1) = σ(w?> xi )

• Parameter recovery: can we recover w? using {xi , yi }ni=1 ?

27/41
Roadmap

1 Step 1: Verify the landscape properties of population loss;


2 Step 2: translate properties of population loss to empirical loss;
3 Step 3: argue wb (minimizer of empirical loss) is close to w ?
(minimizer of population loss).

28/41
Step 1: population risk

θ0
θ2
0

-1

-2

-3
-3 -2 -1 0 1 2 3
θ1

• w? is the unique local minimizer that is also global. No bad local


minima!
• strongly convex near global optima ; large gradient elsewhere
29/41
Nonconvex landscape: from population to empirical

Figure credit: Mei, Bai and Montanari

30/41
Step 2: uniform convergence of gradients & Hessian

Theorem 12.2 (Bai, Song, Montanari, 2017)


Under suitable assumptions, for any δ > 0, there exists a positive
constant C depending on (R, δ) but independent of n and d, such
that as long as n ≥ Cd log d, we have
1 preservation of gradient:
 s 
Cd log n 
P  sup k∇`n (w) − ∇`(w)k2 ≤ ≥ 1 − δ.
kwk≤R n

2 preservation of Hessian:
 s 
Cd log n 
P  sup ∇2 `n (w) − ∇2 `(w) ≤ ≥ 1 − δ.

kwk≤R n

31/41
Step 3: establish rate of convergence for ERM

By the mean-value theorem, there exists some w0 between w


b and w ?
such that
1
b = `n (w ? ) + h∇`n (w ? ), w
`n (w) b − w ? )> ∇2 `n (w 0 )(w
b − w ? i + (w b − w? )
2
≤ `n (w? )

where the last line follows by optimality of w.


b Then

1 1
b − w ? k22 ≤ (w
λmin (∇2 `n (w0 )) kw b − w ? )> ∇2 `n (w 0 )(w
b − w? )
2 2
≤ |h∇`n (w? ), w
b − w ? i|
≤ k∇`n (w? )k · kw
b − w? k
s
? 2k∇`n (w? )k Cd log n
→ kw
b − w k2 ≤ . .
λmin (∇2 `n (w0 )) n
32/41
two-layer networks

33/41
Two-layer linear network
Given arbitrary data {xi , yi }ni=1 , xi , yi ∈ Rd , fit the two-layer linear
network with quadratic loss:
n
kyi − ABxi k22
X
f (A, B) =
i=1

where B ∈ Rp×d , A∈ Rd×p , where p ≤ d.

B A
x

hidden
hidden layer input layer
layer inputlayer
output layer output layer

hidden layer input layer output layer

• special case: auto-association (auto-encoding, identity mapping),


where yi = xi , for e.g. image compression.
34/41
Landscape of 2-layer linear network

• bears some similarity with the nonconvex matrix factorization


problem;

• Lack identifiability: for any invertible C, AB = (AC)(C −1 B).

• Define X = [x1 , x2 , · · · , xn ] and Y = [y1 , y2 , · · · , yn ], then

f (A, B) = kY − ABXk2F

• Let ΣXX = ni=1 xi x> > >


i = XX , ΣY Y = Y Y ,
P
>
ΣXY = XY , and ΣY X = Y X .>

• When X = Y , any optimum AB = U U > , where U is the top


p eigenvectors of ΣXX .

35/41
Landscape of 2-layer linear network

Theorem 12.3 (Baldi and Hornik, 1989)


Suppose X is full rank and hence ΣXX is invertible. Further assume

Σ := ΣY X Σ−1
XX ΣXY

is full rank with d distinct eigenvalues λ1 > · · · > λd > 0. Then


f (A, B) has no spurious local minima, except for equivalent versions
of global minimum due to invertible transformations.

• no bad local min!


• generalizable to multi-layer linear networks.

36/41
Critical points of two-layer linear networks

Lemma 12.4 (critical points)


Any critical point satisfies

AB = PA ΣY X Σ−1
XX ,

where A satisfies

PA Σ = PA ΣPA = ΣPA ,

where PA is the ortho-projector projecting onto the column span of


the sub-indexed matrix.

37/41
Critical points of two-layer linear networks

Let the EVD of Σ = ΣY X Σ−1 >


XX ΣXY be Σ = U ΛU .

Lemma 12.5 (critical points)


At any critical point, A can be written in the form

A = [UJ , 0d×(p−r) ]C

where rank(A) = r ≤ p, J ⊂ {1, . . . , d}, |J | = r and C is invertible.


Correspondingly,
" #
−1 UJ> ΣY X Σ−1
XX
B=C ,
last p − r rows of CL

where L is any d × d matrix.

Verify (A, B) is global optima if and only if J = {1, . . . , p}.


38/41
Two-layer nonlinear network

Local
nlayer
layer strong
input
input layerconvexity
layer output under
layer the Gaussian model [Fu et al., 2018].
output layer
n layer input layer output
x xy y layer
WW y
yer hidden
input layer
layer input layer
output output layer
layer
n layerhidden
inputlayer
layer output
input y layer
x layer output
W layer y
n layer input layer output x y W y
x xy xlayer
Wy W
y W y y +
n layer input layer output
x y layer
W y
x input
hidden layer y W
hidden layer
layer y output layer
inputlayer
output layer

hidden layer input layer output layer

39/41
Reference I

[1] ”ImageNet,” www.image-net.org/.


[2] ”Deep learning,” Lecun, Bengio, and Hinton, Nature, 2015.
[3] ”ImageNet classification with deep convolutional neural networks,”
Krizhevsky et al., NIPS, 2012.
[4] ”Learning representations by back-propagating errors,” Rumelhart,
Hinton, and Williams, Nature, 1986.
[5] ”Dropout: A simple way to prevent neural networks from overfitting,”
Srivastava et al., JMLR, 2014.
[6] ”Dropout Training as Adaptive Regularization,” Wager et al., NIPS,
2013.
[7] ”On the importance of initialization and momentum in deep learning,”
Sutskever et al., ICML, 2013.
[8] ”Exponentially many local minima for single neurons,” P. Auer, M.
Herbster and K. Warmuth, NIPS, 1995.
40/41
Reference II

[9] ”The landscape of empirical risk for non-convex losses,” S. Mei, Y. Bai,
and A. Montanari, arXiv preprint arXiv:1607.06534, 2016.
[10] ”Neural networks and principal component analysis: Learning from
examples without local minima,” P. Baldi and K. Hornik, Neural
Networks, 1989.
[11] ”Local Geometry of One-Hidden-Layer Neural Networks for Logistic
Regression,” Fu, Chi, Liang, arXiv preprint arXiv:1802.06463, 2018.

41/41

Você também pode gostar