Deep Learning Overview-NQU

Deep learning overview
Nguyen Quang Uy
During the presentation
Please ask questions whenever you have.
4/15/16
Outline
This presentation provides an introduction to the machine
learning and deep learning.
Concept of Machine Learning
Artificial Neural network
Deep learning
3/11/16
Machine Learning Concept

Machine learning:
Methods that can automatically detect patterns in data

Use the uncovered patterns to predict future data,
3/11/16
Machine Learning System

Data Collection
New Data
Features
Extraction/Selection
Features
Extraction/Selection
Learning
Learnt Model
Learnt Model
Decision
a) Training phase
3/11/16
b) Testing/Deploying
5
Datasets
Often in the form of tables
Samples: A record/item in the dataset
Features: A column in the dataset presenting a property

of the object of the learning problem.
Iris flower
3/11/16
Data set: Iris

sepal
length
sepal
width
petal
length
petal Class
width
5.1
3.5
1.4
0.2
Iris-setosa
7.0
3.2
4.7
1.4
Irisversicolor
6.3
3.3
6.0
2.5
Iris-virginica
..
3/11/16
Classification Problem
Given a dataset with

all samples have
already been labeled to
several classes.
Find the label class for

new samples.
3/11/16
F1
5.1
7.0
6.3
2.1
1.0
F1
2.2
3.5
F2
3.5
3.2
3.3
1.3
1.8
F2
1.8
2.9
F3
1.4
4.7
6.0
6.2
2.5
F3
3.7
4.8
F4
0.2
1.4
2.5
5.7
2.6
Class
1
0
1
1
0
F4 Class
4.6 ?
5.2 ?
8
K-Nearest Neighbor Classifiers
Learning by analogy: Tell me who your friends are

and Ill tell you who you are
3/11/16
K-Nearest Neighbor Algorithm
To determine the class of a new sample
Calculate the distance between the new sample and all examples
in the training set
Select K-nearest examples to the new sample in the training set
Assign the new sample to the most common class among its Knearest neighbors
3/11/16
10
K-Nearest Neighbor Example

Paper tissue dataset
X1 (Acid durability)
(seconds)
X2 (Strengh)
(Kg/m2 )
Y=Classification
Bad
Bad
Good
Good
Please test the new paper tissue with X1=3, X2=7 with
K=3
3/11/16
11
K-Nearest Neighbor Example

Paper tissue dataset
X1 (Acid
durability)
(seconds)
X2 (Strengh)
(Kg/m2 )
Distance to the
new sample
Y=Classification
16
Bad
25
Bad
Good
13
Good
Please test the new paper tissue with X1=3, X2=7 with
K=3
Since K=3 we have two out of three closet samples are
good then the testing sample is good.
3/11/16
12
K-Nearest Neighbor Algorithm
There are several key issues that affect the performance of

kNN:
One is the choice of k
The distance metric
The approach to combining the class labels.
3/11/16
13
Performance Measure
Popular measure for classification is accuracy:
Accuracy =
Number of correct classified samples

Total number of samples
3/11/16
14
Overfitting and model selection
Overfitting: When model A is beter than model B on training
data, but B is better than A on testing data, A is said to be

overfitted.
Model selection: Problem of selecting a good model for

unseen data
3/11/16
15
No free lunch theorem
No free lunch theorem (David Wolpert and William Macready,

1997).
On average: the performance of all algorithms is the same and

equal to the random algorithm.
3/11/16
16
When to apply machine learning
Human expertise is absent.
Humans are unable to explain their expertise
Speech recognition, Face recognition
The problem size is too vast
Robotics on Mars
Calculating webpage ranks, matching ads Google pages
Solution changes with time
Network traffic monitoring
3/11/16
17
Outline
Deep learning
3/11/16
18
Artificial Neural Network

This learning model is inspired by the biological neural
network.
A biological neural network is a series of interconnected

neurons which interact each other to process information.
3/12/16
19
Artificial Neural Network
A Artificial Neural Network is a system composed of many

simple processing elements operating in parallel:
Each element of NN is a node called unit.

Units are connected by links and the links has a numeric weight.
The activation function of a node defines the output of that node
given a set of inputs.
3/12/16
20
Activation function
The activation function defines the output of that node
given a set of inputs.
Used to transform input to different domain where they

may be easily separable.
3/12/16
21
Popular activation functions
3/12/16
22
Nummerical Example
Neth1=0.15*0.05+0.2*0.1+0.35=0.3775
Outh1=1/(1+e-Neth1)=1/(1+e-0.3775)=0.596
3/12/16
23
Nummerical Example
Neth1=0.15*0.05+0.2*0.1+0.35=0.3775
Outh1=1/(1+e-Neth1)=1/(1+e-0.3775)=0.596
3/12/16
24
Multilayer Perceptron
An MLP consists of multiple layers of nodes in a directed

graph, with each layer fully connected to the next one.
One hidden layer neural network is the most popular
structure in MLP
3/12/16
25
Training MLP
Finding the parameters (weights) so that the objective

function is optimal.
We need to:
Define an objective function

A method for adjusting the parameters
3/12/16
26
Cost function
Cost function is the objective function for the model that we want to
find the parameters to optimize.
One popular cost function for neural network is cross-entroy cost
function:
J()=-[ylogh(x)+(1y)log(1h(x))]
where y is the objective value with the input is x
h(x) is the output of the model given x
3/12/16
27
Parameters Estimation
Find the parameters so that the cost function is minimal.
We select that:
J()=[ylogh(x)+(1y)log(1h(x))]
is minimal
3/12/16
28
Gradient Decent Algorithm

0. Start at xk, k = 0, select .
J ( xk )
1.Compute a search direction pk =
2.Update xk+1 = xk - pk
3.Check for convergence (stopping criteria) e.g. df/dx = 0
4. k=k+1, repeat step 1 to 4.
3/12/16
29
Gradient Decent Algorithm

J ( xk )
pk =
3/12/16
30
Patch Gradient Decent Algorithm

To increase the convergence speed, we often update the
parameters of logistic model after a mini patch of training
samples.
This approach is reffered to as minipatch gradient descent.
3/12/16
31
MLP model selection
Several aspects need to consider when using MLP:
The way to initialize the weights and biases.

The number of neural in hidden layers: of the number of
inputs.
The learning rate.
Stopping criteria?
3/12/16
32
Initialization MLP
For biases:
Can initialize all to zero
For weights
Should not be the same for all

Should not be zero if the activation is tank
The common recipe is to initialize wij uniformly from [-a,a]
where
6
a=
H k + H k1
3/12/16
33
How do we pick ?
The stochastic gradient decent will converge if
t =
t =1
<
t=1
Where t is the learning rate at tth update
if is a constant then the algorithm is not converged.
3/12/16
34
How do we pick ?
Decreased strategies:
1+ t
t =
1+t
t =
Where is constant, usually selected between [0.5,1]

It is often better to used fixed learning rate for first few
updates.
3/12/16
35
When to stop Backprobagation?
Some common criteria are:
The fixed number of epoch.

Stop when can not reduce more error.
Using early stopping: Stop training when the validation error
increase (with some look ahead).
3/12/16
36
Neural work for digit recognition
If the output at position

k is greatest, than the
network will recognize
the input as number k.
3/12/16
37
Universal approximation theorem

Theorem (Cybenko 1989): A feed-forward network with a
single hidden layer containing a finite number of neurons
can approximate any continuous function.
In other words, a set of weights exists that can produce
the targets from the inputs. However, the problem is
finding them.
3/12/16
38
Outline
Deep learning
3/11/16
39
DL appears in The New York Times

Scientists See Promise in Deep-Learning Programs
John Markoff
November 23, 2012
3/12/16
40
DL is the center at Microsoft Research
3/12/16
41
Leading researchers
Hitton at Google and Toronton Uni.
Lecun at Facebook
Andrew Ng. at Stamford Uni.
3/12/16
42
Successful application
Deep learning is a powerful methodology well-suited to

training deep and large networks for big data applications.
Successful applications of deep networks have already
been presented on a large variety of applications:
Computer vision: Facebook image tagging
Natural Language Processing: Google translation
Speech Recognition: Google doc (voice to text)
3/12/16
43
Multilayer Neural Network
Can we use multilayer neural network with a lot of

layers?
The anwer is yes and people have tried this but

without much success.
3/12/16
44
Problems for many layers neural network

When training MLP with many hidden layer, gradient
descent algorithm is not suitable since:
Two many parameters
A network with 1000 inputs two hidden layers with 500 nodes
and 10 output have more than 1000*500*500*10=2,500,000,000
parameters
Computationally expensive
Gradient decay quickly.
3/12/16
45
45
What is novelty of DL
1. What exactly is deep learning?
2. Why is it generally better than other methods on
image, speech and certain other types of data?
3/12/16
46
Deep Learning Overview
Deep learning is training the neural network with many

layers with some important modifications:
Network structure
Training algorithms
3/12/16
47
Deep Learning Objective
Many layers work to build an improved feature space
First layer learns 1st order features (e.g. edges)

2nd layer learns higher order features (combinations of first layer
features, combinations of edges, etc.)
3/12/16
48
48
Why is deep learning great?
Data (feature) representation is important
Features are problem dependent
Engineering require a lot of domain knowledge
Automatically learning data representation is desired
Deep learning is one of such method
3/12/16
49
49
Why does deep learning work?
Biological Plausibility e.g. Visual Cortex
3/12/16
50
50
Why does deep learning work?
Hastad proof - Problems which can be represented with a

polynomial number of nodes with k layers, may require an
exponential number of nodes with k-1 layers (e.g. parity).
*Hastad 2014, On the correlation of parity and small-depth

circuits, Society for Industrial and Applied Mathematics Journal.
3/12/16
51
51
Deep learning categorization
Deep networks for supervised learning
Deep networks for unsupervised learning
Convolutional neural network

deep-structured CRFs
Restricted Boltzmann machine

Deep autoencoder
Hybrid deep networks
Use optimization or regularization methods to alleviate

unsupervised or supervised methods
3/12/16
52
Deep learning framework
Three are some frameworks for implementing deep

learning
In out systems:
Python
Numpy
Theano
3/12/16
53
Thank you!
3/11/16
54

Deep Learning Overview-NQU

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Deep Learning Overview-NQU

Enviado por

Direitos autorais:

Formatos disponíveis

Deep learning overview

During the presentation

Please ask questions whenever you have.

Artificial Neural network

Machine Learning Concept

Methods that can automatically detect patterns in data

Machine Learning System

Samples: A record/item in the dataset

Features: A column in the dataset presenting a property

Data set: Iris

Given a dataset with

Find the label class for

K-Nearest Neighbor Classifiers

Learning by analogy: Tell me who your friends are

K-Nearest Neighbor Algorithm

To determine the class of a new sample

K-Nearest Neighbor Example

K-Nearest Neighbor Example

K-Nearest Neighbor Algorithm

There are several key issues that affect the performance of

One is the choice of k

The distance metric

The approach to combining the class labels.

Popular measure for classification is accuracy:

Number of correct classified samples

Overfitting and model selection

Overfitting: When model A is beter than model B on training

data, but B is better than A on testing data, A is said to be

Model selection: Problem of selecting a good model for

No free lunch theorem

No free lunch theorem (David Wolpert and William Macready,

On average: the performance of all algorithms is the same and

When to apply machine learning

Human expertise is absent.

Humans are unable to explain their expertise

Speech recognition, Face recognition

The problem size is too vast

Calculating webpage ranks, matching ads Google pages

Solution changes with time

Network traffic monitoring

Artificial Neural network

Artificial Neural Network

A biological neural network is a series of interconnected

Artificial Neural Network

A Artificial Neural Network is a system composed of many

Each element of NN is a node called unit.

Used to transform input to different domain where they

Popular activation functions

An MLP consists of multiple layers of nodes in a directed

Finding the parameters (weights) so that the objective

Define an objective function

Find the parameters so that the cost function is minimal.

Gradient Decent Algorithm

Gradient Decent Algorithm

Patch Gradient Decent Algorithm

MLP model selection

Several aspects need to consider when using MLP:

The way to initialize the weights and biases.

The learning rate.

Can initialize all to zero

Should not be the same for all

The stochastic gradient decent will converge if

Where t is the learning rate at tth update

if is a constant then the algorithm is not converged.