Você está na página 1de 19

Mathematical Background

Monday, February 20, 2017 10:01 AM

Linear programming
Method of Lagrange Multiplier
Linear objective function
Linear constraints
Min
subject to

- Define Lagrangian

vi: Lagrange multiplier (dual variables) Quadratic programming

- At optimality : Quadratic objective function


Linear constraints

Find shortest distance from origin to hyperbola x2 + 8xy + 7y2 = 225


G: (Positive semi-definite) matrix
Min x2 + y2
subject to x2 + 8xy + 7y2 - 225 = 0

L(x, y) = x2 + y2 - (x2 + 8xy + 7y2 - 225)

At optimality:

(x, y) (0, 0)

92 + 8 - 1 = 0
= 1/9, = -1

= -1:
Substitute into (I):
Substitute into constraint: -5y2 = 225
No solution

= 1/9:
Substitute into (I): y = 2x
Substitute into constraint: x 2 = 5, y2 = 20
Min distance = x2 + y2 = 25

COMP 4211 Page 1


Support Vector Machine (SVM)
Monday, February 20, 2017 9:37 AM
Finding the solution

Classification Problem - Primal (original problem):

- Given: Training set


(xi, yi): Training pattern - Apply method of Lagrangian:
: Input Associate 1 Lagrange multiplier i for each constraint
: Output (label)

- Assumption: Linearly separable


Exist linear surface to separate 2 classes

BSubstitute these back to primal to get Dual

(2-D: line, 3-D: plane, n-D: hyperplane) - Dual (derived problem):

- Find: that perfectly separate 2 classes

NOTE:
Optimal hyperplane
ij: Quadratic programming
iyi: Linear constraint
Can solved numerically by any general purpose optimization package
- Idea:
Achieve global optimality ()
Max margin

i: Determine HOW MUCH contribute to solution


(i Contribution )

Let
- Find b:
Find and that belong to (+) & (-) class, respectively:
Margin = Magnitude of Projection of on

Better method: Perform average over all Support Vector

Hyperplane should separate 2 classes

Support vector

- Objective:
It can be shown that:
- i > 0:
(Constrained optimization problem) is SUPPORT vector:
Lie on margins
Contribute to solution

- i = 0:
NOT contribute to solution
Perform testing removed/moved Solution NOT change
(Given new , check which class (+/-) belongs to)

Check
Or:
When training data not linearly separable

- New objective: Separate training with min # errors

COMP 4211 Page 2


Or:
When training data not linearly separable

- New objective: Separate training with min # errors


Introduce slack variable i (i 0)

NS: # support vector

- Primal: Penalize in objective function

C: Help decide whether MARGIN or ERROR is more important in


determining solution

- Dual: Soft margin hyperplane:

NOTE:
C constrains i Not let i increase to (may lead to no solution)
Still Quadratic Programming only 1 global min

COMP 4211 Page 3


Non-linear SVM (C-SVM) Kernel trick
Wednesday, March 1, 2017 11:51 AM

- Idea:

Feature transformation Input data only appear (in training + testing) in form of dot products
Training:

Testing:

Dot (inner) product can computed in Rm (without going to H):

- Idea: Change space ( coordinate sys) of data


Decision boundary: Arbitrary surface Hyperplane

- Kernel function k(x,y)


Purpose:
Return dot product between data points in some space
Replace dot products in SVM algorithm with kernel Obtain
efficient representation

Choose k Choose (feature map)

Example of kernels
Name Formula Params needed tuning
- Formulation: : Rm H Inhomogeneous d
polynomial
Gaussian
SVM input: (Radical basis (or )
function)

Original space Feature space Alternative form:

Data
Sigmoid Valid kernel only for
Decision
certain ,
boundary
Boundary 2-D curve 3-D plane
shape

- Problem with directly working on : High dimensionality


Inefficient computation
Input data: Gray-scale 1616 images

Use polynomial curve of degree 5th as decision boundary in
ORIGINAL space

Any algorithm depending only on dot products can use kernel trick

Gaussian Kernel

COMP 4211 Page 4


Gaussian Kernel

- Recall: Gaussian Bell-curve


Put bumps of various sizes on training set

- Param tuning:
:
Influence on decision boundary of EVERY data point
"Smoothness" of decision boundary
: Linear boundary
Heuristics:

C :
Importance of goal of min error
Classification error, but also "smoothness" of decision boundary

Useful SVM observation

- Support vector usually very few in number


Just need to store support vector when doing
prediction

- High-dimensional data MORE likely to be LINEAR


separable
That's why SVM quite commonly used in text,
image processing

COMP 4211 Page 5


Overfitting k-fold Cross (Stratified) Validation
Thursday, March 2, 2017 9:17 AM

- Purpose: Select "best" hypothesis from available data


Select model hyperparams resulting in smallest testing error (highest accuracy)
For Gaussian kernel C-SVM: (C, ) producing smallest testing error
Learning Error Measurement

- Process:
For each hyperparam combination:
f: Target function (unknown)
Divide m examples into k disjoint subsets
D: Targeting data distribution (unknown)
Each of size m/k
(Stratifying step) Prop of examples from each classes in subsets should
h: Hypothesis
be approx EQUAL
Specific SVM param set, neural network,
S: Training set of size n (draw from D)
Subset # Class 1 example # Class 2 example
A 5 2
B 5 2
C 5 2
- Training error:
D 5 3

Run learning process k times, each time:


Prop of examples in training set that h misclassifies Validation set = 1 subset
Can measure Training set = (Other (k-1) subsets)

- Testing error: Calculate avg accuracy


Prob that h misclassify data instance drawn randomly from D
Choose hyperparam combination with highest avg accuracy
Can't measure, but wish to know ( Now, all data can trained with these hyperparams)

Estimated through test set: - Why use k-fold:


High prop of data used for training
Also, all data used in computing error

1000 example: |Class 1| = 600, |Class 2| = 400


Run C-SVM, 4-fold stratified cross-validation
Gaussian kernel
C {2-4, 2-3, 2-2, 2-1, 20, 21, 22}
Overfitting
{2-6, 2-5, 2-4, 2-3, 22, 21, 22, 23, 24, 25}

Subset |Class 1| |Class 2|


A 150 100
B 150 100
C 150 100
D 150 100

For each (C, ):


Learning time Training set Validation set Validation error Accuracy
1 BCD A 50% 50%
2 ACD B 45% 55%

- Def: Hypothesis h overfits training data if h' such that: 3 ABD C 40% 60%
On training set: errorS(h) < errorS(h') 4 ABC D 54% 46%
Over entire distribution: errorD(h) > errorD(h') Avg accuracy 52.75%

- Occam's Razor: Prefer SIMPLE hypothesis, because: Choose (C, ) with highest avg accuracy
Less simple hypotheses than complicated
Simple hypothesis fitting data unlikely to coincidence How many models trained in total: |Set C| |Set | k = 7 10 4 = 280

Over-fitting avoidance principle Leave-one-out Cross-Validation

- Test set drawn independently from training set - Train on (m - 1) examples


- Not use test set for training - Validate on 1 example

- Early stopping: Stop before reaching point where - Useful for SMALL data sets
training data perfectly classified

COMP 4211 Page 6


Neural Networks
Friday, March 10, 2017 5:00 PM

Artificial Neural Network

Real neuron - Use complex network of simple computing element Mimic brain's function

- Cell structure:

- Structure:
Unit (input, hidden, output)
Weighted link

Dendrite: Receive info from others


Axon (1/cell): Transmit info from cell body
Synapse: Package of chem substances (transmitter), influence other
cells when released

- Signal transmission:
Impulses arrive simultaneously, added together - Learning = Updating weights
Transmitter released from synapse, enter dendrite
If sufficiently strong: Electrical pulse sent down axon
Reach synapse, release transmitter into other cells' bodies

- Properties:
Fault-tolerant: Cells die all time with no ill effect to brain's overall
functioning

Graceful degradation: As condition worsen, cell's performance


gradually (rather than sharply) drop

Learning capability: Network can modified Performance

Neural network: Massively parallel computation

COMP 4211 Page 7


Perceptron Model of Single Perceptron
Friday, March 10, 2017 5:00 PM

- Input: x1, x2, , xn


- Weight: w1, w2, , wn
- Activation function: Relate input & output
Perceptron

- Feed-forward network
- Only 1 layer of adjustable weight
(with x0 1, w0 -)

Learning Linearly Separable Function through Single Perceptron

- Principle of weight-updating rule:


Observed output (T) Predicted output (O)
Make small adjustment in weight Difference

- Algorithm:
Randomly initialize weights + Choose learning rate
Learning Capability

Repeat until all examples correctly predicted:


For each sample input x = (x1, x2, , xn), with corresponding output T
- Function can represented by single perceptron Linear separable
Calculate predicted output:

AND (I1 I2) OR (I1 I2)

Update weight:

Input (I1, I2) Output T


(5, 1) 0
(2, 1) 0
(1, 1) 1
(3, 3) 1
(4, 2) 0
(2, 3) 1

Decision boundary: Decision boundary:


I1 + I2 - 1.5 = 0 I1 + I2 - 0.5 = 0

Decision boundary: -3I1 + 4I2 + 1 = 0


- With more layers of sufficiently many perceptrons, any boolean function can
represented
Initialize: w = (0, 0, 0) , Learning rate = 1
(Why: Any boolean function can represented in Sum-of-Product or Product-of-Sum
form) Iteration wold = (-, w1, w2) I = (1, I1, I2) T O T-O wnew
1.1 (0, 0, 0) (1, 5, 1) 0 1 -1 (-1, 5, -1)
XOR: 1.2 (-1, -5, -1) (1, 2, 1) 0 0 0 (-1, 5, -1)
1.3 (-1, -5, -1) (1, 1, 1) 1 0 1 (0, -4, 0)
1.4 (0, -4, 0) (1, 3, 3) 1 0 1 (1, -1, 3)
1.5 (1, -1, 3) (1, 4, 2) 0 1 -1 (0, -5, 1)
1.6 (0, -5, 1) (1, 2, 3) 1 0 1 (1, -3, 4)
2.1 (1, -3, 4) (1, 5, 1) 0 0 0 (1, -3, 4)
2.2 (1, -3, 4) (1, 2, 1) 0 0 0 (1, -3, 4)
2.3 (1, -3, 4) (1, 1, 1) 1 1 0 (1, -3, 4)
2.4 (1, -3, 4) (1, 3, 3) 1 1 0 (1, -3, 4)

(a c) (b c): 2.5 (1, -3, 4) (1, 4, 2) 0 0 0 (1, -3, 4)


2.6 (1, -3, 4) (1, 2, 3) 1 1 0 (1, -3, 4)

COMP 4211 Page 8


2.4 (1, -3, 4) (1, 3, 3) 1 1 0 (1, -3, 4)

(a c) (b c): 2.5 (1, -3, 4) (1, 4, 2) 0 0 0 (1, -3, 4)


2.6 (1, -3, 4) (1, 2, 3) 1 1 0 (1, -3, 4)

- Perceptron convergence theorem:


If training example linearly separable: Running perceptron weight-updating
rule can:
Always converge to solution
In finite step for any initial weight choice

If example not linearly separable: Perceptron may fail to converge

COMP 4211 Page 9


Adaline
Saturday, March 25, 2017 9:27 AM
Fundamentals of Weight-Finding

Adaline (Adaptive Linear Element) - Principle: Gradient descent


Keep going "downhill" from any point in error surface
Can get stuck in local optimal solution
- Def:
Feedforward network
1 layer of adjustable weight

However, for 's adaline: 1 global min only

- Terminologies:
Training example :
Input:
Target output: td

Weight:
Predicted output:
- Mathematics: Gradient descent

- Objective:

* NOTE: Adaline is "linear regression" in statistics


Gradient at :
Move :
Direction: Opposite to
Magnitude: Small fraction of


Weight-Updating Methods

Batch Gradient Descent Stochastic Gradient Descent


Principl In each iteration: Use ALL examples to In each iteration: Use 1 example to update
e update weights weights
Algorit - Randomly initialize - Randomly initialize
hm
- (*) Repeat until termination condition met: - (*) Repeat until termination condition met:
Initialize For each :
For each : Calculate
Calculate

For each wi:


For each wi:

For each wi:

Illustrat
ion Gradient Descent: Common Issues

- Overshooting:
Problem: too large Overstep min

COMP 4211 Page 10


Problem: too large Overstep min

* NOTE: wi changed after each training


example IMMEDIATELY affect od
calculation of NEXT example
Pros & - Gradient change summed over whole data - For big data set: Possible to get good Solution:
Cons set each iteration result after 1 (*) pass (through all training Test on small subset of training set
Expensive computation for big data set: data) Well performance Use for whole set
Can't store whole set on mem Have
to read from disk Slow - generally move in right direction, but
# (*) iteration gradually
not always
- For big data set: Always need multiple (*) Have to use smallest step ()
passes (through all training data) to get
good result

- Mini-batch gradient descent:


In each iteration: Use b examples to update weights
Often faster than stochastic

- Termination condition:
- Gradient calculation:
Reach preset # (*) iteration Problem: Sometimes, gradient can't derived easily &
precisely calculated (not case for Adaline)
Multi-layer perceptron

Solution: Check gradient using finite difference


Repeat for different , , :
Compute
Compute gradient using formula previously derived

Set:

very small
Compute
Verify:

COMP 4211 Page 11


Multilayer Perceptron & Back-Propagation
Monday, May 22, 2017 9:24 PM

Network Construction for Multiclass Classification


Activation (Hidden Unit Transfer) Function

- m 2 classes:
Must nonlinear: Linear hidden unit can duplicated by single-layer network 1 output/class
Object Class i

- m = 2 classes: Special method:


1 output
y > 0 Object YES class
y 0 Object NO class

Hidden Illustration Characteristics


Unit Type
Step Unit

- Non-differentiable
- Not suitable for gradient descent

Sigmoid
Unit

- (x) 0

- Nice derivative:

- x very small: (x) Linear function


- x very big: (x) Step function Universal Approximation
Radical
Basis
Function - 1 hidden layer of sigmoid sufficient to approx any
(RBF) well-behave function to arbitrary precision (1)
Produce localized response to input:
Significant nonzero output only - Why need > 2 hidden layers: (2)
when input falls within small Less weights (params)
localized region
Same accuracy lvl

foutput(x) = g1(x) + g2(x) +


* NOTE: Approximate complicated function:
(1) Use lots of hidden units in 1 layer
w11(x) + w22(x) + w33(x) +

(2) Use many hidden layers


((()))

Rectified f(x) = max(0, x)


Linear
Unit - Efficient computation
(ReLU)
- Simple gradient:

- Most popular in deep networks

NOTE: Careful when initializing


weights Avoid f'(x) = 0

Leaky Efficient as ReLU

COMP 4211 Page 12


Leaky - Efficient as ReLU
ReLU - But: f'(x) 0

Gradient Computation: Single Sigmoid Unit Gradient Computation: Multiple Layers

N0 output unit

- Weights between Hidden & Output layer:

Backpropagation: Gradient Descent for Multilayer Network


- Weights between Input & Hidden layer:

- Algorithm (Stochastic version):


Randomly initialize wji

Repeat until convergence: For each


Propagate input forward:
Compute

Propagate error backward:


For each Output Unit k:
k = ok(1 - ok)(tk - ok) Gradient Descent: Practice issue

For each Hidden Unit j:

COMP 4211 Page 13


Propagate error backward:
For each Output Unit k:
k = ok(1 - ok)(tk - ok) Gradient Descent: Practice issue

For each Hidden Unit j:


- Speed up training:
Use momentum:

Update weights:
Dynamically adapt
For each connection (i j):
Exploit error surface's high-order info
wji wji + jOuti

- Escape poor local minima:


NOTE: Outi = Unit i's output, which is:
Train multiple networks, each initialized with different weights
ui: For hidden unit i
xi: For input unit i

- Efficiency: O(W2)
W = Total number of weights

COMP 4211 Page 14


Convolutional Neural Network
Monday, May 22, 2017 12:50 AM

Convolutional Network: Special Properties

Convolution Operator - Spare connectivity: Due to small convolution kernel

- Def:
Continuous:

Discrete:

- Feature Hierarchy: Each hidden units only connected to local subset of units in
previous layer

I: Image ; K: Mask (Kernel)

- Shared weights:
Due to same kernel used throughout image
Help detect features regardless of positions in images Robustness
# params to learn

- Zero-padding:
Convolution Representation shrink at each layer Limit # layers
Solution: Add zeros to lost positions
- Properties:
Commutative: f * g = g * f
Associative: f * (g * h) = (f * g) * h

Pooling Layer - Activation function:


Convolution is linear Need nonlinear activation function
Commonly used: ReLU: y = max(x, 0)
- Motivation:
Once feature detected, only its approx position relative to other
features relevant
Image of number 7:
Endpoint of roughly horizontal segment in upper left
Corner in upper right area
Endpoint of roughly vertical segment in lower portion

Different object instances:


Difference absolute positions of features
But their relative positions to each other same

- Technique: Max-pooling: For each sub-region, output max value

- Advantages:
Effect of shifting, rotation Robustness
Computational burden on next layer

COMP 4211 Page 15


Reinforcement Learning (RL)
Monday, May 22, 2017 10:59 PM
Markov Assumption

Basic concepts

st+1 = (st, at)


rt = r(st, at)

Current reward & Next states depend ONLY on current state & action

Policy Evaluation

- Given policy Compute State-value function V

- Reinforcement Learning: - Bellman equation:


Interact with env Get EVALUATIVE output/reward Deterministic world:
Learn mapping State Action V(s) = rt + rt+1 + 2rt+2 + = r(s, a) + ((s, a))
Goal: Max long-term reward

Objective: Min time to reach "goal"

States: Car's position & velocity

Actions: Forward/Reverse/None

Rewards: Non-deterministic world:


0, if goal reached
-1, otherwise

- World/Environment:
Deterministic world Non-deterministic world
Actions have certain outcomes Active have uncertain outcomes
From state s, take action a From state s, take action a, reach
Definitely reach state s' - State s', with prob p'
- State s'', with prob p''
(s, a) P(s, s', a)
= New state reached from state = P(Transition s s' | Action a)
s through action a
R(s, s', a)
r(s, a) = E(Reward of Transition s s' | Action a)
= Reward got by taking action a
when at state s Policy (deterministic): State Good Action Produce
State Broken Action Repair

- Policy: Map State Action:


V(Good) = 0.9(1 + V(Good)) + 0.1(0 + V(Bad))
Deterministic policy Non-deterministic policy V(Bad) = 0.7(0 + V(Good)) + 0.3(-10 + V(Bad))
At state s, ALWAYS take action a At state s, take:
- Action a', with prob p' - Computation method:
- Action a'', with prob p'' Solve linear system of V(s1), V(s2)
: S A (s, a) = P(Take action a | State s)
s (s) = a Iterative method:
Randomly initialize V0(s1), V0(s2),
Run until convergence:
- Discount factor :
Total Reward = Reward(t) + Reward(t + 1) * (Discount Factor)
+ Reward(t + 2) * (Discount Factor)2 +

t: Time
Discount factor < 1 (Future reward not worth as much as current reward)

COMP 4211 Page 16


Learning Optimal (Deterministic) Policy
Monday, May 22, 2017 11:30 PM Learning in Known Environment: Policy Iteration

Basic Concepts

- Learning situation:
Situation Condition Learning method
Known Known (s, - Policy iteration
environment a), r(s, a) - Randomly initialize (s) s
- Value iteration
Unknown Unknown Q-learning
- Repeat:
environment (s, a), r(s, a)
Evaluate current policy

- Optimal policy: * = argmaxV(s) s Improve policy


- Optimal state-value function: V*(s) = maxV(s) policyStable True
For each s S:
- Bellman Optimality Condition: newAction = argmax a[r(s, a) + V((s,a))]
Deterministic world: If newAction (s):
V*(s) = maxa[r(s, a) + V*((s,a))] (s) newAction
policyStable False
Non-deterministic world:
Stop if policyStable = True

Learning in Known Environment: Value Iteration


(Often converge faster than Policy Iteration)

- Randomly initialize V(s) s

- Repeat until < :


0
For each s S:
oldStateValue V(s)
V*(A) = max( 0.5(5 + V*(A)) + 0.5(5 + V*(B)), V(s) maxa[r(s, a) + V((s,a))]
10 + V*(B) ) max(, |oldStateValue - V(s)|)
V*(B) = 10 + V*(B)
- Output *:
- Q-function (Action-value function): *(s) argmaxa[r(s, a) + V((s,a))]
Q(s, a) r(s, a) + V((s, a)) Given , s, a

At optimal *:
Deterministic world:
Learning in Unknown Environment: Q-Learning
Q*(s, a) = r(s, a) + V*((s, a))
(Learn to approximate Q-function)
= r(s, a) + maxa'Q*((s, a), a')

Non-deterministic world:
- Initialize

- Repeat sufficiently:
Choose current state s (might/might not random)

Repeat until Terminate:


Select & Execute action a (based on some policy)
Observe: Immediate reward r Training
New state s' episode
Update:
Q-Learning: Remarks

s s'
- Reward function is non-deterministic
Each time (s, a) visited, get different r
Terminate condition: Reach allowed max # iteration
Solution: Update through avg
Reach goal state
Q(s, a) (1 - )Q(s, a) + [r + maxaQ(s', a')]
Q(s, a) Q(s, a) + [r + maxaQ(s', a') - Q(s, a)]

COMP 4211 Page 17


Training progress

visitn(s, a): # times (s, a) visited until nth iteration

- Action selection policy:


Greedy policy:
At s, select "best" a:
Problem:
Over-commit to action with high found early
Fail to explore potentially high-reward actions

-Greedy policy: State-of-the-art


Greedy most of time
Occasionally (with prob ), take random action

Exploit & Explore:


Selection action ai with prob:

Exploit high- action, but still have to chance to explore others

COMP 4211 Page 18


Generalization & Function Approximation
Tuesday, May 23, 2017 12:32 AM

Q-Learning with Linear Approximation

Q-Function Feature-Based Representation - Initialize wi

- Repeat sufficiently:
- Problem with Q-learning: When |S| too big: Choose current state s
Too many states to visit all
Too many entries in Q-table to hold in mem Repeat until Terminate:
Select & Execute action
Solution: Replace Q-table with function approximator Observe: Immediate reward r Training
New state s' episode

Update weight:
Difference = r + maxa'Q(s',a') - Q(s, a)
wi wi + [Difference]fi(s, a) i = 1..n

s s'

Q-Learning with Linear Approximation: Issues

- Feature-based Representation:
Observation Current param determine next data Q used in both evaluating (s, a) & selecting
Q(s, a) = w1f1(s, a) + + wnfn(s, a)
sample getting trained action

fk(s, a): Feature function Q(s, a) Q(s, a) + [r + maxaQ(s', a') - Q(s, a)]
Return real number Q(s, a): Evaluation
Capture important properties of (s, a) maxaQ(s', a'): Action selection
Pacman: Distance to closest ghost,
What might - Strong correlation between samples Divergence/Oscilation
go wrong
Advantage: n < # States
- Learning directly from consecutive
samples is inefficient
Weight update:
Recall: Q(s, a) Q(s, a) + [Difference] Poor local minima/divergence
Difference = (r + maxaQ(s', a')) - Q(s, a)
Solution Experience relay Target network
Formulation:

- Pool agent's experience over many - Every C updates: Clone network Q ( Weight
t = r + maxa'Q(s', a') episode into D set) Target network
Experience e = (s, a, r, s')
- Next C updates: Use Generate Q-learning
- Draw sample randomly from D targets
Update weights

Why - Randomized sample Break - Delay between time Q updated & time targets
solution correlation affected by new update
works
- Each experience potentially used in - Divergence/Oscillation
many weights update Data
wi wi + [Difference]fi(s, a)
efficiency

Avoid oscillation/divergence

COMP 4211 Page 19