Você está na página 1de 19

# Mathematical Background

## Monday, February 20, 2017 10:01 AM

Linear programming
Method of Lagrange Multiplier
Linear objective function
Linear constraints
Min
subject to

- Define Lagrangian

## - At optimality : Quadratic objective function

Linear constraints

## Find shortest distance from origin to hyperbola x2 + 8xy + 7y2 = 225

G: (Positive semi-definite) matrix
Min x2 + y2
subject to x2 + 8xy + 7y2 - 225 = 0

## L(x, y) = x2 + y2 - (x2 + 8xy + 7y2 - 225)

At optimality:

(x, y) (0, 0)

92 + 8 - 1 = 0
= 1/9, = -1

= -1:
Substitute into (I):
Substitute into constraint: -5y2 = 225
No solution

= 1/9:
Substitute into (I): y = 2x
Substitute into constraint: x 2 = 5, y2 = 20
Min distance = x2 + y2 = 25

## COMP 4211 Page 1

Support Vector Machine (SVM)
Monday, February 20, 2017 9:37 AM
Finding the solution

## - Given: Training set

(xi, yi): Training pattern - Apply method of Lagrangian:
: Input Associate 1 Lagrange multiplier i for each constraint
: Output (label)

## - Assumption: Linearly separable

Exist linear surface to separate 2 classes

## - Find: that perfectly separate 2 classes

NOTE:
Optimal hyperplane
iyi: Linear constraint
Can solved numerically by any general purpose optimization package
- Idea:
Achieve global optimality ()
Max margin

## i: Determine HOW MUCH contribute to solution

(i Contribution )

Let
- Find b:
Find and that belong to (+) & (-) class, respectively:
Margin = Magnitude of Projection of on

## Hyperplane should separate 2 classes

Support vector

- Objective:
It can be shown that:
- i > 0:
(Constrained optimization problem) is SUPPORT vector:
Lie on margins
Contribute to solution

- i = 0:
NOT contribute to solution
Perform testing removed/moved Solution NOT change
(Given new , check which class (+/-) belongs to)

Check
Or:
When training data not linearly separable

## COMP 4211 Page 2

Or:
When training data not linearly separable

## - New objective: Separate training with min # errors

Introduce slack variable i (i 0)

## C: Help decide whether MARGIN or ERROR is more important in

determining solution

## - Dual: Soft margin hyperplane:

NOTE:
C constrains i Not let i increase to (may lead to no solution)
Still Quadratic Programming only 1 global min

## COMP 4211 Page 3

Non-linear SVM (C-SVM) Kernel trick
Wednesday, March 1, 2017 11:51 AM

- Idea:

Feature transformation Input data only appear (in training + testing) in form of dot products
Training:

Testing:

## - Idea: Change space ( coordinate sys) of data

Decision boundary: Arbitrary surface Hyperplane

## - Kernel function k(x,y)

Purpose:
Return dot product between data points in some space
Replace dot products in SVM algorithm with kernel Obtain
efficient representation

## Choose k Choose (feature map)

Example of kernels
Name Formula Params needed tuning
- Formulation: : Rm H Inhomogeneous d
polynomial
Gaussian
SVM input: (Radical basis (or )
function)

## Original space Feature space Alternative form:

Data
Sigmoid Valid kernel only for
Decision
certain ,
boundary
Boundary 2-D curve 3-D plane
shape

## - Problem with directly working on : High dimensionality

Inefficient computation
Input data: Gray-scale 1616 images

Use polynomial curve of degree 5th as decision boundary in
ORIGINAL space

Any algorithm depending only on dot products can use kernel trick

Gaussian Kernel

Gaussian Kernel

## - Recall: Gaussian Bell-curve

Put bumps of various sizes on training set

- Param tuning:
:
Influence on decision boundary of EVERY data point
"Smoothness" of decision boundary
: Linear boundary
Heuristics:

C :
Importance of goal of min error
Classification error, but also "smoothness" of decision boundary

## - Support vector usually very few in number

Just need to store support vector when doing
prediction

## - High-dimensional data MORE likely to be LINEAR

separable
That's why SVM quite commonly used in text,
image processing

## COMP 4211 Page 5

Overfitting k-fold Cross (Stratified) Validation
Thursday, March 2, 2017 9:17 AM

## - Purpose: Select "best" hypothesis from available data

Select model hyperparams resulting in smallest testing error (highest accuracy)
For Gaussian kernel C-SVM: (C, ) producing smallest testing error
Learning Error Measurement

- Process:
For each hyperparam combination:
f: Target function (unknown)
Divide m examples into k disjoint subsets
D: Targeting data distribution (unknown)
Each of size m/k
(Stratifying step) Prop of examples from each classes in subsets should
h: Hypothesis
be approx EQUAL
Specific SVM param set, neural network,
S: Training set of size n (draw from D)
Subset # Class 1 example # Class 2 example
A 5 2
B 5 2
C 5 2
- Training error:
D 5 3

## Run learning process k times, each time:

Prop of examples in training set that h misclassifies Validation set = 1 subset
Can measure Training set = (Other (k-1) subsets)

## - Testing error: Calculate avg accuracy

Prob that h misclassify data instance drawn randomly from D
Choose hyperparam combination with highest avg accuracy
Can't measure, but wish to know ( Now, all data can trained with these hyperparams)

## Estimated through test set: - Why use k-fold:

High prop of data used for training
Also, all data used in computing error

## 1000 example: |Class 1| = 600, |Class 2| = 400

Run C-SVM, 4-fold stratified cross-validation
Gaussian kernel
C {2-4, 2-3, 2-2, 2-1, 20, 21, 22}
Overfitting
{2-6, 2-5, 2-4, 2-3, 22, 21, 22, 23, 24, 25}

A 150 100
B 150 100
C 150 100
D 150 100

## For each (C, ):

Learning time Training set Validation set Validation error Accuracy
1 BCD A 50% 50%
2 ACD B 45% 55%

- Def: Hypothesis h overfits training data if h' such that: 3 ABD C 40% 60%
On training set: errorS(h) < errorS(h') 4 ABC D 54% 46%
Over entire distribution: errorD(h) > errorD(h') Avg accuracy 52.75%

- Occam's Razor: Prefer SIMPLE hypothesis, because: Choose (C, ) with highest avg accuracy
Less simple hypotheses than complicated
Simple hypothesis fitting data unlikely to coincidence How many models trained in total: |Set C| |Set | k = 7 10 4 = 280

## - Test set drawn independently from training set - Train on (m - 1) examples

- Not use test set for training - Validate on 1 example

- Early stopping: Stop before reaching point where - Useful for SMALL data sets
training data perfectly classified

## COMP 4211 Page 6

Neural Networks
Friday, March 10, 2017 5:00 PM

## Artificial Neural Network

Real neuron - Use complex network of simple computing element Mimic brain's function

- Cell structure:

- Structure:
Unit (input, hidden, output)

## Dendrite: Receive info from others

Axon (1/cell): Transmit info from cell body
Synapse: Package of chem substances (transmitter), influence other
cells when released

- Signal transmission:
Impulses arrive simultaneously, added together - Learning = Updating weights
Transmitter released from synapse, enter dendrite
If sufficiently strong: Electrical pulse sent down axon
Reach synapse, release transmitter into other cells' bodies

- Properties:
Fault-tolerant: Cells die all time with no ill effect to brain's overall
functioning

## Graceful degradation: As condition worsen, cell's performance

gradually (rather than sharply) drop

## COMP 4211 Page 7

Perceptron Model of Single Perceptron
Friday, March 10, 2017 5:00 PM

## - Input: x1, x2, , xn

- Weight: w1, w2, , wn
- Activation function: Relate input & output
Perceptron

- Feed-forward network
- Only 1 layer of adjustable weight
(with x0 1, w0 -)

## - Principle of weight-updating rule:

Observed output (T) Predicted output (O)
Make small adjustment in weight Difference

- Algorithm:
Randomly initialize weights + Choose learning rate
Learning Capability

## Repeat until all examples correctly predicted:

For each sample input x = (x1, x2, , xn), with corresponding output T
- Function can represented by single perceptron Linear separable
Calculate predicted output:

Update weight:

(5, 1) 0
(2, 1) 0
(1, 1) 1
(3, 3) 1
(4, 2) 0
(2, 3) 1

## Decision boundary: Decision boundary:

I1 + I2 - 1.5 = 0 I1 + I2 - 0.5 = 0

## Decision boundary: -3I1 + 4I2 + 1 = 0

- With more layers of sufficiently many perceptrons, any boolean function can
represented
Initialize: w = (0, 0, 0) , Learning rate = 1
(Why: Any boolean function can represented in Sum-of-Product or Product-of-Sum
form) Iteration wold = (-, w1, w2) I = (1, I1, I2) T O T-O wnew
1.1 (0, 0, 0) (1, 5, 1) 0 1 -1 (-1, 5, -1)
XOR: 1.2 (-1, -5, -1) (1, 2, 1) 0 0 0 (-1, 5, -1)
1.3 (-1, -5, -1) (1, 1, 1) 1 0 1 (0, -4, 0)
1.4 (0, -4, 0) (1, 3, 3) 1 0 1 (1, -1, 3)
1.5 (1, -1, 3) (1, 4, 2) 0 1 -1 (0, -5, 1)
1.6 (0, -5, 1) (1, 2, 3) 1 0 1 (1, -3, 4)
2.1 (1, -3, 4) (1, 5, 1) 0 0 0 (1, -3, 4)
2.2 (1, -3, 4) (1, 2, 1) 0 0 0 (1, -3, 4)
2.3 (1, -3, 4) (1, 1, 1) 1 1 0 (1, -3, 4)
2.4 (1, -3, 4) (1, 3, 3) 1 1 0 (1, -3, 4)

## (a c) (b c): 2.5 (1, -3, 4) (1, 4, 2) 0 0 0 (1, -3, 4)

2.6 (1, -3, 4) (1, 2, 3) 1 1 0 (1, -3, 4)

## COMP 4211 Page 8

2.4 (1, -3, 4) (1, 3, 3) 1 1 0 (1, -3, 4)

## (a c) (b c): 2.5 (1, -3, 4) (1, 4, 2) 0 0 0 (1, -3, 4)

2.6 (1, -3, 4) (1, 2, 3) 1 1 0 (1, -3, 4)

## - Perceptron convergence theorem:

If training example linearly separable: Running perceptron weight-updating
rule can:
Always converge to solution
In finite step for any initial weight choice

## COMP 4211 Page 9

Saturday, March 25, 2017 9:27 AM
Fundamentals of Weight-Finding

Keep going "downhill" from any point in error surface
Can get stuck in local optimal solution
- Def:
Feedforward network
1 layer of adjustable weight

## However, for 's adaline: 1 global min only

- Terminologies:
Training example :
Input:
Target output: td

Weight:
Predicted output:
- Mathematics: Gradient descent

- Objective:

## * NOTE: Adaline is "linear regression" in statistics

Move :
Direction: Opposite to
Magnitude: Small fraction of

Weight-Updating Methods

## Batch Gradient Descent Stochastic Gradient Descent

Principl In each iteration: Use ALL examples to In each iteration: Use 1 example to update
e update weights weights
Algorit - Randomly initialize - Randomly initialize
hm
- (*) Repeat until termination condition met: - (*) Repeat until termination condition met:
Initialize For each :
For each : Calculate
Calculate

For each wi:

## For each wi:

Illustrat
ion Gradient Descent: Common Issues

- Overshooting:
Problem: too large Overstep min

## COMP 4211 Page 10

Problem: too large Overstep min

## * NOTE: wi changed after each training

example IMMEDIATELY affect od
calculation of NEXT example
Pros & - Gradient change summed over whole data - For big data set: Possible to get good Solution:
Cons set each iteration result after 1 (*) pass (through all training Test on small subset of training set
Expensive computation for big data set: data) Well performance Use for whole set
Can't store whole set on mem Have
to read from disk Slow - generally move in right direction, but
# (*) iteration gradually
not always
- For big data set: Always need multiple (*) Have to use smallest step ()
passes (through all training data) to get
good result

## - Mini-batch gradient descent:

In each iteration: Use b examples to update weights
Often faster than stochastic

- Termination condition:
Reach preset # (*) iteration Problem: Sometimes, gradient can't derived easily &
precisely calculated (not case for Adaline)
Multi-layer perceptron

## Solution: Check gradient using finite difference

Repeat for different , , :
Compute
Compute gradient using formula previously derived

Set:

very small
Compute
Verify:

## COMP 4211 Page 11

Multilayer Perceptron & Back-Propagation
Monday, May 22, 2017 9:24 PM

## Network Construction for Multiclass Classification

Activation (Hidden Unit Transfer) Function

- m 2 classes:
Must nonlinear: Linear hidden unit can duplicated by single-layer network 1 output/class
Object Class i

## - m = 2 classes: Special method:

1 output
y > 0 Object YES class
y 0 Object NO class

## Hidden Illustration Characteristics

Unit Type
Step Unit

- Non-differentiable
- Not suitable for gradient descent

Sigmoid
Unit

- (x) 0

- Nice derivative:

## - x very small: (x) Linear function

- x very big: (x) Step function Universal Approximation
Basis
Function - 1 hidden layer of sigmoid sufficient to approx any
(RBF) well-behave function to arbitrary precision (1)
Produce localized response to input:
Significant nonzero output only - Why need > 2 hidden layers: (2)
when input falls within small Less weights (params)
localized region
Same accuracy lvl

## foutput(x) = g1(x) + g2(x) +

* NOTE: Approximate complicated function:
(1) Use lots of hidden units in 1 layer
w11(x) + w22(x) + w33(x) +

((()))

## Rectified f(x) = max(0, x)

Linear
Unit - Efficient computation
(ReLU)

## NOTE: Careful when initializing

weights Avoid f'(x) = 0

## COMP 4211 Page 12

Leaky - Efficient as ReLU
ReLU - But: f'(x) 0

N0 output unit

## Backpropagation: Gradient Descent for Multilayer Network

- Weights between Input & Hidden layer:

## - Algorithm (Stochastic version):

Randomly initialize wji

## Repeat until convergence: For each

Propagate input forward:
Compute

## Propagate error backward:

For each Output Unit k:
k = ok(1 - ok)(tk - ok) Gradient Descent: Practice issue

## COMP 4211 Page 13

Propagate error backward:
For each Output Unit k:
k = ok(1 - ok)(tk - ok) Gradient Descent: Practice issue

## For each Hidden Unit j:

- Speed up training:
Use momentum:

Update weights:
For each connection (i j):
Exploit error surface's high-order info
wji wji + jOuti

## - Escape poor local minima:

NOTE: Outi = Unit i's output, which is:
Train multiple networks, each initialized with different weights
ui: For hidden unit i
xi: For input unit i

- Efficiency: O(W2)
W = Total number of weights

## COMP 4211 Page 14

Convolutional Neural Network
Monday, May 22, 2017 12:50 AM

## Convolution Operator - Spare connectivity: Due to small convolution kernel

- Def:
Continuous:

Discrete:

- Feature Hierarchy: Each hidden units only connected to local subset of units in
previous layer

## I: Image ; K: Mask (Kernel)

- Shared weights:
Due to same kernel used throughout image
Help detect features regardless of positions in images Robustness
# params to learn

Convolution Representation shrink at each layer Limit # layers
Solution: Add zeros to lost positions
- Properties:
Commutative: f * g = g * f
Associative: f * (g * h) = (f * g) * h

## Pooling Layer - Activation function:

Convolution is linear Need nonlinear activation function
Commonly used: ReLU: y = max(x, 0)
- Motivation:
Once feature detected, only its approx position relative to other
features relevant
Image of number 7:
Endpoint of roughly horizontal segment in upper left
Corner in upper right area
Endpoint of roughly vertical segment in lower portion

## Different object instances:

Difference absolute positions of features
But their relative positions to each other same

## - Technique: Max-pooling: For each sub-region, output max value

Effect of shifting, rotation Robustness
Computational burden on next layer

## COMP 4211 Page 15

Reinforcement Learning (RL)
Monday, May 22, 2017 10:59 PM
Markov Assumption

Basic concepts

## st+1 = (st, at)

rt = r(st, at)

Current reward & Next states depend ONLY on current state & action

Policy Evaluation

## - Reinforcement Learning: - Bellman equation:

Interact with env Get EVALUATIVE output/reward Deterministic world:
Learn mapping State Action V(s) = rt + rt+1 + 2rt+2 + = r(s, a) + ((s, a))
Goal: Max long-term reward

## States: Car's position & velocity

Actions: Forward/Reverse/None

## Rewards: Non-deterministic world:

0, if goal reached
-1, otherwise

- World/Environment:
Deterministic world Non-deterministic world
Actions have certain outcomes Active have uncertain outcomes
From state s, take action a From state s, take action a, reach
Definitely reach state s' - State s', with prob p'
- State s'', with prob p''
(s, a) P(s, s', a)
= New state reached from state = P(Transition s s' | Action a)
s through action a
R(s, s', a)
r(s, a) = E(Reward of Transition s s' | Action a)
= Reward got by taking action a
when at state s Policy (deterministic): State Good Action Produce
State Broken Action Repair

## - Policy: Map State Action:

V(Good) = 0.9(1 + V(Good)) + 0.1(0 + V(Bad))
Deterministic policy Non-deterministic policy V(Bad) = 0.7(0 + V(Good)) + 0.3(-10 + V(Bad))
At state s, ALWAYS take action a At state s, take:
- Action a', with prob p' - Computation method:
- Action a'', with prob p'' Solve linear system of V(s1), V(s2)
: S A (s, a) = P(Take action a | State s)
s (s) = a Iterative method:
Randomly initialize V0(s1), V0(s2),
Run until convergence:
- Discount factor :
Total Reward = Reward(t) + Reward(t + 1) * (Discount Factor)
+ Reward(t + 2) * (Discount Factor)2 +

t: Time
Discount factor < 1 (Future reward not worth as much as current reward)

## COMP 4211 Page 16

Learning Optimal (Deterministic) Policy
Monday, May 22, 2017 11:30 PM Learning in Known Environment: Policy Iteration

Basic Concepts

- Learning situation:
Situation Condition Learning method
Known Known (s, - Policy iteration
environment a), r(s, a) - Randomly initialize (s) s
- Value iteration
Unknown Unknown Q-learning
- Repeat:
environment (s, a), r(s, a)
Evaluate current policy

## - Optimal policy: * = argmaxV(s) s Improve policy

- Optimal state-value function: V*(s) = maxV(s) policyStable True
For each s S:
- Bellman Optimality Condition: newAction = argmax a[r(s, a) + V((s,a))]
Deterministic world: If newAction (s):
V*(s) = maxa[r(s, a) + V*((s,a))] (s) newAction
policyStable False
Non-deterministic world:
Stop if policyStable = True

## Learning in Known Environment: Value Iteration

(Often converge faster than Policy Iteration)

## - Repeat until < :

0
For each s S:
oldStateValue V(s)
V*(A) = max( 0.5(5 + V*(A)) + 0.5(5 + V*(B)), V(s) maxa[r(s, a) + V((s,a))]
10 + V*(B) ) max(, |oldStateValue - V(s)|)
V*(B) = 10 + V*(B)
- Output *:
- Q-function (Action-value function): *(s) argmaxa[r(s, a) + V((s,a))]
Q(s, a) r(s, a) + V((s, a)) Given , s, a

At optimal *:
Deterministic world:
Learning in Unknown Environment: Q-Learning
Q*(s, a) = r(s, a) + V*((s, a))
(Learn to approximate Q-function)
= r(s, a) + maxa'Q*((s, a), a')

Non-deterministic world:
- Initialize

- Repeat sufficiently:
Choose current state s (might/might not random)

## Repeat until Terminate:

Select & Execute action a (based on some policy)
Observe: Immediate reward r Training
New state s' episode
Update:
Q-Learning: Remarks

s s'
- Reward function is non-deterministic
Each time (s, a) visited, get different r
Terminate condition: Reach allowed max # iteration
Solution: Update through avg
Reach goal state
Q(s, a) (1 - )Q(s, a) + [r + maxaQ(s', a')]
Q(s, a) Q(s, a) + [r + maxaQ(s', a') - Q(s, a)]

## COMP 4211 Page 17

Training progress

## - Action selection policy:

Greedy policy:
At s, select "best" a:
Problem:
Over-commit to action with high found early
Fail to explore potentially high-reward actions

## -Greedy policy: State-of-the-art

Greedy most of time
Occasionally (with prob ), take random action

## Exploit & Explore:

Selection action ai with prob:

## COMP 4211 Page 18

Generalization & Function Approximation
Tuesday, May 23, 2017 12:32 AM

## Q-Function Feature-Based Representation - Initialize wi

- Repeat sufficiently:
- Problem with Q-learning: When |S| too big: Choose current state s
Too many states to visit all
Too many entries in Q-table to hold in mem Repeat until Terminate:
Select & Execute action
Solution: Replace Q-table with function approximator Observe: Immediate reward r Training
New state s' episode

Update weight:
Difference = r + maxa'Q(s',a') - Q(s, a)
wi wi + [Difference]fi(s, a) i = 1..n

s s'

## Q-Learning with Linear Approximation: Issues

- Feature-based Representation:
Observation Current param determine next data Q used in both evaluating (s, a) & selecting
Q(s, a) = w1f1(s, a) + + wnfn(s, a)
sample getting trained action

fk(s, a): Feature function Q(s, a) Q(s, a) + [r + maxaQ(s', a') - Q(s, a)]
Return real number Q(s, a): Evaluation
Capture important properties of (s, a) maxaQ(s', a'): Action selection
Pacman: Distance to closest ghost,
What might - Strong correlation between samples Divergence/Oscilation
go wrong
Advantage: n < # States
- Learning directly from consecutive
samples is inefficient
Weight update:
Recall: Q(s, a) Q(s, a) + [Difference] Poor local minima/divergence
Difference = (r + maxaQ(s', a')) - Q(s, a)
Solution Experience relay Target network
Formulation:

- Pool agent's experience over many - Every C updates: Clone network Q ( Weight
t = r + maxa'Q(s', a') episode into D set) Target network
Experience e = (s, a, r, s')
- Next C updates: Use Generate Q-learning
- Draw sample randomly from D targets
Update weights

Why - Randomized sample Break - Delay between time Q updated & time targets
solution correlation affected by new update
works
- Each experience potentially used in - Divergence/Oscillation
many weights update Data
wi wi + [Difference]fi(s, a)
efficiency

Avoid oscillation/divergence