Model and Cost Function - STD

3/21/2018
Machine learning
Model and cost function
Project important dates

• 28-March : title, objectives, and short abstract (100
words)
• 25-April : literature review (1500-2500 words)
• 8-May: data and method – empirical results
• Conclusion
• Publication – conference or journal
1
3/21/2018
Today’s Lecture
• Review: Supervised Learning: Regression,
Classification
• Linear Regression with One Variable
• Hypothesis Function
• Cost Function
• Gradient Descent
• Linear Algebra Review
• PC Lab: Latex
Review:
supervised learning - Regression
• Housing price prediction.
Price ($)
In 1000’s
Size in m2
Supervised Learning Regression: Predict continuous valued
output (price)
“right answers” given
“right answers” given
2
3/21/2018
Review:
Supervised Learning - classification
• Classification : Discrete valued output (0 or 1)
Review:
Supervised learning
3
3/21/2018
Review:
Unsupervised Learning
Does not have any label

We are not told what to do with it
We are not told what each data point is.
Can you find some structure in the data?
Data Set
Linear regression with

one variable
Model Representation
4
3/21/2018
Why study linear regression?
• AKA least squares
• “Least squares” is at least 200 years old (Legendre,

Gauss)
• Francis Galton: Regression to mediocrity (1886)
• Often real processes can be approximated by linear

models
• More complicated models require understanding linear
regression
• Many key notions of machine learning can be
introduced
Linear regression with one

variable
• Model Representation
Linear regression with one variable is also
known as “univariate linear regression”
Univariate linear regression is used when you
want to predict a single output value y from a single
input value x.
5
3/21/2018
A toy example : Housing prices
Price
(in 1000s
of dollars)
Size (m2)
Supervised Learning Regression Problem
• Given the “right answer”  Predict real-valued output
for each example in the
data
Size in m2 (x) Price($) in 1000’s (y)

Training set of housing 210 460
prices 145 234
153 315
85 178
… …
Notation:
m = Number of training examples
x = “input” variable / features
y = “output” variable / “target” variable
6
3/21/2018
Model representation
How do we represent h?
Training Set
Learning Algorithm
Size of Estimated
house h price
Linear regression with one variable.

hypothesis Univariate linear regression.
The hypothesis function

• Our hypothesis function has the general form:
ෝ = 𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙
𝒚
This is a equation of a straight line.
We are trying to create a function 𝒉𝜽 that is trying

to map our input data (the x’s) to our output data
(the y’s).
7
3/21/2018
Example
• Suppose we have the following set of training data:
Input X Output y
0 4
1 7
2 7
3 8
Size in m2 (X) Price($) in 1000’s (y)

Training set 210 460
145 234
153 315
85 178
… …
Hypothesis: 𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙
𝜃𝑖′ 𝑠 ∶ Parameters
How to choose 𝜃𝑖′ 𝑠?
8
3/21/2018
𝜃0 = 1.5 𝜃0 = 0.0 𝜃0 = 1.0

𝜃1 = 0.0 𝜃1 = 0.5 𝜃1 = 0.5
y Idea: Choose 𝜃0 and 𝜃1 so

that 𝒉𝜽 𝒙 is close to y for
our training
x
examples (x, y)
9
3/21/2018
Cost function
Simplified: 𝜽𝟎 = 𝟎
Hypothesis:
𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙 𝒉𝜽 𝒙 = 𝜽𝟏 𝒙
Parameters: 𝜽𝟏
𝜃0, 𝜃1
Cost Function: 𝑚
1 𝑚 1
𝐽 𝜃0 , 𝜃1 = 2𝑚 ෌𝑖=1(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 )2 𝐽 𝜃1 = ෍(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 )2
2𝑚
𝑖=1
Goal:
minimize 𝐽 𝜃0 , 𝜃1
𝜃0 , 𝜃1 minimize 𝐽 𝜃1
𝜃1
ℎ𝜃 𝑥 𝐽 𝜃1
(for fixed 𝜃1, this is a function of x) (function of the parameter 𝜃1)
ℎ𝜃 𝑥 = 𝑥
𝜃1 = 1
10
3/21/2018
ℎ𝜃 𝑥 = 𝑥
𝜃1 = 1
𝐽 𝜃1 = 0, if 𝜃1 = 1
ℎ𝜃 𝑥 = 0.5𝑥
𝜃1 = 0.5
11
3/21/2018
ℎ𝜃 𝑥 = 0.5𝑥
𝜃1 = 0.5
𝐽 𝜃1 = 0.58, if 𝜃1 = 0.5
ℎ𝜃 𝑥 = 0
𝜃1 = 0
12
3/21/2018
ℎ𝜃 𝑥 = 0
𝜃1 = 0
𝐽 𝜃1 = 2.3, if 𝜃1 = 0
Hypothesis: 𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙
Parameters: 𝜃0 , 𝜃1
1 𝑚
Cost Function: 𝐽 𝜃0 , 𝜃1 = 2𝑚 ෌𝑖=1(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 )2
Goal: minimize 𝐽 𝜃0 , 𝜃1
𝜃0 , 𝜃1
13
3/21/2018
ℎ𝜃 𝑥 𝐽 𝜃0 , 𝜃1
(for fixed 𝜃0 , 𝜃1, this is a function of x) (function of the parameter 𝜃0 , 𝜃1 )
Price ($)
In 1000’s
Size in m2 (x)
𝒉𝜽 𝒙 = 𝟓𝟎 + 𝟎. 𝟎𝟔𝒙
14
3/21/2018
Price ($)
In 1000’s
Size in m2 (x)
Price ($)
In 1000’s
Size in m2 (x)
15
3/21/2018
Price ($)
In 1000’s
Size in m2 (x)
Gradient descent
Algorithm
Linear regression with one variable
16
3/21/2018
• Have some function 𝐽 𝜃0 , 𝜃1

• Want min 𝐽 𝜃0 , 𝜃1
• Outline:
• Start with some 𝜃0 , 𝜃1
• Keep changing 𝜃0 , 𝜃1 to reduce 𝐽 𝜃0 , 𝜃1 until we
hopefully end up at a minimum
17
3/21/2018
18
3/21/2018
Gradient Descent Algorithm
𝑟𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒 {
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃0 , 𝜃1 for (𝑗 = 0 and 𝑗 = 1)
𝜕𝜃𝑗
Incorrect: Correct: Simultaneous update

𝜕
𝑡𝑒𝑚𝑝0 ∶ = 𝜃0 − 𝛼 𝐽 𝜃0, 𝜃1 𝜕
𝜕𝜃0
𝑡𝑒𝑚𝑝0 ∶ = 𝜃0 − 𝛼 𝐽 𝜃0 , 𝜃1
𝜃0 ≔ 𝑡𝑒𝑚𝑝0 𝜕𝜃0
𝜕 𝜕
𝑡𝑒𝑚𝑝1 ∶ = 𝜃1 − 𝛼 𝜕𝜃 𝐽 𝜃0 , 𝜃1
𝑡𝑒𝑚𝑝1 ∶ = 𝜃1 − 𝛼 𝐽 𝜃0 , 𝜃1 1
𝜕𝜃1
𝜃0 ≔ 𝑡𝑒𝑚𝑝0
Gradient Descent Algorithm
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃0 , 𝜃1 (Simultaneously update 𝑗 = 0 and 𝑗 = 1)
𝜕𝜃𝑗
𝛼 ∶ 𝐿𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒
𝜕
𝜕𝜃𝑗
𝐽 𝜃0, 𝜃1 : Partial Derivative
19
3/21/2018
𝐽 𝜃1
𝜃1
𝐽 𝜃1
𝜃1
𝐽 𝜃1
𝜕
𝜃1 ≔ 𝜃1 − 𝛼 𝐽 𝜃1
𝜕𝜃1
If 𝛼 is too small, gradient descent
can be slow.
𝜃1
𝐽 𝜃1
If 𝛼 is too large, gradient descent

can be overshoot the minimum. It may
fail to converge, or even diverge.
𝜃1
20
3/21/2018
Gradient descent can converge to

a local minimum, even with the
learning rate 𝛼
fixed.
𝜕 𝐽 𝜃1
𝜃1 ≔ 𝜃1 − 𝛼 𝐽 𝜃1
𝜕𝜃1
As we approach a local minimum,

gradient descent will
automatically take smaller steps. 𝜃1
So, no need to decrease 𝛼 over
time.
21
3/21/2018
Gradient descent algorithm Linear Regression Model
𝜕 𝒉𝜽 𝒙 = 𝜽 𝟎 + 𝜽 𝟏 𝒙
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃0 , 𝜃1
𝜕𝜃𝑗
𝑚
(𝑗 = 0 and 𝑗 = 1) 1
𝐽 𝜃0 , 𝜃1 = ෍(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 )2
2𝑚
} 𝑖=1
𝜕
𝐽 𝜃0 , 𝜃1 =
𝜕𝜃𝑗
𝑚
𝜕 1
𝑗=0∶ 𝐽 𝜃0 , 𝜃1 = ෍(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 )
𝜕𝜃0 𝑚
𝑖=1
𝑚
𝜕 1
𝑗=1∶ 𝐽 𝜃0 , 𝜃1 = ෍(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 )𝑥 𝑖
𝜕𝜃1 𝑚
𝑖=1
22
3/21/2018
Gradient descent algorithm

1 𝑚
𝜃0 ≔ 𝜃0 − 𝛼 𝑚 ෌𝑖=1(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 ) Update 𝜃0 and 𝜃1
simultaneously
1 𝑚
𝜃1 ≔ 𝜃1 − 𝛼 𝑚 ෌𝑖=1 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 . 𝑥 𝑖
23
3/21/2018
24
3/21/2018
25
3/21/2018
26
3/21/2018
27
3/21/2018
“BATCH” Gradient Descent
“BATCH” : Each step of gradient

descent uses all the training examples.
Quiz:
Which of the following are true statements? Select all that apply.
a. To make gradient descent converge, we must slowly decrease 𝛼 over time.

b. Gradient descent is guaranteed to find the global minimum for any function
𝐽 𝜃0 , 𝜃1
c. Gradient descent can converge even if 𝛼 is kept fixed. (But 𝛼 cannot be too large,
or else it may fail to converge.)
d. For the specific choice of cost function 𝐽 𝜃0, 𝜃1 used in linear regression, there
are no local optima (other than global optimum).
28

Model and Cost Function - STD

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Model and Cost Function - STD

Enviado por

Direitos autorais:

Formatos disponíveis

3/21/2018

Project important dates

• Housing price prediction.

• Classification : Discrete valued output (0 or 1)

Does not have any label

Can you find some structure in the data?

Linear regression with

Why study linear regression?

• AKA least squares

• “Least squares” is at least 200 years old (Legendre,

• Often real processes can be approximated by linear

Linear regression with one

A toy example : Housing prices

Size in m2 (x) Price($) in 1000’s (y)

Linear regression with one variable.

The hypothesis function

This is a equation of a straight line.

We are trying to create a function 𝒉𝜽 that is trying

Size in m2 (X) Price($) in 1000’s (y)

How to choose 𝜃𝑖′ 𝑠?

𝜃0 = 1.5 𝜃0 = 0.0 𝜃0 = 1.0

y Idea: Choose 𝜃0 and 𝜃1 so

(for fixed 𝜃1, this is a function of x) (function of the parameter 𝜃1)

(for fixed 𝜃1, this is a function of x) (function of the parameter 𝜃1)

(for fixed 𝜃1, this is a function of x) (function of the parameter 𝜃1)

(for fixed 𝜃1, this is a function of x) (function of the parameter 𝜃1)

(for fixed 𝜃1, this is a function of x) (function of the parameter 𝜃1)

(for fixed 𝜃1, this is a function of x) (function of the parameter 𝜃1)

• Have some function 𝐽 𝜃0 , 𝜃1

Gradient Descent Algorithm

𝑟𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒 {

Incorrect: Correct: Simultaneous update

Gradient Descent Algorithm

𝑟𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒 {

If 𝛼 is too large, gradient descent

Gradient descent can converge to

As we approach a local minimum,

Gradient descent algorithm Linear Regression Model

𝑟𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒 {

Gradient descent algorithm

𝑟𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒 {

“BATCH” Gradient Descent

“BATCH” : Each step of gradient

a. To make gradient descent converge, we must slowly decrease 𝛼 over time.

Você também pode gostar