Você está na página 1de 28

3/21/2018

Machine learning
Model and cost function

Project important dates


• 28-March : title, objectives, and short abstract (100
words)
• 25-April : literature review (1500-2500 words)
• 8-May: data and method – empirical results
• Conclusion
• Publication – conference or journal

1
3/21/2018

Today’s Lecture
• Review: Supervised Learning: Regression,
Classification
• Linear Regression with One Variable
• Hypothesis Function
• Cost Function
• Gradient Descent
• Linear Algebra Review
• PC Lab: Latex

Review:
supervised learning - Regression

• Housing price prediction.

Price ($)
In 1000’s

Size in m2
Supervised Learning Regression: Predict continuous valued
output (price)
“right answers” given
“right answers” given

2
3/21/2018

Review:
Supervised Learning - classification

• Classification : Discrete valued output (0 or 1)

Review:
Supervised learning

3
3/21/2018

Review:
Unsupervised Learning

Does not have any label


We are not told what to do with it
We are not told what each data point is.

Can you find some structure in the data?

Data Set

Linear regression with


one variable
Model Representation

4
3/21/2018

Why study linear regression?

• AKA least squares

• “Least squares” is at least 200 years old (Legendre,


Gauss)
• Francis Galton: Regression to mediocrity (1886)

• Often real processes can be approximated by linear


models
• More complicated models require understanding linear
regression
• Many key notions of machine learning can be
introduced

Linear regression with one


variable
• Model Representation
Linear regression with one variable is also
known as “univariate linear regression”
Univariate linear regression is used when you
want to predict a single output value y from a single
input value x.

5
3/21/2018

A toy example : Housing prices

Price
(in 1000s
of dollars)

Size (m2)
Supervised Learning Regression Problem
• Given the “right answer”  Predict real-valued output
for each example in the
data

Size in m2 (x) Price($) in 1000’s (y)


Training set of housing 210 460
prices 145 234
153 315
85 178
… …

Notation:
m = Number of training examples
x = “input” variable / features
y = “output” variable / “target” variable

6
3/21/2018

Model representation

How do we represent h?
Training Set

Learning Algorithm

Size of Estimated
house h price

Linear regression with one variable.


hypothesis Univariate linear regression.

The hypothesis function


• Our hypothesis function has the general form:

ෝ = 𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙
𝒚

This is a equation of a straight line.

We are trying to create a function 𝒉𝜽 that is trying


to map our input data (the x’s) to our output data
(the y’s).

7
3/21/2018

Example
• Suppose we have the following set of training data:
Input X Output y
0 4
1 7
2 7
3 8

Size in m2 (X) Price($) in 1000’s (y)


Training set 210 460
145 234
153 315
85 178
… …

Hypothesis: 𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙

𝜃𝑖′ 𝑠 ∶ Parameters

How to choose 𝜃𝑖′ 𝑠?

8
3/21/2018

𝜃0 = 1.5 𝜃0 = 0.0 𝜃0 = 1.0


𝜃1 = 0.0 𝜃1 = 0.5 𝜃1 = 0.5

y Idea: Choose 𝜃0 and 𝜃1 so


that 𝒉𝜽 𝒙 is close to y for
our training
x
examples (x, y)

9
3/21/2018

Cost function
Simplified: 𝜽𝟎 = 𝟎

Hypothesis:
𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙 𝒉𝜽 𝒙 = 𝜽𝟏 𝒙

Parameters: 𝜽𝟏
𝜃0, 𝜃1

Cost Function: 𝑚
1 𝑚 1
𝐽 𝜃0 , 𝜃1 = 2𝑚 ෌𝑖=1(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 )2 𝐽 𝜃1 = ෍(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 )2
2𝑚
𝑖=1
Goal:
minimize 𝐽 𝜃0 , 𝜃1
𝜃0 , 𝜃1 minimize 𝐽 𝜃1
𝜃1

ℎ𝜃 𝑥 𝐽 𝜃1

(for fixed 𝜃1, this is a function of x) (function of the parameter 𝜃1)

ℎ𝜃 𝑥 = 𝑥

𝜃1 = 1

10
3/21/2018

ℎ𝜃 𝑥 𝐽 𝜃1

(for fixed 𝜃1, this is a function of x) (function of the parameter 𝜃1)

ℎ𝜃 𝑥 = 𝑥

𝜃1 = 1

𝐽 𝜃1 = 0, if 𝜃1 = 1

ℎ𝜃 𝑥 𝐽 𝜃1

(for fixed 𝜃1, this is a function of x) (function of the parameter 𝜃1)

ℎ𝜃 𝑥 = 0.5𝑥

𝜃1 = 0.5

11
3/21/2018

ℎ𝜃 𝑥 𝐽 𝜃1

(for fixed 𝜃1, this is a function of x) (function of the parameter 𝜃1)

ℎ𝜃 𝑥 = 0.5𝑥

𝜃1 = 0.5

𝐽 𝜃1 = 0.58, if 𝜃1 = 0.5

ℎ𝜃 𝑥 𝐽 𝜃1

(for fixed 𝜃1, this is a function of x) (function of the parameter 𝜃1)

ℎ𝜃 𝑥 = 0

𝜃1 = 0

12
3/21/2018

ℎ𝜃 𝑥 𝐽 𝜃1

(for fixed 𝜃1, this is a function of x) (function of the parameter 𝜃1)

ℎ𝜃 𝑥 = 0

𝜃1 = 0

𝐽 𝜃1 = 2.3, if 𝜃1 = 0

Hypothesis: 𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙

Parameters: 𝜃0 , 𝜃1

1 𝑚
Cost Function: 𝐽 𝜃0 , 𝜃1 = 2𝑚 ෌𝑖=1(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 )2

Goal: minimize 𝐽 𝜃0 , 𝜃1
𝜃0 , 𝜃1

13
3/21/2018

ℎ𝜃 𝑥 𝐽 𝜃0 , 𝜃1
(for fixed 𝜃0 , 𝜃1, this is a function of x) (function of the parameter 𝜃0 , 𝜃1 )

Price ($)
In 1000’s

Size in m2 (x)

𝒉𝜽 𝒙 = 𝟓𝟎 + 𝟎. 𝟎𝟔𝒙

14
3/21/2018

ℎ𝜃 𝑥 𝐽 𝜃0 , 𝜃1
(for fixed 𝜃0 , 𝜃1, this is a function of x) (function of the parameter 𝜃0 , 𝜃1 )

Price ($)
In 1000’s

Size in m2 (x)

ℎ𝜃 𝑥 𝐽 𝜃0 , 𝜃1
(for fixed 𝜃0 , 𝜃1, this is a function of x) (function of the parameter 𝜃0 , 𝜃1 )

Price ($)
In 1000’s

Size in m2 (x)

15
3/21/2018

ℎ𝜃 𝑥 𝐽 𝜃0 , 𝜃1
(for fixed 𝜃0 , 𝜃1, this is a function of x) (function of the parameter 𝜃0 , 𝜃1 )

Price ($)
In 1000’s

Size in m2 (x)

Gradient descent
Algorithm
Linear regression with one variable

16
3/21/2018

• Have some function 𝐽 𝜃0 , 𝜃1


• Want min 𝐽 𝜃0 , 𝜃1

• Outline:
• Start with some 𝜃0 , 𝜃1
• Keep changing 𝜃0 , 𝜃1 to reduce 𝐽 𝜃0 , 𝜃1 until we
hopefully end up at a minimum

17
3/21/2018

18
3/21/2018

Gradient Descent Algorithm

𝑟𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒 {

𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃0 , 𝜃1 for (𝑗 = 0 and 𝑗 = 1)
𝜕𝜃𝑗

Incorrect: Correct: Simultaneous update


𝜕
𝑡𝑒𝑚𝑝0 ∶ = 𝜃0 − 𝛼 𝐽 𝜃0, 𝜃1 𝜕
𝜕𝜃0
𝑡𝑒𝑚𝑝0 ∶ = 𝜃0 − 𝛼 𝐽 𝜃0 , 𝜃1
𝜃0 ≔ 𝑡𝑒𝑚𝑝0 𝜕𝜃0
𝜕 𝜕
𝑡𝑒𝑚𝑝1 ∶ = 𝜃1 − 𝛼 𝜕𝜃 𝐽 𝜃0 , 𝜃1
𝑡𝑒𝑚𝑝1 ∶ = 𝜃1 − 𝛼 𝐽 𝜃0 , 𝜃1 1
𝜕𝜃1
𝜃0 ≔ 𝑡𝑒𝑚𝑝0
𝜃1 ≔ 𝑡𝑒𝑚𝑝1
𝜃1 ≔ 𝑡𝑒𝑚𝑝1

Gradient Descent Algorithm

𝑟𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒 {

𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃0 , 𝜃1 (Simultaneously update 𝑗 = 0 and 𝑗 = 1)
𝜕𝜃𝑗

𝛼 ∶ 𝐿𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒

𝜕
𝜕𝜃𝑗
𝐽 𝜃0, 𝜃1 : Partial Derivative

19
3/21/2018

𝐽 𝜃1

𝜃1

𝐽 𝜃1

𝜃1

𝐽 𝜃1
𝜕
𝜃1 ≔ 𝜃1 − 𝛼 𝐽 𝜃1
𝜕𝜃1
If 𝛼 is too small, gradient descent
can be slow.

𝜃1

𝐽 𝜃1

If 𝛼 is too large, gradient descent


can be overshoot the minimum. It may
fail to converge, or even diverge.

𝜃1

20
3/21/2018

Gradient descent can converge to


a local minimum, even with the
learning rate 𝛼
fixed.

𝜕 𝐽 𝜃1
𝜃1 ≔ 𝜃1 − 𝛼 𝐽 𝜃1
𝜕𝜃1

As we approach a local minimum,


gradient descent will
automatically take smaller steps. 𝜃1
So, no need to decrease 𝛼 over
time.

21
3/21/2018

Gradient descent algorithm Linear Regression Model

𝑟𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒 {

𝜕 𝒉𝜽 𝒙 = 𝜽 𝟎 + 𝜽 𝟏 𝒙
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃0 , 𝜃1
𝜕𝜃𝑗
𝑚
(𝑗 = 0 and 𝑗 = 1) 1
𝐽 𝜃0 , 𝜃1 = ෍(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 )2
2𝑚
} 𝑖=1

𝜕
𝐽 𝜃0 , 𝜃1 =
𝜕𝜃𝑗

𝑚
𝜕 1
𝑗=0∶ 𝐽 𝜃0 , 𝜃1 = ෍(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 )
𝜕𝜃0 𝑚
𝑖=1

𝑚
𝜕 1
𝑗=1∶ 𝐽 𝜃0 , 𝜃1 = ෍(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 )𝑥 𝑖
𝜕𝜃1 𝑚
𝑖=1

22
3/21/2018

Gradient descent algorithm

𝑟𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒 {


1 𝑚
𝜃0 ≔ 𝜃0 − 𝛼 𝑚 ෌𝑖=1(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 ) Update 𝜃0 and 𝜃1
simultaneously
1 𝑚
𝜃1 ≔ 𝜃1 − 𝛼 𝑚 ෌𝑖=1 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 . 𝑥 𝑖

23
3/21/2018

24
3/21/2018

25
3/21/2018

26
3/21/2018

27
3/21/2018

“BATCH” Gradient Descent

“BATCH” : Each step of gradient


descent uses all the training examples.

Quiz:
Which of the following are true statements? Select all that apply.

a. To make gradient descent converge, we must slowly decrease 𝛼 over time.


b. Gradient descent is guaranteed to find the global minimum for any function
𝐽 𝜃0 , 𝜃1
c. Gradient descent can converge even if 𝛼 is kept fixed. (But 𝛼 cannot be too large,
or else it may fail to converge.)
d. For the specific choice of cost function 𝐽 𝜃0, 𝜃1 used in linear regression, there
are no local optima (other than global optimum).

28

Você também pode gostar