Você está na página 1de 48

4/28/2012

Reinforcement Learning

Introduction

In which we examine how an agent can learn from success and failure, reward and punishment.

4/28/2012

Introduction

Learning to ride a bicycle:


The goal given to the Reinforcement Learning g g g system is simply to ride the bicycle without falling over Begins riding the bicycle and performs a series of actions that result in the bicycle being tilted 45 degrees to the right

Photo:http://www.roanoke.com/outdoors/bikepages/bikerattler.html

Introduction

Learning to ride a bicycle:


RL system turns the handle bars to the LEFT Result: CRASH!!! Receives negative reinforcement RL system turns the handle bars to the RIGHT Result: CRASH!!! Receives negative reinforcement

4/28/2012

Introduction

Learning to ride a bicycle:


RL system has learned that the state of being titled 45 degrees to the right is bad Repeat trial using 40 degree to the right By performing enough of these trial-and-error interactions with the environment, the RL system will ultimately learn how to prevent the bicycle from ever falling over

Reinforcement Learning
A trial-and-error learning paradigm
Rewards and Punishments

Not just an algorithm but a new paradigm in itself Learn about a system
behaviour control from minimal feed back

Inspired by behavioural psychology

Intro to RL

4/28/2012

RL Framework

Environment
evaluation State

Agent

Action

Learn from close interaction Stochastic environment Noisy delayed scalar evaluation Maximize a measure of long term performance

Not Supervised Learning!


Input Output Error Target

Agent

Very sparse supervision No target output provided g p p No error gradient information available Action chooses next state Explore to estimate gradient Trail and error learning
8 Intro to RL

4/28/2012

Not Unsupervised Learning

Input I t

Agent

Activation A ti ti

Evaluation

Sparse supervision available Pattern detection not primary goal P d l

Intro to RL

The Agent-Environment Interface


Agent
state reward

st

rt

action

at

rt+1 st+1

Environment

Agent and environment interact at discrete time steps: t = 0, 1, 2, K Agent b A t observes state at step t st S t t t t t: S produces action at step t : at A(st ) gets resulting reward: rt +1 and resulting next state: st +1

...
10

st

rt +1 at

st +1

at +1

rt +2

st +2

at +2

rt +3 s t +3

... at +3

4/28/2012

The Agent Learns a Policy


Policy at step t, t : a mapping from states to action probabilities t (s, a) = probability that at = a when st = s

Reinforcement learning methods specify how the g g p y p agent changes its policy as a result of experience. Roughly, the agents goal is to get as much reward as it can over the long run.
11 Intro to RL

Goals and Rewards


Is a scalar reward signal an adequate notion of a g q goal?maybe not, but it is surprisingly flexible. A goal should specify what we want to achieve, not how we want to achieve it. A goal must be outside the agents direct controlthus outside the agent. The Th agent must b able to measure success: be bl
explicitly; frequently during its lifespan.

12

Intro to RL

4/28/2012

Returns
Suppose the sequence of rewards after step t is : rt +1 , rt + 2 , rt +3 ,K What do we want to maximize?

In general, we want to maximize the expected return E{Rt } for each stept. , ,
Episodic tasks: interaction breaks naturally into episodes, e g episodes e.g., plays of a game, trips through a maze. game maze

Rt = rt +1 + rt + 2 + L + rT ,
where T is a final time step at which a terminal state is reached, ending an episode.
13 Intro to RL

Passive Learning in a Known Environment


Passive Learner: A passive learner simply watches the world going by, and tries to learn the utility of being in various states. Another way to think of a passive learner is as an agent with a fixed policy trying to determine its benefits.

4/28/2012

Passive Learning in a Known Environment


In passive learning, the environment generates state p g, g transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:

Passive Learning in a Known Environment

Agent can move {North, East, South, West} Terminate on reading [4,2] or [4,3]

4/28/2012

Passive Learning in a Known Environment


Agent is provided: Mi j = a model given the probability of reaching from state i to state j

Passive Learning in a Known Environment


the object is to use this information about rewards to learn the expected utility U(i) associated with each l th t d tilit i t d ith h nonterminal state i Utilities can be learned using 3 approaches 1) LMS (least mean squares) 2) ADP (adaptive dynamic programming) 3) TD (temporal difference learning)

4/28/2012

Passive Learning in a Known Environment


LMS (Least Mean Squares)

Agent makes random runs (sequences of random moves) through environment [1,1]->[1,2]->[1,3]->[2,3]->[3,3]->[4,3] = +1 [1,1]->[2,1]->[3,1]->[3,2]->[4,2] = -1

Passive Learning in a Known Environment


LMS
Collect statistics on final payoff for each state (eg. when on [2,3], how often reached +1 vs -1 ?) Learner computes average for each state Provably converges to true expected value (utilities)
(Algorithm on page 602, Figure 20.3)

10

4/28/2012

Passive Learning in a Known Environment


LMS
Main Drawback: - slow convergence - it takes the agent well over a 1000 training q g sequences to get close to the correct value

Passive Learning in a Known Environment


ADP (Adaptive Dynamic Programming)
Uses the value or policy iteration algorithm to calculate exact utilities of states given an estimated model

11

4/28/2012

Passive Learning in a Known Environment


ADP
In general: - R(i) is reward of being in state i
(often non zero for only a few end states)

- Mij is the probability of transition from state i to j

Passive Learning in a Known Environment


ADP

Consider U(3,3)
U(3,3) = 0.33 x U(4,3) + 0.33 x U(2,3) + 0.33 x U(3,2) = 0.33 x 1.0 + 0.33 x 0.0886 + 0.33 x -0.4430 = 0.2152

12

4/28/2012

Passive Learning in a Known Environment

ADP
makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment somewhat intractable for large state spaces

Passive Learning in a Known Environment


TD (Temporal Difference Learning) The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations

13

4/28/2012

Passive Learning in a Known Environment


TD Learning
j Suppose we observe a transition from state i to state U(i) = -0.5 and U(j) = +0.5

Suggests that we should increase U(i) to make it agree better with it successor Can be achieved using the following updating rule

Passive Learning in a Known Environment


TD Learning
Performance: Runs noisier than LMS but smaller error Deal with observed states during sample runs (Not all instances, unlike ADP)

14

4/28/2012

Passive Learning in an Unknown Environment


Least Mean Square(LMS) approach and Temporal-Difference(TD) approach operate unchanged in an initially unknown environment. Adaptive Dynamic Programming(ADP) approach adds a step that updates an estimated model of the environment.

Passive Learning in an Unknown Environment


ADP Approach The environment model is learned by direct observation of transitions The environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbors

15

4/28/2012

Passive Learning in an Unknown Environment


ADP & TD Approaches pp The ADP approach and the TD approach are closely related B h try to make local adjustments to the Both k l l dj h utility estimates in order to make each state agree with its successors

Passive Learning in an Unknown Environment


Minor differences :
TD adjusts a state to agree with its observed successor ADP adjusts the state to agree with all of the successors

Important differences :
TD makes a single adjustment per observed k i l dj b d transition ADP makes as many adjustments as it needs to restore consistency between the utility estimates U and the environment model M

16

4/28/2012

Passive Learning in an Unknown Environment


To make ADP more efficient :
directly approximate the algorithm for value iteration or policy iteration prioritized-sweeping heuristic makes adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimates tilit ti t

Advantage of the approximate ADP :


efficient in terms of computation eliminate long value iterations occur in early stage

The Markov Property


the state at step t, means whatever information is available to the agent at step t about its environment. The state can include immediate sensations, highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all essential information, i.e., it should have the Markov Property:
Pr { t +1 = s , rt +1 = r st , at , rt , st 1 , at 1 ,K , r1 , s0 , a0 }= s Pr { t +1 = s , rt +1 = r st , at } s

for all s , r, and histories st , at , rt , st 1 , at 1 ,K , r1 , s0 , a0 .


34 Intro to RL

17

4/28/2012

Markov Decision Processes


If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: M = S, A, P, R
state and action sets one-step dynamics defined by transition probabilities:
Psa = Pr{ t +1 = s st = s,at = a} for all s, s S, a A(s). s s

reward expectations:
Rsa = E{t +1 st = s,at = a,st +1 = s } for all s, s S, a A(s). r s
35 Intro to RL

An Example Finite MDP


Recycling Robot
At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected

36

Intro to RL

18

4/28/2012

Recycling Robot MDP


S = {high, low} A(high) = { search, wait} A(low) = {search, wait, recharge}
1, R wait wait 1 , 3 search , R
search

R R

search wait

= expected no. of cans while searching Rsearch > R wait

= expected no. of cans while waiting p g

high

1, 0

recharge

low

search , R search 1 , R
search

wait 1, R wait

37

Intro to RL

Value Functions
The value of a state is the expected return starting from that state; depends on the agents policy: agent s State - value function for policy :
T R V (s) = E { t st = s}= E rt + k +1 st = s k = 0

The value of a state-action pair is the expected return starting from that state, taking that action, and thereafter following :
Action - value function for policy : T Q ( s , a ) = E {Rt s t = s , a t = a } = E rt + k +1 s t = s , a t = a k =0
38

19

4/28/2012

Bellman Equation for a Policy


The basic idea:

Rt = rt +1 + rt + 2 + rt +3 + rt + 4 L = rt +1 + Rt +1
So:
V (s) = E { t st = s} R

= rt +1 + ( rt + 2 + rt +3 + rt + 4 L)

= E {t +1 + V (st +1 ) st = s} r

Or, without the expectation operator:


V (s) = (s,a) Psa Rsas +V ( s ) s
a s

39

Intro to RL

Optimal Value Functions


For finite MDPs, policies can be partially ordered:
if and only if V (s) V (s) for all s S

There i always at l Th is l least one ( d possibly many) (and ibl ) policies that is better than or equal to all the others. This is an optimal policy. We denote them all *. Optimal policies share the same optimal statevalue function:
V (s) = max V (s) for all s S

40

Intro to RL

20

4/28/2012

Bellman Optimality Equation

The l Th value of a state under an optimal policy must equal f d i l li l the expected return for the best action from that state:
V (s) = max E rt +1 +V (st +1 ) st = s, at = a
aA(s )

= max Psa Rsas +V ( s ) s


aA(s ) s

V is the unique solution of this system of nonlinear equations.

41

Intro to RL

Bellman Optimality Equation

Similarly, Similarly the optimal value of a state-action pair is the expected return for taking that action and thereafter following the optimal policy

Q (s, a) = E rt +1 + max Q (st +1 , a ) st = s, at = a


a

= Psa Rsas + max Q ( s, a ) s a s

Q is the unique solution of this system of nonlinear equations.


42 Intro to RL

21

4/28/2012

Dynamic Programming
DP is the solution method of choice for MDPs
Require complete knowledge of system dynamics (P and R) Expensive and often not practical Curse of dimensionality Guaranteed to converge!

RL methods: online approximate dynamic programming


No knowledge of P and R Sample trajectories through state space Some theoretical convergence analysis available
43 Intro to RL

Policy Evaluation
Policy Evaluation: for a given policy , compute the state value state-value function V Recall:
State - value function for policy : R V (s) = E { t st = s}= E rt + k +1 st = s k = 0
Bellman equation for V : V ( s ) = ( s, a ) Psa Rsas + V ( s) s
a s

a system of S simultaneous linear equations solve iteratively


44 Intro to RL

22

4/28/2012

Policy Improvement
Suppose we have computed V for a deterministic policy . For a given state s s, would it be better to do an action a (s) ?

The value of doing a in state s is :


Q (s, a) = E rt +1 + V (st +1 ) st = s, at = a = Psa Rsas +V ( s) V ) s
s

It is better to switch to action a for state s if and only if Q (s,a) > V (s)
45 Intro to RL

Policy Improvement Cont.


Do this for all states to get a new policy that is greedy with respect to V :

( s ) = arg max Q ( s, a)
= arg max Psa Rsas + V ( s) s
a s a

Then V V

46

Intro to RL

23

4/28/2012

Policy Improvement Cont.


What if V = V ?
a s

i.e., i e for all s S , V ( s ) = max Psa Rsas + V ( s) ? s

But this is the Bellman Optimality Equation. So V = V and both and are optimal policies.

47

Intro to RL

Policy Iteration

0 V 1 V L * V * *
0 1

policy evaluation

policy improvement greedification

48

Intro to RL

24

4/28/2012

Value Iteration
Recall the Bellman optimality equation:
V (s) = max E rt +1 +V (st +1 ) st = s, at = a
aA(s )

= max Psa Rsas +V ( s ) s


aA(s ) s

We can convert it to an full value iteration backup:

Vk +1 (s) max Psa Rsas +Vk ( s ) s


a s

Iterate until convergence


49 Intro to RL

Generalized Policy Iteration


Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. granularity
evaluation V V

greedy(V ) improvement

A geometric metaphor for convergence of GPI:


V = V

starting V

V* *
g re = ed y

*
50

V*
Intro to RL

(V)

25

4/28/2012

Dynamic Programming
V(st ) E { t +1 + V(st )} r

st

rt +1

st +1
T T T T T T

T T

51

Intro to RL

Simplest TD Method
V (st ) V (st ) + [rt +1 + V (st +1 ) V (st )]

st
st +1
T
T T

rt +1

T T

T T

T T

T T

52

Intro to RL

26

4/28/2012

RL Algorithms Prediction
Policy Evaluation (the p y ( prediction p problem): for a g ) given policy, compute the state-value function. No knowledge of P and R, but access to the real system, or a sample model assumed. Uses bootstrapping and sampling

The simplestTD method,TD(0): V (st ) V (st ) + [rt +1 + V (st +1 ) V (st )]


Intro to RL 53

Advantages of TD
TD methods do not require a model of the environment, only experience TD methods can be fully incremental
You can learn before knowing the final outcome
Less memory Less peak computation

You can learn without the final outcome


From incomplete sequences

54

Intro to RL

27

4/28/2012

RL Algorithms Control
SARSA
Q(st , at ) Q(st , at ) + [rt +1 + Q(st +1 , at +1 ) Q(st , at )] If st +1 is terminal, then Q( st +1 , at +1 ) = 0. After every transition from a nonterminal state st , do this :

Q-learning
One-step Q-learning: Q (st ,at ) Q (st ,at )+ rt +1 + max Q (st +1 ,a ) Q (st ,at ) a

55

Intro to RL

Actor-Critic Methods
Policy
Actor

Critic

TD error

state

Value Function
reward

action

Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models

Environment

Intro to RL

56

28

4/28/2012

Actor-Critic Details
TD-error is used to evaluate actions: t = rt +1 + V (st +1 ) V (st ) ( (

If actions are determined by preferences, p(s,a), as follows:

t (s,a) = Pr{ t = a st = s}= a

e p(s, a) , e p(s ,b)


b

then you can update the preferences like this : p(st , at ) p(st ,at ) + t

57

Intro to RL

Active Learning in an Unknown Environment

An active agent must consider : what actions to take what their outcomes may be h t th i t b how they will affect the rewards received

29

4/28/2012

Active Learning in an Unknown Environment


Minor changes to passive learning agent :

environment model now incorporates the probabilities of transitions to other states given a particular action maximize its expected utility agent needs a performance element to choose an action at each step

Active Learning in an Unknown Environment


Active ADP Approach need to learn the probability Maij of a transition instead of Mij the input to the function will include the p action taken

30

4/28/2012

Active Learning in an Unknown Environment


Active TD Approach the model acquisition problem for the TD agent is identical to that for the ADP agent the update rule remains unchanged the TD algorithm will converge to the same values as ADP as the number of training sequences tends to infinity

Exploration
Learning also involves the exploration of unknown areas

Photo:http://www.duke.edu/~icheese/cgeorge.html

31

4/28/2012

Exploration

An agent can benefit from actions in 2 ways immediate rewards received percepts

Exploration
Wacky Approach Vs. Greedy Approach

-0.038
-0.165

0.089

0.215

-0.443

-0.418

-0.544

-0.772

32

4/28/2012

Exploration
The Bandit Problem

Photos: www.freetravel.net

Exploration
The Exploration Function a simple example i l l

u= expected utility (greed) n= number of times actions have been tried(wacky) R+ = best reward possible

33

4/28/2012

Learning An Action Value-Function

What Are Q-Values?

Learning An Action Value-Function


The Q-Values Formula

34

4/28/2012

Learning An Action Value-Function


The Q-Values Formula Application

-just an adaptation of the active learning equation

Learning An Action Value-Function


The TD Q-Learning Update Equation

- requires no model - calculated after each transition from state .i to j

35

4/28/2012

Learning An Action Value-Function


The TD Q-Learning Update Equation in Practice The TD-Gammon System(Tesauro) Program:Neurogammon P N - attempted to learn from self-play and implicit representation

Generalization In Reinforcement Learning

Explicit Representation
we have assumed that all the functions learned by the agents(U,M,R,Q) are represented in tabular form explicit representation involves one output li i i i l value for each input tuple.

36

4/28/2012

Generalization In Reinforcement Learning

Explicit Representation
good for small state spaces, but the time to convergence and the time per iteration increase rapidly as the space gets larger it may be possible to handle 10,000 states or more this suffices for 2-dimensional, maze-like environments

Generalization In Reinforcement Learning

Explicit Representation p p
Problem: more realistic worlds are out of question eg. Chess & backgammon are tiny subsets of the real world, yet their state spaces contain 120 50 on the order of 10 to 10 states. So it would be absurd to suppose that one must visit all these states in order to learn how to play the game.

37

4/28/2012

Generalization In Reinforcement Learning

Implicit R I li it Representation t ti
Overcome the explicit problem a form that allows one to calculate the output for any input, but that is much more compact than the tabular form form.

Generalization In Reinforcement Learning

Implicit R I li it Representation t ti
For example , an estimated utility function for game playing can be represented as a weighted linear function of a set of board features f1fn: U(i) = w1f1(i)+w2f2(i)+.+wnfn(i)

38

4/28/2012

Generalization In Reinforcement Learning

Implicit R I li it Representation t ti
The utility function is characterized by n weights. A typical chess evaluation function might only have 10 weights, so this is enormous compression

Generalization In Reinforcement Learning

Implicit Representation p p
enormous compression : achieved by an implicit representation allows the learning agents to generalize from states it has visited to states it has not visited the most important aspect : it allows for h i i ll inductive generalization over input states. Therefore, such method are said to perform input generalization

39

4/28/2012

Game-playing : Galapagos

Mendel is a four-legged spider-like creature he has goals and desires, rather than instructions through trial and error, he error programs himself to satisfy those desires he is born not even knowing how to walk, and he has to learn to identify all of the deadly things in his environment he has two basic drives; move and avoid pain (negative reinforcement)

Game-playing : Galapagos

player has no direct control over Mendel player turns various objects on and off and activates devices in order to guide him player has to let Mendel die a few times, otherwise hell never learn each death proves to be a valuable lesson as the more experienced Mendel begins to avoid the things that cause him pain
.

40

4/28/2012

Generalization In Reinforcement Learning

Input Generalisation p
The cart pole problem: set up the problem of balancing a long pole upright on the top of a moving cart.

Generalization In Reinforcement Learning

Input Generalisation p
The cart can be jerked left or right by a controller that observes x, x, , and the earliest work on learning for this problem was carried out by Michie and Chambers(1968) i d t b Mi hi d Ch b (1968) their BOXES algorithm was able to balance the pole for over an hour after only about 30 trials.

41

4/28/2012

Generalization In Reinforcement Learning

Input Generalisation p
The algorithm first discretized the 4dimensional state into boxes, hence the name it then ran trials until the pole fell over or the cart hit the end of the track. track Negative reinforcement was associated with the final action in the final box and then propagated back through the sequence

Generalization In Reinforcement Learning

Input Generalisation p
The discretization causes some problems when the apparatus was initialized in a different position i improvement : using the algorithm that i h l ih h adaptively partitions that state space according to the observed variation in the reward

42

4/28/2012

Genetic Algorithms And Evolutionary Programming


G Genetic algorithm starts with a set of one or i l ih ih f more individuals that are successful, as measured by a fitness function several choices for the individuals exist, such as: -Entire Agent functions the fitness function is a performance measure or reward function - the analogy to natural selection is greatest

Genetic Algorithms And Evolutionary Programming


G Genetic algorithm simply searches directly in i l ih i l h di l i the space of individuals, with the goal of finding one that maximizes the fitness function in a performance measure or reward function search is parallel because each individual in the population can be seen as a separate search

43

4/28/2012

Genetic Algorithms And Evolutionary Programming


component function of an agent t f ti f t the fitness function is the critic or they can be anything at all that can be framed as an optimization problem Evolutionary process: learn an agent function based on occasional rewards as supplied by the selection function, it can be seen as a form of reinforcement learning

Genetic Algorithms And Evolutionary Programming


Before we can apply Genetic algorithm to a problem, we need to answer 4 questions : 1. What is the fitness function? 2. How is an individual represented? 3. How are i di id l selected? individuals l d 4. How do individuals reproduce?

44

4/28/2012

Genetic Algorithms And Evolutionary Programming

What is fitness function?


Depends on the problem, but it is a function that takes an individual as input and returns a real number as output p

Genetic Algorithms And Evolutionary Programming

How is an individual represented?


In the classic genetic algorithm, an individual is represented as a string over a finite alphabet each element of the string is called a gene in genetic algorithm, we usually use the binary alphabet(1,0) to represent DNA

45

4/28/2012

Genetic Algorithms And Evolutionary Programming

How are individuals selected ?


The selection strategy is usually randomized, with the probability of selection proportional to fitness p , for example, if an individual X scores twice as high as Y on the fitness function, then X is twice as likely to be selected for reproduction than is Y. selection is done with replacement

Genetic Algorithms And Evolutionary Programming

How do individuals reproduce?


By cross-over and mutation all the individuals that have been selected for reproduction are randomly paired f each pair, a cross-over point is randomly for h i i ti d l chosen cross-over point is a number in the range 1 to N

46

4/28/2012

Genetic Algorithms And Evolutionary Programming

How do individuals reproduce?


One offspring will get genes 1 through 10 from the first parent, and the rest from the second parent the second offspring will get genes 1 through p g g g g 10 from the second parent, and the rest from the first however, each gene can be altered by random mutation to a different value

Conclusion
Passive Learning in a Known Environment Passive Learning in an Unknown Environment Active Learning in an Unknown Environment Exploration Learning an Action Value Function G Generalization in Reinforcement Learning li ti i R i f tL i Genetic Algorithms and Evolutionary Programming

47

4/28/2012

Resources And Glossary


Information Source Russel, S. and P. Norvig (1995). Artificial Intelligence - A Modern Approach. Upper Saddle River, NJ, Prentice Hall

48

Você também pode gostar