Reinforcement Learning

4/28/2012
Reinforcement Learning
Introduction
In which we examine how an agent can learn from success and failure, reward and punishment.
4/28/2012
Introduction
Learning to ride a bicycle:

The goal given to the Reinforcement Learning g g g system is simply to ride the bicycle without falling over Begins riding the bicycle and performs a series of actions that result in the bicycle being tilted 45 degrees to the right
Photo:http://www.roanoke.com/outdoors/bikepages/bikerattler.html
Introduction

RL system turns the handle bars to the LEFT Result: CRASH!!! Receives negative reinforcement RL system turns the handle bars to the RIGHT Result: CRASH!!! Receives negative reinforcement
4/28/2012
Introduction

RL system has learned that the state of being titled 45 degrees to the right is bad Repeat trial using 40 degree to the right By performing enough of these trial-and-error interactions with the environment, the RL system will ultimately learn how to prevent the bicycle from ever falling over
Reinforcement Learning
A trial-and-error learning paradigm
Rewards and Punishments
Not just an algorithm but a new paradigm in itself Learn about a system
behaviour control from minimal feed back
Inspired by behavioural psychology
Intro to RL
4/28/2012
RL Framework
Environment
evaluation State
Agent
Action
Learn from close interaction Stochastic environment Noisy delayed scalar evaluation Maximize a measure of long term performance
Not Supervised Learning!

Input Output Error Target
Agent
Very sparse supervision No target output provided g p p No error gradient information available Action chooses next state Explore to estimate gradient Trail and error learning
8 Intro to RL
4/28/2012
Not Unsupervised Learning
Input I t
Agent
Activation A ti ti
Evaluation
Sparse supervision available Pattern detection not primary goal P d l
Intro to RL
The Agent-Environment Interface

Agent
state reward
st
rt
action
at
rt+1 st+1
Environment
Agent and environment interact at discrete time steps: t = 0, 1, 2, K Agent b A t observes state at step t st S t t t t t: S produces action at step t : at A(st ) gets resulting reward: rt +1 and resulting next state: st +1
...
10
st
rt +1 at
st +1
at +1
rt +2
st +2
at +2
rt +3 s t +3
... at +3
4/28/2012
The Agent Learns a Policy

Policy at step t, t : a mapping from states to action probabilities t (s, a) = probability that at = a when st = s
Reinforcement learning methods specify how the g g p y p agent changes its policy as a result of experience. Roughly, the agents goal is to get as much reward as it can over the long run.
11 Intro to RL
Goals and Rewards

Is a scalar reward signal an adequate notion of a g q goal?maybe not, but it is surprisingly flexible. A goal should specify what we want to achieve, not how we want to achieve it. A goal must be outside the agents direct controlthus outside the agent. The Th agent must b able to measure success: be bl
explicitly; frequently during its lifespan.
12
Intro to RL
4/28/2012
Returns
Suppose the sequence of rewards after step t is : rt +1 , rt + 2 , rt +3 ,K What do we want to maximize?
In general, we want to maximize the expected return E{Rt } for each stept. , ,
Episodic tasks: interaction breaks naturally into episodes, e g episodes e.g., plays of a game, trips through a maze. game maze
Rt = rt +1 + rt + 2 + L + rT ,
where T is a final time step at which a terminal state is reached, ending an episode.
13 Intro to RL
Passive Learning in a Known Environment

Passive Learner: A passive learner simply watches the world going by, and tries to learn the utility of being in various states. Another way to think of a passive learner is as an agent with a fixed policy trying to determine its benefits.
4/28/2012

In passive learning, the environment generates state p g, g transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:
Agent can move {North, East, South, West} Terminate on reading [4,2] or [4,3]
4/28/2012

Agent is provided: Mi j = a model given the probability of reaching from state i to state j

the object is to use this information about rewards to learn the expected utility U(i) associated with each l th t d tilit i t d ith h nonterminal state i Utilities can be learned using 3 approaches 1) LMS (least mean squares) 2) ADP (adaptive dynamic programming) 3) TD (temporal difference learning)
4/28/2012

LMS (Least Mean Squares)
Agent makes random runs (sequences of random moves) through environment [1,1]->[1,2]->[1,3]->[2,3]->[3,3]->[4,3] = +1 [1,1]->[2,1]->[3,1]->[3,2]->[4,2] = -1

LMS
Collect statistics on final payoff for each state (eg. when on [2,3], how often reached +1 vs -1 ?) Learner computes average for each state Provably converges to true expected value (utilities)
(Algorithm on page 602, Figure 20.3)
10
4/28/2012

LMS
Main Drawback: - slow convergence - it takes the agent well over a 1000 training q g sequences to get close to the correct value

ADP (Adaptive Dynamic Programming)
Uses the value or policy iteration algorithm to calculate exact utilities of states given an estimated model
11
4/28/2012

ADP
In general: - R(i) is reward of being in state i
(often non zero for only a few end states)
- Mij is the probability of transition from state i to j

ADP
Consider U(3,3)
U(3,3) = 0.33 x U(4,3) + 0.33 x U(2,3) + 0.33 x U(3,2) = 0.33 x 1.0 + 0.33 x 0.0886 + 0.33 x -0.4430 = 0.2152
12
4/28/2012
ADP
makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment somewhat intractable for large state spaces

TD (Temporal Difference Learning) The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations
13
4/28/2012

TD Learning
j Suppose we observe a transition from state i to state U(i) = -0.5 and U(j) = +0.5
Suggests that we should increase U(i) to make it agree better with it successor Can be achieved using the following updating rule

TD Learning
Performance: Runs noisier than LMS but smaller error Deal with observed states during sample runs (Not all instances, unlike ADP)
14
4/28/2012
Passive Learning in an Unknown Environment

Least Mean Square(LMS) approach and Temporal-Difference(TD) approach operate unchanged in an initially unknown environment. Adaptive Dynamic Programming(ADP) approach adds a step that updates an estimated model of the environment.

ADP Approach The environment model is learned by direct observation of transitions The environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbors
15
4/28/2012

ADP & TD Approaches pp The ADP approach and the TD approach are closely related B h try to make local adjustments to the Both k l l dj h utility estimates in order to make each state agree with its successors

Minor differences :
TD adjusts a state to agree with its observed successor ADP adjusts the state to agree with all of the successors
Important differences :
TD makes a single adjustment per observed k i l dj b d transition ADP makes as many adjustments as it needs to restore consistency between the utility estimates U and the environment model M
16
4/28/2012

To make ADP more efficient :
directly approximate the algorithm for value iteration or policy iteration prioritized-sweeping heuristic makes adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimates tilit ti t
Advantage of the approximate ADP :

efficient in terms of computation eliminate long value iterations occur in early stage
The Markov Property

the state at step t, means whatever information is available to the agent at step t about its environment. The state can include immediate sensations, highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all essential information, i.e., it should have the Markov Property:
Pr { t +1 = s , rt +1 = r st , at , rt , st 1 , at 1 ,K , r1 , s0 , a0 }= s Pr { t +1 = s , rt +1 = r st , at } s
for all s , r, and histories st , at , rt , st 1 , at 1 ,K , r1 , s0 , a0 .

34 Intro to RL
17
4/28/2012
Markov Decision Processes

If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: M = S, A, P, R
state and action sets one-step dynamics defined by transition probabilities:
Psa = Pr{ t +1 = s st = s,at = a} for all s, s S, a A(s). s s
reward expectations:
Rsa = E{t +1 st = s,at = a,st +1 = s } for all s, s S, a A(s). r s
35 Intro to RL
An Example Finite MDP

Recycling Robot
At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected
36
Intro to RL
18
4/28/2012
Recycling Robot MDP

S = {high, low} A(high) = { search, wait} A(low) = {search, wait, recharge}
1, R wait wait 1 , 3 search , R
search
R R
search wait
= expected no. of cans while searching Rsearch > R wait
= expected no. of cans while waiting p g
high
1, 0
recharge
low
search , R search 1 , R
search
wait 1, R wait
37
Intro to RL
Value Functions
The value of a state is the expected return starting from that state; depends on the agents policy: agent s State - value function for policy :
T R V (s) = E { t st = s}= E rt + k +1 st = s k = 0
The value of a state-action pair is the expected return starting from that state, taking that action, and thereafter following :
Action - value function for policy : T Q ( s , a ) = E {Rt s t = s , a t = a } = E rt + k +1 s t = s , a t = a k =0
38
19
4/28/2012
Bellman Equation for a Policy

The basic idea:
Rt = rt +1 + rt + 2 + rt +3 + rt + 4 L = rt +1 + Rt +1
So:
V (s) = E { t st = s} R
= rt +1 + ( rt + 2 + rt +3 + rt + 4 L)
= E {t +1 + V (st +1 ) st = s} r
Or, without the expectation operator:

V (s) = (s,a) Psa Rsas +V ( s ) s
a s
39
Intro to RL
Optimal Value Functions

For finite MDPs, policies can be partially ordered:
if and only if V (s) V (s) for all s S
There i always at l Th is l least one ( d possibly many) (and ibl ) policies that is better than or equal to all the others. This is an optimal policy. We denote them all *. Optimal policies share the same optimal statevalue function:
V (s) = max V (s) for all s S
40
Intro to RL
20
4/28/2012
Bellman Optimality Equation
The l Th value of a state under an optimal policy must equal f d i l li l the expected return for the best action from that state:
V (s) = max E rt +1 +V (st +1 ) st = s, at = a
aA(s )
= max Psa Rsas +V ( s ) s

aA(s ) s
V is the unique solution of this system of nonlinear equations.
41
Intro to RL
Bellman Optimality Equation
Similarly, Similarly the optimal value of a state-action pair is the expected return for taking that action and thereafter following the optimal policy
Q (s, a) = E rt +1 + max Q (st +1 , a ) st = s, at = a

a
= Psa Rsas + max Q ( s, a ) s a s
Q is the unique solution of this system of nonlinear equations.

42 Intro to RL
21
4/28/2012
Dynamic Programming
DP is the solution method of choice for MDPs
Require complete knowledge of system dynamics (P and R) Expensive and often not practical Curse of dimensionality Guaranteed to converge!
RL methods: online approximate dynamic programming

No knowledge of P and R Sample trajectories through state space Some theoretical convergence analysis available
43 Intro to RL
Policy Evaluation
Policy Evaluation: for a given policy , compute the state value state-value function V Recall:
State - value function for policy : R V (s) = E { t st = s}= E rt + k +1 st = s k = 0
Bellman equation for V : V ( s ) = ( s, a ) Psa Rsas + V ( s) s
a s
a system of S simultaneous linear equations solve iteratively

44 Intro to RL
22
4/28/2012
Policy Improvement
Suppose we have computed V for a deterministic policy . For a given state s s, would it be better to do an action a (s) ?
The value of doing a in state s is :

Q (s, a) = E rt +1 + V (st +1 ) st = s, at = a = Psa Rsas +V ( s) V ) s
s
It is better to switch to action a for state s if and only if Q (s,a) > V (s)
45 Intro to RL
Policy Improvement Cont.

Do this for all states to get a new policy that is greedy with respect to V :
( s ) = arg max Q ( s, a)
= arg max Psa Rsas + V ( s) s
a s a
Then V V
46
Intro to RL
23
4/28/2012
Policy Improvement Cont.

What if V = V ?
a s
i.e., i e for all s S , V ( s ) = max Psa Rsas + V ( s) ? s
But this is the Bellman Optimality Equation. So V = V and both and are optimal policies.
47
Intro to RL
Policy Iteration
0 V 1 V L * V * *
0 1
policy evaluation
policy improvement greedification
48
Intro to RL
24
4/28/2012
Value Iteration
Recall the Bellman optimality equation:
V (s) = max E rt +1 +V (st +1 ) st = s, at = a
aA(s )
= max Psa Rsas +V ( s ) s

aA(s ) s
We can convert it to an full value iteration backup:
Vk +1 (s) max Psa Rsas +Vk ( s ) s

a s
Iterate until convergence

49 Intro to RL
Generalized Policy Iteration

Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. granularity
evaluation V V
greedy(V ) improvement
A geometric metaphor for convergence of GPI:

V = V
starting V
V* *
g re = ed y
*
50
V*
Intro to RL
(V)
25
4/28/2012
Dynamic Programming
V(st ) E { t +1 + V(st )} r
st
rt +1
st +1
T T T T T T
T T
51
Intro to RL
Simplest TD Method
V (st ) V (st ) + [rt +1 + V (st +1 ) V (st )]
st
st +1
T
T T
rt +1
T T
T T
T T
T T
52
Intro to RL
26
4/28/2012
RL Algorithms Prediction
Policy Evaluation (the p y ( prediction p problem): for a g ) given policy, compute the state-value function. No knowledge of P and R, but access to the real system, or a sample model assumed. Uses bootstrapping and sampling
The simplestTD method,TD(0): V (st ) V (st ) + [rt +1 + V (st +1 ) V (st )]

Intro to RL 53
Advantages of TD
TD methods do not require a model of the environment, only experience TD methods can be fully incremental
You can learn before knowing the final outcome
Less memory Less peak computation
You can learn without the final outcome

From incomplete sequences
54
Intro to RL
27
4/28/2012
RL Algorithms Control
SARSA
Q(st , at ) Q(st , at ) + [rt +1 + Q(st +1 , at +1 ) Q(st , at )] If st +1 is terminal, then Q( st +1 , at +1 ) = 0. After every transition from a nonterminal state st , do this :
Q-learning
One-step Q-learning: Q (st ,at ) Q (st ,at )+ rt +1 + max Q (st +1 ,a ) Q (st ,at ) a
55
Intro to RL
Actor-Critic Methods
Policy
Actor
Critic
TD error
state
Value Function
reward
action
Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models
Environment
Intro to RL
56
28
4/28/2012
Actor-Critic Details
TD-error is used to evaluate actions: t = rt +1 + V (st +1 ) V (st ) ( (
If actions are determined by preferences, p(s,a), as follows:
t (s,a) = Pr{ t = a st = s}= a
e p(s, a) , e p(s ,b)

b
then you can update the preferences like this : p(st , at ) p(st ,at ) + t
57
Intro to RL
Active Learning in an Unknown Environment
An active agent must consider : what actions to take what their outcomes may be h t th i t b how they will affect the rewards received
29
4/28/2012

Minor changes to passive learning agent :
environment model now incorporates the probabilities of transitions to other states given a particular action maximize its expected utility agent needs a performance element to choose an action at each step

Active ADP Approach need to learn the probability Maij of a transition instead of Mij the input to the function will include the p action taken
30
4/28/2012

Active TD Approach the model acquisition problem for the TD agent is identical to that for the ADP agent the update rule remains unchanged the TD algorithm will converge to the same values as ADP as the number of training sequences tends to infinity
Exploration
Learning also involves the exploration of unknown areas
Photo:http://www.duke.edu/~icheese/cgeorge.html
31
4/28/2012
Exploration
An agent can benefit from actions in 2 ways immediate rewards received percepts
Exploration
Wacky Approach Vs. Greedy Approach
-0.038
-0.165
0.089
0.215
-0.443
-0.418
-0.544
-0.772
32
4/28/2012
Exploration
The Bandit Problem
Photos: www.freetravel.net
Exploration
The Exploration Function a simple example i l l
u= expected utility (greed) n= number of times actions have been tried(wacky) R+ = best reward possible
33
4/28/2012
Learning An Action Value-Function
What Are Q-Values?

The Q-Values Formula
34
4/28/2012

The Q-Values Formula Application
-just an adaptation of the active learning equation

The TD Q-Learning Update Equation
- requires no model - calculated after each transition from state .i to j
35
4/28/2012

The TD Q-Learning Update Equation in Practice The TD-Gammon System(Tesauro) Program:Neurogammon P N - attempted to learn from self-play and implicit representation
Generalization In Reinforcement Learning
Explicit Representation
we have assumed that all the functions learned by the agents(U,M,R,Q) are represented in tabular form explicit representation involves one output li i i i l value for each input tuple.
36
4/28/2012
Explicit Representation
good for small state spaces, but the time to convergence and the time per iteration increase rapidly as the space gets larger it may be possible to handle 10,000 states or more this suffices for 2-dimensional, maze-like environments
Explicit Representation p p
Problem: more realistic worlds are out of question eg. Chess & backgammon are tiny subsets of the real world, yet their state spaces contain 120 50 on the order of 10 to 10 states. So it would be absurd to suppose that one must visit all these states in order to learn how to play the game.
37
4/28/2012
Implicit R I li it Representation t ti
Overcome the explicit problem a form that allows one to calculate the output for any input, but that is much more compact than the tabular form form.
For example , an estimated utility function for game playing can be represented as a weighted linear function of a set of board features f1fn: U(i) = w1f1(i)+w2f2(i)+.+wnfn(i)
38
4/28/2012
The utility function is characterized by n weights. A typical chess evaluation function might only have 10 weights, so this is enormous compression
Implicit Representation p p
enormous compression : achieved by an implicit representation allows the learning agents to generalize from states it has visited to states it has not visited the most important aspect : it allows for h i i ll inductive generalization over input states. Therefore, such method are said to perform input generalization
39
4/28/2012
Game-playing : Galapagos
Mendel is a four-legged spider-like creature he has goals and desires, rather than instructions through trial and error, he error programs himself to satisfy those desires he is born not even knowing how to walk, and he has to learn to identify all of the deadly things in his environment he has two basic drives; move and avoid pain (negative reinforcement)
Game-playing : Galapagos
player has no direct control over Mendel player turns various objects on and off and activates devices in order to guide him player has to let Mendel die a few times, otherwise hell never learn each death proves to be a valuable lesson as the more experienced Mendel begins to avoid the things that cause him pain
.
40
4/28/2012
Input Generalisation p
The cart pole problem: set up the problem of balancing a long pole upright on the top of a moving cart.
The cart can be jerked left or right by a controller that observes x, x, , and the earliest work on learning for this problem was carried out by Michie and Chambers(1968) i d t b Mi hi d Ch b (1968) their BOXES algorithm was able to balance the pole for over an hour after only about 30 trials.
41
4/28/2012
The algorithm first discretized the 4dimensional state into boxes, hence the name it then ran trials until the pole fell over or the cart hit the end of the track. track Negative reinforcement was associated with the final action in the final box and then propagated back through the sequence
The discretization causes some problems when the apparatus was initialized in a different position i improvement : using the algorithm that i h l ih h adaptively partitions that state space according to the observed variation in the reward
42
4/28/2012
Genetic Algorithms And Evolutionary Programming

G Genetic algorithm starts with a set of one or i l ih ih f more individuals that are successful, as measured by a fitness function several choices for the individuals exist, such as: -Entire Agent functions the fitness function is a performance measure or reward function - the analogy to natural selection is greatest

G Genetic algorithm simply searches directly in i l ih i l h di l i the space of individuals, with the goal of finding one that maximizes the fitness function in a performance measure or reward function search is parallel because each individual in the population can be seen as a separate search
43
4/28/2012

component function of an agent t f ti f t the fitness function is the critic or they can be anything at all that can be framed as an optimization problem Evolutionary process: learn an agent function based on occasional rewards as supplied by the selection function, it can be seen as a form of reinforcement learning

Before we can apply Genetic algorithm to a problem, we need to answer 4 questions : 1. What is the fitness function? 2. How is an individual represented? 3. How are i di id l selected? individuals l d 4. How do individuals reproduce?
44
4/28/2012
What is fitness function?

Depends on the problem, but it is a function that takes an individual as input and returns a real number as output p
How is an individual represented?

In the classic genetic algorithm, an individual is represented as a string over a finite alphabet each element of the string is called a gene in genetic algorithm, we usually use the binary alphabet(1,0) to represent DNA
45
4/28/2012
How are individuals selected ?

The selection strategy is usually randomized, with the probability of selection proportional to fitness p , for example, if an individual X scores twice as high as Y on the fitness function, then X is twice as likely to be selected for reproduction than is Y. selection is done with replacement
How do individuals reproduce?

By cross-over and mutation all the individuals that have been selected for reproduction are randomly paired f each pair, a cross-over point is randomly for h i i ti d l chosen cross-over point is a number in the range 1 to N
46
4/28/2012
How do individuals reproduce?

One offspring will get genes 1 through 10 from the first parent, and the rest from the second parent the second offspring will get genes 1 through p g g g g 10 from the second parent, and the rest from the first however, each gene can be altered by random mutation to a different value
Conclusion
Passive Learning in a Known Environment Passive Learning in an Unknown Environment Active Learning in an Unknown Environment Exploration Learning an Action Value Function G Generalization in Reinforcement Learning li ti i R i f tL i Genetic Algorithms and Evolutionary Programming
47
4/28/2012
Resources And Glossary

Information Source Russel, S. and P. Norvig (1995). Artificial Intelligence - A Modern Approach. Upper Saddle River, NJ, Prentice Hall
48

Reinforcement Learning

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Reinforcement Learning

Enviado por

Direitos autorais:

Formatos disponíveis

4/28/2012

Learning to ride a bicycle:

Learning to ride a bicycle:

Learning to ride a bicycle:

Inspired by behavioural psychology

Not Supervised Learning!

Not Unsupervised Learning

Sparse supervision available Pattern detection not primary goal P d l

The Agent-Environment Interface

The Agent Learns a Policy

Goals and Rewards

Passive Learning in a Known Environment

Passive Learning in a Known Environment

Passive Learning in a Known Environment

Passive Learning in a Known Environment

Passive Learning in a Known Environment

Passive Learning in a Known Environment

Passive Learning in a Known Environment

Passive Learning in a Known Environment

Passive Learning in a Known Environment

Passive Learning in a Known Environment

- Mij is the probability of transition from state i to j

Passive Learning in a Known Environment

Passive Learning in a Known Environment

Passive Learning in a Known Environment

Passive Learning in a Known Environment

Passive Learning in a Known Environment

Passive Learning in an Unknown Environment

Passive Learning in an Unknown Environment

Passive Learning in an Unknown Environment

Passive Learning in an Unknown Environment

Passive Learning in an Unknown Environment

Advantage of the approximate ADP :

The Markov Property

for all s , r, and histories st , at , rt , st 1 , at 1 ,K , r1 , s0 , a0 .

Markov Decision Processes

An Example Finite MDP

Recycling Robot MDP

= expected no. of cans while searching Rsearch > R wait

= expected no. of cans while waiting p g

Bellman Equation for a Policy

Or, without the expectation operator:

Optimal Value Functions

Bellman Optimality Equation

= max Psa Rsas +V ( s ) s

V is the unique solution of this system of nonlinear equations.

Bellman Optimality Equation

Q (s, a) = E rt +1 + max Q (st +1 , a ) st = s, at = a

= Psa Rsas + max Q ( s, a ) s a s

Q is the unique solution of this system of nonlinear equations.

RL methods: online approximate dynamic programming

a system of S simultaneous linear equations solve iteratively

The value of doing a in state s is :

Policy Improvement Cont.

Policy Improvement Cont.

i.e., i e for all s S , V ( s ) = max Psa Rsas + V ( s) ? s

policy improvement greedification

= max Psa Rsas +V ( s ) s

We can convert it to an full value iteration backup:

Vk +1 (s) max Psa Rsas +Vk ( s ) s

Iterate until convergence

Generalized Policy Iteration