Escolar Documentos
Profissional Documentos
Cultura Documentos
Reinforcement Learning
Introduction
In which we examine how an agent can learn from success and failure, reward and punishment.
4/28/2012
Introduction
Photo:http://www.roanoke.com/outdoors/bikepages/bikerattler.html
Introduction
4/28/2012
Introduction
Reinforcement Learning
A trial-and-error learning paradigm
Rewards and Punishments
Not just an algorithm but a new paradigm in itself Learn about a system
behaviour control from minimal feed back
Intro to RL
4/28/2012
RL Framework
Environment
evaluation State
Agent
Action
Learn from close interaction Stochastic environment Noisy delayed scalar evaluation Maximize a measure of long term performance
Agent
Very sparse supervision No target output provided g p p No error gradient information available Action chooses next state Explore to estimate gradient Trail and error learning
8 Intro to RL
4/28/2012
Input I t
Agent
Activation A ti ti
Evaluation
Intro to RL
st
rt
action
at
rt+1 st+1
Environment
Agent and environment interact at discrete time steps: t = 0, 1, 2, K Agent b A t observes state at step t st S t t t t t: S produces action at step t : at A(st ) gets resulting reward: rt +1 and resulting next state: st +1
...
10
st
rt +1 at
st +1
at +1
rt +2
st +2
at +2
rt +3 s t +3
... at +3
4/28/2012
Reinforcement learning methods specify how the g g p y p agent changes its policy as a result of experience. Roughly, the agents goal is to get as much reward as it can over the long run.
11 Intro to RL
12
Intro to RL
4/28/2012
Returns
Suppose the sequence of rewards after step t is : rt +1 , rt + 2 , rt +3 ,K What do we want to maximize?
In general, we want to maximize the expected return E{Rt } for each stept. , ,
Episodic tasks: interaction breaks naturally into episodes, e g episodes e.g., plays of a game, trips through a maze. game maze
Rt = rt +1 + rt + 2 + L + rT ,
where T is a final time step at which a terminal state is reached, ending an episode.
13 Intro to RL
4/28/2012
Agent can move {North, East, South, West} Terminate on reading [4,2] or [4,3]
4/28/2012
4/28/2012
Agent makes random runs (sequences of random moves) through environment [1,1]->[1,2]->[1,3]->[2,3]->[3,3]->[4,3] = +1 [1,1]->[2,1]->[3,1]->[3,2]->[4,2] = -1
10
4/28/2012
11
4/28/2012
Consider U(3,3)
U(3,3) = 0.33 x U(4,3) + 0.33 x U(2,3) + 0.33 x U(3,2) = 0.33 x 1.0 + 0.33 x 0.0886 + 0.33 x -0.4430 = 0.2152
12
4/28/2012
ADP
makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment somewhat intractable for large state spaces
13
4/28/2012
Suggests that we should increase U(i) to make it agree better with it successor Can be achieved using the following updating rule
14
4/28/2012
15
4/28/2012
Important differences :
TD makes a single adjustment per observed k i l dj b d transition ADP makes as many adjustments as it needs to restore consistency between the utility estimates U and the environment model M
16
4/28/2012
17
4/28/2012
reward expectations:
Rsa = E{t +1 st = s,at = a,st +1 = s } for all s, s S, a A(s). r s
35 Intro to RL
36
Intro to RL
18
4/28/2012
R R
search wait
high
1, 0
recharge
low
search , R search 1 , R
search
wait 1, R wait
37
Intro to RL
Value Functions
The value of a state is the expected return starting from that state; depends on the agents policy: agent s State - value function for policy :
T R V (s) = E { t st = s}= E rt + k +1 st = s k = 0
The value of a state-action pair is the expected return starting from that state, taking that action, and thereafter following :
Action - value function for policy : T Q ( s , a ) = E {Rt s t = s , a t = a } = E rt + k +1 s t = s , a t = a k =0
38
19
4/28/2012
Rt = rt +1 + rt + 2 + rt +3 + rt + 4 L = rt +1 + Rt +1
So:
V (s) = E { t st = s} R
= rt +1 + ( rt + 2 + rt +3 + rt + 4 L)
= E {t +1 + V (st +1 ) st = s} r
39
Intro to RL
There i always at l Th is l least one ( d possibly many) (and ibl ) policies that is better than or equal to all the others. This is an optimal policy. We denote them all *. Optimal policies share the same optimal statevalue function:
V (s) = max V (s) for all s S
40
Intro to RL
20
4/28/2012
The l Th value of a state under an optimal policy must equal f d i l li l the expected return for the best action from that state:
V (s) = max E rt +1 +V (st +1 ) st = s, at = a
aA(s )
41
Intro to RL
Similarly, Similarly the optimal value of a state-action pair is the expected return for taking that action and thereafter following the optimal policy
21
4/28/2012
Dynamic Programming
DP is the solution method of choice for MDPs
Require complete knowledge of system dynamics (P and R) Expensive and often not practical Curse of dimensionality Guaranteed to converge!
Policy Evaluation
Policy Evaluation: for a given policy , compute the state value state-value function V Recall:
State - value function for policy : R V (s) = E { t st = s}= E rt + k +1 st = s k = 0
Bellman equation for V : V ( s ) = ( s, a ) Psa Rsas + V ( s) s
a s
22
4/28/2012
Policy Improvement
Suppose we have computed V for a deterministic policy . For a given state s s, would it be better to do an action a (s) ?
It is better to switch to action a for state s if and only if Q (s,a) > V (s)
45 Intro to RL
( s ) = arg max Q ( s, a)
= arg max Psa Rsas + V ( s) s
a s a
Then V V
46
Intro to RL
23
4/28/2012
But this is the Bellman Optimality Equation. So V = V and both and are optimal policies.
47
Intro to RL
Policy Iteration
0 V 1 V L * V * *
0 1
policy evaluation
48
Intro to RL
24
4/28/2012
Value Iteration
Recall the Bellman optimality equation:
V (s) = max E rt +1 +V (st +1 ) st = s, at = a
aA(s )
greedy(V ) improvement
starting V
V* *
g re = ed y
*
50
V*
Intro to RL
(V)
25
4/28/2012
Dynamic Programming
V(st ) E { t +1 + V(st )} r
st
rt +1
st +1
T T T T T T
T T
51
Intro to RL
Simplest TD Method
V (st ) V (st ) + [rt +1 + V (st +1 ) V (st )]
st
st +1
T
T T
rt +1
T T
T T
T T
T T
52
Intro to RL
26
4/28/2012
RL Algorithms Prediction
Policy Evaluation (the p y ( prediction p problem): for a g ) given policy, compute the state-value function. No knowledge of P and R, but access to the real system, or a sample model assumed. Uses bootstrapping and sampling
Advantages of TD
TD methods do not require a model of the environment, only experience TD methods can be fully incremental
You can learn before knowing the final outcome
Less memory Less peak computation
54
Intro to RL
27
4/28/2012
RL Algorithms Control
SARSA
Q(st , at ) Q(st , at ) + [rt +1 + Q(st +1 , at +1 ) Q(st , at )] If st +1 is terminal, then Q( st +1 , at +1 ) = 0. After every transition from a nonterminal state st , do this :
Q-learning
One-step Q-learning: Q (st ,at ) Q (st ,at )+ rt +1 + max Q (st +1 ,a ) Q (st ,at ) a
55
Intro to RL
Actor-Critic Methods
Policy
Actor
Critic
TD error
state
Value Function
reward
action
Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models
Environment
Intro to RL
56
28
4/28/2012
Actor-Critic Details
TD-error is used to evaluate actions: t = rt +1 + V (st +1 ) V (st ) ( (
then you can update the preferences like this : p(st , at ) p(st ,at ) + t
57
Intro to RL
An active agent must consider : what actions to take what their outcomes may be h t th i t b how they will affect the rewards received
29
4/28/2012
environment model now incorporates the probabilities of transitions to other states given a particular action maximize its expected utility agent needs a performance element to choose an action at each step
30
4/28/2012
Exploration
Learning also involves the exploration of unknown areas
Photo:http://www.duke.edu/~icheese/cgeorge.html
31
4/28/2012
Exploration
An agent can benefit from actions in 2 ways immediate rewards received percepts
Exploration
Wacky Approach Vs. Greedy Approach
-0.038
-0.165
0.089
0.215
-0.443
-0.418
-0.544
-0.772
32
4/28/2012
Exploration
The Bandit Problem
Photos: www.freetravel.net
Exploration
The Exploration Function a simple example i l l
u= expected utility (greed) n= number of times actions have been tried(wacky) R+ = best reward possible
33
4/28/2012
34
4/28/2012
35
4/28/2012
Explicit Representation
we have assumed that all the functions learned by the agents(U,M,R,Q) are represented in tabular form explicit representation involves one output li i i i l value for each input tuple.
36
4/28/2012
Explicit Representation
good for small state spaces, but the time to convergence and the time per iteration increase rapidly as the space gets larger it may be possible to handle 10,000 states or more this suffices for 2-dimensional, maze-like environments
Explicit Representation p p
Problem: more realistic worlds are out of question eg. Chess & backgammon are tiny subsets of the real world, yet their state spaces contain 120 50 on the order of 10 to 10 states. So it would be absurd to suppose that one must visit all these states in order to learn how to play the game.
37
4/28/2012
Implicit R I li it Representation t ti
Overcome the explicit problem a form that allows one to calculate the output for any input, but that is much more compact than the tabular form form.
Implicit R I li it Representation t ti
For example , an estimated utility function for game playing can be represented as a weighted linear function of a set of board features f1fn: U(i) = w1f1(i)+w2f2(i)+.+wnfn(i)
38
4/28/2012
Implicit R I li it Representation t ti
The utility function is characterized by n weights. A typical chess evaluation function might only have 10 weights, so this is enormous compression
Implicit Representation p p
enormous compression : achieved by an implicit representation allows the learning agents to generalize from states it has visited to states it has not visited the most important aspect : it allows for h i i ll inductive generalization over input states. Therefore, such method are said to perform input generalization
39
4/28/2012
Game-playing : Galapagos
Mendel is a four-legged spider-like creature he has goals and desires, rather than instructions through trial and error, he error programs himself to satisfy those desires he is born not even knowing how to walk, and he has to learn to identify all of the deadly things in his environment he has two basic drives; move and avoid pain (negative reinforcement)
Game-playing : Galapagos
player has no direct control over Mendel player turns various objects on and off and activates devices in order to guide him player has to let Mendel die a few times, otherwise hell never learn each death proves to be a valuable lesson as the more experienced Mendel begins to avoid the things that cause him pain
.
40
4/28/2012
Input Generalisation p
The cart pole problem: set up the problem of balancing a long pole upright on the top of a moving cart.
Input Generalisation p
The cart can be jerked left or right by a controller that observes x, x, , and the earliest work on learning for this problem was carried out by Michie and Chambers(1968) i d t b Mi hi d Ch b (1968) their BOXES algorithm was able to balance the pole for over an hour after only about 30 trials.
41
4/28/2012
Input Generalisation p
The algorithm first discretized the 4dimensional state into boxes, hence the name it then ran trials until the pole fell over or the cart hit the end of the track. track Negative reinforcement was associated with the final action in the final box and then propagated back through the sequence
Input Generalisation p
The discretization causes some problems when the apparatus was initialized in a different position i improvement : using the algorithm that i h l ih h adaptively partitions that state space according to the observed variation in the reward
42
4/28/2012
43
4/28/2012
44
4/28/2012
45
4/28/2012
46
4/28/2012
Conclusion
Passive Learning in a Known Environment Passive Learning in an Unknown Environment Active Learning in an Unknown Environment Exploration Learning an Action Value Function G Generalization in Reinforcement Learning li ti i R i f tL i Genetic Algorithms and Evolutionary Programming
47
4/28/2012
48