Escolar Documentos
Profissional Documentos
Cultura Documentos
=
n
b
b Q
a Q
t
t
e
e
1
) (
) (
t
t
62
Implementao Incremental
O mtodo de estimativa da mdia por
amostragem computa a mdia dos primeiros
k reforos utilizando:
Problema:
A cada reforo, mais memria para guardar a lista
necessria e mais esforo para calcular Q
preciso.
Requisitos computacionais e de memria crescem
com o tempo, no sendo limitados.
k
r r r
Q
k
k
+ +
=
2 1
63
Implementao Incremental
Como computar Q passo a passo, sem
guardar todos os reforos?
???
1
1
1 2 1
1
+ =
+
+ + +
=
+
+
+
k k
k k
k
Q Q
k
r r r r
Q
64
A matemtica...
k
r
( )
( )
( )
| |
k k k
k k k
k k k k
k k
k
i
i k
k
i
i k
Q r
k
Q
Q Q k r
k
Q Q kQ r
k
kQ r
k
r r
k
r
k
Q
+
+ =
+ +
+
=
+ +
+
=
+
+
=
|
.
|
\
|
+
+
=
+
=
+
+
+
+
=
+
+
=
+
1
1
1
1
1
1
1
1
1
1
1
) 1 (
1
1
1
1
1
1
1
1
1
1
65
Implementao Incremental
Ou seja, pode se calcular Q passo a
passo usando:
Esta implementao requer memria
para armazenar apenas Q
k
e pouca
computao.
| |
k k k k
Q r
k
Q Q
+
+ =
+ + 1 1
1
1
66
Implementao Incremental
Esta uma forma muito comum para as
regras de atualizao dos valores:
Onde:
StepSize determina quo rpido se atualiza
os valores.
Para casos no estacionrios...
NewEstimate = OldEstimate + StepSize[Target OldEstimate]
67
O problema no estacionrio
Escolhendo Q
k
como uma mdia
amostrada apropriado para o
problema onde Q*(a) no muda com
o tempo ( estacionrio).
No caso no estacionrio deve-se usar
uma mdia exponencial ponderada:
| |
i
i k
k
i
k
k
k k k k
r Q Q
Q r Q Q
=
+
+ +
+ =
s < + =
) 1 ( ) 1 (
1 0 , for
1
0 1
1 1
o o o
o o o constant
68
Valores iniciais
O mtodo iterativo visto depende do
valor inicial de Q
k=0
(a).
Suponha uma inicializao otimista:
No caso do n-armed bandit: Q
0
(a)=5, a.
69
Avaliao versus Instruo
70
Avaliao versus Instruo
The n-armed bandit problem we considered
above is a case in which the feedback is
purely evaluative.
The reward received after each action gives some
information about how good the action was, but it
says nothing at all about whether the action was
correct or incorrect, that is, whether it was a best
action or not.
Here, correctness is a relative property of actions
that can be determined only by trying them all and
comparing their rewards.
71
Avaliao versus Instruo
You have to perform some form of the
generate-and-test method whereby you
try actions, observe the outcomes, and
selectively retain those that are the
most effective.
This is learning by selection, in contrast
to learning by instruction, and all
reinforcement learning methods have to
use it in one form or another.
72
Avaliao versus Instruo
RL contrasts sharply with supervised
learning, where the feedback from the
environment directly indicates what the
correct action should have been.
In this case there is no need to search:
whatever action you try, you will be told
what the right one would have been.
There is no need to try a variety of actions;
the instructive "feedback" is typically
independent of the action selected (so is not
really feedback at all).
73
Avaliao versus Instruo
The main problem facing a supervised learning
system is to construct a mapping from situations to
actions that mimics the correct actions specified by
the environment and that generalizes correctly to new
situations.
A supervised learning system cannot be said to learn
to control its environment because it follows, rather
than influences, the instructive information it receives.
Instead of trying to make its environment behave in a
certain way, it tries to make itself behave as
instructed by its environment.
74
Binary Bandit Tasks
a
t
= 1 or a
t
=2
r
t
= success or r
t
= failure
Suppose you have just two actions:
and just two rewards:
Then you might infer a target or desired action:
failure
success
if action other the
if
t
a
{
d
t
=
and then always play the action that was most often the target
Call this the supervised algorithm
It works fine on deterministic tasks
75
Contingency Space
The space of all possible binary bandit tasks:
76
Linear Learning Automata
Let t
t
(a) = Pr a
t
= a
{ }
be the only adapted parameter
L
RI
(Linear, reward- inaction)
On success : t
t +1
(a
t
) = t
t
(a
t
) +o(1 t
t
(a
t
)) 0 < o < 1
(the other action probs. are adjusted to still sum to 1)
On failure: no change
L
R-P
(Linear, reward- penalty)
On success : t
t +1
(a
t
) = t
t
(a
t
) +o(1 t
t
(a
t
)) 0 < o < 1
(the other action probs. are adjusted to still sum to 1)
On failure: t
t +1
(a
t
) = t
t
(a
t
) +o(0 t
t
(a
t
)) 0 < o < 1
For two actions, a stochastic, incremental version of the supervised algorithm
77
Performance on Binary Bandit
Tasks A and B
78
Concluso parcial
Tudo mostrado at aqui muito simples:
Mas complicados os suficiente...
Mtodos melhores sero construdos a partir
destes.
Como melhorar estes mtodos?
Estimar incertezas.
Utilizar aproximadores de funes.
Introduzir Bayes...
Aps o intervalo, formalizao do problema
do Aprendizado por Reforo...
79
Intervalo