Você está na página 1de 4

Key Papers in Deep RL

Josh Achiam, OpenAI

The papers highlighted in blue are more important---even if you skip all of the others, you need these on your radar.

1. Model-Free RL
a. Deep Q-Learning
i. Original Deep Q-Learning Paper (​https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf​)
ii. Deep Recurrent Q-Learning (​https://arxiv.org/abs/1507.06527​)
iii. Dueling Architectures (​https://arxiv.org/abs/1511.06581​)
iv. Deep Double Q-Learning (​https://arxiv.org/abs/1509.06461​)
v. Prioritized Replay (​https://arxiv.org/abs/1511.05952​)
vi. Rainbow (​https://arxiv.org/abs/1710.02298​)

b. Policy Gradients
i. A2C / A3C (​https://arxiv.org/abs/1602.01783​)
ii. TRPO (​https://arxiv.org/abs/1502.05477​)
iii. TRPO+GAE (​https://arxiv.org/abs/1506.02438​)
iv. ACKTR (​https://arxiv.org/abs/1708.05144​)
v. PPO (​https://arxiv.org/abs/1707.06347​,
https://arxiv.org/abs/1707.02286​)
vi. ACER (​https://arxiv.org/abs/1611.01224​)
vii. Soft Actor-Critic (​https://arxiv.org/abs/1801.01290​)

c. Deterministic Policy Gradients


i. Original DPG Paper (​http://proceedings.mlr.press/v32/silver14.pdf​)
ii. DDPG (​https://arxiv.org/abs/1509.02971​)
iii. TD3 (​https://arxiv.org/abs/1802.09477​)

d. Distributional RL
i. C51 (​https://arxiv.org/abs/1707.06887​)
ii. QR-DQN (​https://arxiv.org/abs/1710.10044​)
iii. IQN (​https://arxiv.org/abs/1806.06923​)
iv. Dopamine (code library) (​paper link​, ​code link​)

e. Policy Gradients with Action-Dependent Baselines


i. Q-Prop (​https://arxiv.org/abs/1611.02247​)
ii. Stein Control Variates (​https://arxiv.org/abs/1710.11198​)
iii. Mirage of Action-Dependent Baselines (​https://arxiv.org/abs/1802.10031​)

f. Path-Consistency Learning
i. Original PCL Paper (​https://arxiv.org/abs/1702.08892​)
ii. Trust PCL (​https://arxiv.org/abs/1707.01891​)

g. Other Directions for Combining Policy Learning and Q-Learning


i. PGQ (​https://arxiv.org/abs/1611.01626​)
ii. Reactor (​https://arxiv.org/abs/1704.04651​)
iii. Interpolated Policy Gradients (​this link is way too long​)
iv. Equivalence Between PG and SQL (​https://arxiv.org/abs/1704.06440​)

h. Evolution Algorithms
i. Evolutionary Strategies (​https://arxiv.org/abs/1703.03864​)

2. Exploration
a. Intrinsic Motivation
i. VIME (​https://arxiv.org/abs/1605.09674​)
ii. Count-Based
1. Original pseudocounts paper (​https://arxiv.org/abs/1606.01868​)
2. Neural Density Models (​https://arxiv.org/abs/1703.01310​)
3. Hashing (​https://arxiv.org/abs/1611.04717​)
4. EX2 (​https://arxiv.org/abs/1703.01260​)
iii. Self-Supervised Prediction (​https://arxiv.org/abs/1705.05363​)
iv. Large-Scale Study of Curiosity (​https://arxiv.org/abs/1808.04355​)
v. Random Network Distillation (​https://arxiv.org/abs/1810.12894​)
b. Unsupervised RL
i. Variational Intrinsic Control (​https://arxiv.org/abs/1611.07507​)
ii. Diversity is All You Need (​https://arxiv.org/abs/1802.06070​)
iii. Variational Option Discovery Algorithms (​https://arxiv.org/abs/1807.10299​)

3. Transfer and Multitask RL


a. Progressive Networks (​https://arxiv.org/abs/1606.04671​)
b. Universal Value Function Approximators (​http://proceedings.mlr.press/v37/schaul15.pdf​)
c. RL + Unsupervised Auxiliary Tasks (​https://arxiv.org/abs/1611.05397​)
d. Intentional / Unintentional Agent (​https://arxiv.org/abs/1707.03300​)
e. PathNet (​https://arxiv.org/abs/1701.08734​)
f. Mutual Alignment Transfer Learning (​https://arxiv.org/abs/1707.07907​)
g. Learning an Embedding Space for Transfer (​https://openreview.net/pdf?id=rk07ZXZRb​)
h. Hindsight Experience Replay (​https://arxiv.org/abs/1707.01495​)

4. Hierarchy
a. Feudal Networks (​https://arxiv.org/abs/1703.01161​)
b. Strategic Attentive Writer (​https://arxiv.org/abs/1606.04695​)
c. Data-Efficient Hierarchical RL (​https://arxiv.org/abs/1805.08296​)

5. Fast Memory
a. Model-Free Episodic Control (​https://arxiv.org/abs/1606.04460​)
b. Neural Episodic Control (​https://arxiv.org/abs/1703.01988​)
c. Neural Map (​https://arxiv.org/abs/1702.08360​)
d. MERLIN (​https://arxiv.org/abs/1803.10760​)
e. Relational RNNs (​https://arxiv.org/abs/1806.01822​)

6. Model-Based
a. Learned Model
i. Imagination-Augmented Agents (​https://arxiv.org/abs/1707.06203​)
ii. Model-Based Plus Model-Free Fine-tuning (​https://arxiv.org/abs/1708.02596​)
iii. Model-Based Value Expansion (​https://arxiv.org/abs/1803.00101​)
iv. Stochastic Ensemble Value Expansion (​https://arxiv.org/abs/1807.01675​)
v. Model Ensemble TRPO (​hyperlink​)
vi. MB-MPO (​https://arxiv.org/abs/1809.05214​)
vii. World Models (​https://arxiv.org/abs/1809.01999​)
b. Given Model
i. AlphaZero (​https://arxiv.org/abs/1712.01815​)
ii. Expert Iteration (​https://arxiv.org/abs/1705.08439​)

7. Meta-RL
a. RL^2: Fast RL via Slow RL (​https://arxiv.org/abs/1611.02779​)
b. Learning to Reinforcement Learn (​https://arxiv.org/abs/1611.05763​)
c. MAML (​https://arxiv.org/abs/1703.03400​)
d. SNAIL (​link to openreview​)

8. Scaling RL
a. Accelerated Methods for Deep RL (​https://arxiv.org/abs/1803.02811​)
b. IMPALA (​https://arxiv.org/abs/1802.01561​)
c. Distributed Prioritized Experience Replay (​https://openreview.net/forum?id=H1Dy---0Z​)
d. R2D2 (​https://openreview.net/forum?id=r1lyTjAqYX​)
e. RLLib---Distributed RL with Ray (​https://arxiv.org/abs/1712.09381​)

9. RL in the Real World


a. Benchmarking Deep RL in the Real World (​https://arxiv.org/abs/1809.07731​)
b. Learning Dexterity (​https://arxiv.org/abs/1808.00177​)
c. QT-Opt (​https://arxiv.org/abs/1806.10293​)

10. Safety
a. Concrete Problems in AI Safety (​https://arxiv.org/abs/1606.06565​)
b. Learning from Human Preferences (​https://arxiv.org/abs/1706.03741​)
c. Constrained Policy Optimization (​https://arxiv.org/abs/1705.10528​)
d. Safe Exploration in Continuous Action Spaces (​https://arxiv.org/abs/1801.08757​)
e. Trial Without Error (​https://arxiv.org/abs/1707.05173​)
f. Leave No Trace (​https://arxiv.org/abs/1711.06782​)

11. Imitation Learning and Inverse Reinforcement Learning


a. MaxEnt IRL Thesis (​link​)
b. Guided Cost Learning (​https://arxiv.org/abs/1603.00448​)
c. GAIL (​https://arxiv.org/abs/1606.03476​)
d. DeepMimic (​link​)
e. VAIL (​https://arxiv.org/abs/1810.00821​)
f. One-Shot High-Fidelity Imitation Learning (​https://arxiv.org/abs/1810.05017​)

12. Bonus: Classic Papers in RL Theory or RL Review


(Not necessarily deep RL, but foundational nonetheless!)

a. Policy Gradient Methods for RL with Function Approximation (​link​)


b. TD Learning with Function Approximation (​link​)
c. RL of Motor Skills with Policy Gradients (​link​)
d. Approximately Optimal Approximate RL (​link​)
e. A Natural Policy Gradient (​link​)
f. Algorithms for Reinforcement Learning (Szepesvari) (​link​)

Other:
● Unicorn: Continual Learning (​https://arxiv.org/pdf/1802.08294.pdf​)
● Learning by Playing (​https://arxiv.org/pdf/1802.10567.pdf​)

Você também pode gostar