This repository implements several algorithms:

- Trust Region Policy Optimization [1]
- Proximal Policy Optimization (i.e., TRPO, but using a penalty instead of a constraint on KL divergence), where each subproblem is solved with either SGD or L-BFGS
- Cross Entropy Method

TRPO and PPO are implemented with neural-network value functions and use GAE [2].

This library is written in a modular way to allow for sharing code between TRPO and PPO variants, and to write the same code for different kinds of action spaces.

Dependencies:

- keras (2.0.2)
- theano (0.9.0)
- tabulate
- numpy
- scipy

To run the algorithms implemented here, you should put `modular_rl`

on your `PYTHONPATH`

, or run the scripts (e.g. `run_pg.py`

) from this directory.

Good parameter settings can be found in the `experiments`

directory.

You can learn about the various parameters by running one of the experiment scripts with the `-h`

flag, but providing the (required) `env`

and `agent`

parameters. (Those parameters determine what other parameters are available.) For example, to see the parameters of TRPO,

```
./run_pg.py --env CartPole-v0 --agent modular_rl.agentzoo.TrpoAgent -h
```

To the the parameters of CEM,

```
./run_cem.py --env=Acrobot-v0 --agent=modular_rl.agentzoo.DeterministicAgent --n_iter=2
```

[1] JS, S Levine, P Moritz, M Jordan, P Abbeel, "Trust region policy optimization." arXiv preprint arXiv:1502.05477 (2015).

[2] JS, P Moritz, S Levine, M Jordan, P Abbeel, "High-dimensional continuous control using generalized advantage estimation." arXiv preprint arXiv:1506.02438 (2015).