## Contents

## Welcome!

This repository contains Reinforcement Learning algorithms which are being used for research activities at Medipixel. The source code will be frequently updated. We are warmly welcoming external contributors! :)

BC agent on LunarLanderContinuous-v2 | RainbowIQN agent on PongNoFrameskip-v4 | SAC agent on Reacher-v2 |

## Contributors

Thanks goes to these wonderful people (emoji key):

_{Jinwoo Park (Curt)} | _{Kyunghwan Kim} | _{darthegg} |

This project follows the all-contributors specification.

## Algorithms

- Advantage Actor-Critic (A2C)
- Deep Deterministic Policy Gradient (DDPG)
- Proximal Policy Optimization Algorithms (PPO)
- Twin Delayed Deep Deterministic Policy Gradient Algorithm (TD3)
- Soft Actor Critic Algorithm (SAC)
- Behaviour Cloning (BC with DDPG, SAC)
- Prioritized Experience Replay (PER with DDPG)
- From Demonstrations (DDPGfD, SACfD, DQfD)
- Rainbow DQN
- Rainbow IQN (without DuelingNet) - DuelingNet degrades performance

## Performance

We have tested each algorithm on some of the following environments.

The performance is measured on the commit 4248057. Please note that this won't be frequently updated.

#### Reacher-v2

We reproduced the performance of **DDPG**, **TD3**, and **SAC** on Reacher-v2 (Mujoco). They reach the score around -3.5 to -4.5.
See W&B Log for more details.

#### PongNoFrameskip-v4

**RainbowIQN** learns the game incredibly fast! It accomplishes the perfect score (21) within 100 episodes!
The idea of RainbowIQN is roughly suggested from W. Dabney et al..
See W&B Log for more details.

#### LunarLander-v2 / LunarLanderContinuous-v2

We used these environments just for a quick verification of each algorithm, so some of experiments may not show the best performance. Click the following lines to see the figures.

## Getting started

#### Prerequisites

- This repository is tested on Anaconda virtual environment with python 3.6.1+
`$ conda create -n rl_algorithms python=3.6.1 $ conda activate rl_algorithms`

- In order to run Mujoco environments (e.g.
`Reacher-v2`

), you need to acquire Mujoco license.

#### Installation

First, clone the repository.

```
git clone https://github.com/medipixel/rl_algorithms.git
cd rl_algorithms
```

Secondly, install packages required to execute the code. Just type:

```
make dep
```

###### For developers

You need to type the additional command which configures formatting and linting settings. It automatically runs formatting and linting when you commit the code.

```
make dev
```

After having done `make dev`

, you can validate the code by the following commands.

```
make format # for formatting
make test # for linting
```

#### Usages

You can train or test `algorithm`

on `env_name`

if `examples/env_name/algorithm.py`

exists. (`examples/env_name/algorithm.py`

contains hyper-parameters and details of networks.)

```
python run_env_name.py --algo algorithm
```

e.g. running soft actor-critic on LunarLanderContinuous-v2.

```
python run_lunarlander_continuous_v2.py --algo sac <other-options>
```

e.g. running a custom agent, **if you have written your own example**: `examples/env_name/ddpg-custom.py`

.

```
python run_env_name.py --algo ddpg-custom
```

You will see the agent run with hyper parameter and model settings you configured.

#### Arguments for run-files

In addition, there are various argument settings for running algorithms. If you check the options to run file you should command

```
python <run-file> -h
```

`--test`

- Start test mode (no training).

`--off-render`

- Turn off rendering.

`--log`

- Turn on logging using W&B.

`--seed <int>`

- Set random seed.

`--save-period <int>`

- Set saving period of model and optimizer parameters.

`--max-episode-steps <int>`

- Set maximum episode step number of the environment. If the number is less than or equal to 0, it uses the default maximum step number of the environment.

`--episode-num <int>`

- Set the number of episodes for training.

`--render-after <int>`

- Start rendering after the number of episodes.

`--load-from <save-file-path>`

- Load the saved models and optimizers at the beginning.

#### W&B for logging

We use W&B for logging of network parameters and others. For logging, please follow the steps below after requirement installation:

- Create a wandb account
- Check your
API keyin settings, and login wandb on your terminal:`$ wandb login API_KEY`

- Initialize wandb:
`$ wandb init`

For more details, read W&B tutorial.

## Class Diagram

Class diagram at #135. This won't be frequently updated.

## References

- T. P. Lillicrap et al., "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971, 2015.
- J. Schulman et al., "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347, 2017.
- S. Fujimoto et al., "Addressing function approximation error in actor-critic methods." arXiv preprint arXiv:1802.09477, 2018.
- T. Haarnoja et al., "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." arXiv preprint arXiv:1801.01290, 2018.
- T. Haarnoja et al., "Soft Actor-Critic Algorithms and Applications." arXiv preprint arXiv:1812.05905, 2018.
- T. Schaul et al., "Prioritized Experience Replay." arXiv preprint arXiv:1511.05952, 2015.
- M. Andrychowicz et al., "Hindsight Experience Replay." arXiv preprint arXiv:1707.01495, 2017.
- A. Nair et al., "Overcoming Exploration in Reinforcement Learning with Demonstrations." arXiv preprint arXiv:1709.10089, 2017.
- M. Vecerik et al., "Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards."arXiv preprint arXiv:1707.08817, 2017
- V. Mnih et al., "Human-level control through deep reinforcement learning." Nature, 518 (7540):529–533, 2015.
- van Hasselt et al., "Deep Reinforcement Learning with Double Q-learning." arXiv preprint arXiv:1509.06461, 2015.
- Z. Wang et al., "Dueling Network Architectures for Deep Reinforcement Learning." arXiv preprint arXiv:1511.06581, 2015.
- T. Hester et al., "Deep Q-learning from Demonstrations." arXiv preprint arXiv:1704.03732, 2017.
- M. G. Bellemare et al., "A Distributional Perspective on Reinforcement Learning." arXiv preprint arXiv:1707.06887, 2017.
- M. Fortunato et al., "Noisy Networks for Exploration." arXiv preprint arXiv:1706.10295, 2017.
- M. Hessel et al., "Rainbow: Combining Improvements in Deep Reinforcement Learning." arXiv preprint arXiv:1710.02298, 2017.
- W. Dabney et al., "Implicit Quantile Networks for Distributional Reinforcement Learning." arXiv preprint arXiv:1806.06923, 2018.