Counting 3,834 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

Reinforcement Learning using Intrinsic Rewards through Random Network Distillation in Chainer

This is a fairly complete implementation of Reinforcement Learning with Prediction-Based Rewards in Chainer. For more information on this implementation and intrinsic rewards using random network distillation, check out my blog post.

Random Network Distillation Schematic

Notes

  • Why use this implementation when OpenAI provided a full implementation? Because it's much simpler and easy to follow, and is a complete implementation. It uses Chainer rather than Tensorflow.
  • As in the paper, the Reinforcement Learning algorithm used is Proximal Policy Optimization (PPO).
  • All hyperparameters are mostly the same as those listed in the paper.
  • This implementation can seamlessly switch between a recurrent policy (RNN), or convolutional-only policy (CNN). The recurrent layers can be turned off by specifying the argument --rnn_hidden_layers 0.
  • The RNN layers can be replaced with a Differentiable Neural Computer (DNC) using the --dnc and --rnn_single_record_training flag. You'll need to import my Chainer implementation (dnc.py) from: https://github.com/AdeelMufti/DifferentiableNeuralComputer. I was investigating its use in my work on Probabilistic Model-Based Reinforcement Learning Using The Differentiable Neural Computer.
  • This implementation was not tested on Montezuma's Revenge, as the environment did not intersect with the work I've been interested in. Nor do I have access to free GPUs that I can leave running for long periods of time. If you have the computation capacity to try it out on Montezuma's Revenge, please let me know how it goes! I believe OpenAI used MontezumaRevengeNoFrameskip-v4 for their experiments.
  • When I use the term iteration, I mean a full round of performing and logging rollouts, and training the neural networks.
  • While the paper defines a total number of rollouts per environment for training (in batches of "Rollout Length"), I define a target cumulative extrinsic rewards score, averaged over the number of trials ("Rollout Length") per iteration. The training will keep going forever until this score is achieved.
  • During training, progress is saved at each iteration, and the program automatically resumes training if interrupted.
  • I didn't write fancy code to automatically determine the actions per environment, so you will need to define them manually in the ACTIONS constant in rl_rnd.py. Note that the current implementation only supports discrete actions.
  • GPU support is built in, and can be toggled using the --gpu argument. I would recommend GPUs be used single threaded. Otherwise rollouts on CPUs can be performed multi-threaded, configured through the --num_threads argument.

Setup

Python 3.5 and pip 3 are required. Perform the following steps to clone this repo and install required packages:

To install Pygame Learning Environment (for PixelCopter-v0 and more):

Note: There is a bug in the gym_ple package. Find where your Python 3.5 packages are kept, and edit lib/python3.5/site-packages/gym_ple/ple_env.py, and add the following lines under def __init__():

if game_name == 'PixelCopter':
    game_name = 'Pixelcopter'

Usage

  • python rl_rnd.py [--args]
Parameter Default Description
--data_dir /data/rl_rnd The base data/output directory
--game PixelCopter-v0 Game to use
--experiment_name experiment_1 To isolate its files from others
--frame_resize 84 h x w resize of each observation frame
--initial_normalization_num_trials 8 Collect observations over this many trials for initialization of RND normalization pramaters
--num_trials 128 Trials per iteration of training. Referred to as "Rollout Length" in the paper
--portion_experience_train_rnd_predictor 0.25 As in the RND paper
--z_dim 32 Dimension of encoded vector
--rnn_hidden_dim 256 RNN hidden units
--rnn_hidden_layers 1 RNN hidden layers
--rnn_single_record_training False Required for DNC
--final_hidden_dim 0 Units for additional linear layers before final output
--final_hidden_layers 0 Additional linear layers before final output
--epochs_per_iteration 4 Number of optimization epochs
--sequence_length 64 This amount of input records are stacked together for a forward pass
--minibatches 4 Backprop performed over this many rollouts
--stacked_frames 4 # of observations stacked together for the value/policy
--stacked_frames_rnd 1 # of observations stacked together for the RND predictor network
--sticky_action_probability 0.25 Repeat the previous action with this probability
--intrinsic_coefficient 1.0 As in the RND paper
--extrinsic_coefficient 2.0 As in the RND paper
--extrinsic_reward_clip 1.0 Will be clipped from - to +, as in the RND paper
--gamma_intrinsic 0.99 Discount factor for intrinsic rewards
--gamma_extrinsic 0.999 Discount factor for extrinsic rewards
--lambda_gae 0.95 GAE parameter
--epsilon_ppo 0.1 PPO loss clip range, to +/- of this value
--rnd_obs_norm_clip 5.0 Will be clipped from - to +, as in the RND paper
--beta_entropy 0.001 Entropy coeficient
--epsilon_greedy 0.0 epsilon-greedy for exploration
--model_policy_lr 0.0001 Learning rate for policy network
--model_rnd_predictor_lr 0.0001 Learning rate for RND predictor network
--model_value_lr 0.0001 Learning rate for value network
--model False Resume using .model files that are saved when training completes
--keep_past_x_snapshots 10 Delete snapshots older than this many iterations to free up disk space
--no_resume False Dont auto resume from the latest snapshot
--disable_progress_bar False Disable Chainer's progress bar when optimizing
--gpu -1 GPU ID (negative value indicates CPU)
--gradient_clip 0.0 Gradients clipped scaled to this L2 norm threshold
--rng_seed 31337
--num_threads 10 # threads for running rollouts in parallel. Best to use 1 only for GPU
--target_cumulative_rewards_extrinsic 100 Target cumulative extrinsic reward over all trials in an interation. Training ends when this is achieved
--dnc Differentiable Neural Computer. N,W,R,K, e.g. 256,64,4,0

Example usage:

  • Terminal 1: killall -9 python; sleep 1; rm -fr /data/rl_rnd/PixelCopter-v0/experiment_1 && mkdir -p /data/rl_rnd/PixelCopter-v0/experiment_1 && python -u rl_rnd.py --game PixelCopter-v0 --gpu 0 --num_threads 1 --rnn_hidden_layers 0 --sticky_action_probability 0. --final_hidden_dim 256 --final_hidden_layers 3 --stacked_frames 10 --z_dim 1024 --gradient_clip 1. >> /data/rl_rnd/PixelCopter-v0/experiment_1/log.txt &
    • Clean up previous run (if there is any). Launch the training, and log output to a file
  • Terminal 1: tail -f /data/rl_rnd/PixelCopter-v0/experiment_1/log.txt
    • To review useful output as training progresses
  • Terminal 2: watch -n 1 "grep avg /data/rl_rnd/PixelCopter-v0/experiment_1/log.txt | tail"
    • This will output the mean/std/min/max of the cumulative extrinsic rewards from of all rollouts at each iteration
  • Terminal 3: python graph.py --game PixelCopter-v0
    • Graph the cumulative extrinsic rewards from of all rollouts at each iteration. It refreshes as new results become available