The 37 Implementation Details of Proximal Policy Optimization

This repo contains the source code for the blog post The 37 Implementation Details of Proximal Policy Optimization

Blog post url: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
Tracked Weights and Biases experiments: https://wandb.ai/vwxyzjn/ppo-details

If you like this repo, consider checking out CleanRL (https://github.com/vwxyzjn/cleanrl), the RL library that we used to build this repo.

Get started

Prerequisites:

Python 3.8+
Poetry

Install dependencies:

poetry install

Train agents:

poetry run python ppo.py

Train agents with experiment tracking:

poetry run python ppo.py --track --capture-video

Atari

Install dependencies:

poetry install -E atari

Train agents:

poetry run python ppo_atari.py

Train agents with experiment tracking:

poetry run python ppo_atari.py --track --capture-video

Pybullet

Install dependencies:

poetry install -E pybullet

Train agents:

poetry run python ppo_continuous_action.py

Train agents with experiment tracking:

poetry run python ppo_continuous_action.py --track --capture-video

Gym-microrts (MultiDiscrete)

Install dependencies:

poetry install -E gym-microrts

Train agents:

poetry run python ppo_multidiscrete.py

Train agents with experiment tracking:

poetry run python ppo_multidiscrete.py --track --capture-video

Train agents with invalid action masking:

poetry run python ppo_multidiscrete_mask.py

Train agents with invalid action masking and experiment tracking:

poetry run python ppo_multidiscrete_mask.py --track --capture-video

Atari with Envpool

Install dependencies:

poetry install -E envpool

Train agents:

poetry run python ppo_atari_envpool.py

Train agents with experiment tracking:

poetry run python ppo_atari_envpool.py --track

Solve Pong-v5 in 5 mins:

poetry run python ppo_atari_envpool.py --clip-coef=0.2 --num-envs=16 --num-minibatches=8 --num-steps=128 --update-epochs=3

400 game scores in Breakout-v5 with PPO in ~1 hour (side-effects-free 3-4x speed up compared to ppo_atari.py with SyncVectorEnv):

poetry run python ppo_atari_envpool.py --gym-id Breakout-v5

Procgen

Install dependencies:

poetry install -E procgen

Train agents:

poetry run python ppo_procgen.py

Train agents with experiment tracking:

poetry run python ppo_procgen.py --track

Reproduction of all of our results

To reproduce the results run with openai/baselines, install our fork at hhttps://github.com/vwxyzjn/baselines. Then follow the scripts in scripts/baselines. To reproduce our results, follow the scripts in scripts/ours.

Citation

@inproceedings{shengyi2022the37implementation,
  author = {Huang, Shengyi and Dossa, Rousslan Fernand Julien and Raffin, Antonin and Kanervisto, Anssi and Wang, Weixun},
  title = {The 37 Implementation Details of Proximal Policy Optimization},
  booktitle = {ICLR Blog Track},
  year = {2022},
  note = {https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/},
  url  = {https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/}
}

Related Projects

deep-marl-toolkit

MARLToolkit: The Multi-Agent Rainforcement Learning Toolkit. Include implementation of MAPPO, MAD...

08 Aug 2022 70

chain-of-hindsight

Chain-of-Hindsight, A Scalable RLHF Method

20 Feb 2023 211

open-instruct

09 Jun 2023 1,214

PointNav-VO

[ICCV 2021] Official implementation of "The Surprising Effectiveness of Visual Odometry Technique...

22 Aug 2021 55

cleanba

CleanRL's implementation of DeepMind's Podracer Sebulba Architecture for Distributed DRL

12 Feb 2023 105

abcdrl

Modular Single-file Reinfocement Learning Algorithms Library

12 Nov 2022 37

OmniBenchmark

[ECCV2022] New benchmark for evaluating pre-trained model; New supervised contrastive learning fr...

12 Jul 2022 105

cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-f...

07 Jun 2019 5,379

pypose

A library for differentiable robotics.

11 Nov 2021 1,185

ProteinDT

05 Feb 2023 41

invalid-action-masking

Source Code for A Closer Look at Invalid Action Masking in Policy Gradient Algorithms

18 Jun 2020 135

clip-jax

Train vision models using JAX and 🤗 transformers

05 Aug 2022 75

on-policy

This is the official implementation of Multi-Agent PPO (MAPPO).

23 Feb 2021 1,272

softlearning

Softlearning is a reinforcement learning framework for training maximum entropy policies in conti...

03 Dec 2018 1,200

RL4LMs

A modular RL library to fine-tune language models to human preferences

18 Aug 2022 2,183