High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
OTHER License
Bot releases are hidden (Show)
🎉 We are thrilled to announce the v1.0.0 CleanRL Release. Along with our CleanRL paper's recent publication in Journal of Machine Learning Research, our v1.0.0 release includes reworked documentation, new algorithm variants, support for google's new ML framework JAX, hyperparameter tuning utilities, and more. CleanRL has come a long way making high-quality deep reinforcement learning implementations easy to understand and reproducible. This release is a major milestone for the project and we are excited to share it with you. Over 90 PRs were merged to make this release possible. We would like to thank all the contributors who made this release possible.
One of the biggest change of the v1 release is the added documentation at docs.cleanrl.dev. Having great documentation is important for building a reliable and reproducible project. We have reworked the documentation to make it easier to understand and use. For each implemented algorithm, we have documented as much as we can to promote transparency:
Here is a list of the algorithm variants and their documentation:
We also improved the contribution guide to make it easier for new contributors to get started. We are still working on improving the documentation. If you have any suggestions, please let us know in the GitHub Issues.
We now support JAX-based learning algorithm variants, which are usually faster than the torch
equivalent! Here are the docs of the new JAX-based DQN, TD3, and DDPG implementations:
dqn_atari_jax.py
@kinalmehta in vwxyzjn/cleanrl#222
dqn_atari.py
.td3_continuous_action_jax.py
by @joaogui1 in vwxyzjn/cleanrl#225
td3_continuous_action.py
.ddpg_continuous_action_jax.py
by @vwxyzjn in vwxyzjn/cleanrl#187
ddpg_continuous_action.py
.ppo_atari_envpool_xla_jax.py
by @vwxyzjn in vwxyzjn/cleanrl#227
For example, below are the benchmark of DDPG + JAX (see docs here for further detail):
Other new algorithm variants include multi-GPU PPO, PPO prototype that works with Isaac Gym, multi-agent Atari PPO, and refactored PPG and PPO-RND implementations:
ppo_atari_multigpu.pu
by @vwxyzjn in vwxyzjn/cleanrl#178
ppo_atari.py
which uses SyncVectorEnv
.ppo_continuous_action_isaacgym.py
by @vwxyzjn in vwxyzjn/cleanrl#233
Ant
in 4 mins.ppo_pettingzoo_ma_atari.py
by @vwxyzjn in vwxyzjn/cleanrl#188
ppg_procgen.py
by @Dipamc77 in vwxyzjn/cleanrl#186
ppo_rnd_envpoolpy.py
by @yooceii in vwxyzjn/cleanrl#151
MontezumaRevengeNoFrameSkip-v4
.We love tools! The v1.0.0 release comes with a series of DevOps improvements, including pre-commit utilities, CI integration with GitHub to run end-to-end test cases. We also make available a new hyperparameter tuning tool and a new tool for running benchmark experiments.
We added a pre-commit utility to help contributors to format their code, check for spelling, and removing unused variables and imports before submitting a pull request (see Contribution guide for more detail).
To ensure our single-file implementations can run without error, we also added CI/CD pipeline which now runs end-to-end test cases for all the algorithm variants. The pipeline also tests builds across different operating systems, such as Linux, macOS, and Windows (see here as an example). GitHub actions are free for open source projects, and we are very happy to have this tool to help us maintain the project.
We now have preliminary support for hyperparameter tuning via optuna
(see docs), which is designed to help researchers to find a single set of hyperparameters that work well with a kind of games. The current API looks like below:
import optuna
from cleanrl_utils.tuner import Tuner
tuner = Tuner(
script="cleanrl/ppo.py",
metric="charts/episodic_return",
metric_last_n_average_window=50,
direction="maximize",
aggregation_type="average",
target_scores={
"CartPole-v1": [0, 500],
"Acrobot-v1": [-500, 0],
},
params_fn=lambda trial: {
"learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
"num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
"update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4, 8]),
"num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
"vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
"max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
"total-timesteps": 100000,
"num-envs": 16,
},
pruner=optuna.pruners.MedianPruner(n_startup_trials=5),
sampler=optuna.samplers.TPESampler(),
)
tuner.tune(
num_trials=100,
num_seeds=3,
)
We also added a new tool for running benchmark experiments. The tool is designed to help researchers to quickly run benchmark experiments across different algorithms environments with some random seeds. The tool lives in the cleanrl_utils.benchmark
module, and the users can run commands such as:
OMP_NUM_THREADS=1 xvfb-run -a python -m cleanrl_utils.benchmark \
--env-ids CartPole-v1 Acrobot-v1 MountainCar-v0 \
--command "poetry run python cleanrl/ppo.py --cuda False --track --capture-video" \
--num-seeds 3 \
--workers 5
which will run the ppo.py
script with --cuda False --track --capture-video
arguments across 3 random seeds for 3 environments. It uses multiprocessing
to create a pool of 5 workers run the experiments in parallel.
It is an exciting time and new improvements are coming to CleanRL. We plan to add more JAX-based implementations, huggingface integration, some RLops prototypes, and support Gymnasium. CleanRL is a community-based project and we always welcome new contributors. If there is an algorithm or new feature you would like to contribute, feel free to chat with us on our discord channel or raise a GitHub issue.
More JAX-based implementation are coming. Antonin Raffin, the core maintainer of Stable-baselines3, SBX, and rl-baselines3-zoo, is contributing an optimized Soft Actor Critic implementation in JAX (vwxyzjn/cleanrl#300) and TD3+TQC, and DroQ (vwxyzjn/cleanrl#272. These are incredibly exciting new algorithms. For example, DroQ is extremely sample effcient and can obtain ~5000 return in HalfCheetah-v3
in just 100k steps (tracked sbx experiment).
Huggingface Hub 🤗 is a great platform for sharing and collaborating models. We are working on a new integration with Huggingface Hub to make it easier for researchers to share their RL models and benchmark them against other models (vwxyzjn/cleanrl#292). Stay tuned! In the future, we will have a simple snippet for loading models like below:
import random
from typing import Callable
import gym
import numpy as np
import torch
def evaluate(
model_path: str,
make_env: Callable,
env_id: str,
eval_episodes: int,
run_name: str,
Model: torch.nn.Module,
device: torch.device,
epsilon: float = 0.05,
capture_video: bool = True,
):
envs = gym.vector.SyncVectorEnv([make_env(env_id, 0, 0, capture_video, run_name)])
model = Model(envs).to(device)
model.load_state_dict(torch.load(model_path))
model.eval()
obs = envs.reset()
episodic_returns = []
while len(episodic_returns) < eval_episodes:
if random.random() < epsilon:
actions = np.array([envs.single_action_space.sample() for _ in range(envs.num_envs)])
else:
q_values = model(torch.Tensor(obs).to(device))
actions = torch.argmax(q_values, dim=1).cpu().numpy()
next_obs, _, _, infos = envs.step(actions)
for info in infos:
if "episode" in info.keys():
print(f"eval_episode={len(episodic_returns)}, episodic_return={info['episode']['r']}")
episodic_returns += [info["episode"]["r"]]
obs = next_obs
return episodic_returns
if __name__ == "__main__":
from huggingface_hub import hf_hub_download
from cleanrl.dqn import QNetwork, make_env
model_path = hf_hub_download(repo_id="cleanrl/CartPole-v1-dqn-seed1", filename="q_network.pth")
How do we know the effect of a new feature / bug fix? DRL is brittle and has a series of reproducibility issues — even bug fixes sometimes could introduce performance regression (e.g., see how a bug fix of contact force in MuJoCo results in worse performance for PPO). Therefore, it is essential to understand how the proposed changes impact the performance of the algorithms.
We are working a prototype tool that allows us to compare the performance of the library at different versions of the tracked experiment (vwxyzjn/cleanrl#307). With this tool, we can confidently merge new features / bug fixes without worrying about introducing catastrophic regression. The users can run commands such as:
python -m cleanrl_utils.rlops --exp-name ddpg_continuous_action \
--wandb-project-name cleanrl \
--wandb-entity openrlbenchmark \
--tags 'pr-299' 'rlops-pilot' \
--env-ids HalfCheetah-v2 Walker2d-v2 Hopper-v2 InvertedPendulum-v2 Humanoid-v2 Pusher-v2 \
--output-filename compare.png \
--scan-history \
--metric-last-n-average-window 100 \
--report
which generates the following image
Farama-Foundation/Gymnasium is the next generation of openai/gym
that will continue to be maintained and introduce new features. Please see their announcement for further detail. We are migrating to gymnasium
and the progress can be tracked in vwxyzjn/cleanrl#277.
Also, the Farama foundation is working a project called Shimmy which offers conversion wrapper for deepmind/dm_env
environments, such as dm_control
and deepmind/lab
. This is an exciting project that will allow us to support deepmind/dm_env
in the future.
CleanRL has benefited from the contributions of many awesome folks. I would like to cordially thank the core dev members @dosssman @yooceii @Dipamc @kinalmehta @bragajj for their efforts in helping maintain the CleanRL repository. I would also like to give a shout-out to our new contributors @cool-RR, @Howuhh, @jseppanen, @joaogui1, @ALPH2H, @ElliotMunro200, @WillDudley, and @sdpkjc.
We always welcome new contributors to the project. If you are interested in contributing to CleanRL (e.g., new features, bug fixes, new algorithms), please check out our reworked contributing guide.
requirements.txt
automatically by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/143
pyupgrade
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/158
ppo_continuous_action.py
only run 1M steps by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/161
ppo.py
's default timesteps by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/164
ppo_procgen.py
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/166
ppo_continuous_action_isaacgym.py
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/242
ddpg_continuous_action.py
docs by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/137
dqn_atari.py
documentation by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/124
td3_continuous_action.py
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/141
c51.py
and c51_atari.py
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/159
dqn.py
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/157
ppo_atari_envpool.py
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/160
Full Changelog: https://github.com/vwxyzjn/cleanrl/compare/v0.6.0...v1.0.0
Published by vwxyzjn about 2 years ago
🎉 I am thrilled to announce the v1.0.0b2 CleanRL Beta Release. This new release comes with exciting new features. First, we now support JAX-based learning algorithms, which are usually faster than the torch
equivalent! Here are the docs of the new JAX-based DQN, TD3, and DDPG implementations:
Also, we now have preliminary support for hyperparameter tuning via optuna
(see docs), which is designed to help researchers to find a single set of hyperparameters that work well with a kind of games. The current API looks like below:
import optuna
from cleanrl_utils.tuner import Tuner
tuner = Tuner(
script="cleanrl/ppo.py",
metric="charts/episodic_return",
metric_last_n_average_window=50,
direction="maximize",
aggregation_type="average",
target_scores={
"CartPole-v1": [0, 500],
"Acrobot-v1": [-500, 0],
},
params_fn=lambda trial: {
"learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
"num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
"update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4, 8]),
"num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
"vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
"max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
"total-timesteps": 100000,
"num-envs": 16,
},
pruner=optuna.pruners.MedianPruner(n_startup_trials=5),
sampler=optuna.samplers.TPESampler(),
)
tuner.tune(
num_trials=100,
num_seeds=3,
)
Besides, we added support for new algorithms/environments, which are
ppo_continuous_action_isaacgym.py
ppo_rnd_envpool.py
I would like to cordially thank the core dev members @dosssman @yooceii @Dipamc @kinalmehta for their efforts in helping maintain the CleanRL repository. I would also like to give a shout-out to our new contributors @cool-RR, @Howuhh, @jseppanen, @joaogui1, @kinalmehta, and @ALPH2H.
Jiayi Weng, Min Lin, Shengyi Huang, Bo Liu, Denys Makoviichuk, Viktor Makoviychuk, Zichen Liu, Yufan Song, Ting Luo, Yukun Jiang, Zhongwen Xu, & Shuicheng YAN (2022). EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=BubxnHpuMbG
ppo_continuous_action_isaacgym.py
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/242
Full Changelog: https://github.com/vwxyzjn/cleanrl/compare/v1.0.0b1...v1.0.0b2
Published by vwxyzjn over 2 years ago
🎉 I am thrilled to announce the v1.0.0b1 CleanRL Beta Release. CleanRL has come a long way making high-quality deep reinforcement learning implementations easy to understand. In this release, we have put a huge effort into revamping our documentation site, making our implementation friendly to use for new users.
I would like to cordially thank the core dev members @dosssman @yooceii @Dipamc77 @bragajj for their efforts in helping maintain the CleanRL repository. I would also like to give a shout-out to our new contributors @ElliotMunro200 and @Dipamc77.
ppo_continuous_action.py
only run 1M steps by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/161
ppo.py
's default timesteps by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/164
ppo_procgen.py
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/166
A significant amount of documentation changes (tracked by https://github.com/vwxyzjn/cleanrl/issues/121).
See the overview documentation page here: https://docs.cleanrl.dev/rl-algorithms/overview/
ddpg_continuous_action.py
docs by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/137
dqn_atari.py
documentation by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/124
td3_continuous_action.py
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/141
c51.py
and c51_atari.py
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/159
dqn.py
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/157
ppo_atari_envpool.py
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/160
requirements.txt
automatically by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/143
pyupgrade
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/158
Full Changelog: https://github.com/vwxyzjn/cleanrl/compare/v0.6.0...v1.0.0b1
Published by vwxyzjn over 2 years ago
parse_args
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/78
parse_args()
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/118
ppo.py
documentation by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/120
episode_reward
with episodic_return
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/125
apex_dqn_atari.py
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/136
gym==0.23.1
by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/138
Full Changelog: https://github.com/vwxyzjn/cleanrl/compare/v0.5.0...v0.6.0
Published by vwxyzjn almost 3 years ago
cleanrl
directory by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/77
Full Changelog: https://github.com/vwxyzjn/cleanrl/compare/v0.4.8...v0.5.0
Published by vwxyzjn over 3 years ago
Published by vwxyzjn about 4 years ago
gym_id | apex_dqn_atari_visual | c51_atari_visual | dqn_atari_visual | ppo_atari_visual |
---|---|---|---|---|
BeamRiderNoFrameskip-v4 | 2936.93 ± 362.18 | 13380.67 ± 0.00 | 7139.11 ± 479.11 | 2053.08 ± 83.37 |
QbertNoFrameskip-v4 | 3565.00 ± 690.00 | 16286.11 ± 0.00 | 11586.11 ± 0.00 | 17919.44 ± 383.33 |
SpaceInvadersNoFrameskip-v4 | 1019.17 ± 356.94 | 1099.72 ± 14.72 | 935.40 ± 93.17 | 1089.44 ± 67.22 |
PongNoFrameskip-v4 | 19.06 ± 0.83 | 18.00 ± 0.00 | 19.78 ± 0.22 | 20.72 ± 0.28 |
BreakoutNoFrameskip-v4 | 364.97 ± 58.36 | 386.10 ± 21.77 | 353.39 ± 30.61 | 380.67 ± 35.29 |
gym_id | ddpg_continuous_action | td3_continuous_action | ppo_continuous_action |
---|---|---|---|
Reacher-v2 | -6.25 ± 0.54 | -6.65 ± 0.04 | -7.86 ± 1.47 |
Pusher-v2 | -44.84 ± 5.54 | -59.69 ± 3.84 | -44.10 ± 6.49 |
Thrower-v2 | -137.18 ± 47.98 | -80.75 ± 12.92 | -58.76 ± 1.42 |
Striker-v2 | -193.43 ± 27.22 | -269.63 ± 22.14 | -112.03 ± 9.43 |
InvertedPendulum-v2 | 1000.00 ± 0.00 | 443.33 ± 249.78 | 968.33 ± 31.67 |
HalfCheetah-v2 | 10386.46 ± 265.09 | 9265.25 ± 1290.73 | 1717.42 ± 20.25 |
Hopper-v2 | 1128.75 ± 9.61 | 3095.89 ± 590.92 | 2276.30 ± 418.94 |
Swimmer-v2 | 114.93 ± 29.09 | 103.89 ± 30.72 | 111.74 ± 7.06 |
Walker2d-v2 | 1946.23 ± 223.65 | 3059.69 ± 1014.05 | 3142.06 ± 1041.17 |
Ant-v2 | 243.25 ± 129.70 | 5586.91 ± 476.27 | 2785.98 ± 1265.03 |
Humanoid-v2 | 877.90 ± 3.46 | 6342.99 ± 247.26 | 786.83 ± 95.66 |
gym_id | ddpg_continuous_action | td3_continuous_action | ppo_continuous_action |
---|---|---|---|
MinitaurBulletEnv-v0 | -0.17 ± 0.02 | 7.73 ± 5.13 | 23.20 ± 2.23 |
MinitaurBulletDuckEnv-v0 | -0.31 ± 0.03 | 0.88 ± 0.34 | 11.09 ± 1.50 |
InvertedPendulumBulletEnv-v0 | 742.22 ± 47.33 | 1000.00 ± 0.00 | 1000.00 ± 0.00 |
InvertedDoublePendulumBulletEnv-v0 | 5847.31 ± 843.53 | 5085.57 ± 4272.17 | 6970.72 ± 2386.46 |
Walker2DBulletEnv-v0 | 567.61 ± 15.01 | 2177.57 ± 65.49 | 1377.68 ± 51.96 |
HalfCheetahBulletEnv-v0 | 2847.63 ± 212.31 | 2537.34 ± 347.20 | 2347.64 ± 51.56 |
AntBulletEnv-v0 | 2094.62 ± 952.21 | 3253.93 ± 106.96 | 1775.50 ± 50.19 |
HopperBulletEnv-v0 | 1262.70 ± 424.95 | 2271.89 ± 24.26 | 2311.20 ± 45.28 |
HumanoidBulletEnv-v0 | -54.45 ± 13.99 | 937.37 ± 161.05 | 204.47 ± 1.00 |
BipedalWalker-v3 | 66.01 ± 127.82 | 78.91 ± 232.51 | 272.08 ± 10.29 |
LunarLanderContinuous-v2 | 162.96 ± 65.60 | 281.88 ± 0.91 | 215.27 ± 10.17 |
Pendulum-v0 | -238.65 ± 14.13 | -345.29 ± 47.40 | -1255.62 ± 28.37 |
MountainCarContinuous-v0 | -1.01 ± 0.01 | -1.12 ± 0.12 | 93.89 ± 0.06 |
gym_id | ppo | dqn |
---|---|---|
CartPole-v1 | 500.00 ± 0.00 | 182.93 ± 47.82 |
Acrobot-v1 | -80.10 ± 6.77 | -81.50 ± 4.72 |
MountainCar-v0 | -200.00 ± 0.00 | -142.56 ± 15.89 |
LunarLander-v2 | 46.18 ± 53.04 | 144.52 ± 1.75 |
data-processors
in sub-processes to prepare data for the worker
. This is kind of a work around and a hack but according to our benchmark, it works empirically good and fast enough.Benchmarked Learning Curves | Atari |
---|---|
Metrics, logs, and recorded videos are at | cleanrl.benchmark/reports/Atari |
 |
CarRacing-v0
by PPO in the Experimental Domains. It is our first example with pixel observation space and continuous action space. See https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/experiments/ppo_car_racing.py.
CarRacing-v0
(e.g. if dies you get -100 reward, but PPO is anecdotally sensitive to this kind of large rewards).Published by vwxyzjn about 4 years ago
See https://streamable.com/cq8e62 for a demo
Significant amount of effort was put into the making of Open RL Benchmark (http://benchmark.cleanrl.dev/). It provides benchmark of popular Deep Reinforcement Learning algorithms in 34+ games with unprecedented level of transparency, openness, and reproducibility.
In addition, the legacy common.py
is depreciated in favor of using single-file implementations.
Published by vwxyzjn almost 5 years ago
We've made the SAC algorithm works for both continuous and discrete action spaces, with primary references from the following papers:
https://arxiv.org/abs/1801.01290
https://arxiv.org/abs/1812.05905
https://arxiv.org/abs/1910.07207
My personal thanks to everyone who participated in the monthly dev cycle and, in particular, @dosssman who implemented the SAC with discrete action spaces.
Additional improvement include
support gym.wrappers.Monitor to automatically record agent’s performance at certain episodes (default is 1, 2, 9, 28, 65, ... 1000, 2000, 3000) and integrate with wandb. (so cool, see screenshot below) #4
Use the same replay buffer from minimalRL for DQN and SAC #5
Published by vwxyzjn about 5 years ago
This is the initial release 🙌🙌
Working on more algorithms and where and bug fixes for the 1.0 release :) Comments and PR are more than welcome.