An offline deep reinforcement learning library
MIT License
Bot releases are hidden (Show)
ReBRAC has been added to d3rlpy! Please check a reproduction script here.
d3rlpy install dm_control
. Please check an example script here.use_layer_norm
option has been added to VectorEncoderFactory
.Published by takuseno 5 months ago
Cal-QL has been added to d3rlpy in v2.5.0! Please check a reproduction script here. To support faithful reproduction, SparseRewardTransitionPicker
has been also added, which is used in the reproduction script.
One of the frequent questions is "How can I implement a custom algorithm on top of d3rlpy?". Now, the new example script has been added to answer this question. Based on this example, you can build your own algorithm while you can utilize a whole training pipeline provided by d3rlpy. Please check the script here.
save_policy
method in the same way as you use with Q-learning algorithms.n_updates
option has been added to fit_online
method to control update-to-data (UTD) ratio.write_at_termination
option has been added to ReplayBuffer
.fix_online
method has been fixed.Published by takuseno 8 months ago
In v2.4.0, d3rlpy supports tuple observations.
import numpy as np
import d3rlpy
observations = [np.random.random((1000, 100)), np.random.random((1000, 32))]
actions = np.random.random((1000, 4))
rewards = np.random.random((1000, 1))
terminals = np.random.randint(2, size=(1000, 1))
dataset = d3rlpy.dataset.MDPDataset(
observations=observations,
actions=actions,
rewards=rewards,
terminals=terminals,
)
You can find an example script here
logging_steps
and logging_strategy
options have been added to fit
and fit_online
methods (thanks, @claudius-kienle )Published by takuseno 11 months ago
Distributed data parallel training with multiple nodes and GPUs has been one of the most demanded feature. Now, it's finally available! It's extremely easy to use this feature.
Example:
# train.py
from typing import Dict
import d3rlpy
def main() -> None:
# GPU version:
# rank = d3rlpy.distributed.init_process_group("nccl")
rank = d3rlpy.distributed.init_process_group("gloo")
print(f"Start running on rank={rank}.")
# GPU version:
# device = f"cuda:{rank}"
device = "cpu:0"
# setup algorithm
cql = d3rlpy.algos.CQLConfig(
actor_learning_rate=1e-3,
critic_learning_rate=1e-3,
alpha_learning_rate=1e-3,
).create(device=device)
# prepare dataset
dataset, env = d3rlpy.datasets.get_pendulum()
# disable logging on rank != 0 workers
logger_adapter: d3rlpy.logging.LoggerAdapterFactory
evaluators: Dict[str, d3rlpy.metrics.EvaluatorProtocol]
if rank == 0:
evaluators = {"environment": d3rlpy.metrics.EnvironmentEvaluator(env)}
logger_adapter = d3rlpy.logging.FileAdapterFactory()
else:
evaluators = {}
logger_adapter = d3rlpy.logging.NoopAdapterFactory()
# start training
cql.fit(
dataset,
n_steps=10000,
n_steps_per_epoch=1000,
evaluators=evaluators,
logger_adapter=logger_adapter,
show_progress=rank == 0,
enable_ddp=True,
)
d3rlpy.distributed.destroy_process_group()
if __name__ == "__main__":
main()
You need to use torchrun
command to start training, which should be already installed once you install PyTorch.
$ torchrun \
--nnodes=1 \
--nproc_per_node=3 \
--rdzv_id=100 \
--rdzv_backend=c10d \
--rdzv_endpoint=localhost:29400 \
train.py
In this case, 3 processes will be launched and start training loop. DecisionTransformer
-based algorithms also support this distributed training feature.
The example is also available here
Minari is an OSS library to provide a standard format of offline reinforcement learning datasets. Now, d3rlpy provides an easy access to this library.
You can install Minari via d3rlpy CLI.
$ d3rlpy install minari
Example:
import d3rlpy
dataset, env = d3rlpy.datasets.get_minari("antmaze-umaze-v0")
iql = d3rlpy.algos.IQLConfig(
actor_learning_rate=3e-4,
critic_learning_rate=3e-4,
batch_size=256,
weight_temp=10.0,
max_weight=100.0,
expectile=0.9,
reward_scaler=d3rlpy.preprocessing.ConstantShiftRewardScaler(shift=-1),
).create(device="cpu:0")
iql.fit(
dataset,
n_steps=1000000,
n_steps_per_epoch=100000,
evaluators={"environment": d3rlpy.metrics.EnvironmentEvaluator(env)},
)
From this version, calculation of some algorithms are optimized to remove redundant inference. Therefore, especially algorithms with dual optimization such as SAC
and CQL
became extremely faster than the previous version.
GoalConcatWrapper
has been added to support goal-conditioned environments.return_to_go
has been added to Transition
and TransitionMiniBatch
MixedReplayBuffer
has been added to sample two experiences from multiple buffers with arbitrary ratio.initial_temperature
supports 0 at DiscreteSAC
.Published by takuseno 12 months ago
DiscreteDecisionTransformer
, a Decision Transformer implementation for discrete action-space, has been finally implemented in v2.2.0! The reduction results with Atari 2600 are available here.
import d3rlpy
dataset, env = d3rlpy.datasets.get_cartpole()
dt = d3rlpy.algos.DiscreteDecisionTransformerConfig(
batch_size=64,
num_heads=1,
learning_rate=1e-4,
max_timestep=1000,
num_layers=3,
position_encoding_type=d3rlpy.PositionEncodingType.SIMPLE,
encoder_factory=d3rlpy.models.VectorEncoderFactory([128], exclude_last_activation=True),
observation_scaler=d3rlpy.preprocessing.StandardObservationScaler(),
context_size=20,
warmup_tokens=100000,
).create()
dt.fit(
dataset,
n_steps=100000,
n_steps_per_epoch=1000,
eval_env=env,
eval_target_return=500,
)
action_size
and action_space
options for manual dataset creation #338FrameStackTrajectorySlicer
has been added.numpy
is enabled. Some parts of codes differentiate data types of numpy arrays, which is checked by mypy.batch.intervals
#346
Published by takuseno about 1 year ago
From this version, d3rlpy requires PyTorch v2 (v1 still may partially work). To do this, the minimum Python version has been bumped to 3.8. This change allows d3rlpy to utilize more advanced features such as torch.compile
in the upcoming releases.
From this version, d3rlpy diagnoses dependency health automatically. In this version, the version of Gym
is checked to make sure you have installed the correct version of Gym
.
d3rlpy now supports Gymnasium
as well as Gym
. You can use it just same as Gym
. Please check example for the further details.
To make your life easier, d3rlpy provides d3rlpy install
commands to install additional dependencies. This is the part of d3rlpy
CLI. Please check docs for the further details.
$ d3rlpy install atari # Atari 2600 dependencies
$ d3rlpy install d4rl_atari # Atari 2600 + d4rl-atari dependencies
$ d3rlpy install d4rl # D4RL dependencies
In this version, the internal design has been refactored. The algorithm implementation and the way to assign models are mainly refactored. ⚠️ Because of this change, the previously saved models might be incompatible to load in this version.
d3rlpy.notebook_utils
to provide utilities for Jupyter Notebook.Published by takuseno about 1 year ago
dump
ReplayBuffer #299InitialStateValueEstimationEvaluator
#301To the rendering fix, I recommend you reinstall d4rl-atari
if you use it.
$ pip install -U git+https://github.com/takuseno/d4rl-atari
Published by takuseno over 1 year ago
An emergency patch to fix a bug of predict_value
method #297 .
Published by takuseno over 1 year ago
The major update has been finally released! Since the start of the project, this project has earned almost 1K GitHub stars ⭐ , which is a great milestone of d3rlpy. In this update, there are many major changes.
From this version, d3rlpy only supports the latest Gym version 0.26.0
. This change allows us to support Gymnasium
in the future update.
From this version, each algorithm (e.g. "DQN") has a config class (e.g. "DQNConfig"). This allows us to serialize and deserialize algorithms as described later.
dqn = d3rlpy.algos.DQNConfig(learning_rate=3e-4).create(device="cuda:0")
Decision Transformer
is finally available! You can check reproduction code to see how to use it.
import d3rlpy
dataset, env = d3rlpy.datasets.get_pendulum()
dt = d3rlpy.algos.DecisionTransformerConfig(
batch_size=64,
learning_rate=1e-4,
optim_factory=d3rlpy.models.AdamWFactory(weight_decay=1e-4),
encoder_factory=d3rlpy.models.VectorEncoderFactory(
[128],
exclude_last_activation=True,
),
observation_scaler=d3rlpy.preprocessing.StandardObservationScaler(),
reward_scaler=d3rlpy.preprocessing.MultiplyRewardScaler(0.001),
context_size=20,
num_heads=1,
num_layers=3,
warmup_steps=10000,
max_timestep=1000,
).create(device="cuda:0")
dt.fit(
dataset,
n_steps=100000,
n_steps_per_epoch=1000,
save_interval=10,
eval_env=env,
eval_target_return=0.0,
)
In this version, d3rlpy introduces a compact serialization, d3
format, that includes both hyperparameters and model parameters in a single file. This makes it possible for you to easily save checkpoints and reconstruct algorithms for evaluation and deployment.
import d3rlpy
dataset, env = d3rlpy.datasets.get_cartpole()
dqn = d3rlpy.algos.DQNConfig().create()
dqn.fit(dataset, n_steps=10000)
# save as d3 file
dqn.save("model.d3")
# reconstruct the exactly same DQN
new_dqn = d3rlpy.load_learnable("model.d3")
From this version, there is no clear separation between ReplayBuffer
and MDPDataset
anymore. Instead, ReplayBuffer
has unlimited flexibility to support any kinds of algorithms and experiments. Please check details at documentation.
Published by takuseno over 2 years ago
The benchmark results of IQL and NFQ have been added to d3rlpy-benchmarks. Plus, the results of the more random seeds up to 10 have been added to all algorithms. The benchmark results are more reliable now.
Finetuning
tutorial page.Offline Policy Selection
tutorial page has been addedcloudpickle
and GPUUtil
dependencies have been removed.Published by takuseno over 2 years ago
The timestep alignment is now exactly the same as D4RL:
# observations = [o_1, o_2, ..., o_n]
observations = np.random.random((1000, 10))
# actions = [a_1, a_2, ..., a_n]
actions = np.random.random((1000, 10))
# rewards = [r(o_1, a_1), r(o_2, a_2), ...]
rewards = np.random.random(1000)
# terminals = [t(o_1, a_1), t(o_2, a_2), ...]
terminals = ...
where r(o, a)
is the reward function and t(o, a)
is the terminal function.
The reason of this change is that the many users were confused with the difference between d3rlpy and D4RL. But, now it's aligned in the same way. This change might break your dataset.
MDPDataset
.target_reduction_type
and bootstrap
options have been removed.dataset.pyx
has been fixed #167 (thanks, @zbzhu99 )Published by takuseno almost 3 years ago
It's proud to announce that v1.0.0 has been finally released! The first version was released in Aug 2020 under the support of the IPA MITOU program. At the first release, d3rlpy only supported a few algorithms and did not even support online training. After months of constructive feedbacks and insights from the users and the community, d3rlpy has been established as the first offline deep RL library with many online and offline algorithms support and unique features. The next chapter also starts towards the ambitious v2.0.0 today. Please stay tuned for the next announcement!
The workshop paper about d3rlpy has been presented at the NeurIPS 2021 Offline RL Workshop.
URL: https://arxiv.org/abs/2111.03788
The full benchmark results are finally available at d3rlpy-benchmarks.
deterministic
option is added to collect
methodrollout_return
metrics is added to online trainingrandom_steps
is added to fit_online
method--save
option is added to d3rlpy
CLI commands (thanks, @pstansell )multiplier
option is added to reward normalizerspolicy_type
option is added to BCget_atari_transition
function is added for the Atari 2600 offline benchmark proceduredataclasses
torch.jit.script
and CUDA Graphs
Published by takuseno about 3 years ago
From this version, the preprocessors are available for the rewards, which allow you to normalize, standardize and clip the reward values.
import d3rlpy
# normalize
cql = d3rlpy.algos.CQL(reward_scaler="min_max")
# standardize
cql = d3rlpy.algos.CQL(reward_scaler="standardize")
# clip (you can't use string alias)
cql = d3rlpy.algos.CQL(reward_scaler=d3rlpy.preprocessing.ClipRewardScaler(-1.0, 1.0))
In the scenario of finetuning, you might want to initialize SAC's policy function with the pretrained CQL's policy function to boost the initial performance. From this version, you can do that as follows:
import d3rlpy
# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(...)
# transfer the policy function
sac = d3rlpy.algos.SAC()
sac.copy_policy_from(cql)
# you can also transfer the Q-function
sac.copy_q_function_from(cql)
# finetuning with online algorithm
sac.fit_online(...)
alpha
parameter option to DiscreteCQL
callback
function is called every gradient step (previously, it's called every epoch)Published by takuseno over 3 years ago
From this version, the data augmentation feature has been dropped. The reason for this is that the feature introduces a lot of code complexity. In order to make d3rlpy support many algorithms and keep it as simple as possible, the feature was dropped. Instead, TorchMiniBatch
was internally introduced, and all algorithms become more simple.
In offline RL experiments, data collection plays an important role especially when you try new tasks.
From this version, collect
method is finally available.
import d3rlpy
import gym
# prepare environment
env = gym.make('Pendulum-v0')
# prepare algorithm
sac = d3rlpy.algos.SAC()
# prepare replay buffer
buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=100000, env=env)
# start data collection without updates
sac.collect(env, buffer)
# export to MDPDataset
dataset = buffer.to_mdp_dataset()
# save as file
dataset.dump('pendulum.h5')
Along with this change, random policies are also introduced. These are useful to collect dataset with random policy.
# continuous action-space
policy = d3rlpy.algos.RandomPolicy()
# discrete action-space
policy = d3rlpy.algos.DiscreteRandomPolicy()
callback
argument has been added to algorithmsdataset_type='random'
at get_cartpole
and get_pendulum
methodpredict_value
method (thanks, @navidmdn )Currently, I'm benchmarking all algorithms with d4rl dataset. Through the experiments, I realized that it's very difficult to reproduce the table reported in the paper because they actually didn't reveal full hyper-parameters, which are tuned to each dataset. So I gave up reproducing the table, and start producing numbers with the official codes to see if d3rlpy's result matches.
Published by takuseno over 3 years ago
New algorithms are introduced in this version.
Previously, model-based RL has been supported. The model-based specific logic was implemented in dynamics
side. This approach enabled us to combine model-based algorithms with arbitrary model-free algorithms. However, this requires complex designs to implement the recent model-based RL. So, the dynamics interface was refactored and the MOPO is the first algorithm to show how d3rlpy supports model-based RL algorithms.
# train dynamics model
from d3rlpy.datasets import get_pendulum
from d3rlpy.dynamics import ProbabilisticEnsembleDynamics
from d3rlpy.metrics.scorer import dynamics_observation_prediction_error_scorer
from d3rlpy.metrics.scorer import dynamics_reward_prediction_error_scorer
from d3rlpy.metrics.scorer import dynamics_prediction_variance_scorer
from sklearn.model_selection import train_test_split
dataset, _ = get_pendulum()
train_episodes, test_episodes = train_test_split(dataset)
dynamics = d3rlpy.dynamics.ProbabilisticEnsembleDynamics(learning_rate=1e-4, use_gpu=True)
dynamics.fit(train_episodes,
eval_episodes=test_episodes,
n_epochs=100,
scorers={
'observation_error': dynamics_observation_prediction_error_scorer,
'reward_error': dynamics_reward_prediction_error_scorer,
'variance': dynamics_prediction_variance_scorer,
})
# train Model-based RL algorithm
from d3rlpy.algos import MOPO
# give mopo as generator argument.
mopo = MOPO(dynamics=dynamics)
mopo.fit(dataset, n_steps=100000)
fitter
method has been implemented (thanks @jamartinh )tensorboard_dir
repleces tensorboard
flag at fit
method (thanks @navidmdn )fit
method accepts MDPDataset
objectdropout
option has been implemented in encoders__repr__
methods to show pretty outputs when print(algo)
core dumped
errors by fixing numpy versionPublished by takuseno over 3 years ago
New commands are added in this version.
You can record the video of the evaluation episodes without coding anything.
$ d3rlpy record d3rlpy_logs/CQL_20201224224314/model_100.pt --env-id HopperBulletEnv-v0
# record wrapped environment
$ d3rlpy record d3rlpy_logs/Discrete_CQL_20201224224314/model_100.pt \
--env-header 'import gym; env = d3rlpy.envs.Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'
You can run the evaluation episodes with rendering images.
# record simple environment
$ d3rlpy play d3rlpy_logs/CQL_20201224224314/model_100.pt --env-id HopperBulletEnv-v0
# record wrapped environment
$ d3rlpy play d3rlpy_logs/Discrete_CQL_20201224224314/model_100.pt \
--env-header 'import gym; env = d3rlpy.envs.Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'
Ensemble training for Q-functions has been shown as a powerful method to achieve robust training. Previously, bootstrap
option has been available for algorithms. But, the mask for Q-function loss is randomly created every time when the batch is sampled.
In this version, create_mask
option is available for MDPDataset
and ReplayBuffer
, which will create a unique mask at each data-point.
# offline training
dataset = d3rlpy.dataset.MDPDataset(observations, actions, rewards, terminals, create_mask=True, mask_size=5)
cql = d3rlpy.algos.CQL(n_critics=5, bootstrap=True, target_reduction_type='none')
cql.fit(dataset)
# online training
buffer = d3rlpy.online.buffers.ReplayBuffer(1000000, create_mask=True, mask_size=5)
sac = d3rlpy.algos.SAC(n_critics=5, bootstrap=True, target_reduction_type='none')
sac.fit_online(env, buffer)
As you noticed above, target_reduction_type
is newly introduced to specify how to aggregate target Q values. In the standard Soft Actor-Critic, the target_reduction_type='min'
. If you choose none
, each ensemble Q-function uses its own target value, which is similar to what Bootstrapped DQN does.
From this version, you can navigate to all modules through d3rlpy
.
# previously
from d3rlpy.datasets import get_cartpole
dataset = get_cartpole()
# v0.70
import d3rlpy
dataset = d3rlpy.datasets.get_cartpole()
From this version, structlog
is internally used to print information instead of raw print
function. This allows us to emit more structural information. Furthermore, you can control what to show and what to save to the file if you overwrite logger configuration.
soft_q_backup
option is added to CQL
.Paper Reproduction
page has been added to the documentation in order to show the performance with the paper configuration.commit
method at D3RLPyLogger
returns metrics (thanks, @jamartinh )epoch
count in offline training.total_step
count in online training.Published by takuseno over 3 years ago
record
command is newly introduced in this version. You can record videos of evaluation episodes with the saved model.
$ d3rlpy record d3rlpy_logs/CQL_20210131144357/model_100.pt --env-id Hopper-v2
You can also use the wrapped environment.
$ d3rlpy record d3rlpy_logs/DQN_online_20210130170041/model_1000.pt \
--env-header 'import gym; from d3rlpy.envs import Atari; env = Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'
fit_online
methodPublished by takuseno over 3 years ago
New logo images are made for d3rlpy 🎉
standard | inverted |
---|---|
ActionScaler
provides action scaling pre/post-processing for continuous control algorithms. Previously actions must be in between [-1.0, 1.0]
. From now on, you don't need to care about the range of actions.
from d3rlpy.cql import CQL
cql = CQL(action_scaler='min_max') # just pass action_scaler argument
Episodes terminated by timeouts should not be clipped at bootstrapping. From this version, you can specify episode boundaries as well as the terminal flags.
from d3rlpy.dataset import MDPDataset
observations = ...
actions = ...
rewards = ...
terminals = ... # this indicates the environmental termination
episode_terminals = ... # this indicates episode boundaries
datasets = MDPDataset(observations, actions, rewards, terminals, episode_terminals)
# if episode_terminals are omitted, terminals will be used to specify episode boundaries
# datasets = MDPDataset(observations, actions, rewards, terminals)
In online training, you can specify this option via timelimit_aware
flag.
from d3rlpy.sac import SAC
env = gym.make('Hopper-v2') # make sure if the environment is wrapped by gym.wrappers.Timelimit
sac = SAC()
sac.fit_online(env, timelimit_aware=True) # this flag is True by default
reference: https://arxiv.org/abs/1712.00378
When training with computationally expensive environments such as robotics simulators or rich 3D games, it will take a long time to finish due to the slow environment steps.
To solve this, d3rlpy supports batch online training.
from d3rlpy.algos import SAC
from d3rlpy.envs import AsyncBatchEnv
if __name__ == '__main__': # this is necessary if you use AsyncBatchEnv
env = AsyncBatchEnv([lambda: gym.make('Hopper-v2') for _ in range(10)]) # distributing 10 environments in different processes
sac = SAC(use_gpu=True)
sac.fit_batch_online(env) # train with 10 environments concurrently
Pre-built d3rlpy docker image is available in DockerHub.
$ docker run -it --gpus all --name d3rlpy takuseno/d3rlpy:latest bash
BEAR
algorithm is updated based on the official implementation
mmd_kernel
option is availableto_mdp_dataset
method is added to ReplayBuffer
ConstantEpsilonGreedy
explorer is addedd3rlpy.envs.ChannelFirst
wrapper is added (thanks for reporting, @feyza-droid )d3rlpy.datasets.get_d4rl
is added
d3rlpy plot
CLI function (thanks, @pstansell )save_interval
argument is added to fit_online
Published by takuseno almost 4 years ago
typing-extensions
depdencyPublished by takuseno almost 4 years ago
Now, d3rlpy is fully type-annotated not only for the better use of this library but also for the better contribution experiences.
mypy
and pylint
check the type consistency and code quality.v0.50 introduces the new command-line interface, d3rlpy
command that helps you to do more without any efforts. For now, d3rlpy
provides the following commands.
# plot CSV data
$ d3rlpy plot d3rlpy_logs/XXX/YYY.csv
# plot CSV data
$ d3rlpy plot-all d3rlpy_logs/XXX
# export the save model as inference formats (e.g. ONNX, TorchScript)
$ d3rlpy export d3rlpy_logs/XXX/model_YYY.pt