Overview¶
This guide provides an overview of the features and structure of RSL-RL. After introducing the currently available features, we explain the core components and additional components of the library. Finally, we provide a minimal example of how to integrate RSL-RL into a project.
Library Features¶
RSL-RL is intentionally kept minimal and focuses on a small set of components that cover common robotics workflows while remaining easy to adapt. The following sections summarize the main features currently available.
Algorithms¶
PPOPPO is a model-free, on-policy RL algorithm for continuous control. In each iteration, the algorithm collects rollouts, computes GAE-based return targets, and optimizes actor and critic losses over multiple mini-batches. The implementation includes clipped surrogate and value losses, entropy regularization, gradient clipping, and optional adaptive learning-rate scheduling based on the observed KL divergence.
DistillationDistillation is a student-teacher behavior cloning algorithm. During data collection, the student acts while the teacher provides supervision targets. The student is then optimized with a configurable behavior loss, making this algorithm useful for transferring policies from training-time privileged observation sets to observations available during real-world deployment.
Models¶
MLPModelA feed-forward baseline model for actor, critic, student, or teacher networks. It concatenates selected observation groups, optionally normalizes them, processes them through an MLP, and optionally maps outputs through a stochastic action distribution.
RNNModelA recurrent extension of the
MLPModelfor partially observable settings. The recurrent network can be either a Long Short-Term Memory (LSTM) or a Gated Recurrent Unit (GRU). Its output is passed through the MLP to produce the final output.CNNModelA model for mixed 1D + 2D observations. It combines an MLP pathway for 1D observations with one or more CNN encoders for 2D observations. Each 2D observation group is encoded with a separate CNN encoder that can be configured independently. When used in conjunction with
PPO, the encoders may be shared between actor and critic to save memory.
Distributions¶
GaussianDistributionA diagonal Gaussian distribution with state-independent standard deviation parameters. The mean is produced by the model’s MLP network output, while the standard deviation is learned globally and can use either a scalar or a log-scale.
HeteroscedasticGaussianDistributionA diagonal Gaussian distribution with state-dependent standard deviation. The model’s MLP network predicts both mean and standard-deviation terms per sample, allowing uncertainty to vary with the observation. As with the standard Gaussian variant, both scalar and log-scale parameterizations are supported.
Extensions¶
RandomNetworkDistillationRandom Network Distillation (RND) adds an intrinsic reward based on the prediction error between a trainable predictor network and a fixed target network. The implementation supports selecting dedicated observation groups for curiosity, optional state and reward normalization, and configurable weight schedules for annealing the intrinsic reward contribution over the training. This extension is compatible with the
PPOalgorithm. For more details, please check this paper.- Symmetry
Symmetry augments the collected environment interaction data with mirrored data using a user-provided symmetry function that defines how observations and actions are transformed. This can improve sample efficiency and promote symmetric behaviors for robots with structured morphology. Additionally, a mirror-loss regularization term can be added to the loss function to actively encourage symmetry in the policy. This extension is compatible with the
PPOalgorithm. For more details, please check this paper.
Loggers¶
- TensorBoard
The default local logging backend for scalar metrics.
- Weights & Biases
Cloud-based experiment tracking with support for remote monitoring and model uploads.
- Neptune
A cloud-based experiment tracking alternative to W&B.
All logging backends are accessed through a common logger interface, and can be used with both single-GPU and distributed multi-GPU training workflows.
Core Components¶
RSL-RL consists of four core components: Runners, Algorithms, Models, and Modules, implementing the learning loop, algorithmic logic, and neural network architectures. In conjunction with the library’s additional components: Environment, Storage, Extensions, and Utils, described in the next section, they form a complete learning pipeline.
Note
In the following, functions are linked to implementations for a standard PPO training
pipeline. Other learning pipelines may have different implementations of these functions.
Runners¶
The Runner implements the main learning loop in learn() and
coordinates the interaction with the Environment. It is the main API of the library, providing functionality for
saving, loading, and exporting checkpoints and models via save(),
load(),
export_policy_to_jit(),
and export_policy_to_onnx(). Additionally, it coordinates the
initialization of the multi-GPU pipeline, the Algorithm, and the Logger.
Algorithms¶
The Algorithm implements the logic of the core learning process. It is responsible for acting on the observations
provided by the Environment in act(), processing data collected from the
environment interaction in process_env_step(), and updating the parameters of the
learnable Model instances in update(). The Algorithm can make use of a
Storage instance to store collected environment interaction data. The Algorithm manages one or more Model
instances and possibly algorithmic Extensions, which are initialized in
construct_algorithm().
Models¶
The Model implements a neural network that is used by the Algorithm for different purposes. For example,
PPO uses Model instances for the actor and critic networks, while
Distillation uses Model instances for the student and teacher networks. The
forward() pass of a Model instance consists of three steps:
A latent is computed from raw observations in
get_latent(). This may involve normalization, recurrent processing forRNNModelinstances, or convolutional encoding forCNNModelinstances.The latent is passed to a Multi-Layer Perceptron (MLP).
The output of the MLP is either returned as is or used to sample from a
Distribution, in case stochastic outputs are requested.
The Model acts as a unified interface for the Algorithm and coordinates all Module instances needed to implement a certain network architecture.
Modules¶
A Module implements a building block for the Model. This could be a neural network, such as the
MLP, a normalizer, such as
EmpiricalNormalization, or a distribution, such as the
GaussianDistribution. Modules are initialized in the Model constructor and
managed by the Model.
Additional Components¶
Additional Components support the core components by either defining interfaces, adding optional functionality, or providing utilities, such as data storage or logging.
Environment¶
The Environment implements an abstract interface that is used by the Runner to interact with the environment.
Next to some required attributes, the Environment must implement the step()
and get_observations() methods.
Storage¶
The Storage implements a storage buffer that is used by the Algorithm to store collected environment interaction
data. It returns the data to the Algorithm for its update() in a suitable format,
for example in mini-batches.
Extensions¶
An Extension implements an augmentation to a specific Algorithm to modify its behavior. Currently, RSL-RL does not constrain the way an Extension may be implemented, allowing for arbitrary modifications to the learning process.
Utils¶
Utils include various helpers for the library, such as a Logger to record the learning
process, or functions to resolve configuration settings.
Example Integration¶
The following example shows a simple training script which 1) creates an environment, 2) loads a YAML configuration, 3)
initializes an OnPolicyRunner, 4) runs the training, and 5) exports the
trained policy for deployment. The environment is defined by the user and must implement the
VecEnv interface. The configuration setup is described in the
configuration guide.
import yaml
from rsl_rl.runners import OnPolicyRunner
# 1) Create your environment (usually provided by environment libraries such as Isaac Lab)
env = make_env()
# 2) Load a YAML configuration and extract the configuration dictionary expected by RSL-RL
with open("config/my_training.yaml", "r", encoding="utf-8") as f:
full_cfg = yaml.safe_load(f)
train_cfg = full_cfg["runner"]
# 3) Build the runner
runner = OnPolicyRunner(
env=env,
train_cfg=train_cfg,
log_dir="logs/my_experiment", # Directory for saving checkpoints and logs
device="cuda:0", # Device to run the training on
)
# 4) Start training
runner.learn(num_learning_iterations=1500) # Specify the number of desired iterations
# 5) Export the trained policy for deployment
runner.export_policy_to_jit("logs/my_experiment/exported", filename="policy.pt")
runner.export_policy_to_onnx("logs/my_experiment/exported", filename="policy.onnx")
A trained policy can later be loaded to continue a previous training or to be replayed for evaluation as in the following replay script:
import yaml
from rsl_rl.runners import OnPolicyRunner
# 1) Create your environment (usually provided by environment libraries such as Isaac Lab)
env = make_env()
# 2) Load the YAML configuration used during training
with open("config/my_training.yaml", "r", encoding="utf-8") as f:
full_cfg = yaml.safe_load(f)
train_cfg = full_cfg["runner"]
# 3) Build the runner
runner = OnPolicyRunner(
env=env,
train_cfg=train_cfg,
device="cuda:0",
)
# 4) Load the trained policy
runner.load("logs/my_experiment/model_1499.pt")
# 5) Get the inference policy
policy = runner.get_inference_policy()
# 6) Run the policy in the environment for 1000 steps
obs = env.get_observations()
for _ in range(1000):
actions = policy(obs)
obs, rewards, dones, extras = env.step(actions)