Configuration¶
RSL-RL is configured with a dictionary that is passed to RSL-RL’s runner during initialization. The dictionary is usually read from a YAML file or constructed from Python dataclasses, such as in Isaac Lab. It is nested to reflect the structure of the library, and follows this pattern:
The top level represents the runner configuration, which is composed of general settings and configuration dictionaries for the algorithm (e.g. PPO), as well as for the models used by the algorithm (e.g. actor and critic). The algorithm dictionary contains the parameters of the algorithm, and may contain one or more configuration dictionaries for extensions. The model dictionaries contain the parameters of the models, and may contain a configuration dictionary for a distribution.
In the following sections, we list the available settings for each configuration component, provide a minimal example configuration in YAML format, and explain how observations are configured.
Runner Configuration¶
Currently, RSL-RL implements two runner classes:
OnPolicyRunner and
DistillationRunner. The
OnPolicyRunner is configured as follows:
Key |
Type |
Default |
Description |
|---|---|---|---|
|
int |
required |
Number of environment steps collected per iteration. |
|
dict[str, list[str]] |
required |
Mapping from observation sets to observation groups coming from the environment. See here for more details. |
|
int |
required |
Number of iterations between checkpoints. |
|
str |
|
Logging service to use. Valid values: |
|
str |
required for W&B |
W&B project name used by the W&B writer. |
|
str |
required for Neptune |
Neptune project name used by the Neptune writer. |
|
str |
missing |
Optional run label shown in the console output. |
|
bool |
|
Whether to check for NaN values coming from the environment. |
|
dict |
required |
RL algorithm configuration. |
|
dict |
required |
Actor model configuration. |
|
dict |
required |
Critic model configuration. |
For the DistillationRunner, the actor and critic keys are simply
replaced by student and teacher keys, respectively:
Key |
Type |
Default |
Description |
|---|---|---|---|
… |
… |
… |
… |
|
dict |
required |
Student model configuration. |
|
dict |
required |
Teacher model configuration. |
Algorithm Configuration¶
RSL-RL implements two algorithms, PPO and
Distillation, which are configured as follows.
PPO¶
Key |
Type |
Default |
Description |
|---|---|---|---|
|
str |
required |
Algorithm class name. Valid values: |
|
str |
|
Optimizer used for policy/value updates. Valid values: see |
|
float |
|
Optimizer learning rate. |
|
int |
|
Number of optimization epochs per iteration. |
|
int |
|
Number of mini-batches per iteration. |
|
str |
|
Learning-rate schedule. Valid values: |
|
float |
|
Coefficient for the value-function loss. |
|
float |
|
PPO clipping parameter for surrogate/value clipping. |
|
bool |
|
Whether to clip the value loss. |
|
float |
|
Target KL divergence used by the adaptive learning-rate schedule. |
|
float |
|
Entropy regularization coefficient. |
|
float |
|
Discount factor. |
|
float |
|
GAE lambda parameter. |
|
float |
|
Maximum gradient norm for gradient clipping. |
|
bool |
|
Whether to normalize advantages for each mini-batch instead of across the entire rollout. |
|
bool |
|
Whether to share the CNN networks between actor and critic in case
the |
|
dict | None |
|
Optional RND extension configuration. |
|
dict | None |
|
Optional symmetry extension configuration. |
Distillation¶
Key |
Type |
Default |
Description |
|---|---|---|---|
|
str |
required |
Algorithm class name. Valid values: |
|
str |
|
Optimizer used for student updates. Valid values: see |
|
float |
|
Optimizer learning rate. |
|
int |
|
Number of optimization epochs per iteration. |
|
int |
|
Gradient backpropagation length. |
|
float | None |
|
Maximum gradient norm for gradient clipping. |
|
str |
|
Loss type. Valid values: |
Model Configuration¶
Different algorithms use models for different purposes. For example, PPO uses an actor
and a critic, while Distillation uses a student and a teacher. Even though
their function might be different, they can all use the same underlying model classes. RSL-RL currently implements
three different models: MLPModel, RNNModel, and
CNNModel, which are configured as follows.
MLPModel¶
Key |
Type |
Default |
Description |
|---|---|---|---|
|
str |
required |
Model class name. Valid values: |
|
tuple[int] | list[int] |
|
Hidden dimensions of the MLP. |
|
str |
|
Activation function of the MLP. Valid values: see |
|
bool |
|
Whether to normalize the observations before passing them to the MLP. |
|
dict | None |
|
Optional output distribution configuration. If provided, the model can output stochastic values sampled from the distribution. |
The distribution_cfg dictionary contains all parameters required by a specific distribution. RSL-RL implements two
distributions by default: A simple Gaussian distribution (GaussianDistribution)
and a Gaussian distribution with state-dependent standard deviation
(HeteroscedasticGaussianDistribution). Both require the same parameters:
Key |
Type |
Default |
Description |
|---|---|---|---|
|
str |
required |
Distribution class name. Valid values: |
|
float |
|
Initial standard deviation. |
|
str |
|
Parameterization of the standard deviation. Valid values: |
RNNModel¶
The RNNModel inherits from the MLPModel and thus
shares the same configuration keys as the MLPModel, with the addition of the following
keys:
Key |
Type |
Default |
Description |
|---|---|---|---|
|
str |
required |
Model class name. Valid values: |
… |
… |
… |
… |
|
str |
|
Type of RNN network. Valid values: |
|
int |
|
Hidden dimension of the RNN. |
|
int |
|
Number of RNN layers. |
CNNModel¶
The CNNModel inherits from the MLPModel and thus
shares the same configuration keys as the MLPModel, with the addition of the following
keys:
Key |
Type |
Default |
Description |
|---|---|---|---|
|
str |
required |
Model class name. Valid values: |
… |
… |
… |
… |
|
dict[str, dict] | dict[str, Any] | None |
|
Configuration of the CNN encoder(s). |
Instead of directly passing the CNN parameters to the CNNModel (similar to how it is
done for the MLPModel and RNNModel), the parameters
are grouped in a dictionary cnn_cfg. This enables passing multiple CNN configurations for different observations
(e.g. different cameras). If only one CNN is needed or all CNNs have the same configuration, the dictionary may directly
contain the CNN parameters. If multiple CNNs with different configurations are needed, the dictionary must contain a
dictionary for each CNN configuration, with the key being the observation the configuration applies to. The
CNNModel will then create CNNs based on the provided configurations. A CNN
configuration includes the following parameters:
Key |
Type |
Default |
Description |
|---|---|---|---|
|
tuple[int] | list[int] |
required |
Output channels for each convolutional layer. |
|
int | tuple[int] | list[int] |
required |
Kernel size for each convolutional layer or a single kernel size for all layers. |
|
int | tuple[int] | list[int] |
|
Stride for each convolutional layer or a single stride for all layers. |
|
int | tuple[int] | list[int] |
|
Dilation for each convolutional layer or a single dilation for all layers. |
|
str |
|
Padding type to use. Valid values: |
|
str | tuple[str] | list[str] |
|
Normalization type for each convolutional layer or a single normalization type for all layers. Valid values:
|
|
str |
|
Activation function to use. Valid values: see |
|
bool | tuple[bool] | list[bool] |
|
Whether to apply max pooling after each convolutional layer or a single boolean for all layers. |
|
str |
|
Global pooling type to apply at the end. Valid values: |
|
bool |
|
Whether to flatten the output tensor. |
Extension Configuration¶
RSL-RL currently features two extensions for PPO. Those are
RandomNetworkDistillation and Symmetry, which may be configured as follows.
Random Network Distillation¶
Key |
Type |
Default |
Description |
|---|---|---|---|
|
float |
|
Initial weight of the RND reward. |
|
dict | None |
|
Weight schedule for the RND reward. Valid values: see |
|
float |
|
Learning rate for the RND optimizer. |
|
tuple[int] | list[int] |
required |
Hidden dimensions of the RND predictor network. |
|
tuple[int] | list[int] |
required |
Hidden dimensions of the RND target network. |
|
int |
required |
Number of outputs of the RND networks. |
|
str |
|
Activation function for the RND networks. Valid values: see |
|
bool |
|
Whether to normalize the RND state. |
|
bool |
|
Whether to normalize the RND reward. |
Symmetry Augmentation¶
Key |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
required |
Whether to add symmetric trajectories to the batch. |
|
str | callable | None |
required |
Function to generate symmetric trajectories. Resolved using |
|
bool |
required |
Whether to add a symmetry loss term to the loss function. |
|
float |
required |
Coefficient for the symmetry loss. |
Example Configuration¶
While the previous sections make it seem rather complicated to set up a configuration, the required configuration to run
a training with, e.g., PPO is actually quite simple. The following configuration is
already sufficient:
runner:
num_steps_per_env: 24
obs_groups: {"actor": ["policy"], "critic": ["policy", "privileged"]}
save_interval: 100
algorithm:
class_name: PPO
actor:
class_name: MLPModel
distribution_cfg:
class_name: GaussianDistribution
critic:
class_name: MLPModel
Observation Configuration¶
RSL-RL expects the step() method of the environment to return observations as a
TensorDict. This dictionary contains one or more tensors with observation data, referred
to as observation groups in RSL-RL and Isaac Lab.
The obs_groups dictionary of the runner configuration defines which observation groups
are used for which purpose. Each purpose defines its own observation set, which is simply a list of observation
groups. In other words, the obs_groups dictionary maps from observation sets to lists of observation groups.
As the above definition is quite abstract, let’s consider a simple example for a
PPO training. The step() method of our environment
might return the following observations:
obs = TensorDict(
{
"policy": torch.tensor([1.0, 2.0, 3.0]), # available during robot deployment
"privileged": torch.tensor([4.0, 5.0, 6.0]), # only available during training
}
)
Let’s assume the “policy” observation group is meant for both actor and critic. The “privileged” observation group is
only available during training and therefore cannot be used by the actor model, but may still improve learning
performance when passed to the critic. Thus, the obs_groups dictionary would be configured as follows:
obs_groups: {"actor": ["policy"], "critic": ["policy", "privileged"]}
With this configuration, the actor would receive the “policy” tensor as input, while the critic would receive both the “policy” and the “privileged” tensor as input.
Depending on the algorithm and extensions used, RSL-RL expects different observation sets to be present in the
obs_groups dictionary. Currently, the following observation sets may be required, depending on the configuration:
Key |
Description |
|---|---|
|
Observations used as input to the actor model. |
|
Observations used as input to the critic model. |
|
Observations used as input to the student model. |
|
Observations used as input to the teacher model. |
|
Observations used as input to the RND extension. |
Incomplete or incorrect configurations are handled in resolve_obs_groups(), which provides
detailed information on how errors are resolved.