Algorithms

PPO

class rsl_rl.algorithms.ppo.PPO[source]

Proximal Policy Optimization algorithm.

Reference:
  • Schulman et al. “Proximal policy optimization algorithms.” arXiv preprint arXiv:1707.06347 (2017).

__init__(actor, critic, storage, num_learning_epochs=5, num_mini_batches=4, clip_param=0.2, gamma=0.99, lam=0.95, value_loss_coef=1.0, entropy_coef=0.01, learning_rate=0.001, max_grad_norm=1.0, optimizer='adam', use_clipped_value_loss=True, schedule='adaptive', desired_kl=0.01, normalize_advantage_per_mini_batch=False, device='cpu', rnd_cfg=None, symmetry_cfg=None, multi_gpu_cfg=None)[source]

Initialize the algorithm with models, storage, and optimization settings.

Parameters:
  • actor (MLPModel)

  • critic (MLPModel)

  • storage (RolloutStorage)

  • num_learning_epochs (int)

  • num_mini_batches (int)

  • clip_param (float)

  • gamma (float)

  • lam (float)

  • value_loss_coef (float)

  • entropy_coef (float)

  • learning_rate (float)

  • max_grad_norm (float)

  • optimizer (str)

  • use_clipped_value_loss (bool)

  • schedule (str)

  • desired_kl (float)

  • normalize_advantage_per_mini_batch (bool)

  • device (str)

  • rnd_cfg (dict | None)

  • symmetry_cfg (dict | None)

  • multi_gpu_cfg (dict | None)

Return type:

None

actor: MLPModel

The actor model.

critic: MLPModel

The critic model.

act(obs)[source]

Sample actions and store transition data.

Parameters:

obs (tensordict.TensorDict)

Return type:

torch.Tensor

process_env_step(obs, rewards, dones, extras)[source]

Record one environment step and update the normalizers.

Parameters:
  • obs (tensordict.TensorDict)

  • rewards (torch.Tensor)

  • dones (torch.Tensor)

  • extras (dict[str, torch.Tensor])

Return type:

None

compute_returns(obs)[source]

Compute return and advantage targets from stored transitions.

Parameters:

obs (tensordict.TensorDict)

Return type:

None

update()[source]

Run optimization epochs over stored batches and return mean losses.

Return type:

dict[str, float]

train_mode()[source]

Set train mode for learnable models.

Return type:

None

eval_mode()[source]

Set evaluation mode for learnable models.

Return type:

None

save()[source]

Return a dict of all models for saving.

Return type:

dict

load(loaded_dict, load_cfg, strict)[source]

Load specified models from a saved dict.

Parameters:
  • loaded_dict (dict)

  • load_cfg (dict | None)

  • strict (bool)

Return type:

bool

get_policy()[source]

Get the policy model.

Return type:

MLPModel

static construct_algorithm(obs, env, cfg, device)[source]

Construct the PPO algorithm.

Parameters:
  • obs (tensordict.TensorDict)

  • env (VecEnv)

  • cfg (dict)

  • device (str)

Return type:

PPO

broadcast_parameters()[source]

Broadcast model parameters to all GPUs.

Return type:

None

reduce_parameters()[source]

Collect gradients from all GPUs and average them.

This function is called after the backward pass to synchronize the gradients across all GPUs.

Return type:

None

Distillation

class rsl_rl.algorithms.distillation.Distillation[source]

Distillation algorithm for training a student model to mimic a teacher model.

teacher_loaded: bool = False

Indicates whether the teacher model parameters have been loaded.

__init__(student, teacher, storage, num_learning_epochs=1, gradient_length=15, learning_rate=0.001, max_grad_norm=None, loss_type='mse', optimizer='adam', device='cpu', multi_gpu_cfg=None, **kwargs)[source]

Initialize the algorithm with models, storage, and optimization settings.

Parameters:
  • student (MLPModel)

  • teacher (MLPModel)

  • storage (RolloutStorage)

  • num_learning_epochs (int)

  • gradient_length (int)

  • learning_rate (float)

  • max_grad_norm (float | None)

  • loss_type (str)

  • optimizer (str)

  • device (str)

  • multi_gpu_cfg (dict | None)

  • kwargs (dict)

Return type:

None

student: MLPModel

The student model.

teacher: MLPModel

The teacher model.

act(obs)[source]

Sample actions and store transition data.

Parameters:

obs (tensordict.TensorDict)

Return type:

torch.Tensor

process_env_step(obs, rewards, dones, extras)[source]

Record one environment step and update the normalizers.

Parameters:
  • obs (tensordict.TensorDict)

  • rewards (torch.Tensor)

  • dones (torch.Tensor)

  • extras (dict[str, torch.Tensor])

Return type:

None

compute_returns(obs)[source]

No-op since distillation does not use return targets.

Parameters:

obs (tensordict.TensorDict)

Return type:

None

update()[source]

Run optimization epochs over stored batches and return mean losses.

Return type:

dict[str, float]

train_mode()[source]

Set train mode for the student and keep the teacher in eval mode.

Return type:

None

eval_mode()[source]

Set evaluation mode for student and teacher models.

Return type:

None

save()[source]

Return a dict of all models for saving.

Return type:

dict

load(loaded_dict, load_cfg, strict)[source]

Load specified models from a saved dict.

Parameters:
  • loaded_dict (dict)

  • load_cfg (dict | None)

  • strict (bool)

Return type:

bool

get_policy()[source]

Get the policy model.

Return type:

MLPModel

static construct_algorithm(obs, env, cfg, device)[source]

Construct the distillation algorithm.

Parameters:
  • obs (tensordict.TensorDict)

  • env (VecEnv)

  • cfg (dict)

  • device (str)

Return type:

Distillation

broadcast_parameters()[source]

Broadcast model parameters to all GPUs.

Return type:

None

reduce_parameters()[source]

Collect gradients from all GPUs and average them.

This function is called after the backward pass to synchronize the gradients across all GPUs.

Return type:

None