Algorithms¶
PPO¶
- class rsl_rl.algorithms.ppo.PPO[source]¶
Proximal Policy Optimization algorithm.
- Reference:
Schulman et al. “Proximal policy optimization algorithms.” arXiv preprint arXiv:1707.06347 (2017).
- __init__(actor, critic, storage, num_learning_epochs=5, num_mini_batches=4, clip_param=0.2, gamma=0.99, lam=0.95, value_loss_coef=1.0, entropy_coef=0.01, learning_rate=0.001, max_grad_norm=1.0, optimizer='adam', use_clipped_value_loss=True, schedule='adaptive', desired_kl=0.01, normalize_advantage_per_mini_batch=False, device='cpu', rnd_cfg=None, symmetry_cfg=None, multi_gpu_cfg=None)[source]¶
Initialize the algorithm with models, storage, and optimization settings.
- Parameters:
actor (MLPModel)
critic (MLPModel)
storage (RolloutStorage)
num_learning_epochs (int)
num_mini_batches (int)
clip_param (float)
gamma (float)
lam (float)
value_loss_coef (float)
entropy_coef (float)
learning_rate (float)
max_grad_norm (float)
optimizer (str)
use_clipped_value_loss (bool)
schedule (str)
desired_kl (float)
normalize_advantage_per_mini_batch (bool)
device (str)
rnd_cfg (dict | None)
symmetry_cfg (dict | None)
multi_gpu_cfg (dict | None)
- Return type:
None
- act(obs)[source]¶
Sample actions and store transition data.
- Parameters:
obs (tensordict.TensorDict)
- Return type:
torch.Tensor
- process_env_step(obs, rewards, dones, extras)[source]¶
Record one environment step and update the normalizers.
- Parameters:
obs (tensordict.TensorDict)
rewards (torch.Tensor)
dones (torch.Tensor)
extras (dict[str, torch.Tensor])
- Return type:
None
- compute_returns(obs)[source]¶
Compute return and advantage targets from stored transitions.
- Parameters:
obs (tensordict.TensorDict)
- Return type:
None
- update()[source]¶
Run optimization epochs over stored batches and return mean losses.
- Return type:
dict[str, float]
- load(loaded_dict, load_cfg, strict)[source]¶
Load specified models from a saved dict.
- Parameters:
loaded_dict (dict)
load_cfg (dict | None)
strict (bool)
- Return type:
bool
Distillation¶
- class rsl_rl.algorithms.distillation.Distillation[source]¶
Distillation algorithm for training a student model to mimic a teacher model.
- teacher_loaded: bool = False¶
Indicates whether the teacher model parameters have been loaded.
- __init__(student, teacher, storage, num_learning_epochs=1, gradient_length=15, learning_rate=0.001, max_grad_norm=None, loss_type='mse', optimizer='adam', device='cpu', multi_gpu_cfg=None, **kwargs)[source]¶
Initialize the algorithm with models, storage, and optimization settings.
- Parameters:
student (MLPModel)
teacher (MLPModel)
storage (RolloutStorage)
num_learning_epochs (int)
gradient_length (int)
learning_rate (float)
max_grad_norm (float | None)
loss_type (str)
optimizer (str)
device (str)
multi_gpu_cfg (dict | None)
kwargs (dict)
- Return type:
None
- act(obs)[source]¶
Sample actions and store transition data.
- Parameters:
obs (tensordict.TensorDict)
- Return type:
torch.Tensor
- process_env_step(obs, rewards, dones, extras)[source]¶
Record one environment step and update the normalizers.
- Parameters:
obs (tensordict.TensorDict)
rewards (torch.Tensor)
dones (torch.Tensor)
extras (dict[str, torch.Tensor])
- Return type:
None
- compute_returns(obs)[source]¶
No-op since distillation does not use return targets.
- Parameters:
obs (tensordict.TensorDict)
- Return type:
None
- update()[source]¶
Run optimization epochs over stored batches and return mean losses.
- Return type:
dict[str, float]
- train_mode()[source]¶
Set train mode for the student and keep the teacher in eval mode.
- Return type:
None
- load(loaded_dict, load_cfg, strict)[source]¶
Load specified models from a saved dict.
- Parameters:
loaded_dict (dict)
load_cfg (dict | None)
strict (bool)
- Return type:
bool