How to Harden a Policy for Sim2Real¶
The deployment recipe: which domain randomization (DR) and noise to enable while training, what to dump at export, and how to align the policy on the hardware side. Training and replaying a task is assumed (see Run RL Experiments).
The goal is a policy that survives the reality gap — inaccurate inertia, friction, PD gains, sensor bias, and actuator imperfections — without being trained on the real robot.
1. Enabling domain randomization while training¶
DR events are wired into events_cfg and follow the
EventTermCfg calling convention. Sample once per
episode with mode="startup"; sample mid-episode with mode="interval".
DR event (genelab.mdp.dr / mdp.events) |
What it perturbs |
|---|---|
geom_friction |
per-link friction coefficient (ground / feet) |
body_com_offset |
centre-of-mass position (inertial calibration error) |
body_mass_offset |
per-link mass (kg) |
randomize_joint_stiffness_damping |
per-env PD gains (kp / kv) — sim-side for implicit-PD, Python gain scale for force-channel |
randomize_actuator_deadzone |
per-joint torque deadzone (stiction / backlash) |
encoder_bias |
constant joint-angle offset (silently shifted zero-point) |
push_by_setting_velocity (mdp.events) |
impulse base-velocity kick — use mode="interval" |
A representative startup block (mirrors examples/unitree/.../g1/env_cfg.py):
from genelab import mdp
from genelab.managers import EventTermCfg
from genelab.managers.scene_entity_cfg import SceneEntityCfg
events_cfg = {
"foot_friction": EventTermCfg(
mode="startup",
func=mdp.dr.geom_friction,
params={"asset_cfg": SceneEntityCfg(name="robot", link_names=("left_foot", "right_foot")),
"ranges": (0.3, 1.2), "shared_random": True},
),
"base_com": EventTermCfg(
mode="startup",
func=mdp.dr.body_com_offset,
params={"asset_cfg": SceneEntityCfg(name="robot", link_names=("torso_link",)),
"ranges": {0: (-0.025, 0.025), 1: (-0.025, 0.025), 2: (-0.03, 0.03)}},
),
"pd_gains": EventTermCfg(
mode="startup",
func=mdp.dr.randomize_joint_stiffness_damping,
params={"stiffness_range": (0.8, 1.2), "damping_range": (0.8, 1.2)},
),
"encoder_bias": EventTermCfg(
mode="startup",
func=mdp.dr.encoder_bias,
params={"asset_cfg": SceneEntityCfg(name="robot"), "bias_range": (-0.015, 0.015)},
),
# Mid-episode push every 5–10 s of sim time.
"push": EventTermCfg(
mode="interval",
interval_range_s=(5.0, 10.0),
func=mdp.events.push_by_setting_velocity,
params={"velocity_range": {0: (-0.5, 0.5), 1: (-0.5, 0.5)}},
),
}
Start with mild ranges and widen until same-seed return drops by no more than ~10%; that is the budget the Reference Runs acceptance uses.
2. Corrupting observations the way real sensors do¶
Set enable_corruption=True on the policy observation group and attach a NoiseCfg per
term. Keep the critic group uncorrupted — it learns from clean state.
genelab.mdp.noise model |
Use for |
|---|---|
Unoise / Gnoise |
baseline additive sensor noise |
ScaledNoise |
scale-factor / gain error (grows with the signal) |
CorrelatedNoise |
temporally-correlated (AR(1)) noise — slow, coloured drift |
BiasDrift |
slowly-drifting bias (random walk, optionally clamped) |
from genelab.managers import ObservationGroupCfg, ObservationTermCfg
from genelab.mdp.noise import BiasDrift, CorrelatedNoise, Unoise
policy = ObservationGroupCfg(
enable_corruption=True,
terms={
"joint_pos": ObservationTermCfg(func=mdp.joint_pos_rel, noise=Unoise(-0.01, 0.01)),
"joint_vel": ObservationTermCfg(func=mdp.joint_vel_rel, noise=CorrelatedNoise(std=0.5, alpha=0.8)),
"imu_ang_vel": ObservationTermCfg(func=mdp.sensor_data,
params={"sensor_name": "imu"}, noise=BiasDrift(drift_std=0.002)),
},
)
critic = ObservationGroupCfg(enable_corruption=False, terms=policy.terms)
IMU sensors carry their own per-env bias — set bias_range_lin_acc / bias_range_ang_acc on
the IMUSensorCfg (resampled each reset) instead of layering BiasDrift on top.
CorrelatedNoise/BiasDriftare stateful and intentionally do not reset on episode boundaries (a real drifting sensor doesn't know about resets).
3. Making the policy respect real actuator limits¶
Add hardening terms so the policy never learns behaviour the hardware can't reproduce.
- Terminations (
genelab.mdp):joint_pos_out_of_limit,joint_vel_out_of_limit,contact_force_limit(sensor_name, max_force). Set a soft velocity limit withArticulationCfg.joint_vel_limitsojoint_vel_out_of_limitand thejoint_vel_limitsreward become active. - Rewards (
genelab.mdp):applied_torque_l2(penalize effort),joint_vel_limits(penalize over-speed),alive_bonus(offset per-step penalties),lin_vel_z_l2/base_height_l2(discourage bouncing).
4. Modeling the actuator gap (optional)¶
With real torque-tracking logs, use MlpResidualActuator — a DCMotorActuator base
plus a TorchScript residual on [pos_error, joint_vel]. Train the residual net downstream and
point MlpResidualActuatorCfg.network_file at the saved .pt. With no file it degrades to a
plain DCMotorActuator.
from genelab.actuator import MlpResidualActuatorCfg
robot_cfg.actuators["drive"] = MlpResidualActuatorCfg(
target_names_expr=(".*_joint",),
stiffness=35.0,
damping=0.8,
effort_limit=80.0,
velocity_limit=25.0,
saturation_effort=80.0,
network_file="assets/actuators/drive_residual.pt",
residual_scale=0.25,
)
See Actuators for the TorchScript input/output contract and the full field notes.
5. Exporting a dependency-free policy¶
The export writes policy.onnx (a pure nn.Module, no rsl_rl/skrl/sb3) and
policy.onnx.metadata.json. The exported model bakes in the obs scale + clip, so it takes
raw observations and emits actions. metadata.json records, per obs group:
dimand the orderedterms(each term'sname, dim,scale,clip);- the action
dim/ range; - provenance (
task,checkpoint).
6. Aligning the deployment side¶
What the hardware controller must do — and must not do:
- Assemble the obs vector in
metadata.jsonterm order, concatenated, as raw values (do not pre-scale — the exported model appliesscale/clipinternally). - Reconstruct the joint target from the action:
target = default_joint_pos + action_scale * action, using the sameaction_scaleand default pose the env's actuators used. The exported model outputs the raw action only. - Do not replicate the sim-only corruption on hardware — DR (§1), observation noise (§2),
and
encoder_biasare training perturbations. The real robot already has its own friction, bias, and sensor noise; re-adding the sim versions doubles the gap.
7. Validating before shipping¶
genelab eval TASK_ID model_best.pt --episodes 100for deterministic return / success-rate numbers; compare against the Reference Runs.- Diff
metadata.jsonobs term order against the hardware obs assembly — a mis-ordered or mis-scaled obs vector is the most common silent deployment failure. - Sanity-check the exported model in a plain-
torchprocess (load, feed a zero obs, confirm the action shape matchesmetadata.json).