Eval and Export¶

GeneLab's research-reproducibility tooling under genelab.rl.evaluator / eval_callback / exporter, surfaced as three CLIs that close the train → eval → export loop:

Command	Purpose	Output
`genelab eval TASK CKPT`	Deterministic rollout, fixed seed, N episodes	`eval.json`
`genelab train ... --eval-every K`	Periodic in-training eval + best-model save	`logs/.../best_model.<ext>` + `best_model_meta.json`
`genelab export TASK CKPT`	Backend-agnostic TorchScript / ONNX policy	`policy.{ts,onnx}` + `<file>.metadata.json`

All three route through the same backend abstraction (InferenceSetup, defined in genelab.rl.backends.base), so they work identically against the rsl_rl, skrl, and sb3 backends.

`genelab eval`¶

Runs a vectorized deterministic rollout and writes a JSON summary with this schema:

genelab eval GeneLab-Inverted-Pendulum-v0 logs/rsl_rl/exp1/.../model_500.pt \
    --num-envs 64 --episodes 100 --seed 0 \
    --deterministic --out eval.json

Output:

{
  "task": "GeneLab-Inverted-Pendulum-v0",
  "checkpoint": "logs/.../model_500.pt",
  "num_episodes": 100,
  "metrics": {
    "return_mean": 487.3,
    "return_std": 22.1,
    "length_mean": 998.4,
    "success_rate": 0.96
  },
  "wall_clock_seconds": 18.2,
  "seed": 0,
  "deterministic": true,
  "evaluated_at": "2026-05-20T08:42:11+00:00"
}

Success rate¶

success_rate is computed when the task publishes a per-env bool tensor at extras["is_success"] from ManagerBasedRlEnv.step (gymnasium convention). Tasks opt in by setting self._extras["is_success"] = <(num_envs,) bool tensor> inside a termination or reward term — typically a check against the goal pose for manipulation or a "reached velocity command" check for locomotion.

Tasks that do not publish is_success get success_rate: null in the output; downstream tools (best-model selection, reference-runs tables) should guard against None.

`genelab train --eval-every`¶

When --eval-every K is set, training runs in chunks of K iterations. After each chunk the latest checkpoint is loaded into the same backend and a deterministic eval is run (defaulting to 10 episodes at the same num_envs as training). When return_mean improves on the prior best, the checkpoint is copied to <log_dir>/best_model.<ext> and a sibling best_model_meta.json is updated with the eval payload.

genelab train GeneLab-Inverted-Pendulum-v0 \
    --max_iterations 1000 --num_envs 64 --seed 0 \
    --eval-every 100 --eval-episodes 16

Caveats:

Each chunk closes and rebuilds the Genesis env via the backend's normal train lifecycle. Pick --eval-every ≥ 50 for short tasks so Genesis init time is amortized.
For off-policy algorithms (SAC / TD3 / DDPG via skrl or sb3), reloading from a checkpoint between chunks loses the replay buffer. Sample efficiency degrades but training still converges.
best_model.<ext> reuses the source backend's checkpoint format (.pt for rsl_rl / skrl, .zip for sb3). The metadata file records the source iteration, eval seed, episodes, and return statistics.

`genelab export`¶

Serializes the actor sub-network to TorchScript or ONNX with per-term obs scale / clip baked into a single forward(raw_obs) -> actions pass. Deployment environments need only torch (TorchScript) or an ONNX runtime; they do not need rsl_rl / skrl / stable_baselines3 at inference time.

# TorchScript
genelab export Genelab-Velocity-Flat-Unitree-G1-v0 logs/.../model_30000.pt \
    --format torchscript --out policy.ts

# ONNX (opset 17 by default)
genelab export Genelab-Velocity-Flat-Unitree-G1-v0 logs/.../model_30000.pt \
    --format onnx --out policy.onnx --opset 17

Note: GeneLab-Franka-Pick-And-Place-v0 is SAC+HER with a goal-conditioned Dict observation. Its exported model takes a single flat obs that is the concatenation of observation + achieved_goal + desired_goal (in that order); see the multi-group metadata below. Locomotion tasks (cartpole, G1) use a single flat-tensor obs group.

The exporter writes a sibling <output>.metadata.json describing the obs schema. obs_dim is the total flat-input width; each obs_groups entry records its start offset into that flat tensor (so goal-conditioned policies can be sliced back into their sub-spaces):

{
  "task": "Genelab-Velocity-Flat-Unitree-G1-v0",
  "checkpoint": "logs/.../model_30000.pt",
  "obs_dim": 48,
  "obs_groups": {
    "policy": {
      "start": 0,
      "dim": 48,
      "terms": [
        {"name": "joint_pos", "dim": 23, "start": 0, "scale": 1.0, "clip": null},
        {"name": "joint_vel", "dim": 23, "start": 23, "scale": 0.1, "clip": [-2, 2]}
      ]
    }
  },
  "action_dim": 23,
  "action_range": [-1.0, 1.0],
  "normalization_baked": true,
  "is_recurrent": false,
  "format": "torchscript",
  "exported_at": "2026-05-20T08:42:11+00:00",
  "torch_version": "2.4.0"
}

For a SAC+HER task obs_groups has three entries — e.g. observation (start: 0), achieved_goal (start: 35), desired_goal (start: 38) — and obs_dim is their sum.

Deployment-side usage¶

import torch
m = torch.jit.load("policy.ts")
m.eval()
# raw obs in (training-side concatenation order); model applies scale/clip itself
actions = m(torch.tensor([[joint_pos_0, joint_pos_1, ..., joint_vel_0, ...]]))

For ONNX:

import onnxruntime as ort
sess = ort.InferenceSession("policy.onnx")
actions = sess.run(None, {"obs": raw_obs.astype("float32")})[0]

What's exported¶

The actor is extracted via a backend-specific shim and wrapped so the call shape is uniform:

rsl_rl: takes the actor module off the algorithm directly (alg._raw_actor, falling back to alg.actor) and uses its as_jit() export wrapper, which exposes a flat forward(obs) -> deterministic action with the learned obs normalizer baked in. Older releases that kept the actor under alg.actor_critic.actor (or only act_inference) are still supported.
skrl: wraps agent.policy.act and returns the deterministic mean (the mean_actions key) for GaussianMixin policies.
sb3: wraps model.policy._predict(obs, deterministic=True), which is uniform across PPO / A2C / SAC / TD3 / DDPG. For goal-conditioned SAC+HER policies the observation space is a Dict (observation / achieved_goal / desired_goal) and the SAC actor consumes all keys, so the wrapper takes the flat concatenation of those sub-spaces (in that order) and rebuilds the Dict before calling the policy — the exported model still has a single flat obs input, and the metadata's obs_groups records each sub-space's start / dim so deployers know the layout.

SB3's ONNX export uses the legacy TorchScript-based exporter (torch.onnx.export(..., dynamo=False)): the torch.export-based default (torch ≥ 2.9) can't trace SAC's Normal distribution construction.

Recurrent (RNN / LSTM / GRU) policy export¶

Setting rnn_type on an RslRlModelCfg ("lstm" or "gru") trains a recurrent policy — it is the single knob, automatically selecting rsl_rl's RNNModel:

RslRlModelCfg(rnn_type="lstm", rnn_hidden_dim=256, rnn_num_layers=1)

genelab export then takes the recurrent path automatically (no extra flags). The metadata gains an "is_recurrent": true field plus a "recurrent" block recording rnn_type, rnn_num_layers, rnn_hidden_dim, the hidden-state shape, and the ONNX port names.

The two formats expose the hidden state differently:

TorchScript keeps the hidden state inside the module, so the call shape stays the single-input MLP form forward(obs) -> actions. The module also exposes a reset() method — call it at every episode boundary to zero the hidden state. The serialized buffer is fixed at batch size 1 (one deployed robot).

import torch
m = torch.jit.load("policy.ts"); m.eval()
m.reset()                       # at the start of each episode
actions = m(raw_obs)            # raw obs in; hidden state carried internally

ONNX exposes the hidden state explicitly. Inputs are obs, h_in (and c_in for LSTM); outputs are actions, h_out (and c_out for LSTM), each shaped (num_layers, batch, hidden_dim). Thread the returned state back in each step and zero it on episode boundaries:

import numpy as np, onnxruntime as ort
sess = ort.InferenceSession("policy.onnx")
h = np.zeros((num_layers, 1, hidden_dim), np.float32)
c = np.zeros((num_layers, 1, hidden_dim), np.float32)   # LSTM only
actions, h, c = sess.run(None, {"obs": raw_obs, "h_in": h, "c_in": c})
# GRU: actions, h = sess.run(None, {"obs": raw_obs, "h_in": h})

Play and eval rollouts reset the hidden state automatically for environments whose episode just ended, so recurrent eval metrics are unbiased.

Limitations¶

The exported model does not apply observation noise from ObservationTermCfg.noise; noise is part of training only.