Runtime: play and train¶

play runs a registered task. train runs a registered task through a supported runner when the task provides an agent config. The post-training runtime subcommands — eval, export, and benchmark — take the same checkpoint produced by train.

Play¶

genelab play TASK_ID --steps 128            # headless: 128-step smoke rollout
genelab play TASK_ID --vis                   # viewer: run until you close the window
genelab play TASK_ID --vis --max-steps 500   # viewer: stop after 500 steps
genelab play TASK_ID --agent random --steps 128

Policy sources:

Agent	Behavior
`zero`	Zero actions. Default when no checkpoint is given.
`random`	Uniform random actions in `[-1, 1]`.
`trained`	Load a checkpoint and use the runner inference policy.

The policy options (--agent, --checkpoint, --num-envs, --prof*) apply only to RL tasks — those whose play env config is a ManagerBasedRlEnvCfg. Non-RL scene-playback demos (e.g. GeneLab-Rubiks-Play-v0, GeneLab-Wuji-Hand-Playback-v0), whose config subclasses the base ManagerBasedEnvCfg, run their own built-in playback; passing those options prints a warning and they are ignored. --steps / --vis / --headless / --gpu / --dt and dotted config overrides still apply to both.

Checkpoint replay:

genelab play TASK_ID \
  --checkpoint logs/rsl_rl/<experiment>/<run>/model_300.pt

Trained playback on a headless server

Trainable tasks enable the Genesis viewer in their play env (vis=play), so play --agent trained opens a window by default and aborts with No display detected on a display-less machine. Pass --headless (mutually exclusive with --vis) to force env.simulation.vis=false:

genelab play TASK_ID --agent trained \
  --checkpoint <ckpt> --headless

Headless playback is bounded: with no window to close, it stops after simulation.steps steps (set with --steps, default 240) instead of running forever. Pass --max-steps N to override.

Playback length: `--steps` vs `--max-steps`¶

The two knobs are deliberately different and behave identically across RL playback and the non-RL scene-playback / showcase runners:

	`--steps N`	`--max-steps N`
What it is	Soft config (`env.simulation.steps`)	Hard, genelab-enforced cap
Lives on	The env config (editable in code)	The runner (not stored on the cfg)
With a viewer (`--vis`)	Ignored — runs until you close the window	Stops after `N` steps even with the window open
Headless	Caps the rollout at `N`	Caps the rollout at `N` (wins over `--steps`)
Default	240	unset (soft config decides)

In short: --steps is an advisory length you (or your code) can change or have ignored; --max-steps is a hard ceiling genelab always enforces. To bound a windowed run, reach for --max-steps.

Shortcut flags¶

Both play and train rewrite the following shortcuts into env.simulation.* overrides:

Shortcut	Override
`-v`, `--vis`	`env.simulation.vis=true`
`--headless`	`env.simulation.vis=false` (mutually exclusive with `--vis`)
`--gpu`	`env.simulation.gpu=true`
`--steps N`	play: soft length `env.simulation.steps=N` (ignored with `--vis`; see above); train: alias for `--max_iterations N`
`--dt SECONDS`	`env.simulation.dt=SECONDS`
`--a.b.c VALUE`	Any dotted cfg path

--max-steps N is not an env override — it is a runner flag (the hard playback cap, play only), so it is not in this table. See Playback length.

Train¶

genelab train TASK_ID --num_envs 4096 --max_iterations 300

For distributed training:

genelab train TASK_ID --gpus 4 --num_envs 4096

--num_envs is total across ranks and must divide evenly by --gpus. Use --num_envs_per_gpu for per-rank semantics (mutually exclusive with --num_envs). Multi-GPU is RSL-RL only; the first entry automatically relaunches under torchrun.

In-training eval¶

Pass --eval_every K to run a deterministic rollout every K iterations. On improvement, the runner writes best_model.<ext> into --log_dir:

genelab train TASK_ID --eval_every 50 --eval_episodes 20

Option	Meaning (default)
`--eval_every K`	Evaluate every K iterations.
`--eval_episodes N`	Episodes per evaluation (10).
`--eval_num_envs N`	Parallel envs during eval (matches training).
`--eval_seed N`	RNG seed for the eval rollout (0).

Multi-seed train¶

--seeds 1,2,3 fans out the current train invocation into one independent subprocess per seed:

genelab train TASK_ID --seeds 1,2,3,4 --parallel 2 --num_envs 4096

--parallel N caps concurrency (default 1 — sequential).
Each child is invoked with --seed S and --log_dir <parent>/seed_<S>.
Without an explicit --log_dir, the parent is logs/multi-seed/<task_id>/<YYYY-MM-DD_HH-MM-SS>/.
If any seed fails, the command exits non-zero.

RL backends¶

The training backend is chosen automatically from the type of the task's agent config — no flag required:

Agent config	Backend	Algorithms
`RslRlOnPolicyRunnerCfg`	`rsl_rl` (default)	PPO
`SkrlAgentCfg`	`skrl`	PPO, A2C, SAC, TD3, DDPG
`Sb3AgentCfg`	`sb3`	PPO, A2C, SAC, TD3, DDPG (+ HER)

The skrl and Stable-Baselines3 backends are optional — install them with the skrl / sb3 extras (uv sync already includes both in this checkout; downstream users run pip install genelab[skrl] or genelab[sb3]). Pick the algorithm via SkrlAgentCfg.algorithm / Sb3AgentCfg.algorithm.

Both skrl and SB3 train in environment timesteps rather than learning iterations, so --max_iterations N sets the timestep budget for those tasks. Multi-GPU (--gpus) is supported by the RSL-RL backend only.

SB3 trains through stable_baselines3.common.vec_env.VecEnv (numpy, CPU), so the SB3 wrapper copies observations to host memory every step — a known cost of pairing SB3 with GeneLab's GPU-vectorized env. Hindsight Experience Replay is available for the off-policy algorithms via Sb3AgentCfg.her, which exposes a goal-conditioned observation and trains through SB3's HerReplayBuffer.

# An Sb3AgentCfg routes through the SB3 backend; the Franka pick-and-place task
# is SAC + HER + lift bonus + FSM demo prefill (see its example page).
GENELAB_SB3_DEMO_PATH=/tmp/franka_pp_demos.npz \
  genelab train GeneLab-Franka-Pick-And-Place-v0 \
  --gpu --num-envs 32 --max-iterations 2000000

Post-training subcommands¶

eval, export, and benchmark all take a registered task plus a checkpoint and reuse the task's play env config.

Eval¶

Deterministic rollout that writes eval.json (return_mean, length_mean, and success_rate if the task publishes extras['is_success']):

genelab eval TASK_ID logs/.../model_300.pt \
  --num-envs 64 --episodes 100 --out eval.json

--deterministic / --stochastic toggles the policy mode; --max-steps caps the rollout.

Export¶

Export the policy as TorchScript or ONNX (per-term scale/clip baked into the model):

genelab export TASK_ID logs/.../model_300.pt --format onnx --out policy.onnx

A sibling <OUTPUT>.metadata.json records the observation schema.

Benchmark¶

Batch eval driven by a JSON suite, aggregated into one report:

genelab benchmark --suite suite.json --out report.json
genelab benchmark --suite suite.json --reference baseline.json --tolerance 0.1

suite.json is [{"task": ..., "checkpoint": ..., "episodes": ..., "seed": ..., "num_envs": ...}, ...]. With --reference, the command compares return_mean against the baseline and exits non-zero when any task drops more than --tolerance — usable directly as a CI regression gate.

Config overrides¶

Any unknown option after the task id is treated as a dotted config override:

genelab play TASK_ID \
  --env.simulation.dt 0.005 \
  --env.rewards_cfg.action_rate.weight -0.01