Reference Runs¶

The reproducibility ground truth for GeneLab's bundled tasks. It lists, per registered task and per seed, the converged return, the convergence step count, and the wall-clock budget — the numbers to expect from a clone → train → eval against the same configuration.

Reference tasks¶

The six tasks tracked here cover GeneLab's bundled locomotion + manipulation lines:

Task ID	Backend (default agent)	Budget	Notes
`GeneLab-Inverted-Pendulum-v0`	rsl_rl PPO	150 iter × 4096 envs	Tiny cartpole; sanity smoke target.
`GeneLab-Double-Inverted-Pendulum-v0`	rsl_rl PPO	300 iter × 4096 envs	Harder cartpole.
`Genelab-Velocity-Flat-Unitree-G1-v0`	rsl_rl PPO	30k iter × 4096 envs	Unitree G1 velocity tracking on flat ground.
`Genelab-Velocity-Rough-Unitree-G1-v0`	rsl_rl PPO	6k iter × 4096 envs	Unitree G1 velocity tracking on a 10-level mixed-terrain curriculum.
`Genelab-Tracking-Flat-Unitree-G1-v0`	rsl_rl PPO	30k iter × 4096 envs	Unitree G1 motion-tracking on flat ground.
`Genelab-Velocity-Flat-Unitree-Go1-v0`	rsl_rl PPO	3k iter × 4096 envs	Unitree Go1 quadruped velocity tracking on flat ground; deployable proprioception-only actor (no base-linear-velocity sensor).
`Genelab-Velocity-Rough-Unitree-Go1-v0`	rsl_rl PPO	(WIP)	Unitree Go1 on the 10-level mixed-terrain curriculum. Not yet a reference — from-scratch 3k iters stalls in a stand-still optimum (~1–2 % of commanded speed); needs a larger budget (~6k) + an easier curriculum bootstrap.
`Genelab-Velocity-Flat-Unitree-Go2W-v0`	rsl_rl PPO	6k+6k+4k iter × 4096 envs (2-stage)	Unitree Go2-W wheeled quadruped, hybrid wheel-leg omnidirectional velocity tracking: crab-walk stage 1 (wheels locked) → rolling stage 2 with mirror-symmetry augmentation + L1 tracking terms. sim2sim-hardened (5-frame stacking, startup DR, action noise & latency).
`GeneLab-Franka-Pick-And-Place-v0`	sb3 SAC + HER	2M timesteps × 64 envs	Goal-conditioned manipulation; needs offline demo prefill (see protocol below).

Reproduction protocol¶

Common path (5 of 6 tasks)¶

Cartpole + G1 tasks are rsl_rl PPO; their reference runs use the multi-seed CLI:

# 1. Train three seeds (parallel=3 only for cartpole-sized tasks; G1 needs
#    parallel=1 on a single GPU to avoid OOM).
genelab train <TASK> \
    --seeds 1,2,3 --parallel <P> \
    --log_dir logs/reference/<TASK>/<DATE>

# 2. Deterministic eval against each seed's final checkpoint.
for s in 1 2 3; do
  genelab eval <TASK> \
    "logs/reference/<TASK>/<DATE>/seed_${s}/model_final.pt" \
    --num-envs 64 --episodes 100 --seed 0 \
    --out "logs/reference/<TASK>/<DATE>/seed_${s}/eval.json"
done

eval.json files are the source of truth for the table numbers.

Franka SAC+HER path¶

GeneLab-Franka-Pick-And-Place-v0 is goal-conditioned SAC+HER and needs an offline demo prefill before training, otherwise the cold-start replay buffer never sees a successful trajectory:

# 1. Collect demos via the scripted FSM (one-shot, seed-independent).
#    --num-envs must match the task's train num_envs (currently 64); the
#    prefill loader asserts the shapes match.
python -m genelab_franka.collect_demos \
    --num-envs 64 --steps 1000 \
    --out logs/reference/franka-pp/demos.npz

# 2. Train three seeds — each child reads the demo file via
#    GENELAB_SB3_DEMO_PATH (or set agent.demo_path in cfg).
GENELAB_SB3_DEMO_PATH=logs/reference/franka-pp/demos.npz \
  genelab train GeneLab-Franka-Pick-And-Place-v0 \
    --seeds 1,2,3 --parallel 1 \
    --log_dir logs/reference/franka-pp/<DATE>

# 3. Eval each seed's saved model.zip (SB3's native format).
for s in 1 2 3; do
  genelab eval GeneLab-Franka-Pick-And-Place-v0 \
    "logs/reference/franka-pp/<DATE>/seed_${s}/model.zip" \
    --num-envs 64 --episodes 100 --seed 0 \
    --out "logs/reference/franka-pp/<DATE>/seed_${s}/eval.json"
done

The Franka task cannot currently be exported via genelab export. Export supports flat-tensor observations only, while SAC+HER uses a goal-conditioned Dict observation.

Hardware¶

One CUDA GPU (≥ 12 GB VRAM) for training. CPU-only eval works for the deterministic rollout step but is much slower than GPU-vectorized.

Run the sim on the GPU backend

SimulationCfg.gpu defaults to False (CPU backend). With the CPU backend the physics steps on the CPU while the policy/tensors sit on the GPU, leaving the GPU idle and training ~50–100× slower (contact-heavy tasks like G1 go from a few s to hundreds of s per iteration). Bundled trainable tasks set gpu=True; custom tasks must do the same. If nvidia-smi shows the training GPU near 0 % during steps, this is almost certainly why.

Hopper (H100/H200) and multi-GPU caveats

On Hopper (SM 90), set QD_GRAPH=0 (Genesis ships no SM 90 graph_do_while fatbin); this disables CUDA-graph batching and badly slows contact-heavy sims. Prefer a non-Hopper GPU (Ada / Ampere) for locomotion reproduction.
Multi-GPU (genelab train --gpus N) gives little speedup for G1 (per-step cost + PCIe all-reduce dominate). For a multi-seed sweep, run one seed per GPU rather than one seed across many GPUs.
RL training at 4096 envs is largely CPU-bound and wants the whole host; running many such trainings concurrently on one box oversubscribes the CPU and slows them super-linearly. Wall-clock suffers but rewards are deterministic, so reproduced numbers are unaffected by contention.

Reference numbers¶

The tables below are the v1.0 Genesis numbers. The v0.4.7 numbers are preserved in a "Previous: v0.4.7" admonition next to each task so a re-baseline diff is one scroll away.

Hardware the v1.0 numbers were collected on

Cartpole IP / DIP, Franka — NVIDIA H200 (141 GB HBM3), driver 570.211.01, CUDA 12.8, QD_GRAPH=0 (Hopper requires it — no graph_do_while fatbin for SM 90). Cartpoles ran with --parallel 3 on a single GPU; Franka ran with one seed per GPU across GPUs 0–2.
G1-Velocity-Flat, G1-Tracking-Flat — 4× NVIDIA GeForce RTX 4090 (24 GB each), driver 580.159.03, CUDA 12.8. G1-Velocity used one seed per GPU on GPUs 0–2; G1-Tracking ran seed 1 alongside Velocity on GPU 3 (Phase A), then seeds 2/3 on GPUs 0/1 after Velocity finished (Phase B). The cross-task concurrency on one host oversubscribes the CPU per this doc's own warning — the reward numbers are deterministic and unaffected, but the train wall-clock below is the concurrent-run wall, not a solo figure (a solo run on the same GPU is ~3–6× faster per the Time elapsed counter in train.log).
Go1-Velocity-Flat, Go2W-Velocity-Flat — single NVIDIA GeForce RTX 5060 Ti (16 GB), local workstation. Single-seed runs (seed 42, the task's cfg default) rather than the 3-seed cluster sweep used for the G1 tasks — smoke-grade references, not variance estimates.

`GeneLab-Inverted-Pendulum-v0`¶

Seed	Final `return_mean`	`return_std`	Convergence iter	Train wall-clock	Eval wall-clock
1	39.977	0.002	150	~2.5 min	12.6 s
2	39.994	0.001	150	~2.5 min	12.4 s
3	39.986	0.001	150	~2.5 min	12.6 s

Eval length_mean = 1000.0 for all seeds (episode hits the time-limit cap without falling), so the policy is solved at the budget cap. success_rate is null (task does not publish extras["is_success"]).

Previous: v0.4.7

Seed	Final `return_mean`	`return_std`	Convergence iter	Train wall-clock	Eval wall-clock
1	39.944	0.026	150	~21 min	10.3 s
2	39.978	0.002	150	~20 min	10.1 s
3	39.991	0.001	150	~19 min	10.1 s

`GeneLab-Double-Inverted-Pendulum-v0`¶

Seed	Final `return_mean`	`return_std`	Convergence iter	Train wall-clock	Eval wall-clock
1	59.933	0.020	300	~3.5 min	16.3 s
2	59.914	0.174	300	~3.5 min	16.4 s
3	59.897	0.023	300	~3.5 min	16.4 s

Eval length_mean = 1200.0 for all seeds. success_rate is null (same reason as IP).

Previous: v0.4.7

Seed	Final `return_mean`	`return_std`	Convergence iter	Train wall-clock	Eval wall-clock
1	59.980	0.007	300	~85 min	12.2 s
2	59.986	0.003	300	~88 min	14.2 s
3	59.987	0.002	300	~85 min	12.6 s

`Genelab-Velocity-Flat-Unitree-G1-v0`¶

The actor observation omits base_lin_vel (same change as rough: removed from the actor, kept in the critic; see sim2real). The table below is for that config (commit d653aa9).

Seed	Final `return_mean`	`return_std`	Convergence iter	Train wall-clock	Eval wall-clock
1	112.02	4.85	30 000	~28 h	165.2 s
2	91.996	3.82	30 000	~28 h	162.1 s
3	111.85	5.10	30 000	~28 h	163.6 s

Eval length_mean = 1000.0 for all seeds (play_env episode_length_s = 20 s × 50 Hz; the policy never falls). success_rate is null. Seeds 1/3 match the base_lin_vel v1.0 baseline (112.04 / 113.16); seed 2 lands at 92 — a lower-return but stable policy (length 1000, std 3.8), within the known flat seed spread (the v0.4.7 baseline had seeds at 92–93), not a regression from the obs change. This sweep ran concurrently (staggered) with rough on 4× RTX 4090, so the train wall-clock is the concurrent figure (~28 h order, not a dedicated single-job run).

Eval requires the auto_reset fix (commit d56158c)

genelab eval builds the play_env (auto_reset=False); the evaluator relies on the env auto-resetting terminated sub-envs to collect episodes, so without the fix eval collapses to garbage (degenerate ~1-step episodes). The numbers above are post-fix.

Previous: with base_lin_vel (v1.0 baseline)

Actor includes base_lin_vel.

Seed	Final `return_mean`	`return_std`	Convergence iter	Train wall-clock	Eval wall-clock
1	112.038	4.816	30 000	~28 h	163.5 s
2	112.871	4.918	30 000	~28 h	160.3 s
3	113.163	4.850	30 000	~28 h	160.9 s

Previous: v0.4.7

Seed	Final `return_mean`	`return_std`	Convergence iter	Train wall-clock	Eval wall-clock
1	112.419	4.647	30 000	~18.7 h	143.0 s
2	93.417	3.921	30 000	~20.6 h	161.0 s
3	92.028	4.162	30 000	~19.8 h	156.9 s

`Genelab-Velocity-Rough-Unitree-G1-v0`¶

This task's actor observation omits base_lin_vel (removed from the actor, kept in the critic as a privileged signal; a real G1 has no direct base-linear-velocity sensor — see sim2real). The table below is for that config (commit d653aa9).

Seed	Final `return_mean`	`return_std`	Convergence iter	Train wall-clock	Eval wall-clock
1	82.96	29.20	6 000	~6.9 h	516 s*
2	85.51	21.62	6 000	~6.9 h	198 s
3	85.23	25.40	6 000	~6.9 h	201 s

Eval length_mean per seed = 912 / 951 / 902 (of 1000 max; play_env episode_length_s = 20 s × 50 Hz) — the policy walks ~90–95 % of full episodes on the mixed rough terrain. success_rate is null. Eval terrain seed = 0 (deterministic, though Genesis GPU float non-determinism still leaves ~±2 run-to-run variance on return_mean). The curriculum self-balances at terrain level ~4.5; training runs to convergence at 6k with no de-learning (action std holds at its 0.3 floor throughout). Removing base_lin_vel is performance-neutral — on par with the base_lin_vel v1.0 baseline (admonition below). Hardware: 4× NVIDIA GeForce RTX 4090 (one seed per GPU), run concurrently with the flat sweep; * seed_1's eval shared a GPU with a concurrent flat training so it ran long, seed_2/3 ~200 s on a free GPU.

Eval requires the auto_reset fix (commit d56158c)

genelab eval builds the play_env (auto_reset=False, for teleop), but the evaluator relies on the env auto-resetting terminated sub-envs to collect episodes. Without the fix, rough eval collapses to garbage (return ≈ -2.6, length ≈ 17 — degenerate ~1-step episodes). The numbers above are post-fix.

Previous: with base_lin_vel (v1.0 baseline)

Actor includes base_lin_vel; hardware H200 (QD_GRAPH=0, ~2× per-step cost).

Seed	Final `return_mean`	`return_std`	Convergence iter	Train wall-clock	Eval wall-clock
1	83.95	23.86	6 000	~4.8 h	145 s
2	87.17	12.47	6 000	~4.9 h	145 s
3	84.41	17.67	6 000	~4.9 h	145 s

length_mean = 899 / 966 / 935.

Maintainer sweep protocol: genelab train Genelab-Velocity-Rough-Unitree-G1-v0 --seeds 1,2,3 --parallel 1 --log_dir logs/reference/Genelab-Velocity-Rough-Unitree-G1-v0/<DATE>, one seed per RTX 4090 GPU (4090 cluster). Eval with genelab eval ... --num-envs 64 --episodes 100 --seed 0 --out <seed_dir>/eval.json and QT_QPA_PLATFORM=offscreen (headless Qt — recording extras crash without this even after eval forces vis=False).

`Genelab-Tracking-Flat-Unitree-G1-v0`¶

Seed	Final `return_mean`	`return_std`	Convergence iter	Train wall-clock	Eval wall-clock
1	138.444	0.006	30 000	~28 h	227.3 s
2	137.980	0.008	30 000	~29 h	301.6 s
3	138.060	0.004	30 000	~29 h	236.9 s

Eval length_mean = 1500.0. The tracking play_env normally sets episode_length_s = 1e9 for infinite viewer playback; genelab eval clamps that to 30 s, so 30 s × 50 Hz = 1500 steps per episode, all hitting the cap without termination. Very tight std across seeds — the converged policy follows the motion clip on track under the 30 s window. success_rate is null.

Previous: v0.4.7

Seed	Final `return_mean`	`return_std`	Convergence iter	Train wall-clock	Eval wall-clock
1	137.800	0.005	30 000	~20.8 h	212.8 s
2	138.047	0.004	30 000	~20.6 h	216.8 s
3	138.122	0.007	30 000	~20.9 h	216.0 s

`Genelab-Velocity-Flat-Unitree-Go1-v0`¶

Seed	Final `return_mean`	`return_std`	Convergence iter	Train wall-clock	Eval wall-clock
42	56.533	2.264	3 000	~55 min	99.4 s

Single seed (42, the task's cfg default) on one RTX 5060 Ti — see the hardware note; this is a smoke-grade reference, not the 3-seed sweep, so there is no cross-seed variance figure. Eval length_mean = 1000.0 (play_env episode_length_s = 20 s × 50 Hz — full episodes, no falls). success_rate is null. The actor is proprioception-only (no base linear velocity — real Go1 has no such sensor); an asymmetric critic gets the privileged true base velocity during training. A direction-binned rollout (256 envs, fixed commands) confirms symmetric tracking — forward/backward/lateral/yaw all land at 88–97 % of the commanded speed.

`Genelab-Velocity-Rough-Unitree-Go1-v0` (work in progress)¶

No reference numbers yet. Trained from scratch at 3k iters the policy stalls in a stand-still optimum — it stays upright on the curriculum terrain but translates at only ~1–2 % of the commanded speed (the stand-still posture already collects partial velocity-tracking credit, and 3k iters from scratch never escapes it; the penalties are not the cause — foot-slip / undesired-contact terms stay negligible). The env wiring is correct (asymmetric critic + 187-ray height scan + terrain_levels curriculum, all smoke-tested); it is a training-budget / bootstrapping gap. Converging it likely needs a larger budget (~6k, matching the G1 rough task) plus an easier curriculum level-0 so a from-scratch policy can bootstrap basic locomotion before the terrain hardens. Tracked separately, like the G1 rough task was before it landed.

`Genelab-Velocity-Flat-Unitree-Go2W-v0`¶

Go2-W is a skid-steer wheeled quadruped — its fore-aft wheels cannot roll sideways and cannot scrub-turn slowly (stiction deadband), so lateral / slow-yaw motion must come from legged stepping. The shipped config trains a hybrid wheel-leg policy via a two-stage curriculum plus sim2sim hardening:

Stage 1 (Genelab-Velocity-Flat-Unitree-Go2W-CrabStage1-v0) — wheels locked into rigid feet (lock_wheels=True: wheel action scale 0, damping 20) + Go1-style gait shaping (feet_air_time, feet_slip); the legs learn a symmetric crab-walk / stepping turn from scratch (6k iters).
Stage 2 (this task) — warm-start from stage 1 with the wheels rolling (--checkpoint <stage1>/model_*.pt, 4–6k iters). Key ingredients, all probed against their failure modes: mirror-symmetry data augmentation (rsl_rl Symmetry; without it PPO collapses to a one-sided lateral gait — one direction 64/64, the mirror falling 64/64), L1 tracking-error terms (vy_error_l1 −0.5, wz_error_l1 −0.5; the exp kernel's gradient vanishes once an axis is abandoned), a lateral+yaw-gated feet_air_time (stepping stays rewarded when vy or wz is demanded; pure-vx rolls), and wheel damping 5.0 (a stance wheel commanded to zero actually brakes). Sim2sim hardening rides on top: 5-frame observation stacking (actor input 285 = 57 × 5), startup DR (wheel friction, trunk mass / COM, ±20 % PD gains, encoder bias), per-step action noise + per-env action latency (training-only).

Seed	Final `return_mean`	`return_std`	Budget	Train wall-clock	Eval wall-clock
42	62.396	2.311	6k (stage 1) + 6k + 4k (stage 2)	~7 h total	21.6 s

Single seed (42) on one RTX 5060 Ti (hardware note) — smoke-grade, not a 3-seed sweep. Eval length_mean = 1000.0 (full episodes, zero falls in 50 episodes). success_rate is null. Per-direction probe (64 envs each, fixed command, deterministic, no auto-reset, median of per-env means): ±vy 96 % (perfectly mirror-symmetric), ±wz 93–94 %, ±vx 94–100 %, slow-yaw wz=0.2 ~77 % (stepping turns; was a 32 % stiction deadband before the air-time gate covered wz). Zero falls across all 576 probe envs.

Deployment: actuator gains + stacked obs

The exported policy.ts / policy.onnx expects the 5-frame frame-major stacked observation (push one 57-dim frame per control step, backfill on reset — schema in policy.*.metadata.json) and was trained with wheel velocity gain kv = 5.0 (not the asset default 0.5) — match it on the MuJoCo / hardware side or the wheel response will differ.

History: single-frame no-DR (51.9) → DR-hardened single-stage (36.1) → hybrid (62.4)

The original single-frame, no-DR config scored 51.9 but transferred poorly to MuJoCo and could neither strafe (lean-only) nor hold in-place rotation. The first hardening pass (5-frame stack + DR, single-stage, 6k iters) scored 36.1 clean — robust but its lateral "tracking" was still body-lean, exposed by per-env probing. The two-stage curriculum + symmetry + L1 terms recover genuine omnidirectional tracking and the best clean return.

`GeneLab-Franka-Pick-And-Place-v0` (SAC+HER, demo-prefilled)¶

Seed	Final `return_mean`	`return_std`	`success_rate`	Convergence timestep	Train wall-clock	Eval wall-clock
1	−6.122	13.307	0.990	2 000 000 (budget cap)	~67 min	10.7 s
2	−7.260	14.695	0.990	2 000 000 (budget cap)	~62 min	10.4 s
3	−8.790	18.275	0.970	2 000 000 (budget cap)	~64 min	11.1 s

Eval length_mean = 100.0 (fixed episode length). success_rate reflects the goal-reach termination from the manipulation task. Cross-seed mean success_rate ≈ 0.983 ± 0.012 — meaningfully tighter than the v0.4.7 sweep, where one seed only reached 0.89 from end-effector drift on harder goal poses. All three v1.0 seeds land in the same tight band.

Previous: v0.4.7

Seed	Final `return_mean`	`return_std`	`success_rate`	Convergence timestep	Train wall-clock	Eval wall-clock
1	−19.264	33.334	0.89	2 000 000 (budget cap)	~68 min	15.3 s
2	−4.626	8.297	1.00	2 000 000 (budget cap)	~63 min	14.3 s
3	−4.102	7.644	1.00	2 000 000 (budget cap)	~64 min	18.7 s

Mean success_rate ≈ 0.963 ± 0.052 across the three seeds; the two perfect seeds reflect a fully solved policy, the 0.89 seed still misses ~11 % of episodes from end-effector orientation drift on harder goal poses.

Training curves¶

Curves are exported from TensorBoard once the reference runs land. The expected location:

logs/reference/<TASK>/<DATE>/seed_<S>/
├── events.out.tfevents.*   # TensorBoard
├── ckpts/                  # checkpoints (or `model_<N>.pt` under the dir,
│                           #   depending on backend)
├── eval.json               # written by `genelab eval`
└── (optional) curves.png   # screenshot used in this doc

Until the runs are done, this section is intentionally empty — the schema above is what populated PRs should match.

Methodology notes¶

Seeds 1, 2, 3 are GeneLab's canonical triplet. Any task in this doc that ships with different seeds should explain why (e.g. seed 0 hit a degenerate Genesis init on this task).
Eval seed is fixed at 0. This ensures the deterministic eval rollout is the same trajectory across seeds and across re-runs of this protocol — the variance in return_mean then reflects training variance only.
No success_rate for locomotion at this revision. Locomotion tasks ship without extras["is_success"]; the doc reports null rather than inventing a threshold. Manipulation tasks (Franka) emit is_success from the goal-reach termination, so the field is populated there.
Genesis version pin. The version used to produce these numbers is recorded at the top of each eval.json (via evaluated_at and the params/env.json snapshot in the same directory). Re-running with a different Genesis is not expected to reproduce the numbers exactly.

What this doc is not¶

It is not a benchmark suite or leaderboard. The numbers here are GeneLab's own reproducibility check.
It is not a tuning guide. See best-practices/rl-experiments for curriculum, DR, and reward weight choices that are upstream of these numbers.

Reference Runs¶

Reference tasks¶

Reproduction protocol¶

Common path (5 of 6 tasks)¶

Franka SAC+HER path¶

Hardware¶

Reference numbers¶

GeneLab-Inverted-Pendulum-v0¶

GeneLab-Double-Inverted-Pendulum-v0¶

Genelab-Velocity-Flat-Unitree-G1-v0¶

Genelab-Velocity-Rough-Unitree-G1-v0¶

Genelab-Tracking-Flat-Unitree-G1-v0¶

Genelab-Velocity-Flat-Unitree-Go1-v0¶

Genelab-Velocity-Rough-Unitree-Go1-v0 (work in progress)¶

Genelab-Velocity-Flat-Unitree-Go2W-v0¶

GeneLab-Franka-Pick-And-Place-v0 (SAC+HER, demo-prefilled)¶

Training curves¶

Methodology notes¶

What this doc is not¶

`GeneLab-Inverted-Pendulum-v0`¶

`GeneLab-Double-Inverted-Pendulum-v0`¶

`Genelab-Velocity-Flat-Unitree-G1-v0`¶

`Genelab-Velocity-Rough-Unitree-G1-v0`¶

`Genelab-Tracking-Flat-Unitree-G1-v0`¶

`Genelab-Velocity-Flat-Unitree-Go1-v0`¶

`Genelab-Velocity-Rough-Unitree-Go1-v0` (work in progress)¶

`Genelab-Velocity-Flat-Unitree-Go2W-v0`¶

`GeneLab-Franka-Pick-And-Place-v0` (SAC+HER, demo-prefilled)¶