Inverted Pendulum¶
The inverted pendulum example is the recommended first runnable task. It is small enough for smoke tests but exercises the full GeneLab train/play path.
Tasks¶
| Task id | Description |
|---|---|
GeneLab-Inverted-Pendulum-v0 |
Single pole balancing (rsl_rl PPO). |
GeneLab-Double-Inverted-Pendulum-v0 |
Two-link pole balancing (rsl_rl PPO). |
GeneLab-Inverted-Pendulum-Skrl-v0 |
Single pole balancing — same env, skrl PPO backend. |
GeneLab-Inverted-Pendulum-Memory-Rnn-v0 |
"Recall the target" — an LSTM policy on a task that needs memory. |
GeneLab-Inverted-Pendulum-Memory-Mlp-v0 |
Same task, plain MLP — the baseline that structurally can't remember. |
Installing and listing¶
Without installation:
PYTHONPATH=examples/inverted_pendulum/src \
genelab --import genelab_inverted_pendulum.tasks list tasks
Running¶
genelab play GeneLab-Inverted-Pendulum-v0 --steps 64
genelab play GeneLab-Inverted-Pendulum-v0 --vis --steps 500
genelab train GeneLab-Inverted-Pendulum-v0 --num_envs 64 --max_iterations 2
# Same env, skrl PPO backend (selected purely by the agent cfg type):
genelab train GeneLab-Inverted-Pendulum-Skrl-v0 --num_envs 64 --max_iterations 4800
# skrl names checkpoints agent_<timesteps>.pt under the run's checkpoints/ dir:
genelab eval GeneLab-Inverted-Pendulum-Skrl-v0 logs/skrl/inverted_pendulum_skrl/<run>/checkpoints/agent_<N>.pt
genelab trainfor the skrl task needs the optionalskrldependency installed; registering and listing the task does not.
Recurrent (RNN) memory showcase¶
GeneLab-Inverted-Pendulum-Memory-Rnn-v0 shows when recurrence actually helps. It is a
"recall the target" task: each episode a random cart target is flashed in the observation
for only the first 5 steps, then removed. The flash is too brief to drive the cart there
while it is visible, so the policy must remember the target and move to it afterwards. A
memoryless MLP, once the cue is gone, has no way to recover the target and falls back to the
centre; an LSTM stores it in its hidden state and drives the cart straight to it. (Velocities
stay observed, so the only thing needing memory is the target.)
This is the task category where recurrence is decisive — unlike plain balancing, which a feedforward MLP solves even with velocities or loads hidden, because feedback stabilisation is robust to unobserved state. Memory, not partial observability alone, is what an RNN buys you.
Train both and compare the post-cue tracking error (mean distance from the hidden target):
genelab train GeneLab-Inverted-Pendulum-Memory-Rnn-v0 --num_envs 2048 --max_iterations 400
genelab train GeneLab-Inverted-Pendulum-Memory-Mlp-v0 --num_envs 2048 --max_iterations 400
| Policy | Task | Post-cue tracking error (mean |cart − target|) |
|---|---|---|
| LSTM (recurrent) | ...-Memory-Rnn-v0 |
≈ 0.013 — reaches and holds the hidden target |
| MLP (baseline) | ...-Memory-Mlp-v0 |
≈ 0.78 — can't recall it, sits near centre |
(Targets are uniform in [-0.8, 0.8]; a policy that ignores the target scores ≈ its spread.)
Why it trains (and earlier memory attempts didn't)
Recurrent PPO uses truncated BPTT of length num_steps_per_env. The memory cfg sets it to
100 — one whole ~100-step episode — so each BPTT window spans the full cue→reach
dependency and the gradient can flow from the recall back to where the target was encoded.
With the default 24-step window the LSTM never learns to associate the two.
Code entry points¶
| File | Role |
|---|---|
tasks.py |
Registers robots, envs, and tasks. |
single/env_cfg.py |
Single-pole manager-based env config. |
single/memory_env_cfg.py |
The "recall the target" env (masked target cue + tracking reward). |
single/rnn_cfg.py |
LSTM + MLP PPO configs for the memory task. |
double/env_cfg.py |
Double-pole manager-based env config. |
mdp.py |
Example-specific reward / termination / memory-task terms. |