PhAIL – Eval

Real, as easily as sim. Run your model on a real arm through the same API as sim – one set of metrics, one comparable number.

One catalog, growing. Real rigs plus the sim benchmarks you already run (LIBERO, RoboCasa, ManiSkill) in the same harness – so you stop rebuilding eval infrastructure for every model and robot.

Numbers you can cite. Blinded, randomized, enough trials to mean something – every run saved with multi-view video and telemetry.

Private or public. Keep results to your lab, or put them on the leaderboard alongside OpenPI π_0.5, GR00T, SmolVLA, and ACT.

We’re opening this to early users. Leave your email and what you’d run through it.

See it live: the v1.0 leaderboard, every run, the protocol and data, and the methodology.