A single five-second manipulation episode — robot hand picking up a red cube — viewed through six modalities a foundation model would ingest in parallel. Hover any panel to move a synchronized cursor across all six.

Vision-Language 3 frames + instruction
Pick up the red cube and hand it over.
Proprioception Joint angle (rad)
Action Motor cmd
Memory 12 tokens
pastnow
Cognition tokens (compressed past context)
Tactile Pressure (N)
Torque Joint torque (Nm)
t = 0.00 s Hover any panel — a shared cursor will sweep across all six.
0.0s
5.0s