How to Annotate Multimodal Data for Robotic Foundation Models
The six modalities in a modern VLA stack
The publicly available RLDX-1 architecture from RLWRLD is a good canonical example because they describe it openly. Their Multi-Stream Action Transformer (MSAT) routes six modalities through dedicated processing streams and lets them interact via joint self-attention rather than collapsing them into one shared representation:
| Modality | What it carries | What annotation looks like |
|---|---|---|
| Vision-Language | Video from the robot's cameras + text instruction | Object identity, spatial layout, language-grounding tags, scene state at start and end |
| Proprioception | Joint positions, velocities — the robot's internal sense of its body | Phase boundaries by joint configuration; gait/motion state labels |
| Action | Motor commands — what the robot is doing | Action primitive labels; intent annotations; success/failure flags |
| Memory | Past cognition features — long-horizon context | Sub-task decomposition; long-horizon goal labels; cross-episode references |
| Tactile | Pressure / texture readings from fingertip sensors | Contact / no-contact events; grip quality; slip detection markers |
| Torque | Joint torques — physical effort applied at each joint | Contact phase transitions; force signatures of grasp engagement and release |
Each modality has its own labeling discipline. Conflating them — having one labeler do all six in a single pass on the video player — is the most common mistake in early-stage VLA annotation builds. The right model is one labeler per stream, with a cross-modal reconciliation pass at the end.
The phase boundary problem
The reason multi-stream labeling matters is that the most important events in robotic manipulation are visually ambiguous. A clean example from the RLDX-1 paper: a card-slide pick task, broken into four phases — Approach → Contact → Pick → Hand-Over.
Watch the video alone and you cannot reliably tell when the gripper actually engages the card. The hand is in approximately the right position several frames before it grips, and the contact happens somewhere inside that ambiguity. A labeler relying on video will pick a frame, but their picked frame and another labeler's picked frame will not agree.
Look at the joint torque trace instead and the contact event is unmistakable — a sharp spike at the millisecond the fingers engage. The torque trace doesn't carry the spatial context that the video does, but for this specific boundary it owns the truth. RLDX-1's Physics Stream consumes torque as its own modality precisely so the model can learn to associate that signature with the contact concept.
The principle generalizes: for any phase boundary in a manipulation task, one modality owns the truth. Label there, not in the video.
| Boundary | Authoritative modality | Why |
|---|---|---|
| Approach → Contact | Joint torque | Visual contact is ambiguous; torque spikes the instant fingers engage |
| Contact → Pick | Tactile + joint velocity | Pick is "grip stable + start to move"; tactile confirms stable grip, velocity confirms motion |
| Pick → Transport | Proprioception (joint trajectory) | Defined by end-effector trajectory shape, not visual cue |
| Transport → Hand-Over | Vision-language (human reach detection) | The human's intent enters the scene; vision-language model registers it best |
| Hand-Over → Release | Torque + tactile drop-off | Force and pressure both fall to zero in a characteristic way |
Time alignment: the silent killer
Multimodal annotation has one technical prerequisite that, when missing, makes everything else pointless: hardware-clock synchronization across all sensors.
A single video frame at 30fps is roughly 33ms. A torque reading at 1kHz resolves to 1ms. A tactile event sampled at higher rates is tighter still. If those clocks aren't hardware-synchronized at the capture layer, the labels you produce will drift relative to the events they're marking — and the drift will be silent. The model trains, accuracy plateaus a few points lower than it should, and nothing in your dashboards explains why.
Practical rules:
- Sensor clocks must converge to a common reference with sub-five-millisecond drift across an eight-hour session. Verify with an audit recording before any real data collection starts.
- Timestamps must be hardware-stamped at capture, not software-stamped on receipt. Network delay alone can introduce tens of milliseconds of variance.
- Every dataset chunk should carry a synchronization quality flag. Down-weight or drop chunks where drift exceeded tolerance.
- Build a synthetic alignment test into the pipeline: a known event (e.g., a sharp tap on a fingertip with a marker visible to the camera) recorded periodically, with measured offset across modalities.
Annotator skill profile
The biggest hidden cost in VLA annotation is the skill profile of the labeler. Reading a joint torque trace and recognizing the difference between a real contact spike, a gripper backlash event, and sensor drift is a learned skill that takes weeks of structured exposure to real examples. Same for tactile, same for proprioception.
The labeler needs to internalize what a "clean" trace looks like in each modality and develop instinct for what's signal versus noise. Bring people in already comfortable with signal data — engineers, lab technicians, people from instrumentation or biomedical backgrounds — and train them on the specific modalities you're capturing.
Practical guidance:
- Specialize labelers by modality, not by project. A torque-trained labeler stays on torque; a tactile-trained labeler stays on tactile. Cross-training degrades both.
- Pair a senior labeler with each new hire for the first two weeks. Calibration drift in a labeler's mental model is the second-most-common source of dataset noise after timestamp drift.
- Use objective benchmarks for hiring. Show candidates real signal traces with hidden ground truth and measure their accuracy on phase boundary identification before they touch production data.
Memory and long-horizon labels
RLDX-1's Memory Module is worth a paragraph because it surfaces an annotation problem that doesn't exist in single-frame CV work. The Memory Module compresses the vision-language context of past frames into "cognition tokens" and passes them through transformer blocks to maintain long-horizon state — the model's equivalent of remembering what it was doing thirty seconds ago.
For annotation this means tasks have to be labeled at multiple time scales. Frame-level boundaries (contact, release) get one labeling pass. Sub-task boundaries (approach the object, manipulate it, place it down) get another. Whole-episode goals (make a cup of coffee) get a third. The labels at each level have to be consistent with each other — you can't have frame-level labels that contradict sub-task labels, or sub-task labels that don't compose into the whole-episode goal.
Hierarchical label consistency is enforced by tooling, not goodwill. If your annotation platform doesn't support nested labels with automatic consistency checks, you will not get this right by hand.
Ontology discipline before recording
The most expensive mistake we see in early-stage VLA builds is "we'll figure out the action taxonomy as we go." You won't. You'll discover three months in that you have thousands of demos with inconsistent labels that can't be reconciled without re-collection.
The action ontology, the mode taxonomy, the metadata schema, the modality definitions — all of it has to exist on paper, signed off by both the ML and operations sides, before any sensor records anything. Annotators can fill in fine-grained phase markers after capture. The high-level structure must be locked first.
A useful discipline: any change to the ontology after week one triggers re-labeling of the affected dataset chunks. The cost of that re-labeling is the discipline mechanism — it forces the team to think carefully before changing the schema, instead of changing it ad hoc and hoping nothing breaks.
QA workflow
Cross-modal alignment errors are clustered, not random. A drift in one sensor's clock affects every label produced during that session. A confused labeler will mis-mark every contact event of a particular task type. So QA has to be structural, not statistical:
- For each labeled session, verify time alignment with a known reference event embedded in the recording.
- For each phase boundary type, audit all labels in the modality that owns the boundary — full census, not sample.
- Spot-check cross-modal consistency: when the torque trace says contact and the video says no contact, flag the disagreement and resolve manually.
- Track inter-labeler agreement statistics per labeler per modality. A labeler whose torque-boundary agreement with the consensus drops below threshold gets re-trained or reassigned.
How we price multimodal work: Every engagement starts with a test batch from your real data. You see speed and quality on a representative slice before any commitment, and the quote is calibrated to your actual modality count, phase complexity, and QA depth — not a generic per-frame rate. See how we structure data labeling engagements.
Failure modes
| Failure mode | Symptom in training | Root cause in labeling pipeline |
|---|---|---|
| Silent timestamp drift | Accuracy plateaus smoothly during training; no obvious bug | Sensor clocks not hardware-synced; drift accumulates across sessions |
| Visual-only phase labels | Model fails on contact-rich tasks despite tons of data | Labelers used video alone; contact events labeled at the wrong frame |
| Ontology drift mid-collection | Different chunks of dataset use incompatible action labels | Action taxonomy revised during data collection without re-labeling earlier batches |
| Annotator skill mismatch | Phase boundaries inconsistent across labelers | Visual-trained labelers assigned torque/tactile work without physics training |
| QA via random spot-check | Model fails on real deployment but pre-deploy tests passed | Cross-modal alignment errors are clustered, not random — random sampling misses them |
| Flat label structure for long-horizon tasks | Model learns sub-tasks but fails to compose them into longer behaviors | No hierarchical labeling — frame-level, sub-task-level, and episode-level labels not enforced consistent |
What we'd do on day zero of a new build
If we were setting up annotation for a humanoid VLA foundation model right now, the first three items on the punch list:
- Hardware-clock sync audit, with a recorded test, before any data gets collected. Prove sub-five-millisecond drift across an eight-hour session. Don't skip this and discover it during training.
- An action-taxonomy committee, signed off in writing. Two ML engineers plus one operations lead define the phase ontology with worked examples. The document is the labeling rubric. Changes after week one trigger re-labeling — the cost of that re-labeling is the discipline mechanism.
- Specialized labelers by modality, paid accordingly. Not gig-economy clickers. People trained to read each stream's signatures. Smaller team, slower ramp, dramatically higher quality.
Why we're publishing this
The teams that scale fast on VLA work treat labeling as a first-class engineering problem from day zero. The teams that plateau hire more ML engineers when what they actually need is a labeling operation that can read physics. We wrote this playbook because that gap is the single biggest tax on the field right now and we think the answer doesn't have to be proprietary.
If you're building a VLA model and any of this resonates with the pain on your build, talk to us. Multi-stream, time-aligned, physics-literate annotation is the work we know best.
Building a robotic foundation model and starting to feel the data side? We run multimodal annotation pipelines — video, proprioception, torque, tactile, time-aligned, physics-literate labelers. See our data labeling services, book a free 30-min call, or email directly.
FAQ
Which modality should annotators use to label contact events?
Joint torque, almost always. Visual contact is ambiguous — by the time the gripper appears to engage the object on video, the actual contact has already happened or is still a few frames away. Torque spikes the instant fingers physically engage, with a sharp signature that's hard to mistake. Use video as context, torque as truth.
Can existing computer vision labelers work on torque or tactile data?
Not without retraining. Reading a torque trace and recognizing the difference between a real contact spike, a gripper backlash event, and sensor drift is a learned skill — it takes weeks of structured exposure to real examples. Bring people in already comfortable with signal data (engineers, instrumentation backgrounds) and specialize them by modality, not by project.
What sensor clocks need to be hardware-synchronized for VLA training?
Every sensor whose output the model will consume. At minimum: cameras, joint encoders, torque sensors, tactile sensors. Target sub-five-millisecond drift across an eight-hour session. Software timestamps applied on receipt are not sufficient — network and buffer delay alone introduce tens of milliseconds of variance. Hardware-stamp at capture.
How do you verify time alignment across sensors?
Embed a known reference event into recording sessions — for example, a sharp tap on a fingertip sensor while the marker is visible to the camera. Measure the offset between modalities at that event. Do it at the start of every session, and audit the offsets weekly. Any session where drift exceeded tolerance gets flagged in metadata and either dropped or down-weighted in training.
What's the difference between frame-level, sub-task, and episode labels?
Frame-level labels mark exact moments — contact at t=12.34s, release at t=14.89s. Sub-task labels group sequences of frames into meaningful operations — "Take cup", "Place under machine". Episode labels describe the whole task — "Make coffee". Modern VLA architectures with memory modules (like RLDX-1's cognition tokens) consume all three levels. Your annotation tooling needs to enforce consistency across them, not just at the frame level.
How do you QA multimodal annotation when random sampling doesn't work?
Cross-modal alignment errors are clustered, not random. Audit by structure instead of sampling: verify time alignment with embedded reference events, full-census every phase boundary in its authoritative modality, spot-check cross-modal disagreements (torque says contact, video says no contact — flag for manual resolution), and track inter-labeler agreement statistics per labeler per modality.