
Consider a robotics system that misidentifies an object during a nighttime test run. The engineering team pulls the logs and sees the object was present, and all three sensors recorded it. Every annotator followed the guidelines, but no one could identify where the failure originated.Â
A 2025 review of multi-sensor fusion in Sensors documented that AV perception systems generate data continuously at rates reaching multiple gigabytes per second, even up to terabytes per hour depending on sensor configuration, and confirmed that sensor faults travel downstream into vehicle decisions. A labeling inconsistency moves through the same path. It enters the data quietly, and nothing surfaces it until a model decision does.
The scale of investment in physical AI data infrastructure reflects how much is at stake. According to Market.us, the AI data annotation market is projected to grow from $3 billion in 2025 to $28.5 billion in 2034, with autonomous vehicles and robotics programs among the largest drivers of demand.
Sensors also reports the largest automotive programs have moved past running test programs and are now in production:
- Level 3 conditional automation is now standard at Mercedes-Benz and BMW in Germany and across the U.S.
- Waymo claims to have built vehicles with autonomy equivalent to Level 4 for their commercial self-driving ride-sharing services
Those vehicles are encountering real passengers and consequences at every intersection. Somewhere in the decision chain is a label that was applied to a sensor return. That’s the work annotation infrastructure is doing, and the full weight of it only becomes clear when a perception error reaches a deployed system.
“When we talk about multi-sensor annotation for autonomous driving, we’re talking about maintaining consistency across data types that are fundamentally different in structure,” says Steve Nemzer, Senior Director, Artificial Intelligence Research & Innovation at TELUS Digital, who has worked across physical AI annotation programs spanning autonomous vehicles.Â
Nemzer adds, “Lidar gives you a sparse point cloud, radar gives you velocity, and a camera gives you pixels. The annotation team has to unify those into a single coherent truth about what’s happening in the scene, and they have to do it at scale, frame by frame, with sub-pixel accuracy. That’s not a task you can fully automate.”
Three Sensors, Three Different Failure Modes
Sensors fail at different times and toward different classes of error, and no single annotator in the pipeline has visibility across all of them. Research from Sensors also mapped each failure mode precisely:
- Lidar can generate false positive objects in rainy weather as laser beams scatter off wet surfaces
- Radar can return spurious observations from multiple reflections off surfaces, creating clutter that trained models can’t distinguish from real objects
- Cameras can degrade in low light, glare and under adverse weather conditions
- Spatial-temporal calibration errors can increase in each of the above, placing the same physical object at different positions across sensor streams
Each annotator produces a defensible label. The conflict between them is invisible because no one in the pipeline has a cross-modal view. And when programs expand across geographies, sensor configurations and weather conditions, informal coordination breaks down. Surfacing cross-modal conflicts consistently requires that capability to be built into the annotation infrastructure from the start.
The Frame That Reaches Production
A construction zone on an urban test route generates a radar return that the automated pipeline reads as a slow-moving vehicle. The camera frame, partially obscured by roadside signage, doesn’t confirm it and the lidar point cloud shows nothing at that position. The system’s confidence score clears the threshold and the label commits. The annotator reviewing the batch sees the camera frame in isolation. The radar return sits outside their workflow and the label holds.Â
That mislabeled object then enters the training dataset. Whether it becomes a recurring pattern depends entirely on how many similar frames follow it into the same batch and whether the annotation infrastructure is built to surface that class of conflict before it reaches the model.
Well-structured annotation operations handle that differently. High-uncertainty cases are routed out of the automated pipeline and directed to annotators with sufficient domain knowledge to understand what the sensor return is actually showing. That judgment feeds back into the model, and expertise lands where it has the most effect.
“Annotation automation hits a wall in those safety-critical edge cases where ambiguity is high,” Nemzer says. “Interpreting the gesture of a crossing guard is far trickier than identifying a yield sign. The most effective annotation processes at scale do not attempt to eliminate human judgment entirely. Automated systems flag high-uncertainty cases using confidence thresholds and disagreement signals, and human-in-the-loop annotators resolve them using structured decision frameworks.”
This is what makes human-in-the-loop a design choice rather than a fallback. An annotation infrastructure that routes high-uncertainty cases to qualified humans in real time, with the domain knowledge to resolve them correctly, produces a materially different quality system than one that adds a human review layer after a predominantly automated pipeline. At the scale physical AI programs require, that routing discipline is the operational challenge TELUS Digital’s physical AI annotation programs are designed around. The difference shows up in model performance on the edge cases automated systems can’t handle.
Where Did a Product Failure Start?
For programs operating in safety-critical environments, annotation lineage is also a compliance requirement. Three key frameworks are used to provide safety and security:
- ISO 26262, the international standard for road vehicle functional safety, which mandates a structured development process with full traceability from system design through deployment
- TISAX governs automotive-specific data handling
- ISO 27001 and SOC 2 set the baseline for information security and audit controls
For leading AV and robotics programs, these credentials are active procurement requirements built into vendor evaluation criteria before a program reaches scale.Â
What those requirements make possible is the ability to quickly answer where a failure actually started. When a robotics system misidentifies an object during testing, the investigation typically spans multiple potential origins: the annotation guidelines, the annotators who labeled the relevant batch, the quality review that passed it through, a systematic bias that entered the training data weeks earlier in a different environment, the sensor configuration active during collection, or the guideline version in effect at the time and whether it had been updated between that run and the previous one.
“Data lineage isn’t a nice-to-have for safety-critical AI. You need to be able to quickly answer questions like what exact data trained this model, what quality standards did it meet and why did it fail in this specific case, without extensive manual investigation. If you’re having to dig through logs, you’re not ready for production safety-critical work,” Nemzer says.
Programs with lineage built in from the start can trace a model failure back to a specific label and a specific point in the review process. Architectural decisions determine whether a program reaches production, with decisions fixed by the time a model encounters the conditions that test them.
FAQ:
Q: What should enterprise teams look for in a 3D point cloud annotation service for robotics programs?
Lidar returns vary by object material, weather condition, sensor type, and geography. Solid-state and flash lidar produce entirely different artifacts. The differentiator is whether annotators understand those differences and can validate point cloud labels against camera and radar streams before the data touches a model.
Q: What compliance standards should AV and robotics programs hold their annotation partners to?Â
ISO 26262 mandates full traceability from system design through deployment. ISO 27001 covers information security. TISAX governs automotive-specific data handling. SOC 2 sets the audit control baseline.Â
Q: How do annotation operations enforce quality standards across large workforces at production scale?Â
Consistency at production scale has to be built into the operation through active learning, multi-stage quality review and annotation lineage that makes any labeling decision auditable when a perception failure occurs. Manual oversight and informal coordination work at limited scale. They do not survive the expansion to millions of frames across geographies and sensor configurations.