Interview

Building Vision Systems for Millimeter-Perfect Robot Coordination with Hrishikesh Tawade

By Tom Allen

Hrishikesh Tawade specializes in teaching robots to see with precision that matters. He builds AI systems that guide robots swapping electric vehicle batteries with 3-millimeter accuracy in live production environments. His work combines lidar point clouds, RGB-D cameras, and deep learning models that must perform reliably under industrial conditions where lighting shifts, hardware ages, and edge cases appear without warning.

Hrishikesh previously developed pedestrian tracking pipelines at Quanergy Systems using sensor fusion and classical computer vision techniques. His career spans the spectrum from autonomous vehicle perception to precision industrial manipulation, giving him perspective on where traditional algorithms still outperform neural networks and where deep learning has changed the game entirely.

In this conversation, Hrishikesh discusses the technical realities of deploying ML models on embedded systems, the challenges of coordinating multiple AI-driven robots in shared spaces, and why achieving millimeter-level accuracy in production requires more than just training better models. For anyone working at the intersection of computer vision, robotics, and real-time AI systems, this interview offers practical insight into what it takes to move from lab demos to 24/7 industrial deployment.

For those unfamiliar with the field, can you explain what robotic perception means in the context of EV battery swapping and how AI changes the approach compared to traditional rule-based vision systems?

For EV battery swapping there could be multiple perception tasks possible depending on the approach used for swapping. But primarily, there is detecting if there is a vehicle, identifying license plates or other markers (depending on the company) to recognize the vehicle, localizing the car and battery modules with respect to the station, and monitoring space in and around the swapping station to prevent unwanted intrusion during swapping. Compared to rule-based approaches like relying on edges, templates and thresholds, using CNNs for these tasks is much more reliable, adaptable and scalable. CNNs learn features directly from data making them robust against real-world edge cases like glares, dust, debris on the camera lens, occlusions on license plates and variations in battery modules that need to be swapped. For example, when detecting markers on battery modules even if the markers are scratched or faded, a traditional algorithm would directly fail, but AI based approaches can tell what marker it is and how confident the AI is about it.

You’ve implemented deep learning models that need to run in real-time on embedded systems in battery swapping stations. How do you optimize neural networks to achieve 3mm accuracy while meeting strict latency and power constraints?

Optimizing just neural networks isn’t enough. It is a multi-step process. You have to start with sensing and calibration. Bad calibration is difficult to fix later using neural networks. We calibrated cameras using hand-eye calibration and Brown-Conrady techniques to achieve a reprojection error under 0.3px. We also had a calibration drift detection system wherein we recalibrated every 6 hours to keep the errors within 3mm. Then, for detection, we initially used sub-pixel fiducial markers and LED targets, and later on shifted to custom-made markers. We also corrected depth sensors for any thermal drift.

The neural networks we started with smaller models for latency constraints and increased model size only when underfitting occurred. We also used integral heatmaps for sub-pixel keypoints (~0.1 px) and PnP + ICP fusion, which yielded an accurate 6D pose. We also relied on minimum 3 markers with decent spacing between them, which further helped to average out any errors due to ICP. Remember, that the markers were custom designed to allow for enough keypoints under occlusion to make ICP achieve global minima.

For edge deployments, we optimized with INT8 quantization-aware training, structured pruning, and distillation from a larger teacher model. One can also use batchnorm fusion, TensorRT compilation, etc to reduce memory footprint and latency. At the end, we achieved an average 12-18ms perception latency, 30-40ms total cycle time (capturing + pre-processing + inference + post processing + sending message to server), and consistent <= 3mm accuracy.

At Quanergy, you built pedestrian tracking pipelines using techniques like IMM-EKF and JPDA. When do you choose classical algorithms versus deep learning approaches for perception tasks?

Classical tracking algorithms like IMM-EKF, JPDA, etc are best when you already know how objects to track e.g. car, trucks, pedestrians, etc are going to move e.g. straight line, constant turn, constant velocity, etc. Traditional methods are more predictable and interpretable, which is important for real-time guarantees and safety guarantees. I have also found algorithms like JDPA handle lots of targets > 300 with reasonable latency. It also handled messy associations under noisy data decently.

We found deep learning approaches like BEV fusion or graph neural networks handled associations extremely well as they were able to learn patterns and make associations using appearance similarity. Furthermore, deep learning approaches give measurement to state mapping, e.g. raw camera/point cloud to 3D boxes. Also they can decode context like intent or classification of objects much better than traditional algorithms. 

Training AI models in a lab is one thing. Deploying them in a 24/7 production environment is another. What are the biggest challenges in getting machine learning models to work reliably when lighting conditions change, hardware degrades, or edge cases appear that weren’t in your training data?

Moving from lab to production is a mix of planning and learning. Overtime collecting data for edge cases will improve your model, but one needs to have strategies/recoveries in place to enable the bridging. Human-in-loop approaches wherever possible helps significantly. Also, if you have a previous system that is working in production, always deploy the model in a shadow mode first, monitor performance and KPIs and then do canary style deployment to derisk production uptime.

One needs to log uncertain cases, label them and retrain the model. In my experience, using bright, medium, and dark exposures, which is also called HDR, helps to preserve details in an image in highlights and shadows. Furthermore, augmenting training data with variations like blur, glare, scratches, etc., pays dividends. We also had a health monitoring service that would detect sensor overheating, blurry lenses, sensor dropouts, latency tracking, FPS etc. to catch issues early. We also followed a regular in-person servicing schedule to detect or replace any worn-out parts. For hardware degrades, sensor blackouts, etc. plan for fail-operational or fail-safe modes in advance. Quick model rollback strategies (< 1min) also need to be in place.        

Battery swapping requires coordinating multiple AI-driven robotic systems simultaneously. How do you approach multi-agent coordination when each agent is making decisions based on ML models, and what happens when their predictions conflict?

To make multiple robots coordinate, one needs a shared world model. This could be a pose graph or an occupancy grid. Robots need to populate this model with their state and what they see (perception). The robots also need to be synchronized for which PTP could be used. And then one needs a communication layer for collaboration wherein one can use gRPC, DDS, etc.

For multi-agent coordination, there are all kinds of approaches ranging from centralized optimizers, decentralized ones, rule-based, etc. For battery swapping, a centralized planner makes more sense, which assigns tasks and solves any optimization problems. Although one thing to remember is that this can create a SPOF (Single Point Of Failure) system, and a comms failure can make it fragile as well.

In case there are conflicts, the centralized planner generally prefers predictions that are the least uncertain or based on tie-breaking rules, e.g., priority-based rules, random selection, or workspace/region-based yielding policies.

You’ve worked with point cloud data from lidars, RGB images, and depth sensors. How has deep learning changed the way you process and fuse data from these different modalities compared to classical computer vision pipelines?

Traditionally, sensor fusion meant directly working in measurement space. e.g. project LiDAR points onto camera images, running ICP to align depth with geometry, or use hand-crafted features like SIFT and edge detectors. These “early fusion” approaches were weak against small calibration errors, glare, occlusion, etc. Deep learning changed this by enabling networks to directly ingest raw RGB, depth, and LiDAR inputs, learning correlations automatically rather than relying on brittle rules.

The biggest leap according to me has been in middle fusion at the feature level. Instead of matching hand-crafted descriptors, each modality has its own backbone (e.g., sparse 3D CNNs for LiDAR, ResNets for images) and their latent features are aligned in a shared space. Techniques like PointPainting (attaching image semantics to LiDAR points), voxel/BEV fusion, or attention-based cross-modal transformers allows the system to learn which visual cues matter for which 3D regions. This has made deep learning approaches robust against occlusions, missing data (due to sensor brownouts, package drops, etc), or sensor noise that classical pipelines couldn’t handle.

In case of late fusion wherein we combine detections or classifications from each modality, the traditional approach to this was mostly rule-based like weighting them by confidence. Deep learning adds learned weighting and uncertainty modeling, so the system knows when to trust LiDAR over vision (fog vs glare, for instance). Thus, deep learning approaches have shifted fusion from rule-driven heuristics to data-driven representations.

When you’re building perception systems for industrial automation, how do you balance model accuracy, inference speed, and the ability to detect and handle failures? What does your ML ops pipeline look like for robotics applications?

I initially start by understanding the requirements and constraints of the problem that the perception system is trying to solve. Let’s take an example of a defect detection system. We need to understand the acceptable end-to-end latency budget for this system, the acceptable FPR, accuracy, recall, computational limitations, cost limitations, needed uptime, etc. This exercise will define the error budget and acceptable trade-offs. In case of defect detection system accuracy might trump latency.

While selecting a model I generally prefer the smallest network that meets the bar. Then I compress it further with distillation or quantization. At runtime, I add safeguards like confidence thresholds, temporal filtering (moving average, exponential filter, etc), and sanity checks to catch drift or bad outputs.

For robotic applications my ML ops pipeline looks like following:

1)    Training:

a)    Define data/labels clearly.

b)    Train with fixed and versioned configs that contain your hyperparameters. Generally, this is in .yaml or .json format.

c)     Set random seeds for pytorch/tensorflow, numpy, cuda, etc which ensures deterministic runs

d)    I have also found using tools like Weights & Biases (wandb) or MLflow very useful to log configs and metrics. This allows you to compare runs.

e)    Also remember to lock the training and inference environment i.e. python packages, CUDA/cuDNN, etc

2)    Export & Optimization:

a)    Freeze the trained model into a ONXX format which ensures the model can run outside the training framework.

b)    To optimize for latency and memory use TensorRT(for NVIDIA GPUs) or TVM (cross-hardware compiler).

c)     In some cases, I have also used layer fusion, kernel auto-tuning, and precision lowering (FP16/INT8) to optimize the model further.

d)    But make sure you validate the optimized model with a held-out dataset to ensure the accuracy drop is < 1%.

3)    HIL Testing

a)    In this stage we deploy the model on the target edge device and feed in recorded sensor data (video, lidar data) at real time frame rates.

b)    Then measure latency, FPS, power usage, thermal load, output etc to validate your constraints and requirements.

c)     One can also stress test the model under degraded hardware conditions like low battery, throttled CPU, etc or inject faults like sensor dropouts, sensor brownouts, etc

4)     Rollout/Deployment:

a)    Initially deploy the model in shadow mode and log predictions and compare with live production outputs to detect any drifts or false positives.

b)    They release the model to a small subset of robots/stations (Canary deployment).

c)     This can be a function of percent of robots e.g. 1% or 5% fleet or function of uptime e.g. 1% of uptime in all robots or based on geographic locations of the robots, low risk shifts in factories or low risk production lines, etc

d)    We monitor our KPIs and rollback instantly if metrics exceed thresholds.

e)    After passing the canary stage, is when all robots/factories receive the updated model. One can decide also to do this in phased waves to reduce blast radius. 

Where do you see the intersection of deep learning, robotics, and electric vehicle infrastructure heading? What AI capabilities would need to mature to make autonomous battery swapping scale globally?

For global scaling of swapping stations, robots must become self-reliant. That means auto-calibrating vision and manipulators without manual measurement or the use of built-in markers. To make existing swapping stations compatible with new battery pack shapes, I think foundation models with diffusion or transformer-based policies would be key, as generic autonomy can adapt quickly to different types of battery designs and different types of tasks along the battery swapping pipeline. Also, since physical swapping stations are costly and are expensive to maintain ( > $500k) high fidelity digital twins are the key to reliably provide edge case coverage like deformations, end effector contact issues, etc.

I also strongly believe that the battery-swapping ecosystem needs interoperability standards to create common interfaces between vehicles and stations. This is what will let multiple automakers and infrastructure providers cooperate, making cross-OEM swapping practical worldwide.

You’ve moved from perception for autonomous vehicles to industrial robotics. How do the AI/ML requirements differ between these domains, and what surprised you most about applying deep learning to precision manipulation versus navigation?

In the AV industry, perception tolerances are measured in decimeters at tens of meters since what matters is detecting cars, pedestrians, and lanes quickly and reliably. Even if bounding boxes shift by a few centimeters, it is fine. Industrial robotics, on the other hand, demands these tolerances to be within millimeters with low jitter. The “closed-world” nature of factories, I would say, makes the conditions the robot is going to operate in a lot more predictable, but that doesn’t mean that the challenges are easier. For example, in automotive factories, AMRs and manipulators often have to deal with greasy floors, parts, and sparks.

Also, industrial automation is extremely safety compliant. The robots and machines must handle safety-critical edge cases out of the box. Compliance in the AV industry, on the other hand, is still evolving. What also surprised me the most is how data priorities change. In AV perception, scale wins. Massive datasets covering rare long-tail cases are essential. But in manipulation, success comes more from label quality and calibration stability than raw data volume. A small, carefully collected dataset with precise 3D geometry and consistent calibration outperforms large, noisy datasets when the goal is sub-mm or even micron-level precision.

That shift taught me that for precision robotics, the bottleneck isn’t “how much data do you have?” but “how trustworthy is your data, and how stable is your sensor calibration?” which is a very different mindset from the AV world. 

You’ve progressed from Computer Vision Intern to Tech Lead, now owning ML model development and mentoring others. What advice would you give to engineers trying to move from implementing existing ML models to architecting AI systems and making decisions about model selection and deployment?

The most significant learning for me was the mindset shift. Instead of asking “Which model should I use?” one should start by collecting requirements, defining scope (from MVP to being production-ready), and then owning a big chunk of the scope. As a lead, your job is not just to deliver a working model but to define a repeatable process that delivers output consistently and can scale.

I always suggest starting small and keeping solutions simple. This enables you to ship features and products faster and learn and course correct faster from learnings in production. Always remember to design a rollback or safe mode in case things go wrong.  You should clearly explain trade-offs to the controls, safety or production/operations team and create clear escalation tiers with predefined thresholds to allow for recovering from failures faster. 

Author

Related Articles

Back to top button