A hat looks like a hat from the side, but a flat disc from above. For humans, that’s obvious. For machine learning models trained on 2D images, it’s a fundamental breakdown. Phani Harish Wajjala, a Principal Machine Learning Engineer who leads Content Understanding for a large-scale avatar marketplace, spends his days solving exactly this kind of problem. His team builds the AI that determines what a 3D asset is, whether it violates platform policies, and how it should surface to users.

In this interview, Phani breaks down for us why 3D computer vision remains harder than its 2D counterpart, how his team compensates for limited training data, and what he sees as the next frontier for avatar-scale AI.

Phani, how would you explain 3D computer vision to someone outside the field, and why does it matter for virtual environments and digital marketplaces?

I usually explain it by comparing it to how we look at photos. In the early days, computer vision (using algorithms like SIFT or HoG) aimed to identify an object regardless of its position in animage. The big breakthrough came with Deep Learning and AlexNet; we realized if we threw enough internet data at a model, it could learn to handle those variations naturally.

But 3D is a different beast. We don’t have that same massive library of data, and the math gets much harder. In 2D, a picture is just a picture. In 3D, you have a “frame of reference” problem. A hat looks like a hat from the side, but it might look like a flat disc from the top. If the computer doesn’t understand that those two views are the same object, it breaks.

For a marketplace like Roblox, this matters because we aren’t just selling images; we’re selling things that have to work. If our systems can’t understand the 3D geometry, how a shirt drapes or how an item fits, we can’t effectively moderate it, price it, or recommend it.

You oversee the ‘Content Understanding’ pod. Can you break down what that specifically entails in the context of an avatar marketplace?

The Avatar Marketplace lets users create and sell 3D assets. My pod’s job is basically to make sense of that huge influx of content.

First, there are the fundamentals: Safety and Policy. We have to ensure an asset isn’t violating our rules before anyone sees it. Then there’s taxonomy – figuring out what the item actually is so we can organize the catalog.

But beyond that, our “Content Intelligence” platform generates the signals that run the economy. We look at:

● Fit & Compatibility: Does this shirt clip through this body type?

● Market Metrics: We estimate quality, uniqueness, and potential demand.

● Rich Features: We tag items with style, theme, and material for recommendations.

● Composition: We don’t just look at the item; we look at the whole outfit to understand the avatar’s overall “vibe” or relation to pop culture trends.

You touched on the scarcity of 3D data earlier. In the world of LLMs and 2D image generators, we hear about ‘trillions of parameters’ and massive datasets. How do you build high-performing AI for 3D understanding when you don’t have the internet’s worth of training data?

We can’t just throw everything into one giant model because the data isn’t there. Instead, we use a “model cascade”, chaining smaller, specialized models together with LLMs.

Take asset fit, for example. We don’t need a supercomputer to know if a shirt intersects with a body. We use a fast, geometric model to check that first (using a standard mannequin as an anchor). That filters out the obvious bad fits. Then we pass the tricky ones to an LLM to figure out the nuances.

We do the same for classification. A shoe isn’t always a shoe. If a user puts it on their head, it’s a “hat” with a funny theme. A standard model misses that context. So, we use a pipeline: we find the best camera angle, render it on a mannequin, extract the geometry, and compare it to user search signals to see how the community describes it. Finally, an LLM considers all those signals to determine what the item really is.

You manage a team that sits at the intersection of AI research and product engineering. Those are often two very different cultures. How do you structure your ‘pod’ to ensure research actually makes it into production?

We structure the team around strong ML Engineering principles. We don’t treat research and engineering as separate silos.

First, we focus on translation: taking a product requirement and turning it into a quantifiable math objective. If we can’t measure it, we can’t build it. Second, we keep our architecture modular. Because “Content Understanding” covers so many different problems, we separate our heavy infrastructure from the experimental stuff.

This lets us follow a “Fail Fast” philosophy. A team member can build a prototype module, test it, and either ship it or scrap it without breaking the whole system. We plan for things to change constantly, so we build the systems to be swapped out from day one.

Generative AI is obviously the hot topic. As you build these understanding systems, do you see a future where ‘Content Understanding’ and ‘Content Generation’ merge?

I see understanding as the prerequisite for good generation. Roblox is always trying to make creation easier, but creating for avatars is harder than just making a pretty 3D model because of interaction.

Right now, we handle things like Layered Clothing that deforms on the body. But I hope we move toward a future where creators can generate assets that function like they do in the real world – clothes you can fold, mugs you can drink from, guitars you can play. To generate an object that actually functions correctly, the AI must deeply understand what that object is first.

Your work directly impacts the marketplace. When you improve ‘Content Understanding,’ how does that actually change the experience for a user or a creator on Roblox?

The biggest impact is removing friction when users do something unexpected.

For example, after we opened up UGC creation in 2024, we saw a huge wave of creators making “speech bubbles” – static meshes with text on them. They didn’t fit into our existing categories like “Hat” or “Shirt,” so they were getting buried.

We deployed an “open-set recognition” system, which is basically a model that can detect entirely new clusters of items that it hasn’t seen before. It identified these bubbles, and now we’re launching a dedicated “Props” hierarchy (for speech bubbles, auras, and companions). Instead of fighting the system to sell their items, creators now have a dedicated home for them.

Safety is obviously paramount for Roblox. Identifying a prohibited image is one thing, but how do you moderate a 3D asset? Are the challenges different when the content has volume and physics?

It’s infinitely trickier. In 2D, what you see is what you get. In 3D, bad actors can hide things inside the geometry or use specific angles to mask violations. We have to use models to predict the “correct” view to capture images for our safety scanners or convert animations into video to check for inappropriate movement.

But the hardest part is what I call combinatorial violations, where two items are fine separately, but bad when worn together. We’ve seen cases where users try to spell out slurs across different body parts (like wearing a shirt with letters and accessories with other letters). You can’t catch that by looking at the items in isolation.

To solve this, we treat the outfit like a graph of connected assets. We use graph embeddings to analyze the final combined look and mathematically determine which assets simply cannot be allowed to be equipped together.

Looking at your career, you’ve worked at major tech companies. What makes the engineering culture or the technical challenges at Roblox different from your previous experiences?

It’s the sheer number of unsolved problems. At other big tech companies, innovation often happens on a “light-weight” basis; you’re exploring new ideas, but often as a side project or an optimization.

At Roblox, because we are doing 3D co-experience at this scale, the solutions usually don’t exist yet. We can’t just grab an off-the-shelf model. Trying new solutions isn’t just encouraged here; it’s an operational requirement. It creates a culture where everyone is constantly looking to do things differently because we have to.

Finally, looking ahead to 2026, what is the ‘North Star’ for your team?

My goal is to move from a static catalog to a dynamic one. I want to minimize the effort it takes for creators to showcase their work and for users to find it.

Specifically, I want to change how people search. Right now, you search by keywords: “blue shirt.” In the future, I want users to search by idea. They should be able to describe the avatar they want to be, a mood, a character, a specific aesthetic, and have our system actively discover and assemble the assets to make that happen.

Author

Tom Allen

Founder and Director at The AI Journal. Created this platform with the vision to lead conversations about AI. I am an AI enthusiast.

View all posts

Tom Allen 7 January 2026

6 minutes read

In this interview, Phani breaks down for us why 3D computer vision remains harder than its 2D counterpart, how his team compensates for limited training data, and what he sees as the next frontier for avatar-scale AI.

Phani, how would you explain 3D computer vision to someone outside the field, and why does it matter for virtual environments and digital marketplaces?

You oversee the ‘Content Understanding’ pod. Can you break down what that specifically entails in the context of an avatar marketplace?

You touched on the scarcity of 3D data earlier. In the world of LLMs and 2D image generators, we hear about ‘trillions of parameters’ and massive datasets. How do you build high-performing AI for 3D understanding when you don’t have the internet’s worth of training data?

You manage a team that sits at the intersection of AI research and product engineering. Those are often two very different cultures. How do you structure your ‘pod’ to ensure research actually makes it into production?

Generative AI is obviously the hot topic. As you build these understanding systems, do you see a future where ‘Content Understanding’ and ‘Content Generation’ merge?

Your work directly impacts the marketplace. When you improve ‘Content Understanding,’ how does that actually change the experience for a user or a creator on Roblox?

Safety is obviously paramount for Roblox. Identifying a prohibited image is one thing, but how do you moderate a 3D asset? Are the challenges different when the content has volume and physics?

Looking at your career, you’ve worked at major tech companies. What makes the engineering culture or the technical challenges at Roblox different from your previous experiences?

Finally, looking ahead to 2026, what is the ‘North Star’ for your team?

Author

Related Articles

The Engineer Who Has Seen Both Ends, Omkar Wagle on Embedded Systems, Cloud Scale, and the Future of Infrastructure

Before BOPIS went mainstream: James Markunas on pioneering one of the world’s first true global omni-channel ecosystems for American Apparel across 16 countries

AI Is Not Here to Replace People, It’s Here to Replace Waiting: Coinspaid CTO on Speed, Safety and the Future of Intelligent Payments

When AI Meets the Enterprise: Jisu Dasgupta on Building Foundations That Actually Work