AI Leadership & Perspective

What talking to three product leaders in AI taught me about AI evaluation

Hey! My name is James Aitkin, and I’ve been working on the Global App Testing “AI Strategy” – our dedicated service to leverage the crowd for AI organizations, AI GroundTruth.

Since we already work with some of the biggest logos in AI, you’d think that would be easy – but AI eval strategy changes every day, so I’m keeping a kind of diary of some of the things I’m learning along the way.

Recently, I spoke to three Product Leaders at AI-first and AI-adopter organizations to try to identify some of the challenges, traumas, and successes of working in AI. Here’s what I learned about evaluation and some further thoughts.

Insight #1: It’s always best to show your working

The company 

An HR technology org selling a B2B product with an AI feature into enterprise clients.

The challenge

This PM identified two challenges. First, he was worried about discrimination outcomes in the HR use case. Second, he was wrestling with how building effort fits into buying cycles… in his world, something that looks a certain way in January can evolve beyond recognition by June. He needed to move faster for AI product-building to be viable, but he was nervous about shipping given the stakes.

What he learned

The way he was able to accelerate procurement cycles and reduce risk turned out to be the same. He put his AI evaluation process onto the product roadmap and opened it up for clients to inspect. Bringing customers in earlier drew them into the specific tradeoffs between release and evaluation, and facilitated a conversation about where they’d prioritise faster rollout versus waiting for validation to land.

Part of the requirements for some of the larger enterprise buys was to test it independently in real life, which is where GroundTruth was able to help, and how we got involved in the first place.

Insight #2: Risk is inevitable, what matters is proving readiness in the real world

The company

A startup with “AI” in its title, recently flush with investment.

The challenge

They don’t have the resources to thoroughly evaluate everything they take to market. The competitive pressure is simply too intense.

What he had learned

Everything in this world involves risk, this leader explained. It’s not viable to pursue a process that eliminates risk entirely, and non-deterministic products are clearly riskier than those with predictable outputs. But part of the validation process had to be about better product/use-case fit. He wanted to know how well something would work before he shipped it.

His approach was layered: internal dogfooding first, then early-adopter access months ahead of a broader launch, and finally spot checks in specific real-life verticals to make sure the product was finding its market. “The hardest challenges was the final one”, he explained. “Outside-in, real user evidence that bridges the gap without requiring a fresh startup.”

Insight #3: Get ready for regulation, and use real-world validation to lead on it

The company: 

An enterprise technology organisation with mostly B2B products and services.

The challenge 

Enterprise technology is walking one of the most interesting lines between user trust, evaluation, and use case implementation. The product leader here described a process where context and required safety level are assessed ahead of rollout, and products are sorted into different streams with distinct evaluation criteria. For some of those streams, he valued the “real-life” component of AI evaluation. “Lots of evaluation steps are virtual, because that’s easier to build,” he pointed out.

The insight

The question for this business wasn’t just about today. The market is highly unstable , lots of entrants, exits, and new technologies. The broader question is about tomorrow: what does a stable market in AI look like, and how can we best position ourselves to be ready for it?

One answer is regulation. If non-deterministic products are as risky as they seem, a move that is both competitive and safe is to proactively identify and adopt evaluation processes that can be advocated to regulators as standard. This was part of the strategic thinking behind the scenes at this organisation, and it’s a space where GroundTruth’s independent, structured, real-world validation methodology becomes especially valuable. It gives organisations not just a way to test today’s deployments, but an evidence framework they can point to when regulators come asking how they know their AI is ready.

My thoughts 

  • Validating “ready for the real world” is the gap no one has closed. Strong benchmarks, clean demos, and internal QA still don’t prove that a customer-facing AI will hold up when live users interact with it. Every organization we spoke with, from frontier labs to deployers, described the same blind spot: they lack independent, outside-in evidence that the experience actually works before they scale it
  • Brand, trust, and revenue are on the line, not just model performance. The leaders we heard from aren’t worried about abstract AI safety. They’re worried about tone breakdowns, poor escalation, off-brand interactions, and multilingual inconsistencies eroding customer trust in high-visibility moments. The need isn’t to test whether the model can generate a response, it’s to test whether that response feels right.
  • The demand for independent proof is becoming a buying requirement. Enterprise customers are increasingly asking AI providers a pointed question: “How do you know this will work for our brand, our markets, and our customers?” Self-reported evals and vendor assurances aren’t enough. Independent, expert-led, real-world validation is becoming table stakes for deployment confidence.

Do you have some thoughts yourself? We’re open to submissions and suggestions, so you can ping me an email with what you’re thinking at [email protected]

Author

Related Articles

Back to top button