
AI agents are moving from flashy demos into real products. That shift creates a boring but critical problem: testing. Teams need to know if a change made an agent better, worse, or simply different. Today that work is slow, manual, and easy to mess up.
Scorecard thinks evaluation should run all the time. The startup has raised $3.75 million in seed funding from Kindred Ventures, Neo, Inception Studio, Tekton Ventures, and angels from OpenAI, Apple, Waymo, Uber, Perplexity, and Meta. The pitch: let developers run tens of thousands of tests per day against their agents and ship with evidence, not vibes.
Scorecard is already in use at Thomson Reuters, where it helps test and deploy CoCounsel, the company’s legal AI suite. “The reliability and effectiveness of CoCounsel Core is paramount,” said Tyler Alexander, Director of AI Reliability at Thomson Reuters. “Scorecard enables us to scale our continuous monitoring efforts and make them vastly more efficient.”
What Scorecard does
Scorecard is a managed evaluation engine with a dashboard on top. Teams define test suites in minutes through a no-code API or a TypeScript SDK. Those suites can model end-to-end scenarios: prompts, tool calls, compliance checks, and performance benchmarks. You can point these at a live agent or a staging environment and then run them at high frequency.
Every run streams into a web UI. You see pass/fail rates, regression alerts, diffs, and trend lines over time. If a new prompt template or model version quietly degrades accuracy on a core workflow, you get a clear signal. If a policy rule fails, you can block a rollout.
The key design choice: wire it into CI/CD. Any code commit, model swap, system prompt change, or configuration tweak can trigger a fresh round of tests. That removes the human bottleneck and keeps quality checks close to the change that caused them.
The company also says it ships the first open-source TypeScript library for MCP evals, aimed at teams standardizing on the Model Context Protocol.
Why this matters now
We’re about to see a flood of agents across legal, finance, health, and insurance. Many will handle sensitive tasks. Evaluation today often means ad-hoc scripts, hand-curated datasets, and spreadsheet exports. It can take days or weeks to answer a simple question: “Did we make it better?”
That lag creates real risk. Blind spots slip into production. Compliance and safety issues go undetected. Teams delay features because they can’t get clean readouts. Scorecard’s value prop is speed and repeatability: make it cheap to test constantly so you find problems early and fix them fast.
Founder fit: simulation meets agents
Founder and CEO Darius Emrani led simulation and evaluation programs at Waymo and earlier at Uber. At Waymo, he helped grow the simulation/evaluation team from a few dozen engineers to 200+, and the infrastructure ran millions of scenario evaluations per day. That work supported the first commercial autonomous ride-hail operations, with service across multiple states and tens of millions of rides.
The lesson he brought over: scale and automation beat heroics. “At Waymo, millions of simulations made the difference between a working prototype and a category-defining product,” Emrani told TechCrunch. “With Scorecard we’re giving companies the tools to iterate at unprecedented velocity, reducing months of manual testing to seconds.”
Steve Jang of Kindred Ventures frames the need in simple terms: trust. “With millions of AI agents set to be deployed over the coming years in regulated industries like legal, finance, and healthcare, trust in AI isn’t optional, it’s mission-critical. The Scorecard team’s experience testing and deploying self driving car systems at Waymo and Uber, uniquely equip them to build robust and reliable evals for agents in virtual or physical environments.”
Massive opportunity to improve AI agents
Tech companies are pouring hundreds of billions of dollars per year to develop AI agents, tools to test and evaluate will play a massive part in making those agents ready for customers. Scorecard’s bet is that the winning teams will treat evaluation like infrastructure, always on, close to the code, boring by design. If it becomes the default CI layer for agents, the startup will sit in a valuable slot of the stack.
Availability: Developers can learn more and request access at scorecard.io.