
Pioneering AI evaluation company introduces industry-first platform combining observability, evaluation, and guardrails specifically designed for multi-agent systems
SAN FRANCISCO, July 17, 2025 /PRNewswire/ — Galileo, the leading AI reliability platform trusted for evaluations and observability by global enterprises including HP, Twilio, Reddit, and Comcast, today announced the launch of its comprehensive platform update for AI agent reliability, free for developers around the world. As AI agents become increasingly autonomous and multi-step, traditional evaluation tools struggle to detect their complex failure modes. Galileo’s new agent reliability solution is purpose-built for multi-agent AI systems and addresses this critical gap with agentic observability, evaluation, and guardrail capabilities working in concert.
What This Means for Enterprises
With 10% of organizations already deploying AI agents and 82% planning integration within three years, enterprises face a critical challenge: ensuring reliable AI agent performance at scale. Galileo’s platform addresses the high-stakes nature of enterprise AI deployment, where a single agent failure can expose sensitive data, cost real money, or damage customer relationships. Galileo’s new Luna-2 small language models(SLMs) deliver up to 97% cost reduction in production monitoring while enabling real-time protection against failures that could derail enterprise AI initiatives.
Ship Reliable AI Agents
“When your agent fails, you shouldn’t have to become a detective,” said Vikram Chatterji, CEO and Co-founder of Galileo. “Our agent reliability platform, fueled by our world-first Insights Engine, represents a fundamental shift from reactive debugging to proactive intelligence, giving developers the confidence to deploy AI agents that perform reliably in production.”
Enterprise customers and partners are already seeing a significant impact:
MongoDB: “As our customers deploy AI applications at scale, sophisticated monitoring is needed to build trust and reliability into these systems. Galileo’s platform, as part of the MAAP ecosystem, ensures AI applications and agents built on MongoDB can be deployed with added confidence, thanks to its sophisticated monitoring and evaluation capabilities.” – Abhinav Mehla, VP – Global Partner GTM Programs, MongoDB
CrewAI: “Trust doesn’t come from a flashy demo—it comes from agents that deliver the same high-quality results, over and over. That’s why we’ve partnered with Galileo: to help companies move fast and stay reliable. With CrewAI + Galileo, teams can deploy agents that don’t just work once; they work at scale, in the real world, where consistency actually matters.” – João Moura, CEO and Co-founder at CrewAI
Comprehensive Agent Reliability Solution
The platform tackles the unique challenges of agentic AI development, where a single bad action can expose sensitive data or cost real money, requiring guardrails that trigger before tools execute. Galileo’s platform powers custom real-time evaluations and guardrails with new Luna-2 small language models, giving developers targeted visibility into agent behavior across every step, tool call, and output.
Galileo’s Agent Reliability Platform delivers four key capabilities:
1. Agent Observability Reimagined
- Framework-agnostic Graph Engine that renders every branch, decision, and tool call
- Timeline View for execution flow analysis and bottleneck identification
- Conversation View for user-perspective debugging
2. Insights Engine for Automatic Failure Detection Powered by bespoke evaluation reasoning models, the Insights Engine automatically identifies failure modes and surfaces actionable insights, including:
- Root cause analysis linking errors to exact traces
- Multi-agent coordination analysis
- Tool usage optimization recommendations
- Conversation flow and performance monitoring
3. Scalable Agentic Metrics Purpose-built metrics covering flow adherence, task completion, conversation quality, and agent efficiency, with support for custom metrics using code-based approaches, LLM-as-a-judge, or Galileo’s new Luna-2 small language models.
4. Real-Time Production Guardrails Luna-2 powered guardrails enable low-cost, real-time protection against malicious user behavior and agent mistakes without the expense of traditional LLM-based solutions.
Powered by Luna-2: Purpose-Built for Agents
Central to the platform are Galileo’s Luna-2 small language models, specifically designed for always-on agent evaluations. Unlike traditional approaches that rely on expensive, slow LLMs, Luna-2 enables:
- 10-20 sophisticated metrics running simultaneously
- Sub-200ms latency even at 100% sampling rates
- Enterprise-scale production monitoring at 97% cheaper costs
- Session-level metrics that capture the entire agent journey
“Multiturn agents never follow a single script, so your tests can’t either,” explained Atin Sanyal, CTO and Co-founder of Galileo. “Luna-2’s session metrics capture conversation quality, intent changes, efficiency, and compound-request resolution across the whole journey, not just individual turns.”
Enterprise Technology Partner Validation
Outshift by Cisco: “What Galileo is doing with their Luna-2 small language models is amazing. This is a key step to having total, live in-production evaluations and guardrailing of your AI system,” said Giovanna Carofiglio, Distinguished Engineer & Senior Director at Outshift by Cisco.
Elastic: “Galileo’s Luna-2 SLMs and evaluation metrics help developers guardrail and understand their LLM-generated data. Combining the capabilities of Galileo and the Elasticsearch vector database empowers developers to build reliable, trustworthy AI systems and agents.” – Philipp Krenn, Head of DevRel & Developer Advocacy, Elastic
Market Context and Availability
Recent research from Capgemini shows that 10% of organizations already use AI agents, with more than half planning implementation in 2025 and 82% planning integration within three years. As enterprises increasingly deploy autonomous AI systems for customer service, financial operations, and business automation, robust agent reliability becomes critical to avoid becoming one of the 40% of agentic AI projects that Gartner predicts will be canceled by the end of 2027.
The Galileo Agent Reliability Platform is available now as part of Galileo’s free tier, with additional enterprise features available through paid plans. The platform integrates with popular agent frameworks, including CrewAI, LangGraph, OpenAI’s Agent SDK, LlamaIndex, and Amazon Strands, leveraging open standards like OpenTelemetry for maximum compatibility.
To accompany the platform, Galileo has also released a new v2 of its viral AI agent leaderboard today. The leaderboard evaluates models for their effectiveness in solving domain-specific enterprise tasks across different purpose-built agent metrics and datasets covering banking, healthcare, insurance, investments, and telecoms. OpenAI’s GPT-4.1 tops the updated research, and Kimi K2 leads among open-source models.
About Galileo
Founded by AI veterans from Google AI, Apple Siri, and Google Brain, Galileo’s AI reliability platform is built with observability, evaluations, and guardrails to provide the trust layer for GenAI applications at global enterprises. With more than $68 million raised from investors including Battery Ventures, Scale Venture Partners, Databricks Ventures, Citi Ventures, and Hugging Face CEO Clement Delangue, Galileo is the leading AI research and evaluation organization empowering AI teams of all sizes to build, evaluate, and deploy trustworthy AI applications.
For more information about Galileo’s Agent Reliability Platform, visit galileo.ai or watch the announcement video at https://youtu.be/N_TsQ0sdV5k.
 View original content to download multimedia:https://www.prnewswire.com/news-releases/galileo-announces-free-agent-reliability-platform-302508172.html
 View original content to download multimedia:https://www.prnewswire.com/news-releases/galileo-announces-free-agent-reliability-platform-302508172.html
SOURCE Galileo

 
				
