
AI models are posting record coding benchmark scores. But while the headlines focus on how these models can write working code from a clean prompt most of the time, that’s only one small section of software engineering.
New research from BlueOptima suggests that those headlines gloss over a more complex reality than the marketing implies, and the implications reach far beyond software development. The BlueOptima’s AI Refactoring Evaluation (BARE) benchmark found that even the best models still struggle on the most common software development work carried out on the workhorse Production software that enterprises depend on.
This exposes a broader pattern that’s starting to define the current industry discourse regarding AI, in which the ways we currently understand the capabilities of the technology, which have their basis in theoretical and academic contexts, aren’t reflective of performance in the real world, leading to failed RoI and business value.
The Benchmark Illusion
Most widely cited AI coding benchmarks like HumanEval, SWE-bench, and GPQA test how well a model can solve well-defined problems in controlled settings. Think of them as standardized tests with clear, correct answers. Models have gotten so good at these, possibly through exposure at training, that many benchmarks are approaching saturation, meaning top models all score similarly high and the test no longer separates them in a notable way.
But the kind of software that runs banks, hospitals and the apps on your phone is nothing like a benchmark environment. These enterprise software estates are enormous and messy, with layers of complexity from years of decisions made by dozens of different people. The work of “refactoring” or maintaining and improving that software is where the majority of professional coding time actually goes. Studies estimate that 40-80% of the software lifecycle costs come from maintenance rather than new development. This is exactly where AI struggles.
BARE tested 57 LLMs on tasks like reducing the complexity of legacy code or improving code structure without changing what it does. Across all models, the average success rate was just 17%. Models scored above 80% on surface checks like whether the code is syntactically correct, the function signatures still match, and it looks reasonable. When the models were asked to actually improve the code without breaking what already worked, success rates dropped below 23%.
These findings are in line with what most engineering teams often report. AI coding tools work well on contained, low-risk tasks like creating a basic skeleton for a new component, writing boilerplate or generating tests for a function with clear inputs and outputs. But as the task widens, their performance worsens. BARE found models could simplify a single function more than 35% of the time, yet that number dropped to 5% on changes that touched how multiple parts of the system fit together.
Not All Code is Equal
These benchmarks also fail to capture the evidently highly variable performance of LLMs on the coding languages often employed in enterprise software development. The BARE benchmark’s findings in this regard were surprising in the variability they uncovered. JavaScript performed best and succeeded about 32% of the time. The worst performer was C, which only succeeded roughly 3-4%. That’s an 8.6x spread.
A straightforward explanation for this might be that AI models are trained on very large quantities of code scraped from the internet. The more code in a given language that exists online, the better the model performs in it and the better results it provides. Since JavaScript powers the web, and Python is a widely popular language, there’s abundant training data available. This appears to be compounded by the fact that some languages, such as C, are less widely used and are employed in systems that require deeper understanding and reasoning about hardware and memory. The BARE benchmark evidences these LLM shortcomings
This means that certain industries can more easily extract value from AI coding tools than others. Web developers may see much greater benefit than engineers writing firmware for medical devices or operating system kernels.
Behind the Plateau
Here’s the BARE benchmark finding with the biggest implications for the broader AI industry: The leading proprietary models are showing diminishing improvements with each new release. Forecasts suggest the ceiling for this generation of models sits around an average attainment of 21% when it comes to real maintainability work, while recent releases have achieved around 17-23%. Open-weight models did not show any consistent improvement over time.
This is counterintuitive when we look at the prevailing narrative in tech, which expects AI to deliver exponential progress. After years of significant benchmark gains, real-world performance on hard engineering tasks appears to be flattening. The old approach of simply trying to make models bigger and feeding them more data is hitting limits for these types of problems.
It’s worth noting that BARE tested raw model capabilities, not the full developer tools built on top of them. Products like GitHub Copilot, Cursor and Claude Code wrap models in orchestration layers that supply additional context, give models access to tools and create feedback loops. These layers do help, but if the underlying models are plateauing, there’s a ceiling on how far the entire system can go.
This may explain why so much of the AI industry’s recent attention and investment has shifted to agentic systems. When it gets harder to squeeze more raw capability out of the core models, the focus naturally moves up the stack. The priorities become smarter orchestration, better context management and more sophisticated coordination between AI and human oversight.
Beyond the Engineering Org
The pressure is intense to demonstrate real and meaningful AI productivity gains, and the major labs have obvious incentives to make everyone feel behind if they’re not posting jaw-dropping numbers. BARE’s findings offer rare independent insights contrary to that narrative: The evidence it uncovers suggests that those productivity claims don’t reflect how LLMs actually perform on the daily, complex work that comprises most of an engineer’s day.
LLMs are almost certainly transformative and AI coding tools are fundamentally changing how software is built. The pace of adoption seems to be more rapid than any previous increase in developer tooling. But the value delivered by this technology is uneven and the anecdotes and headline numbers used to justify investment often don’t survive contact with the messy reality of enterprise software development.
For anyone trying to make sense of where AI is heading, the BARE results suggest a few things worth keeping in mind:
- Oft-cited benchmarks driving public perception of AI progress may be measuring the wrong things. Saturation on a benchmark is different from solving real-world problems.
- The current generation of LLMs may be closer to a capability ceiling on real-world engineering work than marketing suggests.
- The industry’s pivot toward agentic enablement technologies is looking more like a pragmatic response to diminishing returns at the core LLM capabilities.
- Productivity claims should be evaluated against what tools actually do in real environments vs. how they score on standardized tests.
The best scoreboard for AI is whether the work the technology does holds up when it leaves the lab.


