If you could have a chef in your kitchen to help prepare an extensive dish, you’d happily accept the assistance, wouldn’t you? Or a qualified contractor to work on your home-improvement projects? Or an office aide to handle often tedious, repetitive tasks while you focus on “big picture” deep thinking?

Since the answers would likely be “Oh my, yes!” for all three inquiries, it should come as no surprise that software developers are rapidly adopting large language model (LLM) tools and additional forms of generative artificial intelligence (Gen AI) to help them produce code more efficiently, and at a rapid pace: In the last five years, nearly two-thirds of IT executives and administrators say their organi s ation has incorporated these AI tools into the software development lifecycle (SDLC), according to research from KPMG and OutSystems, which makes available a low-code, AI-supported platform to build applications.

The Market is Flooded with the Volume of Available AI Tools

Three-quarters of those surveyed say 10 to 50 percent of code in final products is created with Gen AI technologies, and more than nine of ten intend to boost AI investments further. They’re doing so because of the clear benefits, as four of five say they are seeing as much as a 50 percent reduction in development time due to the greater usage of AI and automation tools.

With constant newcomers to the market promising better productivity and output than the last, the increased ubiquity of AI in the SDLC appears inevitable. However, 56 percent of these executives cite data privacy and security concerns as the main barriers to adoption.

Challenges of Assessing the Security of AI-Powered Coding Tools

The concerns are valid. Even if tools state they have “improved” protection in new versions, we cannot assume they are secure by default. BaxBench, which oversees a coding benchmark to evaluate LLMs for accuracy and security, has concluded that no current LLM can generate deployment-ready code. It indicates that 62 percent of solutions produced by even the best model are either incorrect or contain a vulnerability. Among those that are correct, about one-half are insecure.

Often readily exploitable and prone to insecure code output, the tools are very likely to trigger compromises. China’s DeepSeek, for example, has emerged as a popular option – with 5 to 6 million users worldwide – as it presents a faster and smarter assistant for software development, compared to well-established LLMs. It’s also much more affordable, priced at 1/30^th of the cost of similar models.

Yet, research has shown that DeepSeek is susceptible to critical risks, such as malware generation (with a failure rate of 93 percent), jailbreaking (91 percent) and prompt injection attacks (86 percent). This performance indicates that DeepSeek is too unsafe for business and enterprise use. Regardless, organisations often continue to use these tools, frequently without the knowledge or approval of executives. This is commonly referred to as “shadow AI” which is a highly concerning issue. Covert use of AI is an enterprise security risk, but rather than outlawing or ignoring it, Application Security (AppSec) leaders should implement strong guidelines and a pre-approved suite of tools.

Developers and Organisations Need a Standard Way to Measure an AI Tool’s Security Standing

Given the continuously increasing deployments coupled with precarious protection, security leaders and development teams – and the industry as a whole – must come together to establish a standard for using AI coding tools. Currently, there is no uniform process that teams can easily follow to assess products for safety, or compare them to other tools. Everyone is proceeding in different ways, and this piecemeal approach can introduce risk in the form of vulnerabilities.

So what should standardization look like? It starts with benchmarking, which focuses on two key areas:

Tools: Security leaders and development team personnel need to assess the data sources feeding an LLM tool, and how the tool generates code. They should also identify the mechanisms in place that either will – or will not – keep cyber criminals from exploiting the code. Then, they must compile individual scores for each of these considerations, and combine those to come up with an overall security score.

People: Tools are only one-half of the “team.” Developers are the other. It’s essential for them to learn and upskill in security so they can make informed decisions about AI usage that better protect code, as opposed to exposing it. Toward this goal, their organisations have to invest in education for developers so they can write safeguarded code from the start while mitigating vulnerabilities, including those generated by AI coding assistants.

Learning to code correctly with safe coding patterns via additional agile, hands-on learning pathways prove the most effective here. These help developers learn, test and apply knowledge immediately and with “real-world” context, and then come up with defence approaches that emerge as second-nature routines. Again, standardization benchmarks tracking training frequency/impact, team skill levels and vulnerability reductions will enable organisations to measure – and improve – enterprise-wide security maturity.

Let’s be honest: We all would like a little help from a friendly assistant, at home and work. But if that chef makes meals with harmful ingredients and the contractor commits fraud while the office aide embezzles funds from our employer, then we’re actually creating more problems than we’re benefiting, right?

The same thinking applies to software development: We may get an AI assistant to generate code faster and better. However, if the assistant elevates unnecessary and damaging risks in the process, then it will only cause headaches in the form of rework and remediations. By taking a standardization/benchmarking approach to it all, we’ll end up with software that excels in both quality and security. And this is the best “little help from our friends” we can wish for.

AIJ Guest Post 23 April 2025

4 minutes read

Why Security Benchmarking Has Emerged as Critical for AI in Software Development Tools

By Matias Madou, Co-Founder & CTO, Secure Code Warrior

The Market is Flooded with the Volume of Available AI Tools

Challenges of Assessing the Security of AI-Powered Coding Tools

Developers and Organisations Need a Standard Way to Measure an AI Tool’s Security Standing

The Market is Flooded with the Volume of Available AI Tools

Challenges of Assessing the Security of AI-Powered Coding Tools

Developers and Organisations Need a Standard Way to Measure an AI Tool’s Security Standing

Related Articles

Cybersecurity best practices with agentic AI adoption

AI Security for SaaS Companies: How to Build Trust in AI Products?

AI Governance Crisis Deepens as Agent Adoption Outpaces Organisational Control

Mythos Shows Security Timelines Are Shrinking. Vendor Evaluation Has to Catch Up