
A Novel Framework for Secure Integration of Large Language Models into the Software Development Lifecycle
AI coding assistants now pervade 76% of the developer workforce [1], yet roughly half of AI-generated code snippets contain bugs that could potentially be exploited [2]—creating an urgent need for systematic guardrails. This research compiles verifiable evidence showing that while AI assistants demonstrably boost developer productivity by 26% in controlled studies [3], they simultaneously introduce security risks, compliance challenges, and technical debt at unprecedented rates. The tension between velocity gains and quality erosion demands a new blueprint for SDLC integration that treats AI-generated code as fundamentally different from human-written code.
The regulatory landscape is crystallizing rapidly: the EU AI Act’s GPAI obligations take effect August 2025 [4], federal SBOM and secure-software requirements were restructured across 2025–2026 from a centralized attestation mandate toward an agency-led, risk-based approach [5], and the GitHub Copilot copyright litigation remains unresolved after three years [6]. Meanwhile, enterprise leaders face a practical paradox—90% of Fortune 100 companies have adopted GitHub Copilot [7], yet 80% of developers admit bypassing security policies for AI-generated code [8].
Adoption Has Reached Critical Mass, with Uneven Enterprise Governance
Developer usage of AI coding tools has crossed the threshold from experimentation to mainstream adoption. The 2024 Stack Overflow Developer Survey reports that 62% of professional developers actively use AI tools, up from 44% in 2023, with 76% either using or planning to use such tools [1]. GitHub Copilot leads with over 20 million cumulative users and 1.3 million paid subscribers, achieving 30% quarter-over-quarter growth through 2025 [9]. Cursor has emerged as the fastest SaaS product to reach $100M ARR, now exceeding $500M ARR with over 1 million daily users [10].
Enterprise adoption presents a more complex picture. While 90% of Fortune 100 companies have deployed Copilot and 50,000+ organizations use it company-wide [7], governance remains fragmented. A GitHub survey found that only 30-40% of organizations actively encourage AI coding tools, with another 29-49% allowing use but providing limited guidance [11]. Most concerningly, 73% of developers report unclear or absent AI policies at their organizations [12], creating conditions for uncontrolled risk exposure.
Industry-specific adoption patterns reveal significant variation. Technology and startup sectors show highest acceptance rates, while financial services accept fewer AI suggestions due to stricter security requirements despite similar productivity gains [9]. Healthcare and insurance demonstrate the lowest acceptance rates, driven by regulatory requirements demanding rigorous validation—a pattern that suggests mature governance may naturally constrain AI adoption velocity.
Productivity Gains Are Real but More Nuanced Than Vendor Claims Suggest
The largest independent randomized controlled trial, conducted across Microsoft, Accenture, and a Fortune 100 company with 4,867 developers, found a 26.08% increase in completed pull requests per week for AI-assisted developers [3]. This Microsoft/MIT/Princeton/Wharton study represents the most rigorous evidence available, using objective output measures rather than self-reported productivity [13].
The productivity distribution matters enormously for planning. Junior developers and those with shorter tenure at their companies experienced 27-39% productivity increases, while senior developers with deep codebase familiarity saw only 7-16% gains [3]. A Google internal RCT with approximately 100 engineers found 21% faster task completion for specific coding tasks—consistent with, though slightly more modest than, the multi-company results [14].
The most striking contradictory finding comes from a July 2025 METR study of 16 experienced open-source developers completing 246 tasks on their own repositories: AI assistance made them 19% slower [15]. This directly contradicts the prevailing narrative, though the study’s focus on highly experienced developers working on familiar codebases may explain the divergence. Developers expected 24% time savings and perceived 20% reduction while actually experiencing degradation—a perception gap with significant implications for organizational decision-making.
McKinsey’s analysis suggests productivity gains are highly task-dependent: documentation writing improves by 50%, new code generation by nearly 50%, and code refactoring by nearly 67%, but high-complexity architectural tasks show less than 10% improvement [14]. The Jellyfish/OpenAI research indicates that organizations achieving 80-100% developer adoption see 110%+ productivity gains, but the median enterprise currently uses AI tools for only 10-20% of coding activities [16].
Security Vulnerabilities Pervade AI-Generated Code at Alarming Rates
The Georgetown University Center for Security and Emerging Technology analyzed code from five leading LLMs and found that approximately 48% of code snippets contained bugs that could potentially lead to malicious exploitation, including dereference failures, buffer overflows, and memory leaks [2]. Veracode’s research confirms that AI introduces security vulnerabilities in 45% of cases, with stark variation by vulnerability type: SQL injection shows 80% pass rate, but cross-site scripting shows only 14% pass rate and log injection just 12% [17].
An ACM study analyzing 733 code snippets from GitHub projects found vulnerability rates of 29.5% for Python and 24.2% for JavaScript in AI-generated code [18]. Earlier foundational research by Pearce et al. (2021) found ~40% of 1,689 Copilot-generated programs were vulnerable to MITRE’s CWE Top 25 [18], while Khoury et al. (2023) determined only 5 of 21 ChatGPT-generated programs were initially secure [2].
Specific incidents illustrate the real-world consequences. A U.S. fintech startup suffered a major breach in late 2024 traced to an AI-generated login function that skipped essential input validation, allowing attackers to inject malicious payloads [19]. The “Rules File Backdoor” attack discovered by Pillar Security in March 2025 demonstrated how attackers can inject hidden malicious instructions into configuration files used by Cursor and GitHub Copilot, enabling threat actors to bypass code reviews [20]. Most comprehensively, researchers discovered 30+ security vulnerabilities across AI-powered IDEs in December 2025, resulting in 24 CVE identifiers for issues enabling data exfiltration and remote code execution via prompt injection [21].
The Snyk AI Code Security Report found that 56.4% of developers describe insecure AI suggestions as “common” or “frequent,” yet 76% believe AI code is more secure than human code—a dangerous false perception [8]. Only 10% scan most AI-generated code, and less than 25% use Software Composition Analysis tooling for AI-generated code [8].
Supply Chain Attacks Exploit Package Hallucination at Scale
A March 2025 study by researchers at the University of Texas at San Antonio, Virginia Tech, and University of Oklahoma analyzed 576,000 code samples from 16 LLMs and found that nearly 20% of recommended packages did not exist in any public registry [22]. Open-source models hallucinated packages at a 21.7% rate compared to 5.2% for commercial models, generating 205,474 unique fabricated package names [22]. The 43% repeatability rate of these hallucinations makes them exploitable: attackers can register packages with AI-hallucinated names and wait for developers to install malicious code [23].
Lasso Security demonstrated this attack vector practically: Google Gemini referenced at least one hallucinated package in nearly two-thirds (65%) of all prompts, and when researchers uploaded a “dummy” package with a hallucinated name, it was downloaded over 30,000 times within weeks [24]. Socket Security documented a threat actor named “_Iain” who automated creation of thousands of typosquatted packages using ChatGPT to generate realistic-sounding variant names [23].
The broader supply chain context amplifies these risks. Sonatype’s 2024 State of Software Supply Chain Report logged 512,847+ malicious packages in the past year—a 156% year-over-year increase [25]. Consumer complacency compounds the problem: 80% of dependencies remain un-upgraded for over a year, even though 95% of vulnerable components consumed already have fixed versions available [25].
Training data poisoning presents an emerging systemic risk. Anthropic research with the UK AI Safety Institute and Alan Turing Institute found that just 250 malicious documents can create backdoors in LLMs regardless of model size (from 600M to 13B parameters), contradicting the assumption that larger models require proportionally more poisoned data [26]. Even at poisoning rates below 0.01%, attack success rates remain high when using low-frequency tokens as triggers [26].
Regulatory Requirements Are Crystallizing with Concrete Deadlines
The EU AI Act entered force August 1, 2024, with General-Purpose AI model obligations taking effect August 2, 2025 [4]. While AI coding tools are not explicitly classified as “high-risk” under Annex III, they fall under GPAI provisions if they use models trained on more than 10²³ FLOP [27]. Providers must publish sufficiently detailed summaries of training data content, provide technical documentation per Annex XI specifications, and maintain 10-year record retention for safety and risk management activities [4]. Although the obligations apply from August 2, 2025, the AI Office’s full enforcement powers (including fines) begin August 2, 2026, and GPAI models already on the market before August 2, 2025 have until August 2, 2027 to comply [4].
U.S. federal policy on software supply chain security shifted substantially across 2025 and 2026. Executive Order 14028 (2021) established the foundation, directing NIST to set standards and machine-readable SBOM expectations (SPDX, CycloneDX, or SWID) covering supplier and component names, version identifiers, hashes, dependency relationships, and timestamps [5]. Executive Order 14306 (June 2025) then rescinded key portions of the prior administration’s EO 14144, removing the enhanced, centralized attestation mandates and the CISA validation role [78], and OMB Memorandum M-26-05 (January 2026) rescinded the earlier memoranda (M-22-18 and M-23-16) that had required agencies to collect standardized self-attestation forms [79]. As a result, EO 14028’s core SBOM and secure-development principles remain in effect, but the implementing regime has moved from a uniform federal mandate to a decentralized, agency-led, risk-based model in which agencies may—but are not universally required to—demand SBOMs and attestations [5]. AI-generated code components should still be documented in SBOMs, though standardized AIBOM formats remain nascent [28]. The practical takeaway for organizations is that supply-chain expectations persist and continue to shape procurement, even as the mandatory submission infrastructure has loosened. Black Duck reports that only 51% of organizations always validate external supplier SBOMs, and while 95% use AI for development, only 24% comprehensively evaluate IP, license, security, and quality risks [29].
PCI-DSS considerations crystallized in September 2025 when the PCI Security Standards Council published AI principles explicitly stating that “AI trained to generate functional code may not always generate the most secure code”—emphasizing that secure coding training requirements still apply to AI-generated code [30]. Healthcare organizations face additional HIPAA constraints: AI tools processing PHI require Business Associate Agreements, and organizations cannot input PHI into non-HIPAA-compliant AI services [31].
The GitHub Copilot copyright litigation (Doe v. GitHub) remains ongoing after being filed in November 2022. While 20 of 22 original claims have been dismissed, two claims remain: open-source license violation and breach of contract [6]. A Ninth Circuit appeal was filed September 2024 on certified questions [32], leaving fundamental legal questions unresolved regarding whether AI training on open-source code violates license terms and whether AI-generated output constitutes derivative work [33].
Enterprise Governance Frameworks Are Maturing Across Major Tech Companies
Google implements a three-tiered AI governance structure: product teams with embedded UX, privacy, and trust specialists; a central Responsible Innovation team with dedicated review bodies; and specialized reviews for Cloud, Devices, and Health products [34]. Google’s Secure AI Framework (SAIF) integrates risk management strategies with data governance programs to ensure high-quality, properly sourced training data, with red teaming conducted by external experts against rigorous safety, privacy, and security benchmarks [35].
Microsoft applies a multi-layered defense-in-depth strategy for Copilot security including Zero Trust principles, tenant isolation, end-to-end encryption, and ML-based prompt injection detection [36]. Microsoft Purview provides DLP, sensitivity labels, and compliance monitoring, while SharePoint Advanced Management detects oversharing [37]. The Copilot Control System dashboard launched July 2025 provides centralized governance visibility [38], and internally Microsoft follows a “self-service with guardrails” approach that has generated deployment guides from their experience as the first large enterprise adopter [39].
Meta developed CyberSecEval to evaluate cybersecurity risks in LLM-generated code and Code Shield to reduce potentially insecure suggestions at generation time—reportedly preventing “tens of thousands of potentially insecure suggestions” at Meta’s internal coding LLM [40]. Llama Guard 3 provides multilingual input/output moderation, and Prompt Guard protects against prompt injections for developers building with Llama models [41].
Amazon CodeWhisperer (now Amazon Q Developer) integrates SAST powered by the CodeGuru Detector Library, supporting Java, Python, JavaScript, C#, TypeScript, CloudFormation, Terraform, and AWS CDK [42]. Professional tier users receive 500 security scans per month, with reference tracking linking suggestions that match open-source training data to source repositories for proper attribution [43].
CI/CD Integration Patterns Are Emerging for AI Code Validation
Stage-specific validation has become the dominant pattern for AI code integration. At the source stage, organizations implement linting and syntax validation for AI-generated code through pre-commit hooks enforcing style guidelines [44]. The build stage adds compile-time security analysis, while the test stage encompasses integration tests, load tests, and combined SAST/DAST scans [45]. The deploy stage establishes quality gates that block deployment on security failures [46].
Human-in-the-loop requirements follow confidence-based routing patterns. Triggers for mandatory human review include confidence scores below threshold, validator failures (schema violations, missing citations), high-risk content areas (legal, safety, financial, privacy), enterprise customer tier, and novelty (new feature areas or content domains) [47]. The LangGraph framework with interrupt() functions enables pausing AI agent execution mid-workflow for human approval, with full audit logging of every access request and tool call [48].
Progressive delivery patterns apply cognitive safeguards through gradual rollout of AI changes to user subsets, automated rollback mechanisms for model degradation or hallucinations, and continuous monitoring for production data and model performance [45]. CircleCI has documented specific strategies including snapshot testing against golden datasets, hallucination detection through threshold-based validation, and bias/fairness testing before production [46].
The effectiveness evidence is sobering: humans respond to only 56% of AI agent reviews, and only 18% of AI suggestions result in actual code changes—underscoring the necessity of human validation gates [47].
Academic Benchmarks Reveal Capability Gaps and Contamination Challenges
HumanEval, created by OpenAI in 2021, established the foundational benchmark with 164 hand-crafted programming challenges using pass@k metrics for functional correctness [49]. Subsequent extensions include HumanEval-XL (23 natural languages × 12 programming languages = 22,080 prompts) [50], HumanEval-V (253 visual reasoning tasks with complex diagrams), and HumanEval Pro (self-invoking code generation). BigCodeBench positions itself as “the next generation of HumanEval” with real-world library dependencies and diverse function calls—finding 149 tasks unsolved by all models in Complete mode and 278 unsolved in Instruct mode [51].
SWE-bench from Princeton (ICLR 2024 Oral) uses 2,294 real GitHub issues from 12 Python repositories, requiring models to generate patches resolving real-world software issues [52]. SWE-bench Verified (500 human-validated problems with OpenAI collaboration) addresses reliability issues [53], while SWE-bench Pro demonstrates the difficulty gap: as of late 2025, even the strongest frontier models—then GPT-5 (23.3%) and Claude Opus 4.1 (23.1%)—scored far below their 70%+ results on Verified [53]. Absolute Pro scores have since risen as newer models shipped, but the large Verified-to-Pro gap, and its implications for benchmark contamination and real-world capability, remains the durable point.
Data contamination poses a fundamental challenge. SWE-bench+ creates issues after LLM training cutoffs, causing resolution rates to drop from 3.97% to 0.55% on SWE-Agent+GPT-4—suggesting benchmark scores may significantly overstate real-world capabilities [54]. SWE-rebench from Nebius addresses this through continuous updates and contamination tracking [55].
Security-focused benchmarks remain less mature. CodeLMSec systematically evaluates security vulnerabilities in black-box code LLMs [56], while CodeSecEval focuses specifically on secure code generation evaluation. The HalluCode benchmark from Liu et al. (2024) evaluates hallucination recognition across five primary categories, finding that LLMs face great challenges recognizing hallucinations—especially identifying specific types [57].
Formal Verification and RAG Approaches Show Promise but Remain Immature
Clover from Stanford AI Lab demonstrates consistency checking among code, docstrings, and formal annotations using Dafny for deductive verification, achieving 87% acceptance for correct examples and 100% rejection for incorrect code [58]. Astrogator (July 2025) targets Ansible with a formal query language for user intent specification, achieving 83% verification of correct code and 92% identification of incorrect [59]. The LEMUR framework integrates LLMs proposing invariants with automated reasoners verifying them, establishing the first framework with formal calculus for LLM-verifier integration [60].
Retrieval-Augmented Generation shows consistent benefits for code quality. CodeRAG-Bench evaluates across eight coding tasks and finds that gold documents significantly boost GPT-4/GPT-3.5 performance, though a gap remains between oracle retrieval and current models [61]. The EVOR framework’s evolving retrieval approach achieves 2-4× execution accuracy improvement over Reflexion and DocPrompting baselines [62]. For code translation specifically, a RAG-based strategy reduced vulnerability introduction rate by 32.8% [63].
Multi-agent systems represent the frontier of autonomous software engineering. The ACM TOSEM survey on LLM-Based Multi-Agent Systems envisions “Software Engineering 2.0” with fully autonomous, scalable systems [64]. Systems like AutoCodeRover combine spectrum-based fault localization with LLM-guided repair [65], while Agentless from FSE 2025 demonstrates that simple agentic workflows can match complex agent systems—suggesting the optimal complexity level remains an open question [52].
Technical Debt Accumulation Presents Long-Term Structural Risks
The GitClear 2025 AI Copilot Code Quality Report, analyzing 211 million changed lines of code from 2020-2024, documents an 8-fold increase in duplicated code blocks and projects that code churn (code discarded within two weeks) will double [66]. The report characterizes this as “AI-induced technical debt” accumulating rapidly across the industry [67].
Google’s DORA research reinforces these concerns: 2024 findings showed that 25% increase in AI usage correlated with 7.2% decrease in delivery stability [68]. The 2025 update found that 90% AI adoption increase corresponded to 9% bug rate climb, 91% code review time increase, and 154% PR size increase—suggesting AI tools may be shifting rather than reducing work [68].
Ox Security’s “Army of Juniors” report (October 2024) characterizes AI code as “highly functional but systematically lacking in architectural judgment,” identifying 10 architecture and security anti-patterns [69]. A Cursor adoption study using difference-in-differences design found transient velocity increases alongside persistent increases in static analysis warnings and code complexity—a pattern consistent with short-term gains creating long-term liabilities [66].
Academic research by Recupito et al. at the University of Salerno confirms that AI technical debt impacts security and maintainability differently than traditional technical debt, suggesting organizations may need distinct tracking and remediation approaches for AI-generated code [70].
Industry Safety Initiatives Are Coalescing Around Open Standards
The OpenSSF Security-Focused Guide for AI Code Assistant Instructions (August 2025) represents the most comprehensive industry guidance, developed with contributors from Microsoft, Google, Red Hat, and the Linux Foundation [71]. Core principles establish that the developer remains in control, engineering best practices always apply, AI code must be assumed to contain bugs/vulnerabilities, and practitioners should ask AI to self-review using Recursive Criticism and Improvement techniques [71].
The guide provides specific instructions for security-conscious prompting: use parameterized queries for database access, never include API keys/secrets in output, use constant-time comparison for security-sensitive operations, prefer popular community-trusted libraries, use official package managers with version pinning, and generate SBOMs in SPDX or CycloneDX format [72]. This explicitly addresses the slopsquatting risk, emphasizing pre-use package evaluation given that 19.7% of AI-proposed packages don’t exist [71].
The OWASP Top 10 for LLM Applications (2025 Edition) establishes the security taxonomy for AI systems, ranking Prompt Injection as the #1 risk (appearing in over 73% of production AI deployments according to security audits) [73]. Other critical risks include Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data and Model Poisoning, and Excessive Agency [74]. The framework provides structured mitigation guidance for each risk category.
Cisco’s Project CodeGuard (2025) offers an open-source framework for securing AI-generated code with security rules based on OWASP and CWE best practices, providing translators for Cursor, Windsurf, and GitHub Copilot [75]. The OpenSSF AI/ML Security Working Group continues developing model signing specifications through Sigstore, with the LFEL1012 course “Secure AI/ML-Driven Software Development” released October 2025 [71].
Introducing the AI-SDLC Safety Framework: A Novel Methodology for Secure Integration
Based on the evidence compiled in this analysis, I propose the AI-SDLC Safety Framework (ASSF)—a structured methodology for organizations to safely integrate LLM-powered code generation into their software delivery pipelines. ASSF does not introduce new cryptographic or verification primitives; its contribution is compositional—applying established supply-chain controls (drawn from SLSA, the OpenSSF AI code assistant guidance, NIST SSDF, and the OWASP LLM Top 10) to AI-generated code as a distinct, lower-trust class of input, with validation intensity scaled to generation context rather than to authorship. This synthesis of existing controls into an AI-specific trust model is what the framework offers that prior, human-code-oriented frameworks do not.
Framework Pillars
Pillar 1: Tiered Trust Model
AI-generated code should be classified into trust tiers based on generation context:
- Tier 1 (Untrusted): Raw AI output without validation—requires full review pipeline
- Tier 2 (Validated): AI output that has passed automated security scanning and unit tests
- Tier 3 (Attested): Human-reviewed code with cryptographic attestation of review completion
Pillar 2: Mandatory Validation Gates
Each code path must traverse validation stages proportional to risk:
- Pre-commit: Static analysis, secret detection, dependency validation against known-good registries
- Pre-merge: SAST/DAST scanning, license compliance checking, hallucination detection for package names
- Pre-deploy: Integration testing, security regression testing, SBOM generation with AI provenance markers
Pillar 3: Provenance Tracking
All AI-generated code must carry metadata indicating:
- Originating model and version
- Prompt template used
- Timestamp of generation
- Validation gates passed
- Human reviewer attestation (if applicable)
Pillar 4: Rollback Capability
Organizations must maintain the ability to:
- Identify all AI-generated code in production
- Revert AI-generated changes independently of human-written code
- Audit AI code contribution over time for technical debt assessment
Implementation Tiers
| Maturity Level | Characteristics | Recommended Controls |
| Level 1: Ad-Hoc | No formal AI code policy | Immediate: Block AI tools pending policy development |
| Level 2: Managed | Policies exist but enforcement is manual | Add automated scanning to CI/CD |
| Level 3: Defined | Automated gates with human review | Implement provenance tracking |
| Level 4: Measured | Metrics on AI code quality tracked | Add technical debt dashboards |
| Level 5: Optimized | Continuous improvement loop | Predictive risk modeling |
Operationalizing the Framework: DepsShield Integration
The framework’s dependency-validation pillar can be operationalized using tools such as DepsShield [76], an open-source MCP (Model Context Protocol) server providing real-time Software Composition Analysis. (Disclosure: the author developed DepsShield; it is used here as a reference implementation of the dependency-validation pillar, not as an independent third-party benchmark.) This addresses the 20% package hallucination rate by validating every AI-suggested dependency against known registries before code reaches the repository.
Case Study: Demonstrating the Framework with a Proof-of-Concept Implementation
To demonstrate the practical utility of the AI-SDLC Safety Framework, I developed a lightweight validation pipeline implementing the core pillars and measured its effectiveness against realistic AI-generated code samples. The proof-of-concept tool, ASSF Validator, consists of two primary components: a package hallucination detector that validates imports against npm and PyPI registries, and a security scanner targeting OWASP Top 10 vulnerabilities commonly found in AI-generated code. For npm packages, the validator integrates with DepsShield [76], an open-source MCP server providing real-time security intelligence including existence verification, vulnerability detection, and typosquatting identification.
Methodology
The experiment analyzed 12 code samples representing patterns documented in the research literature as typical AI-generated outputs. These samples included Flask applications with SQL injection vulnerabilities (CWE-89), file handlers with path traversal flaws (CWE-22), API clients with hardcoded credentials (CWE-798), shell command execution with injection risks (CWE-78), templates vulnerable to XSS (CWE-79), insecure deserialization patterns (CWE-502), authentication using weak cryptography (CWE-327), SSRF-vulnerable proxy endpoints (CWE-918), and one clean control sample to measure false positive rate. Several samples additionally contained secondary weaknesses detected by the scanner’s broader 15-pattern ruleset—notably log injection (CWE-117) and insecure debug configuration (CWE-489)—which accounts for their appearance in the results below. Samples included both Python (8) and JavaScript (4) files to demonstrate cross-ecosystem validation, with JavaScript samples specifically showcasing DepsShield’s npm integration capabilities.
Each sample was designed to include at least one hallucinated package name (e.g., ai_query_helper, secure_file_utils, quick-api-client) alongside legitimate dependencies, reflecting the documented 20% hallucination rate from published research [22]].
Results
| Metric | Baseline (No ASSF) | With ASSF | Improvement |
| Samples that would ship | 12/12 (100%) | 1/12 (8%) | 92% blocked |
| Hallucinated packages detected | 0 | 15 of 30 | — |
| Security findings | 0 | 27 total | — |
| Critical/High severity | 0 | 13 (7 critical, 6 high) | — |
| False positives | — | 0 | — |
Package validation analyzed 30 total packages, confirming 15 as valid (present in registries) and identifying 15 hallucinated packages that do not exist—a 100% detection rate. Security scanning detected 27 total vulnerabilities: 7 critical severity (CWE-89, CWE-78, CWE-502), 6 high severity (CWE-22, CWE-79, CWE-798, CWE-918), and 14 medium/low severity (CWE-327, CWE-117, CWE-489), flagging 83.3% of samples.
Key Finding
Without ASSF validation gates, all of the vulnerable samples would have shipped unflagged; with ASSF, 11 of the 12 were blocked at the pre-commit stage, and the only sample that passed was the intentionally clean control (SAMPLE-010). Because the samples were deliberately constructed to contain known vulnerabilities and hallucinated packages, these figures should be read as a demonstration that the pipeline reliably detects the patterns it targets—not as an estimate of real-world precision or recall. In particular, a single clean control cannot characterize the false-positive rate, which would require a larger corpus of genuine, mixed AI-generated code.
The 50% hallucination rate observed in our test samples exceeds the 20% reported in academic literature [22], likely because our samples were specifically constructed to include common AI hallucination patterns. In production environments with mixed AI and human code, overall hallucination rates would be lower but still represent significant supply chain risk.
Implementation Details
The ASSF Validator operates as a CLI tool with optional MCP (Model Context Protocol) integration for AI coding assistants. Registry validation uses a hybrid approach: npm packages are validated through DepsShield’s MCP server, which provides not only existence verification but also vulnerability detection and typosquatting identification, while PyPI validation uses direct registry API calls. This architecture demonstrates how existing security tooling can be composed into validation pipelines. Pattern-based security scanning uses regular expressions targeting 15 CWE patterns, balancing detection speed against comprehensiveness. Tiered blocking logic triggers deployment blocks on critical and high-severity findings while generating warnings for medium/low severity, with a strict mode override available. JSON output includes timestamps and validation status for audit trail requirements.
For AI-assisted development workflows, DepsShield can be enabled directly in coding assistants (Claude Desktop, Cursor, Cline, Windsurf) via MCP configuration, allowing the security validation to occur in real-time as developers accept AI-generated code suggestions.
Limitations
This case study has important constraints. The small sample size (n=12) limits statistical significance. Samples were constructed to demonstrate vulnerabilities rather than randomly generated from actual AI tools. Pattern-based scanning misses semantic vulnerabilities requiring deeper context analysis. Registry validation requires network access and may timeout on large dependency trees. No measurement was made of developer time impact or workflow friction.
Reproducibility
The complete ASSF Validator source code and experiment data are available on GitHub at github.com/ganolmc/assf-validator [77]. The experiment can be reproduced by cloning the repository, installing dependencies via pip, and running the case_study_experiment.py script. Output includes JSON-formatted metrics and detailed per-sample results suitable for further analysis.
Conclusion: A Blueprint for Safe SDLC Integration
The evidence supports four structural conclusions for practitioners implementing AI-augmented software delivery.
First, productivity gains are real but not uniform—organizations should expect 26% output increases as a reasonable baseline, with significantly higher gains for junior developers and routine tasks, and minimal or negative returns for senior developers on familiar codebases [3].
Second, security cannot be assumed—the 45-48% vulnerability rate in AI-generated code demands mandatory security scanning in CI/CD pipelines, with particular attention to XSS (86% failure rate), log injection (88% failure rate), and package hallucination (20% non-existent packages) [2] [17] [22]. Human review gates must be preserved, not automated away.
Third, compliance obligations are real but in flux—EU AI Act GPAI obligations took effect August 2, 2025, with full enforcement powers following in August 2026 [4]; U.S. federal SBOM and secure-development requirements were restructured in 2025–2026 into an agency-led, risk-based model that retains EO 14028’s core principles while relaxing the universal attestation mandate [5]; and unresolved copyright litigation creates ongoing legal exposure [6]. Organizations should still implement license scanning, maintain audit trails of AI tool usage, and be prepared to provide SBOMs and developer attestations on request.
Fourth, governance maturity determines outcomes—the gap between leading enterprises (Google’s three-tiered review [34], Microsoft’s defense-in-depth [36], Meta’s Code Shield preventing “tens of thousands” of insecure suggestions [40]) and the median organization (73% with unclear policies [12], 80% bypassing security [8]) represents the key differentiator between AI tools as productivity multipliers versus risk vectors.
The blueprint for safe integration is not abstinence from AI tools—that ship has sailed with 76% adoption [1]—but rather systematic treatment of AI-generated code as untrusted input requiring validation, with governance proportional to the substantial quality and security trade-offs the evidence now conclusively demonstrates.
About the Author
Mykhailo Hanol is a Software Engineer with 7+ years building secure, scalable web applications across e-commerce, fintech, and blockchain. His expertise spans frontend architecture, security, and AI-assisted development workflows.
His work sits at the frontier where web engineering meets applied AI. He specializes in AI-native software architecture — designing systems where autonomous agents are built into the development process itself, and where software is structured to be built, operated, and extended by intelligent agents rather than humans alone. He’s especially focused on two of the field’s hardest open problems: how AI systems retain persistent, compounding memory, and how autonomous agents can carry out complex, multi-stage work reliably and within safe boundaries.
On the frontend, he focuses on high-performance reactive architecture — fine-grained signal-based reactivity, minimal runtimes, and design systems where accessibility and performance hold by default rather than as an afterthought. Security runs through all of it: encryption, key management, multi-tenant isolation, and defense-in-depth access control, shaped by years in fintech and blockchain where data integrity is non-negotiable.
What ties his work together is the convergence of these three disciplines — reactive web architecture, applied AI, and security — into one practice, aimed at how the next generation of intelligent software gets built: secure by design, fast by default, and made to work hand in hand with autonomous AI.
References
[1] Stack Overflow. “2024 Stack Overflow Developer Survey.” https://survey.stackoverflow.co/2024/
[2] Georgetown University CSET. “Cybersecurity Risks of AI-Generated Code.” November 2024. https://cset.georgetown.edu/publication/cybersecurity-risks-of-ai-generated-code/
[3] InfoQ. “Study Shows AI Coding Assistant Improves Developer Productivity.” September 2024. https://www.infoq.com/news/2024/09/copilot-developer-productivity/
[4] EU Artificial Intelligence Act. “Overview of the Code of Practice.” https://artificialintelligenceact.eu/code-of-practice-overview/
[5] NIST. “Software Security in Supply Chains: Software Bill of Materials (SBOM).” https://www.nist.gov/itl/executive-order-14028-improving-nations-cybersecurity/software-security-supply-chains-software-1
[6] Joseph Saveri Law Firm. “GitHub Copilot Intellectual Property Litigation.” https://www.saverilawfirm.com/our-cases/github-copilot-intellectual-property-litigation
[7] TechCrunch. “GitHub Copilot crosses 20M all-time users.” July 2025. https://techcrunch.com/2025/07/30/github-copilot-crosses-20-million-all-time-users/
[8] Snyk. “AI Code Security Report.” https://www.snyk.io/reports/ai-code-security/
[9] Second Talent. “GitHub Copilot Statistics & Adoption Trends [2025].” https://www.secondtalent.com/resources/github-copilot-statistics/
[10] Dataconomy. “GitHub Copilot Now Has Over 20 Million Users.” July 2025. https://dataconomy.com/2025/07/31/github-copilot-now-has-over-20-million-users/
[11] GitHub Blog. “Research: Quantifying GitHub Copilot’s impact in the enterprise with Accenture.” May 2024. https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-in-the-enterprise-with-accenture/
[12] Stack Overflow Blog. “Developers get by with a little help from AI.” May 2024. https://stackoverflow.blog/2024/05/29/developers-get-by-with-a-little-help-from-ai-stack-overflow-knows-code-assistant-pulse-survey-results/
[13] DX Newsletter. “What three experiments tell us about Copilot’s impact on productivity.” September 2024. https://newsletter.getdx.com/p/copilot-impact-on-productivity
[14] McKinsey. “Unleashing developer productivity with generative AI.” https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/unleashing-developer-productivity-with-generative-ai
[15] METR. “Early 2025 AI Experienced OS Dev Study.” July 2025. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
[16] McKinsey. “Measuring AI in software development: Interview with Jellyfish CEO.” https://www.mckinsey.com/capabilities/mckinsey-technology/our-insights/measuring-ai-in-software-development-interview-with-jellyfish-ceo-andrew-lau
[17] Veracode. “AI-Generated Code Security Risks: What Developers Must Know.” September 2025. https://www.veracode.com/blog/ai-generated-code-security-risks/
[18] PMC/NIH. “A systematic literature review on the impact of AI models on the security of code generation.” https://pmc.ncbi.nlm.nih.gov/articles/PMC11128619/
[19] O’Reilly. “When AI Writes Code, Who Secures It?” https://www.oreilly.com/radar/when-ai-writes-code-who-secures-it/
[20] Pillar Security. “New Vulnerability in GitHub Copilot and Cursor: How Hackers Can Weaponize Code Agents.” March 2025. https://www.pillar.security/blog/new-vulnerability-in-github-copilot-and-cursor-how-hackers-can-weaponize-code-agents
[21] Oligo Security. “OWASP Top 10 LLM, Updated 2025: Examples & Mitigation Strategies.” https://www.oligo.security/academy/owasp-top-10-llm-updated-2025-examples-and-mitigation-strategies
[22] Bleeping Computer. “AI-hallucinated code dependencies become new supply chain risk.” https://www.bleepingcomputer.com/news/security/ai-hallucinated-code-dependencies-become-new-supply-chain-risk/
[23] HackerOne. “Slopsquatting: AI’s Contribution to Supply Chain Attacks.” https://www.hackerone.com/blog/ai-slopsquatting-supply-chain-security
[24] IDC Blog. “Package Hallucination: The Latest, Greatest Software Supply Chain Security Threat?” April 2024. https://blogs.idc.com/2024/04/22/package-hallucination-the-latest-greatest-software-supply-chain-security-threat/
[25] GlobeNewswire. “Sonatype’s 10th Annual State of the Software Supply Chain Report Reveals 156% Surge in Open Source Malware.” October 2024. https://www.globenewswire.com/news-release/2024/10/10/2961239/0/en/Sonatype-s-10th-Annual-State-of-the-Software-Supply-Chain-Report-Reveals-156-Surge-in-Open-Source-Malware.html
[26] ActiveState. “Is AI-Generated Code Poisoning Your Software Supply Chain?” https://www.activestate.com/blog/is-ai-generated-code-poisoning-your-software-supply-chain/
[27] Safe AI Newsletter. “AI Safety Newsletter #59: EU Publishes General-Purpose AI Code of Practice.” https://newsletter.safe.ai/p/ai-safety-newsletter-59-eu-publishes
[28] SAP LeanIX. “SBOMs: What Does EO 14028 Actually Mean For You?” https://www.leanix.net/en/blog/sboms-eo-14028
[29] TechNadu. “Software Supply Chain Security: AI Code Risks, Secure SDLC, SBOM Validation.” https://www.technadu.com/the-imperative-of-software-supply-chain-security-ai-generated-code-risks-secure-sdlc-practices-and-sbom-validation/615999/
[30] PCI Security Standards Council. “AI Principles: Securing the Use of AI in Payment Environments.” September 2025. https://blog.pcisecuritystandards.org/ai-principles-securing-the-use-of-ai-in-payment-environments
[31] PMC/NIH. “AI Chatbots and Challenges of HIPAA Compliance for AI Developers and Vendors.” https://pmc.ncbi.nlm.nih.gov/articles/PMC10937180/
[32] Legal.io. “Judge Throws Out Majority of Claims in GitHub Copilot Lawsuit.” https://www.legal.io/articles/5516216/Judge-Throws-Out-Majority-of-Claims-in-GitHub-Copilot-Lawsuit
[33] Syracuse Law Review. “Update in Copilot Copyright Claim may affect Future Challenges of Artificial Intelligence.” https://lawreview.syr.edu/update-in-copilot-copyright-claim-may-affect-future-challenges-of-artificial-intelligence/
[34] Google AI. “AI Principles.” https://ai.google/responsibility/principles/
[35] Google Cloud. “Gen AI governance: 10 tips to level up your AI program.” https://cloud.google.com/transform/gen-ai-governance-10-tips-to-level-up-your-ai-program
[36] Microsoft Learn. “Security for Microsoft 365 Copilot.” https://learn.microsoft.com/en-us/copilot/microsoft-365/microsoft-365-copilot-ai-security
[37] Microsoft Learn. “Copilot Control System Security and Governance.” https://learn.microsoft.com/en-us/copilot/microsoft-365/copilot-control-system/security-governance
[38] Data Studios. “Microsoft Copilot enterprise security: configurations and best practices in 2025.” https://www.datastudios.org/post/microsoft-copilot-enterprise-security-configurations-and-best-practices-in-2025
[39] Microsoft Inside Track. “How we’re tackling Microsoft 365 Copilot governance internally at Microsoft.” https://www.microsoft.com/insidetrack/blog/how-were-tackling-microsoft-365-copilot-governance-internally-at-microsoft/
[40] Meta AI. “Our responsible approach to Meta AI and Meta Llama 3.” https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/
[41] Meta AI. “Expanding our open source large language models responsibly.” https://ai.meta.com/blog/meta-llama-3-1-ai-responsibility/
[42] AWS Documentation. “Security scans – CodeWhisperer.” https://docs.aws.amazon.com/codewhisperer/latest/userguide/security-scans.html
[43] AWS Security Blog. “Use CodeWhisperer to identify issues and use suggestions to improve code security in your IDE.” https://aws.amazon.com/blogs/security/use-codewhisperer-to-identify-issues-and-use-suggestions-to-improve-code-security-in-your-ide/
[44] Augment Code. “5 CI/CD Pipeline Integrations Every AI Coding Tool Should Support.” https://www.augmentcode.com/guides/5-ci-cd-pipeline-integrations-every-ai-coding-tool-should-support
[45] Speedscale. “Testing AI Code in CI/CD Made Simple for Developers.” https://speedscale.com/blog/testing-ai-code-in-cicd-made-simple-for-developers/
[46] Medium. “Building an AI-native CI/CD pipeline: Generative AI for automated code review and security scanning.” October 2025. https://medium.com/@naeemulhaq/building-an-ai-native-ci-cd-pipeline-generative-ai-for-automated-code-review-and-security-scanning-ea6ab8255616
[47] All Days Tech. “Human-in-the-Loop AI Review Queues: Workflow Patterns That Scale (2025).” https://alldaystech.com/guides/artificial-intelligence/human-in-the-loop-ai-review-queue-workflows
[48] Permit.io. “Human-in-the-Loop for AI Agents: Best Practices, Frameworks, Use Cases, and Demo.” https://www.permit.io/blog/human-in-the-loop-for-ai-agents-best-practices-frameworks-use-cases-and-demo
[49] Klu. “HumanEval Benchmark.” https://klu.ai/glossary/humaneval-benchmark
[50] arXiv. “HumanEval-XL: A Multilingual Code Generation Benchmark.” https://arxiv.org/abs/2402.16694
[51] Hugging Face. “BigCodeBench: The Next Generation of HumanEval.” https://huggingface.co/blog/leaderboard-bigcodebench
[52] GitHub. “SWE-bench: Can Language Models Resolve Real-world Github Issues?” https://github.com/SWE-bench/SWE-bench
[53] Scale AI. “SWE-Bench Pro (Public Dataset).” https://scale.com/leaderboard/swe_bench_pro_public
[54] arXiv. “SWE-Bench+: Enhanced Coding Benchmark for LLMs.” https://arxiv.org/html/2410.06992v1
[55] Nebius. “SWE-rebench: A continuously updated benchmark for SWE LLMs.” https://nebius.com/blog/posts/introducing-swe-rebench
[56] arXiv. “Guiding AI to Fix Its Own Flaws: An Empirical Study on LLM-Driven Secure Code Generation.” https://arxiv.org/html/2506.23034v1
[57] ADS/Harvard. “Exploring and Evaluating Hallucinations in LLM-Powered Code Generation.” https://ui.adsabs.harvard.edu/abs/2024arXiv240400971L/abstract
[58] Stanford AI Lab. “Clover: Closed-Loop Verifiable Code Generation.” https://ai.stanford.edu/blog/clover/
[59] arXiv. “Towards Formal Verification of LLM-Generated Code from Natural Language Prompts.” https://arxiv.org/html/2507.13290
[60] NeurIPS MathAI Workshop. “Lemur: Integrating Large Language Models in Automated Program Verification.” https://mathai2023.github.io/papers/28.pdf
[61] CodeRAG-Bench. “Can Retrieval Augment Code Generation?” https://code-rag-bench.github.io/
[62] ARKS. “Retrieval-Augmented Code Generation.” https://arks-codegen.github.io/
[63] arXiv. “When Code Crosses Borders: A Security-Centric Study of LLM-based Code Translation.” https://arxiv.org/html/2509.06504v2
[64] ACM Digital Library. “LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead.” https://dl.acm.org/doi/10.1145/3712003
[65] GitHub. “AwesomeLLM4APR: A Systematic Literature Review on Large Language Models for Automated Program Repair.” https://github.com/iSEngLab/AwesomeLLM4APR
[66] LeadDev. “How AI generated code compounds technical debt.” https://leaddev.com/software-quality/how-ai-generated-code-accelerates-technical-debt
[67] Sonar. “The inevitable rise of poor code quality in AI-accelerated codebases.” https://www.sonarsource.com/blog/the-inevitable-rise-of-poor-code-quality-in-ai-accelerated-codebases/
[68] InfoQ. “AI-Generated Code Creates New Wave of Technical Debt, Report Finds.” November 2025. https://www.infoq.com/news/2025/11/ai-code-technical-debt/
[69] DevOps.com. “AI in Software Development: Productivity at the Cost of Code Quality?” https://devops.com/ai-in-software-development-productivity-at-the-cost-of-code-quality/
[70] ScienceDirect. “Technical debt in AI-enabled systems: On the prevalence, severity, impact, and management strategies for code and architecture.” https://www.sciencedirect.com/science/article/pii/S0164121224001961
[71] OpenSSF. “Security-Focused Guide for AI Code Assistant Instructions.” August 2025. https://best.openssf.org/Security-Focused-Guide-for-AI-Code-Assistant-Instructions
[72] William OGOU Blog. “The Prompt That Turns Your AI Coder into a Security Expert.” https://blog.ogwilliam.com/post/secure-ai-code-assistant-prompts
[73] Obsidian Security. “Prompt Injection Attacks: The Most Common AI Exploit in 2025.” https://www.obsidiansecurity.com/blog/prompt-injection
[74] Snyk Learn. “OWASP Top 10 LLM and GenAI.” https://learn.snyk.io/learning-paths/owasp-top-10-llm/
[75] Cisco Blogs. “Announcing a New Framework for Securing AI-Generated Code.” 2025. https://blogs.cisco.com/ai/announcing-new-framework-securing-ai-generated-code
[76] DepsShield. “Software Composition Analysis for AI Coding Assistants.” https://depsshield.com
[77] M. Hanol, “ASSF Validator,” GitHub repository, 2026. https://github.com/ganolmc/assf-validator|
[78] The White House. “Sustaining Select Efforts to Strengthen the Nation’s Cybersecurity and Amending Executive Order 13694 and Executive Order 14144,” Executive Order 14306, June 6, 2025.
[79] U.S. Office of Management and Budget. “Memorandum M-26-05: Software Security and Supply Chain Risk Management,” January 2026.



