AI

Beyond Confidentiality: The AI Data War Between Law Firms and Clients

By Beverly Rich, Vice President of AI & Innovation, UnitedLex

As generative AI transforms the practice of law, an outstanding question remains around who owns and controls the data that fuels these systems. For decades, law firms and corporate legal departments have operated under well-defined boundaries of client confidentiality and work product protection. But as firms begin using AI tools that rely on data aggregation and fine-tuning, those boundaries blur. The value of legal data has shifted from evidentiary substance to strategic infrastructure.  

Understanding who can use it, and under what circumstances, is now a defining issue of the modern legal industry. 

The Core Tension: Clients Own the Data and Firms Create the Work Product 

At its simplest, data powering legal AI falls into two categories: “client” data and law firm-generated data. Clients own their underlying information, such as contracts, discovery documents, communications, transaction details, and case files, and output/deliverables from outside counsel that clients have paid for. Law firms, on the other hand, may own derived work product such as drafts, research notes, and summaries, though those too may be governed by confidentiality and professional conduct rules.  

This distinction matters because many law firm AI use cases like contract review, litigation analytics, due diligence, and e-discovery depend on training or otherwise using (e.g., for fine tuning) large language models (LLMs) and/or retrieval augmented generation (RAG) code with a mixture of both deliverables to the client and internal work product. 

If a law firm builds or fine-tunes an AI model using client data, it could inadvertently violate client confidentiality or intellectual property rights unless expressly permitted.In contrast, in-house legal departments that sit closer to the data source often view that same dataset as a corporate asset.  

They are more likely to want to use their data to train proprietary AI tools that enhance decision-making, risk prediction, or portfolio management. 

So, the questions emerge: can both the law firm and the client use the same data to train AI models? What happens if they both do? Is enforcement possible? Probable? The answers may depend less on technology and more on contract language. 

The Contractual Layer: What Provisions Matter 

The key provisions that govern data use in AI are scattered across several types of documents. These typically include engagement letters, outside counsel guidelines (OCGs), vendor and cloud agreements, and AI pilot or development agreements. 

Engagement letters and OCGs set baseline terms around confidentiality, data retention, and use of client information. Increasingly, OCGs include explicit prohibitions on uploading client data into AI systems that might use the data to train underlying models. 

Vendor and cloud agreements determine whether data is stored in private environments, whether it leaves a specified jurisdiction, and whether it may be used to train or improve the provider’s services. AI pilot or development agreements typically define who owns derivative outputs and improvements. 

Key clauses to watch include data ownership and license-back rights, use restrictions, and confidentiality and anonymization standards. 

What Counts as “Safe” Data Use 

Law firms often assume that anonymization resolves the data ownership and usage concerns. After stripping identifiers or aggregating data, many believe the resulting dataset can be freely used for internal AI training. In reality, anonymization is a moving target and does not automatically remove client-sensitivity or eliminate contractual restrictions. Even when direct identifiers are removed, matters can remain re-identifiable, particularly when (1) the underlying dispute is public, (2) the dataset is small or unique, or (3) the fact patterns themselves function as identifiers. As a result, anonymized data does not guarantee firm ownership or unrestricted reuse unless the client agreement expressly allows it. 

A better lens is data governance, where processing occurs within a firm-controlled or vendor segregated cloud instance under contractual guarantees that client data will not train external foundation models. It is crucial to note that most current enterprise-grade tools do not use inputs to improve their base models and maintain strict data-isolation controls. Firms must leverage security documentation (SOC 2 Type II, ISO 27001, DPAs, DPAs with model-training exclusions, and environment architecture diagrams) from vendors to dispel this persistent confidentiality concern. This distinction separates technical reality from common client fear. The safest path ultimately relies on consent and transparency, moving beyond reliance on de-identification alone. This means clearly documenting: (1) how data will be used, (2) where it is stored and processed, (3) whether it remains in a single-tenant or region-locked environment, and (4) confirming that no third-party model training or cross-matter data blending occurs. This governance-first approach substantially mitigates risk. 

Policing the Boundary 

Even with clear rules, enforcement is tricky. How can a client verify that its data isn’t being used to train a firm’s internal or vendor model? And how can firms prevent well-intentioned employees from inadvertently breaching these boundaries through tool usage? 

Policing this requires a combination of technical controls (segmented instances, audit logs, and data usage dashboards) and contractual accountability (attestations, audit rights, and breach remedies).  

Firms should implement governance layers that track which datasets are used to fine-tune models, who authorized their use, and whether consent was obtained. From the client’s side, periodic audits or certifications, such as SOC 2 or ISO 27001 attestations, can provide assurance that their data remains quarantined from model improvement cycles. 

Is This About Privacy or Power? 

While these debates are often framed in terms of privacy, the deeper issue is control and competitive advantage. Client data represents institutional knowledge about market terms, litigation strategies, and pricing norms. 

Allowing law firms to train AI models on client data could erode that advantage by arming outside counsel, or even competitors, with insights derived from proprietary transactions or disputes. From the law firm perspective, restricting data use limits their ability to build predictive tools or automate future matters. These limitations may create a temporary asymmetry where in-house legal teams have richer datasets while firms are left with fragmented or lower-quality data. There may be value in firms that successfully leverage their work product, rather than client deliverables, into useful training data. 

In short, this debate is not just about privacy. It’s about who gets to own the AI learning curve in law. 

How the Tension May Resolve 

Several paths are emerging to balance ownership, innovation, and client protection. Joint development agreements (JDAs) allow clients and firms to co-develop AI tools using shared or partitioned datasets with clearly defined ownership. 

Data escrow or clawback provisions give clients the ability to revoke or require deletion of their data. Federated learning approaches enable firms and clients to train models locally while sharing only model parameters. 

Data licensing frameworks allow clients to license de-identified data back to firms for specific applications. Ultimately, ownership will depend on who invests in curation, who controls access, and who bears compliance risk. 

The Public Data Frontier 

Not all data is subject to ownership constraints. Litigation filings, court decisions, and regulatory materials are generally public, though firms should still tread carefully. 

Even public data can contain personal information protected by privacy laws. Proprietary analysis layered on top of public records can itself become owned intellectual property. 

The rise of “public-plus” datasets, public materials enriched by proprietary tagging or summarization, is creating new commercial opportunities and new conflicts. The line between public record and proprietary insight may be one of the next battlegrounds. 

The Way Forward 

Data ownership in the age of AI is not simply a legal drafting challenge; it is a governance challenge. Firms and in-house teams must jointly define what ethical, secure, and value-creating data use looks like. 

The firms that succeed will treat data not merely as fuel but as a strategic asset, leveraging their ability to convert work product into a unique competitive advantage. This ultimately means moving beyond a protectionist mindset to one of proactive data stewardship that will define the next generation of AI-enabled law.  

Author

Related Articles

Back to top button