AI & Technology

Beyond Confidentiality: The AI Data War Between Law Firms and Clients

By Beverly Rich, Vice President of AI & Innovation, UnitedLex

As generative AI transforms the practice of law,ย an outstanding questionย remainsย aroundย who owns and controls the data that fuels these systems. For decades, law firms and corporate legal departments haveย operatedย under well-defined boundaries of client confidentiality and work product protection. But as firms begin using AI tools that rely on data aggregation and fine-tuning, those boundaries blur. The value of legal data has shifted from evidentiary substance to strategic infrastructure.ย ย 

Understanding who can use it, and under what circumstances, isย nowย a defining issue of the modern legal industry.ย 

The Core Tension: Clients Own the Dataย andย Firms Create the Work Productย 

At its simplest, data powering legal AI falls into two categories:ย โ€œclientโ€ย data and law firm-generated data. Clients own their underlying information, such as contracts, discovery documents, communications, transaction details, and case files, and output/deliverables from outside counsel that clients have paid for. Law firms, on the other hand, may own derived workย productย such as drafts, research notes, and summaries, though those tooย may beย governed by confidentiality and professional conduct rules.ย ย 

This distinction matters because manyย law firm AI use cases like contract review, litigation analytics, due diligence, and e-discovery depend on training orย otherwise using (e.g., for fine tuning)ย large language models (LLMs)ย and/or retrieval augmented generation (RAG) codeย with a mixture of bothย deliverables to the client and internal work product.ย 

If a law firm builds or fine-tunes an AI model using client data, it could inadvertently violate client confidentiality or intellectual property rights unless expresslyย permitted.Inย contrast, in-house legal departments that sit closer to the data source often view that same dataset as a corporate asset.ย ย 

They are more likely to want to use their data to train proprietary AI tools that enhance decision-making, risk prediction, or portfolio management.ย 

So,ย the questionsย emerge: can both the law firm and the client use the same data to train AI models? What happens if they both do? Is enforcement possible?ย Probable?ย The answers may dependย less on technology and more on contract language.ย 

The Contractual Layer: What Provisions Matterย 

The key provisions that govern data use in AI are scattered across several types of documents. Theseย typicallyย include engagement letters, outside counsel guidelines (OCGs), vendor and cloud agreements, and AI pilot or development agreements.ย 

Engagement letters and OCGs set baseline terms around confidentiality, data retention, and use of client information. Increasingly, OCGs include explicit prohibitions on uploading client data into AI systems that might use the data to train underlying models.ย 

Vendor and cloud agreementsย determineย whether data is stored in private environments, whether it leaves a specifiedย jurisdiction, and whether it may be used to train or improve the providerโ€™s services.ย AI pilot or development agreementsย typicallyย define who owns derivative outputs and improvements.ย 

Key clauses to watch include data ownershipย and license-back rights, use restrictions, andย confidentialityย and anonymization standards.ย 

What Counts as โ€œSafeโ€ Data Useย 

Law firms often assume that anonymizationย resolves the data ownership and usageย concerns.ย After stripping identifiers or aggregating data, many believe the resulting dataset can be freely used for internal AI training.ย In reality, anonymizationย is aย moving targetย and does not automatically remove client-sensitivity orย eliminateย contractual restrictions. Even when direct identifiers are removed, matters can remainย re-identifiable, particularly when (1) the underlying dispute is public, (2) the dataset is small or unique, or (3) the fact patterns themselves function as identifiers. As a result, anonymized dataย doesย notย guaranteeย firm ownership or unrestricted reuse unless the client agreement expressly allows it.ย 

A better lens is data governance, where processing occursย within a firm-controlledย orย vendorย segregatedย cloudย instance underย contractual guarantees that client data will not train external foundation models.ย Itย isย crucial to note that most current enterprise-gradeย tools do not use inputs to improve their base modelsย andย maintainย strict data-isolation controls.ย Firms mustย leverageย security documentationย (SOC 2 Type II, ISO 27001, DPAs, DPAs with model-training exclusions, and environment architecture diagrams)ย from vendorsย to dispel thisย persistent confidentialityย concern.ย This distinction separates technical reality from common client fear.ย The safest pathย ultimately reliesย onย consent and transparency, moving beyond reliance on de-identification alone.ย This means clearly documenting: (1) how data will be used, (2) where it is stored and processed, (3) whether itย remainsย in a single-tenant or region-locked environment, and (4) confirming that no third-party model training or cross-matter data blending occurs. This governance-first approachย substantially mitigatesย risk.ย 

Policing the Boundaryย 

Even with clear rules, enforcement is tricky. How can a client verify that its dataย isnโ€™tย being used to train a firmโ€™s internal or vendor model? And how can firms prevent well-intentioned employees from inadvertently breaching these boundaries through tool usage?ย 

Policing this requires a combination of technical controls (segmented instances, audit logs, and data usage dashboards) and contractual accountability (attestations, audit rights, andย breachย remedies).ย ย 

Firms should implement governance layers that track which datasets are used to fine-tune models, who authorized their use, and whether consent wasย obtained.ย Fromย the clientโ€™s side, periodicย auditsย or certifications, such asย SOC 2 or ISO 27001 attestations,ย can provide assurance that their dataย remainsย quarantined from model improvement cycles.ย 

Is This About Privacy or Power?ย 

While these debates are often framed in terms of privacy, the deeper issue is control and competitive advantage. Client dataย representsย institutional knowledge about market terms, litigation strategies, and pricing norms.ย 

Allowingย law firms to train AI models onย clientย data could erode that advantage by arming outside counsel,ย or even competitors,ย with insights derived from proprietary transactions or disputes.ย Fromย theย law firm perspective,ย restrictingย data useย limits theirย ability to build predictive tools or automate future matters.ย These limitationsย mayย createย aย temporaryย asymmetry where in-house legal teams have richer datasets while firms are left with fragmented or lower-quality data. There may be value in firms thatย successfullyย leverageย their work product, rather than client deliverables, into useful training data.ย 

In short, thisย debate is not just aboutย privacy.ย Itโ€™sย about who gets to own the AI learning curve in law.ย 

How the Tension May Resolveย 

Several paths areย emergingย to balance ownership, innovation, and client protection. Joint development agreements (JDAs) allow clients and firms to co-develop AI tools using shared or partitioned datasets with clearly defined ownership.ย 

Dataย escrow orย clawbackย provisions give clients the abilityย to revoke or require deletion of theirย data.ย Federatedย learning approaches enableย firms and clientsย toย train models locallyย while sharingย only model parameters.ย 

Dataย licensing frameworks allowย clients to license de-identified data back to firms for specific applications.ย Ultimately,ย ownershipย will depend on who invests in curation, who controls access, and who bears compliance risk.ย 

The Public Data Frontierย 

Not all data is subject to ownership constraints. Litigation filings, court decisions, and regulatory materials areย generally public, though firms should still tread carefully.ย 

Even public data canย containย personal information protected by privacy laws. Proprietaryย analysis layered on top of public records can itselfย becomeย owned intellectual property.ย 

The rise of โ€œpublic-plusโ€ datasets, publicย materials enriched by proprietary tagging or summarization,ย is creating new commercial opportunities and new conflicts. The line between public record and proprietary insightย mayย be one of the next battlegrounds.ย 

The Way Forwardย 

Data ownership in the age of AI is not simply a legal drafting challenge;ย it isย a governance challenge.ย Firmsย and in-house teams must jointly define what ethical, secure, and value-creating data use looks like.ย 

The firms that succeed will treat data not merely as fuel but asย aย strategic asset,ย leveragingย their abilityย toย convert work product into a unique competitive advantage. Thisย ultimatelyย meansย moving beyond a protectionist mindset toย one ofย proactive data stewardship that willย define the next generation of AI-enabled law.ย ย 

Author

Related Articles

Back to top button