I havenโt opened Instagram in months. But when I recently returned to document my founder journey, I found my feed unrecognizably flooded with AI-generated influencers so realistic they could pass for your college roommate. “I’m fake. Create ads with me,” one captioned provocatively. “Generate fashion model lookbooks with one click,” promised another, showcasing impossibly perfect bodies in designer clothes.
The ability to produce content quickly and efficiently represents genuine innovation. These tools unlock creative expression that pushes beyond human imagination’s limits. But here’s the question we’re not asking: How did AI learn to render humans so perfectly? How can a simple text prompt conjure a photorealistic face, complete with subtle skin texture, natural lighting, and that ineffable quality that makes someone look real? Let’s dive in.
The Machine Learning Behind the Magic
The answer lies in a process called deep learning. AI image models are trained on vast datasets – millions or billions of images paired with text descriptions. Through countless iterations, these models learn to recognize patterns: what “smiling woman in sunlight” looks like, how fabric drapes on a body, the way shadows fall on a cheekbone.
Companies like Midjourney, Stable Diffusion, and DALL-E didn’t invent human faces from scratch. They studied millions of them. Feed a model enough photos labeled “fashion model,” “business professional,” or “casual portrait,” and it learns to generate convincing variations on command.
The technical achievement is remarkable. But here’s where it gets complicated.
The Dataset Built from4 Scraped Faces
Many foundational AI image models were built on LAION-5B ย – a dataset of 5.85 billion image-text pairs scraped from the web.[^1] The dataset cost approximately $10,000 to compile, primarily from a donation by Stability AI founder Emad Mostaque.[^2]
Stable Diffusion 1.5, one of the most widely used AI image generation models, was trained on subsets of LAION-5B.[^3] The model cost $600,000 in compute to train (using 256 Nvidia A100 GPUs for 150,000 GPU-hours) and has become foundational infrastructure powering thousands of commercial applications. [^3]
In December 2023, Stanford’s Internet Observatory identified 3,226 suspected instances of child sexual abuse material within LAION-5B, with 1,008 externally validated.[^4] Following this discovery, a revised version called Re-LAION-5B was released in August 2024 with 2,236 links removed.[^5]
In September 2024, the Hamburg Regional Court in Germany ruled that LAION’s data collection qualified under scientific research exceptions in German copyright law, as the dataset was made freely available to researchers.[^6] This ruling was upheld on appeal in December 2025.[^7] However, the court noted this protection applies specifically to non-commercial research purposes not to commercial entities that might use the dataset to train proprietary AI models.[^6]
It’s Like Counterfeiting vs. Licensing – You Need to Know the Difference
Right now, your photos are being scraped into datasets. AI systems are learning the exact geometry of your features, the way light hits your cheekbone, your distinctive presence. Companies are generating synthetic versions of “you” for commercial use without your permission, without compensation, often without you even knowing it happened.
But here’s what most people miss: the problem isn’t AI. The problem is who controls it.
The fashion industry figured this out decades ago. Brands license designs. Agencies negotiate model rates. Production companies pay actors. Everyone understands that professional work deserves professional compensation.
AI doesn’t change that equation. If anything, it creates more opportunities if you’re the one in control. A model who licenses their face for AI training could earn revenue every time it’s used. A photographer could monetize their portfolio for years. An actor could book global campaigns while shooting locally.
The technology isn’t the threat. The lack of control is.
The Courts Are Starting to Pay Attention
The legal system is catching up, and the penalties are staggering.
In July 2024, Meta agreed to a $1.4 billion settlement with Texas over Facebook’s facial recognition feature – the largest privacy settlement by a single state in US history.[^8] The settlement, to be paid over five years, resolved claims that Meta violated Texas’s Capture or Use of Biometric Identifier Act (CUBI) by running facial recognition software on virtually every face in photos uploaded to Facebook for over a decade without user consent.[^9]
Google followed with a $1.375 billion settlement announced in May 2025 and formally executed in October 2025 for unlawfully tracking location data, incognito searches, and collecting biometric data including facial geometry.[^10]
Illinois’s Biometric Information Privacy Act (BIPA) has become one of the most powerful tools for protecting facial data. Facebook paid $650 million, TikTok $92 million, and Snapchat $35 million in BIPA settlements.[^11]
The Clearview AI case produced a first-of-its-kind equity-based settlement in March 2025: class members received a 23% stake in Clearview AI (valued at approximately $51.75 million based on January 2024 valuation) rather than cash.[^12] Clearview had scraped over 50 billion facial images from public sources without consent to build a searchable facial recognition database sold to law enforcement.[^12]
Even more tellingly, the FTC’s action against Rite Aid in December 2023 established a precedent for “algorithmic disgorgement โforcing companies to delete not just improperly obtained data, but the AI models trained on that data.[^13] The message is clear: if your training data wasn’t obtained legitimately, your entire model is tainted.
In a landmark copyright case, Anthropic agreed to a $1.5 billion settlement in September 2025 with authors who accused the company of downloading millions of pirated books to train its Claude AI model.[^14] The settlement, which pays approximately $3,000 per book for roughly 500,000 works, represents the largest publicly reported copyright recovery in U.S. history.[^14]
The Consent-Based Alternative Is Already Here
Not every company took the scraping shortcut. Some built their businesses on a radically different model: consent and compensation.
Major content providers are securing substantial licensing deals. News Corp secured over $250 million from OpenAI for a five-year content licensing agreement announced in May 2024.[^15] These partnerships demonstrate that AI companies are willing to pay for legitimately sourced training data when required to do so by law or competitive pressure.
Your Data Is More Valuable Than You Think
Here’s what most people don’t realize: every photo you post, every selfie you share, every casual portrait on your profile represents training data that companies would pay real money if they weren’t getting it for free.
Professional models and photographers understand this instinctively. But this isn’t just about models or influencers. It’s about anyone whose face, voice, art, or creative work becomes part of the datasets training the next generation of AI. As these systems become more sophisticated, the data they’re built on becomes more valuable and the question of who benefits from that value becomes more urgent.
The Future: Fair Licensing or Continued Extraction?
We stand at a crossroads. One path leads to continued data extraction companies scraping whatever they can access, fighting billion-dollar lawsuits, and treating creators as raw material rather than partners. The other path leads toward consent-based licensing marketplaces where data providers receive fair compensation and maintain control over their work.
The technical infrastructure exists. The legal framework is emerging. The compensation models are being tested. What we need now is a fundamental shift in how we think about data ownership in the AI era.
Your Instagram photos, your TikTok videos, your creative work, they’re not just content anymore. They’re training data for the next generation of AI systems. The question isn’t whether AI companies will use them. It’s whether you’ll have a say in how they’re used and whether you’ll be compensated when they are.
The AI models flooding your feed look real because they learned from real people – people who, in many cases, never consented to that use and never saw a cent. As we build the AI-powered future, we can choose to perpetuate that model of extraction, or we can build something better: a system where consent matters, where compensation is fair, and where the people whose data powers AI innovation actually benefit from it.
The choice is ours to make. But we need to make it soon before the next 5 billion images get scraped, and the question becomes moot.
Hannah Choi is the CEO and co-founder of Spotlite Global, a platform protecting creators in the AI era. Spotlite operates on two pillars: a transparent booking platform connecting models and creatives directly with brands across Asia and North America, and IP protection infrastructure that detects unauthorized use of creators’ likenesses and enables consent-based data licensing for AI training. Spotlite serves world-class brands including Sulwhasoo, MUSINSA, Samsung, LG, Hyatt Group, and leading K-beauty companies. Learn more at https://home.spotlite.global/
References
[^1]: Schuhmann, C., Beaumont, R., Vencu, R., et al. (2022). “LAION-5B: An open large-scale dataset for training next generation image-text models.” arXiv:2210.08402. https://arxiv.org/abs/2210.08402
[^2]: “Creating the World’s Largest Open-Sourced Multimodal Dataset.” DeepLearning.AI, October 26, 2022. https://www.deeplearning.ai/the-batch/creating-the-worlds-largest-open-sourced-multimodal-dataset/
[^3]: “Stable Diffusion v1.5 Model Card.” Hugging Face. https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5
[^4]: Thiel, D., et al. “Identifying and Eliminating CSAM in Generative ML Training Data and Models.” Stanford Internet Observatory, December 2023. https://stacks.stanford.edu/file/druid:kh752sm9123/20231214-DataProvenance-CSAM-Report.pdf
[^5]: “Re-LAION-5B: A Dataset of 2 Billion Revised Filtered Image-Text Pairs.” LAION, August 2024. https://laion.ai/blog/revision/
[^6]: Hamburg Regional Court, Case 310 O 227/23, Judgment of September 27, 2024 (Kneschke v. LAION).
[^7]: Hamburg Higher Regional Court, Case 5 U 104/24, Judgment of December 10, 2025.
[^8]: “Attorney General Ken Paxton Secures $1.4 Billion Settlement with Meta.” Texas Attorney General’s Office, July 30, 2024. https://www.texasattorneygeneral.gov/news/releases/attorney-general-ken-paxton-secures-14-billion-settlement-meta
[^9]: “Meta to pay Texas $1.4 billion.” The Texas Tribune, July 30, 2024. https://www.texastribune.org/2024/07/30/texas-meta-facebook-biometric-data-settlement/
[^10]: “Attorney General Ken Paxton Finalizes Historic Settlement with Google and Secures $1.375 Billion.” Texas Attorney General’s Office, October 31, 2025. https://www.texasattorneygeneral.gov/news/releases/attorney-general-ken-paxton-finalizes-historic-settlement-google
[^11]: “Judge Approves $650M Facebook Privacy Lawsuit Settlement.” Associated Press, February 26, 2021. https://apnews.com/article/facebook-lawsuit-privacy-biometric-data-illinois-b09f4e244e2b0a87ec097a753ae24af8
[^12]: “A First in BIPA Litigation: Class Members Receive Equity in Clearview AI.” The National Law Review, March 20, 2025. https://www.natlawreview.com/article/first-bipa-litigation-class-members-receive-equity-clearview-ai
[^13]: “Rite Aid Banned from Using AI Facial Recognition.” Federal Trade Commission, December 19, 2023. https://www.ftc.gov/news-events/news/press-releases/2023/12/rite-aid-banned-using-ai-facial-recognition
[^14]: “Anthropic pays authors $1.5 billion to settle copyright infringement lawsuit.” NPR, September 6, 2025. https://www.npr.org/2025/09/05/nx-s1-5529404/anthropic-settlement-authors-copyright-ai
[^15]: “News Corp Inks OpenAI Licensing Deal Potentially Worth More Than $250 Million.” Variety, June 18, 2024. https://variety.com/2024/digital/news/news-corp-openai-licensing-deal-1236013734/
About the Author
Hannah Choi leads Spotlite’s transformation of the creator economy in the AI age. With a background spanning Korea’s leading impact accelerator MYSC, unicorn startup Hyperconnect, and international expansion at CloudKitchens, she brings deep expertise in building platforms that solve social problems at scale. At MYSC, she consulted 40+ startups on marketing strategy and voluntarily relocated to Southeast Asia to expand global networks. Her experience evaluating early-stage impact investments and scaling high-growth companies positions her uniquely to lead Spotlite’s global expansion from 50 models to 3,200+ across multiple markets, while maintaining the company’s core mission of ethical AI development and creator partnership rather than exploitation.

