The large-scale collection of text, images, code, and personal data from the internet by AI companies to train machine learning models — often without consent or compensation.

What is AI Scraping? | Privacy Glossary

Every major AI model was trained on data scraped from the open internet — including your blog posts, photos, social media, code, and personal information.

What Gets Scraped

Text: Blog posts, forum comments, social media, news articles, academic papers
Images: Photos, art, diagrams — anything publicly accessible
Code: GitHub repositories, Stack Overflow answers
Personal data: Names, bios, publicly posted contact info
Conversations: Public Discord, Reddit, and forum discussions
Creative work: Stories, music, artwork posted online

Who's Doing It

OpenAI (ChatGPT): Common Crawl, books, websites, Reddit
Google (Gemini): Effectively the entire indexed web
Meta (LLaMA): Public internet data, Instagram, Facebook posts
Anthropic (Claude): Common Crawl, public datasets
Stability AI, Midjourney: Billions of images from the web
Countless startups: Scraping everything they can reach

Why It's a Privacy Problem

No meaningful consent: You posted a photo in 2015, it's now in an AI model forever
No opt-out that works: Robots.txt is voluntary, and most AI crawlers ignore it
Derivative use: Your writing style can be replicated, your face can appear in generated images
Data laundering: Personal info scraped into a model becomes difficult to remove
Copyright gray area: Legal battles ongoing (NYT v. OpenAI, Getty v. Stability AI)
Permanent inclusion: Once data is in a trained model, removing it is technically difficult

How to Reduce Exposure

Set robots.txt to block AI crawlers (GPTBot, CCBot, Google-Extended)
Use "noai" meta tags where supported
Opt out directly — OpenAI, Google, and others offer opt-out forms (limited effectiveness)
Minimize public content that contains personal information
Use watermarking tools like Glaze (for images) to poison AI training
Review privacy settings on every platform — some now have "don't use my data for AI" toggles
Support legal action — Class action lawsuits are establishing precedent

What is AI Scraping?

What Gets Scraped

Who's Doing It

Why It's a Privacy Problem

How to Reduce Exposure

Related Terms

Data Minimization

Large Language Model Privacy

Model Training Data

Right to Be Forgotten

Have more questions?