Scanning your connection...
Back to Glossary
AI & Automation

What is AI Scraping?

The large-scale collection of text, images, code, and personal data from the internet by AI companies to train machine learning models — often without consent or compensation.

Also known as: AI Data Collection, Training Data Scraping, AI Crawling

Every major AI model was trained on data scraped from the open internet — including your blog posts, photos, social media, code, and personal information.

What Gets Scraped

  • Text: Blog posts, forum comments, social media, news articles, academic papers
  • Images: Photos, art, diagrams — anything publicly accessible
  • Code: GitHub repositories, Stack Overflow answers
  • Personal data: Names, bios, publicly posted contact info
  • Conversations: Public Discord, Reddit, and forum discussions
  • Creative work: Stories, music, artwork posted online

Who's Doing It

  • OpenAI (ChatGPT): Common Crawl, books, websites, Reddit
  • Google (Gemini): Effectively the entire indexed web
  • Meta (LLaMA): Public internet data, Instagram, Facebook posts
  • Anthropic (Claude): Common Crawl, public datasets
  • Stability AI, Midjourney: Billions of images from the web
  • Countless startups: Scraping everything they can reach

Why It's a Privacy Problem

  • No meaningful consent: You posted a photo in 2015, it's now in an AI model forever
  • No opt-out that works: Robots.txt is voluntary, and most AI crawlers ignore it
  • Derivative use: Your writing style can be replicated, your face can appear in generated images
  • Data laundering: Personal info scraped into a model becomes difficult to remove
  • Copyright gray area: Legal battles ongoing (NYT v. OpenAI, Getty v. Stability AI)
  • Permanent inclusion: Once data is in a trained model, removing it is technically difficult

How to Reduce Exposure

  1. Set robots.txt to block AI crawlers (GPTBot, CCBot, Google-Extended)
  2. Use "noai" meta tags where supported
  3. Opt out directly — OpenAI, Google, and others offer opt-out forms (limited effectiveness)
  4. Minimize public content that contains personal information
  5. Use watermarking tools like Glaze (for images) to poison AI training
  6. Review privacy settings on every platform — some now have "don't use my data for AI" toggles
  7. Support legal action — Class action lawsuits are establishing precedent

Related Terms

Have more questions?

Use our guided flow to get the right next privacy step for AI Scraping.

Open Guided Flow