What is Anonymization?
The process of permanently removing personally identifiable information from data so that individuals cannot be re-identified, even with additional data.
True anonymization is extremely difficult to achieve. Most "anonymized" data can be re-identified.
Anonymization vs Pseudonymization
- Anonymization: Irreversible — the data can never be linked back to an individual
- Pseudonymization: Reversible — identifiers are replaced but can be restored with a key
Re-identification Risks
- Netflix "anonymous" movie ratings were de-anonymized using IMDB reviews
- "Anonymous" NYC taxi data was re-identified using pick-up/drop-off locations
- Research shows 87% of Americans identifiable from ZIP code + birth date + gender
Techniques
- K-anonymity: Ensure each record matches at least K-1 other records
- L-diversity: Ensure sensitive values are diverse within each group
- Differential privacy: Add noise to prevent individual identification
- Data suppression: Remove quasi-identifiers entirely
The Hard Truth
For most practical purposes, if data contains enough attributes to be useful, it contains enough to be re-identified. True anonymization often destroys the utility of the data.
Related Terms
Differential Privacy
A mathematical framework for sharing aggregate information about a dataset while provably protecting the privacy of individual entries.
PII (Personally Identifiable Information)
Any data that can be used to identify a specific individual, including name, address, phone number, email, Social Security number, and biometric data.
Pseudonymity
The state of using a consistent fake identity rather than your real name. Unlike anonymity, pseudonymity allows building reputation and history while protecting real-world identity from casual observers.
Have more questions?
Use our guided flow to get the right next privacy step for Anonymization.
Open Guided Flow