Privacy
What is PII Detection in AI?
Protecting personal data in LLM applications
PII detection is the process of automatically identifying personally identifiable information in text. For AI applications, this means scanning LLM inputs and outputs for sensitive data that could violate privacy regulations or expose users to risk.
What is PII detection?
PII detection is the process of automatically identifying personally identifiable information in text data. In AI applications, PII detection scans LLM inputs and outputs for sensitive data like names, Social Security numbers, email addresses, phone numbers, and health information to prevent privacy violations.
Why PII Detection Matters for AI
LLMs can expose PII in several ways:
- Users include personal data in prompts
- Models generate PII based on patterns in training data
- RAG systems retrieve documents containing sensitive information
- Logs capture and store PII without redaction
Why is PII detection important for AI compliance?
PII detection is essential for compliance with GDPR, HIPAA, CCPA, and other privacy regulations. These laws require organizations to protect personal data, and AI systems can inadvertently expose PII in responses, logs, or training data. Detection enables redaction before storage or transmission.
Types of PII
What types of PII can be detected in LLM outputs?
Common PII types detected include: names, email addresses, phone numbers, Social Security numbers, credit card numbers, addresses, dates of birth, medical record numbers, IP addresses, driver's license numbers, passport numbers, and biometric identifiers. HIPAA defines 18 specific identifiers for healthcare data.
HIPAA defines 18 identifiers that constitute Protected Health Information (PHI):
| Category | Examples |
|---|---|
| Direct identifiers | Names, SSN, email, phone |
| Geographic | Address, ZIP code (more specific than state) |
| Dates | Birth date, admission date, death date |
| Account numbers | Medical record #, health plan #, account # |
| Device/vehicle IDs | Serial numbers, VIN, license plates |
| Biometric | Fingerprints, voiceprints, photos |
How PII Detection Works
How does PII detection work in LLM applications?
PII detection uses pattern matching (regex for SSNs, emails, phone numbers), named entity recognition (NER) for names and locations, and machine learning classifiers for context-dependent detection. Modern systems combine these approaches and can auto-redact detected PII before logging or returning responses.
Pattern matching — Regular expressions detect structured PII like SSNs (XXX-XX-XXXX), credit cards (Luhn-validated 16 digits), emails, and phone numbers.
Named Entity Recognition — NER models identify names, locations, and organizations that may constitute PII in context.
Context-aware classification — ML models determine if detected entities are actually PII based on surrounding context. "John Smith" in a novel isn't PII; "Patient John Smith" in a medical context is.
PII Handling Options
Once PII is detected, systems can:
- Flag — Mark the event for review without blocking
- Redact — Replace PII with placeholders ([EMAIL], [SSN]) before logging
- Block — Prevent the response from being returned to users
- Encrypt — Store PII in encrypted form with access controls
DriftRail provides automatic PII detection for 12+ data types with configurable redaction policies, helping organizations maintain compliance without manual review of every interaction.