Data Types
PII in LLM Prompts: How to Detect and Redact Personal Data
Comprehensive guide to the types of PII that commonly appear in LLM prompts, how to detect them reliably, and redaction strategies for DPDP compliance.
Why LLM Prompts Are a PII Hotspot
Users share remarkable amounts of personal data when interacting with AI assistants. They paste documents, copy-paste form fields, describe personal situations in detail. In an enterprise context, employees often paste sensitive documents into AI assistants without considering what data they contain.
In Indian applications specifically: KYC-related queries frequently include Aadhaar and PAN numbers, financial queries include account numbers and UPI IDs, support interactions include order details with delivery addresses, and HR chatbots receive salary and employment data. All of this is DPDP personal data.
High-Frequency PII in Indian LLM Traffic
Based on common patterns in Indian AI deployments: Mobile numbers (highest frequency) — users reference 'my number 9876543210' for account lookups. Email addresses — for account identification. Aadhaar (medium frequency, high severity) — users paste Aadhaar for KYC queries or form-filling assistance. PAN — common in tax and investment queries. UPI IDs — in payment support queries.
Less frequent but high risk: bank account + IFSC combinations in payment setup queries, full delivery addresses in logistics support, medical record numbers in health chatbot interactions, and employee IDs in HR chatbot workflows.
Multi-Layer Detection Architecture
Layer 1 — Structural patterns: regex + validation for known PII formats (Aadhaar + Verhoeff, PAN format + checksum, phone number normalisation). This catches the majority of formatted PII. Layer 2 — Contextual patterns: look for PII-adjacent keywords ('my Aadhaar is', 'account number', 'contact number', 'date of birth', 'address is'). Expands detection to unformatted or partially formatted PII.
Layer 3 — Semantic detection: use a lightweight classifier trained to identify PII concepts even when not in standard format. Catches cases like 'twelve digit number: two three four five...' or PII described in words. Layer 4 — System prompts: instruct the LLM itself to not reproduce PII from context. Not a primary control (LLMs can be coaxed to ignore this), but adds a defence-in-depth layer.
Redaction Strategy Selection
For gateway-level redaction (before LLM API call): prefer type-preserving token replacement ([AADHAAR_1], [PHONE_1]) over blank redaction. This lets the LLM understand the context ('the user mentioned their Aadhaar number') without receiving the actual value.
Token indexing: if the LLM response needs to reference the redacted value back (e.g., confirming an account), use indexed tokens — [AADHAAR_1] in the prompt maps back to the original value in your gateway layer, which substitutes it back in the final response shown to the user. The LLM never sees the actual Aadhaar.
Data Types operational checklist
PII in LLM Prompts: How to Detect and Redact Personal Data should be reviewed as an operating control, not only as a reference article. The minimum checklist is a data inventory, a stated processing purpose, owner approval, PII detection at the AI boundary, redaction or tokenisation where possible, retention limits, vendor transfer records, and a tested user-rights workflow. This checklist gives engineering and compliance teams a shared language for deciding what must be blocked, what can be allowed in shadow mode, and what needs human review before production release.
For AI systems, the review should include prompts, retrieved context, tool call arguments, model responses, logs, traces, analytics events, exports, and support attachments. Many incidents happen because teams scan only the visible form field while sensitive data moves through background context or observability tooling. CrewCheck's recommended pattern is to place the scanner at the request boundary, record the policy version, and keep audit evidence that shows which identifiers were detected and what action was taken.
A practical rollout starts with representative samples from production-like traffic. Run a DPDP scan, sort findings by identifier sensitivity and blast radius, fix Aadhaar, PAN, financial, health, children's, and precise-location exposure first, then move to consent wording, retention, deletion, and vendor review. Use shadow mode when false positives could disrupt users, and promote to enforcement only after the exceptions have owners and expiry dates.
This page is educational and should be paired with legal review for final policy interpretation. The operational proof should still come from repeatable evidence: scanner results, audit exports, pull-request checks, policy configuration, and a documented owner for the workflow. That combination is what makes the content useful during buyer diligence, board review, regulatory questions, or an incident investigation.
Related pages
Check your own workflow
Run a free DPDP scan before this risk reaches production.
Scan prompts, logs, documents, and API payloads for Indian PII exposure, missing redaction, and audit gaps. Backlinks: learn hub, developer docs, pricing, and the DPDP scanner.