Aadhaar Detection: Verhoeff Checksums and Why Regex Isn't Enough
Regex finds shapes. Checksums reduce false positives and make PII controls credible.
Harsh · 27 April 2026 · 8 min read
Pattern matching is only the first gate
The naive approach to Aadhaar detection is a regex: twelve consecutive digits. This pattern matches Aadhaar numbers, but it also matches invoice numbers, timestamps, and random test data.
Adding Verhoeff validation reduced false positives by over 90% compared to regex-only detection.
PAN, UPI, and other Indian identifiers
PAN cards follow a strict format: five uppercase letters, four digits, one uppercase letter. UPI IDs follow the format user@provider. Indian mobile numbers are 10 digits starting with 6-9.
Each identifier has its own validation rules beyond simple pattern matching. CrewCheck implements specific validators for each type.
Detection as an evidence problem
PII detection is not just about blocking sensitive data. It is about creating evidence that you attempted to detect and redact personal data.
Every PII detection event in CrewCheck is logged in the trust ledger with the type, confidence level, action taken, and the original and redacted text.
How to operationalize this in an Indian AI workflow
Treat this topic as a production workflow, not a policy note. Identify the user action that starts the AI call, the personal or regulated data that can enter the prompt, the model provider that receives it, and the owner responsible for changing the route when something goes wrong. For an Indian product, the data inventory should explicitly cover Aadhaar-like identifiers, PAN, UPI IDs, account numbers, ABHA IDs, mobile numbers, addresses, and mixed-language free text because those are the values that often slip through generic Western scanners.
Once the workflow is named, put the control at the boundary. For CrewCheck, that means routing the model call through the gateway so detection, redaction, rule evaluation, provider choice, and audit logging happen consistently. The important detail is that the control should run on every request, including retries, fallback providers, demos, internal admin tools, and queue workers that call models outside the main web path.
| Control point | Evidence to retain | Operational owner |
|---|---|---|
| Pre-prompt scan | PII type, rule ID, request hash, redacted payload | Platform engineering |
| Provider routing | Selected provider, region notes, fallback reason | AI platform owner |
| Post-output scan | Blocked text class, replacement copy, reviewer status | Product owner |
What evidence a buyer should ask for
A serious buyer should ask for evidence that connects the claim to live behavior. For a privacy claim, that means redaction logs, blocked examples, sanitized payloads, and data-retention behavior. For a safety claim, that means red-team cases, circuit-breaker decisions, and output scanning results. For a compliance claim, that means the notice, purpose, rule, and provider route can be reconstructed from the audit trail without waiting for an engineer to open production logs.
The practical standard is whether the team can answer a specific question without manual archaeology: what happened to this request, which rule fired, what data was removed, which provider saw the final payload, who approved the exception, and how long will the record be retained? If that answer requires five tools and a memory of how the system was meant to work, the evidence layer is not ready yet.
- Keep one sample allowed request, one redacted request, and one blocked request for each high-risk AI route.
- Link every public compliance claim to a live page, report export, gateway event, or scanner finding.
- Review DPDP notice language whenever the AI feature changes its purpose, provider, or data fields.
- Retest Hindi, Hinglish, spaced, hyphenated, and word-digit personal-data variants before release.
A safe next step
Start with one high-risk path and make it boringly inspectable. Run realistic Indian examples through it, including Aadhaar-like numbers, PAN formats, UPI IDs, mixed-language prompts, and attempts to override system instructions. Check the user-facing response, the gateway event, the dashboard state, and the exportable report. The path is ready only when all four tell the same story.
That narrow verification habit matters more than a large compliance checklist. AI governance fails when teams assume controls are present because the architecture says they are. It becomes trustworthy when the live product can show the exact request, exact decision, exact redaction, exact provider route, and exact evidence behind the claim.
After that, make the check repeatable. Keep the examples in a small regression pack, rerun them before deployment, and compare the result with the public claim you are about to make. If the route, report, or dashboard no longer proves the claim, change the product or change the claim before a customer finds the gap.
The habit is deliberately plain: one workflow, one owner, one evidence trail, one live verification path. That is enough to turn a short article, launch note, or procurement answer into something an operator can actually use when a bank, insurer, hospital, or enterprise SaaS buyer asks for proof.
Internal reference path
Use this article with the DPDP consent management implementation, the Indian PII types reference, and the LLM gateway for DPDP compliance. Those three pages give the legal, data-type, and runtime-control context needed to turn the article into an implementation review.
If the workflow touches banking, lending, insurance, healthcare, education, employment, or public-sector records, add one more internal review step before shipping. Ask whether the prompt uses the minimum data needed, whether the user-facing notice matches the route, whether output scanning runs before the response is shown, and whether an exportable event exists for the buyer, auditor, regulator, or incident commander who will eventually ask for proof during a real review.
Explore More
Check your own AI path
Your AI is probably leaking data you haven't checked for.
Author
Harsh
Building CrewCheck in public from India.
Related posts
India's First Prompt Risk Scanner – Paste Your AI Prompt, See What Leaks Instantly
Paste a raw LLM prompt, detect Aadhaar, PAN, UPI, ABHA, mobile numbers, and see how CrewCheck redacts or blocks it before provider transfer.
India's First AI Data Flow Visualiser – See Where Your Users' Data Really Goes
Map the journey of Indian user data through LLM providers, analytics trackers, and third-party APIs, then insert CrewCheck on risky edges.
India's First Synthetic PII Attack Suite – Red-Team Your AI Without Real Data
Generate valid-format synthetic Aadhaar, PAN, UPI, ABHA, mobile, IFSC, and address prompts, then test whether CrewCheck blocks them.