Data Types
Aadhaar Number Detection: Verhoeff Algorithm Deep Dive
How to reliably detect Aadhaar numbers in text using the Verhoeff checksum algorithm — with code examples, false positive handling, and production considerations.
Why Regex Alone Isn't Enough
The naive regex for Aadhaar is \b[2-9][0-9]{11}\b — a 12-digit number starting with 2-9. The problem: this matches 10 billion possible numbers, including order IDs, transaction references, phone numbers (with country code prefix), and many other 12-digit sequences in your data.
In production systems, regex-only Aadhaar detection generates a false positive rate that makes the system unusable — you end up redacting legitimate business data. The solution is to layer the Verhoeff checksum validation on top of the regex, which validates that the 12-digit number is a valid Aadhaar number (not just any 12-digit sequence).
The Verhoeff Algorithm
The Verhoeff algorithm is a checksum algorithm that uses a 10×10 multiplication table (the dihedral group D5), a 10×10 permutation table, and an inverse table. To validate an Aadhaar number: (1) Reverse the digits of the number, (2) For each position i (0 to 11), get the permutation of digit[i] using the permutation table at row (i mod 8), (3) Accumulate into a running checksum using the multiplication table, (4) A valid Aadhaar number produces a final checksum of 0.
This reduces false positives from billions of possible matches to essentially zero for random numeric strings — only numbers that are mathematically consistent Aadhaar numbers pass the check. UIDAI uses this algorithm, so any valid Aadhaar number will pass and any randomly generated 12-digit number will almost certainly fail.
Implementation in JavaScript/TypeScript
const MULT_TABLE = [[0,1,2,3,4,5,6,7,8,9],[1,2,3,4,0,6,7,8,9,5],[2,3,4,0,1,7,8,9,5,6],[3,4,0,1,2,8,9,5,6,7],[4,0,1,2,3,9,5,6,7,8],[5,9,8,7,6,0,4,3,2,1],[6,5,9,8,7,1,0,4,3,2],[7,6,5,9,8,2,1,0,4,3],[8,7,6,5,9,3,2,1,0,4],[9,8,7,6,5,4,3,2,1,0]]; — then apply permutation table and inverse table. For production code, use the crewcheck-pii-sdk which handles this natively.
Normalisation before detection: strip spaces (common in Aadhaar display format '1234 5678 9012'), strip hyphens, strip the word 'Aadhaar' prefix if present. A user might write 'aadhaar: 1234 5678 9012' — normalise to '123456789012' before applying Verhoeff validation.
Handling Aadhaar in Different Contexts
In structured forms: field-level detection is straightforward — fields named aadhaar_number, aadhar, uid, or uid_number are high-probability hits. Apply Verhoeff validation to the value.
In free text (LLM prompts, support tickets): context clues help. Look for the word 'Aadhaar' or 'UIDAI' adjacent to a 12-digit number. Also detect common misspellings: Adhar, Aadhar, Aadar. In Hindi/regional language text: 'आधार' (Aadhaar in Devanagari) followed by digits.
Partial Aadhaar: masked Aadhaar shows last 4 digits (XXXX XXXX 1234). This is not a DPDP violation if the masking was done correctly and the full number isn't recoverable. If you find partial Aadhaar in logs, it may indicate correctly masked data — but verify masking was applied before transmission, not after.
Data Types operational checklist
Aadhaar Number Detection: Verhoeff Algorithm Deep Dive should be reviewed as an operating control, not only as a reference article. The minimum checklist is a data inventory, a stated processing purpose, owner approval, PII detection at the AI boundary, redaction or tokenisation where possible, retention limits, vendor transfer records, and a tested user-rights workflow. This checklist gives engineering and compliance teams a shared language for deciding what must be blocked, what can be allowed in shadow mode, and what needs human review before production release.
For AI systems, the review should include prompts, retrieved context, tool call arguments, model responses, logs, traces, analytics events, exports, and support attachments. Many incidents happen because teams scan only the visible form field while sensitive data moves through background context or observability tooling. CrewCheck's recommended pattern is to place the scanner at the request boundary, record the policy version, and keep audit evidence that shows which identifiers were detected and what action was taken.
A practical rollout starts with representative samples from production-like traffic. Run a DPDP scan, sort findings by identifier sensitivity and blast radius, fix Aadhaar, PAN, financial, health, children's, and precise-location exposure first, then move to consent wording, retention, deletion, and vendor review. Use shadow mode when false positives could disrupt users, and promote to enforcement only after the exceptions have owners and expiry dates.
This page is educational and should be paired with legal review for final policy interpretation. The operational proof should still come from repeatable evidence: scanner results, audit exports, pull-request checks, policy configuration, and a documented owner for the workflow. That combination is what makes the content useful during buyer diligence, board review, regulatory questions, or an incident investigation.
Related pages
Check your own workflow
Run a free DPDP scan before this risk reaches production.
Scan prompts, logs, documents, and API payloads for Indian PII exposure, missing redaction, and audit gaps. Backlinks: learn hub, developer docs, pricing, and the DPDP scanner.