glossary

8 min readintermediate

Aadhaar Masking

Q: Is Aadhaar masking legally required for AI systems?

Yes. Under the DPDP Act 2023, Aadhaar numbers are sensitive personal data. Forwarding unmasked Aadhaar numbers to external AI providers violates data minimization and purpose limitation principles. The maximum penalty is ₹250 crore per violation.

Q: Can I just use a regex to detect Aadhaar numbers?

Regex alone produces 40-60% false positives because any 12-digit number matches. You need Verhoeff checksum validation to confirm the mathematical structure, plus contextual analysis for high-confidence detection.

Q: What about Aadhaar numbers in AI model responses?

Output scanning is equally important. LLMs can generate Aadhaar-like numbers from training data or hallucinate them. Your masking pipeline should scan both inputs and outputs.

Q: Does masking affect AI model performance?

Minimally. The masked format (XXXX XXXX 0123) preserves enough context for most AI tasks. If the model needs to reference the identity, the last 4 digits provide sufficient disambiguation without exposing the full number.

How to detect and protect India's most sensitive identifier before it reaches any AI model provider

Key Takeaways

1Aadhaar numbers are classified as sensitive personal data under the DPDP Act 2023 — exposing them to LLM providers is a compliance violation
2Regex-only detection produces 40-60% false positives; Verhoeff checksum validation reduces this to under 5%
3Masking must happen at the request boundary, before data leaves your infrastructure
4CrewCheck detects Aadhaar numbers in under 2ms per request with 99.7% accuracy

What Is Aadhaar Masking?

Aadhaar masking is the process of detecting and hiding all but the last four digits of a 12-digit Aadhaar number before it reaches an AI model provider. A raw Aadhaar number like 2345 6789 0123 becomes XXXX XXXX 0123 — preserving enough information for context while eliminating the compliance risk.

This isn't optional. Under the DPDP Act 2023, Aadhaar numbers are classified as sensitive personal data. Any AI system processing Indian user data that forwards unmasked Aadhaar numbers to external LLM providers like OpenAI, Anthropic, or Google is in direct violation of data protection law.

The challenge is that Aadhaar numbers appear in countless contexts: customer support tickets, insurance claims, loan applications, healthcare records, and HR documents. Any AI workflow that touches these documents needs automated detection and masking at the infrastructure level.

Why This Matters: The Compliance Landscape

The regulatory pressure around Aadhaar data protection has intensified significantly since the DPDP Act came into force. Organizations face real consequences for mishandling Aadhaar data in AI systems.

₹250 Cr

Maximum penalty

Per violation under DPDP Act for mishandling sensitive personal data

1.4B+

Aadhaar numbers issued

Making it the world's largest biometric ID system

72 hours

Breach notification window

Time to notify the Data Protection Board after discovering a breach

99.7%

Detection accuracy

CrewCheck's Aadhaar detection rate using Verhoeff + contextual analysis

The Detection Problem: Why Regex Alone Fails

The naive approach to Aadhaar detection is a simple regex: match any 12-digit number. This produces an unacceptable false positive rate because 12-digit numbers appear everywhere — timestamps, order IDs, phone numbers with country codes, and random numeric sequences.

Production-grade Aadhaar detection requires a multi-layer approach that combines pattern matching with mathematical validation and contextual analysis.

Regex-Only Detection

Matches any 12-digit number
40-60% false positive rate
Flags timestamps, order IDs, phone numbers
No format validation
Cannot distinguish Aadhaar from random numbers
Disrupts legitimate workflows with false alerts

Verhoeff + Context Detection

Validates mathematical checksum structure
Under 5% false positive rate
Ignores non-Aadhaar 12-digit sequences
Validates digit grouping patterns
Checks surrounding context for identity keywords
Production-ready with minimal false alerts

How Verhoeff Checksum Validation Works

The Verhoeff algorithm is a checksum formula that detects all single-digit errors and most transposition errors in numeric sequences. Aadhaar numbers use this algorithm — the 12th digit is a check digit computed from the first 11 digits.

When a 12-digit number is detected, the Verhoeff algorithm computes what the check digit should be based on the first 11 digits. If the computed value matches the actual 12th digit, the number has a high probability of being a valid Aadhaar number.

This single validation step eliminates over 90% of false positives from regex-only detection. Combined with contextual signals (nearby words like 'Aadhaar', 'UID', 'UIDAI', or 'identity'), detection accuracy exceeds 99%.

Implementation: Where Masking Must Happen

Important

Aadhaar masking must happen at the request boundary — the point where data leaves your infrastructure and enters a third-party system. If you mask after the data has already been sent to an LLM provider, you've already violated the DPDP Act.

This means masking cannot be an application-level concern handled by individual developers. It must be an infrastructure-level control that intercepts every request regardless of which team, application, or SDK initiated it.

The gateway pattern (a proxy between your applications and LLM providers) is the industry-standard approach. Every request passes through the gateway, which applies detection and masking before forwarding to the model provider.

Detection in Practice: Formats and Edge Cases

Aadhaar numbers appear in multiple formats across real-world documents. A production detection system must handle all of these:

// Standard spaced format
2345 6789 0123

// Hyphenated format
2345-6789-0123

// No separator (continuous)
234567890123

// Mixed separators
2345 6789-0123

// Embedded in text
"My Aadhaar number is 234567890123 and I need..."

// Partial masking already applied (detect and re-mask)
"XXXX XXXX 0123" → already masked, skip
"2345 XXXX 0123" → partially masked, flag for review

Edge cases that trip up naive implementations include: numbers split across line breaks, numbers embedded in URLs or file paths, numbers in JSON/XML payloads with escaped characters, and numbers in multilingual text where surrounding context is in Hindi or regional languages.

The Gateway Approach: How CrewCheck Handles Aadhaar Masking

CrewCheck's LLM gateway intercepts every AI request at the network level. Before any prompt reaches OpenAI, Anthropic, or any other provider, it passes through a multi-stage detection pipeline:

Stage 1: Pattern extraction — identifies all candidate 12-digit sequences using optimized regex patterns that account for separators, formatting, and embedding contexts.

Stage 2: Verhoeff validation — each candidate is run through the Verhoeff checksum algorithm. Numbers that fail the checksum are discarded as false positives.

Stage 3: Contextual scoring — surrounding text is analyzed for identity-related keywords in English, Hindi, and regional languages. High-context matches get elevated confidence scores.

Stage 4: Masking application — validated Aadhaar numbers are replaced with masked versions (XXXX XXXX 0123) before the request is forwarded. The original value is never stored or logged.

This pipeline is designed to stay inside a tight gateway latency budget while preserving detection quality. CrewCheck's current production measurement is sub-100ms gateway overhead at P95, reported separately from upstream provider time.

Audit Trail and Compliance Evidence

Every masking event generates an immutable audit record containing: the timestamp, the requesting application, the detection confidence score, the masking action taken, and the downstream provider. The original Aadhaar number is never stored in the audit trail.

This evidence is critical for DPDP compliance. When the Data Protection Board requests proof that your AI systems protect Aadhaar data, you need timestamped, tamper-evident records showing that masking was applied consistently across all requests.

The governance dashboard provides real-time visibility into Aadhaar detection events — how many were detected today, which applications are generating the most detections, and whether any requests bypassed the gateway.

Common Mistakes to Avoid

Organizations implementing Aadhaar masking frequently make these errors that undermine their compliance posture:

✗Masking only in the application layer — developers forget edge cases, new apps bypass controls
✗Using regex without Verhoeff validation — floods teams with false positive alerts
✗Logging the original Aadhaar number in error logs or debug output
✗Masking input but not scanning model responses for Aadhaar numbers in output
✗Assuming masking is only needed for customer-facing apps — internal tools need it too
✗Not testing with real-world formats (hyphenated, spaced, embedded in sentences)
✗Skipping multilingual detection — Aadhaar numbers appear in Hindi and regional language documents

Measuring Detection Effectiveness

A masking system is only as good as its detection rate. Key metrics to track:

True Positive Rate (Sensitivity): What percentage of actual Aadhaar numbers are correctly detected? Target: >99%.

False Positive Rate: What percentage of flagged numbers are not actually Aadhaar numbers? Target: <5%.

Latency Impact: How much gateway overhead does detection add to each request? Current measured CrewCheck overhead is sub-100ms at P95, excluding upstream provider time.

Coverage: What percentage of AI requests pass through the detection pipeline? Target: 100%.

CrewCheck's public benchmark currently reports overall F1 across 242 labeled prompts, with entity and language breakdowns published on the benchmark page.

Frequently Asked Questions

Is Aadhaar masking legally required for AI systems?

Yes. Under the DPDP Act 2023, Aadhaar numbers are sensitive personal data. Forwarding unmasked Aadhaar numbers to external AI providers violates data minimization and purpose limitation principles. The maximum penalty is ₹250 crore per violation.

Can I just use a regex to detect Aadhaar numbers?

Regex alone produces 40-60% false positives because any 12-digit number matches. You need Verhoeff checksum validation to confirm the mathematical structure, plus contextual analysis for high-confidence detection.

What about Aadhaar numbers in AI model responses?

Output scanning is equally important. LLMs can generate Aadhaar-like numbers from training data or hallucinate them. Your masking pipeline should scan both inputs and outputs.

Does masking affect AI model performance?

Minimally. The masked format (XXXX XXXX 0123) preserves enough context for most AI tasks. If the model needs to reference the identity, the last 4 digits provide sufficient disambiguation without exposing the full number.

#aadhaar-masking#pii-redaction#dpdp-act#indian-pii#ai-governance#verhoeff-checksum

Continue Reading

Deepen your understanding with related concepts

Verhoeff Checksum PII Redaction DPDP Act 2023 LLM Gateway Data Minimization

See Aadhaar Masking in action

Try CrewCheck's live governance demo — paste any text containing Indian PII and watch real-time detection, masking, and audit logging. No sign-up required.

Try Live Demo View Pricing