DPDP Act
What is Personal Data under the DPDP Act? (With Real Indian Examples)
Understand the DPDP Act's definition of personal data with Aadhaar, PAN, UPI, and AI-generated data examples. Know what you must protect and what falls outside the Act.
The statutory definition (Section 2(t))
Section 2(t) of the DPDP Act defines 'personal data' as any data about an individual who is identifiable by or in relation to such data. This definition is deliberately broad and technology-neutral. It does not require that the individual be directly identified — data that makes someone identifiable, even indirectly or in combination with other data, qualifies as personal data.
The phrase 'in relation to such data' extends coverage significantly. Customer behaviour data that does not contain a name or ID number can still be personal data if it can be combined with other data to identify an individual. Session logs, clickstream data, purchase patterns, and location traces all qualify when they are linked to identifiable individuals in your database.
Critically, the Act covers only 'digital' personal data — data that is digitised or data that is generated in digital form. Handwritten records do not fall under the Act unless digitised. However, this exclusion is narrow in practice since nearly all data in modern SaaS products is digital from the moment of creation.
Indian personal data types: the complete list
Indian personal data has patterns distinct from Western equivalents. Aadhaar numbers are 12-digit identifiers issued by UIDAI and governed additionally by the Aadhaar Act 2016. They are personally identifiable on their own and are classified as sensitive personal data. PAN (Permanent Account Number) cards combine letters and numbers in a specific format (ABCDE1234F) and uniquely identify taxpayers — any system that processes PAN data is processing personal data.
UPI IDs (user@provider format) link directly to bank accounts and are financial personal data. IFSC codes, by themselves, identify bank branches not individuals, but in combination with account numbers they constitute financial personal data. Indian mobile numbers (10 digits starting with 6-9, often prefixed +91) are personal data since they are assigned to individuals and used for authentication. Voter IDs (three letters followed by seven digits), driving licences (state code + digits), and passport numbers are all government-issued identifiers that are personal data.
Health data — diagnoses, prescriptions, lab results, ABHA IDs from the Ayushman Bharat Digital Mission — is treated as sensitive personal data requiring heightened protection. Biometric data (fingerprints, iris scans, facial recognition templates) is similarly sensitive. Financial data including account numbers, credit scores, transaction histories, and salary information is personal data that often requires additional protections under sector-specific regulations.
Does AI-generated content contain personal data?
This is one of the most contested questions in AI compliance. If an LLM generates a summary of a customer's support history — 'This customer has three unresolved complaints about billing' — that output is personal data about the customer even though a human did not write it. The Act's definition does not distinguish between human-generated and AI-generated data.
Model outputs that reference specific individuals are personal data. A model that says 'Rahul Sharma from Mumbai complained about...' is processing and generating personal data. A model that provides generic information ('common complaint categories include...') without reference to identifiable individuals is not generating personal data. The distinction is identifiability.
More complex: does an embedding of a customer support conversation constitute personal data? An embedding is a vector of floating-point numbers — not human-readable, but potentially reversible. Research has shown that embeddings can be used to recover approximate versions of the original text. Until the Data Protection Board issues guidance, the cautious position is to treat embeddings derived from personal data as personal data themselves, applying appropriate access controls and retention policies.
What is NOT personal data under the DPDP Act
Section 3 exempts 'anonymised data' from the Act's scope. Data is anonymised when it is processed such that the individual cannot be identified through any means reasonably likely to be used. However, true anonymisation is difficult to achieve and easy to overclaim. Research consistently demonstrates that anonymised datasets can be re-identified when combined with auxiliary information. Aggregate statistics, synthetic data generated without reference to real individuals, and data about organisations (not individuals) generally fall outside the Act.
Non-personal data — data about processes, products, or events that is not about identifiable individuals — is outside the Act's scope. Server performance metrics, application error rates, and aggregate transaction volumes do not contain personal data. However, if any of these contain identifiers (user IDs, session tokens, IP addresses in some contexts), they become personal data for that component.
Data about deceased individuals is excluded from most data protection frameworks including DPDP, unless the data also concerns living individuals (e.g., medical data about hereditary conditions).
DPDP Act operational checklist
What is Personal Data under the DPDP Act? (With Real Indian Examples) should be reviewed as an operating control, not only as a reference article. The minimum checklist is a data inventory, a stated processing purpose, owner approval, PII detection at the AI boundary, redaction or tokenisation where possible, retention limits, vendor transfer records, and a tested user-rights workflow. This checklist gives engineering and compliance teams a shared language for deciding what must be blocked, what can be allowed in shadow mode, and what needs human review before production release.
For AI systems, the review should include prompts, retrieved context, tool call arguments, model responses, logs, traces, analytics events, exports, and support attachments. Many incidents happen because teams scan only the visible form field while sensitive data moves through background context or observability tooling. CrewCheck's recommended pattern is to place the scanner at the request boundary, record the policy version, and keep audit evidence that shows which identifiers were detected and what action was taken.
A practical rollout starts with representative samples from production-like traffic. Run a DPDP scan, sort findings by identifier sensitivity and blast radius, fix Aadhaar, PAN, financial, health, children's, and precise-location exposure first, then move to consent wording, retention, deletion, and vendor review. Use shadow mode when false positives could disrupt users, and promote to enforcement only after the exceptions have owners and expiry dates.
This page is educational and should be paired with legal review for final policy interpretation. The operational proof should still come from repeatable evidence: scanner results, audit exports, pull-request checks, policy configuration, and a documented owner for the workflow. That combination is what makes the content useful during buyer diligence, board review, regulatory questions, or an incident investigation.
DPDP Act pillar implementation addendum
A pillar page should also connect the legal idea to a concrete implementation path. Start with ownership: name the product owner, engineering owner, security reviewer, and compliance reviewer for this topic. Then map the systems that can create, store, transform, or transmit the relevant personal data. The map should include frontend forms, backend APIs, queues, warehouses, LLM prompts, embedding stores, admin exports, vendor dashboards, and customer-success tooling.
Next, document the lawful purpose and the user-facing notice. The notice should be clear enough that a data principal understands what is processed, why AI may be involved, what categories of personal data are affected, and how consent or withdrawal works. If the workflow supports children, healthcare, financial services, employment, or government delivery, treat that context as higher risk and add stricter review before allowing personal data into model calls.
The engineering control should run before data leaves the application boundary. Scan the full prompt package, not just the user's message. That means system instructions, retrieved snippets, tool outputs, attachments, OCR text, chat history, and structured JSON all need inspection. When a high-confidence identifier is found, redact, tokenise, block, or route to a safer model depending on the policy. Keep the original sensitive value out of general logs unless a protected exception is approved.
Audit evidence should be designed for reconstruction. A reviewer should be able to answer: when did the request happen, which application sent it, which data type was detected, which rule fired, what action was taken, which provider received the final payload, and who approved any exception. Without that trail, teams are left with policy claims rather than proof. With it, they can respond faster to buyer diligence, internal audits, breach triage, and regulator questions.
Finally, make the process repeatable. Add sample payloads to tests, run scheduled scans against logs and representative documents, check sitemap and page health for public guidance, and keep the DPDP scanner linked from the page so readers can move from learning to action. The goal is not to freeze the system; it is to make every future AI workflow easier to review, safer to launch, and easier to explain.
Related pages
Check your own workflow
Run a free DPDP scan before this risk reaches production.
Scan prompts, logs, documents, and API payloads for Indian PII exposure, missing redaction, and audit gaps. Backlinks: learn hub, developer docs, pricing, and the DPDP scanner.