Compliance
DPDP-Compliant LLM Fine-Tuning: Removing PII from Training Data
How to build LLM fine-tuning pipelines that comply with India's DPDP Act — data deidentification, consent lineage, and audit controls for training datasets.
Why Fine-Tuning Is a DPDP Minefield
LLM fine-tuning uses historical data — and most historical data contains personal information. Customer support conversations contain names, order IDs, and sometimes Aadhaar or PAN numbers. Product usage logs contain email addresses and device fingerprints. Sales call transcripts contain contact details and company-sensitive information. All of this is personal data under DPDP.
The specific risk: fine-tuned models can reproduce training data verbatim when prompted in certain ways. A model trained on customer support tickets might, when asked the right question, output an actual customer's address or phone number. This is a DPDP breach even if the model is internal.
Pre-Training Data Deidentification Pipeline
Before any data enters a fine-tuning pipeline: (1) PII detection scan: run all candidate training documents through a PII scanner. Flag documents with high PII density for manual review. (2) Structural redaction: replace detected PII with type-tagged tokens ([CUSTOMER_NAME], [AADHAAR], [EMAIL]) or remove the data element entirely. (3) Consent audit: verify that each training document was collected with consent that covers use for AI model training. If not, exclude it.
Use CrewCheck's batch scanning API for large training datasets. Upload your dataset as JSON lines, configure the India PII policy, and get back a scan report with PII locations and a redacted version of the dataset.
Consent Lineage for Training Data
DPDP's purpose limitation principle (Section 8(3)) means you can only use data for the purpose it was originally collected for. If customers gave consent for their support tickets to be used for quality monitoring, that doesn't automatically include using them for LLM training.
For fine-tuning on customer data: either add 'improving our AI features' to your original consent notice, or run a re-consent campaign for existing users, or anonymise the data so thoroughly that it no longer qualifies as personal data under DPDP (difficult to achieve in practice — anonymisation is not the same as pseudonymisation).
Post-Training PII Extraction Testing
After fine-tuning, conduct an extraction test: probe the model with queries designed to elicit training data. Examples: 'What is the phone number of [customer name from training set]?', 'Complete this sentence: Aadhaar number XXXX [continue]', 'Tell me everything you know about [company name from support tickets]'.
If the model reproduces PII from training data, you have a DPDP incident even before deployment. Remediation options: add system prompt guardrails, further fine-tune with RL to reduce memorisation, or if the leakage is severe, retrain from scratch on a more thoroughly deidentified dataset.
Compliance operational checklist
DPDP-Compliant LLM Fine-Tuning: Removing PII from Training Data should be reviewed as an operating control, not only as a reference article. The minimum checklist is a data inventory, a stated processing purpose, owner approval, PII detection at the AI boundary, redaction or tokenisation where possible, retention limits, vendor transfer records, and a tested user-rights workflow. This checklist gives engineering and compliance teams a shared language for deciding what must be blocked, what can be allowed in shadow mode, and what needs human review before production release.
For AI systems, the review should include prompts, retrieved context, tool call arguments, model responses, logs, traces, analytics events, exports, and support attachments. Many incidents happen because teams scan only the visible form field while sensitive data moves through background context or observability tooling. CrewCheck's recommended pattern is to place the scanner at the request boundary, record the policy version, and keep audit evidence that shows which identifiers were detected and what action was taken.
A practical rollout starts with representative samples from production-like traffic. Run a DPDP scan, sort findings by identifier sensitivity and blast radius, fix Aadhaar, PAN, financial, health, children's, and precise-location exposure first, then move to consent wording, retention, deletion, and vendor review. Use shadow mode when false positives could disrupt users, and promote to enforcement only after the exceptions have owners and expiry dates.
This page is educational and should be paired with legal review for final policy interpretation. The operational proof should still come from repeatable evidence: scanner results, audit exports, pull-request checks, policy configuration, and a documented owner for the workflow. That combination is what makes the content useful during buyer diligence, board review, regulatory questions, or an incident investigation.
Related pages
Check your own workflow
Run a free DPDP scan before this risk reaches production.
Scan prompts, logs, documents, and API payloads for Indian PII exposure, missing redaction, and audit gaps. Backlinks: learn hub, developer docs, pricing, and the DPDP scanner.