Data Privacy & Compliance in AI: Building LLM Applications That Respect User Data
A healthcare company builds an AI assistant for doctors. Doctors paste patient notes into the chat to get diagnosis suggestions. Those notes - containing names, conditions, medications, and social security numbers - are sent to a third-party LLM API. The API provider’s terms say they may use inputs for model improvement. The company just violated HIPAA by sharing protected health information with an unauthorized third party for a purpose the patient never consented to.
This is not a contrived scenario. It happens constantly because AI applications process data in ways traditional software does not: user inputs become model context, context becomes training data (in some agreements), and outputs may contain information from other users’ inputs (in shared fine-tuning or retrieval systems).
Where data flows in LLM applications
graph TD
subgraph user_data["User Data Flows"]
UD["User Input
(may contain PII)"]
CTX["Retrieved Context
(may contain others' PII)"]
HIST["Conversation History
(accumulates PII over time)"]
end
subgraph processing["Processing"]
PROMPT["Prompt Assembly
(combines all sources)"]
API["LLM API Call
(data leaves your infrastructure)"]
OUTPUT["Response Generation
(may leak PII from context)"]
end
subgraph storage["Storage"]
LOGS["Logs & Traces
(full prompts recorded)"]
VDB["Vector Database
(embedded user content)"]
MEM["Memory Store
(persistent user data)"]
end
UD --> PROMPT
CTX --> PROMPT
HIST --> PROMPT
PROMPT --> API
API --> OUTPUT
PROMPT --> LOGS
UD --> VDB
OUTPUT --> MEM
style UD fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style API fill:#FAEEDA,stroke:#854F0B,color:#633806
style LOGS fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
Key compliance frameworks
GDPR (EU)
- Right to deletion: Users can request all their data be deleted - including from vector databases, conversation logs, and memory stores
- Data minimization: Only process PII necessary for the task
- Purpose limitation: Data collected for one purpose cannot be used for another (e.g., support conversations used for training)
- Cross-border transfer: Data sent to US-based LLM APIs may violate GDPR unless covered by adequacy decisions or standard contractual clauses
HIPAA (US Healthcare)
- Business Associate Agreement (BAA): Required with any third party processing PHI - including LLM providers
- Minimum necessary: Only share the minimum PHI needed for the task
- De-identification: Remove 18 HIPAA identifiers before processing
- Audit trail: Track all access to and processing of PHI
SOC 2
- Data encryption: At rest and in transit
- Access controls: Who can see prompts, logs, and user data
- Monitoring: Detect unauthorized access to AI system data
- Vendor management: Assess LLM provider’s security posture
Practical compliance techniques
PII detection and redaction
import presidio_analyzer
def redact_before_llm(text):
analyzer = presidio_analyzer.AnalyzerEngine()
results = analyzer.analyze(text, language='en')
redacted = text
for result in sorted(results, key=lambda x: x.start, reverse=True):
redacted = redacted[:result.start] + f"[{result.entity_type}]" + redacted[result.end:]
return redacted
# "John Smith's SSN is 123-45-6789" → "[PERSON]'s SSN is [US_SSN]"
Data processing agreements with LLM providers
| Provider | Zero-data-retention option | BAA available | EU data residency |
|---|---|---|---|
| OpenAI | Yes (API, not ChatGPT) | Yes (Enterprise) | Yes (Azure) |
| Anthropic | Yes (API default) | Yes (via AWS) | Via AWS regions |
| Yes (Vertex AI) | Yes (Vertex) | Yes | |
| Azure OpenAI | Yes | Yes | Yes |
Self-hosted models for sensitive data
When data cannot leave your infrastructure:
- Deploy open-source models (LLaMA, Mistral) on your own GPU infrastructure
- Use on-premise solutions (Azure OpenAI on private endpoints, AWS Bedrock in VPC)
- Accept the quality/capability tradeoff vs frontier models
Retrieval access controls
In multi-tenant RAG systems, ensure users only retrieve their own data:
def search_with_permissions(query, user_id):
results = vector_db.search(
embedding=embed(query),
filter={"tenant_id": user_id, "access_level": {"$lte": user.access_level}}
)
return results
Where compliance gets hard
Training data leakage: If your fine-tuned model was trained on User A’s data, can User B’s queries cause the model to generate User A’s information? This is called memorization, and it is a real risk with small fine-tuning datasets.
Right to deletion in embeddings: A user requests deletion. You delete their documents, but their content is compressed into embedding vectors in your vector database. The embeddings cannot be “un-computed” - you must re-embed the remaining corpus without their data.
Cross-border complexity: User in Germany → your server in US → LLM API in US → vector DB in EU. Data crosses borders multiple times in one request.
Observability vs privacy: Full prompt logging (essential for debugging) conflicts with data minimization (only store what is necessary). Solution: redacted logs with an option to temporarily enable full logging for specific investigations.
How to apply in practice
Assess data sensitivity before choosing architecture. If data is highly sensitive (medical, financial, legal), start with self-hosted models or providers with BAAs and zero-retention agreements. Do not retrofit compliance - design for it.
Implement PII detection at the input layer. Redact PII before it enters any system - model, vector database, or logs. This is the simplest defense with the broadest effect.
Document your data flows. Map where user data goes: which services process it, where it is stored, and who can access it. This documentation is required by GDPR and useful for security audits.
Use data processing agreements (DPAs) with all LLM providers. Even if using “zero retention” APIs, a formal DPA defines responsibilities if something goes wrong.
Design for deletion from day one. Every piece of user data should be tagged with a user ID. Deletion should be a single operation that cascades through all storage layers: conversations, embeddings, memories, logs.
FAQ
Q: If an LLM provider says “zero data retention,” is that sufficient for compliance?
It is necessary but not sufficient. Zero retention means they do not store your prompts/responses - but during processing, data exists in their memory. You still need: a DPA, appropriate security certifications (SOC 2), clarity on data handling during processing, and assurance that zero-retention applies to all sub-processors. For regulated industries (healthcare, finance), you likely also need the provider to sign a BAA or equivalent.
Q: Can I use a user’s data to improve my AI system (fine-tuning, RAG) without separate consent?
Under GDPR, no - using data for model improvement is a different purpose than the original service. You need explicit consent for training. Even under less strict frameworks, using one user’s data in ways that might influence other users’ experiences requires careful consent management. Best practice: separate consent for “use my data to improve the service” and honor opt-outs by excluding that data from all training and retrieval pipelines.
Interview questions
Q: Design a HIPAA-compliant AI medical coding assistant that helps doctors code patient diagnoses. Patient records contain PHI that the AI needs to process.
Architecture: (1) Deploy model on HIPAA-compliant infrastructure (Azure OpenAI with BAA, or self-hosted model in HIPAA-certified cloud). Zero third-party API calls without BAAs. (2) Data flow: patient record → PII detection layer (identify but do not remove PHI needed for coding) → encrypted transmission to model → response → audit log (encrypted, access-controlled). (3) Access controls: only authenticated doctors with patient relationship can process that patient’s records. Every access logged. (4) De-identification option: for training/improvement, de-identify records using HIPAA Safe Harbor method before any use. (5) Retention: minimum necessary retention (30 days for audit, then auto-delete unless clinical record requires longer). (6) Vector database: if using RAG over medical knowledge (not patient data), keep knowledge base separate from patient data entirely. Never embed patient records in a shared vector space.