Data Privacy & Compliance in AI: Building LLM Applications That Respect User Data

A healthcare company builds an AI assistant for doctors. Doctors paste patient notes into the chat to get diagnosis suggestions. Those notes - containing names, conditions, medications, and social security numbers - are sent to a third-party LLM API. The API provider’s terms say they may use inputs for model improvement. The company just violated HIPAA by sharing protected health information with an unauthorized third party for a purpose the patient never consented to.

This is not a contrived scenario. It happens constantly because AI applications process data in ways traditional software does not: user inputs become model context, context becomes training data (in some agreements), and outputs may contain information from other users’ inputs (in shared fine-tuning or retrieval systems).

Where data flows in LLM applications

graph TD
  subgraph user_data["User Data Flows"]
      UD["User Input
(may contain PII)"]
      CTX["Retrieved Context
(may contain others' PII)"]
      HIST["Conversation History
(accumulates PII over time)"]
  end
  subgraph processing["Processing"]
      PROMPT["Prompt Assembly
(combines all sources)"]
      API["LLM API Call
(data leaves your infrastructure)"]
      OUTPUT["Response Generation
(may leak PII from context)"]
  end
  subgraph storage["Storage"]
      LOGS["Logs & Traces
(full prompts recorded)"]
      VDB["Vector Database
(embedded user content)"]
      MEM["Memory Store
(persistent user data)"]
  end

  UD --> PROMPT
  CTX --> PROMPT
  HIST --> PROMPT
  PROMPT --> API
  API --> OUTPUT
  PROMPT --> LOGS
  UD --> VDB
  OUTPUT --> MEM

  style UD fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
  style API fill:#FAEEDA,stroke:#854F0B,color:#633806
  style LOGS fill:#FCEBEB,stroke:#A32D2D,color:#791F1F

Key compliance frameworks

Right to deletion: Users can request all their data be deleted - including from vector databases, conversation logs, and memory stores
Data minimization: Only process PII necessary for the task
Purpose limitation: Data collected for one purpose cannot be used for another (e.g., support conversations used for training)
Cross-border transfer: Data sent to US-based LLM APIs may violate GDPR unless covered by adequacy decisions or standard contractual clauses

HIPAA (US Healthcare)

Business Associate Agreement (BAA): Required with any third party processing PHI - including LLM providers
Minimum necessary: Only share the minimum PHI needed for the task
De-identification: Remove 18 HIPAA identifiers before processing
Audit trail: Track all access to and processing of PHI

SOC 2

Data encryption: At rest and in transit
Access controls: Who can see prompts, logs, and user data
Monitoring: Detect unauthorized access to AI system data
Vendor management: Assess LLM provider’s security posture

Practical compliance techniques

PII detection and redaction

import presidio_analyzer

def redact_before_llm(text):
    analyzer = presidio_analyzer.AnalyzerEngine()
    results = analyzer.analyze(text, language='en')
    
    redacted = text
    for result in sorted(results, key=lambda x: x.start, reverse=True):
        redacted = redacted[:result.start] + f"[{result.entity_type}]" + redacted[result.end:]
    
    return redacted

# "John Smith's SSN is 123-45-6789" → "[PERSON]'s SSN is [US_SSN]"

Data processing agreements with LLM providers

Provider	Zero-data-retention option	BAA available	EU data residency
OpenAI	Yes (API, not ChatGPT)	Yes (Enterprise)	Yes (Azure)
Anthropic	Yes (API default)	Yes (via AWS)	Via AWS regions
Google	Yes (Vertex AI)	Yes (Vertex)	Yes
Azure OpenAI	Yes	Yes	Yes

Self-hosted models for sensitive data

When data cannot leave your infrastructure:

Deploy open-source models (LLaMA, Mistral) on your own GPU infrastructure
Use on-premise solutions (Azure OpenAI on private endpoints, AWS Bedrock in VPC)
Accept the quality/capability tradeoff vs frontier models

Retrieval access controls

In multi-tenant RAG systems, ensure users only retrieve their own data:

def search_with_permissions(query, user_id):
    results = vector_db.search(
        embedding=embed(query),
        filter={"tenant_id": user_id, "access_level": {"$lte": user.access_level}}
    )
    return results

Where compliance gets hard

Training data leakage: If your fine-tuned model was trained on User A’s data, can User B’s queries cause the model to generate User A’s information? This is called memorization, and it is a real risk with small fine-tuning datasets.

Right to deletion in embeddings: A user requests deletion. You delete their documents, but their content is compressed into embedding vectors in your vector database. The embeddings cannot be “un-computed” - you must re-embed the remaining corpus without their data.

Cross-border complexity: User in Germany → your server in US → LLM API in US → vector DB in EU. Data crosses borders multiple times in one request.

Observability vs privacy: Full prompt logging (essential for debugging) conflicts with data minimization (only store what is necessary). Solution: redacted logs with an option to temporarily enable full logging for specific investigations.

How to apply in practice

Assess data sensitivity before choosing architecture. If data is highly sensitive (medical, financial, legal), start with self-hosted models or providers with BAAs and zero-retention agreements. Do not retrofit compliance - design for it.

Implement PII detection at the input layer. Redact PII before it enters any system - model, vector database, or logs. This is the simplest defense with the broadest effect.

Document your data flows. Map where user data goes: which services process it, where it is stored, and who can access it. This documentation is required by GDPR and useful for security audits.

Use data processing agreements (DPAs) with all LLM providers. Even if using “zero retention” APIs, a formal DPA defines responsibilities if something goes wrong.

Design for deletion from day one. Every piece of user data should be tagged with a user ID. Deletion should be a single operation that cascades through all storage layers: conversations, embeddings, memories, logs.

FAQ

Q: If an LLM provider says “zero data retention,” is that sufficient for compliance?

It is necessary but not sufficient. Zero retention means they do not store your prompts/responses - but during processing, data exists in their memory. You still need: a DPA, appropriate security certifications (SOC 2), clarity on data handling during processing, and assurance that zero-retention applies to all sub-processors. For regulated industries (healthcare, finance), you likely also need the provider to sign a BAA or equivalent.

Q: Can I use a user’s data to improve my AI system (fine-tuning, RAG) without separate consent?

Under GDPR, no - using data for model improvement is a different purpose than the original service. You need explicit consent for training. Even under less strict frameworks, using one user’s data in ways that might influence other users’ experiences requires careful consent management. Best practice: separate consent for “use my data to improve the service” and honor opt-outs by excluding that data from all training and retrieval pipelines.

Interview questions

Q: Design a HIPAA-compliant AI medical coding assistant that helps doctors code patient diagnoses. Patient records contain PHI that the AI needs to process.

Architecture: (1) Deploy model on HIPAA-compliant infrastructure (Azure OpenAI with BAA, or self-hosted model in HIPAA-certified cloud). Zero third-party API calls without BAAs. (2) Data flow: patient record → PII detection layer (identify but do not remove PHI needed for coding) → encrypted transmission to model → response → audit log (encrypted, access-controlled). (3) Access controls: only authenticated doctors with patient relationship can process that patient’s records. Every access logged. (4) De-identification option: for training/improvement, de-identify records using HIPAA Safe Harbor method before any use. (5) Retention: minimum necessary retention (30 days for audit, then auto-delete unless clinical record requires longer). (6) Vector database: if using RAG over medical knowledge (not patient data), keep knowledge base separate from patient data entirely. Never embed patient records in a shared vector space.