Privacy-First Development - Using Synthetic Data in Healthcare ETL Testing
Testing healthcare ETL pipelines presents a fundamental dilemma: how do you ensure your data transformations work correctly without exposing sensitive patient information? With GDPR fines reaching €20 million and patient trust at stake, we developed a synthetic data generation approach that enables comprehensive testing while maintaining complete isolation from real patient data.
The Privacy Challenge in Healthcare ETL
Healthcare data is among the most sensitive information organizations handle. A typical patient record contains:
- Personal identifiers (names, addresses, ID numbers)
- Medical history and diagnoses
- Treatment details and medications
- Insurance and billing information
- Genetic and biometric data
Under GDPR Article 9, health data is classified as “special category data” requiring the highest level of protection. Traditional approaches to test data - like copying production databases or using data masking - still pose significant risks:
graph TB
subgraph "Traditional Approach - Risk Remains"
PROD[Production Data]
MASK[Masked Data]
TEST[Test Environment]
LEAK[Data Breach]
PROD -->|Copy & Mask| MASK
MASK -->|Still Contains Patterns| TEST
TEST -->|Risk of Re-identification| LEAK
end
SPACER1[ ]
subgraph "Our Approach - Complete Isolation"
SCHEMA[Data Schema]
GEN[Synthetic Generator]
SYNTHETIC[Synthetic Data]
SAFE[Safe Testing]
SCHEMA -->|Extract Structure Only| GEN
GEN -->|Generate Fake Data| SYNTHETIC
SYNTHETIC -->|Zero Real Data| SAFE
end
TEST -.->|VS| SPACER1
SPACER1 -.->|VS| SYNTHETIC
style SPACER1 fill:none,stroke:none Building a Schema-Driven Synthetic Data Generator
Our solution: extract only the structure and statistical properties of real data, then generate completely artificial test data that maintains the same characteristics.
Step 1: Schema Extraction
First, we analyze the production data structure without touching actual patient information:
def extract_schema(table_name: str) -> dict:
"""
Extract only structural information from production tables.
No actual data values are accessed.
"""
schema = {
"columns": [],
"constraints": [],
"statistics": {}
}
# Get column definitions
column_info = database.get_table_structure(table_name)
for column in column_info:
schema["columns"].append({
"name": column.name,
"type": column.data_type,
"nullable": column.is_nullable,
"max_length": column.max_length
})
# Extract value distributions (not actual values)
schema["statistics"] = {
"row_count": database.get_row_count(table_name),
"date_ranges": database.get_date_ranges(table_name),
"numeric_ranges": database.get_numeric_ranges(table_name)
}
return schema Step 2: Synthetic Data Generation
Using the Faker library and custom generators, we create realistic but completely artificial data:
from faker import Faker
import random
from datetime import datetime, timedelta
fake = Faker(['en_GB']) # Use appropriate locale
def generate_patient_record(schema: dict) -> dict:
"""
Generate a completely synthetic patient record
following the extracted schema structure.
"""
# Generate consistent patient ID
patient_id = f"P{fake.random_number(digits=8)}"
# Create realistic but fake personal details
record = {
"patient_id": patient_id,
"first_name": fake.first_name(),
"last_name": fake.last_name(),
"date_of_birth": fake.date_of_birth(minimum_age=0, maximum_age=100),
"address": fake.street_address(),
"postal_code": fake.postcode(),
"phone": fake.phone_number()
}
# Generate medical data following statistical patterns
record["admission_date"] = fake.date_between(
start_date="-2years",
end_date="today"
)
record["diagnosis_code"] = random.choice([
"J44.1", "I10", "E11.9", "M79.3", "K21.0" # Common ICD-10 codes
])
record["treatment_cost"] = round(
random.lognormal(mu=7.5, sigma=1.2), 2 # Realistic cost distribution
)
return record Step 3: Relationship Preservation
Healthcare data has complex relationships that must be maintained in synthetic data:
def generate_related_datasets(schemas: dict) -> dict:
"""
Generate multiple related synthetic datasets
maintaining referential integrity.
"""
datasets = {}
# Generate primary entities first
patients = [generate_patient_record(schemas["patients"])
for _ in range(1000)]
datasets["patients"] = patients
# Generate related entities with valid foreign keys
appointments = []
for patient in patients:
num_appointments = random.randint(1, 10)
for _ in range(num_appointments):
appointment = generate_appointment(
patient_id=patient["patient_id"],
after_date=patient["registration_date"]
)
appointments.append(appointment)
datasets["appointments"] = appointments
return datasets Compliance Benefits
This approach provides multiple compliance advantages:
1. Complete GDPR Compliance
According to GDPR Recital 26, the regulation does not apply to anonymous information:
“The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person”
Our synthetic data is generated from scratch - it never relates to any real person, eliminating GDPR concerns entirely.
2. Cross-Border Development
With synthetic data, development teams anywhere in the world can work on the ETL pipeline without triggering GDPR’s data transfer restrictions:
graph TB
subgraph "EU Data Center"
PROD[Production Data]
SCHEMA[Schema Only]
PROD -->|Extract Structure| SCHEMA
end
SCHEMA -->|Transfer Schema<br/>No Personal Data| GEN
subgraph "Development Team - Any Location"
GEN[Generator]
TEST[Test Data]
DEV[Development]
GEN -->|Generate Synthetic| TEST
TEST -->|Safe Testing| DEV
end 3. Audit Trail Simplification
Testing with synthetic data dramatically simplifies compliance audits:
def log_data_generation(config: dict) -> None:
"""
Create audit log for synthetic data generation.
Simple and transparent for compliance reviews.
"""
audit_entry = {
"timestamp": datetime.now().isoformat(),
"action": "synthetic_data_generation",
"real_data_accessed": False,
"personal_data_processed": False,
"purpose": "ETL pipeline testing",
"data_volume": config["num_records"],
"retention_period": "Until test completion"
}
audit_logger.log(audit_entry) Practical Implementation Patterns
Pattern 1: Realistic Data Distributions
Healthcare data often follows specific distributions. We model these statistically:
def generate_realistic_ages(num_patients: int) -> list:
"""
Generate age distribution matching typical healthcare populations.
"""
# Bimodal distribution: pediatric and elderly peaks
ages = []
for _ in range(num_patients):
if random.random() < 0.3:
# Pediatric population
age = max(0, int(random.normalvariate(8, 5)))
else:
# Adult population with elderly bias
age = min(100, int(random.normalvariate(55, 20)))
ages.append(age)
return ages Pattern 2: Temporal Consistency
Medical events must follow logical time sequences:
def generate_patient_timeline(patient_id: str) -> list:
"""
Generate consistent timeline of medical events.
"""
events = []
# Start with registration
registration_date = fake.date_between(start_date="-5years")
events.append({
"type": "registration",
"date": registration_date,
"patient_id": patient_id
})
# Generate subsequent events in logical order
current_date = registration_date
for _ in range(random.randint(3, 20)):
days_later = random.randint(1, 90)
current_date = current_date + timedelta(days=days_later)
event_type = random.choice([
"appointment", "lab_test", "prescription", "procedure"
])
events.append({
"type": event_type,
"date": current_date,
"patient_id": patient_id
})
return events Pattern 3: Edge Case Generation
Synthetic data excels at creating edge cases that might be rare in production:
def generate_edge_cases() -> list:
"""
Generate specific edge cases for thorough testing.
"""
edge_cases = []
# Maximum length names
edge_cases.append({
"patient_id": "EDGE001",
"last_name": "A" * 255, # Maximum varchar length
"diagnosis": "Z99.9" # Unusual diagnosis code
})
# Boundary dates
edge_cases.append({
"patient_id": "EDGE002",
"admission_date": datetime(1900, 1, 1), # Minimum date
"discharge_date": datetime.now() # Current date
})
# Null handling
edge_cases.append({
"patient_id": "EDGE003",
"middle_name": None,
"secondary_diagnosis": None,
"insurance_provider": None
})
return edge_cases Testing Strategy with Synthetic Data
Our testing pyramid leverages synthetic data at every level:
graph TB
subgraph "Testing Pyramid"
UNIT[Unit Tests<br/>Individual transformations]
INTEGRATION[Integration Tests<br/>Pipeline segments]
E2E[End-to-End Tests<br/>Full pipeline]
PERF[Performance Tests<br/>Scale validation]
UNIT -->|Small synthetic datasets| INTEGRATION
INTEGRATION -->|Medium synthetic datasets| E2E
E2E -->|Production-scale synthetic data| PERF
end Unit Testing Example
def test_patient_age_calculation():
"""Test age calculation with synthetic data."""
# Generate synthetic patient
patient = {
"patient_id": "TEST001",
"date_of_birth": datetime(1990, 1, 1)
}
# Test transformation
calculated_age = calculate_age(
patient["date_of_birth"],
reference_date=datetime(2025, 1, 1)
)
assert calculated_age == 35 Challenges and Solutions
Challenge 1: Statistical Similarity
Problem: Synthetic data might not capture all statistical properties of real data.
Solution: Implement validation to ensure synthetic data maintains key statistical properties:
def validate_synthetic_statistics(real_stats: dict, synthetic_data: DataFrame):
"""Ensure synthetic data maintains statistical similarity."""
synthetic_stats = calculate_statistics(synthetic_data)
for metric in ["mean", "std", "min", "max"]:
real_value = real_stats[metric]
synthetic_value = synthetic_stats[metric]
# Allow 10% deviation
assert abs(real_value - synthetic_value) / real_value < 0.1 Challenge 2: Complex Business Rules
Problem: Healthcare has intricate business rules that must be reflected in test data.
Solution: Encode business rules directly in generators:
def apply_billing_rules(record: dict) -> dict:
"""Apply realistic billing rules to synthetic records."""
# Insurance coverage rules
if record["insurance_type"] == "PRIVATE":
record["coverage_percentage"] = random.uniform(0.7, 0.9)
elif record["insurance_type"] == "PUBLIC":
record["coverage_percentage"] = 1.0
# Co-pay calculations
record["patient_copay"] = (
record["total_cost"] * (1 - record["coverage_percentage"])
)
return record Results and Impact
Implementing synthetic data generation for our healthcare ETL pipeline delivered significant benefits:
- Zero privacy incidents: Complete elimination of personal data exposure risk
- Faster development: Developers can work with test data immediately without access requests
- Better test coverage: Easy generation of edge cases and error conditions
- Simplified compliance: Straightforward audit trail with no personal data processing
- Cost reduction: No need for expensive data masking tools or secure test environments
Conclusion
Privacy-first development using synthetic data transforms ETL testing from a compliance burden into a competitive advantage. By completely separating test data from real patient information, we achieve comprehensive testing while maintaining absolute privacy protection.
For healthcare organizations facing the dual challenges of data privacy regulations and the need for thorough testing, synthetic data generation offers a path forward that satisfies both requirements without compromise.