Privacy-First Development - Using Synthetic Data in Healthcare ETL Testing

Testing healthcare ETL pipelines presents a fundamental dilemma: how do you ensure your data transformations work correctly without exposing sensitive patient information? With GDPR fines reaching €20 million and patient trust at stake, we developed a synthetic data generation approach that enables comprehensive testing while maintaining complete isolation from real patient data.

The Privacy Challenge in Healthcare ETL

Healthcare data is among the most sensitive information organizations handle. A typical patient record contains:

Personal identifiers (names, addresses, ID numbers)
Medical history and diagnoses
Treatment details and medications
Insurance and billing information
Genetic and biometric data

Under GDPR Article 9, health data is classified as “special category data” requiring the highest level of protection. Traditional approaches to test data - like copying production databases or using data masking - still pose significant risks:

graph TB
    subgraph "Traditional Approach - Risk Remains"
        PROD[Production Data]
        MASK[Masked Data]
        TEST[Test Environment]
        LEAK[Data Breach]

        PROD -->|Copy & Mask| MASK
        MASK -->|Still Contains Patterns| TEST
        TEST -->|Risk of Re-identification| LEAK
    end

    SPACER1[ ]

    subgraph "Our Approach - Complete Isolation"
        SCHEMA[Data Schema]
        GEN[Synthetic Generator]
        SYNTHETIC[Synthetic Data]
        SAFE[Safe Testing]

        SCHEMA -->|Extract Structure Only| GEN
        GEN -->|Generate Fake Data| SYNTHETIC
        SYNTHETIC -->|Zero Real Data| SAFE
    end

    TEST -.->|VS| SPACER1
    SPACER1 -.->|VS| SYNTHETIC

    style SPACER1 fill:none,stroke:none

Building a Schema-Driven Synthetic Data Generator

Our solution: extract only the structure and statistical properties of real data, then generate completely artificial test data that maintains the same characteristics.

Step 1: Schema Extraction

First, we analyze the production data structure without touching actual patient information:

def extract_schema(table_name: str) -> dict:
    """
    Extract only structural information from production tables.
    No actual data values are accessed.
    """
    schema = {
        "columns": [],
        "constraints": [],
        "statistics": {}
    }

    # Get column definitions
    column_info = database.get_table_structure(table_name)
    for column in column_info:
        schema["columns"].append({
            "name": column.name,
            "type": column.data_type,
            "nullable": column.is_nullable,
            "max_length": column.max_length
        })

    # Extract value distributions (not actual values)
    schema["statistics"] = {
        "row_count": database.get_row_count(table_name),
        "date_ranges": database.get_date_ranges(table_name),
        "numeric_ranges": database.get_numeric_ranges(table_name)
    }

    return schema

Step 2: Synthetic Data Generation

Using the Faker library and custom generators, we create realistic but completely artificial data:

from faker import Faker
import random
from datetime import datetime, timedelta

fake = Faker(['en_GB'])  # Use appropriate locale

def generate_patient_record(schema: dict) -> dict:
    """
    Generate a completely synthetic patient record
    following the extracted schema structure.
    """

    # Generate consistent patient ID
    patient_id = f"P{fake.random_number(digits=8)}"

    # Create realistic but fake personal details
    record = {
        "patient_id": patient_id,
        "first_name": fake.first_name(),
        "last_name": fake.last_name(),
        "date_of_birth": fake.date_of_birth(minimum_age=0, maximum_age=100),
        "address": fake.street_address(),
        "postal_code": fake.postcode(),
        "phone": fake.phone_number()
    }

    # Generate medical data following statistical patterns
    record["admission_date"] = fake.date_between(
        start_date="-2years",
        end_date="today"
    )

    record["diagnosis_code"] = random.choice([
        "J44.1", "I10", "E11.9", "M79.3", "K21.0"  # Common ICD-10 codes
    ])

    record["treatment_cost"] = round(
        random.lognormal(mu=7.5, sigma=1.2), 2  # Realistic cost distribution
    )

    return record

Step 3: Relationship Preservation

Healthcare data has complex relationships that must be maintained in synthetic data:

def generate_related_datasets(schemas: dict) -> dict:
    """
    Generate multiple related synthetic datasets
    maintaining referential integrity.
    """
    datasets = {}

    # Generate primary entities first
    patients = [generate_patient_record(schemas["patients"])
                for _ in range(1000)]
    datasets["patients"] = patients

    # Generate related entities with valid foreign keys
    appointments = []
    for patient in patients:
        num_appointments = random.randint(1, 10)
        for _ in range(num_appointments):
            appointment = generate_appointment(
                patient_id=patient["patient_id"],
                after_date=patient["registration_date"]
            )
            appointments.append(appointment)

    datasets["appointments"] = appointments

    return datasets

Compliance Benefits

This approach provides multiple compliance advantages:

According to GDPR Recital 26, the regulation does not apply to anonymous information:

“The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person”

Our synthetic data is generated from scratch - it never relates to any real person, eliminating GDPR concerns entirely.

2. Cross-Border Development

With synthetic data, development teams anywhere in the world can work on the ETL pipeline without triggering GDPR’s data transfer restrictions:

graph TB
    subgraph "EU Data Center"
        PROD[Production Data]
        SCHEMA[Schema Only]

        PROD -->|Extract Structure| SCHEMA
    end

    SCHEMA -->|Transfer Schema<br/>No Personal Data| GEN

    subgraph "Development Team - Any Location"
        GEN[Generator]
        TEST[Test Data]
        DEV[Development]

        GEN -->|Generate Synthetic| TEST
        TEST -->|Safe Testing| DEV
    end

3. Audit Trail Simplification

Testing with synthetic data dramatically simplifies compliance audits:

def log_data_generation(config: dict) -> None:
    """
    Create audit log for synthetic data generation.
    Simple and transparent for compliance reviews.
    """
    audit_entry = {
        "timestamp": datetime.now().isoformat(),
        "action": "synthetic_data_generation",
        "real_data_accessed": False,
        "personal_data_processed": False,
        "purpose": "ETL pipeline testing",
        "data_volume": config["num_records"],
        "retention_period": "Until test completion"
    }

    audit_logger.log(audit_entry)

Practical Implementation Patterns

Pattern 1: Realistic Data Distributions

Healthcare data often follows specific distributions. We model these statistically:

def generate_realistic_ages(num_patients: int) -> list:
    """
    Generate age distribution matching typical healthcare populations.
    """
    # Bimodal distribution: pediatric and elderly peaks
    ages = []
    for _ in range(num_patients):
        if random.random() < 0.3:
            # Pediatric population
            age = max(0, int(random.normalvariate(8, 5)))
        else:
            # Adult population with elderly bias
            age = min(100, int(random.normalvariate(55, 20)))
        ages.append(age)

    return ages

Pattern 2: Temporal Consistency

Medical events must follow logical time sequences:

def generate_patient_timeline(patient_id: str) -> list:
    """
    Generate consistent timeline of medical events.
    """
    events = []

    # Start with registration
    registration_date = fake.date_between(start_date="-5years")
    events.append({
        "type": "registration",
        "date": registration_date,
        "patient_id": patient_id
    })

    # Generate subsequent events in logical order
    current_date = registration_date
    for _ in range(random.randint(3, 20)):
        days_later = random.randint(1, 90)
        current_date = current_date + timedelta(days=days_later)

        event_type = random.choice([
            "appointment", "lab_test", "prescription", "procedure"
        ])

        events.append({
            "type": event_type,
            "date": current_date,
            "patient_id": patient_id
        })

    return events

Pattern 3: Edge Case Generation

Synthetic data excels at creating edge cases that might be rare in production:

def generate_edge_cases() -> list:
    """
    Generate specific edge cases for thorough testing.
    """
    edge_cases = []

    # Maximum length names
    edge_cases.append({
        "patient_id": "EDGE001",
        "last_name": "A" * 255,  # Maximum varchar length
        "diagnosis": "Z99.9"  # Unusual diagnosis code
    })

    # Boundary dates
    edge_cases.append({
        "patient_id": "EDGE002",
        "admission_date": datetime(1900, 1, 1),  # Minimum date
        "discharge_date": datetime.now()  # Current date
    })

    # Null handling
    edge_cases.append({
        "patient_id": "EDGE003",
        "middle_name": None,
        "secondary_diagnosis": None,
        "insurance_provider": None
    })

    return edge_cases

Testing Strategy with Synthetic Data

Our testing pyramid leverages synthetic data at every level:

graph TB
    subgraph "Testing Pyramid"
        UNIT[Unit Tests<br/>Individual transformations]
        INTEGRATION[Integration Tests<br/>Pipeline segments]
        E2E[End-to-End Tests<br/>Full pipeline]
        PERF[Performance Tests<br/>Scale validation]

        UNIT -->|Small synthetic datasets| INTEGRATION
        INTEGRATION -->|Medium synthetic datasets| E2E
        E2E -->|Production-scale synthetic data| PERF
    end

Unit Testing Example

def test_patient_age_calculation():
    """Test age calculation with synthetic data."""
    # Generate synthetic patient
    patient = {
        "patient_id": "TEST001",
        "date_of_birth": datetime(1990, 1, 1)
    }

    # Test transformation
    calculated_age = calculate_age(
        patient["date_of_birth"],
        reference_date=datetime(2025, 1, 1)
    )

    assert calculated_age == 35

Challenges and Solutions

Challenge 1: Statistical Similarity

Problem: Synthetic data might not capture all statistical properties of real data.

Solution: Implement validation to ensure synthetic data maintains key statistical properties:

def validate_synthetic_statistics(real_stats: dict, synthetic_data: DataFrame):
    """Ensure synthetic data maintains statistical similarity."""
    synthetic_stats = calculate_statistics(synthetic_data)

    for metric in ["mean", "std", "min", "max"]:
        real_value = real_stats[metric]
        synthetic_value = synthetic_stats[metric]

        # Allow 10% deviation
        assert abs(real_value - synthetic_value) / real_value < 0.1

Challenge 2: Complex Business Rules

Problem: Healthcare has intricate business rules that must be reflected in test data.

Solution: Encode business rules directly in generators:

def apply_billing_rules(record: dict) -> dict:
    """Apply realistic billing rules to synthetic records."""
    # Insurance coverage rules
    if record["insurance_type"] == "PRIVATE":
        record["coverage_percentage"] = random.uniform(0.7, 0.9)
    elif record["insurance_type"] == "PUBLIC":
        record["coverage_percentage"] = 1.0

    # Co-pay calculations
    record["patient_copay"] = (
        record["total_cost"] * (1 - record["coverage_percentage"])
    )

    return record

Results and Impact

Implementing synthetic data generation for our healthcare ETL pipeline delivered significant benefits:

Zero privacy incidents: Complete elimination of personal data exposure risk
Faster development: Developers can work with test data immediately without access requests
Better test coverage: Easy generation of edge cases and error conditions
Simplified compliance: Straightforward audit trail with no personal data processing
Cost reduction: No need for expensive data masking tools or secure test environments

Conclusion

Privacy-first development using synthetic data transforms ETL testing from a compliance burden into a competitive advantage. By completely separating test data from real patient information, we achieve comprehensive testing while maintaining absolute privacy protection.

For healthcare organizations facing the dual challenges of data privacy regulations and the need for thorough testing, synthetic data generation offers a path forward that satisfies both requirements without compromise.