Building Modern ETL Pipelines with Dagster - An Asset-Centric Approach

Healthcare data processing presents unique challenges: multiple data sources, stringent compliance requirements, and the critical nature of accuracy. We recently tackled these challenges by building a modern ETL pipeline using Dagster’s asset-centric approach, moving beyond traditional task-based orchestration to create a more maintainable and observable data platform.

The Challenge: Healthcare Data at Scale

Healthcare organizations typically deal with data from numerous sources - electronic health records, billing systems, laboratory results, and administrative databases. Each source has its own format, update frequency, and quality characteristics. Traditional ETL approaches often struggle with:

Enter Dagster: Asset-Centric Orchestration

Unlike traditional workflow orchestrators that focus on tasks, Dagster treats data assets as first-class citizens. Instead of defining “extract_billing_data” and “transform_billing_data” tasks, we define a “billing_data” asset that knows how to materialize itself.

@asset
def billing_data(context, firebird_database: DatabaseResource) -> Output[DataFrame]:
    """
    Billing data as a software-defined asset.
    Dagster tracks not just execution, but the actual data produced.
    """
    raw_data = firebird_database.fetch_billing_records()

    # Apply transformations
    cleaned_data = clean_billing_records(raw_data)
    validated_data = apply_billing_rules(cleaned_data)

    # Return with rich metadata
    return Output(
        validated_data,
        metadata={
            "record_count": len(validated_data),
            "validation_errors": validation_report,
            "data_preview": MetadataValue.md(validated_data.head().to_markdown())
        }
    )

Architecture: From Sources to Analytics

Our pipeline architecture leverages Dagster’s powerful abstractions to create a clear, maintainable data flow:

graph TB
    subgraph "Data Sources"
        FB[Firebird DB]
        XML[XML Files]
        XLS[Excel Reports]
    end

    FB --> BD
    XML --> PD
    XLS --> PAY

    subgraph "Dagster Assets"
        BD[billing_data]
        PD[patient_data]
        PAY[payment_data]
    end

    BD --> CLEAN
    PD --> CLEAN
    PAY --> CLEAN

    subgraph "Transformations"
        CLEAN[Data Cleansing]
        VAL[Validation]
        AGG[Aggregations]

        CLEAN --> VAL
        VAL --> AGG
    end

    AGG --> DUCK
    AGG --> REPORT

    subgraph "Outputs"
        DUCK[DuckDB]
        REPORT[Reports]
        ANALYTICS[Analytics]

        DUCK --> ANALYTICS
    end

Key Architectural Benefits

  1. Built-in Data Catalog: Every asset automatically appears in Dagster’s UI with its lineage, last update time, and metadata. No separate catalog tool needed.

  2. Intelligent Re-execution: When upstream data changes, Dagster knows exactly which downstream assets need updating. No manual dependency tracking required.

  3. Resource Abstraction: Database connections, API clients, and other resources are configured once and injected where needed:

@resource(config_schema={"connection_string": str})
def database_resource(context):
    # Production vs development environments handled automatically
    if context.resource_config["connection_string"] == "mock":
        return MockDatabaseResource()
    return ProductionDatabaseResource(
        context.resource_config["connection_string"]
    )

Real-World Implementation Patterns

Pattern 1: Multi-Source Data Integration

Healthcare data rarely comes from a single source. We implemented a pattern for combining data from Firebird databases, XML feeds, and Excel uploads:

@asset(ins={"billing": AssetIn(), "patients": AssetIn()})
def enriched_billing(billing: DataFrame, patients: DataFrame) -> DataFrame:
    """
    Combine billing with patient demographics.
    Dagster ensures both inputs are fresh before execution.
    """
    return billing.merge(patients, on='patient_id', how='left')

Pattern 2: Incremental Processing with Partitions

Processing years of historical data efficiently requires partitioning:

@asset(partitions_def=DailyPartitionsDefinition(start_date="2023-01-01"))
def daily_billing_summary(context) -> DataFrame:
    partition_date = context.partition_key
    # Process only the specific day's data
    return process_billing_for_date(partition_date)

Pattern 3: Data Quality Gates

Using Dagster’s asset checks, we enforce quality standards:

@asset_check(asset=billing_data)
def billing_data_quality(context, billing_data):
    # Check for required fields
    assert not billing_data['patient_id'].isna().any()

    # Validate business rules
    assert (billing_data['amount'] >= 0).all()

    return AssetCheckResult(
        passed=True,
        metadata={"rows_validated": len(billing_data)}
    )

Observability: Beyond “Did It Run?”

Traditional orchestrators tell you a job succeeded or failed. Dagster provides deep observability into the actual data:

The Dagster UI provides all this information in a single pane of glass, dramatically reducing debugging time when issues arise.

Performance Considerations

Moving to an asset-centric model with DuckDB as our analytical database delivered significant performance improvements:

Lessons Learned

After running this pipeline in production for several months, key takeaways include:

  1. Start with Clear Asset Boundaries: Define assets based on business concepts, not technical implementation details

  2. Invest in Resource Abstraction: Clean separation between business logic and infrastructure makes testing and deployment much easier

  3. Embrace Metadata: Rich metadata on assets pays dividends in debugging and compliance auditing

  4. Plan for Evolution: The asset graph will grow - design with modularity in mind

Conclusion

Adopting Dagster’s asset-centric approach transformed our healthcare ETL pipeline from a complex web of scripts into a maintainable, observable data platform. The combination of software-defined assets, built-in lineage tracking, and powerful orchestration capabilities has reduced our development time by 40% while improving data quality and reliability.

For organizations dealing with complex data orchestration challenges - especially in regulated industries like healthcare - the asset-centric paradigm offers a compelling alternative to traditional task-based ETL.

Further Reading