Building Modern ETL Pipelines with Dagster - An Asset-Centric Approach
Healthcare data processing presents unique challenges: multiple data sources, stringent compliance requirements, and the critical nature of accuracy. We recently tackled these challenges by building a modern ETL pipeline using Dagster’s asset-centric approach, moving beyond traditional task-based orchestration to create a more maintainable and observable data platform.
The Challenge: Healthcare Data at Scale
Healthcare organizations typically deal with data from numerous sources - electronic health records, billing systems, laboratory results, and administrative databases. Each source has its own format, update frequency, and quality characteristics. Traditional ETL approaches often struggle with:
- Data lineage complexity: Understanding how patient data flows through transformations
- Testing difficulties: Validating transformations without exposing sensitive data
- Orchestration overhead: Managing dependencies between hundreds of tasks
- Observability gaps: Knowing not just if a job ran, but what data was actually processed
Enter Dagster: Asset-Centric Orchestration
Unlike traditional workflow orchestrators that focus on tasks, Dagster treats data assets as first-class citizens. Instead of defining “extract_billing_data” and “transform_billing_data” tasks, we define a “billing_data” asset that knows how to materialize itself.
@asset
def billing_data(context, firebird_database: DatabaseResource) -> Output[DataFrame]:
"""
Billing data as a software-defined asset.
Dagster tracks not just execution, but the actual data produced.
"""
raw_data = firebird_database.fetch_billing_records()
# Apply transformations
cleaned_data = clean_billing_records(raw_data)
validated_data = apply_billing_rules(cleaned_data)
# Return with rich metadata
return Output(
validated_data,
metadata={
"record_count": len(validated_data),
"validation_errors": validation_report,
"data_preview": MetadataValue.md(validated_data.head().to_markdown())
}
) Architecture: From Sources to Analytics
Our pipeline architecture leverages Dagster’s powerful abstractions to create a clear, maintainable data flow:
graph TB
subgraph "Data Sources"
FB[Firebird DB]
XML[XML Files]
XLS[Excel Reports]
end
FB --> BD
XML --> PD
XLS --> PAY
subgraph "Dagster Assets"
BD[billing_data]
PD[patient_data]
PAY[payment_data]
end
BD --> CLEAN
PD --> CLEAN
PAY --> CLEAN
subgraph "Transformations"
CLEAN[Data Cleansing]
VAL[Validation]
AGG[Aggregations]
CLEAN --> VAL
VAL --> AGG
end
AGG --> DUCK
AGG --> REPORT
subgraph "Outputs"
DUCK[DuckDB]
REPORT[Reports]
ANALYTICS[Analytics]
DUCK --> ANALYTICS
end Key Architectural Benefits
Built-in Data Catalog: Every asset automatically appears in Dagster’s UI with its lineage, last update time, and metadata. No separate catalog tool needed.
Intelligent Re-execution: When upstream data changes, Dagster knows exactly which downstream assets need updating. No manual dependency tracking required.
Resource Abstraction: Database connections, API clients, and other resources are configured once and injected where needed:
@resource(config_schema={"connection_string": str})
def database_resource(context):
# Production vs development environments handled automatically
if context.resource_config["connection_string"] == "mock":
return MockDatabaseResource()
return ProductionDatabaseResource(
context.resource_config["connection_string"]
) Real-World Implementation Patterns
Pattern 1: Multi-Source Data Integration
Healthcare data rarely comes from a single source. We implemented a pattern for combining data from Firebird databases, XML feeds, and Excel uploads:
@asset(ins={"billing": AssetIn(), "patients": AssetIn()})
def enriched_billing(billing: DataFrame, patients: DataFrame) -> DataFrame:
"""
Combine billing with patient demographics.
Dagster ensures both inputs are fresh before execution.
"""
return billing.merge(patients, on='patient_id', how='left') Pattern 2: Incremental Processing with Partitions
Processing years of historical data efficiently requires partitioning:
@asset(partitions_def=DailyPartitionsDefinition(start_date="2023-01-01"))
def daily_billing_summary(context) -> DataFrame:
partition_date = context.partition_key
# Process only the specific day's data
return process_billing_for_date(partition_date) Pattern 3: Data Quality Gates
Using Dagster’s asset checks, we enforce quality standards:
@asset_check(asset=billing_data)
def billing_data_quality(context, billing_data):
# Check for required fields
assert not billing_data['patient_id'].isna().any()
# Validate business rules
assert (billing_data['amount'] >= 0).all()
return AssetCheckResult(
passed=True,
metadata={"rows_validated": len(billing_data)}
) Observability: Beyond “Did It Run?”
Traditional orchestrators tell you a job succeeded or failed. Dagster provides deep observability into the actual data:
- Data Freshness: When was each asset last updated?
- Data Lineage: What upstream changes triggered this update?
- Data Quality: Are validation checks passing?
- Data Volume: How many records were processed?
The Dagster UI provides all this information in a single pane of glass, dramatically reducing debugging time when issues arise.
Performance Considerations
Moving to an asset-centric model with DuckDB as our analytical database delivered significant performance improvements:
- Lazy Materialization: Assets only rebuild when their dependencies change
- Parallel Execution: Independent assets process simultaneously
- Efficient Storage: DuckDB’s columnar format reduced storage by 70%
- Query Performance: Analytical queries that took minutes now complete in seconds
Lessons Learned
After running this pipeline in production for several months, key takeaways include:
Start with Clear Asset Boundaries: Define assets based on business concepts, not technical implementation details
Invest in Resource Abstraction: Clean separation between business logic and infrastructure makes testing and deployment much easier
Embrace Metadata: Rich metadata on assets pays dividends in debugging and compliance auditing
Plan for Evolution: The asset graph will grow - design with modularity in mind
Conclusion
Adopting Dagster’s asset-centric approach transformed our healthcare ETL pipeline from a complex web of scripts into a maintainable, observable data platform. The combination of software-defined assets, built-in lineage tracking, and powerful orchestration capabilities has reduced our development time by 40% while improving data quality and reliability.
For organizations dealing with complex data orchestration challenges - especially in regulated industries like healthcare - the asset-centric paradigm offers a compelling alternative to traditional task-based ETL.