Databricks finally open-sourced the thing everyone expected them to keep behind the paywall. Spark 4.1 ships with Spark Declarative Pipelines — SDP — and if you've ever fought with DLT pricing tiers, you already know why this deserves attention.

What SDP actually does

Strip away the marketing and you get this: SDP lets you define what your datasets should contain, not how to orchestrate the transformations. You write functions that return DataFrames, slap a decorator on them, and the framework handles dependency resolution, execution ordering, parallelism, and checkpoint management.

Minimum viable example:

from pyspark import pipelines as dp

@dp.table
def orders():
    return (
        spark.readStream.format("kafka")
        .option("subscribe", "orders")
        .load()
    )

@dp.materialized_view
def daily_summary():
    return (
        spark.table("orders")
        .groupBy("order_date")
        .agg(sum("amount").alias("total"))
    )

No Airflow DAG file. No dependency declaration. No retry configuration. The framework reads the spark.table("orders") reference in daily_summary, infers the edge in the graph, and runs both nodes in the right order. Independent branches get parallelized without you asking.

Where this came from

If the API looks familiar, you've used Delta Live Tables. SDP is effectively DLT donated to the Apache Spark project — import dlt becomes from pyspark import pipelines as dp, every dlt. becomes dp., and your pure transformation functions survive the migration character-for-character.

The three decorators worth knowing

The API surface is deliberately small.

@dp.table creates a streaming table that processes data incrementally. Only new rows since the last checkpoint get touched. This is your decorator for anything arriving from Kafka, Kinesis, or an append-only file source. The checkpoint management that used to eat an afternoon of debugging? Handled internally.

@dp.materialized_view builds a batch-recomputed dataset. The framework re-executes the full query on each run. Aggregation tables, dimension lookups, feature stores that need a complete refresh — this is where they live.

@dp.append_flow enables fan-in patterns where multiple sources write to one target:

dp.create_streaming_table("all_events")

@dp.append_flow(target="all_events")
def mobile_events():
    return spark.readStream.table("mobile_raw")

@dp.append_flow(target="all_events")
def web_events():
    return spark.readStream.table("web_raw")

Two separate streams merging into a single table. The framework coordinates the checkpoints across both flows. If you've hand-rolled this with Structured Streaming before, you know how much offset-tracking headache that eliminates. The merge logic is exactly the kind of plumbing nobody should write twice.

One subtlety: @dp.materialized_view and @dp.table look interchangeable at first glance. They're not. Materialized views recompute everything; streaming tables process only the delta. Picking the wrong one means either wasted compute (full recompute on append-only data) or stale results (incremental processing on data that gets updated in place). Match the decorator to the data's mutation pattern.

What will bite you

SDP re-evaluates your pipeline code multiple times during planning and execution. Side effects inside decorated functions fire more than once. Never put collect(), count(), or toPandas() inside a definition — the planner runs them during dry-run, which is both slow and semantically wrong.

The dry-run command catches structural problems before they page you at 3 AM:

spark-pipelines dry-run --spec spark-pipeline.yml

It validates syntax, resolves table references, and flags circular dependencies. Run this in CI. A cyclic reference that only surfaces in production is the exact failure mode declarative frameworks are supposed to prevent — but only if you actually use the validation tooling.

Other edges:

  • No PIVOT in SQL pipelines. Annoying if your reporting layer depends on it, but not a deal-breaker for most workloads.

  • Dynamic table creation via for-loops requires the value list to be always additive. Remove a region from the list between runs and the job fails. The planner tracks expected table names and panics when one disappears.

  • No DDL mutations. DROP VIEW, ALTER TABLE — these don't belong in SDP SQL files. The framework expects pure definitions, not imperative state changes. This trips up teams migrating from traditional SQL-based ETL where the deploy script does schema evolution inline.

When switching makes sense (and when it doesn't)

Already on Databricks DLT? Evaluate immediately. The migration is near-trivial — a find-and-replace on imports — and you reclaim control over compute costs. Whether the open-source runtime reaches parity with Lakeflow's monitoring layer is another question. Don't expect the same observability dashboards on day one.

Running Airflow + vanilla Spark jobs that work fine? SDP solves a problem you might not have. Airflow handles inter-task dependencies at the DAG level. Where SDP wins is within a single Spark application — the intra-job dependency graph that your orchestrator can't see. If your Spark jobs are already clean single-purpose transformations, the overhead of adopting a new framework buys you little.

Starting from scratch? This is where SDP is most compelling. The spark-pipelines init command scaffolds a project with a spec file, and the declarative model genuinely shrinks the bug surface. Fewer imperative decisions about checkpoint offsets, write ordering, and retry semantics means fewer ways to corrupt state at 2 AM. The tradeoff is ecosystem maturity — IDE support, testing frameworks, and third-party integrations are months behind what DLT had on Databricks. You're betting on the community filling those gaps.

The real question isn't whether SDP works — it does, and the DLT lineage gives it a head start most new frameworks don't get. The question is whether the open-source ecosystem around it matures fast enough to matter before Databricks makes Lakeflow cheap enough that nobody bothers switching.