Stop Running Spark for 40 GB Jobs

Every quarter, someone on the team asks: "Do we really need this Spark cluster?" For most of the jobs running on it, the answer in 2026 is no.

The 40 GB Threshold

Digital Turbine published a benchmark that captures what a lot of teams have been discovering independently. They ran a filtering-and-aggregation query across 40 GiB of Parquet files — 94 columns, 52 files — on a c2d-standard-16 instance (16 cores, 64 GB RAM).

Engine	Elapsed Time	RAM Used	Container Image
DuckDB	3.99 s	196 MB	60 MiB
chDB	9.3 s	2.3 GB	574 MiB
Spark	16.39 s	1.6 GB	428 MiB

The duck finished in under four seconds. Spark needed sixteen. Not because Spark is bad at this query — it pays upfront for distributed coordination that a 40 GB dataset doesn't need. Serialization, shuffle planning, JVM startup, the driver-executor handshake — all of that overhead exists whether you're processing 40 GB or 40 TB.

The RAM difference is even more telling. DuckDB used 196 MB versus Spark's 1.6 GB. When your K8s CronJob runs hourly, that 8× memory gap maps directly to pod resource requests and node cost. Digital Turbine ended up shipping the duck as a lightweight CronJob — something they couldn't justify with a 428 MiB Spark image eating scheduler resources every few minutes.

What 1.4 LTS Actually Fixed

The performance advantage isn't new. What's new is that the list of "yeah but can it do X in production?" objections keeps shrinking. Version 1.4 LTS shipped three features that removed the biggest blockers:

Encryption at rest. AES-256-GCM with 5–10% overhead, or AES-256-CTR at 2–5%. Block-level encryption covers data, indexes, metadata, the WAL, and temporary spill files. Healthcare and finance teams that previously couldn't touch it now can. The overhead is minimal because the engine leverages hardware-accelerated AES instructions on modern CPUs.

Iceberg writes. Direct write-back to Iceberg tables means the engine fits into existing lakehouse architectures without maintaining a separate write path through Spark or Trino.

MERGE support. Proper upserts without the delete-then-insert workaround. This was the last feature gap keeping people on Spark for incremental loads.

None of these are flashy. That's the point — they're the boring compliance and integration checkboxes that procurement and platform teams actually care about.

At 100 GB, It Still Wins

Scaling up to 100 GB of Parquet, the engine still holds a clear advantage — roughly 48 seconds on a single machine versus 90 seconds net runtime on an EMR cluster. But the real savings come from infrastructure you don't run. No cluster to spin up at $50/hour. No Spark session startup tax. No YARN negotiation. You point DuckDB at an S3 path with the httpfs extension, run SQL, and get results.

The break-even point where distributed execution starts paying for itself sits somewhere around 500 GB of working set — not input data, but the intermediate results your transformations actually materialize. Most analytics pipelines read 200 GB and aggregate down to 2 GB. That's duck territory.

The Migration Is Boring

Good. Here's what it looks like in practice:

Profile your bottleneck. Open your Airflow DAGs and find every PySpark task reading under 100 GB of Parquet. Those are your candidates.
Rewrite as SQL. Most PySpark jobs are already calling spark.sql() internally, so the translation is mechanical. Point duckdb.connect() at the same Parquet paths and run the query.
Hand off downstream. Use .fetchdf() for pandas or .pl() for Polars. Zero-copy interchange means no serialization penalty.

Two gotchas worth flagging:

Partition pruning matters more here. Spark can brute-force a full scan across nodes. A single-node engine can't afford that luxury. If your Parquet isn't partitioned by your common filter columns (year=, region=), you'll read far more data than necessary. Run EXPLAIN ANALYZE before declaring victory.

OOM behavior differs. Disk spill has improved — tested up to 3.6 TB in TPC-H benchmarks — but when memory truly runs out, the duck fails. Spark's resilient distributed datasets retry failed stages automatically. If your pipeline has unpredictable data volume spikes, that resilience still matters.

When Spark Earns Its Keep

Multi-pass ML feature engineering over a terabyte. Shuffle-heavy joins across datasets that won't fit on one machine no matter how much RAM you throw at it. Streaming ingestion at high throughput with exactly-once semantics — though Flink is arguably better there.

Don't over-rotate. The question isn't "DuckDB or Spark" as a company-wide decision. It's "which of my current Spark jobs are paying cluster tax for no reason?"

Run the Numbers on Your Own Stack

One team reported cutting their Snowflake bill by 79% using a DuckDB caching and transformation layer in front of the warehouse. That number is extreme, but the direction is consistent across every migration story I've seen.

A c2d-standard-16 on GCP costs about $0.75/hour. An EMR cluster with a driver and four workers starts at$ 2.50/hour before EBS. If your pipeline runs for 10 minutes every hour, 24/7, the difference is roughly $540/month versus$ 1,800/month — and that's before counting engineer-hours spent debugging Spark session configs, executor memory fractions, and dependency conflicts between your driver and workers.

The duck's operational surface area is a Python import. No cluster to manage. No JVM to tune. No spark.executor.memoryOverhead to get wrong at 2 AM when the pipeline falls over.

That tradeoff won't make sense for every workload. But for the 40 GB aggregation job that's been running on a $2.50/hour cluster since 2021? Pull the plug.

#The 40 GB Threshold

#What 1.4 LTS Actually Fixed

#At 100 GB, It Still Wins

#The Migration Is Boring

#When Spark Earns Its Keep

#Run the Numbers on Your Own Stack