JSON columns are a bug we've been treating as a feature for eight years. Last October, Apache Parquet quietly ratified an open Variant specification — and as of this week, the three dominant table formats and the stream processor you probably use all speak it natively.
The news that crystallized this was Alibaba Cloud's Ververica Runtime 11.6.0 release on April 9, which added Variant support alongside its flashier AI functions for multimodal inference. The AI-function bits get the headlines — Qwen-VL running against PDFs inside Flink SQL is a neat demo. Variant matters more for most pipelines, because it signals that even stream processors are done pretending JSON strings are a reasonable data type.
Why JSON-as-string was always cursed
The pattern is familiar. Your upstream sends you webhooks, your API gateway ships request logs, your mobile SDK emits heterogeneous event payloads. The schema shifts weekly. So you do the lazy thing: declare a STRING column called payload, stuff the raw JSON into it, and push pain downstream to whoever has to query it.
Every read path then does the same thing. Pull the full string off disk, parse it, extract the two fields you actually want. No statistics, no pushdown, no column pruning. The query planner is blind because from its perspective your 4 KB blob is an opaque sequence of bytes. Predicate pushdown can't skip pages it can't reason about.
Teams learned to paper over this with ELT shortcuts — JSON-flattening staging models in dbt, get_json_object UDFs, materialized views keyed on specific paths. It works until the payload schema changes and your staging model starts silently dropping fields.
What Variant actually is
A binary encoding with two parts. A metadata section deduplicates field names into a dictionary. A value section tags each value with its type (primitive, array, object, short-string). Field access uses offsets, so you can navigate into a nested structure without parsing the whole blob.
In SQL, the migration looks roughly like this:
-- before
CREATE TABLE events (
event_id STRING,
payload STRING -- JSON as text
) USING ICEBERG;
SELECT get_json_object(payload, '$.user.id') AS user_id
FROM events
WHERE get_json_object(payload, '$.event_type') = 'purchase';
-- after (Iceberg v3 / Delta 4 / Spark 4)
CREATE TABLE events (
event_id STRING,
payload VARIANT
) USING ICEBERG;
SELECT payload:user.id::STRING AS user_id
FROM events
WHERE payload:event_type::STRING = 'purchase';
Two things changed. Field access uses a path operator the engine understands natively. And the bytes on disk are a compact binary encoding, not UTF-8 text, so both storage footprint and scan cost drop immediately — without any flattening stage.
The shredding trick is where the wins come from
The headline numbers — Databricks reports 8x faster reads versus JSON strings, 30x with shredding enabled — don't come from the binary encoding alone. They come from shredding.
Shredding is subcolumnarization. The writer identifies fields that appear in most rows (say, event_type, event_ts, user.id) and stores them as dedicated typed Parquet columns, while keeping the residual binary for everything else. You get column-level min/max stats for the shredded fields, so predicate pushdown and row-group skipping work the way they do for normal columns. You also keep schema flexibility for the long tail — new fields nobody asked for yet stay inside the Variant and cost nothing extra at ingest time.
Writes are 20–50% slower with shredding turned on — the writer is doing real work to identify and promote fields. For most analytical workloads that cost is trivial. For high-throughput ingest pipelines it matters and is worth benchmarking on your own data before flipping it on.
Where support actually stands
The ecosystem moved faster than most teams realize:
| Engine / Format | Variant | Shredding | Notes |
|---|---|---|---|
| Parquet spec | ratified (Oct 2025) | ratified | v2.12.0 / parquet-java 1.16.0 |
| Apache Iceberg v3 | yes | read-only (Spark writer WIP) | v3 tables only |
| Delta Lake 4.0 | yes | yes | DBR 17.2+ / DBSQL 2025.30+ for shredding |
| Apache Spark 4.x | yes | yes | Ships with Delta 4 / Iceberg v3 |
| Flink (VVR 11.6.0) | yes | read path | Alibaba Cloud; Apache Flink OSS pending |
| DuckDB | reader support | partial | Reads Variant-encoded Parquet |
The gap worth noting: open-source Apache Flink doesn't have it yet. If you're running self-managed Flink on EKS, you're waiting. Confluent Cloud and Alibaba's VVR got there first because both had commercial pressure to ship AI-function integrations that emit semi-structured output.
Should you migrate?
Probably not, if your table is already Iceberg v2 and your queries are working. The switch to v3 is not free — your Trino, Athena, and Snowflake readers all need to understand v3 metadata, and as of this week Snowflake's Iceberg v3 support is still labeled public preview. A mid-quarter table-format upgrade for a performance win that compounds slowly is not the kind of trade most teams should take.
For net-new tables — especially event streams, audit logs, anything flowing in from a webhook gateway — there's no good reason left to reach for STRING when the shape is JSON. Declare the column as VARIANT, let the writer worry about encoding, and get back to the real work.
One thing that'll bite you
Variant values compare by their binary representation, not their logical JSON equivalent. {"a":1,"b":2} and {"b":2,"a":1} can be byte-different Variants that query as "not equal." If your ETL relies on equality joins or deduplication across JSON-shaped columns, you need to canonicalize before you cast. That's not a Variant bug — it's a reminder that moving from text to binary changes the identity semantics of your data, and every cutover I've seen has forgotten about it at least once.