Stop Re-Embedding Your Entire Corpus Every Night

Most RAG teams treat embedding freshness the same way they treat data warehouse freshness — schedule a nightly batch job and hope nothing changes too fast. Except each embedding API call costs real money, and the vast majority recompute vectors for documents nobody touched.

The batch habit

The typical setup: an Airflow DAG fires at 2 AM, pulls every document from Postgres, chunks them, calls OpenAI's embedding API, upserts the results into your vector store. On a 50,000-document corpus where maybe 500 rows change daily — a 1% churn rate — you're paying for 49,500 calls that produce bit-for-bit identical vectors.

The raw API cost with text-embedding-3-small at $0.13 per million tokens might not break the bank on its own. But the architectural cost is worse: a 24-hour staleness window where your retrieval system serves yesterday's information. For a documentation chatbot over docs that ship weekly, who cares. For a support agent pulling from live ticket histories and CRM notes, that gap is the difference between a helpful answer and a confidently wrong one.

CDC flips the question

Change Data Capture reverses the problem. Instead of "give me everything, I'll figure out what changed," you subscribe to the change stream. Only modified rows trigger re-embedding.

The architecture is shorter than you'd expect:

Source database (Postgres) with logical replication turned on
A streaming database consuming the CDC feed directly — no Kafka or Debezium sitting in the middle
An embedding function invoked inside a materialized view
A vector index that refreshes automatically as new vectors arrive

RisingWave's implementation makes the pattern concrete:

CREATE MATERIALIZED VIEW document_embeddings AS
SELECT doc_id, title, category, updated_at,
  openai_embedding('text-embedding-3-small',
    title || '. ' || content)::vector(1536) AS embedding
FROM documents;

When a row in the upstream documents table changes, RisingWave detects the diff via the WAL, re-embeds just that row, and updates the materialized view. The index stays current within seconds of the source mutation.

The numbers

	Batch (nightly)	Streaming (CDC)
Embedding API calls/day	50,000	~500
Max staleness	24 hours	Seconds
Scheduling infra	Airflow / cron	None
Cost scales with	Corpus size	Change rate

That last row is the one to stare at. Batch costs grow linearly as your corpus expands. Streaming costs grow with edit velocity. Double your corpus but maintain the same edit frequency? CDC costs barely move. Try saying that about a nightly full-reindex.

What breaks

This isn't free. A few things to watch.

Cold start. First-time setup still requires a batch load to embed the full corpus. RisingWave handles this transparently — it snapshots the current table state and then switches to streaming mode. If you're wiring CDC manually with Kafka Connect and a custom consumer, the initial snapshot is your problem to solve.

Provider coupling. Calling openai_embedding() inside a materialized view weds your data pipeline to a single API endpoint. OpenAI goes down, your embedding pipeline stalls. Rate limiting behaves differently during a bulk migration versus trickle updates throughout the day. You want circuit breakers, backpressure handling, and ideally a local model fallback for when the provider has a bad afternoon.

Chunking works identically in both models — fixed-size or semantic, the strategy sits upstream of the embedding call. The streaming win is that only modified chunks trigger recomputation. A document with 12 chunks where paragraph 3 changed? One embedding call, not twelve.

Monitoring shifts shape. You stop watching "did the DAG succeed" and start watching CDC lag, embedding throughput, and API error rates. Different failure modes require different dashboards. If your team's observability story is "check the Airflow UI," plan some migration time for alerting infrastructure too.

When batch is actually fine

Not every RAG system needs sub-second freshness. Be honest about your update cadence before rearchitecting.

Batch works when the content update cycle is longer than the staleness window. Product docs that ship biweekly? A nightly job gives you 13 days of margin. Legal documents reviewed quarterly? You could run the job monthly and still be ahead.

Streaming earns its complexity in three scenarios:

Live knowledge bases — wikis, Confluence spaces, Notion databases — where edits land throughout the day and users expect current answers within minutes
Customer-facing support — ticket histories, policy changes, product updates that a bot should reflect before the next shift starts
Compliance workloads — CDC produces a natural audit log of every change and when its embedding was computed, which turns out to be painful to reconstruct from batch job metadata

If your corpus updates less than once a day, keep the cron job. No shame in it.

The DAG you get to delete

The most tangible payoff isn't freshness or cost — it's killing orchestration code. No retry logic for half-failed embedding batches. No monitoring for DAG duration creep as the corpus doubles year over year. No manual reruns when the 2 AM job slams into an API rate limit and your on-call engineer gets a page they can't do anything about until morning.

The materialized view is the pipeline. The streaming database is the scheduler. If you're already running RisingWave or Materialize for analytics workloads, adding an embedding view is one SQL statement. If you're not running one yet, deploying a streaming database solely for this is probably overkill — but in my experience it doesn't stay a single use case for long.

The broader pattern here is that embedding infrastructure is a data pipeline problem, not an ML problem. The sooner teams treat it that way — with proper CDC, incremental processing, and change-proportional cost — the sooner they stop burning money re-embedding documents that haven't moved.

#The batch habit

#CDC flips the question

#The numbers

#What breaks

#When batch is actually fine

#The DAG you get to delete