Rishabh Rahul
← Back to all posts

The 600 Million Offset Trap

October 13, 2025
kafka

One of the scariest things in software engineering is flipping a feature flag and waking up to a monitoring dashboard flashing bright red.

I turned on a pre-existing configuration flag to enable a tenant filter on one of our Kafka consumers. The application didn't crash. No errors in the logs. Just silence — for about a day, until our monitoring tools started screaming. Consumer lag: over 600 million messages.

When you see a number like that, the first move isn't to start fixing things. It's to do some back-of-the-napkin math, because 600 million sounds catastrophic until you actually think about it.

Every message in Kafka gets a unique ID in its partition — an offset. Our average payload was roughly 2 KB. So:

600,000,000 × 2 KB = 1.2 Terabytes

Except our topic had a double-digit number of partitions, each capped at 1–3 GB of storage before Kafka starts deleting old data. Even if every partition was maxed out at 3 GB, we're nowhere near 1.2 TB. The data physically didn't exist anymore. We weren't drowning — we were looking at a ghost.


The Real Problem Was Already There

The feature flag wasn't the root cause. The code behind it was.

I didn't write the filtering logic; it was inherited, dormant, never actually exercised in production. Whoever wrote it did what most of us do: tested the business logic. Checked that the right messages came through and the wrong ones got dropped. That part worked fine. What nobody tested was what that filtering did to consumer lag.

The code used Spring Kafka's RecordFilterStrategy : a hook that intercepts messages before they reach your listener. If a message doesn't pass the filter, the framework quietly discards it. Clean, invisible, no fuss.

Our consumer was also using manual acknowledgment. Because we were doing heavy batch processing, the code was responsible for explicitly telling Kafka when a batch was done as "finished, move the offset forward."

Those two things together were the problem.


What Kafka Actually Saw

The RecordFilterStrategy was throwing away messages outside our listener, so our application code never touched them. When we acknowledged a batch, we were only acknowledging the handful of messages that made it through. The millions the framework quietly dropped? Never acknowledged.

From Kafka's perspective, an unacknowledged offset is unfinished work. Lag is just the latest offset minus the last committed offset. We were leaving enormous gaps of silently dropped messages sitting there, uncommitted, and that number kept climbing — all the way to 600 million.


The Fix

We ripped out the RecordFilterStrategy.

Opened the gate, let everything through, and moved the filtering inline as right inside the listener, before processing. Same business logic, same messages getting dropped, same messages getting processed. The only difference was where the drop happened.

With the full batch flowing through our code, the manual acknowledgment at the end covered everything: messages we processed, messages we skipped, all of it. We were telling the broker we'd seen the whole batch and were done with it.

The committed offsets caught up. The lag dropped to zero.


I still think about how long that code sat there waiting. Perfectly reasonable in isolation. A ticking clock the moment you paired it with manual acks.

When you inherit code, look at how it interacts with the framework and not just what it does on its own. And if you're managing Kafka offsets manually, you're responsible for the whole stream. Even the stuff you throw away.