Real-Time Event-Driven Architectures for Live Odds (Kafka Case Study)

The 20:07 Incident That Forced Our Hand

It was derby night. At 20:07, traffic jumped. We hit 1.8 million requests per second. Our p99 moved past 380 ms. Odds on two hot markets went stale for 19 seconds. That may not sound like much. In live betting, it is an age. Books pulled lines. Traders yelled. Social flagged us. We had risk on the table and no safe out.

We did not lose money that night. We were lucky. We had good kill switches. But the root cause was clear. Our sync calls and shared caches could not keep up. We needed a change. Within two weeks, we drew a new plan around Kafka and streams. Over the next season, we took p99 from ~420 ms to ~140 ms at 1.2M msgs/s. We cut stale odds by 92%. This is the story of what we built, why, and what we learned.

The Business Constraint Nobody Talks About

Live odds have an odd shape. The house risk is not even. A late or wrong price can swing a game. Rules and audits add stress. Every write and read can be logged and kept. You must show clean data flow and the time it took. You must show who changed what and when. You must fix bugs fast and prove you did.

So the work is not just speed. It is also trust. Bettors want clear, fair, and fast lines. Traders want tools they can trust. Regulators want a trail that stands up. We keep all three in mind as we design. It pays off when things break. It also makes it easier to sleep.

Architecture Sketch Before Definitions

Here is the shape we moved to. Feeds come in from partners and from our own apps. We also use CDC from our OLTP stores. Producers write to Kafka topics. We use enough Kafka topic partitions to spread load. Stream jobs build odds, merge feeds, and score risk. Jobs keep state in local stores and in a fast cache. We fan out to clients via WebSockets and gRPC. We use backpressure when fans get too hot. We watch the whole line with metrics and traces. We have kill switches and throttles we can flip in seconds.

Simple napkin math for our latency budget:

ingest and parse: 10–20 ms
produce to broker: 5–15 ms
broker to consumer: 5–15 ms
stream compute and joins: 30–70 ms
state read/write: 5–10 ms
cache set and hot path read: 2–5 ms
fan-out and network to user: 50–120 ms

Goal: p99 under 200–250 ms at peak. We meet this with headroom most days.

Only Now: What “Event-Driven” Means for Live Odds

Event-driven means the log is king. Each change is a fact. We do not call a sync API to pull state. We react to new events as they come. We can replay old events to rebuild state. We can branch the stream to test a new rule. We can add a sink later without the source even knowing.

For more on why the log model works, see The Log: what every software engineer should know. For live odds, this model helps with four hard needs: idempotency, order per key (like match_id or market_id), replay for bugs and audits, and schema change without big bangs.

Contracts Over Chaos: Schemas, CDC, and Replay

Odds and markets change a lot. We lock data shape with a schema contract. We put schemas in a registry and enforce rules. That means a bad change fails fast, before it hits prod. We set backward or full rules per topic. See Schema Registry compatibility rules for the knobs we use.

We stream DB changes with CDC. It keeps us close to source truth. We use Debezium to tail our OLTP logs. Here is the doc for that: Debezium CDC. CDC helps us avoid hot read loops on the DB. It also gives us a clear audit line from write to odds update.

Replay is a first-class task, not an afterthought. We store enough history on key topics to reprocess a week or more. We keep a compacted topic for current state. We tag replays so they do not hit fan-out by mistake. We test every new change on replay first, then go live.

Stream Processing Choices

We tried three paths: Kafka Streams, Flink, and ksqlDB. All can work. Each has a strong side.

Kafka Streams is in-process and lean. It shines for per-key state and joins. It gives us EOS today. See Kafka Streams exactly-once semantics.
Flink is great for complex time work, large joins, and custom windows. It has strong event-time controls. See Flink event-time and watermarks.
ksqlDB is fast to build with if your use case fits. You can do joins and windows with SQL. See ksqlDB time and windows.

We picked Kafka Streams for core odds. We used Flink for heavy joins on stats feeds. We used ksqlDB for light enrich and for quick views for ops.

Failure Modes vs Controls

We built our runbooks from painful nights and good books. A key one is Designing Data-Intensive Applications. The table below shows how we match common fails with controls that work in the real world.

Hot partition skew	One partition 10× lag; broker CPU high	Custom keying; add partitions; repartition topics	Re-key by market_id+shard; roll out a repartition topic; rebalance clients	4
Out-of-order events	Negative join latencies; wrong window counts	Event-time + watermarks + grace	Raise allowed lateness by X ms; widen grace; audit source clocks	3
Duplicate writes to sink	Dedupe hits spike; client sees flicker	Idempotent producer; EOS transactions	Verify enable.idempotence=true; set transactional.id; ensure EOS in sink	5
Consumer lag creep	p99 up; time-in-queue grows each hour	Parallelism; backpressure; quotas	Scale consumers; add partitions; tune fetch.min.bytes and max.poll.records	4
Schema break	Deser errors; red bars in logs	Schema Registry compat checks	Rollback producer; publish a backward schema; reprocess bad window	5
Broker disk full	ISR shrink; produce errors; throttling	Retention; tiered storage; alerts	Raise retention on dead topics; purge; offload; add brokers if near cap	4
Mass fan-out storm	Gateway CPU pegged; dropped sockets	Backpressure; token bucket; cache TTLs	Enable shed policy; cut update rate; widen per-client min interval	3
Clock drift at source	Weird late windows; jumps in event time	Watermark strategy; NTP checks	Flag source; switch to safe feed; enforce NTP in partner SLA	3
Zombie consumers (no rebalance)	Offsets stale; no heartbeats	Cooperative rebalancing; session timeouts	Drop lagging group; restart with jitter; review max.poll.interval.ms	4
Cross-region split brain	Odd mismatches; A/B drift	Cluster linking; write fencing	Force write to leader region; fence loser; backfill with MM2	2

Serving Live Odds Without Drowning Clients

Odds change fast. Clients can choke if we push too much. We use a Redis tier for hot keys and fan-out control. See Redis caching patterns for common setups. We send deltas, not full views, when we can. We batch small updates for a few ms to cut chatter. We use WebSockets for rich views and SSE for light feeds. Cloud edges help with last-mile hops; see how Cloudflare runs WebSockets at scale.

We cap fan-out per client. We set a min time between pushes per market. We drop low rank updates first. We give traders a higher tier than public apps when the house is at risk. We expose a simple backoff hint so apps can ease pull or push rates when we are hot.

Observability, SLOs, and the Ugly Truth

We measure the user path end to end. We trace from feed to device. We put trace IDs on events. We add exemplars to key metrics. We use OpenTelemetry to keep this standard. Alerts map to SLOs, not raw noise. Here is a clean intro to SLOs: Service Level Objectives.

Dashboards show red or green, not a rainbow. We cut alert spam with sane windows and burn rate rules. We keep a runbook for each red widget. We drill on-call handoffs. We test pages on Fridays at noon, not at night. We use Grafana alerting for most ops. If you use a vendor, this guide helps: Datadog Kafka integration.

Price of Real-Time: Cost and Capacity Math

Real time is not cheap. But you can plan it. Start with peak rate, average size, and your SLA. Then pick a safe per-partition rate and work back to a partition count. This post is a good primer: How to choose Kafka partitions.

Here is a quick guide with round numbers:

Peak ingest: 1.2M msgs/s, average 350 bytes
Safe per-partition rate for small events: 30k–60k msgs/s (pick low to start)
Target partitions for core topic: ceil(1,200,000 / 40,000) ≈ 30
Add 30–50% headroom for bursts and rebalances → 40–45 partitions

Other cost knobs:

Compression can halve egress. Start with lz4. Watch CPU.
linger.ms and batch.size cut syscalls. Keep p99 in view.
Rack aware leaders lower cross-zone costs.
Tiered storage saves on long retention.
Cold standby beats hot multi-active if your SLA allows a short failover.

Risk, Compliance, Integrity

Rules matter in this space. The UK has strict tech rules for remote play. See the Remote Technical Standards. They care about uptime, change control, game logs, and fair play. Your system must show a full trail and prove that odds are based on facts, not on hidden rules.

Trust is not a buzzword. Players pick brands that pay fast, show clear status, and keep lines open. In some markets, users even search for guides on fast deposits. For Finnish users, this shows up as talletus kasinolle. Good tech helps with this. Clean events mean clean ledgers. Clear SLOs mean clear ETAs for pay and cash-out. Note: if you use affiliate links, say so. It builds trust.

On the operator side, integrity means controls. Every change should link to a ticket. Every deploy should have a diff and a plan. Every bet should tie back to source events. We keep a replay runbook ready so we can prove what happened in any market at any time.

What We’d Redo After Two Seasons

We learned a lot by pain. We put all markets in one hot topic at first. It was a mistake. We re-keyed by market and shard to avoid skew. We also kept some legacy sync paths too long “just in case.” They became a trap under load. We cut them after a nasty night.

We over-alerted. Pager duty was loud and unclear. We cut alerts by 60% and made burn rules. We also moved from hard deploys to a cheap shadow path first. Netflix’s story here rings true: Keystone: real-time stream processing at Netflix. We wish we had done that sooner.

Field Notes, Not Myths

Exactly-once is real in Kafka land, but only if all hops play along. Use enable.idempotence=true, acks=all, and a unique transactional.id per app instance. For sinks, use transactional sinks or idempotent keys.
Event-time is your friend. Source time can drift. Set watermarks. Give a grace window. Keep late data paths visible.
CDC is not free. Backfills hurt. Put per-table limits. Use filters. Test failover on a weekday.
Backpressure saves you. Quotas on brokers; max.in.flight lower than you think; client buffers with sane caps.
Reprocess often, not just in a crisis. Make it boring. Boring is good.

Architecture Diagram (What to Draw)

Field-Ready Checklist

Keys: Pick match_id or market_id. Shard hot markets. Plan repartition topics.
Schemas: Add a registry. Enforce backward or full rules. Add CI checks.
Producers: enable.idempotence=true, acks=all, linger.ms tuned, compression=lz4.
Consumers: cooperative rebalance; sane max.poll.interval.ms; max.poll.records set by CPU.
Streams: EOS v2 on; state stores on fast disk; checkpoint path with space alerts.
Backpressure: quotas on brokers; per-client update caps; token bucket on gateways.
Cache: set TTL per market; delta push where safe; hit rate SLI.
Replay: time and compact topics; tag replays; keep a “no fan-out” replay mode.
Observability: traces from feed to device; exemplars on p99; SLO burn alerts.
Resilience: MM2 or cluster linking; kill switches; chaos drills every sprint.
Compliance: audit log on odds changes; change tickets; data retention map.

FAQ

Can Kafka hit sub-100 ms end to end for live odds?

Sometimes, yes. Under light load, small events, warm cache, and local users, we see 80–120 ms p99. At scale and across regions, 150–250 ms is a safer goal. The last mile is often the slowest.

Kafka Streams or Flink for live odds?

Use Kafka Streams for lean per-key logic, joins, and low ops work. Pick Flink for big joins, complex time logic, or if you need a full job manager and custom windows. You can run both. We do.

How do you stop stale or wrong odds from reaching users?

Add source checks, idempotent writes, EOS, and grace windows. Keep a kill switch per market. Put a “good until” time on each odds view. Expire views fast on the client if you must.

What is a good starting partition count?

Start from peak msgs/s and a safe per-partition rate. Many teams pick 30k–60k msgs/s per partition for small events. Add 30–50% headroom. Measure. Adjust. See the Confluent post linked above for more.

How do you replay history without breaking live?

Tag replay events. Route them to a shadow path. Disable fan-out for replays. Backfill state stores and caches first. Then switch traffic, or just compare views.

Postmortem Snippets (What Bit Us)

A single-threaded JSON decode blocked a hot path. We moved to pooled decode and cut p99 by 30 ms.
A mis-set max.in.flight=5 caused reorders under retry. We set it to 1 for key topics. No more ghost flips.
We forgot NTP on one partner feed host. Watermarks went wild. We now check partner clocks hourly.

Small but High-Value Configs

Producer: enable.idempotence=true; acks=all; linger.ms=5–15; batch.size 32–64 KB; compression=lz4.
Consumer: max.poll.records=500–2000; max.poll.interval.ms tuned to job time; fetch.min.bytes=1–64 KB.
Broker: request.timeout.ms sane; quotas per client; rack awareness on; retention by SLA.
Security: TLS on; SASL; ACLs per topic; rotate keys; keep PII out of streams if you can.

A Note on Multi-Region

Active-active sounds cool. For odds, drift can cost you. We run active-passive for core write paths. We fan out local reads in both regions. We use MirrorMaker 2 or cluster linking for backfill and failover. We test failover a lot. We do it in the day. With coffee. With all hands on.

Closing Thoughts

Event-driven design fits live odds. Kafka gives you a log, order per key, and replay. Streams give you state and joins. A cache and smart fan-out keep users happy. Good SLOs and strong ops keep you sane. The rest is craft and care. Start small. Measure. Fix. Repeat. And write your runbooks like your bonus depends on them—because it does.

About the author

Lead data/platform engineer with 9+ years in streaming systems. Built and ran Kafka, Flink, and real-time APIs for betting and sports data at scale. Speaker at two data meetups. Writes about reliable streams and SLOs.

Disclosure: We may include third-party links. We do not take money to change our view. Any affiliate ties are noted.

Last updated: 2026-03-23

other languages

English عربي français español русский 中国

Keyword

copy and paste on protected Web pages, Copy From Right Click Disabled Websites, How to copy text and images from a web page, select and copy text from restricted website, How to bypass a website's copy-paste restriction, can t copy text site,how to copy text from web page that cannot be copied, chrome allow copy, how to copy text from protected website firefox, how to copy from right click disabled websites, right to copy chrome, allow copy firefox, how to enable copy paste in chrome, quick javascript switcher, how to copy text from protected web page, how to copy and paste on websites that don't allow it, righttoclick addon, allow copy chrome extension, right to click chrome, right to click add on chrome, can't copy text from website chrome