Back to Blog

Real-Time Data Processing with Kafka and Flink

10 min read
by Akshay
Data EngineeringKafkaFlinkDistributed Systems

Real-Time Data Processing with Kafka and Flink

When your data platform needs to process billions of events per day with near real-time latency, traditional batch processing won't cut it. Here's how we built a streaming data platform that scales.

The Challenge

Our requirements:

  • Process 5B+ events daily
  • <1 second end-to-end latency
  • Exactly-once semantics
  • Complex event processing (CEP)
  • Integration with existing data warehouse

Architecture

Event Sources → Kafka → Flink → Clickhouse
                  ↓             ↓
              S3 Archive    Real-time Dashboards

Why Kafka?

Kafka provides the durability and scalability we need:

  • Horizontal scaling: Add brokers as load increases
  • Replication: Fault tolerance built-in
  • Retention: Historical replay for debugging
  • Schema evolution: Integration with Schema Registry

Why Flink?

Flink offers true stream processing with:

  • Exactly-once processing: Critical for financial data
  • Event time processing: Handle out-of-order events
  • Rich operators: Windows, joins, CEP patterns
  • Savepoints: Zero-downtime deployments

Key Implementation Details

1. Schema Management

We use Confluent Schema Registry with Avro:

KafkaAvroDeserializer deserializer = new KafkaAvroDeserializer();
GenericRecord record = deserializer.deserialize(topic, data);

2. Watermark Strategy

Handling late events correctly:

WatermarkStrategy
    .<Event>forBoundedOutOfOrderness(Duration.ofSeconds(20))
    .withTimestampAssigner((event, timestamp) -> event.getTimestamp());

3. Exactly-Once Semantics

Configure both Kafka and Flink for exactly-once:

env.enableCheckpointing(60000);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);

Scaling Strategies

  1. Partition by key: Ensure related events stay on same partition
  2. Replication factor: RF=3 for critical topics
  3. Parallelism: Match Flink parallelism to Kafka partitions
  4. Resource isolation: Separate clusters for different use cases

Monitoring

Critical metrics we track:

  • Consumer lag (per partition)
  • Processing latency (end-to-end)
  • Checkpoint duration and failures
  • Data skew across partitions

Results

After optimization:

  • Throughput: 60K events/second sustained
  • Latency: p99 < 500ms
  • Availability: 99.95% uptime
  • Cost: 40% reduction vs. previous solution

Lessons Learned

  1. Start small: Begin with a single use case and expand
  2. Monitor everything: You can't fix what you can't measure
  3. Plan for failures: Networks partition, disks fail, bugs happen
  4. Schema evolution: Plan for schema changes from day one
  5. Backpressure: Design for it, test for it, handle it gracefully

The complete code and infrastructure setup is available in my projects.