Reliable real-time event processing with Kafka and Storm

by Matti Pehrs 

Thursday, 12 June 2014 13:00

At Spotify we produce tens of terabytes of data each day across several data centers on multiple countries. Trying to build and maintain a platform to transfer this amount of data each day is not easy. Traditional off-the-shelf messaging products, such as ActiveMQ or RabbitMQ would just break under the load in scenarios of this scale. Recently we deployed our new system based on Apache Kafka and Storm to enable us to transfer and process this data in near real-time.

Even with Kafka and Storm you will only get at-least-once delivery of messages. However our situation needs a stronger delivery guarantee than that. We have implemented an end-to-end acknowledgement subsystem on top of our Kafka that provides at-least-once semantics and supports most failure scenarios that are reasonably likely to occur.

In this presentation we discuss the architecture of our system, the rationale behind it and the challenges getting it into production, covering TCP security issues, JVM GC tuning and monitoring.