Twitter processes billions of events daily and it isn’t easy doing it in real-time. They were using Apache Storm, an open source distributed realtime computation system. But with the increasing scale and data diversity, they couldn’t rely on Apache Storm anymore – so they got their hands dirty building a new system, Twitter Heron, an engine for real-time analytics/stream processing and is fully compatible with Storm APIs for smooth adoption.
It’s also important to understand for what use cases Apache Storm is widely used. Some of it are:
- Real-time stream processing: Updating databases in real-time by continuously processing a stream of data. Storm has an excellent built-in fault-tolerance and can scale well unlike the regular approach of processing real-time stream with a network of queues and workers,.
- Continuous computation: Storm can continuously query and stream the results to clients/listeners in real-time. A good example of it is to stream the topics that trend on Twitter into browsers.
- Distributed RPC: Storm can run an intense query in a parallel manner on the fly. The thought here is that the Storm topology is a distributed function that awaits invocation messages – reactive. When it receives an invocation, it calculates/computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets.
Key motivations behind building Heron
Weighing the different design choices, Twitter’s engineering team thought it’d be best to rewrite the system from ground up keeping it compatible with Storm APIs, reusing and building upon some of the existing software components within Twitter. They wanted a system that
- scales better
- is easier to debug
- has better performance
- is easier to deploy and manage
- can work well in a shared multi-tenant cluster environment
Twitter Heron moved from a typical thread-based system to a process-based system. To make it efficient, maintainable and get the broader community’s acceptance, they used the industry-standard languages such as Java, C++ and Python.
One key thing to point out about Heron is that it’s designed from ground up for deployment in modern cluster environments by integrating with powerful open source schedulers, such as Mesos, Aurora (Mesos framework for long-running services and cron jobs), REEF (Retainable Evaluator Execution Framework), Slurm (Simple Linux Utility for Resource Management).
Heron satisfies one of the key motivations of “easier to debug” by running each task in isolation (a process of its own).
Heron’s source code is available at https://github.com/twitter/heron