Spark Streaming: Fault-tolerant Streaming Computation at Scale
Many “big data” applications need to act on data arriving in real time.
Running these applications at ever-larger scales requires parallel
execution platforms that automatically handle faults and stragglers.
Unfortunately, current distributed stream processing models provide
fault recovery in an expensive manner, requiring hot replication or long
recovery times, and do not handle stragglers. We propose a new
processing model, discretized streams (D-streams), that overcomes these
challenges. D-streams support a parallel recovery mechanism that
improves efficiency over the traditional replication and upstream backup
schemes in streaming databases, and also handles stragglers. We show
that D-streams can support a rich set of streaming operators while
attaining high per-node throughput similar to single-node systems,
linear scaling to 100 nodes, sub-second latency, and sub-second fault
recovery. Finally, the D-stream model can seamlessly be composed with
batch and interactive query models for clusters (e.g. MapReduce),
enabling rich applications that combine these modes. We have implemented
D-streams in Spark Streaming, an extension to the Spark cluster
Matei Zaharia finishing his PhD at UC Berkeley, where he worked with
Scott Shenker and Ion Stoica on topics in large-scale data processing
and cloud computing. After Berkeley, he will be starting an assistant
professor position at MIT. During his PhD, Matei has also been an active
open source contributor, becoming a committer on the Apache Hadoop
project and starting the Mesos and Spark projects.