Hadoop vs Spark vs Flink

Data Processing

Hadoop Spark Flink
Hadoop was designed for batch processing, that means it takes large dataset in input, all at once, processes it and produces the result. Batch processing is very efficient in processing in high volume data. Depending on the size of the data being processed and the computational power of the system, output can be delayed significantly Apache Spark is also a part of Hadoop Ecosystem, it is a batch processing System at heart too but it also supports stream processing Flink provides single runtime for the streaming and as well batch processing so one common runtime is utilized for data streaming application and batch processing application

Streaming Engine

Hadoop Spark Flink
Map-reduce is batch- oriented processing tool. It takes large dataset in input, all at once, processes it and produces the result Spark Streaming processes data streams in micro-batches, where each batch contains a collection of events that arrived over the batch period. But it is not sufficient for use cases where we need to process large streams of live data and provide results in real time Apache Flink is the true streaming engine that uses streams for workloads: streaming, SQL, micro-batch and batch. Batch is a finite set of streamed data

Data Flow

Hadoop Spark Flink
MapReduce computation dataflow does not have any loops, it is a chain of stages; at each stage you progress forward using output of previous stage and producing input for the next stage. Though Machine Learning algorithm is a cyclic data flow, it is represented as direct acyclic graph inside the spark. Flink takes a different approach than others. It supports controlled cyclic dependency graph in run time. This helps it in representing the Machine Learning algorithms in a very efficient way

Computation Model

Hadoop Spark Flink
MapReduce adopted batch-oriented model. Batch is essentially processing data at rest, taking a large amount of data at once, processing it and then writing out the output Spark has adopted micro-batching. Micro-batches are an essentially “collect and then process” kind of computational model Flink has adopted a continuous flow, operator-based streaming model. A continuous flow operator processes data when it arrives, without any delay in collecting the data or processing the data

Performance

Hadoop Spark Flink
Hadoop supports batch processing only. It doesn’t process streamed data hence overall performance is slower when compared Though Apache Spark has an excellent community background and now It is considered as most matured community. But Its stream processing is not much efficient than Apache Flink as it uses micro-batch processing Overall performance of Apache Flink is excellent as compared to any other data processing system. Apache Flink uses native closed loop iteration operators which makes machine learning and graph processing more faster when we compare Flink and Spark and Hadoop

Memory management

Hadoop Spark Flink
Hadoop provides configurable Memory management. You can do it dynamically or statically Spark provides configurable memory management. The latest release of Spark has moved towards automating memory management Flink provides automatic memory management. It has its own memory management system, separate from Java’s garbage collector

Fault tolerance

Hadoop Spark Flink
MapReduce is highly fault tolerant. There is no need to restart the application from scratch in case of any failure in Hadoop Spark Streaming recovers lost work and with no extra code or configuration, it delivers exactly-once semantics out of the box The fault tolerance mechanism followed by Apache Flink is based on Chandy-Lamport distributed snapshots. The mechanism is lightweight, which results in maintaining high throughput rates and provide strong consistency guarantees at the same time

Scalability

Hadoop Spark Flink
MapReduce has incredible scalability potential and has been used in production on tens of thousands of Nodes Spark is highly scalable, we can keep adding n number of nodes in the cluster. A large known spark cluster is of 8000 nodes Flink is also highly scalable, we can keep adding n number of nodes in the cluster A large known Flink cluster is of thousands of nodes

Iterative Processing

Hadoop Spark Flink
Does not support iterative processing Spark iterates its data in batches In Spark, each iteration has to be scheduled and executed separately Flink iterates data by using its streaming architecture. Flink can be instructed to only process the parts of the data that have actually changed, thus significantly increasing the performance of the job

Language Support

Hadoop Spark Flink
Hadoop Supports Primarily Java, other languages supported are c, c++, ruby, groovy, Perl, python Spark supports java, Scala, python and R. Spark is implemented in scala, it provides API in other languages like Java, Python, and R. Flink Supports java, Scala, python and R. Flink is implemented in java. It does provide Scala API too

Optimization

Hadoop Spark Flink
In MapReduce, jobs have to be manually optimized. There are several ways to optimize the MapReduce Jobs: Configure your cluster correctly, use a combiner , use LZO compression, tune the number of MapReduce Task appropriately and use the most appropriate and compact writable type for your data In Apache Spark, jobs have to be manually optimized. There is a new extensible optimizer, Catalyst, based on functional programming construct in scala. Catalyst’s extensible design had two purposes: First, easy to add new optimization techniques. Second, enable external developers to extend the optimizer catalyst Flink comes with an optimizer that is independent with actual programming interface. The Flink optimizer works similarly to a relational Database Optimizer, but applies these optimizations to the Flink programs, rather than SQL queries

Latency

Hadoop Spark Flink
The MapReduce framework of Hadoop is relatively slower since it is designed to support different format, structure and huge volume of data. That’s why Hadoop has higher latency than both spark and Flink Apache Spark is yet another batch processing system but it is relatively faster than Hadoop MapReduce since it caches much of the input data on memory by RDD and keeps intermediate data in memory itself, eventually writes the data to disk upon completion or whenever required. With minimum efforts in configuration, Apache Flink’s data streaming runtime achieves low latency and high throughput

Processing Speed

Hadoop Spark Flink
MapReduce processes slower than spark and flink. The slowness occurs only because of the nature of the MapReduce based execution, where it produces lots of intermediate data, much data exchanged between nodes, thus causes huge disk IO latency. Furthermore, it has to persist much data in disk for synchronization between phases so that it can support Job recovery from failures. Also, there are no ways in MapReduce to cache all subset of the data in memory Spark processes faster than MapReduce because it caches much of the input data on memory by RDD and keeps intermediate data in memory itself, eventually writes the data to disk upon completion or whenever required. Spark is 10x times faster than mapreduce and this shows how spark is better than Hadoop MapReduce Flink processes faster than Spark because of its streaming architecture. Flink can be instructed to only process the parts of the data that have actually changed, thus significantly increasing the performance of job

Visualization

Hadoop Spark Flink
Hadoop data visualization tool is zoomdata that can connect directly to HDFS as well as to SQL-on-Hadoop technologies such as Impala, Hive, Spark SQL, Presto and more Spark offers a web interface for submitting and executing jobs on which the resulting execution plan can be visualized. Flink and Spark both are integrated to Apache zeppelin It provides data analytics, ingestion, as well as discovery, visualization, and collaboration Flink also offers a web interface for submitting and executing jobs. The resulting execution plan can be visualized on this interface

Recovery

Hadoop Spark Flink
MapReduce is naturally resilient to system faults or failures. It is highly fault tolerant system Spark RDDs allow recovery of partitions on failed nodes by re-computation of the DAG while also supporting a more similar recovery style to Hadoop by way of checkpointing, to reduce the dependencies of RDDs Flink supports checkpointing mechanism that stores the program in the data sources and data sink, the state of window, as well as user-defined state that recovers streaming job after failure

Security

Hadoop Spark Flink
Hadoop supports Kerberos authentication, which is somewhat painful to manage. However, third party vendors have enabled organizations to leverage Active Directory Kerberos and LDAP for authentication. Spark’s security is a bit sparse by currently only supporting authentication via shared secret (password authentication). The security bonus that Spark can enjoy is that if you run Spark on HDFS, it can use HDFS ACLs and file-level permissions. Additionally, Spark can run on YARN to use Kerberos authentication There is user-authentication support in Flink via the Hadoop / Kerberos infrastructure. If you run Flink on YARN, Flink acquires the Kerberos tokens of the user that submits programs, and authenticate itself at YARN, HDFS, and HBase with that.Flink’s upcoming connector, streaming programs can authenticate themselves as stream brokers via SSL

Cost

Hadoop Spark Flink
MapReduce can typically run on less expensive hardware than some alternatives since it does not attempt to store everything in memory As spark requires a lot of RAM to run in-memory, increasing it in cluster, gradually increases its cost. Flink also requires a lot of RAM to run in-memory, so it will increase its cost gradually.

Compatibility

Hadoop Spark Flink
Hadoop MapReduce and Apache Spark are compatible with each other and Spark shares all MapReduce’s compatibilities for data sources, file formats and business intelligence tools via JDBC and ODBC Spark and hadoop are compatible to each other. Spark is compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat Flink is a scalable data analytics framework that is fully compatible to Hadoop. It provides a Hadoop Compatibility package to wrap functions implemented against Hadoop’s MapReduce interfaces and embed them in Flink programs

Interactive Mode

Hadoop Spark Flink
MapReduce does not have interactive Mode Spark has an interactive shell to learn how to make the most out of Apache Spark. This is a Spark application written in Scala to offer a command-line environment with auto-completion where you can run ad-hoc queries and get familiar with the features of Spark Flink comes with an integrated interactive Scala Shell. It can be used in a local setup as well as in a cluster setup

Real time Analysis

Hadoop Spark Flink
MapReduce fails when it comes to real-time data processing as it was designed to perform batch processing on voluminous amounts of data It can process real time data ie data coming from the real-time event streams at the rate of millions of events per second It is mainly used for real-time data Analysis Although it also provides fast batch data Processing

Abstraction

Hadoop Spark Flink
In Mapreduce, we don’t have any type of abstraction In Spark, for batch we have Spark RDD abstraction and DStream for streaming which is internally RDD itself In flink, we have Dataset abstraction for batch and DataStreams for the streaming application

Machine Learning

Hadoop Spark Flink
Hadoop requires machine learning tool like Apache Mahout Spark has its own set of machine learning MLlib. Within memory caching and other implementation details, it’s really powerful platform to implement ML algorithms Flink has FlinkML which is Machine Learning library for Flink. It supports controlled cyclic dependency graph in runtime. This makes them represent the ML algorithms in a very efficient way compared to DAG representation

Scheduler

Hadoop Spark Flink
Scheduler in Hadoop becomes the pluggable component. There are two schedulers for multi user workload: fair Scheduler and capacity Scheduler. To schedule complex flows, MapReduce needs an external job scheduler like Oozie. Due to in-memory computation, spark acts its own flow scheduler. Can be configured with YARN Scheduler Flink can use YARN Scheduler but Flink also has its own Scheduler

SQL support

Hadoop Spark Flink
It enables users to run SQL queries using Apache Hive It enables users to run SQL queries using Spark-SQL. Spark provides both Hive like query language and Dataframe like DSL for querying structured data In Flink, Table API is an SQL-like expression language that supports data frame like DSL and it’s still in beta. There are plans to add the SQL interface but not sure when it will land in the framework

Caching

Hadoop Spark Flink
MapReduce cannot cache the data in memory for future requirements Spark can cache data in memory for further iterations which enhance its performance Flink can cache data in memory for further iterations to enhance its performance

Deployment

Hadoop Spark Flink
In Standalone mode, Hadoop is configured to run in a single-node, non-distributed mode. In pseudo Distributed mode, Hadoop runs in a pseudo distributed mode. The difference is that each Hadoop daemon runs in a separate java process in pseudo-distributed mode. Whereas in local mode each Hadoop daemon runs as a single java process. In a fully-distributed mode, all daemons are executed in separate nodes forming a multi-node cluster In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. It can be launched either manually, by starting a master and workers by hand or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing In addition to running on YARN cluster Managers, Flink also provides standalone deploy mode

Duplication elimination

Hadoop Spark Flink
There is no duplication elimination in Hadoop Spark also process every record exactly one time hence eliminates duplication. Apache Flink processes every record exactly one time hence eliminates duplication. Streaming applications can maintain custom state during their computation. Flink’s checkpointing mechanism ensures exactly once semantics for the state in the presence of failures

Window criteria

A data stream needs to be grouped into multiple logical streams on each of which a window operator can be applied.

Hadoop Spark Flink
Hadoop doesn’t support streaming so there is no need of window criteria Spark has time-based window criteria Flink has record-based or any custom user-defined Flink Window criteria

Back pressure Handing

BackPressure refers to the buildup of data at an I/O switch when buffers are full and not able to receive additional data. No additional data packets are transferred until the bottleneck of data has been eliminated or the buffer has been emptied.

Hadoop Spark Flink
Hadoop handles back pressure through Manual Configuration Spark also handles back pressure through Manual Configuration Flink handles back pressure Implicitly through System Architecture

Hardware Requirements

Hadoop Spark Flink
MapReduce runs very well on commodity Hardware Spark needs mid to high-level hardware because Spark cache data in memory for further iterations which enhance its performance Flink also needs mid to High-level Hardware. Flink can also cache data in memory for further iterations which enhance its performance.

High Availability

Hadoop Spark Flink
Configurable in High Availability Mode Configurable in High Availability Mode Configurable in High Availability Mode

Amazon S3 connector
Amazon Simple Storage Service (Amazon S3) is object storage with a simple web service interface to store and retrieve any amount of data from anywhere on the web

Hadoop Spark Flink
Provides Supports for Amazon S3 Connector Provides Supports for Amazon S3 Connector Provides Supports for Amazon S3 Connector

Apache License

All the three are Apache Licensed.
The Apache License, Version 2.0 (ALv2) is a permissive free software license written by the Apache Software Foundation (ASF). The Apache License requires preservation of the copyright notice and disclaimer.


References

Apache Flink
Apache Spark
Apache Hadoop
Cloudera Blog
Hortonworks Blog