Apache Spark

From Verify.Wiki
Jump to: navigation, search

Apache Spark is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009[1] and later donated to Apache Software foundation.Apache Spark is a general execution engine suitable for both batch as well as real-time jobs unlike MapReduce which is only suited for batch jobs.Spark Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk [2]. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.

Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

Spark runs on Hadoop YARN, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3 and even RDBM's.

Since its release, Spark has seen rapid adoption by enterprises across a wide range of industries. Internet powerhouses such as Yahoo, Baidu, and Tencent, have eagerly deployed Spark at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes. It has quickly become the largest open source community in big data, with over 750 contributors from 200+ organizations[3] .


Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in 2009, and open sourced in 2010 under a BSD license.[4]

In 2013, the project was donated to the Apache Software Foundation and switched its license to Apache 2.0. In February 2014, Spark became a Top-Level Apache Project.[5].A wide range of contributors now develop the project .

Version Original release date Latest Version Release date
0.5 2012-06-12 0.5.1 2012-10-07
0.6 2012-10-14 0.6.1 2012-11-16
0.7 2013-02-27 0.7.3 2013-07-16
0.8 2013-09-25 0.8.1 2013-12-19
0.9 2014-02-02 0.9.2 2014-07-23
1.0 2014-05-30 1.0.2 2014-08-05
1.1 2014-09-11 1.1.1 2014-11-26
1.2 2014-12-18 1.2.2 2015-04-17
1.3 2015-03-13 1.3.1 2015-04-17
1.4 2015-06-11 1.4.1 2015-07-15
1.5 2015-09-09 1.5.2 2015-11-09
1.5.2 2015-09-09 1.6.0 2016-01-04


  • Hadoop Spark has been said to execute batch processing jobs near about 10 to 100 times faster than the Hadoop MapReduce framework just by merely by cutting down on the number of reads and writes to the disc.[6]
  • Spark allows to perform Streaming, Batch Processing and Machine Learning all in the same cluster.With Spark it is possible to control different kinds of workloads, so if there is an interaction between various workloads in the same process it is easier to manage and secure such workloads which come as a limitation with MapReduce.
  • In case of Hadoop MapReduce you just get to process a batch of stored data but with Hadoop Spark it is as well possible to modify the data in real time through Spark Streaming.With Spark Streaming it is possible to pass data through various software functions for instance performing data analytics as and when it is collected.
  • Spark ensures lower latency computations by caching the partial results across its memory of distributed workers unlike MapReduce which is disk oriented completely.
  • Writing Spark is always compact than writing Hadoop MapReduce code.
  • Spark can also be used for graph processing .From advertising to social data analysis, graph processing capture relationships in data between entities, say people and objects which are then are mapped out.[7]
  • Spark is known as the Swiss army knife of Big Data Analytics. It is very popular for its speed, iterative computing and most importantly caching intermediate data in memory for better access.
  • Spark makes the Hadoop ecosystem an even more general-purpose platform that can support more industries and types of problems.
  • Apache Spark can run as standalone or on top of Hadoop YARN or Mesos on-premise or on the cloud. It supports data sources that implement Hadoop InputFormat, so it can integrate with all the data sources and file formats that are supported by Hadoop.Spark also supports JDBC/ODBC drivers.[8]
  • Sparm is fault tolerant,Spark has retries per task and speculative execution—just like MapReduce


  • Since Spark runs a job 100X faster as compared to MapReduce when data is in memory,to take full advantage of speed Spark requires huge amount of RAM to keep data in memory.
  • Lot of development is still in progress including lot of Bug fixes.


  • One of the biggest controversies surrounding spark since it's support for YARN is using it as a replacement for Map Reduce ,though MapReduce has been in production since long now and has sufficed all the batch processing use cases while taking into account speed/security .Since Apache Spark provides a single framework for Batch/Real time analytic and greatly reduces the complexity of complex Map Reduce jobs many companies like Cloudera is thinking of replacing MapReuce with Spark as default processing engine for hadoop[9].
  • Skepticism rises over the ODP Pivotal, one of ODP's founders, spoke about making sure ahead of time that all the pieces work well together, so those who create products for the Hadoop ecosystem don't have to go through an arduous certification process.But Gartner analysts Nick Heudecker and Merv Adrian are not buying the idea that Hadoop distributions are fragmented enough to need that kind of horizontal unification. "This simply institutionalizes a dichotomy in favor of a few favored players," they wrote. "Who wants it? As Cloudera [a Hadoop vendor that is not part of the initiative] suggests, the paying members, and it's not clear who else."

Cloudera voiced its skepticism about the need for a unified base Hadoop distribution in explaining why it was staying out of the fray.


Hadoop is written in Scala programming language hence follows Scala programming syntax.

"Hello World" Example

[10] import scala.Tuple2;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.function.FlatMapFunction;

import org.apache.spark.api.java.function.Function2;

import org.apache.spark.api.java.function.PairFunction;

import java.util.Arrays;

import java.util.List;

import java.util.regex.Pattern;

public final class JavaWordCount {

 private static final Pattern SPACE = Pattern.compile(" ");
 public static void main(String[] args) throws Exception {
   if (args.length < 1) {
     System.err.println("Usage: JavaWordCount <file>");
   SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount");
   JavaSparkContext ctx = new JavaSparkContext(sparkConf);
   JavaRDD<String> lines = ctx.textFile(args[0], 1);
   JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
     public Iterable<String> call(String s) {
       return Arrays.asList(SPACE.split(s));
   JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
     public Tuple2<String, Integer> call(String s) {
       return new Tuple2<String, Integer>(s, 1);
   JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
     public Integer call(Integer i1, Integer i2) {
       return i1 + i2;
   List<Tuple2<String, Integer>> output = counts.collect();
   for (Tuple2<?,?> tuple : output) {
     System.out.println(tuple._1() + ": " + tuple._2());


  • Writing Map Reduce style programming is by far more simple in Spark ctx.textFile(args[0], 1) define the file path to read from and optional parameter defining number of partitions to create for file which in turn will decide the degree of parallelism ,number of parallel mapper.
  • flatMap is used when there are multiple outputs for a single input in mapper function like in above example it takes line as input splits it on space and then returns list of word in line.
  • mapToPair function is applied on every record thus resulting in tuple of (word,1)
  • reduceByKey function makes sure to get all entries of same word to a single reducer where the sum/frequency of the occurrence of word is then calculated.
  • collect is good to use when testing as it forces all the values to driver program ,can create problem when data is too large.

Best Practices

Spark is often considered to be the next step for Big Data processing beyond Hadoop. While Hadoop’s MapReduce functionality is spread across storage, Spark uses in-memory techniques to provide huge improvements in processing times, in some cases up to 100x the equivalent task in Hadoop.

  • Spark is written in Scala, so new features will be first available for Scala (and Java). Python and R bindings can lag behind on new features from version to version.
  • If you are developing Spark in Java, use Java 8 or above if possible. The new lambda functionality in Java 8 removes a lot of verbosity required in your code.
  • The fundamentals of software development still apply in the Big Data world – Spark can be unit-tested and integration-tested, and code should be reused between streaming and batch jobs wherever possible.
  • When Spark is transferring data over the network, it needs to serialize objects into a binary form. This can have an effect on performance when shuffling or on other operations that require large amounts of data to be transferred. To ameliorate this, first try to make sure that your code is written in a way that minimizes the amount of shuffling that may occur (e.g. only use groupByKey as a last resort, preferring instead to use actions like reduceByKey which perform aggregation as in-place as possible).
  • Consider using Kryo instead of java.io.Serializable for your objects, as it has a more compact binary representation than the standard Java serializer, and is also faster to compress or decompress. For further performance, especially when dealing with billions of objects, you can register classes with the Kryo serializer at start-up, saving more precious bytes.
  • Use connection pools instead of creating dedicated connections when connecting to external data sources – e.g. if you are writing elements from an RDD into a Redis cluster, you might be surprised if it attempts to open 10 million connections to Redis when running on production traffic.
  • Use Spark’s check-pointing features when running streaming applications to ensure recovery from failures. Spark can save checkpoints to local files, HDFS, or S3.
  • With larger datasets (>200Gb), garbage collection on the JVM Spark runs may become a performance issue. In general, switching to the G1 GC over the default ParallelGC will ultimately be more performant. Although, some tuning will be required according to the details of your dataset and application.
  • If possible, use dataframes over RDDs for developing new applications. While dataframes are in the Spark SQL package rather than Spark Core, Databricks has indicated that they will be dedicating significant resources to improving the Catalyst optimizer for generating RDD code. By adopting the dataframe approach of development, your application is likely to benefit from any optimizer improvements in the upcoming development cycles.
  • Remember, Spark Streaming is not a pure streaming architecture. If the microbatches do not provide a low enough latency for your processing, you may need to consider a different framework, e.g. Storm, Samza, or Flink.

Feature Comparison Chart

Storm is designed for real time application whereas Apache Spark is designed to do more than plain data processing as it can make use of existing machine learning libraries and process graphs. Thanks to the high performance of Apache Spark, it can be used for both batch processing and real time processing. Spark provides an opportunity to use a single platform for everything rather than splitting the tasks on different open source platforms-avoiding the overhead of learning and maintaining different platforms.

  1. Micro-batching is a special kind of batch processing wherein the batch size is orders smaller. Windowing becomes easy with micro-batching as it offer stateful computation of data. Storm is a complete stream processing engine that supports micro-batching whereas Spark is a batch processing engine that micro-batches but does not render support for streaming in the strictest sense.
  2. Spark and Storm both provide fault tolerance and scalability but differ in the processing model. Spark streams events in small batches that come in short time window before it processes them whereas Storm processes the events one at a time. Thus, Spark has a latency of few seconds whereas Storm processes an event with just millisecond latency.
  • Hadoop MapReduce is best suited for batch processing not for real time applications/processing whereas spark has great support for both real time as well as batch processing.
  1. Ease of development writing map reduce jobs in Hadoop is more difficult as compared to writing them in Spark as we can express a complex map logic via multiple mapper functions in spark as compared to writing the complex logic in one mapper when using Hadoop MapReduce.
  2. Spark processes in-memory data whereas Hadoop MapReduce persists back to the disk after a map action or a reduce action thereby Hadoop MapReduce lags behind when compared to Spark in this aspect.
  • Apache Flink Flink is primarily a stream processing framework that can look like a batch processor.Flink provides expressive APIs that enable programmers to quickly develop streaming data applications.For extra speed, Flink allows iterative processing to take place on the same nodes rather than having the cluster run each iteration independently.Flink is built to be a good YARN citizen (which Spark has not quite achieved yet), and it can run existing MapReduce jobs directly on its execution engine.
  1. Flink is optimized for cyclic or iterative processes by using iterative transformations on collections. This is achieved by an optimization of join algorithms, operator chaining and reusing of partitioning and sorting. Flink streaming processes data streams as true streams, i.e., data elements are immediately "pipelined" though a streaming program as soon as they arrive. This allows to perform flexible window operations on streams.Spark on the other hand is based on resilient distributed datasets (RDDs). This (mostly) in-memory datastructure gives the power to sparks functional programming paradigm. It is capable of big batch calculations by pinning memory. Spark streaming wraps data streams into mini-batches, i.e., it collects all data that arrives within a certain period of time and runs a regular batch program on the collected data. While the batch program is running, the data for the next mini-batch is collected.
  • Apache Apex is an enterprise-grade native YARN Big Data-in-motion platform that unifies stream processing as well as batch processing.Apex processes Big Data in-motion in a highly scalable, highly performant, fault tolerant, stateful, secure, distributed and an easily operable way. It provides a simple API that enables users to write or re-use generic Java code, thereby lowering the expertise needed to write Big Data applications.Apache Apex includes key features requested by open source developer community that are not available in current open source technologies.
  1. Event processing guarantees
  2. In-memory performance & scalability
  3. Fault tolerance and state management
  4. Native rolling and tumbling window support
  5. Hadoop-native YARN & HDFS implementation

Top Companies Providing Services

  • Databricks was founded out of the UC Berkeley AMPLab by the creators of Apache Spark. They are one of the major provider/contributor to Spark. Databricks also provide professional training for spark and has created a unifying certification process that is designed to be meaningful for application developers and end-customers without being overly burdensome for either.
  • Hortonworks has invested a lot in Spark for Open Enterprise Hadoop so users can deploy Spark-based applications alongside other Hadoop workloads in a consistent, predictable and robust way. Hortonworks believe's that Spark & Hadoop are perfect together.
  • PivotalHD and Spark Unlike a multi-vendor patchwork of heterogeneous solutions, Pivotal brings together an integrated full stack of technologies to allow enterprises to create a Business Data Lake.Pivotal HD 2.0.1 consists of a Hadoop distribution that is compatible with Apache Hadoop 2.x, a market-leading SQL on Hadoop query engine in HAWQ, and GemfireXD for in-memory data serving and ultra-low latency transaction processing capabilities. Together these platforms extend Pivotal’s differentiation in both the Hadoop ecosystem and the more established data warehousing markets, meeting the full spectrum of analytics requirements from batch to ultra-low latency.With Spark, Pivotal aims to further extend this differentiation by leveraging Spark’s cutting edge capabilities and integrating it with the rest of Pivotal’s world-class platform.
  • ClouderaAn integrated part of CDH and supported with Cloudera Enterprise, Spark is the open standard for flexible in-memory data processing for batch, real-time, and advanced analytics. Via the One Platform initiative, Cloudera is committed to helping the ecosystem adopt Spark as a replacement for MapReduce in the Hadoop ecosystem as the default data execution engine for analytic workloads.
  • IBM Spark As part of its commitment to Apache Spark, IBM will:
  1. Open source its breakthrough IBM SystemML machine learning technology and collaborate with Databricks to advance machine learning at the core of the Apache Spark project
  2. Offer IBM Analytics for Apache Spark on IBM Bluemix
  3. Open a Spark Technology Center in San Francisco for the Data Science and Developer community
  4. Educate one million data scientists and data engineers on Apache Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC

Top 5 Recent Tweets

Date Author Tweet
24 Feb 2015 @rxin Spark "hall of fame": 8000+ nodes in a single cluster, 1PB+/day ingest, mapping the brain at scale, shuffling 1PB http://www.slideshare.net/databricks/large-scalesparktalk
31 March 2015 @matei_zaharia Spark turns five years old today! A look back at the first version here: https://databricks.com/blog/2015/03/31/spark-turns-five-years-old.html
1 Oct 2015 @ApacheSpark Spark 1.5.1 is out. We strongly recommend all 1.5.0 users to upgrade. http://spark.apache.org/news/spark-1-5-1-released.html
4 Oct 2015 @cloudera #Hadoop and #Spark: Better Together http://j.mp/1LWC6hY via @SmartDataCo
1 Nov 2015 @rxin Want to write your Spark job in C#? Microsoft just published Spark .NET CLR language binding http://spark-packages.org/package/skaarthik/SparkCLR

Top 5 Lifetime Tweets

Date Author Tweet
1 Jul 2014 @databricks Shark, Spark SQL, Hive on Spark, and the future of SQL on Spark! http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
3 Dec 2014 @databricks Spark powering an Internet of Things platform: read how @Technicolor @Virdata_IoT implemented Spark for IoT: http://hubs.ly/y0lyC-0
22 Jan 2015 @cloudera @Google’s Dataflow pipeline tool can now run on #Spark, thanks to @Cloudera http://j.mp/1ymAR4x via @VentureBeat
9 Apr 2014 @ApacheSpark Spark 0.9.1 released - http://spark.apache.org/releases/spark-release-0-9-1.html
27 Feb 2014 @ApacheSpark We're proud to announce that Spark has become a top-level Apache project: https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces50


  1. https://databricks.com/spark/about/
  2. http://spark.apache.org/ Spark
  3. DataBricks
  4. http://spark.apache.org/community.html#history
  5. https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces50
  6. https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
  7. http://analyticstraining.com/2014/advantages-apache-spark/
  8. https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
  9. [1]
  10. [2],SparkhelloWorld
  11. [3]

Verification history