Apache Hadoop

Follow

Verify

Apache Hadoop is a opensource framework that allows processing of distributed data using clusters of computers.It is built using Java and utilizes commodity hardware to a great extent to deliver results. Hadoop helps businesses to gain insights from massive structured and unstructured data.^[1]

At it's core Apache Hadoop has a processing framework Map-Reduce(MR)and a storage component Hadoop Distributed File System (HDFS). HDFS helps Hadoop to efficiently store huge amounts of data distributed across clusters of machines and MR helps in processing this data utilizing computing power of all the machines in cluster.HDFS splits data into smaller parts and stores it onto different machines providing failure recovery and high availability. MR framework divides a task(program submitted ,a job) into multiple small components and ships the code onto different machines or nodes to process and execute it in parallel.Both HDFS/MR can be scaled to multiple machines as they support horizontal scaling where we increase the computing power and storage by adding node when required. The framework is designed to detect and handle failures at the application layer, thus delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:^[2]

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

History[edit | edit source]

The use of concurrent processes that communicate by message-passing has its roots in operating system architectures studied in the 1960s.^[3]
The study of distributed computing became its own branch of computer science in the late 1970s and early 1980s.^[4]
In 1976 Digital Equipment Ceorporation created the File Access Listener (FAL), an implementation of the Data Access Protocol as part of DECnet Phase II which became the first widely used network file system.
In 1985 Sun Microsystems created the file system called "Network File System" (NFS)
It was originally implemented at Yahoo based on papers published by Google in 2003 and 2004.^[5]
Hadoop was created by Doug Cutting and Mike Cafarella at Yahoo in 2005.^[6]
Yahoo used Nutch's storage and processing ideas to form the backbone of Hadoop.
In 2009 Hadoop was successfully able to handle billions of web searches by sorting petabyte of data in less than 17 hours. ^[7]
Hadoop Yarn was first released as part of Hadoop 0.23 in 2012.^[8]

Strengths[edit | edit source]

Distributed Data which allows us to store data more then capacity of a single node but within the capacity of cluster as HDFS divides the data into smaller parts/components/shards and stores each one of them onto different nodes available within the cluster. ^[9]^[10]
Distributed Computation this allows computation in parallel by distributing it into independent tasks which are then assigned to various nodes/machines within the cluster.
Support for Horizontal Scaling ,since HDFS and MR are inherently designed to support distributed computing we can easily scale our data/computation requirements by increasing number of nodes/machines on our cluster as when required.
Robust coherency model
Designed keeping in mind that a machine can fail ,therefore handles failure at application layer
Replication handled transparently
Hadoop Streaming allows to write programs in different languages such as C,C++,C# and other
Can be deployed on large clusters of cheap commodity hardware as opposed to expensive, specialized parallel-processing hardware
With Combination of storage/computation layer,Hadoop make sure that the Processing logic is close to the data, rather than the data close to the processing logic
HDFS supplies out-of-the-box redundancy and failover capabilities that require little to no manual intervention
The hallmark of HDFS is its ability to tackle big data use cases and most of the characteristics that comprise them (data velocity, variety, and volume).
Cost-Effective As previously stated, HDFS is open source software, which translates into real cost savings for its users
Flexible Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data.

Weaknesses[edit | edit source]

Not Fit for Small Data,Due to its high capacity design, the Hadoop Distributed File System or HDFS, lacks the ability to efficiently support the random reading of small files.
Programming model ,One has to get into details of implementing Mappers/Reducers and how they are executed to take advantage of distributed computing.
Joins of multiple datasets are tricky and slow.
Security Hadoop's security model is weak if used in complex applications.
Very limited SQL support, there are opensource components which attempt to set up Hadoop as a queryable data warehouse,but these offer very limited SQL support.
Inefficient execution,HDFS has no notion of a query optimizer, so cannot pick an efficient cost-based plan for execution.
Jobs submitted to MR run in isolation from each other, thus creating problem when one job needs to talk to other.
MR is best suited for Batch applications but not for real time streaming application.
Namenode keeps metadata of each and every file in our cluster in memory,thus increasing load on RAM.

Criticism[edit | edit source]

Hadoop is not one shoe for all ,hadoop is not suitable for all the use cases of data analytic especially when real time data streaming is involved.^[11]
Cassandra File System (CFS) is superior to HDFS.^[12]
One of the most important aspects in any organization is data privacy and that is lacking in Hadoop Ecosystem.
Big Data may be a hype to sell Hadoop based computing systems.As it is not suited even for TB's of data.^[13]
With each vendor providing their own flavor of hadoop ,there is a huge demand in market for Open Data Platform to prevent vendor locking.
MapR takes what is probably the most controversial approach to Hadoop. The company replaces standard HDFS with its own proprietary storage services layer that enables random read-writes and allows users to mount the cluster on NFS.^[14]
One of the biggest controversies around Big Data is they way data is collected and used.^[15]

Syntax[edit | edit source]

Hadoop is written in Java programming language hence follows Java programming syntax.

"Hello World" Example[edit | edit source]

Below is the standard wordcount example implemented in Java:^[16]

    package org.verify.wiki;

   import java.io.IOException;
   import java.util.*;

   import org.apache.hadoop.fs.Path;
   import org.apache.hadoop.conf.*;
   import org.apache.hadoop.io.*;
   import org.apache.hadoop.mapreduce.*;
   import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
   import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
   import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
   import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

   public class WordCount {

    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
       private final static IntWritable one = new IntWritable(1);
       private Text word = new Text();

       public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
           String line = value.toString();
           StringTokenizer tokenizer = new StringTokenizer(line);
           while (tokenizer.hasMoreTokens()) {
               word.set(tokenizer.nextToken());
               context.write(word, one);
           }
       }
    }

    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

       public void reduce(Text key, Iterable<IntWritable> values, Context context) 
         throws IOException, InterruptedException {
           int sum = 0;
           for (IntWritable val : values) {
               sum += val.get();
           }
           context.write(key, new IntWritable(sum));
       }
    }

    public static void main(String[] args) throws Exception {
       Configuration conf = new Configuration();

           Job job = new Job(conf, "wordcount");

       job.setOutputKeyClass(Text.class);
       job.setOutputValueClass(IntWritable.class);

       job.setMapperClass(Map.class);
       job.setReducerClass(Reduce.class);

       job.setInputFormatClass(TextInputFormat.class);
       job.setOutputFormatClass(TextOutputFormat.class);

       FileInputFormat.addInputPath(job, new Path(args[0]));
       FileOutputFormat.setOutputPath(job, new Path(args[1]));

       job.waitForCompletion(true);
    }

Every MapReduce job consists of three portions

The driver code
Code that runs on the client to configure and submit the job
The Mapper
The Reducer

The driver code,runs at client side it configures the job then submits it to the cluster.main() method runs at client side and configures Input/Output format of the job where Input format defines the location of the input data like a file or directory on HDFS. It also determines how to split the input data into input splits ,whereas Output format specifies location where to save the output data before submitting it to cluster, in above example TextInputFormat.class is set as Input/Output class format

The Mapper,each Mapper deals with a single input split. InputFormat is a factory for RecordReader objects to extract (key, value) records from the input source.Number of mapper is determined by the number of input splits.

The Reducer,reducers receives data from mapper and starts executing only after all mappers have finished their execution.Framework guarantees that all key,value pairs of same key gets to a single reducer,reduce method in reducer receives Iterable<IntWritable> values where values is collection of all values associated with same key.

Best Practices[edit | edit source]

Input^[17]

Hadoop Map-Reduce is optimized to process large amounts of data. The maps typically process data in an embarrassingly parallel manner, typically at least 1 HDFS block of data, usually 128MB.

By default, the framework processes at most 1 HDFS file per-map. This means that if an application needs to processes a very large number of input files, it is better to process multiple files per-map via a special input-format such as MultiFileInputFormat. This is true even for applications processing a small number of tiny input files, processing multiple files per map is significantly more efficient.
If the application needs to process a very large amount of data, even if they are present in large-sized files, it is more efficient to process more than 128MB of data per-map (see section on Maps).

Coalesce processing of multiple small input files into smaller number of maps and use larger HDFS block-sizes for processing very large data-sets.

Maps

Having too many maps or lots of maps with very short run-time is anti-productive.

Unless the application's maps are heavily CPU bound, there is almost no reason to ever require more than 60,000-70,000 maps for a single application.

Also, when processing larger blocks per-map, it is important ensure they have sufficient memory for the sort-buffer to speed up the map-side sort . The performance of the application can improve dramatically if it can be arranged such that the majority of the map-output can be held in the map's sort-buffer, this will entail larger heap-sizes for the map JVM.

Ensure maps are sized so that all of map-outputs can be sorted in one pass by keeping all of them in the sort-buffer.

Applications should use fewer maps to process data in parallel, as few as possible without having really bad failure recovery cases.

Combiner

When used appropriately, it significantly cuts down the amount of data shuffled from the maps to the reduces.

Combiners help the shuffle phase of the applications by reducing network traffic. However, it is important to ensure that the Combiner does provide sufficient aggregation.

Reducer

The efficiency of reducer is driven by a large extent by the performance of the shuffle.

Having too many or too few reduces is anti-productive:

Too few reduces cause undue load on the node on which the reduce is scheduled — in extreme cases we have seen reduces processing over 100GB per-reduce. This also leads to very bad failure-recovery scenarios since a single failed reduce has a significant, adverse, impact on the latency of the job.

Too many reduces adversely affects the shuffle crossbar. Also, in extreme cases it results in too many small files created as the output of the job — this hurts both the NameNode and performance of subsequent Map-Reduce applications who need to process lots of small files.

Applications should ensure that each reduce should process at least 1-2 GB of data, and at most 5-10GB of data, in most scenarios.

Output

Number of output of an application is linear with the number of reducers configured.

Consider compressing the application's output with an appropriate compressor (compression speed v/s efficiency) to improve HDFS write-performance.

Do not write out more than one output file per-reduce, using side-files is usually avoidable. Typically applications write small side-files to capture statistics and the like; counters might be more appropriate if the number of statistics collected is small.

Use an appropriate file-format for the output of the reduces. Writing out large amounts of compressed textual data with a codec such as zlib/gzip/lzo is counter-productive for downstream consumers. This is because zlib/gzip/lzo files cannot be split and processed and the Map-Reduce framework is forced to process the entire file in a single map, in the downstream consumer applications. This results in a bad load imbalance and failure recover scenarios on the maps. Using file-formats such as SequenceFile or TFile alleviates these problems since they are both compressed and splittable.

Consider using a larger output block size (dfs.block.size) when the individual output files are large (multiple GBs).

Application outputs to be few large files, with each file spanning multiple HDFS blocks and appropriately compressed

Distributed Cache

DistributedCache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.

The DistributedCache is designed to distribute a small number of medium-sized artifacts, ranging from a few MBs to few tens of MBs. One drawback of the current implementation of the DistributedCache is that there is no way to specify map or reduce specific artifacts.

One should avoid using Distributed Cache if the file to be shared is more than 512MB and application have less number of reducers,in such cases reducers should perform HDFS I/O by themselves.

Counters

Counters represent global counters, defined either by the Map/Reduce framework or applications. Applications can define arbitrary Counters and update them in the map and/or reduce methods. These counters are then globally aggregated by the framework.

Counters are very expensive since the JobTracker has to maintain every counter of every map/reduce task for the entire duration of the application.

Applications should not use more than 10, 15 or 25 custom counters.

Sampling

A common anti-pattern for applications required to generate fully sorted data is to use a single-reducer, forcing a single, global aggregation. It is very inefficient - this not only puts a significant amount of load on the single node on which the reduce task is executing, but also has very bad failure recovery.

A much better approach is to sample the input and use that to drive a sampling partitioner rather than the default hash partitioner. Thus, one can derive benefits of better load balancing and failure recovery.

HDFS Operations & JobTracker Operations

Applications should not perform any metadata operations on the file-system from the backend, they should be confined to the job-client during job-submission. Furthermore, applications should be careful not to contact the JobTracker from the backend.

Web-UI

Web-UI is meant to be used for humans and not for automated processes.Implementing automated processes to screen-scrape the web-ui is strictly prohibited. Some parts of the web-ui, such as browsing of job-history, are very resource-intensive on the JobTracker and could lead to severe performance problems when they are screen-scraped.

Hadoop Related Modules[edit | edit source]

There are many software's under Hadoop framework but each one of them is build to serve specific use case /needs .It is worth having idea of below listed tools/software/framework

Frameworks

Hadoop: This is a software library written in Java used for processing large amounts of data in a distributed environment. It allows developers to setup clusters of computers, starting with a single node that can scale up to thousands of nodes.

Hive: Hive is data warehousing framework that’s built on Hadoop. It allows for structuring data and querying using a SQL-like language called HiveQL. Developers can use Hive and HiveQL to write complex MapReduce over structured data in a distributed file system. Hive is the closest thing to a relational-database in the Hadoop ecosystem.

Pig: Pig is an application for transforming large data sets. Like Hive, Pig has its own SQL-Like language called Pig Latin. Where Hive is used for structured data, Pig excels in transforming semi-structured and unstructured data. Pig Latin allows developers to write complex MapReduce jobs without having to write them in Java.

Flume: Odds are if you are in the Hadoop ecosystem you will need to move around large amounts of data. Flume is a distributed service that helps collect, aggregate and move around large log data. It’s written in Java and typically delivers files directly into HDFS.

Drill: Why not use tools with cool names like drill and drill bits? Apache Drill is a schema-free SQL query engine for data exploration. Drill is listed as real SQL and not just “SQL-like,” which allows developers or analysts to use existing SQL knowledge to begin writing queries in minutes. Apache Drill is extendable with User Define Functions.

Kafka: Another great tool for messaging in Hadoop is Kafka. Kafka is used as a queuing system when working with Storm.

Tez: If you’re using YARN, you’ll want to learn about the Tez project. Tez allows for building applications that process DAG (directed acyclic graph) tasks. Basically, Tez allows Hive and Pig jobs to be written with fewer MapReduce jobs, which makes Hive and Pig scripts run faster.

Sqoop: Do you have structured data in a relational database, SQL Server or MySQL, and want to pull that data into your Big Data platform? Well Sqoop can help. Sqoop allows developers to transfer data from a relational database into Hadoop.

Storm: Hadoop works in batch processing, but many applications need real-time processing and this is where Storm fits in. Storm allows for streaming data, so analysis can happen in real-time. Storm boasts a benchmark speed of over a million tuples processed per second, per node.

Ambari: One of most useful tools you’ll use if you’re administering a Hadoop cluster, Ambari allows administrators to install, manage and monitor Hadoop clusters with a simple Web interface. Ambari provides an easy-to-follow wizard for setting up a Hadoop cluster of any size.

HBase: When developing applications you’ll often want real-time read/write access to your data. Hadoop runs processes in batch and doesn’t allow for modification, and this is what makes HBase so popular. HBase provides the capability to modify data in real-time and still run in a HDFS environment.

Mahout: Looking to run Singular Value Decomposition, K-nearest neighbor, or Naive Bayes Classification in a Hadoop environment? Mahout can help. Mahout provides specialized data analysis algorithms that run in a distributed file system. Think of Mahout as a Java library with distributed algorithms to reference in MapReduce jobs.

Zookeeper: Zookeeper provides centralized services for Hadoop cluster configuration management, synchronization and group services. For example, think about how a global configuration file works on a Web application; Zookeeper is like that configuration file, but at a much higher level.

Spark: A real-time general engine for data processing, Spark boasts a speed 100-times faster than Hadoop and works in memory. Spark supports Scala, Python and Java. It also contains a Machine Learning Library (MLlib), which provides scalable machine learning libraries comparable to Mahout.

Feature Comparison[edit | edit source]

Hadoop has become the de facto standard in the research and industry uses of small and large-scale MapReduce. Since its inception, an entire ecosystem has been built around it including conferences (Hadoop World, Hadoop Summit), books, training, and commercial distributions (Cloudera, Hortonworks, MapR) with support,but there are alternative to Hadoop too

BashReduce

Unlike Hadoop, BashReduce is just a script! BashReduce implements MapReduce for standard Unix commands such as sort, awk, grep, join etc. It supports mapping/partitioning, reducing, and merging. The developers note that BashReduce “sort of” handles task coordination and a distributed file system.There is actually no task coordination as a master process simply fires off jobs and data. There is also no distributed file system at all, but BashReduce will distribute files to worker machines, but without a distributed file system there is a lack of fault-tolerance among other things.

Intermachine communication is facilitated with simple passwordless SSH, but there is a large cost associated with transferring files from a master machine to its workers whereas with Hadoop, data is stored centrally in HDFS. Additionally, partition/merge in the standard unix tools is not optimized for this use case, thus the developer had to use a few additional C programs to speed up the process.

Compared to Hadoop, there is less complexity and faster development. The result is the lack of fault-tolerance, and lack of flexibility as BashReduce only works with certain Unix commands. Unlike Hadoop, BashReduce is more of a tool than a full system for MapReduce. BashReduce was developed by Erik Frey et. al. of last.fm.

Disco Project

Disco was initially developed by Nokia Research and has been around silently for a few years. Developers write MapReduce jobs in simple, beautiful Python. Disco’s backend is written in Erlang, a scalable functional language with built-in support for concurrency, fault tolerance and distribution — perfect for a MapReduce system! Similar to Hadoop, Disco distributes and replicates data, but it does not use its own file system. Disco also has efficient job scheduling features.

Disco is a pretty standard and powerful MapReduce implementation that removes some of the painful aspects of Hadoop, but it also likely removes persistent fault tolerance as it relies on a standard filesystem rather than one like HDFS, but Erlang may impose some functionality that provides a “good enough” level of fault tolerance for data.

Spark

Spark is one of the newest players in the MapReduce field. Its purpose is to make data analytics fast to write, and fast to run. Unlike many MapReduce systems, Spark allows in-memory querying of data (even distributed across machines) rather than using disk I/O. It is of no surprise then that Spark out-performs Hadoop on many iterative algorithms. Spark is implemented in Scala, a functional object-oriented language that sits on top of the JVM. Similar to other languages like Python, Ruby, and Clojure, Scala has an interactive prompt and users can use Spark to query big data straight from the Scala interpreter.

Spark was developed by the UC Berkeley AMP Lab. Currently, its main users are UC Berkeley researchers and Conviva.

GraphLab

GraphLab was developed at Carnegie Mellon and is designed for use in machine learning. GraphLab’s goal is to make the design and implementation of efficient and correct parallel machine learning algorithms easier. Their website states that paradigms like MapReduce lack expressiveness while lower level tools such as MPI present overhead by requiring the researcher to write code that beats a dead horse.

GraphLab has its own version of the map stage, called the update phase. Unlike MapReduce, the update phase can both read and modify overlapping sets of data. Recall that MapReduce requires data to be partitioned. GraphLab accomplishes this by allowing the user to specify data as a graph where each vertex and edge in the graph is associated memory. The update phases can be chained in such a way such that one update function can recursively trigger other update functions that operate on vertices in the graph. This graph-based approach would not only make machine learning on graphs more tractable, but it also improves dynamic iterative algorithms.

GraphLab also has its own version of the reduce stage, called the sync operation. The results of the sync operation are global and can be used by all vertices in the graph. In MapReduce, output from the reducers is local (until committed) and there is a strict data barrier among reducers. The sync operations are performed at time intervals, and there is not as strong of a tie between the update and sync phases.

GraphLab’s website also contains the original UAI paper and presentation, a document better explaining the abstraction.

HPCC Systems (from LexisNexis)

Project with the least flattering name comes from LexisNexis, which has developed its own framework for massive data analytics. HPCC attempts to make writing parallel-processing workflows easier by using Enterprise Control Language (ECL), a declarative, data-centric language. A matter of fact, the development team has a converter for translating Pig jobs to ECL. HPCC is written in C++. Some have commented that this will make in-memory querying much faster because there is less bloated object sizes originating from the JVM.

HPCC already has its own jungle of technologies like Hadoop. HPCC has two “systems” for processing and serving data: the Thor Data Refinery Cluster, and the Roxy Rapid Data Delivery Cluster. Thor is a data processor, like Hadoop. Roxie is similar to a data warehouse (like HBase/Hive) and supports transactions. HPCC uses a distributed file system.

Although details are still preliminary as is the system, this certainly has a “feel” for potentially being a solid alternative for Hadoop, but only time will tell.

Top Companies Providing Hadoop Services[edit | edit source]

There are many commercial Hadoop vendors in market providing from 100% opensource to their own proprietary implementation of Hadoop ,while some of then provides Hadoop-as-a-Service Few of the most popular vendors are

Elastic MapReduce Hadoop Distribution Amazon Elastic MapReduce (Amazon EMR) is a web service that makes it easy to quickly and cost-effectively process vast amounts of data.Amazon EMR simplifies big data processing, providing a managed Hadoop framework that makes it easy, fast, and cost-effective for you to distribute and process vast amounts of your data across dynamically scalable Amazon EC2 instances. ^[18]

Hortonworks Hadoop Distribution Hortonworks Hadoop vendor, features in the list of Top 100 winners of “Red Herring”. Hortonworks is a pure play Hadoop company that drives open source Hadoop distributions in the IT market. The main goal of Hortonworks is to drive all its innovations through the Hadoop open data platform and build an ecosystem of partners that speeds up the process of Hadoop adoption amongst enterprises. ^[19]

Cloudera Cloudera Hadoop Vendor ranks top in the big data vendors list for making Hadoop a reliable platform for business use since 2008.Cloudera, founded by a group of engineers from Yahoo, Google and Facebook - is focused on providing enterprise ready solutions of Hadoop with additional customer support and training. Cloudera Hadoop vendor has close to 350 paying customers including the U.S Army, AllState and Monsanto. Some of them boast of deploying 1000 nodes on a Hadoop cluster to crunch big data analytics for one petabyte of data. Cloudera owes its long term success to corporate partners - Oracle, IBM, HP, NetApp and MongoDB that have been consistently pushing its services. ^[20]

MapR MapR has been recognized extensively for its advanced distributions in Hadoop marking a place in the Gartner report “Cool Vendors in Information Infrastructure and Big Data, 2012.” MapR has scored the top place for its Hadoop distributions amongst all other vendors. ^[21]

Pivotal Pivotal HD is 100% Apache Hadoop compliant and supports all Hadoop Distributed File System (HDFS) file formats. Pivotal HD incorporates the common, industry-standardized ODP core containing components such as HDFS, Yarn and Ambari. Other standard Hadoop components for scripting, non-relational database, workflow orchestration, security, monitoring and data processing are included as well. ^[22]
IBM Infosphere IBM Infosphere BigInsights is an industry standard IBM Hadoop distribution that combines Hadoop with enterprise grade characteristics.IBM provides BigSheets and BigInsights as a service via its Smartcloud Enterprise Infrastructure .With IBM Hadoop distributions users can easily set up and move data to Hadoop clusters in no more than 30 minutes with data processing rate of 60 cents per Hadoop cluster, per hour. ^[23]

Microsoft Hadoop Forrester rates Microsoft Hadoop Distribution as 4/5- based on the Big Data Vendor’s current Hadoop Distributions, market presence and strategy - with Cloudera and Hortonworks scoring 5/5.Microsoft is an IT organization not known for embracing open source software solutions, but it has made efforts to run this open data platform software on Windows. Hadoop as a service offering by Microsoft’s big data solution is best leveraged through its public cloud product -Windows Azure’s HDInsight particularly developed to run on Azure. There is another production ready feature of Microsoft named Polybase that lets the users search for information available on SQL Server during the execution of Hadoop queries. ^[24]

Top 5 Recent Tweets[edit | edit source]

Date	Author	Tweet
6 Feb 2015	@Gartner_inc	Digital business transformation, #opendata, #datalakes, #Hadoop meets infrastructure
2 May 2015	@Pivotal	Pivotal welcomes @Hortonworks deeper into the Big Data Suite. HDP now runs HAWQ http://spr.ly/6013f4gW #sql #hadoop
28 Sept 2015	@cloudera	Introducing #Kudu: The New #Hadoop Storage Engine for Fast Analytics on Fast Data http://j.mp/1Wt047Mvia the @Cloudera VISION blog
14 Oct 2015	@Gartner_inc	Hadoop 2016: Moving Into Mainstream
9 Oct 2014	@Forbes	Thinking about Hadoop? Why your CMO should consider big data as a service insteadhttp://onforb.es/1oRJRXL @SungardA

Top 5 Lifetime Tweets[edit | edit source]

Date	Author	Tweet
16 Sept 2010	@yahoo	Raymie Stata: Congrats to Yahoo! engineer Nicholas Sze for breaking the world record for computing Pi using Hadoop!
21 Sept 2010	@ydn	Apache #Hadoop divides and conquers, creates longest #pi yet, says @DavidLinthicum of @InfoWorld - http://bit.ly/cvIZq0
5 Jul 2011	@IBMWatson	#Analytics are nothing new, but #Hadoop made orgs of all types realize they can analyze all their data" - @gigaom http://bit.ly/iglzMF
28 Feb 2012	@gigaom	Microsoft’s Hadoop play is shaping up, and it includes Excel http://dlvr.it/1FypXt
4 Nov 2014	@forrester	Predictions 2015: "Hadooponomics" will make enterprise adoption of #Hadoop mandatory. http://s.forr.com/J0FP

References[edit | edit source]

[1] Jump up ↑ http://hortonworks.com/hadoop/

[2] Jump up ↑ https://hadoop.apache.org/

[3] Jump up ↑ https://prezi.com/9gobleqbzgp-/the-history-evolution-trends-in-distributed-computing/ >

[4] Jump up ↑ http://www.uu.edu/dept/compscience/seminar/harwellwhite.pdf>

[5] Jump up ↑ https://books.google.co.in/books?id=axruBQAAQBAJ&pg=PA300&redir_esc=y#v=onepage&q&f=false>

[6] Jump up ↑ http://hortonworks.com/big-data-insights/spotlight-on-the-early-history-of-hadoop/>

[7] Jump up ↑ https://www.quora.com/What-is-the-history-of-Hadoop >

[8] Jump up ↑ https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html>

[9] Jump up ↑ http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies

[10] Jump up ↑ http://www.itproportal.com/2013/12/20/big-data-5-major-advantages-of-hadoop/

[11] Jump up ↑ http://bigdata-madesimple.com/5-controversies-and-debates-around-big-data/

[12] Jump up ↑ http://www.datastax.com/wp-content/uploads/2012/09/WP-DataStax-HDFSvsCFS.pdf

[13] Jump up ↑ http://www.ijfeat.org/papers/jan13.pdf

[14] Jump up ↑ http://wikibon.org/wiki/v/Hadoop:_From_Innovative_Up-Start_to_Enterprise-Grade_Big_Data_Platform

[15] Jump up ↑ http://data-informed.com/the-5-scariest-ways-big-data-is-used-today/

[16] Jump up ↑ http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/

[17] Jump up ↑ https://developer.yahoo.com/blogs/hadoop/apache-hadoop-best-practices-anti-patterns-465.html

[18] Jump up ↑ https://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93

[19] Jump up ↑ https://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93

[20] Jump up ↑ https://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93

[21] Jump up ↑ https://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93

[22] Jump up ↑ https://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93

[23] Jump up ↑ https://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93

[24] Jump up ↑ https://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

Apache Hadoop

Follow

Verify

Contents

History[edit | edit source]

Strengths[edit | edit source]

Weaknesses[edit | edit source]

Criticism[edit | edit source]

Syntax[edit | edit source]

"Hello World" Example[edit | edit source]

Best Practices[edit | edit source]

Hadoop Related Modules[edit | edit source]

Feature Comparison[edit | edit source]

Top Companies Providing Hadoop Services[edit | edit source]

Top 5 Recent Tweets[edit | edit source]

Top 5 Lifetime Tweets[edit | edit source]

References[edit | edit source]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Top Authors

Verification history

Navigation

Tools

Apache Hadoop Follow Verify

Contents

History[edit | edit source]

Strengths[edit | edit source]

Weaknesses[edit | edit source]

Criticism[edit | edit source]

Syntax[edit | edit source]

"Hello World" Example[edit | edit source]

Best Practices[edit | edit source]

Hadoop Related Modules[edit | edit source]

Feature Comparison[edit | edit source]

Top Companies Providing Hadoop Services[edit | edit source]

Top 5 Recent Tweets[edit | edit source]

Top 5 Lifetime Tweets[edit | edit source]

References[edit | edit source]

Navigation menu

Search

Top Authors

Verification history

Apache Hadoop

Follow

Verify