Apache Flume

Follow
Verify
From Verify.Wiki
Jump to: navigation, search

Apache Flume is a distributed, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for fail-over and recovery.

YARN coordinates data ingest from Apache Flume and other services that deliver raw data into an Enterprise Hadoop cluster.[1]

Core Concepts

The purpose of Flume is to provide a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. The architecture of Flume NG is based on a few concepts that together help achieve this objective. Some of these concepts have existed in the past implementation, but have changed drastically. Here is a summary of concepts that Flume NG introduces, redefines, or reuses from earlier implementation:

Event: A byte payload with optional string headers that represent the unit of data that Flume can transport from it’s point of origination to it’s final destination.

Flow: Movement of events from the point of origin to their final destination is considered a data flow, or simply flow. This is not a rigorous definition and is used only at a high level for description purposes.

Client: An interface implementation that operates at the point of origin of events and delivers them to a Flume agent. Clients typically operate in the process space of the application they are consuming data from. For example, Flume Log4j Appender is a client.

Agent: An independent process that hosts flume components such as sources, channels and sinks, and thus has the ability to receive, store and forward events to their next-hop destination.

Source: An interface implementation that can consume events delivered to it via a specific mechanism. For example, an Avro source is a source implementation that can be used to receive Avro events from clients or other agents in the flow. When a source receives an event, it hands it over to one or more channels.

Channel: A transient store for events, where events are delivered to the channel via sources operating within the agent. An event put in a channel stays in that channel until a sink removes it for further transport. An example of channel is the JDBC channel that uses a file-system backed embedded database to persist the events until they are removed by a sink. Channels play an important role in ensuring durability of the flows.

Sink: An interface implementation that can remove events from a channel and transmit them to the next agent in the flow, or to the event’s final destination. Sinks that transmit the event to it’s final destination are also known as terminal sinks. The Flume HDFS sink is an example of a terminal sink. Whereas the Flume Avro sink is an example of a regular sink that can transmit messages to other agents that are running an Avro source.

These concepts help in simplifying the architecture, implementation, configuration and deployment of Flume.[2]

Features of Flume

Some of the notable features of Flume are as follows −

  • Flume ingests log data from multiple web servers into a centralized store (HDFS, HBase) efficiently.
  • Using Flume, we can get the data from multiple servers immediately into Hadoop.
  • Along with the log files, Flume is also used to import huge volumes of event data produced by social networking sites like Facebook and Twitter, and e-commerce websites like Amazon and Flipkart.
  • Flume supports a large set of sources and destinations types.
  • Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.
  • Flume can be scaled horizontally.


Advantages

Here are the advantages of using Flume −

  • Using Apache Flume we can store the data in to any of the centralized stores (HBase, HDFS).
  • When the rate of incoming data exceeds the rate at which data can be written to the destination, Flume acts as a mediator between data producers and the centralized stores and provides a steady flow of data between them.
  • Flume provides the feature of contextual routing.
  • The transactions in Flume are channel-based where two transactions (one sender and one receiver) are maintained for each message. It guarantees reliable message delivery.
  • Flume is reliable, fault tolerant, scalable, manageable, and customizable.[3]


Disadvantages

Each collection node may face some performance problem when data processing capability of the node is unexpectedly exceeded due to the enormous amount of log workload suddenly coming from data generating agents connected to itself. On the contrary, if the amount of data transmitted to the collection node is too small compared with its data processing capacity, the node may become under-utilized, even remaining almost in the idle state.

This problem of the existing method fundamentally results from not considering the load condition of the task taking collection node. Therefore, when the method attempts to load balance the entire system, its performance may significantly varies depending on the state of the receiving collection node.

If its value is set too high, the method is infrequently invoked even if there are several overloaded nodes. Otherwise, the opposite behavior may occur. Therefore, the Flume requires an effective load balancing method to be able to adapt to dynamic characteristics of incoming workload.

Controversies

None

History

To deal with the streaming of log/data,Cloudera,a service provider of professional services for Hadoop as well as their distribution of hadoop,saw this need,That is when Flume was created which provided standard,simple,robust,flexible,and extensible tool for data ingestion into Hadoop.

Flume was first interdused in Cloudera's CDH3 distribution in 2011.It consisted of a federation of worker daemons (agents) configured from a centralized master (masters) via Zookeeper.

In June,2011,Cloudera moved control of the Flume project to the Apache Foundation.It came out of the incubator status a year later in 2012, during which work had already begun to refactor Flume under the Star-Trek-themed tag,Flume-NG(Flume the Next Generation)[4]

Best Practices

Documentation is the Key

Documentation is the Key to become successful software developer, tester or architect.

Before you start developing small or big software, you should have answer for the following questions:

  • Where is the Requirements Specification?
  • Where is the Impact Analysis Document?
  • Where is the Design Document?
  • Have you documented all the assumptions, limitations properly?
  • Have you done review of all the documents?
  • Did you get sign off on all the documents from all the stakeholders?


Defined standards

Most of the standard software organizations maintain their coding standards. These standards would have been set up by well-experienced software developers after spending years with software development.

A coding standard would fix the rules about various important attributes of the code, few are listed below:

  • File Naming convention
  • Function & Module Naming convention
  • Variable Naming convention
  • History, Indentation, Comments
  • Readability guidelines
  • List of do's and don'ts


Testing

Testing is mandatory after every small or big change.Have to perform testing for each and every change you did in the code.

Top 5 Recent Tweets

Date Author Tweet
16 Feb 2016 @imharit #hadoop #bigdata #programmers #hortonworks #cloudera #flume #apache #developer #blogger #MongoDB #hive #sqoop
14 Feb 2016 ‏@sanghani93 #hadoop #programmer #apache #flume #bigdata #hive #NoSql #Mondaymotivation
12 Feb 2016 ‏@imharit #hadoop #apache #analyzing #flume #bigdata #hive #amazing #hortonworks #zookeeper @HadoopNews @HadoopDaily
11 Feb 2016 @datiobdEng Streaming #Oracle Database Log to Apache #Kafka with Apache #Flume http://ow.ly/Y2Va2 #bigdata via @Dell_Toad
11 Feb 2016 ‏@imharit #apache #hive #flume #programming #hadoop #bigdata #spark #hive #neo4j #ubuntu http://sanghani93.blogspot.in/2016/02/analyse-twitter-data-using- hadoop-flume.html?m=0 … @sanghani93 #SiachenAvalanche


References

  1. http://hortonworks.com/hadoop/flume/
  2. http://blog.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/
  3. http://www.tutorialspoint.com/apache_flume/apache_flume_introduction.htm
  4. https://books.google.co.in/books?id=u1bTBgAAQBAJ&pg=PA154&dq=history+of+apache+flume&hl=en&sa=X&ved=0ahUKEwjXptD4i_nKAhXFRI4KHb3CAXkQ6AEIHDAA#v=onepage&q=history%20of%20apache%20flume&f=false

Verification history

  • This page has not been verified