Apache Kafka
Contents
About
"Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable." With real-time data processing challenges getting complex by the day, its unique attributes make it an extremely desirable option for data integration. It maintains the feeds of messages contained in topics, which are partitioned and replicated across various nodes. Produces write to the topics and consumers read from them.
Messages are just byte arrays, so developers can use them for storing objects in varied formats. The system allows for configuring consumer groups with multiple consumers to read from topics. This apache tool ensures that every consumer gets to read messages from a single subset of the partitions in a topic he/she has subscribed to, so every message is delivered to only one consumer in a specific group and messages containing the same key are all delivered to the same consumer.
Apache Kafka treats topics like logs, assigning messages in partitions to unique subsets. Instead of tracking the messages read by consumers and retaining the unread messages, it retains all the messages for a specified time period, making consumers responsible for tracking their location in the logs. Kafka also supports large numbers of consumers and retains humongous amounts of data with little overhead costs.[1]
Key Features
Messaging: Big Data Kafka works better than traditional message brokers, boasting better throughput, built-in partitioning, fault-tolerance and replication. This makes it a great solution for applications that requires large scale message / event processing.
Website Activity Tracking: Kafka can rebuild any user activity tracking pipeline in the form of a set of real-time publish-subscribe feeds. These feeds can then be made available for subscription for various purposes like real-time monitoring and processing, data warehousing systems for processing and reporting offline.
Log Aggregation: Kafka takes away details of files and provides a clearer abstraction of event/log data as a stream of messages, allowing for lower-latency processing, distributed data consumption and better support for diverse data sources. Kafka boasts better performance and stronger durability guarantees as a result of replication and lower latency.
Metrics: Kafka can help monitor operational data by collecting statistics from distributed apps for producing centralized feeds
Commit Log: Kafka serves as an external commit-log for distributed systems. The log helps in replicating data between multiple nodes, offering a re-syncing mechanism to failed nodes for restoring their data.
Stream Processing: Kafka allows for stage-wise processing of data consumed from raw data topics, before aggregating, enriching or transforming them into Kafka topic for later consumption.
Advantages
- Fast: Every Kafka broker is capable of handling hundreds of megabytes of writes and reads per second from numerous clients.
- Scalable: Kafka has been designed to let a single cluster serve as the main data backbone for a large enterprise. It can be transparently and elastically expanded without any downtime. It partitions data streams and spreads them over a cluster of machines (to accommodate data streams that cannot be handled by a single machine) to allow for clusters of co-ordinated consumers.
- Durable: In Kafka, messages persist on the disk and get replicated within clusters to prevent loss of data. Every broker has the capability of handling terabytes of messages without impacting performance.
- Distributed Design: This big data tool boasts a modern, cluster-centric design that ensures better fault-tolerance and durability.
- Guaranteed Ordered Messages Kafka guarantees a stronger ordering of message delivery than a traditional messaging system. Traditional messaging systems hand out messages in order, but these messages are delivered asynchronously to the consumers. This could result in the message getting delivered out of order to different consumers. Kafka guarantees ordered delivery of messages within a partition. If a system requires total order over messages then this can be achieved by having a topic with no partitions. But this comes at the cost of sacrificing parallelism of message consumers.
- Industry Adoption Apache Kafka has become a popular messaging system in a short period of time with a number of organisations like LinkedIn, Tumblr, PayPal, Cisco, Box, Airbnb, Netflix, Square, Spotify, Pinterest, Uber, Goldman Sachs, Yahoo and Twitter among others using it in production systems.[2]