Apache Pig is a high-level procedural language platform developed to simplify querying large data sets in Apache Hadoop and MapReduce.Pig is made up of two components: the first is the language itself, which is called PigLatin (yes, people naming various Hadoop projects do tend to have a sense of humor associated with their naming conventions), and the second is a runtime environment where PigLatin programs are executed. Think of the relationship between a Java Virtual Machine (JVM) and a Java application. In this section, we’ll just refer to the whole entity as Pig. Apache Pig features a “Pig Latin” language layer that enables SQL-like queries to be performed on distributed datasets within Hadoop applications. Pig originated as a Yahoo Research initiative for creating and executing map-reduce jobs on very large data sets. In 2007 Pig became an open source project of the Apache Software Foundation.
Pig is an interactive, or script-based, execution environment supporting Pig Latin, a language used to express data flows. The Pig Latin language supports the loading and processing of input data with a series of operators that transform the input data and produce the desired output.
The Pig execution environment has two modes:
Local mode: All scripts are run on a single machine. Hadoop MapReduce and HDFS are not required.
Hadoop: Also called MapReduce mode, all scripts are run on a given Hadoop cluster.
The Pig Latin language provides an abstract way to get answers from big data by focusing on the data and not the structure of a custom software program. Pig makes prototyping very simple. For example, you can run a Pig script on a small representation of your big data environment to ensure that you are getting the desired results before you commit to processing all the data.
Pig programs can be run in three different ways, all of them compatible with local and Hadoop mode:
Script: Simply a file containing Pig Latin commands, identified by the .pig suffix (for example, file.pig or myscript.pig). The commands are interpreted by Pig and executed in sequential order.
Grunt: Grunt is a command interpreter. You can type Pig Latin on the grunt command line and Grunt will execute the command on your behalf. This is very useful for prototyping and “what if” scenarios.
Embedded: Pig programs can be executed as part of a Java program.
Pig Latin has a very rich syntax. It supports operators for the following operations:
- Loading and storing of data
- Streaming data
- Filtering data
- Grouping and joining data
- Sorting data
- Combining and splitting data
Pig Latin also supports a wide variety of types, expressions, functions, diagnostic operators, macros, and file system commands.
Pig started out as a research project in Yahoo! Research, where Yahoo! scientists designed it and produced an initial implementation. As explained in a paper presented at SIGMOD in 2008, the researchers felt that the MapReduce paradigm presented by Hadoop “is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain and reuse.” At the same time they observed that many MapReduce users were not comfortable with declarative languages such as SQL. Thus they set out to produce “a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of MapReduce.”
Yahoo! Hadoop users started to adopt Pig. So, a team of development engineers was assembled to take the research prototype and build it into a production-quality product. About this same time, in fall 2007, Pig was open sourced via the Apache Incubator. The first Pig release came a year later in September 2008. Later that same year, Pig graduated from the Incubator and became a subproject of Apache Hadoop.
Early in 2009 other companies started to use Pig for their data processing. Amazon also added Pig as part of its Elastic MapReduce service. By the end of 2009 about half of Hadoop jobs at Yahoo! were Pig jobs. In 2010, Pig adoption continued to grow, and Pig graduated from a Hadoop subproject, becoming its own top-level Apache project.
- Apache Pig is a Scripting language and thus follows more/less all the syntax/rules of Shell Scripting.
- PigLatin is a Command Based Language Designed Specifically for data transformation and flow expression
"Hello World" Example
The dataset that we are using here is
Top Companies Providing Apache Flume Services
Top 5 Recent Tweets
Top 5 Recent News Headlines
Top 5 Lifetime Tweets
Top 5 Lifetime News Headlines