Apache Hive

From Verify.Wiki
Jump to: navigation, search

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.[1]. Built on top of Apache HadoopTM, it provides

  • Tools to enable easy data extract/transform/load (ETL)
  • A mechanism to impose structure on a variety of data formats
  • Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase
  • Query execution via MapReduce

Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's). Hive does not mandate read or written data be in the "Hive format" — there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats.

Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs). What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.

Components of Hive include HCatalog and WebHCat.

  • HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools — including Pig and MapReduce — to more easily read and write data on the grid.
  • WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP (REST style) interface.

History

Ashish Thusoo and Joydeep Sen Sarma who were co-founders of Apache Hive published an interview where they talk about the history of Hive. According to Joydeep: "Apache Hive was born out of these dual goals — an SQL-based declarative language that also allowed engineers to be able to plug in their own scripts and programs when SQL did not suffice. It was also built to store centralized meta-data about all the (Hadoop based) data sets in an organization, which was indispensable to creating a data driven organization."[2] Since it's first release Hive have undergone huge changes.Below list summarizes the Hive release over time[3]

Version Release Date
0.10.0 2012-12-18
0.11.0 2013-05-11
0.12.0 2013-10-10
0.13.0 2014-04-15
0.13.1 2014-11-16
0.14.0 2014-11-16
0.6.0 2010-10-20
0.7.0 2011-03-20
0.7.1 2011-06-15
0.8.0 2011-12-13
0.8.1 2012-01-26
0.9.0 2012-04-24
1.0.0 2015-02-03
1.0.1 2015-05-21
1.1.0 2015-03-09
1.1.1 2015-05-21
1.2.0 2015-05-15
1.2.1 2015-06-26

Advantages

[4]

  • Fits the low level interface requirement of Hadoop perfectly.
  • Supports external tables which make it possible to process data without actually storing in HDFS.
  • It has a rule based optimizer for optimizing logical plans.
  • Supports partitioning of data at the level of tables to improve performance.
  • Metastore or Metadata store is a big plus in the architecture which makes the lookup easy
  • Lower learning curve than Pig or MapReduce[5]
HiveQL is much closer to SQL than Pig
Less trial and error than Pig
  • Hive is built on hadoo, so supports and handles all the capablities of hadoop provides like reliable, high avialble ,nodefailure,commodatiy hardware.[6]
  • Database developer need not to learn the java programming for writing map reduce programs for retrieving data from hadoop system.
  • It take very less time to write Hive Query compared to Map Reduce code. For example, the word count problem which takes around 50 lines of code can be written in 5 lines in Hive.[7]
  • It was developed so that people who have SQL knowledge can write the MR Job. It supports many SQL Syntax which means that it is possible to integrate Hive with existing BI tools. So, business Analyst, or non-java guys can also work on the large data set. Similarly, the code which were earlier used in RDBMS can be used in Hive (with changes). With MR, it is a separate exercise in itself.
  • It is very easy to write query involving joins (if there are few joins) in Hive. Comparing it with MR code, you have to do caching of data and do several operations to reach to the same point.
  • It has very low maintenance and is very simple to learn & use (low learning curve).

Advantages of Hive over SQL [8]

  • Hive has several interesting UDF packages and makes it easy to contribute new UDFs.
  • One can explicitly control the map and reduce transform stages, and even route them through simple scripts written quickly in languages like Python or Perl.
  • One can also work with many different serialization formats, making it easy to ingest nearly any kind of data. Finally, you can connect your Hive queries to several Hadoop packages for statistical processing, including Apache Mahout, RHipe, RHive or RHadoop. For these reasons Hive can improve developer productivity when working with challenging data formats or complex analytical tasks.

Disadvantages

[9]

  • The batch nature of Map / Reduce makes Hive perform poorly when you need low-latency execution for simple queries.(We can use tolls like Tez to improve performance)[10]
  • No Query optimizer planner available to evaluate query execution based on multiple paths and choose one out of those paths.
  • No support for update and delete.
  • No support for singleton inserts. Data is required to be loaded from a file using LOAD command.
  • No access control implementation.
  • Correlated sub queries are not supported.
  • Updating data is complicated [11]
Mainly because of using HDFS
Can add records
Can overwrite partitions
  • No real time access to data
Use other means like Hbase or Impala
  • High latency
  • You cannot do complicated operations using Hive. For example, when output of one job acts as input to the other job (SequenceFileFormat file) or writing query on an image file, Hive is not useful.[12]
  • Hive is useful only if the data is structured. With unstructured data, Hive is not a good tool while with Map Reduce you can work on any kind of dataset.
  • Debugging code is very difficult in Hive while with MR, you have the same debugging facility as in eclipse.
  • MR is like machine code. So, you can do any analytical operation using MR programing.

Controversies

Syntax

"Hello World" Example

Best Practices

  • Partitioning Tables:[13]

Hive partitioning is an effective method to improve the query performance on larger tables (Tweet this). Partitioning allows you to store data in separate sub-directories under table location. It greatly helps the queries which are queried upon the partition key(s). Although the selection of partition key is always a sensitive decision, it should always be a low cardinal attribute, e.g. if your data is associated with time dimension, then date could be a good partition key. Similarly, if data has association with location, like a country or state, then it’s a good idea to have hierarchical partitions like country/state.

  • De-normalizing data:[14]

Normalization is a standard process used to model your data tables with certain rules to deal with redundancy of data and anomalies. In simpler words, if you normalize your data sets, you end up creating multiple relational tables which can be joined at the run time to produce the results. Joins are expensive and difficult operations to perform and are one of the common reasons for performance issues . Because of that, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics.

  • Compress map/reduce output:[15]

Compression techniques significantly reduce the intermediate data volume, which internally reduces the amount of data transfers between mappers and reducers. All this generally occurs over the network. Compression can be applied on the mapper and reducer output individually. Keep in mind that gzip compressed files are not splittable. That means this should be applied with caution. A compressed file size should not be larger than a few hundred megabytes . Otherwise it can potentially lead to an imbalanced job. Other options of compression codec could be snappy, lzo, bzip, etc.

Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine.If it’s not turned on by default in your environment, use Tez by setting to ‘true’ the following in the beginning of your Hive query:

[code language=”SQL”] set hive.execution.engine=tez;

Hive supports ORCfile, a new table storage format that sports fantastic speed improvements through techniques like predicate push-down, compression and more.

Using ORCFile for every HIVE table should really be a no-brainer and extremely beneficial to get fast response times for your HIVE queries.

ORC supports compressed storage (with ZLIB or as shown above with SNAPPY) but also uncompressed storage.

  • Use Vectorization[18]

Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.

Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:

[code language=”SQL”] set hive.vectorized.execution.enabled = true; set hive.vectorized.execution.reduce.enabled = true; [/code]

SQL is a powerful declarative language. Like other declarative languages, there is more than one way to write a SQL statement. Although each statement’s functionality is the same, it may have strikingly different performance characteristics.

  • Input Format Selection:[20]

Input formats are playing very important role in Hive performance.Primary choices of Input Format are Text,Sequence File,RC File,ORC .Detailed Information and difference between Text,Sequence File,RC File,ORC click here (Hive Input Format Selection)

  • Parallel execution:[21]

Hadoop can execute MapReduce jobs in parallel,many queries in hive also executed by using this parallel execution concept we can utilize the performance of our hadoop cluster and reduce the query execution time also improve the performance The configuration in Hive to change this behavior is merely switching a single flag SET hive.exec.parallel=true. More Information Click here (Parallel Execution)

  • Unit Testing:

Unit Testing is nothing but Testing the small piece of code works exactly as you expect or not.By using with Unit Testing we can easily find out the errors and easily re modify the code also.In Hive, you can unit test UDFs, SerDes, streaming scripts, Hive queries and more.By using this Unit Testing we can save huge amount of time.There are several tools available that helps you to test Hive queries. Some of them that you might want to look atHiveRunner, Hive_test and Beetest.

Sampling allows users to take a subset of dataset and analyze it, without having to analyze the entire data set. If a representative sample is used, then a query can return meaningful results as well as finish quicker and consume fewer compute resources.

Hive offers a built-in TABLESAMPLE clause that allows you to sample your tables. TABLESAMPLE can sample at various granularity levels – it can return only subsets of buckets (bucket sampling), or HDFS blocks (block sampling), or only first N records from each input split. Alternatively, you can implement your own UDF that filters out records according to your sampling algorithm.

Feature Comparison Chart

Hive vs. Pig

  • Hive Hadoop Component is used mainly by data analysts whereas Pig Hadoop Component is generally used by Researchers and Programmers.
  • Hive Hadoop Component is used for completely structured Data whereas Pig Hadoop Component is used for semi structured data.
  • Hive Hadoop Component has a declarative SQLish language (HiveQL) whereas Pig Hadoop Component has a procedural data flow language (Pig Latin)
  • Hive Hadoop Component is mainly used for creating reports whereas Pig Hadoop Component is mainly used for programming.
  • Hive Hadoop Component operates on the server side of any cluster whereas Pig Hadoop Component operates on the client side of any cluster.
  • Hive Hadoop Component is helpful for ETL whereas Pig Hadoop is great for ETL because of its powerful transformation and processing capabilities.
  • Hive can start an optional thrift based server that can send queries from any nook and corner directly to the Hive server which will execute them whereas this feature is not available with Pig.
  • Hive directly leverages SQL expertise and thus can be learnt easily whereas Pig is also SQL-like but varies to a great extent and thus it will take some time efforts to master Pig.
  • Hive makes use of exact variation of the SQL DLL language by defining the tables beforehand and storing the schema details in any local database whereas in case of Pig there is no dedicated metadata database and the schemas or data types will be defined in the script itself.
  • The Hive Hadoop component has a provision for partitions so that you can process the subset of data by date or in an alphabetical order whereas Pig Hadoop component does not have any notion for partitions though might be one can achieve this through filters.
  • Pig supports Avro whereas Hive does not.
  • Pig can be installed easily over Hive as it is completely based on shell interaction
  • Pig Hadoop Component renders users with sample data for each scenario and each step through its “Illustrate” function whereas this feature is not incorporated with the Hive Hadoop Component.
  • Hive has smart inbuilt features on accessing raw data but in case of Pig Latin Scripts we are not pretty sure that accessing raw data is as fast as with HiveQL.
  • You can join, order and sort data dynamically in an aggregated manner with Hive and Pig however Pig also provides you an additional COGROUP feature for performing outer joins.

To conclude with after having understood the difference between Pig and Hive, to me both Hive Hadoop and Pig Hadoop Component will help achieve the same goals, it can be said that Pig is a script kiddy and Hive comes in, innate for all the natural database developers. When it comes to access choices, Hive is said to have more features over Pig. Both the Hive and Pig components are reportedly having near about the same number of committers in every project.[23]


Redshift vs. Hive

  • Performance: Critical performance tests and benchmarks conducted by independent organizations claim that Redshift is significantly faster than Hive. A complex query with two joins, run against a million rows, took Hive 182 seconds to Redshift’s 8 seconds. Redshift was 20 times faster in this case. The one deviation in these findings is the fact that Redshift performed best in the terabyte range, but Hive caught up and surpassed Redshift in the petabyte range of data. This would suggest that these results are not set in stone, but rather depend on the needs and size of the data warehouse.
  • Price: Redshift edges out Hive in the pricing comparison as well. According to the same Airbnb nerds performance tests , the same Redshift implementation they used in their tests would cost $13.60 per hour. In comparison, the same Hive on Hadoop implementation in the test would cost $57 per hour, making Redshift the clear-cut winner.
  • Ease of use: Redshift allows you to get your data warehouse up and running in minutes. It is fully managed , fault tolerant, supports automated backups, and facilitates fast restores. Hadoop based Hive, on the other hand, is much more complex and demanding. It supports all of these features, but not as easily or cheaply as Redshift. Just the expertise needed for a Hadoop implementation is bound to make any data warehousing activity a major exercise in planning, development, deployment, and administration. Scaling is an area where Hive on Hadoop takes the prize. Redshift has an upper limit of 100 nodes and a maximum of 16TB of storage per node. There are virtually no limits to scaling Hadoop. Lastly, Redshift uses standard SQL, making it easier to port existing queries without modification. Hive uses a SQL-like language called HiveQL. All in all, Hive is a good solution, but if lack of specialized personnel and ease of use are major concerns, Redshift can’t be beat.[24]

Top Companies Providing Hive Services

There are many commercial Apache Hive vendors in market providing from 100% opensource to their own proprietary implementation of the same , Few of the most popular vendors are

Amazon Elastic MapReduce (Amazon EMR) is a web service that makes it easy to quickly and cost-effectively process vast amounts of data.Amazon EMR simplifies big data processing, providing a managed Hadoop framework that makes it easy, fast, and cost-effective for you to distribute and process vast amounts of your data across dynamically scalable Amazon EC2 instances. [25]

Hortonworks Hadoop vendor, features in the list of Top 100 winners of “Red Herring”. Hortonworks is a pure play Hadoop company that drives open source Hadoop distributions in the IT market. The main goal of Hortonworks is to drive all its innovations through the Hadoop open data platform and build an ecosystem of partners that speeds up the process of Hadoop adoption amongst enterprises. [26].Since its incubation in 2008, Apache Hive is considered the defacto standard for interactive SQL queries over petabytes of data in Hadoop.With the completion of the Stinger Initiative, and the next phase of Stinger.next, the Apache community has greatly improved Hive’s speed, scale and SQL semantics. Hive easily integrates with other critical data center technologies using a familiar JDBC interface.

Hive ODBC Driver for HDP 2.3 (v2.0.5) The Hortonworks Hive ODBC Driver allows you to connect popular Business Intelligence (BI) tools to query, analyze and visualize data stored within the Hortonworks Data Platform. This driver supports HDP 2.3 and includes support for new SQL features added in HDP 2.3 such as UNION DISTINCT and interval types.

Cloudera Hadoop Vendor ranks top in the big data vendors list for making Hadoop a reliable platform for business use since 2008.Cloudera, founded by a group of engineers from Yahoo, Google and Facebook - is focused on providing enterprise ready solutions of Hadoop with additional customer support and training. Cloudera Hadoop vendor has close to 350 paying customers including the U.S Army, AllState and Monsanto. Some of them boast of deploying 1000 nodes on a Hadoop cluster to crunch big data analytics for one petabyte of data. Cloudera owes its long term success to corporate partners - Oracle, IBM, HP, NetApp and MongoDB that have been consistently pushing its services. [27]

As an integrated part of Cloudera’s platform, users can run batch processing workloads with Apache Hive, while also analyzing the same data for interactive SQL or machine-learning workloads using tools like Impala or Apache Spark — all within a single platform.

Hive also benefits from unified resource management (through YARN), simple deployment and administration (through Cloudera Manager), and shared compliance-ready security and governance (through Apache Sentry and Cloudera Navigator) —- all critical for running in production.

Pivotal HD is 100% Apache Hadoop compliant and supports all Hadoop Distributed File System (HDFS) file formats. Pivotal HD incorporates the common, industry-standardized ODP core containing components such as HDFS, Yarn and Ambari. Other standard Hadoop components for scripting, non-relational database, workflow orchestration, security, monitoring and data processing are included as well. [28]

Pivotal HD deploys Apache Hive for your Hadoop cluster.

Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability for querying and analysis of large data sets stored in Hadoop files.

Hive defines a simple SQL query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that

Forrester rates Microsoft Hadoop Distribution as 4/5- based on the Big Data Vendor’s current Hadoop Distributions, market presence and strategy - with Cloudera and Hortonworks scoring 5/5.Microsoft is an IT organization not known for embracing open source software solutions, but it has made efforts to run this open data platform software on Windows. Hadoop as a service offering by Microsoft’s big data solution is best leveraged through its public cloud product -Windows Azure’s HDInsight particularly developed to run on Azure. There is another production ready feature of Microsoft named Polybase that lets the users search for information available on SQL Server during the execution of Hadoop queries. [29]

Microsoft® Hive ODBC Driver Microsoft® Hive ODBC Driver is a connector to Apache Hadoop Hive available as part of HDInsight clusters. Microsoft® Hive ODBC Driver enables Business Intelligence, Analytics and Reporting on data in Apache Hive.

Top 5 Recent Tweets

Date Author Tweet
6 Feb 2015 @cwsteinbach Glad to see that this meetup is focused on ways to improve the quality of Hive. One sign of a healthy project!
15 Nov 2015 @ApacheHive The next Hive Contributors Meetup is happening tomorrow at Hortonworks HQ. It's not too late to RSVP! http://www.meetup.com/Hive-Contributors-
12 Jan 2015 @ApacheHive February Hive Meetup @ LinkedIn just announced! RSVP here: http://www.meetup.com/Hive-User-Group-Meeting/events/219794523/
3 Dec 2014 @ApacheHive SF Bay Area Hive Meetup at LinkedIn happening Tonight! More details here: http://www.meetup.com/Hive-User-Group-Meeting/events/218628646/
19 Jul 2014 ‏@Thembek 27/34 I obviously love #Hive. The reason is simple - @ApacheHive works! SQL coverage is best. Scale is best. Hive is simple the best free

Top 5 Lifetime Tweets

References

  1. https://hive.apache.org/
  2. https://www.qubole.com/blog/big-data/founders-transformation-hadoop/?nabe=6725258257104896:1&utm_source=Quora&utm_medium=Answer&utm_campaign=Gil&utm_content=History-of-Hive
  3. https://archive.apache.org/dist/hive/
  4. http://www.gise.cse.iitb.ac.in/wiki/images/2/26/Hive.pdf
  5. https://courses.cs.ut.ee/MTAT.08.036/2013_fall/uploads/Main/slides5.pdf
  6. http://www.cloudhadoop.com/2011/11/hive-usage.html#.VpRnkvl97IU
  7. https://www.quora.com/What-are-the-benefits-and-drawbacks-of-using-MapReduce-over-Hive
  8. https://www.quora.com/What-are-the-advantages-of-Hive-over-SQL
  9. http://www.gise.cse.iitb.ac.in/wiki/images/2/26/Hive.pdf
  10. https://www.quora.com/What-are-the-advantages-of-Hive-over-SQL
  11. https://courses.cs.ut.ee/MTAT.08.036/2013_fall/uploads/Main/slides5.pdf
  12. https://www.quora.com/What-are-the-benefits-and-drawbacks-of-using-MapReduce-over-Hive
  13. https://www.qubole.com/blog/big-data/hive-best-practices/
  14. https://www.qubole.com/blog/big-data/hive-best-practices/
  15. https://www.qubole.com/blog/big-data/hive-best-practices/
  16. http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
  17. http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
  18. http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
  19. https://www.qubole.com/blog/big-data/hive-best-practices/
  20. http://www.hadooptpoint.com/apache-hive-10-best-practices/
  21. http://www.hadooptpoint.com/apache-hive-10-best-practices/
  22. https://www.qubole.com/blog/big-data/hive-best-practices/
  23. https://www.dezyre.com/article/difference-between-pig-and-hive-the-two-key-components-of-hadoop-ecosystem/79
  24. https://www.flydata.com/blog/hive-and-redshift-a-brief-comparison/
  25. https://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93
  26. https://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93
  27. https://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93
  28. https://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93
  29. https://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93

Verification history

  • This page has not been verified