Apache Hive vs Apache Pig

Apache Hive & Pig try to ease the complexity of writing MapReduce jobs in a programming language like Java by giving the user a set of tools that they may be more familiar with.

What are their Similiarities ?

  • The raw data is stored in Hadoop’s HDFS and can be any format although natively it usually is a TAB separated text file, while internally they also may make use of Hadoop’s SequenceFile file format.

  • Both translate the irrespective high level languages to MapReducejobs

  • Both offer significant reductions in program size over Java

  • Both provide points of extension to cover gaps in functionality

  • Both provide interoperability withotherlanguages

  • None support random reads/writes or low-latency queries

Let look into the Differences

PIG

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Pig was Developed by Yahoo! as a scripting language that can consume any type of data, where the schmea of the data can be anything & is defined at the load time. Different data types supported include:

  • Primitive datatypes

++ scalars ++ int ++ float ++ double ++ long ++ Arrays ++ chararray ++ bytearray

  • Complex datatype

++ tuple - An ordered set of fields. ++ bag - An collection of tuples. ++ map - A set of key value pairs.

PIG Components

Pig Latin

Pig’s language layer currently consists of a textual language called Pig Latin, which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
  • Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.
Compiler & Stages of operation

On the FrontEnd

  • The parser transforms a Pig Latin script into a logical plan

  • Semantic checks and optimizations are done in this logical plan

  • Logical plan is then transformed into physical plan .This Physical Plan contains the operators that will be applied to the data

  • This physical plan is divided into a set of MapReduce jobs by the MRCompiler into an MROperPlan

  • This MROperPlan is then optimized

  • Finally a set of MapReduce jobs are generated by the JobControlCompiler. These are submitted to Hadoop and monitored by the MapReduceLauncher

On the BackEnd

  • Each PigGenericMapReduce.Map, PigCombiner.Combine, and PigGenericMapReduce.Reduce use the pipeline of physical operators constructed in the front end to load, process, and store data

Pig has no metadata database. Datatypes and schemas are defined within each script

Join Optimization
  • Replicated Joins

Works well if one or more relations are small enough to fit into main memory. In this type of join the large relation is followed by one or more small relations. The small relations must be small enough to fit into main memory; if they don’t, the process fails and an error is generated.

Conditions: Fragment replicate joins are experimental; we don’t have a strong sense of how small the small relation must be to fit into memory. On a simple test with a query that involves just a JOIN, a relation of up to 100 M can be used if the process overall gets 1 GB of memory.

  • Merge Joins

If both the inputs are already sorted on the join key ,the data can be joined in the map phase of the mapreduce jobs. This provides a significant performance improvement compared to passing all of the data through unneeded sort and shuffle phases.

Conditions:

  • The merge join only has two inputs
  • Only inner join will be supported
  • Between the load of the sorted input and the merge join statement there can only be filter statements and foreach statement where the foreach statement should meet the following conditions:

++ There should be no UDFs in the foreach statement ++ The foreach statement should not change the position of the join keys ++ There should not transformation on the join keys which will change the sort order

  • Skewed Joins

Skewed join computes a histogram of the key space and uses this data to allocate reducers for a given key. Skewed join does not place a restriction on the size of the input keys. It accomplishes this by splitting the left input on the join predicate and streaming the right input. The left input is sampled to create the histogram.

Conditions:

  • Skewed join works with two-table inner join. Do not support more than two tables for skewed join
  • Specifying three-way (or more) joins will fail validation. For such joins, you have to break them up into two-way joins
No JDBC/ODBC driver

Pigserver is used for connecting to Pig using Java program

No partitions

filters can achieve the partitions

Server

No such server No Web UI

User Defined Functions

UDF functions in pig can be implemented by extending any of the abstract classes like EvalFunc,StoreFunc,LoadFunc and FilterFunc