Quick Hit: Common Ways to Interact with Hadoop

MapReduce: geniuses only. If you are on this page, read the next option!

Pig: Short for Pig Latin. Allows you to query Hadoop like SQL. Developed by Yahoo. Easy to learn.

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
 
 -- Extract words from each line and put them into a pig bag
 -- datatype, then flatten the bag to get one word on each row
 words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
 
 -- filter out any words that are just white spaces
 filtered_words = FILTER words BY word MATCHES '\\w+';
 
 -- create a group for each word
 word_groups = GROUP filtered_words BY word;
 
 -- count the entries in each group
 word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
 
 -- order the records by count
 ordered_word_count = ORDER word_count BY count DESC;
 STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

Hive: originally built by Facebook, a social networking site (you knew you would learn something). It has a SQL-like language called HiveQL. The queries are translated into MapReduce, Tez, or Spark jobs.

DROP TABLE IF EXISTS docs;
2 CREATE TABLE docs (line STRING);
3 LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs;
4 CREATE TABLE word_counts AS
5 SELECT word, count(1) AS count FROM
6  (SELECT explode(split(line, '\s')) AS word FROM docs) temp
7 GROUP BY word
8 ORDER BY word;

Oozie: an orchestration framework that allows you to string together different MapReduce, Pig, and Hive jobs.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *