MapReduce: geniuses only. If you are on this page, read the next option!
Pig: Short for Pig Latin. Allows you to query Hadoop like SQL. Developed by Yahoo. Easy to learn.
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
Hive: originally built by Facebook, a social networking site (you knew you would learn something). It has a SQL-like language called HiveQL. The queries are translated into MapReduce, Tez, or Spark jobs.
DROP TABLE IF EXISTS docs; 2 CREATE TABLE docs (line STRING); 3 LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs; 4 CREATE TABLE word_counts AS 5 SELECT word, count(1) AS count FROM 6 (SELECT explode(split(line, '\s')) AS word FROM docs) temp 7 GROUP BY word 8 ORDER BY word;
Oozie: an orchestration framework that allows you to string together different MapReduce, Pig, and Hive jobs.