Está en la página 1de 4

1. Start pig command line at the dollar prompt.

$pig
You will get the grunt prompt grunt>

2. List the contents of users directory in HDFS


ls logdir;
Note that Pig lists your home directory in HDFS, not the local system directory. You will
see two files logfile1 and logfile2 created in the previous labs. If not, go back to previous
labs and create these files.

3. Load the logfile1 and logfile2 into pig aliases:

log1 = load 'logdir/logfile1' using PigStorage();

log2 = load 'logdir/logfile2' using PigStorage();

You will see some warnings on type conversion. You can ignore them.

4. Since this is unstructured data, the entire line is $0. We will extract the type of log from
character position 24 to 28 from the line, giving four character log type.

log1type = FOREACH log1 GENERATE substring($0,24,28) as logtype;

This gives an error that the function substring is not found. This is because the key words in
Pig are not case sensitive, whereas the functions are. So let us modify it to use capital
letters.

log1type = FOREACH log1 GENERATE SUBSTRING($0,24,28) as logtype;

log2type = FOREACH log2 GENERATE SUBSTRING($0,24,28) as logtype;

This works fine with some conversion warnings.

© Copyright 2015, Simplilearn. All rights reserved. Page |1


5. Filter only the log types INFO, WARNING, or ERRORs. Since we have taken only 4
characters for log type, we will compare four letters only.

log1f = FILTER log1type by (logtype == 'INFO' OR logtype == 'WARN' OR logtype ==


'ERRO');

log2f = FILTER log2type by (logtype == 'INFO' OR logtype == 'WARN' OR logtype ==


'ERRO');

6. Since we want to count the number of occurrences of each log type, let us do a group by
the log type.

log1grp = group log1f by logtype;

log2grp = group log2f by logtype;

7. Let us check the structure of log1grp and log2grp. You will see that they are nested
structures with log1f nested as a bag inside log1grp and log2f nested as a bag inside
log2grp.

describe log1grp;

describe log2grp;

8. Now generate the count of each logtype

log1cnt = foreach log1grp generate group as logtype, COUNT(log1f.logtype) as cnt;

log2cnt = foreach log2grp generate group as logtype, COUNT(log2f.logtype) as cnt;

Note that we have used lower case letters for “foreach” and “generate” as they are key
words, whereas “COUNT” has to be upper case as it is a function.

© Copyright 2015, Simplilearn. All rights reserved. Page |2


9. We will store these counts into HDFS. We will use comma as the delimiter.

store log1cnt into ‘log1cnt’ using PigStorage(‘,’);

store log2cnt into ‘log2cnt’ using PigStorage(‘,’);

Note that Pig uses lazy processing, so a MapReduce job is created only when it sees a dump
or a store command. While processing log1cnt, only the aliases pertaining to log1 are
processed. Similarly, during log2cnt processing, only the aliases for log2 are processed. This
optimization is done by Pig.

10. List the contents of log1cnt and log2cnt.


ls log1cnt
ls log2cnt

You can see that the output is similar to MapReduce output. Output is a directory with
an empty file _SUCCESS and an output file ‘part-r-00000’.

11. Check the content of these files


cat log1cnt/part-r-00000
cat log2cnt/part-r-00000

12. Reload these files into the same aliases as before. These will be read as structured data,
so provide the delimiter as well as the schema. Also, we can specify the directory itself
and Pig will load all the files in the directory. Since _Success is an empty file, this will not
be a problem for us.
log1cnt = load ‘log1cnt’ using PigStorage(‘,’) as (logtype1: chararray, cnt1 : chararray);
log2cnt = load ‘log2cnt’ using PigStorage(‘,’) as (logtype2: chararray, cnt2 : chararray);

13. Join the two log type relations to produce a third relation. Use Full Outer Join as we want
records from both the relations even if there is no match.

logtotal = JOIN log1cnt by logtype FULL OUTER, log2cnt by logtype;

© Copyright 2015, Simplilearn. All rights reserved. Page |2


14. Check the structure of the logtotal relation.
describe logtotal;
Note that it has columns from both the log1cnt and log2cnt relations.

15. Check the content of the logtotal relation using dump command. Note that Pig starts
MapReduce jobs only on dump or store commands.

dump logtotal

Note that it processes the data from files upon dump. Output contains fields from both
the relations.

16. Exit Pig using the quit command.

quit;

© Copyright 2015, Simplilearn. All rights reserved. Page |2

También podría gustarte