Escolar Documentos
Profissional Documentos
Cultura Documentos
1|
2014, Cognizant
2|
2014, Cognizant
What is Pig
Apache Pig is a platform for analyzing large data sets that consists of a highlevel language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs.
Originally developed as a research project at Yahoo
Goals: flexibility, productivity and maintainability
Now an open source Apache project
Pig is made up of two pieces:
The language used to express data flows, called Pig Latin.
The execution environment to run Pig Latin programs.
3|
2014, Cognizant
Pig Latin
Pig Latin is a data flow language.
Allows users to describe how data from one or more inputs should be read,
processed, and then stored to one or more outputs in parallel.
Why Pig
Opens the system to non-Java programmers.
Provides common operations like join, group, filter, and sort.
Much simpler than Java
10 lines of Pig Latin 200 lines of Java.
What took 4 hours to write in Java took 15 minutes in Pig Latin.
Early error checking
Chains multiple MR jobs
What is Pig useful for:
ETL data pipelines
Research on raw data
Iterative processing
4|
2014, Cognizant
5|
2014, Cognizant
User machine
6|
2014, Cognizant
Pig Architecture
user
Parser
parsed
program
Pig Latin
program
Pig Compiler
cross-job
optimizer
output
execution
plan
f( )
MR Compiler
map-red.
jobs
7|
2014, Cognizant
join
Map-Reduce
filter
Cluster
8|
2014, Cognizant
Grunt
Grunt is Pigs interactive shell.
$ pig -x local
$ pig or pig x mapreduce
quit or Ctrl-D.
store or dump.
Hdfs commands
grunt>fs -ls
cat, copyFromLocal, copyToLocal, rmr
commands for controlling Pig and MapReduce
kill jobid
exec [[-param param_name = param_value]] [[-param_file filename]] script
run [[-param param_name = param_value]] [[-param_file filename]] script
9|
2014, Cognizant
10 |
2014, Cognizant
11 |
2014, Cognizant
Tuple
Ordered collection of Pig data elements.
Tuples are divided into fields.
Eg: ('bob', 55)
Bag
A bag is an unordered collection of tuples.
{('bob', 55), ('sally', 52), ('john', 25)}
Null
a null data element means the value is unknown.
data is missing or an error occurred in processing it.
12 |
2014, Cognizant
13 |
2014, Cognizant
14 |
2014, Cognizant
15 |
2014, Cognizant
Relations
16 |
2014, Cognizant
Schema
User can optionally define the schema of the input data
Once the schema of the source data is given, all the schema of the intermediate
relation will be induced by Pig
Why schema?
Scripts are more readable (by alias)
Help system validate the input
Similar to Database?
Yes. But schema here is optional
Schema is not fixed for a particular dataset, but changeable
Example:
--Schema 1
A = LOAD 'input/A' as (name:chararray, age:int);
B = FILTER A BY age != 20;
--No Schema
A = LOAD 'input/A' ;
B = FILTER A BY A.$1 != '20';
17 |
2014, Cognizant
18 |
2014, Cognizant
19 |
2014, Cognizant
20 |
2014, Cognizant
21 |
2014, Cognizant
22 |
2014, Cognizant
23 |
2014, Cognizant
24 |
2014, Cognizant
Group
The group statement collects together records
with the same key.
C = GROUP A BY f1;
DUMP C;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
Order by
Sorts a relation based on one or more fields.
Filter
The filter statement allows you to select
which records will be retained.
B = FILTER A by a1 >1 ;
(4,2,1)
(4,3,3)
25 |
2014, Cognizant
D = ORDER A by a3 DESC;
(4,3,3)
(1,2,2)
(4,2,1)
Join
Performs an inner join of two or
more relations based on common
field values.
(1,2,3)
(4,2,1)
(7,2,5)
(2,4)
(1,3)
(4,9)
X = JOIN A BY a1, B BY b1;
(1,2,3,1,3)
(4,2,1,4,9)
26 |
2014, Cognizant
Parallel:
27 |
2014, Cognizant
Pig
Example
Load
SQL: from X;
file(s)
Select
Where
Filter
where col2>2;
Pig: C = filter B by col2 > 2;
28 |
2014, Cognizant
Pig
Group +
Example
SQL: select col1, col2, sum(col3)
foreach
from X group by col1, col2;
generate Pig: D = group A by (col1, col2);
E = foreach D generate flatten(group), SUM(A.col3);
Having
Filter
Order
By
29 |
2014, Cognizant
Pig
Distinct Distinct
Example
Distinct Distinct in
Agg
foreach
L = distinct A.col2;
generate flatten(group), SUM(L); }
30 |
2014, Cognizant
Example
31 |
2014, Cognizant
Pig Latin
users = load users.txt as (name, age);
usersFltrd = filter users by age >=18 and age <=25;
32 |
2014, Cognizant
33 |
2014, Cognizant
Thank you!!!
34 |
2014, Cognizant