Você está na página 1de 34

Apache Pig Day 1

1|

2014, Cognizant

You Will Learn About:


DAY1:
Pig Introduction
Grunt
Pigs data model
Introduction to Pig Latin
Example
DAY2:
Advanced Pig Latin
Developing Pig Latin scripts
Writing Pig UDFs and execution

2|

2014, Cognizant

What is Pig
Apache Pig is a platform for analyzing large data sets that consists of a highlevel language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs.
Originally developed as a research project at Yahoo
Goals: flexibility, productivity and maintainability
Now an open source Apache project
Pig is made up of two pieces:
The language used to express data flows, called Pig Latin.
The execution environment to run Pig Latin programs.

3|

2014, Cognizant

Pig Latin
Pig Latin is a data flow language.
Allows users to describe how data from one or more inputs should be read,
processed, and then stored to one or more outputs in parallel.

Why Pig
Opens the system to non-Java programmers.
Provides common operations like join, group, filter, and sort.
Much simpler than Java
10 lines of Pig Latin 200 lines of Java.
What took 4 hours to write in Java took 15 minutes in Pig Latin.
Early error checking
Chains multiple MR jobs
What is Pig useful for:
ETL data pipelines
Research on raw data
Iterative processing

4|

2014, Cognizant

The Anatomy of Pig

5|

2014, Cognizant

How Pig works

Job executes on cluster

Pig resides on user machine


Hadoop Cluster

User machine

No need to install anything extra on your Hadoop cluster.

6|

2014, Cognizant

Pig Architecture

user

Parser

parsed
program

Pig Latin
program
Pig Compiler

cross-job
optimizer

output

execution
plan

f( )
MR Compiler

map-red.
jobs

7|

2014, Cognizant

join

Map-Reduce

filter

Cluster

Pig Execution Mode


Local Mode
Launch single JVM
Access local file system
Hadoop Mode
Execute a sequence of MR jobs
Pig interacts with Hadoop master node

8|

2014, Cognizant

Grunt
Grunt is Pigs interactive shell.
$ pig -x local
$ pig or pig x mapreduce
quit or Ctrl-D.
store or dump.
Hdfs commands
grunt>fs -ls
cat, copyFromLocal, copyToLocal, rmr
commands for controlling Pig and MapReduce
kill jobid
exec [[-param param_name = param_value]] [[-param_file filename]] script
run [[-param param_name = param_value]] [[-param_file filename]] script

9|

2014, Cognizant

Running Pig Scripts

10 |

2014, Cognizant

Pigs Data model


Scalar types
Int, - 4 byte signed integer
long, - 8 byte signed integer
float, - 32 bit floating point
doube 64 bit floating point
chararray - String, \t \n \u0001
bytearray byte[]
Complex types

A relation is a bag (more specifically, an outer bag).


A bag is a collection of tuples.
A tuple is an ordered set of fields.
A field is a piece of data.

11 |

2014, Cognizant

Pigs Data model


Map
Map is set of key value pairs
Eg: ['name'#'bob', age'#55]

Tuple
Ordered collection of Pig data elements.
Tuples are divided into fields.
Eg: ('bob', 55)
Bag
A bag is an unordered collection of tuples.
{('bob', 55), ('sally', 52), ('john', 25)}
Null
a null data element means the value is unknown.
data is missing or an error occurred in processing it.

12 |

2014, Cognizant

Working with Complex Data Types

13 |

2014, Cognizant

Case-Sensitivity in Pig Latin

14 |

2014, Cognizant

Common Operators in Pig Latin

15 |

2014, Cognizant

Relations

16 |

2014, Cognizant

Schema
User can optionally define the schema of the input data
Once the schema of the source data is given, all the schema of the intermediate
relation will be induced by Pig
Why schema?
Scripts are more readable (by alias)
Help system validate the input
Similar to Database?
Yes. But schema here is optional
Schema is not fixed for a particular dataset, but changeable
Example:
--Schema 1
A = LOAD 'input/A' as (name:chararray, age:int);
B = FILTER A BY age != 20;
--No Schema
A = LOAD 'input/A' ;
B = FILTER A BY A.$1 != '20';
17 |

2014, Cognizant

Pigs Data model


Schema
dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);

dividends = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,


date:chararray, dividend:float);

18 |

2014, Cognizant

Pigs Data model


Pig will do its best to determine data types based on context
For example, you can calculate sales commission as price * 0.1
In this case, Pig will assume that this value is of type double
However, it is beter to specify data types explicitly when possible
Helps with error checking and optimizations
Easiest to do this upon load using the format fieldname:type

Choosing the right data type is important to avoid loss of precision`

19 |

2014, Cognizant

Handling Invalid data

20 |

2014, Cognizant

Pigs Data model


Cast
B = FOREACH A GENERATE (int)$0 + 1;

21 |

2014, Cognizant

Introduction to Pig Latin


Keyword in pig Latin is not case sensitive
Relation names or UDFs are case sensitive
Comments:
A = load 'foo'; --this is a single-line comment
/*
* This is a multiline comment.
*/
B = load /* a comment in the middle */'bar';

22 |

2014, Cognizant

Introduction to Pig Latin

23 |

2014, Cognizant

Introduction to Pig Latin


Load - Loads data from the file system.
Default : tab-delimited file using default load function PigStorage()
Eg: divs = LOAD '/data/examples/NYSE_dividends';
divs = LOAD 'NYSE_dividends' USING HBaseStorage();
divs = LOAD 'NYSE_dividends' USING PigStorage(',');
Store - Stores or saves results to the file system.
STORE A INTO 'myoutput' USING PigStorage ('*');
STORE A INTO 'myoutput' USING HBaseStorage();
Dump - Displays the results in console.
`

24 |

2014, Cognizant

Introduction to Pig Latin


Foreach
foreach takes a set of expressions and
applies them to every record.

Group
The group statement collects together records
with the same key.

A = LOAD 'data1' AS (a1:int,a2:int,a3:int);


DUMP A;
(1,2,2)
(4,2,1)
(4,3,3)

C = GROUP A BY f1;
DUMP C;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})

X = FOREACH A GENERATE a1, a2;


DUMP X;
(1,2)
(4,2)
(4,3)

Order by
Sorts a relation based on one or more fields.

Filter
The filter statement allows you to select
which records will be retained.
B = FILTER A by a1 >1 ;
(4,2,1)
(4,3,3)

25 |

2014, Cognizant

D = ORDER A by a3 DESC;
(4,3,3)
(1,2,2)
(4,2,1)

Introduction to Pig Latin


Distinct Removes duplicate
(8,3,4)
(1,2,3)
(4,3,3)
(4,3,3)
(1,2,3)
X = DISTINCT A;
DUMP X;
(1,2,3)
(4,3,3)
(8,3,4)

Join
Performs an inner join of two or
more relations based on common
field values.
(1,2,3)
(4,2,1)
(7,2,5)
(2,4)
(1,3)
(4,9)
X = JOIN A BY a1, B BY b1;
(1,2,3,1,3)
(4,2,1,4,9)

26 |

2014, Cognizant

Introduction to Pig Latin


Limit:

divs = load 'NYSE_dividends';


first10 = limit divs 10;
Sample:

divs = load 'NYSE_dividends';


some = sample divs 0.1;

Parallel:

daily = load 'NYSE_daily' as (exchange, symbol, date, volume);


bysymbl = group daily by symbol parallel 10;

27 |

2014, Cognizant

Translating SQL Queries to Pig latin


SQL
From
table

Pig

Example

Load

SQL: from X;

file(s)

Pig: A = load mydata using PigStorage(\t)


as (col1, col2, col3);

Select

Where

Foreach SQL: select col1 + col2, col3


generate Pig: B = foreach A generate col1 + col2, col3;

Filter

SQL: select col1 + col2, col3


from X

where col2>2;
Pig: C = filter B by col2 > 2;

28 |

2014, Cognizant

Translating SQL Queries to Pig latin


SQL
Group
by

Pig
Group +

Example
SQL: select col1, col2, sum(col3)

foreach
from X group by col1, col2;
generate Pig: D = group A by (col1, col2);
E = foreach D generate flatten(group), SUM(A.col3);

Having

Filter

SQL: select col1, sum(col2) from X group by col1

having sum(col2) > 5;


Pig: F = filter E by $1 > 5;
Order
By

Order
By

SQL: select col1, sum(col2)


from X group by col1 order by col1;
Pig: H = ORDER E by $0;

29 |

2014, Cognizant

Translating SQL Queries to Pig latin


SQL

Pig

Distinct Distinct

Example

SQL: select distinct col1 from X;


Pig: I = foreach A generate col1;
J = distinct I;

Distinct Distinct in
Agg
foreach

SQL: select col1, count (distinct col2)


from X group by col1;
Pig: K = foreach D {

L = distinct A.col2;
generate flatten(group), SUM(L); }

30 |

2014, Cognizant

Example

Suppose we have user data in one


file, website data in another file.
We need to find the top 5 most
visited pages by users aged 18-25

31 |

2014, Cognizant

Pig Latin
users = load users.txt as (name, age);
usersFltrd = filter users by age >=18 and age <=25;

pages = load pages.txt as (user, url);


combinedData = join usersFltrd by name, pages by user;
grpData = group combinedData by url;

valCnt = foreach grpData generate group, count(combinedData) as hits;


sortData = order valCnt by hits desc;
top5 = limit sortData 5;
store top5 into top5sites.txt;

32 |

2014, Cognizant

Word count MR vs Pig

33 |

2014, Cognizant

Thank you!!!

34 |

2014, Cognizant

Você também pode gostar