Apache Pig - Day 1

Apache Pig Day 1
1|
2014, Cognizant
You Will Learn About:

DAY1:
Pig Introduction
Grunt
Pigs data model
Introduction to Pig Latin
Example
DAY2:
Advanced Pig Latin
Developing Pig Latin scripts
Writing Pig UDFs and execution
2|
2014, Cognizant
What is Pig
Apache Pig is a platform for analyzing large data sets that consists of a highlevel language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs.
Originally developed as a research project at Yahoo
Goals: flexibility, productivity and maintainability
Now an open source Apache project
Pig is made up of two pieces:
The language used to express data flows, called Pig Latin.
The execution environment to run Pig Latin programs.
3|
2014, Cognizant
Pig Latin
Pig Latin is a data flow language.
Allows users to describe how data from one or more inputs should be read,
processed, and then stored to one or more outputs in parallel.
Why Pig
Opens the system to non-Java programmers.
Provides common operations like join, group, filter, and sort.
Much simpler than Java
10 lines of Pig Latin 200 lines of Java.
What took 4 hours to write in Java took 15 minutes in Pig Latin.
Early error checking
Chains multiple MR jobs
What is Pig useful for:
ETL data pipelines
Research on raw data
Iterative processing
4|
2014, Cognizant
The Anatomy of Pig
5|
2014, Cognizant
How Pig works
Job executes on cluster
Pig resides on user machine

Hadoop Cluster
User machine
No need to install anything extra on your Hadoop cluster.
6|
2014, Cognizant
Pig Architecture
user
Parser
parsed
program
Pig Latin
program
Pig Compiler
cross-job
optimizer
output
execution
plan
f( )
MR Compiler
map-red.
jobs
7|
2014, Cognizant
join
Map-Reduce
filter
Cluster
Pig Execution Mode

Local Mode
Launch single JVM
Access local file system
Hadoop Mode
Execute a sequence of MR jobs
Pig interacts with Hadoop master node
8|
2014, Cognizant
Grunt
Grunt is Pigs interactive shell.
$ pig -x local
$ pig or pig x mapreduce
quit or Ctrl-D.
store or dump.
Hdfs commands
grunt>fs -ls
cat, copyFromLocal, copyToLocal, rmr
commands for controlling Pig and MapReduce
kill jobid
exec [[-param param_name = param_value]] [[-param_file filename]] script
run [[-param param_name = param_value]] [[-param_file filename]] script
9|
2014, Cognizant
Running Pig Scripts
10 |
2014, Cognizant
Pigs Data model

Scalar types
Int, - 4 byte signed integer
long, - 8 byte signed integer
float, - 32 bit floating point
doube 64 bit floating point
chararray - String, \t \n \u0001
bytearray byte[]
Complex types
A relation is a bag (more specifically, an outer bag).

A bag is a collection of tuples.
A tuple is an ordered set of fields.
A field is a piece of data.
11 |
2014, Cognizant
Pigs Data model

Map
Map is set of key value pairs
Eg: ['name'#'bob', age'#55]
Tuple
Ordered collection of Pig data elements.
Tuples are divided into fields.
Eg: ('bob', 55)
Bag
A bag is an unordered collection of tuples.
{('bob', 55), ('sally', 52), ('john', 25)}
Null
a null data element means the value is unknown.
data is missing or an error occurred in processing it.
12 |
2014, Cognizant
Working with Complex Data Types
13 |
2014, Cognizant
Case-Sensitivity in Pig Latin
14 |
2014, Cognizant
Common Operators in Pig Latin
15 |
2014, Cognizant
Relations
16 |
2014, Cognizant
Schema
User can optionally define the schema of the input data
Once the schema of the source data is given, all the schema of the intermediate
relation will be induced by Pig
Why schema?
Scripts are more readable (by alias)
Help system validate the input
Similar to Database?
Yes. But schema here is optional
Schema is not fixed for a particular dataset, but changeable
Example:
--Schema 1
A = LOAD 'input/A' as (name:chararray, age:int);
B = FILTER A BY age != 20;
--No Schema
A = LOAD 'input/A' ;
B = FILTER A BY A.$1 != '20';
17 |
2014, Cognizant
Pigs Data model

Schema
dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
dividends = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,

date:chararray, dividend:float);
18 |
2014, Cognizant
Pigs Data model

Pig will do its best to determine data types based on context
For example, you can calculate sales commission as price * 0.1
In this case, Pig will assume that this value is of type double
However, it is beter to specify data types explicitly when possible
Helps with error checking and optimizations
Easiest to do this upon load using the format fieldname:type
Choosing the right data type is important to avoid loss of precision`
19 |
2014, Cognizant
Handling Invalid data
20 |
2014, Cognizant
Pigs Data model

Cast
B = FOREACH A GENERATE (int)$0 + 1;
21 |
2014, Cognizant

Keyword in pig Latin is not case sensitive
Relation names or UDFs are case sensitive
Comments:
A = load 'foo'; --this is a single-line comment
/*
* This is a multiline comment.
*/
B = load /* a comment in the middle */'bar';
22 |
2014, Cognizant
23 |
2014, Cognizant

Load - Loads data from the file system.
Default : tab-delimited file using default load function PigStorage()
Eg: divs = LOAD '/data/examples/NYSE_dividends';
divs = LOAD 'NYSE_dividends' USING HBaseStorage();
divs = LOAD 'NYSE_dividends' USING PigStorage(',');
Store - Stores or saves results to the file system.
STORE A INTO 'myoutput' USING PigStorage ('*');
STORE A INTO 'myoutput' USING HBaseStorage();
Dump - Displays the results in console.
`
24 |
2014, Cognizant

Foreach
foreach takes a set of expressions and
applies them to every record.
Group
The group statement collects together records
with the same key.
A = LOAD 'data1' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,2)
(4,2,1)
(4,3,3)
C = GROUP A BY f1;
DUMP C;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
X = FOREACH A GENERATE a1, a2;

DUMP X;
(1,2)
(4,2)
(4,3)
Order by
Sorts a relation based on one or more fields.
Filter
The filter statement allows you to select
which records will be retained.
B = FILTER A by a1 >1 ;
(4,2,1)
(4,3,3)
25 |
2014, Cognizant
D = ORDER A by a3 DESC;
(4,3,3)
(1,2,2)
(4,2,1)

Distinct Removes duplicate
(8,3,4)
(1,2,3)
(4,3,3)
(4,3,3)
(1,2,3)
X = DISTINCT A;
DUMP X;
(1,2,3)
(4,3,3)
(8,3,4)
Join
Performs an inner join of two or
more relations based on common
field values.
(1,2,3)
(4,2,1)
(7,2,5)
(2,4)
(1,3)
(4,9)
X = JOIN A BY a1, B BY b1;
(1,2,3,1,3)
(4,2,1,4,9)
26 |
2014, Cognizant

Limit:
divs = load 'NYSE_dividends';

first10 = limit divs 10;
Sample:
divs = load 'NYSE_dividends';

some = sample divs 0.1;
Parallel:
daily = load 'NYSE_daily' as (exchange, symbol, date, volume);

bysymbl = group daily by symbol parallel 10;
27 |
2014, Cognizant
Translating SQL Queries to Pig latin

SQL
From
table
Pig
Example
Load
SQL: from X;
file(s)
Pig: A = load mydata using PigStorage(\t)

as (col1, col2, col3);
Select
Where
Foreach SQL: select col1 + col2, col3

generate Pig: B = foreach A generate col1 + col2, col3;
Filter
SQL: select col1 + col2, col3

from X
where col2>2;
Pig: C = filter B by col2 > 2;
28 |
2014, Cognizant

SQL
Group
by
Pig
Group +
Example
SQL: select col1, col2, sum(col3)
foreach
from X group by col1, col2;
generate Pig: D = group A by (col1, col2);
E = foreach D generate flatten(group), SUM(A.col3);
Having
Filter
SQL: select col1, sum(col2) from X group by col1
having sum(col2) > 5;

Pig: F = filter E by $1 > 5;
Order
By
Order
By
SQL: select col1, sum(col2)

from X group by col1 order by col1;
Pig: H = ORDER E by $0;
29 |
2014, Cognizant

SQL
Pig
Distinct Distinct
Example
SQL: select distinct col1 from X;

Pig: I = foreach A generate col1;
J = distinct I;
Distinct Distinct in
Agg
foreach
SQL: select col1, count (distinct col2)

from X group by col1;
Pig: K = foreach D {
L = distinct A.col2;
generate flatten(group), SUM(L); }
30 |
2014, Cognizant
Example
Suppose we have user data in one

file, website data in another file.
We need to find the top 5 most
visited pages by users aged 18-25
31 |
2014, Cognizant
Pig Latin
users = load users.txt as (name, age);
usersFltrd = filter users by age >=18 and age <=25;
pages = load pages.txt as (user, url);

combinedData = join usersFltrd by name, pages by user;
grpData = group combinedData by url;
valCnt = foreach grpData generate group, count(combinedData) as hits;

sortData = order valCnt by hits desc;
top5 = limit sortData 5;
store top5 into top5sites.txt;
32 |
2014, Cognizant
Word count MR vs Pig
33 |
2014, Cognizant
Thank you!!!
34 |
2014, Cognizant

Apache Pig - Day 1

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Apache Pig - Day 1

Enviado por

Direitos autorais:

Formatos disponíveis

Apache Pig Day 1

You Will Learn About:

The Anatomy of Pig

How Pig works

Job executes on cluster

Pig resides on user machine

No need to install anything extra on your Hadoop cluster.

Pig Execution Mode

Running Pig Scripts

Pigs Data model

A relation is a bag (more specifically, an outer bag).

Pigs Data model

Working with Complex Data Types

Case-Sensitivity in Pig Latin

Common Operators in Pig Latin

Pigs Data model

dividends = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,

Pigs Data model

Choosing the right data type is important to avoid loss of precision`

Handling Invalid data

Pigs Data model

Introduction to Pig Latin

Introduction to Pig Latin

Introduction to Pig Latin

Introduction to Pig Latin

A = LOAD 'data1' AS (a1:int,a2:int,a3:int);

X = FOREACH A GENERATE a1, a2;

Introduction to Pig Latin

Introduction to Pig Latin

divs = load 'NYSE_dividends';

divs = load 'NYSE_dividends';

daily = load 'NYSE_daily' as (exchange, symbol, date, volume);

Translating SQL Queries to Pig latin

Pig: A = load mydata using PigStorage(\t)

Foreach SQL: select col1 + col2, col3

SQL: select col1 + col2, col3

Translating SQL Queries to Pig latin

SQL: select col1, sum(col2) from X group by col1

having sum(col2) > 5;

SQL: select col1, sum(col2)

Translating SQL Queries to Pig latin

SQL: select distinct col1 from X;

SQL: select col1, count (distinct col2)

Suppose we have user data in one

pages = load pages.txt as (user, url);

valCnt = foreach grpData generate group, count(combinedData) as hits;

Word count MR vs Pig

Você também pode gostar