Você está na página 1de 39

Big Data Analysis Workshop:

Introduction to Hadoop and


Spark

Drs. Weijia Xu, Ruizhu Huang and Amit Gupta


Data Mining & Statistics Group
Texas Advanced Computing Center
University of Texas at Austin

Sept. 28~29, 2017


Atlanta, GA
About Instructors
•  Weijia Xu
•  Research Scientist, Group Manager, Data Mining & Statistics
Group, TACC
•  Trained computational scientists.
•  Ph.D. in CS, UT Austin

•  Ruizhu Huang
•  Trained Statistician
•  Ph.D from University Washington

•  Amit Gupta
•  HPC Specialists,
•  M.S. from University of Colorado
Texas Advanced Computing
Center (TACC)
•  An organized research
unit under the University
of Texas at Aus7n

•  R & D in high
performance compu7ng
and data driven analysis

•  Service provider for large


cyberinfrastructure
Stampede Maverick
HPC Jobs Vis & Analysis
6400+ Nodes Interactive
10 PFlops Access
14+ PB Storage 132 K40 GPUs

Lonestar Wrangler
HTC Jobs Data Intensive
1800+ Nodes Computations
22000+ Cores 10 PB Storage
146 GB/node Stockyard High IOPS
Shared Workspace
20 PB Storage
1 TB per user
Corral Project Workspace Rodeo/
Data Collections
6 PB Storage Chameleon
Databases Cloud Services
IRODS User VMs

Vis Lab Ranch


Immersive Vis Tape Archive
Colaborative 160 PB Tape
Touch Screen 1+ PB Access
3D Cache
Stampede 2
•  Funded by NSF as a renewal of the original Stampede
project.
•  The largest XSEDE resource (and largest university-
based system).
•  Phase 1
•  4,204 Intel Xeon Phi ”Knights Landing” (KNL) Processors (Intel
and Dell)
•  ~20PB (usable) Lustre Filesystem (Seagate), 310GB/s to scratch.
•  Intel Omnipath Fabric – Fat Tree.
•  Ethernet fabric and (some) management infrastructure.
•  Phase 2
•  1,736 Intel Xeon (Sky Lake) Processors
About Data Mining & Statistics
Group (DMS) @ TACC
•  Computational resources:
•  Static Hadoop Cluster: 40 nodes, ~1PB HDFS
•  Dynamic Hadoop Cluster: 1 ~ 40 nodes, 4TB Falsh Storage per
node

•  Data Analysis Software support


•  Hadoop, Spark
•  Hive, Hbase, Drill, Storm…
•  Elastic Search, Solr, Caffe, Tensorflow…
•  R, Python, Zeppelin…

•  Research & Development:


•  Big data analysis
•  Data science collaboration
Big Data Analysis Support at
TACC

8
XSEDE
•  Extreme Science and
Engineering Discovery
Environment.
•  The most powerful
integrated advanced
digital resources and
services in the world.
Funded by NSF.
•  Consists of
supercomputers, high-end
visualization, data
analysis and storage
around the country.
About This Workshop
•  Goal:
•  Introducing cyberinfrastructure resource availability through
TACC and XSEDE.
•  An overview of big data analysis using Hadoop and Spark.
•  To help domain scientists get started with right tools,
methods, and systems for big data problems.
•  To help existing HPC users to use resources more efficiently
with data intensive computing.
•  Format:
•  Break a wide range of topics into multiple sessions,
•  Sessions are related but not strongly dependent.
•  Offered as a weekly half-day training sessions in the past
Workshop Overview
•  Session 1 : Introduction to Big Data Analysis
•  An introduction of common models and systems used in big data
analysis: Hadoop and Spark.

•  Session 2: Developing a scalable application with Spark


•  Introducing key concepts in Spark programming, Scala
programing and using Spark with Python.

•  Session 3: Data Analysis with Spark


•  Introducing common data mining and machine learning methods
and how they are supported through Spark.

•  Session 4: Advanced Topics in Big Data Analysis


•  Drill in and expand out: How things work internally and the big
data ecosystem
•  Bring Your Own Data to play
Day 1 in Details
•  Session 1 : Introduction to Big Data Analysis with
Hadoop/Spark
•  8:30 ~ 9:20 Introduction and Workshop Overview
•  9:30 ~ 10:20 Hadoop/Spark support at TACC and how to
access them
•  10:30 ~ 11:20 Introduction of Hadoop cluster
•  11:20 ~ 12:30 From Hadoop to Spark
•  Session 2: Developing a scalable application with
Spark
•  1:30 ~ 2:30 Introduction of Scala
•  2:30 ~ 2:50 break / hands on
•  2:50 ~ 3:50 Data handling in Spark
•  3:50 ~ 4:10 break / hands on
•  4:10 ~ 5:10 Using Python with Spark.
•  5:10 ~ 5:30 break / hands on
An Introduction of Big Data
Analysis
What is big data?

•  Big data is
•  a lot of data Volume
•  accumulating very fast Velocity
•  consisting of different types Variety
•  often with a lot of noises Veracity

•  How big is big data?

•  Big data is “when you have too many data than you
can handle”.
Let’s consider a simple
example: the “wordcount”
•  The problem: counting occurrences of unique words
from text
•  The input:
•  A text file
•  The output:
•  a list of
word, number of times the word occurred

•  How would you solve it?

•  But what about the input text file is


•  1TB
•  1PB…
Large Data Set = Troubles
Considering a simple problem of sorting data with Bubble
Sort (O(n2)) and merge sort (in avg. O(nlogn)).
To sort 10k, 100Byte data (1MB total)
Probably not much difference to user regardless choice of algorithms
due to the capability of the CPU.
To sort 10M, 100Byte data (1GB total),
Merge sort can be significant faster to user.
What about 10B, 100 Byte data (1TB total)?
Can your code still run without problem?
How about 1 Trillion 100Byte, (100TB total)?
Take ~ 20+ days just load from a regular SATA drive.
What’s the largest text data
set you know?
•  Google n-gram dataset
•  http://storage.googleapis.com/books/ngrams/books/
datasetsv2.html
•  Basically a word count for all books Google has
•  But for 1~5 gram, per language, per year, per book, per page

•  2.2 TB compressed, ~9 TB when uncompressed in text
•  ClueWeb09
•  https://lemurproject.org/clueweb09/
•  Text from ~1 billion web pages
•  5TB compressed, ~25TB uncompressed
•  How many web pages in the world?
•  ~5 billions?
The “data” problem
No algorithm can run without at least accessing data (likely
from a slow media) at least once.

Each data need to be processed at least once.


O(n), usually the minimum operations required.
O(nlogn), usually among best algorithms.
O(n2), can be very costly but common.
>O(n2), probably infeasible for large scale data.
Challenges with Big Data
Analysis
•  The process would take much longer
•  Typical hard drive read speed is about 150MB/sec,
•  Reading 1TB ~ 2 hours
•  Complexity of the analysis matters
•  Complexity, time of a algorithm to finish with respect to the
problem size.
•  e.g. wordcount is a “linear” problem.
•  Analysis could require processing time quadratic to the size of
the data
•  Analysis took 1 second for 1GB data, would require 11 days to
finish for 1TB data

•  More computational resources required


•  Storage, triple the size of the raw data to store the
intermediate files, output etc.
•  Memory, algorithm may need to hash more information in
heap. e.g. the pair-wise distance matrix among data points
Solution?
•  Let’s use more computers to do it in parallel.

•  Great and simple idea

•  But how? many practical issues


•  How to distribute data
•  How to track the data
•  How to coordinate works of concurrent processes
•  How to scale with different resources
•  How to handle failures and errors
•  …
Solution?
•  Traditional HPC and
HTC
•  Optimized for
computing intensive
tasks

•  Rely on high end


hardware

•  Not optimized for I/O


intensive tasks
The Requirement of Supporting
Big Data Analysis
•  Scalability
•  scale up v.s. scale out

•  Elasticity
•  Dynamically change of resources

•  Fault tolerance
•  Computation
•  Data

•  Optimized to reduce data transfer


MapReduce Model for Big Data
•  A programing model proposed and used by Google.

•  Serving as a platform for customized computation over


large scale of data.

•  At the core, users implement interface of two main


functions:
map (in_key, in_value) ->
(out_key, intermediate_value) list
reduce (out_key, intermediate_value list) ->
out_value list

•  Data is treated as a collection of Key/Value pair.


23
Data Representation with
Key/Value Pair
•  Each basic data object is treated as a pair of values
•  The first is key
•  The second is value

•  Provide a uniformed way to represent data

•  Either key or value can be of any data type

•  “Key” is not an identifier


•  Considering it represents some aspect of data, like group,
label. etc.
•  Key might be “missing” or empty
Map & Reduce Operations
•  Map
•  Computation to be applied on each data object
•  Passing computation around instead of data
•  The result is in the form of key/value pair

•  Reduce
•  Computation over a set of key/value pairs
•  An aggregation function
•  All data in the set share the same key
•  The result is in the form of key/value pair.
3
Let’s revisit the “wordcount”
•  The problem: counting occurrences of unique words
from text
•  The input:
•  A text file
•  The output:
•  a list of
word, number of times the word occurred

•  How would you solve it?

•  But what about the input text file is


•  1TB
•  1PB…
WordCount with MapReduce
Read text files and count how often words occur.
The input is text files
The output is a text file
each line: word, count

Map:
Produce pairs of (word, count)
Reduce:
For each key (word), sum up the value (counts).

28
29
Mapper - Java
Maps input key-value pairs to a set of intermediate key-
value pair
Class for Individual tasks to run.
One mapper task per InputSplit.
Map function are automatically called per key Value pair.

public sta7c class Map extends Mapper<Object, Text, Text, IntWritable> {


private final sta7c IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map( Object key, Text value, Context context)
throws IOExcep7on {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

30
Reducer - Java
Reduces the set of values of the same key to a
smaller set.
Each reducer will process subset generated by
partitioner
Each reducer will generate an output file
public sta7c class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOExcep7on {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
result.set(sum);
context.write(key, result);
}
}
31
WordCount main - Java
public static void main(String[] args) throws Exception {
Job job = Job.getInstance(new Configuration(), "word count");

job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}

32
Simple?
•  But ...

•  Where is the parallelism?

•  How the data is split?

•  Who assign the work to each process?

•  How map results given to reduce?

•  Where is the output?

•  …
•  MapReduce model separates the parallel process from the analysis.

34
35
“Big” Ideas in MapReduce
•  prefer scale out instead of scale up
•  The same code can run with 1 node or 100 nodes.

•  Hide system details from user


•  Providing abstraction in writing parallel code.

•  Isolates developer from (and independent from) system


hardware details

•  Allow cluster consisting of different types of hardware

36
“Big” Ideas in MapReduce
•  Move computations to data.
•  Do not use/assume high bandwidth inter-connection between
nodes.
•  Avoid and reduce the need of data transfer over the network is
often the bottleneck.

•  Data are sliced into chunks and duplicated across nodes

•  A uniformed large data access and analysis platform

37
From MapReduce to Hadoop
•  Hadoop: an open source implementation of MapReduce
model
User Applica7ons

Algorithms and libraries

Programming Language

Opera7ng system

Você também pode gostar