Escolar Documentos
Profissional Documentos
Cultura Documentos
• Ruizhu Huang
• Trained Statistician
• Ph.D from University Washington
• Amit Gupta
• HPC Specialists,
• M.S. from University of Colorado
Texas Advanced Computing
Center (TACC)
• An organized research
unit under the University
of Texas at Aus7n
• R & D in high
performance compu7ng
and data driven analysis
Lonestar Wrangler
HTC Jobs Data Intensive
1800+ Nodes Computations
22000+ Cores 10 PB Storage
146 GB/node Stockyard High IOPS
Shared Workspace
20 PB Storage
1 TB per user
Corral Project Workspace Rodeo/
Data Collections
6 PB Storage Chameleon
Databases Cloud Services
IRODS User VMs
8
XSEDE
• Extreme Science and
Engineering Discovery
Environment.
• The most powerful
integrated advanced
digital resources and
services in the world.
Funded by NSF.
• Consists of
supercomputers, high-end
visualization, data
analysis and storage
around the country.
About This Workshop
• Goal:
• Introducing cyberinfrastructure resource availability through
TACC and XSEDE.
• An overview of big data analysis using Hadoop and Spark.
• To help domain scientists get started with right tools,
methods, and systems for big data problems.
• To help existing HPC users to use resources more efficiently
with data intensive computing.
• Format:
• Break a wide range of topics into multiple sessions,
• Sessions are related but not strongly dependent.
• Offered as a weekly half-day training sessions in the past
Workshop Overview
• Session 1 : Introduction to Big Data Analysis
• An introduction of common models and systems used in big data
analysis: Hadoop and Spark.
• Big data is
• a lot of data Volume
• accumulating very fast Velocity
• consisting of different types Variety
• often with a lot of noises Veracity
• Big data is “when you have too many data than you
can handle”.
Let’s consider a simple
example: the “wordcount”
• The problem: counting occurrences of unique words
from text
• The input:
• A text file
• The output:
• a list of
word, number of times the word occurred
• Elasticity
• Dynamically change of resources
• Fault tolerance
• Computation
• Data
• Reduce
• Computation over a set of key/value pairs
• An aggregation function
• All data in the set share the same key
• The result is in the form of key/value pair.
3
Let’s revisit the “wordcount”
• The problem: counting occurrences of unique words
from text
• The input:
• A text file
• The output:
• a list of
word, number of times the word occurred
Map:
Produce pairs of (word, count)
Reduce:
For each key (word), sum up the value (counts).
28
29
Mapper - Java
Maps input key-value pairs to a set of intermediate key-
value pair
Class for Individual tasks to run.
One mapper task per InputSplit.
Map function are automatically called per key Value pair.
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
32
Simple?
• But ...
• …
• MapReduce model separates the parallel process from the analysis.
34
35
“Big” Ideas in MapReduce
• prefer scale out instead of scale up
• The same code can run with 1 node or 100 nodes.
36
“Big” Ideas in MapReduce
• Move computations to data.
• Do not use/assume high bandwidth inter-connection between
nodes.
• Avoid and reduce the need of data transfer over the network is
often the bottleneck.
37
From MapReduce to Hadoop
• Hadoop: an open source implementation of MapReduce
model
User Applica7ons
Programming Language
Opera7ng system