Você está na página 1de 28

CS525: Special Topics in DBs

Large-Scale Data
Management
Introduction & Logistics
Spring 2013
WPI, Mohamed Eltabakh
1

Theme of this Course

Large-Scale Data Management

Big Data Analytics


Data Science and Analytics
How to manage very large amounts of data and extract
value and knowledge from them
2

Introduction to Big
Data
What is Big Data?
What makes data, Big Data?

Big Data Definition


No single standard definition

Big Data is data whose scale, diversity, and


complexity require new architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it
4

Characteristics of Big Data:


1-Scale (Volume)
Data Volume
44x increase from 2009 2020
From 0.8 zettabytes to 35zb

Data volume is increasing exponentially

Exponential increase in
collected/generated data
5

Characteristics of Big Data:


2-Complexity (Varity)
Various formats, types, and structures
Text, numerical, images, audio, video,
sequences, time series, social media
data, multi-dim arrays, etc
Static data vs. streaming data
A single application can be
generating/collecting many types of
data
To extract knowledge all these
types of data need to linked
together
6

Characteristics of Big Data:


3-Speed (Velocity)
Data is begin generated fast and need to be processed
fast
Online Data Analytics
Late decisions missing opportunities
Examples
E-Promotions: Based on your current location, your purchase
history, what you like send promotions right now for store next
to you
Healthcare monitoring: sensors monitoring your activities and
body any abnormal measurements require immediate reaction
7

Big Data: 3Vs

Some Make it 4Vs

Harnessing Big Data

OLTP: Online Transaction Processing (DBMSs)

OLAP: Online Analytical Processing (Data Warehousing)

RTAP: Real-Time Analytics Processing (Big Data Architecture &


technology)
10

Whos Generating Big Data

Mobile devices
(tracking all objects all the time
Social media and networksScientific instruments
(all of us are generating data)(collecting all sorts of data)
Sensor technology and
networks
(measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion

11

The Model Has Changed


The Model of Generating/Consuming Data has
Changed

d Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are


consuming data

12

Whats driving Big


Data
-

Optimizations and predictive analytics


Complex statistical analysis
All types of data, and many sources
Very large datasets
More of a real-time

13

Ad-hoc querying and reporting


Data mining techniques
Structured data, typical sources
Small to mid-size datasets

Value of Big Data Analytics


Big data is more real-time in
nature than traditional DW
applications
Traditional DW architectures
(e.g. Exadata, Teradata) are
not well-suited for big data
apps
Shared nothing, massively
parallel processing, scale out
architectures are well-suited
for big data apps
14

Challenges in Handling Big


Data

The Bottleneck is in technology


New architecture, algorithms, techniques are needed

Also in technical skills


Experts in using the new technology and dealing with big data
15

What Technology Do We Have


For Big Data ??

16

17

Big Data Technology

18

What You Will Learn


We focus on Hadoop/MapReduce technology
Learn the platform (how it is designed and works)
How big data are managed in a scalable, efficient way

Learn writing Hadoop jobs in different languages


Programming Languages: Java, C, Python
High-Level Languages: Apache Pig, Hive

Learn advanced analytics tools on top of Hadoop


RHadoop: Statistical tools for managing big data
Mahout: Data mining and machine learning tools over big data

Learn state-of-art technology from recent research papers


Optimizations, indexing techniques, and other extensions to Hadoop

19

Course Logistics

20

Course Logistics
Web Page: http
://web.cs.wpi.edu/~cs525/s13-MYE/
Electronic WPI system: blackboard.wpi.edu
Lectures
Tuesday, Thursday: (4:00pm - 5:20pm)

21

Textbook & Reading


List
No specific textbook
Big Data is a relatively new topic (so no fixed syllabus)

Reading List
We will cover the state-of-art technology from research papers
in big conferences
Many Hadoop-related papers are available on the course
website

Related books:
Hadoop, The Definitive Guide [pdf]
22

Requirements &
Grading
Seminar-Type Course
Students will read research papers and present them (Reading List)

Hands-on Course
No written homework or exams
Several coding projects covering the entire semester

23

Done in
teams of
two

Requirements & Grading


(Contd)
Reviews
When a team is presenting (not the instructor), the other students should
prepare a review on the presented paper
Course website gives guidelines on how to make good reviews
Reviews are done individually

24

Late Submission Policy


For Projects

One-day late 10% off the max grade


Two-day late 20% off the max grade
Three-day late 30% off the max grade
Beyond that, no late submission is accepted
Submissions:
Submitted via blackboard system by the due date
Demonstrated to the instructor within the week after

For Reviews
No late submissions
Student may skip at most 4 reviews
Submissions:
Given to the instructor at the beginning of class
25

More about Projects

A virtual machine is created including the needed


platform for the projects
Ubuntu OS (Version 12.10)
Hadoop platform (Version 1.1.0)
Apache Pig (Version 0.10.0)
Mahout library (Version 0.7)
Rhadoop
In addition to other software packages

Download it from the course website (link)


Username and password will be sent to you

Need Virtual Box (Vbox) [free]


26

Next Step from You


1. Form teams of two
2. Visit the course website (Reading List), each team
selects its first paper to present (1st come 1st
served)

Send me your choices top 2/3 choices

3. You have until Jan 20th

Otherwise, Ill randomly form teams and assign


papers

4. Use Blackboard Discussion forum for posts or


for searching for teammates
27

Course Output: What You


Will Learn
We focus on Hadoop/MapReduce technology
Learn the platform (how it is designed and works)
How big data are managed in a scalable, efficient way

Learn writing Hadoop jobs in different languages


Programming Languages: Java, C, Python
High-Level Languages: Apache Pig, Hive

Learn advanced analytics tools on top of Hadoop


RHadoop: Statistical tools for managing big data
Mahout: Analytics and data mining tools over big data

Learn state-of-art technology from recent research papers


Optimizations, indexing techniques, and other extensions to Hadoop

28