Escolar Documentos
Profissional Documentos
Cultura Documentos
3
2 System G Team 2016 IBM Corporation
2015
Massive Parallelism
Huge Data Volumes Storage
Data Distribution
High-Speed Networks
High-Performance Computing
Task and Thread Management
Data Mining and Analytics
Data Retrieval
Machine Learning
Data Visualization
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017
14 2016 CY Lin, Columbia University
E6893 Big Data Analytics Lecture 1: Overview
Big Data Market further breakdown
http://wikibon.org/wiki/v/
Big_Data_Database_Revenue_and_Market_Forecast_2012-2017
USD: billions
16 E6893 Big Data Analytics Lecture 1: Overview 2015 CY Lin, Columbia University
Course Grading
3 Homeworks: 50%
-- Individual work; Language Requirement: Java, JavaScript, Python, C/C++, Perl
-- Report and source code
Data Store & Processing using Hadoop
Recommendation, Clustering, Classification using Spark
Graph Database & Analytics and Machine Reasoning using System G
17 E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
Course Information
Website:
http://www.ee.columbia.edu/~cylin/course/bigdata/
Textbook:
-- None, but reference book(s) and/or articles/papers will be provided each lecture.
18 E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
Sapphirine Big Data Analytics Open Source Applications
Goal: Create a Big Data open source toolsets for various industries (and disciplines)
Professor Lin:
Office Hours:
Thursday 9:40pm 10:00pm (SIPA 417, lecture room)
Contact: c {dot} lin {at} columbia {dot} edu (the same as <cl300>)
Telephone: 914-945-1897
TAs:
Eric Johnson (efj2106), Munan Cheng (munan.cheng), Kushwanth Shantharam
(kk3098), Rohan Kulkarni (rohan.kulkarni), Gautam Sihag (gautam.sihag), Peiran Zhou
(pz2210), Emily Yao (dy2307), Chuwen Xu (cx2178), and TBA.
20 E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
Apache Hadoop
The Apache Hadoop project develops open-source software for reliable, scalable,
distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so delivering
a highly-available service on top of a cluster of computers, each of which may be prone to
failures.
http://hadoop.apache.org
21 2016 CY Lin, Columbia University
E6893 Big Data Analytics Lecture 1: Overview
Hadoop-related Apache Projects
http://hortonworks.com/hadoop/hdfs/
23 2016 CY Lin, Columbia University
E6893 Big Data Analytics Lecture 1: Overview
MapReduce example
http://www.alex-hanna.com
24 2016 CY Lin, Columbia University
E6893 Big Data Analytics Lecture 1: Overview
MapReduce Data Flow
http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/
25 2016 CY Lin, Columbia University
E6893 Big Data Analytics Lecture 1: Overview
In-Memory Computing Apache Spark
27 E6895 Advanced Big Data Analytics Lecture 3: Spark and Data Analytics 2015 CY Lin, Columbia University
Spark Core
Home to the API that defines resilient distributed datasets (RDDs) - Sparks main
programming abstraction.
RDD represents a collection of items distributed across many compute nodes that can be
manipulated in parallel.
28 E6895 Advanced Big Data Analytics Lecture 3: Spark and Data Analytics 2015 CY Lin, Columbia University
Judgement
Perception
Reasoning
Strategy
Observation
Memory
30 E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
System G Graph Computing for Machine Intelligence
IBM System G: a brand name approved by HQ April 2014.
Judgement
Based on 30+ graph-related projects; 150+ papers; ~40 patents; ~10
Perception
Reasoning & best paper awards; ~$25M Research funding
Observation Strategy
Memory
Visualization Huge Network Network Dynamic Network Geo Network Graphical Model
Visualization Propagation Visualization Visualization Visualization
Middleware Multi-Core
In-Memory Distributed GPU Graph
Graph RT Library Multi-Thread
Graph RT Library Graph RT Library Computing Driver
33 E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
System G Solutions
Model Entity process Social net Predictive Peer group Sim. what/if
validation Analytic Engines analysis
analysis analysis analysis
35 E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
Mobile Cognition Enabling AI right on the Edge
Created novel graph computing and deep learning framework on iOS devices and NAOqi
robots including:
generic object recognition, event recognition, face recognition, visual sentiment
recognition, and document recognition
graph database
Prototype summer 2016 and first version release 3Q2016
36 E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
Challenges
37
Multi-Scale Deep Convolutional Neural
Network for Fast Object Detection
Demo: Detecting Cars and Pedestrians in
complex scenario
39
Example - Generic Object Recognition running
Demo - CIDetector
41 E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
Game Theory Tool to construct Strategy
Decide when/what/
Complete information Solve equilibriums Incomplete information how to communicate
Optimization
Learning
with customers..
Convex/Nonconvex Optimization
Distributed learning
Mechanism design
42 E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
Investment Advisory
Graph Visualizations
43 E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
Social Media Solution
44 E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
NoSQL Database
Key-Value Store
Document Store
Tabular Store
Object Database
Graph Database (property graphs, RDF graphs)
Property Graphs
RDF Graphs born 1973
home
Larry Page Palo Alto
OpenGL
s
hic
born 1850 ind gra
p
fou
Software us on
versi
boa
try 4.1
nde
died 1934
rd
Charles Flint
r
per
try develo Android Linux
us
Internet kernel
ind
Google
fou
industry emp
loye
nd
HQ es pre
ced
er
Armonk ed 4.0
HQ
industry 54,604
Hardware Mountain View
IBM
indu
indus
try born 1955
stry
433,362 es
em p l o ye
Services died
Steve Jobs 2011
ind
fou
on
versi
boa
7.1
us
nde
try
rd
r
per
develo iOS XNU
kernel
Apple emp
loye pre
es ced 7.0
HQ ed
80,000
Cupertino
49 E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
Big Data Analytics Example Use Cases
1. Expertise Location
2. Recommendation
3. Commerce
4. Financial Analysis
5. Social Media Monitoring
6. Telco Customer Analysis
7. Watson
8. Data Exploration and Visualization
9. Personalized Search
10. Anomaly Detection (Espionage, Sabotage, etc.)
11. Fraud Detection
12. Cybersecurity
13. Sensor Monitoring (Smarter another Planet)
14. Celluar Network Monitoring
15. Cloud Monitoring
16. Code Life Cycle Management
17. Traffic Navigation
18. Image and Video Semantic Understanding
19. Genomic Medicine
20. Brain Network Analysis
21. Data Curation
22. Near Earth Object Analysis
item
Enhancing:
user
Graph Visualizations
Dynamic networks
of 400,000+
IBMers:
On BusinessWeek four times, including being the Top Story of Week, April 2009 Shortest Paths
Help IBM earned the 2012 Most Admired Knowledge Enterprise Award Social Capital
Wharton School study: $7,010 gain per user per year using the tool Bridges
Hubs
In 2012, contributing about 1/3 of GBS Practitioner Portal $228.5 million savings and benefits
Expertise Search
APQC (WW leader in Knowledge Practice) April 2013:
Graph Search
The Industry Leader and Best Practice in Expertise Location
Graph Recomm.
52 2016 CY Lin, Columbia University
E6893 Big Data Analytics Lecture 1: Overview
Finding and Ranking Expertise Social Network Analysis
Decades of Social Science studies demonstrates that (social) network structure is the key indicator determining a
person's influence, organizational operation efficiency, social capital to get help, potential to be successful, etc.
Who are the key bridges? Who have the most connections? How do these experts cluster?
Analogy Google founders utilized the concept of network analysis on webpages to create ranking.
Independent
experts on
healthcare
His self-described
expertise
The public interest groups
he is in
My various paths to Tom. SmallBlue can show the paths to any colleagues up to 6-degree away
How many
people in my
personal
networks?
Analyzing existing
social networks of
What types of unique every employee That
colleagues my friend Chris can makes it possible to
help me connect to? find the shortest path
to any colleague..
600
Collabrative Filtering
Collaborative + Content Filtering
500 CBDR
Community Upper
Personalized Rec.Bound
Upper
400 Global Upper Bound
Bound
No. of people
Non-Personalized Upper
C Bound
300 B
D
C
200 R
B
CB
D DR
100
R
0
>=1 useful >=2 useful >=3 useful >=4 useful >=5 useful
Precision
0.4
Info Flow
0.3
0.2
0.1
0
1 2 3 4
No.recommended
Number of of retrieved users
users
Recall
0.08 1 month
Innovators 0.06
586
0.04
0.02
new docs
Early adopters 1,170
0
1 2 3 4 users
No. of of
Number retrieved users
recommended users
Data Source:
Relationships among 7594
companies, data mining from
NYT 1981 ~ 2009
Team
Account team Design team
Person
Sociology Healthcare
CS
SNA Info
EE Improve
Sensor
Detected as top 1
anomaly in Sandy Outperform existing
Tweets approaches by up
to 180% (IJCAI 13)
66 2016 CY Lin, Columbia University
E6893 Big Data Analytics Lecture 1: Overview
Dynamics of Information Graphs in Social Media
67
E6893 Big Data Analytics Lecture 1: Overview 58
2016 CY Lin, Columbia University
Visual Sentiment and Semantic Analysis
First work in the literature on automatic visual sentiment analysis
Build Sentiment
Ontology
MISTY WOODS
Train Classifiers
Select
Adj-Noun Pairs
Discover Performance
SAD Filtering
sentiment
EYES
words
Training from 6 million tags
Experiment on Sentiment
Detection Accuracy
on Twitter
Detection results of crazy car (100% accuracy, 5 out of 5 correct) Text 0.43
Visual 0.70
T+V 0.72
Enhancing:
headache
chill migraine
high fever
stomachache
cough
Graph
Communities
77
Indexing time Query processing time
E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
User Case 8: Visualization for Navigation and Exploration
Cluster based huge graph visualization Query based huge graph visualization
http://systemg.ibm.com/apps/whisper/
index.html
http://systemg.ibm.com/apps/whisper/index.html
SocialHelix: Visualizaiton of
Sentiment Divergence in
Social Media
http://systemg.ibm.com/apps/socialhelix/index.html
http://systemg.ibm.com/apps/socialhelix/index.html
79 2016 CY Lin, Columbia University
E6893 Big Data Analytics Lecture 1: Overview
Use Case 9: Graph Search
ranking re-ranking
Interest / social network
based content
recommendations
Info-Socio
networks Graph analysis query context
YouTube: 6M Edges
Flickr: 24M Edges
LiveJournal: 72M Edges
System G MapReduce
Execution Time in seconds.
81 2016 CY Lin, Columbia University
E6893 Big Data Analytics Lecture 1: Overview
Category 3: Security
Network Ponzi scheme Detection Ego Net
Info Flow Features
Normal:
Attacker:
(1) Clique-like
Near-Star
(2) Two-way links
Detecting DoS
attack
Graph Visualizations
Emails
Graph analysis
Instant Messaging
Social sensors
Web Access Behavior analysis Detection,
Click streams capturer Multimodality Prediction
Executed Processes
Feed subscription Semantics analysis Analysis &
Printing
Exploration
Copying Database access Psychological Interface
analysis
Log On/Off
Infrastructure + ~ 70 Analytics
83 2016 CY Lin, Columbia University
E6893 Big Data Analytics Lecture 1: Overview
Story Espionage Example
Unstable
Planning Mental status
Attack
84 2016 CY Lin, Columbia University
75 E6893 Big Data Analytics Lecture 1: Overview
Multi-Modality Multi-Layer Understanding of Human
Mapping Espionage, Sabotage, and Fraud Use Cases into
#
Cognition
Layer
Semantics
Layer
Concept
Layer
Feature
Layer
Sensor
Layer
HR records, Travel records, Transmitted images,
Badge/Location records, speech content, video
Phone records, Mobile records content
Available existing data
future additions?
85 : observations : hidden states
E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
Example of Graphical Analytics and Provenance
Markov Latent Bayesian
Network Network Network
Promising results. IBMs system successfully caught the bad guys of the 12 cases: 4 as Top #1, 3 in Top #2-#5, 2 in T
#20,
87 1 in Top #21-#50, and 2 in Top #51-#100. Performer 2 did not report results. Performer 3 reported: 3 of the 12 cases T
2016 CY Lin, Columbia University
#50-#100, 6 cases Top E6893 Big Dataand
#101-#500, Analytics Lecture
3 cases 1: Overview
beyond Top #501.
Use Case 11: Fraud Detection for Bank
Network Ego Net
Info Flow Features
Normal:
Attacker:
(1) Clique-like
Near-Star
(2) Two-way links
Detecting DoS
attack
Bayesian
Network
Varying over
KPI time series (e.g., ? time
Causality
server performance/
load, network analyzer
performance/load)
KPI (a time series)
(potential) pairwise relationship
(e.g., causality)
Graph Visualizations
Bayesian Network
* 3 timesteps * 63 variables
* 3.9 avg states * 4.0 avg
indegree
* 16,858 CPT entries
Junction Tree
* 67 cliques
* 873,064 PT entries in cliques
Varying over
KPI time series Causality ? time
(e.g., server analyzer
performance/load,
network KPI (a time series)
performance/load) (potential) pairwise relationship
(e.g., causality)
Select KPI pairs (sampling) Test link existence Estimate unsampled links based on histo
93 Overall graph 2016 CY Lin, Columbia University
E6893 Big Data Analytics Lecture 1: Overview
Category 5: Data Warehouse Augmentation
Graph
application Graph
application
Graph objects
Graph objects
Vertex Attribute
Correspondence Transformation
Ys ARG s ARG t
Yt
98 2016 CY Lin, Columbia University
E6893 Big Data Analytics Lecture 1: Overview
Use Case 19: Graph Matching for Genomic Medicine
Ongoing discussions
E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University
Sign List
Name (Last, First) UNI Department Degree (yr) Prior School or Company
E6893 Big Data Analytics Lecture 1: Overview 2016 CY Lin, Columbia University