Você está na página 1de 37

IBM Big Data Platform Overview

Martin Pavlk +420 731 435 691 martin_pavlik@cz.ibm.com

January 2013

2013 IBM Corporation

Big Data is a Hot Topic Because Technology Makes it Possible to Analyze ALL Available Data
Cost effectively manage and analyze all available data in its native form unstructured, structured, streaming

Website

Social Media

Billing ERP
2

CRM

RFID

Network Switches
2012 IBM Corporation

BIG DATA is not just HADOOP


Understand and navigate federated big data sources Manage & store huge volume of any data Federated Discovery and Navigation

Hadoop File System MapReduce

Structure and control data

Data Warehousing

Manage streaming data

Stream Computing

Analyze unstructured data

Text Analytics Engine

Integrate and govern all data sources


3

Integration, Data Quality, Security, Lifecycle Management, MDM


2012 IBM Corporation

Business-Centric Big Data Enables You to Start With a Critical Business Pain and Expand the Foundation for Future Requirements

Big data isnt just a technologyits a business strategy for capitalizing on information resources Getting started is crucial Success at each entry point is accelerated by products within the Big Data platform Build the foundation for future requirements by expanding further into the big data platform

2012 IBM Corporation

1 Unlock Big Data


Customer need Understand existing data sources

Search and navigate data within existing systems


No copying of data Value statement Get up and running quickly Discover and retrieve big data Work even with big data sources by business users Solution Vivisimo Velocity renamed to IBM InfoSphere DataDiscovery

2012 IBM Corporation

2 Analyze Raw Data


Customer need Ingest data as-is into Hadoop Combine it with data from DWH

Process very large volume of data Value statement


Gain new insight Overcome the high cost of converting data from unstructured to structured format Experiment with analysis on different data and combine them with other sources Solution IBM InfoSphere BigInsights

2012 IBM Corporation

Merging the Traditional and Big Data Approaches


Traditional Approach
Structured & Repeatable Analysis

Big Data Approach


Iterative & Exploratory Analysis IT Delivers a platform to enable creative discovery

Business Users Determine what question to ask

IT
Structures the data to answer that question
Monthly sales reports Profitability analysis Customer surveys

Business Explores what questions could be asked


Brand sentiment Product strategy Maximum asset utilization

2012 IBM Corporation

InfoSphere BigInsights is more than just HADOOP

IBM InfoSphere Big Insights Is much more than HADOOP

IBM Big data platform Includes much more than IBM InfoSphere Big Insights

2012 IBM Corporation

Hadoop
Open-source software framework from Apache Inspired by
Google MapReduce GFS (Google File System)

HDFS Map/Reduce

2012 IBM Corporation

InfoSphere BigInsights
Platform for volume, variety, velocity Enhanced Hadoop foundation Analytics Text analytics & tooling Application accelerators
Enterprise class

Can run also on top of

Enterprise Edition
Licensed Application accelerators Pre-built applications Text analytics Spreadsheet-style tool RDBMS, warehouse connectivity Basic Edition Administrative tools, security Eclipse development tools Free download Performance enhancements Integrated install .... Online InfoCenter BigData Univ.

Usability Web console Spreadsheet-style tool Ready-made apps


Enterprise Class Storage, security, cluster management Integration Connectivity to Netezza, DB2, JDBC databases, etc
10

Apache Hadoop

Breadth of capabilities
2012 IBM Corporation

Spreadsheet-style Analysis
Web-based analysis and visualization

Spreadsheet-like interface
Define and manage long running data collection jobs Analyze content of the text on the pages that have been retrieved

11

2012 IBM Corporation

Build a Big Data Program MapReduce example


Eclipse tools
For Jaql, Hive, Pig Java MapReduce, BigSheets plug-ins, text analytics, etc.

12

2012 IBM Corporation

JAQL IBMs programming language in hadoop world


Jaql is a complete solutions environment supporting all other BigInsights components
Ad-Hoc analysis (BigSheets) BigInsights Text Analytics (Integration) DB2, Netezza, Streams, Machine learning (SystemML) Statistical Analysis (R module)

Integration point for various analytics


Text analytics Statistical analysis Machine learning Ad-hoc analysis

Integration point for various data sources

Jaql
Jaql I/O Jaql Core Operators

Jaql Modules

Local and distributed file systems NoSQL data bases Content repositories Relational sources (Warehouses, operational data bases)

DFS

NoSQL

RDBMS

File System

13

2012 IBM Corporation

BigInsights and the data warehouse


Big Data analytic applications

Traditional analytic tools

Data warehouse

BigInsights

Filter
14

Transform

Aggregate
2012 IBM Corporation

3 Simplify your warehouse


Customer need SIGNIFICANTLY Make performance of DWH better Reduce DWH administration costs

Value statement Speed: 10 100x better performance Simplicity: Administration costs reduced by 75% - 90% Scalability Smart system In-database analytics Out-of-the box integration with SPSS

Solution IBM Netezza renamed to PureData System for Analytics

15

2012 IBM Corporation

I need to evaluate the possible relationship between client salary and overdrafts Analyst

OK. We have to evaluate a lot of statistics, set the correct db indexes and db partitioning. It will take us 5 days. IT

16

2012 IBM Corporation

Great. Thanks a lot. Im going to check the results.

Done. You can run your analytical query.

Analyst

IT

17

2012 IBM Corporation

Great. I can see here some nice Noooo!!! not correlations. Now I need to Its look atpossible to work here! it from the different perspective.

Ohhh, welcome dear friend. Understand. So, its . another 5 days of our work

Analyst

IT

18

2012 IBM Corporation

And now with Netezza ...

19

2012 IBM Corporation

I need to evaluate the possible relationship between client salary and overdrafts. I will use Netezza. Analyst IT

20

2012 IBM Corporation

Great. I can see here some nice correlations. Now I need to look at it from the different perspective. With Netezza I can run the query immediately. The response will be in the same time

Analyst

IT

IT can do something else much more useful

21

2012 IBM Corporation

22

2012 IBM Corporation

Built-In Expertise Makes This as Simple as an Appliance

Dedicated device
Optimized for purpose Complete solution

Fast installation
Very easy operation Standard interfaces Low cost

23

2012 IBM Corporation

In October 2012

IBM Netezza was renamed to IBM PureData System for Analytics

24

2012 IBM Corporation

Netezza Genesis in T-Mobile CZ

Proof-Of-Concept Project
New EnterpriseDataWarehouse platform selection Comparison of existing and other platforms Selection Criteria
Performance Operational Savings

.and the winner was: Netezza

25

2012 IBM Corporation

Netezza Genesis in T-Mobile CZ


Expectations
Significant response improvement: Faster platform means better reports response Direct Data Availability Higher trust in data , one version of truth Aggregation reduction Any attribute available Operational Benefits Storage savings (no data replicas) Administration costs reduction(DBA) Infrastructure Simplification Lower environment complexity

26

2012 IBM Corporation

Netezza Genesis in T-Mobile CZ


Project Implementation EDW platform migration
Netezza platform implementation ETL graphs/processes redesign

BI Front-End Tool Migration


SAP Business Object implementation All reports redesign

Main Integration Partner: T-System CZ

27

2012 IBM Corporation

Netezza Genesis in T-Mobile CZ


Actual Status All relevant ETL procecessing redesigned Actual parallel run to Original and Netezza platform finished Netezza as only primary platform

28

2012 IBM Corporation

Real Netezza experience from T-Mobile Czech Rep.

Original Platform Workflow Reporting Invoicing and Payments reporting Payment discipline of current month invoices 33 minutes 2 hours

Netezza 1 minute

17 seconds

Overdue Debt of Invoices in Current Month


Average Monthly Invoice Figures

10 hours
50 minutes

23 seconds
38 seconds

RESPONSE TIME MASSIVELY IMPROVED


29 2012 IBM Corporation

4 Reduce costs with Hadoop


Customer need SIGNIFICANTLY Too much data => Too expensive to store and to maintain Big portion is used just in case Data amount is still growing => its more expensive

=> too expensive to have all data in standard DWH


Value statement Leverage the architecture of parallel processing in Hadoop Hadoop uses cheap commodity HW Enable business users still work in the same or similar way Solution IBM InfoSphere BigInsights

30

2012 IBM Corporation

BigInsights and the data warehouse


Traditional analytic tools
From Cognos BI via Hive JDBC

Big Data analytic applications

BigInsights Query-ready archive for cold warehouse data

Data Warehouse

31

2012 IBM Corporation

Future: The SQL interface . . . .


Rich SQL query capabilities
SQL '92 and 2011 features Correlated subqueries Windowed aggregates
Application

SQL Language JDBC / ODBC Driver

SQL access to all data stored in InfoSphere BigInsights


Robust JDBC/ODBC support Take advantage of key features of each data source Leverage MapReduce parallelism OR achieving low-latency
Data Sources

JDBC / ODBC Server

SQL interface Engine

HiveTables

HBase tables

CSV Files

InfoSphere BigInsights
34 2012 IBM Corporation

5 Analyze Streaming Data


Customer need Process and leverage streaming data Select valuable data from data stream for future processing Quickly process data going to be useless if its not processed immediately Value statement React in real-time to take an oppurtinity before it expires Periodically adjust streaming models based on analysis on data at rest
ACTION
Streaming Data Sources

Streams Computing

Solution IBM InfoSphere Streams

35

2012 IBM Corporation

Why and when to use InfoSphere Streams?


Applications needing on-fly processing, filtering and analyzing streaming data
Sensors Data Exhaust High-rate transaction data Environmental, Industrial, GPS, Images, Videos, Network data system logs (web server, app server), Financial transactions CDRs

At least 2 criteria from the list bellow should be fulfilled


Isolation Non-traditional formats included Integration challenges Multiple processing nodes Sub-millisecond latency Store & mine approach doesnt work
36

Processing in isolation or in limited windows (time / nr. Of records) Spatial data, images, text, voice, Different connection methods Different data rates Different processing requirements Volume / rate very high => scalability required Immediate analysis and response Because of very high volume of data (and its rates)
2012 IBM Corporation

Streams and BigInsights - Integrated Analytics on Data in Motion & Data at Rest
Visualization of realtime and historical insights

InfoSphere Streams

Data Integration, data mining, machine learning, statistical modeling

1. Data Ingest
Data

2. Bootstrap/Enrich
Data ingest, preparation, online analysis, model validation

Control flow

InfoSphere BigInsights, Database & Warehouse

3. Adaptive Analytics Model

38

2012 IBM Corporation

The Platform Advantage


BENEFITS
Increase over time

IN DETAIL By moving from entry to a 2nd and 3rd project Shared components Integration

Analytic Applications
BI / Exploration / Functional Industry Predictive Content BI / Reporting Visualization App App Analytics Analytics Reporting

IBM Big Data Platform


Visualization & Discovery Application Development Systems Management

Lowering deployment costs

Accelerators
Points of leverage

Shared text analytics for Streams and BigInsights HDFS connectors (data integration (ETL, ), Streams) Accelerators Build across multiple engines

Hadoop System

Stream Computing

Data Warehouse

Information Integration & Governance

39

2012 IBM Corporation

IBM big data

IBM big data

IBM big data IBM big data

THINK
IBM big data IBM big data IBM big data IBM big data IBM big data
40

2012 IBM Corporation

IBM big data

Você também pode gostar