Você está na página 1de 30

CP902: Data Warehousing & Data Mining

What is Data?

What is knowledge / information?


A data warehouse is a repository of selected and adapted operational data that is used to take important decision in business. Data mining is the process to extract knowledge from huge amount of data.

What is Data Mining?


Data mining is the task of discovering interesting patterns from large amounts of data, where the data can be stored in databases, dataware houses, or other information repositories. Data mining is often defined as finding hidden information in a large database. Alternatively, it has been called exploratory data analysis, data driven discovery, and deductive learning. Data mining involves an integration of techniques from multiple disciplines such as database and data warehouse technology, statistics, machine learning, high-performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial or temporal data analysis.

CP902: Data Warehousing & Data Mining


Different data sources: - Database - File - Web :

Terminologies
a consistency summarizes the validity, accuracy, usability and integrity of related data between applications and across an IT enterprise.

Terminologies
Data integrity It refers to the validity of data, meaning of data is consistent and correct. In the data warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If there is no data integrity in the data warehouse, any resulting report and analysis will not be useful.

Terminologies
data integrity Business rules that dictate the standards for acceptable data. These rules are applied to a database by using integrity constraints and triggers to prevent invalid data entry. Consistency states that only valid data will be written to the database.

Terminologies
Data integrity involves three level in DW:

Database level
ETL process Access level

Size of Data
sales, trade, wall mart, product descriptions, customer feedback, companies profile, . As per survey, after every 18 months, data is almost double. B, KB, MB, GB, TB, PB, EB, ZB,YB

Data Mart / Wall Mart


A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team.

Data Mart / Wall Mart


In some deployments, each department or business unit is considered the owner of its data mart including all the hardware, software and data. This enables each department to use, manipulate and develop their data any way they see fit; without altering information inside other data marts or the data warehouse. In other deployments where conformed dimensions are used, this business unit ownership will not hold true for shared dimensions like customer, product, etc.

Example
Dmart across India

Data Warehouse
WHAT IS DW? HOW DID DW COME INTO EXISTENCE? WHAT PURPOSE DOES IT SOLVE? HOW DO WE CREATE DW?

WHAT IS Data Warehouse?


DW IS A PROCESS NOT A PRODUCT. THE PROCESS OF CREATING A WELL DESIGNED INFORMATION MANAGEMENT SOLUTION WHICH ENABLE INFORMATIONAL AND ANALYTICAL PROCESSING WITHOUT THE BARRIERS OF GEOGRAPHY AND ORGANIZATION.

WHAT IS Data Warehouse?


-STORATE AREA FOR PROCESSED AND INTEGRATED DATA

-ACROSS DIFFERENT SOURCES


-IT MAY OPERATIONAL / EXTERNAL DATA -IT ALLOWS USERS TO EXTACT REQUIRED DATA FOR BUSINESS ANALYSIS & STRATEGIC DECISION MAKING

WHAT IS Data Warehouse?


CONCEPTUALLY: A DW is a home for Secondhand data that originate in either corporate application or some source external to the company.

WHAT IS Data Warehouse?


FORMALLY: A DW is a stand-alone repository of information several possibly heterogeneous operational DBs. A DW is repository of subjectively selected and adapted operational data, which can successfully answer any ad hoc, complex, statistical or analytical queries. [It situated at DSS of an org.]

WHAT IS Data Warehouse?


RALPH KIMBALL Ralph Kimball (Born 1944) is an author on the subject of data warehousing and business intelligence. A data warehouse is a copy of transaction data specifically structured for query and analysis.

WHAT IS Data Warehouse?


William H. Inmon (born 1945) is an American computer scientist, recognized by many, as the father of the data warehouse. data warehouse is a subject oriented, nonvolatile, integrated, time variant collection of data in support of management's decisions.

WHAT IS Data Warehouse?


Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject. Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.

WHAT IS Data Warehouse?


Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer. Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.

Compare two philosophies


Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.

Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model.

Compare two philosophies


-no right or wrong between these two ideas, as they represent different data warehousing philosophies.

-In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. -This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart.
-Only when more data marts are built later do they evolve into a data warehouse.

Evolution
60s: Batch reports
hard to find and analyze information inflexible and expensive, reprogram every new request

70s: Terminal-based DSS and EIS (executive information systems)


still inflexible, not integrated with desktop tools

80s: Desktop data access and analysis tools


query tools, spreadsheets, GUIs easier to use, but only access operational databases

90s: Data warehousing with integrated OLAP engines and tools

23

EVOLUTION

Warehouses are Very Large Databases


35%

30%
25% Respondents 20% 15% 10% Initial 5% 0%

Projected 2Q96
Source: META Group, Inc.

5GB

10-19GB
5-9GB

50-99GB

250-499GB
500GB-1TB
25

20-49GB

100-249GB

Very Large Data Bases


Terabytes -- 10^12 bytes: Petabytes -- 10^15 bytes: Exabytes -- 10^18 bytes: Zettabytes -- 10^21 bytes: Walmart -- 24 Terabytes Geographic Information Systems National Medical Records Weather images

Zottabytes -- 10^24 bytes:

Intelligence Agency Videos

26

DW Creation

DW Creation
Construction of DW required : -data cleaning -data integration -data consolidation (Transformation)

DW Creation
-Steps involved in the data warehousing project cycle: Requirement Gathering Physical Environment Setup Data Modeling ETL OLAP Cube Design Front End Development Report Development Performance Tuning Query Optimization Quality Assurance Rolling out to Production Production Maintenance Incremental Enhancements

Users
-knowledge workers [manager, analyst, executives, administrators, .] uses warehouse to get the summarized data quickly to make strategic decision.

Types of users: -Executives and managers -Power Users -Support Users

Você também pode gostar