Koneru Lakshmaiah College of Engineering: S.indira Priya Darsini V.chaitanya

Presented by,
s.indira priya darsini v.chaitanya

¾ C.S.E. ¾ I.S.T
priyadarshni_sikhinam@yahoo.com chaitu_8918@yahoo.com
FROM
Koneru Lakshmaiah College Of Engineering
PDF created with pdfFactory Pro trial version www.software-partners.co.uk

DATA WAREHOUSING
AND
DATA MINING
ABSTRACT
Data mining, the extraction of hidden predictive information from
large databases, is a powerful new technology with great potential to help
companies focus on the most important information in their data
warehouses. Data mining tools predict future trends and behaviors, allowing
businesses to make proactive, knowledge-driven decisions. The automated,
prospective analyses offered by data mining move beyond the analyses of
past events provided by retrospective tools typical of decision support
systems.
Data warehouse is a computer system designed to give business
decision-makers instant access to information. The warehouse copies its
data from existing systems like order entry, general ledger, and human
resources and stores it for use by executives rather than programmers.
Data warehouse users use special software that enables them to create
and access information when they need it, as opposed to a reporting
schedule defined by the information systems(IS) department.
This paper describes about the basic architecture of data

warehousing, it’s software and process of data warehousing. It also
presents different techniques followed in data mining.

INTRODUCTION
Databases today can range in size into terabytes.
Within these masses of data lies hidden information of
strategic importance. Data mining is the process of
uncovering that information. Innovative organisations
worldwide are already using data mining to locate and
appeal to higher-value customers, to reconfigure their
product offerings to increase sales, and to minimize
losses due to error or fraud. Data mining is a relatively
unique process which extracts information from a
database that the user did not know existed.
There's the more difficult way to use the results of data mining: getting the user to actually understand
what is going on so that they can take action directly.
1) Visualization of the data mining output in a meaningful way, and
2) Allowing the user to interact with the visualization so that simple questions can be answered.
Stages of the data mining process
The phases depicted start with the raw data and finish with the extracted knowledge which was
acquired as a result of the following stages:
DATA WAREHOUSING
Data mining potential can be enhanced if the appropriate data has been collected and stored in a data
warehouse. Data warehousing is technique making it possible to extract archived operational data and
overcome inconsistencies between different legacy data formats.
The logical link between what the managers see in their decision support
EIS applications and the company's operational activities
Characteristics of a data warehouse

• Subject-oriented: data is organized according to subject instead of application
• Integrated: When data resides in many separate applications in the operational environment,
encoding of data is often inconsistent.
• Time-variant: The data warehouse contains data for comparisions.
• Non-volatile: Data are not updated or changed in any way once they enter the data warehouse,
but are only loaded and accessed.
ARCHITECTURE OF DATA WAREHOUSE SYSTEM:
The Data Warehouse architecture is a method by which the overall structure of data,
communication, processing and presentation for end-user computing with in the enterprise can be
represented. The architecture of the data warehouse is implemented layer-wise so as to achieve a
higher degree of abstraction. The layered architecture of a data warehouse is illustrated in the
following diagram.
APPLICATION
Information
Data Access
Data
Data Staging
Operational Data
Data Directory Data

Direct
Process
Layered Architecture of Data Warehouse

• Information Access Layer:
• Data Access Layer:
• Process Management Layer:
• Data Warehouse Layer:
• Data Staging Layer:
• Application Messaging Layer:
• WAREHOUSE SOFTWARE

A warehousing team will require several different types of tools during the course of a
warehousing project. These software products generally fall into one or more of the categories:
Extraction and Ware house Data access and retrieval

Transformation Technology
Source Data OLAP
DATA MARTS
middleware
Report
writers
extraction
DATA
EIS/DSS WAREHOUSE
Transformation
Data
Mining
Quality assurance
• Extraction and transformation.:

• Warehouse storage:
• Data access and retrieval:
• Extraction tools
There are two primary methods for extracting data from source systems
Change-based Replication
-------
Ware
House
Source
System ----
--
---
lk Bulk Extraction
-----
--- Ware
Source House
System -- -----
---
--
5

Bulk extractions:
Change-based replication:
Transformation tools.
Transformation tools are aptly named; they transform extracted data into the appropriate
format, data structure, and values that are required by the data warehouse.
Data transformations
• Field splitting and consolidation:
• Standardization:.
• Data quality tools:
• Data access and retrieval tools
Data warehouse users derive and obtain information through these types of tools. Data access
and retrieval tools are currently classified into the subcategories below.
• Online analytical processing (OLAP) tools.
• Executive information systems(EIS)
PROCESSES IN DATA WAREHOUSING
The first phase in data warehousing is to "insulate" your current operational

information, i.e., to preserve the security and integrity of mission-critical OLTP applications,
while giving you access to the broadest possible base of data.
DATA MINING ISSUES
1. One of the key issues raised by data mining technology is not a business or technological one,
but a social one. It is the issue of individual privacy.
2. Another issue is that of data integrity. Clearly, data analysis can only be as good as the data
that is being analyzed. A key implementation challenge is integrating conflicting or redundant
data from different sources.
3. A hotly debated technical issue is whether it is better to set up a relational database structure
or a multidimensional one.
4. Finally, there is the issue of cost. While system hardware costs have dropped dramatically
within the past five years, data mining and data warehousing tend to be self-reinforcing.
DATA MINING TECHNIQUES
These provide a description of some of the most common data mining algorithms in use today. We
have broken the discussion into two sections, each with a specific theme:
• Classical Techniques: Statistics, Neighborhoods and Clustering

• Next Generation Techniques: Trees, Networks and Rules
Classical Techniques
Statistics
By strict definition "statistics" or statistical techniques are not data mining. They were being
used long before the term data mining was coined to apply to business applications.
Nearest Neighbor
Clustering and the Nearest Neighbor prediction technique are among the oldest techniques
used in data mining. Nearest neighbor is a prediction technique that is quite similar to clustering -
Clustering
Clustering is the method by which like records are grouped together. Usually this is done to
give the end user a high level view of what is going on in the database. Clustering is sometimes used
to mean segmentation -
Name Income Age Education Vendor

Blue Blood Estates Wealthy 35-54 College Claritas Prizm™
Shotguns and Pickups Middle 35-64 High School Claritas Prizm™
Southside City Poor Mix Grade School Claritas Prizm™
Living Off the Land Middle-Poor School Age Low Equifax MicroVision™
Families
University USA Very low Young - Mix Medium to Equifax MicroVision™
High
Sunset Years Medium Seniors Medium Equifax MicroVision™
Table Some Commercially Available Cluster Tags
This clustering information is then used by the end user to tag the customers in their
database. Once this is done the business user can get a quick high level view of what is happening
within the cluster. Once the business user has worked with these codes for some time they also begin
to build intuitions about how these different customers clusters will react to the marketing offers
particular to their business.
Next Generation Techniques

The data mining techniques in this section represent the most often used techniques that have
been developed over the last two decades of research. These techniques can be used for either
discovering new information within large databases or for building predictive models.
Decision Trees
A decision tree is a predictive model that, as its name implies, can be viewed as a tree.
Specifically each branch of the tree is a classification question and the leaves of the tree are partitions
of the dataset with their classification. For instance if we were going to classify customers who churn
(don’t renew their phone contracts) in the Cellular Telephone Industry a decision tree might look
something like that found in Figure
Figure A decision tree is a predictive model that makes a prediction on the basis of a series of
decision much like the game of 20 questions.
We may notice some interesting things about the tree:
• It divides up the data on each branch point without losing any of the data (the number of total
records in a given parent node is equal to the sum of the records contained in its two
children).
• The number of churners and non-churners is conserved as you move up or down the tree
Neural Networks
Foremost among the advantages of neural networks is their highly accurate predictive models
that can be applied across a large number of different types of problems. True neural networks are
biological systems that detect patterns, make predictions and learn.

The node - This loosely corresponds to the neuron in the human brain.
The link - This loosely corresponds to the connections between neurons (axons, dendrites and
synapses) in the human brain.
Which Technique and When?
Some of the criteria that are important in determining the technique to be used are determined by trial
and error. There are definite differences in the types of problems that are most conducive to each
technique but the reality of real world data and the dynamic way in which markets, customers and
hence the data that represents them is formed means that the data is constantly changing. These
dynamics mean that it no longer makes sense to build the "perfect" model on the historical data since
whatever was known in the past cannot adequately predict the future because the future is so unlike
what has gone before.
POTENTIAL APPLICATIONS
Data mining has many and varied fields of application some of which are listed below.
Retail/Marketing
Banking
Insurance and Health Care
Transportation
Medicine
CONCLUSION
Data Mining is not a new phenomenon. All large organizations already have data warehouses,
but they are just not managing them. The Data Warehousing solution should enhance
intelligence in decision-making process of an enterpris. Over the next few years, the growth of
data mining is going to be enormous with new products and technologies coming out frequently. In
order to get the most out of this period, it is going to be important that data warehousing and mining
planners and developers have a clear idea of what they are looking for and then choose strategies and
methods that will provide them with performance today and flexibility for tomorrow.

Koneru Lakshmaiah College of Engineering: S.indira Priya Darsini V.chaitanya

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Koneru Lakshmaiah College of Engineering: S.indira Priya Darsini V.chaitanya

Enviado por

Direitos autorais:

Formatos disponíveis

Presented by,

s.indira priya darsini v.chaitanya

Koneru Lakshmaiah College Of Engineering

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

This paper describes about the basic architecture of data

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

Stages of the data mining process

Characteristics of a data warehouse

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

ARCHITECTURE OF DATA WAREHOUSE SYSTEM:

Data Directory Data

Layered Architecture of Data Warehouse

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

Extraction and Ware house Data access and retrieval

• Extraction and transformation.:

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

PROCESSES IN DATA WAREHOUSING

The first phase in data warehousing is to "insulate" your current operational

DATA MINING ISSUES

DATA MINING TECHNIQUES

• Classical Techniques: Statistics, Neighborhoods and Clustering

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

Name Income Age Education Vendor

Table Some Commercially Available Cluster Tags

Next Generation Techniques

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

We may notice some interesting things about the tree:

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

Which Technique and When?

Insurance and Health Care

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

Você também pode gostar