Você está na página 1de 97

Data Warehouse Fundamentals

Chapter 9

Data Mining Basics


Instructor: Paul Chen
Topics
1. How Data Mining Evolved?
2. Decision Processing Overview and Tasks
3. Data Mining, What’s it?
4. Data Mining vs. Data Warehousing
5. How Data Mining Works? And Its Applications
6. Data Mining Operations and Associated Techniques
7. The Data Mining Process
8. Data Mining Tools
9. Data Mining Techniques- A Summary
Topic 1:How Data Mining Evolved?

Many businesses have invested heavily in information


technology to help them manage their businesses more
effectively and gain a competitive edge. Increasingly large
amounts of critical business data are being stored
electronically and this volume is expected to continue to
grow. The Data Mining technology is helping companies
leverage their existing data more effectively and obtain
insightful information giving them a competitive edge.
How Data Mining Evolved?

1960s 1990s Late 1990s to


1970s-80s
Data OLAP and Now
RDBMS Data Mining
Collection DW

Time Line
Topic 2: Decision Processing
Overview
 Decision processing systems, and their underlying
analytical applications, provide business users with the
information they need to track and analyze business
trends, and to explore new business opportunities. As
businesses become increasingly competitive and
complex, effective decision processing systems are
essential for success.
The Next Generation of Business
Intelligence
 A decision processing system analyzes business
information captured from operational systems (Back-
and-front office, and e-business applications).
 Distribution of business information to business users
is via corporate intranets and extranets.
 The flow of data can be thought of as an information
supply chain whose objective is to convert operational
data into useful business information.
The Decision Processing Information
Supply Chain Business
Metrics
Operational
Systems
External
E-Business Analytic
Data
Applications Applications

Collaborative
DW &
Back-Office Office Systems
Transaction Business
Applications Intelligence
Information Tools
Staging
Area
Business
Front-Office Decisions
Applications
Decision Processing—Four Tasks***

 Extracting and transforming information


This involves capturing data from operational systems,
transforming it into business information, and loading
Into a data warehouse information store.

Current extract templates on the market are primarily at


Capturing data from ERP (Enterprise Resource Planning)
Transaction processing systems –for example: SAP Business
Information Warehouse and Peoplesoft BPM data warehouse)

*** Mentioned in chapter 2


Decision Processing—Four Tasks
(Cont’d)
Managing information

This task encompasses the maintenance of business information in


information stores, and how these information stores are processed by
business intelligence tools and analytic applications.
The cornerstone of decision processing is data warehousing, and
warehouse information stores should be organized and modeled into
relational and multidimensional database products.
Decision Processing—Four Tasks
(Cont’d)
 Analyzing and modeling information
The traditional approach to decision processing is to
build a data warehouse and supply business users
with a set of business intelligence tools (query,
reporting, OLAP and data mining, for example) to
process information in data warehouse information
stores.
A better approach is employ turn-key and web-
based analytic application packages that are
designed to provide comprehensive analyses for the
business area being researched. Key business
metrics (ex. Revenue dollars per sales rep per day)
are useful.
Decision Processing—Four Tasks
(Cont’d)
 Distributing information

Business intelligence tools and analytic applications distribute information


and the results of analysis operations to business users via standard graphical
and Web interfaces.
To help users uncover and organize this range of business information, an
enterprise information portal (EIP) is required. An EIP provides a single
point of entry to any piece of business information, no matter where it
resides.
The main components of an EIP are information assistant (Web browser
interface) , an information directory and a subscription facility.
Decision Making Under Risk

 Decisions are made under three sets of conditions:


 Certainty
 The decision makers know everything in advance

of making the decision


 Uncertainty
 The decision makers know nothing about the

probabilities or the consequences of decisions


 Risk
Decision-Making Style
 Decision-making styles of users are categorized as
either
 Analytic or
 Heuristic
Analytic and Heuristic Decision
Making
 Analytical Decision Maker  Heuristic Decision Maker

 Learns by analyzing  Learns by acting


 Uses step-by-step procedure  Uses trial and error
 Values quantitative  Values experiences
information and models  Relies on common sense
 Builds mathematical models  Seeks completely satisfying
and algorithms solution
 Seeks optimal solution
Topic 3: Data Mining, What’s it?

 Data Mining has been defined as “ a decision support


process in which a search is made for patterns of
information in data”. To detect patterns in data, Data
Mining uses sophisticated statistical analysis and modeling
technologies to uncover useful relationships hidden in
databases. It predicts future trends and finds behavior
allowing businesses to make predictive, knowledge-driven
decisions.
Data Mining, What’s it?
 The process of extracting valid, previously unknown,
comprehensible, and actionable information from large
databases and using it to make crucial business
decisions, (Simoudis,1996).

 Involves analysis of data and use of software techniques


for finding hidden and unexpected patterns and
relationships in sets of data.
Data Mining, What’s it?
 Reveals information that is hidden and unexpected, as
little value in finding patterns and relationships that
are already intuitive.

 Patterns and relationships are identified by examining


the underlying rules and features in the data.

 Tends to work from the data up and most accurate


results normally require large volumes of data to
deliver reliable conclusions.
Data Mining, What’s it?
 Starts by developing an optimal representation of
structure of sample data, during which time knowledge
is acquired and extended to larger sets of data.

 Data mining can provide huge paybacks for companies


who have made a significant investment in data
warehousing.

 Relatively new technology, however already used in a


number of industries.
Topic 4: Data Mining vs. Data
Warehousing
 Data Mining does not require that a Data Warehouse be
built. Often, data can be downloaded from the operational
files to flat files that contain the data ready for the data
mining analysis.

 Data Mining can be implemented rapidly on existing


software and hardware platforms. Data Mining tools can
analyze massive databases to deliver answers to questions
such as, “ Which customers are most likely to respond to
my next promotional mailing, and why?”
Data Mining vs. Data
Warehousing
 Major challenge to exploit data mining is identifying suitable data
to mine.

 Data mining requires single, separate, clean, integrated, and self-


consistent source of data.

 A data warehouse is well equipped for providing data for mining.

 Data quality and consistency is a pre-requisite for mining to


ensure the accuracy of the predictive models. Data warehouses are
populated with clean, consistent data.
Data Mining vs. Data
Warehousing
 Advantageous to mine data from multiple sources to discover as
many interrelationships as possible. Data warehouses contain data
from a number of sources.

 Selecting relevant subsets of records and fields for data mining


requires query capabilities of the data warehouse.

 Results of a data mining study are useful if there is some way to


further investigate the uncovered patterns. Data warehouses
provide capability to go back to the data source.
Topic 5: How Data Mining
Works?
 How exactly is Data Mining able to tell you important
things that you didn’t know or what is going to happen
next? The technique in Data Mining is called Predictive
Modeling which is knowledge discovery process via
relationships and patterns in broad sense.

 Modeling is the act of building a model in one situation


where you know the answer and then applying it to another
situation that you don’t.
Examples of Applications of Data
Mining via relationships and patterns
 Retail / Marketing
 Identifying buying patterns of customers
 Finding associations among customer demographic
characteristics
 Predicting response to mailing campaigns
 Market basket analysis
Examples of Applications of Data
Mining via relationships and patterns
 Banking
 Detecting patterns of fraudulent credit card use
 Identifying loyal customers
 Predicting customers likely to change their credit
card affiliation
 Determining credit card spending by customer
groups
Examples of Applications of Data
Mining via relationships and patterns
 Insurance
 Claims analysis
 Predicting which customers will buy new policies.

 Medicine
 Characterizing patient behaviour to predict
surgery visits
 Identifying successful medical therapies for
different illnesses.
Examples of Applications of Data
Mining via relationships and patterns
 Customer profiling: characteristics of good customers are
identified with the goals of predicting who will become
one and helping marketers target new prospects.

 Targeting specific marketing promotions to existing and


potential customers offers similar benefits.

 Market-basket analysis: With Data Mining, companies can


determine which products to stock in which stores, and
even how to place them within a store.
Examples of Applications of Data
Mining via relationships and patterns
 Customer Relationships Management-Determines
characteristics of customers who are likely to leave for a
competitor, a company can take action to retain that
customer because doing so is usually for less expensive
than acquiring a new customer.

 Fraud detection- With Data Mining, companies can


identify potentially fraudulent transactions before they
happen.
Topic 6: Data Mining Operations
and Associated Techniques

In previous foils, predictive modeling in essence includes


other operations shown in the above table.
Descriptive: The dealer sold 200 cars last month.

Operational (OLTP)

Explanatory: For every increase in 1 % in the


interest,
auto sales decrease by 5 %.
Traditional DW
OLAP

Predictive: predictions about future buyer behavior.

Data Mining
Level of Modeling vs. Level of Analytical Processing

Descriptive Explanatory Predictive

SIMPLE QUERIES “WHAT IF”


& REPORTS PROCESSING DETERMINE IF
ANY PATTERNS
ANALYZE WHAT EXIST BY REVIEWING
HAS PREVIOUSLY DATA RELATIONSHIPS
OCCURRED TO
BRING ABOUT THE
CURRENT STATE
OF THE DATA
Normalized Denormalized + Statistical Analysis/
Tables Tables Artificial Intelligence

Roll-up; Drill Down Classification & Value Prediction


Predictive Modelling

 Similar to the human learning experience


 uses observations to form a model of the important
characteristics of some phenomenon.
 Uses generalizations of ‘real world’ and ability to fit
new data into a general framework.

 Can analyze a database to determine essential


characteristics (model) about the data set.
Predictive Modelling

 Model is developed using a supervised learning


approach, which has two phases: training and testing.

 Training builds a model using a large sample of


historical data called a training set.
 Testing involves trying out the model on new,
previously unseen data to determine its accuracy
and physical performance characteristics.
Predictive Modelling

 Applications of predictive modelling include customer


retention management, credit approval, cross selling,
and direct marketing.

 Two techniques associated with predictive modelling:


A. classification
B. value prediction, distinguished by nature of the
variable being predicted.
Statistical Analysis of Actual Sales (dollars
and quantities) relative To these Signage
Variables-a predictive modeling example.
 Content
 Frequency
 Depth
 Focus
 Depth
 Scale
 Length
 Location

Statistical Analysis : Correlation, Regression, Experiment Design,


Optimization. Now it goes into real time analysis.
Signage
Signage
PREDICTIVE MODELING

 There are two techniques associated with predictive


modeling: classification and value prediction, which
are distinguished by the nature of the variable being
predicted.
Predictive Modelling - Classification

 Used to establish a specific predetermined class for


each record in a database from a finite set of possible,
class values.

 Two specializations of classification: tree induction and


neural induction.
Example of Classification using
Tree Induction
Example of Classification using
Tree Induction
Customer renting property
> 2 years
No Yes

Rent property Customer age>45


No Yes

Rent property Buy property


Example of Classification using
Neural Induction
Example of Classification using
Neural Induction
 Each processing unit (circle) in one layer is connected
to each processing unit in the next layer by a weighted
value, expressing the strength of the relationship. The
network attempts to mirror the way the human brain
works in recognizing patterns by arithmetically
combining all the variables with a given data point.

 In this way, it is possible to develop nonlinear


predictive models that ‘learn’ by studying
combinations of variables and how different
combinations of variables affect different data sets.
Predictive Modelling - Value
Prediction
 Used to estimate a continuous numeric value that is
associated with a database record.

 Uses the traditional statistical techniques of linear


regression and non-linear regression.

 Relatively easy-to-use and understand.


Predictive Modelling - Value
Prediction
 Linear regression attempts to fit a straight line through
a plot of the data, such that the line is the best
representation of the average of all observations at that
point in the plot.

 Problem is that the technique only works well with


linear data and is sensitive to the presence of outliers
(i.e.., data values, which do not conform to the expected
norm).
Predictive Modelling - Value
Prediction
 Although non-linear regression avoids the main
problems of linear regression, still not flexible enough
to handle all possible shapes of the data plot.

 Statistical measurements are fine for building linear


models that describe predictable data points, however,
most data is not linear in nature.
Predictive Modelling - Value
Prediction
 Data mining requires statistical methods that can
accommodate non-linearity, outliers, and non-numeric
data.

 Applications of value prediction include credit card


fraud detection or target mailing list identification.
Database Segmentation

 Aim is to partition a database into an unknown number


of segments, or clusters, of similar records.

 Uses unsupervised learning to discover homogeneous


sub-populations in a database to improve the accuracy
of the profiles.
Database Segmentation

 Less precise than other operations thus less sensitive to


redundant and irrelevant features.

 Sensitivity can be reduced by ignoring a subset of the


attributes that describe each instance or by assigning a
weighting factor to each variable.

 Applications of database segmentation include


customer profiling, direct marketing, and cross selling.
Example of Database Segmentation
using a Scatter plot
Database Segmentation
 Associated with demographic or neural clustering
techniques, distinguished by:
 Allowable data inputs
 Methods used to calculate the distance between
records
 Presentation of the resulting segments for analysis.
Example of Database Segmentation
using a Visualization
Link Analysis

 Aims to establish links (associations) between records,


or sets of records, in a database.

 There are three specializations


 Associations discovery
 Sequential pattern discovery
 Similar time sequence discovery

 Applications include product affinity analysis, direct


marketing, and stock price movement.
Link Analysis - Associations
Discovery
 Finds items that imply the presence of other items in
the same event.

 Affinities between items are represented by association


rules.
 e.g. ‘When customer rents property for more than 2
years and is more than 25 years old, in 40% of cases,
customer will buy a property. Association happens
in 35% of all customers who rent properties’.
Link Analysis - Sequential Pattern
Discovery
 Finds patterns between events such that the presence of
one set of items is followed by another set of items in a
database of events over a period of time.

 e.g. Used to understand long term customer buying


behaviour.
Link Analysis - Similar Time
Sequence Discovery
 Finds links between two sets of data that are time-
dependent, and is based on the degree of similarity
between the patterns that both time series demonstrate.
 e.g. Within three months of buying property, new
home owners will purchase goods such as cookers,
freezers, and washing machines.
Deviation Detection

 Relatively new operation in terms of commercially


available data mining tools.

 Often a source of true discovery because it identifies


outliers, which express deviation from some previously
known expectation and norm.
Deviation Detection

 Can be performed using statistics and visualization


techniques or as a by-product of data mining.

 Applications include fraud detection in the use of credit


cards and insurance claims, quality control, and defects
tracing.
A Summary: Data-Driven
Techniques*
 Data Visualization

 Decision Trees

 Clustering

 Factor Analysis

 Neural Network

 Association Rules

 Rule Induction

* Based on Sakhr Youness’s book “ Professional Data Warehousing with SQL Server 7.0
and OLAP Services
Data Visualization
A pie chart showing the sales of a product by region is
Sometimes much more effective than presenting the same
Data in a text or tabular form.

9%
Northeast South 11 %
39% North

21 %
West
20 %
East
Decision Tree
Cluster Analysis
First segment (high income>8,000)
Have
Children
Second Segment (8000>middle income >3000)
Married

Third Segment (low income < 3000) Last car is


A used one

Own car
Factor Analysis
 Unlike cluster analysis, factor analysis builds a model from data.
The technique finds underlying factors, also called “latent
variables” and provides models for these factors based on
variables in the data. For ex., a software company is considering a
survey to find out the nine most perceived attributes of one of
their products. They might categorize these products to categories
such as service for technical support, availability for training and
a help system.

 Factor analysis is used for grouping together products based on a


similarity of buying patterns so that vendors may bundle several
products as one to sell them together at a lower price than their
added individual prices..
Neural Networks
Association Rules

 Association models are models that examine the extent to which


values of one field depend on, or are produced by, values of
another field. These models are often referred to as Market Basket
Analysis when they are applied to retail industries to study the
buying patterns of these customers, especially in grocery and
retail stores that issue their own credit cards. Charging against
these cards gives the store the chance to associate the purchases of
customers with their identities, which allows them to study
associations among other things.
Rules Induction

 This is a powerful technique that involves a large number of rules


using a set of “if..then” statements in the pursuit of all possible
patterns in the dataset. For ex., if the customer is a male then, if he
is between 30 and 40 years of ages, and his income is less than
$50,000 and more than $20,000, he is likely to be driving a car that
was bought as new.
A Summary: Theory-Driven
Techniques
 Correlations

 T-Tests

 Analysis of Variables

 Linear Regression

 Logistic Regression

 Discriminate Analysis

 Forecasting Methods
Topic 7: The Data Mining Process

 Define the problem.


 Select the data.
 Prepare the data.
 Mine the data.
 Deploy the model.
 Take business action.
 Are you ready for Data Mining?
Define the problem
 A successful data mining initiative always starts with
a well-defined project. To insure that the project produces
incremental value, include an assessment of the status quo
solution and a review of technology, organization, and
business processes.
Select the data

 This step involves defining your data source . (not every


data source and record is required.) The data is usually
extracted from the source system to a separate server.
Prepare the data

 This step represents up to 80 percent of the total project


effort. For data mining, the data must reside in one flat
table (each record has many columns). In addition to being
the most time consuming, the step is also the most critical.
The resulting models are only as good as the data used to
create them.
Mine the data

 Typically the easiest and shortest phase, this step involves


applying statistical and AI tools to create mathematical
models. Data mining typically occurs on a server separate
from the data warehousing and other corporate systems.
Deploy the Model

 Model deployment is the process of implementing the


mathematical models into operational systems to improve
business results.
Take Business Action

 Use the deployed model to achieve improved results to the


business problem identified at the beginning of the
process.
Step to Implement Data Mining

Discovery (patterns, relations


Prior Knowledge
Associations, etc.)

Information Model

Validation

Deployment
ARE YOU READY FOR DATA
MINING?
Just because you have a data warehouse doesn’t mean
you’re necessarily ready for data mining. Much of the
work our company does in the data mining arena has
more to do with data mining readiness assessment than
with actually performing data mining.
Metrics you can use to gauge your data
mining readiness
 Do you have a staff of experienced knowledge workers?
 Do you have the data?
 Do you have marketing processes in place that can use this
data?
 Do you have a business champion who can embrace the
process and results?
 Do you have the technology infrastructure to support
advanced analysis?
Topic 8: Data Mining Tools

Data mining tools are typically classified by the type of


algorithm they use to identify hidden patterns. There are
many different algorithms in use, but the four most
popular are association, sequence, clustering (or
segmentation), and predictive modeling.
Data Mining Tools

 There are a growing number of commercial data


mining tools on the marketplace.

 Important characteristics of data mining tools include:


 Data preparation facilities
 Selection of data mining operations
 Product scalability and performance
 Facilities for visualization of results.
Data Mining vs. OLAP

They are two separate breeds of analysis with


entirely different objectives, not to mention
tools, skill sets, and implementation methods.
Data Mining
 With canned reports, ad hoc querying, and
 OLAP, the end user defines a hypothesis and
 determines which data to examine. With data
 mining, the tool identifies the hypothesis, and it
 actually tells the user where in the data to start
 the exploration process.
Data Mining
Rather than using SQL to filter out values and methodically
reduce the data into a concise answer set, data mining uses
algorithms that exhaustively review the relationships among
data elements to determine if any patterns exist. The whole
purpose of data mining is to yield new business information
that a business person can act on.
OLAP vs. Data Mining Tools
OLAP Tools Data Mining Tools
 Are ad hoc, shrink wrapped  Methods for analyzing
tools that provide an interface multiple data types
to data -- Regression Trees
-- Neural networks
 Are used when you have -- Genetic algorithms
specific known questions
 Are used when you don’t
 Looks and feels like a know what the questions are
spreadsheet that allow
rotation, slicing and graphic
 Usually textual in nature
 Can be deployed to large
number of users
 Usually deployed to a small
number of analysts
Data Mining Tools

 ASSOCIATION

Association, also frequently referred to as "affinity


analysis," reviews numerous sets of items and looks for
common groupings. An example of association is market
basket analysis, which involves reviewing the products
that consumers purchase in a single trip to the grocery
store.
ASSOCIATION

 Finds items that imply the presence of other items


in the same event.

 Affinities between items are represented by


association rules.
 e.g. ‘When a customer rents property for more than 2
years and is more than 25 years old, in 40% of cases,
the customer will buy a property. This association
happens in 35% of all customers who rent properties’.
Data Mining Tools

 SEQUENCE

Sequential analysis helps data miners identify a set of


order-specific items or events. Association identifies the
existence of patterns or groups of items; sequential
analysis identifies the order of those patterns or groups of
items.
SEQUENCE

 Finds patterns between events such that the presence of


one set of items is followed by another set of items in a
database of events over a period of time.
e.g. Used to understand long term customer buying
behavior.
Link Analysis - Similar Time Sequence
Discovery
 Finds links between two sets of data that are time-
dependent, and is based on the degree of similarity
between the patterns that both time series demonstrate.

e.g. Within three months of buying property, new home


owners will purchase goods such as cookers, freezers, and
washing machines.
Data Mining Tools

 CLUSTERING

Cluster analysis lets the data miner assemble data into


unforeseen groups containing similar characteristics. Also
known as "segmentation," this type of data
mining is probably the most widely used.
CLUSTERING

 Aim is to partition a database into an unknown number of


segments, or clusters, of similar records.

 Uses unsupervised learning to discover homogeneous sub-


populations in a database to improve the accuracy of the
profiles.
Data Mining Tools

 PREDICTIVE MODELING

As the name implies, predictive modeling involves


developing a model from historical data for predicting a
future event. The power of predictive modeling engines is
that they can use a broad range of data attributes to identify
future behavior. Both cluster analysis and predictive
modeling tools identify distinct groups of items with common
attributes; the difference is that predictive modeling focuses
on the likelihood of a particular outcome for a particular
group.
Topic 9: Data Mining Techniques-
A Summary
 Artificial neural networks: Non-linear predictive models that
learn through training and resembles biological neural networks
in structure.
 Decision Trees: Tree-shaped structures that represent sets of
decisions. These decisions generate rules for the classification of a
database.

 Generic Algorithms: Optimization techniques that use processes


such as generic combination, mutation, and natural selection in a
design based on the concepts of revolution.

 Rule induction: The extraction of useful if-then rules from data


based on statistical significance.
Data Mining Techniques- A
Summary
 Predictive modeling  Classification
 Value prediction
 Database Segmentation
 Demographic clustering
 Neural clustering
 Association discovery
 Link analysis  Sequential pattern discovery
 Similar time sequence
discovery
 Deviation detection  Statistics
 Visualization
Two Types of Data Mining Modeling-
Verification and Discovery
 The verification model utilizes a process that looks in a
database to detect trends and patterns in data that will help
answer some specific questions about the business.

 In this mode, the user generates a hypothesis about the


data, issues a query against the data and examines the
results of the query looking for verification of the
hypothesis or the user decides that the hypothesis is not
valid.
Verification Model

 In this model, very little information is created in this


extraction process: either the hypothesis is verified or it is
not.

 Common tools used in this mode are: queries,


multidimensional analysis and visualization. What all have
in common are that the user is essentially ‘guiding’ the
exploration of the data being inspected.
Discovery Model

 A more popular model is the Discovery Model that utilizes


a process that looks in a database to discover and/or
predict future patterns. The discovery model is divided
into two modes: “Descriptive” and “Predictive”.
Discovery Model- Descriptive Mode

 The Descriptive mode finds hidden patterns without a


predetermined idea or hypothesis about what the patterns
may be. In other words, the Data Mining software or
program takes the initiative in finding what the interesting
patterns are, without the user thinking of the relevant
questions first. In this mode information is created about
the data with very little or guidance from the user. The
exploration of the data is done in such a way as to yield as
large a number of useful facts about the data in the shortest
amount of time.
Discovery Model- Predictive Mode

 In the Predictive mode patterns discovered from the database are used
to predict the future patterns or trends. Predictive modeling allows the
user to submit records with some unknown field values, and the
system will guess the unknown values based on previous patterns
discovered from the database.

 In comparing the two models, one can state that “Verification” can be
very inefficient, timely and costly. Whereas, “Discovery” modeling
can be very efficient, cost effective, less dependent on user input and
increases modeling accuracy.

Você também pode gostar