Você está na página 1de 29

m  

Chapter 1

 
-- Basic Data Mining Tasks
-- Related Concepts
-- Data Mining Techniques
Definition:
Data Mining is defined as finding a hidden
information in a database.
General database is access as follows :

 m 
m   
 
Data Mining involves number of algorithms to
accomplish the tasks:
The algorithms examine the data and determine a
model that is closest to the characteristics of the data
being examined.
Data mining algorithms are categorized as :
1) Model : To fit a model for data
2) Preference: Some criteria must be used to fit one
model over another.
3) Search: All algorithms require some technique to
search the data.
Data Mining Models and Tasks

m  


  m 

 
    
     
     ! 
 
     
 
Predictive model makes prediction based on the
previous result sets ; it uses historical data.
For e.g a credit card use might be refused not
because of the user¶s own credit history, but because
of the current purchase is similar to earlier purchases
that were subsequently found to be made stolen
cards.
Here the predictive model is used to predict the
›  
A descriptive model identifies patterns or
relationship
Classification:
- Maps data into predefined groups or classes
- It is also referred as supervised learning
because the classes are defined before
examining the data.
-E.g whether to make a bank loan and
identifying credit risks.
-Pattern recognition is a type of classification.
In pattern recognition an input pattern is
classified into one of several classes based
on its similarity to these predefined
classes
Example:
An airport security screening station used
to determine if passenger is terrorist or
criminals
Regression:
It is used to map a data item to a real valued
prediction variable.
In regression there is a learning of function
that does mapping.
Regression assumes that the target data fit into
some known type of function (e.g linear ,
logistic,etc);
For e.g A professor want to reach a certain
level of savings
Time Series Analysis :
The value of an attribute is examined as it varies
over time. The values are obtained as evenly
spaced(daily,weekly,hourly etc.).
The time series plot is used to visualize the time
series.
Prediction:
Prediction is a type of classification.
The only difference is that prediction is
predicting a future state rather than current
state.
e.g Predicting flooding ;
Clustering:
Clustering is alternatively referred to as
unsupervised learning or segmentation.
The clustering is usually accomplished by
determining the similarity among data on
predefined attributes.

For e.g Catlogs of demographic groups;


Summarization :
It maps data into subsets with associated
simple descriptions.
Summarization is also called characterization
or generalization.
It extracts or derives representative
information about the database.
For e.g One of many criteria used to compare
universities by the U.S News and World
Report is the average SAT or ACT score.
Association Rules:
An association rule is a model that identifies
specific types of data associations.
Sequence Discovery:
Sequential analysis is used to determine
sequential patterns in data.And these patterns
are based on a time sequence of actions.
They are also similar to associations in that
data are found to be related , but the
relationship is based on time.
Data Mining versus Knowledge Discovery
Databases :
Knowledge discovery in databases is the
process of finding useful information and
patterns in data .
While , data mining is the use of algorithms
to extract the information and patterns
derived by the KDD process.
KDD is a process which has data as an input and
the output is useful information.


"
Database

Result
The KDD process consists of the following five
steps:

         

  

 


 


 


   m  

# $

Some Related Concepts
-Database / OLTP
-FUZZY sets and FUZZY LOGIC
-Information Retrieval
-Decision Support System
-Dimensional Modeling
-Data Warehousing
-OLAP
Some Related Concepts

-Web Search Engine


-Statistics
-Machine Learning
-Pattern Matching
Database/OLTP Systems
-A Database contains the data of an organization or
enterprise .
-A database follows the database techniques and
handles the entire data with respect to its model and
relationship among its entities.
-To describe the data a data model is design
ER Model Example
± & ± 
m &  m

Employee % ±  ± 

Address    
Fuzzy Sets
Fuzzy Logic means reasoning with uncertainty
A Set of fuzzy values .
-fuzzy values means appropriate values
Consider a Fuzzy set F,
F = { x | x ȯ Z+ and x<= 5}
Information Retrieval
-
   

K
#$ 

IR query result measures
IR systems consists of a set of documents ,
Where , D = { D1 , D2 ,«., Dn} .
Input to the system is query q ( which contains the
keywords) .
Then , Similarity between the query and each
document is calculated by : sim(q,Di) .
So the effectiveness of the system in processing
the query is measured by , ›
and ›
IR query result measures

  '(  



(
(
(
      $)* 


 +*

 '(  

(
( (

,   )*%   


+-
Decision Support System
-Dimensional Modeling
A dimension is a collection of logically
related attributes and is viewed as an axis
for modeling the data.
The time dimension : year , time , month ,
century , decade etc;
Web Search Engine
Web Search engines are treated as IR systems.

  !
#$ 
  

 
Search Engine Limitations
Search Engine is facing a lot of problems:
-Abundance
Single query cannot retrieve all the database on the
Web;
-Limited Coverage
Though the search engines are available but only
limited data is searched by it
-Limited Query : Limitations due to search engines.
-Limited Customization : lack of knowledge to the
user
Machine Learning
Machine learning is the area of AI that examines
how to write programs that can learn.
In data mining machine learning is used for
prediction or classification.
For data mining applications it follows some model.
The two types of machine learning are :
- Supervised Learning
- Unsupervised learning
Pattern Matching

Você também pode gostar