Escolar Documentos
Profissional Documentos
Cultura Documentos
2
Why Has Data Mining Appeared
• Large volumes of data stored by organizations in a
competitive environment combined with advances in
technologies which can be applied to the data
• Background and evolution
– The failure of traditional approaches
• The need for exploratory data analysis
– Niche marketing, customer retention, the internet, online
interaction, scientific discovery
• The means to implement Data Mining
– data warehouses, computing power, effective modelling
approaches
3
Structural Pattern of Data
4
Structural Pattern of Data --cont--
5
Machine Learning
• To learn:
– To get knowledge of by study, experience, or being
taught
– To become aware by information or from observation
– To commit to memory
– To be informed
– To receive instruction
• Learning:
– Things learn when they change their behavior in a way
that makes them perform better in the future
6
Machine Learning --cont--
• Machine Learning involves learning in
practical not in theoretical
• Interested in techniques for finding and
describing structural patterns in data as a tool
for helping to explain that data and make
predictions from it
7
Data Mining
• Preliminary Analysis
– Much interesting information can be found by
querying the data set
– May be supported by a visualisation of the data set
• Choose a one or more modelling approaches
• There are (at least?) two styles of data mining
– Hypothesis testing
– Knowledge discovery
• The styles and approaches are not mutually
exclusive
8
The Proses of Knowlegde Discovery
• Pre-processing
– data selection
– cleaning
– coding
• Data Mining
– select a model
– apply the model
• Analysis of results and assimilation
– Take action and measure the results
9
Data Selection
• Identify the relevant data, both internal and
external to the organisation
• Select the subset of the data appropriate for
the particular data mining application
• Store the data in a database separate from
the operational systems
10
Data Pre-Processing
• Cleaning
– Domain consistency: replace certain values with
null
– De-duplication: customers are often added to the
database (DB) on each purchase transaction
– Disambiguation: highlighting ambiguities for a
decision by the user
• e.g., if names differed slightly but addresses were the
same
11
Data Pre-Processing –cont--
• Enrichment
– Additional fields are added to records from external
sources which may be vital in establishing
relationships.
• Coding
– e.g., take addresses and replace them with regional
codes
– e.g., transform birth dates into age ranges
• It is often necessary to convert continuous data
into range data for categorisation purposes.
12
Data Mining Task
• Various taxonomies exist. E.g. Berry & Linoff 6 tasks:
– Classification
– Estimation (a.k.a. regression)
– Prediction
– Association Rule Discovery (a.k.a. Affinity Grouping )
– Clustering
– Description
• The tasks are also referred to as operations. Cabena et al. define 4 operations:
– Predictive Modelling
– Database Segmentation (a.k.a. clustering)
– Link Analysis
– Deviation Detection
• Beware! Different authors use different names for the same technique, operation
or task.
13
Clasification
• Classification involves considering the
features of some object then assigning it it to
some pre-defined class, for example:
– Spotting fraudulent insurance claims
– Which phone numbers are fax numbers
– Which customers are high-value
14
Regression
• Regression deals with numerically valued
outcomes rather than discrete categories as
occurs in classification.
– Estimating the number of children in a family
– Estimating family income
15
Prediction
• Essentially the same as classification and
estimation but involves future behavior
• Historical data is used to build a model
explaining behavior (outputs) for known inputs
• The model developed is then applied to current
inputs to predict future outputs
– Predict which customers will respond to an
advertising promotion
– Classifying loan applications
16
Association Rule Discovery
• Association Rule Discovery is also referred to
as Market Basket Analysis, or Affinity
grouping
• A common example is discovering which
items are bought together at the
supermarket. Once this is known, decisions
can be made on, for example:
– how to arrange items on the shelves
– which items should be promoted together
17
Clustering
18
Deviation Detection
• Records whose attributes deviate from the norm
by significant amounts are also called outliers
• Application areas include:
– fraud detection
– quality control
– tracing defects
• Visualization techniques and statistical
techniques are useful in finding outliers
• A cluster which contains only a few records may
in fact represent outliers
19