Data Mining, Data Pattern, Machine Learning (Week 2

DATA MINING
Data Mining, Data Pattern

and Machine Learning
Definition
• “…the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel
ways that are both understandable and useful to the data
owner.”
Hand, Mannila & Smyth
• “… an interdisciplinary field bringing together techniques from
machine learning, pattern recognition, statistics, databases,
and visualization to address the issue of information extraction
from large data bases.”
Evangelos Simoudis in Cabena et al.
• “… the extraction of implicit, previously unknown, and
potentially useful information from data.”
Witten & Frank
2
Why Has Data Mining Appeared
• Large volumes of data stored by organizations in a
competitive environment combined with advances in
technologies which can be applied to the data
• Background and evolution
– The failure of traditional approaches
• The need for exploratory data analysis
– Niche marketing, customer retention, the internet, online
interaction, scientific discovery
• The means to implement Data Mining
– data warehouses, computing power, effective modelling
approaches
3
Structural Pattern of Data
4
Structural Pattern of Data --cont--
5
Machine Learning
• To learn:
– To get knowledge of by study, experience, or being
taught
– To become aware by information or from observation
– To commit to memory
– To be informed
– To receive instruction
• Learning:
– Things learn when they change their behavior in a way
that makes them perform better in the future
6
Machine Learning --cont--
• Machine Learning involves learning in
practical not in theoretical
• Interested in techniques for finding and
describing structural patterns in data as a tool
for helping to explain that data and make
predictions from it
7
Data Mining
• Preliminary Analysis
– Much interesting information can be found by
querying the data set
– May be supported by a visualisation of the data set
• Choose a one or more modelling approaches
• There are (at least?) two styles of data mining
– Hypothesis testing
– Knowledge discovery
• The styles and approaches are not mutually
exclusive
8
The Proses of Knowlegde Discovery
• Pre-processing
– data selection
– cleaning
– coding
• Data Mining
– select a model
– apply the model
• Analysis of results and assimilation
– Take action and measure the results
9
Data Selection
• Identify the relevant data, both internal and
external to the organisation
• Select the subset of the data appropriate for
the particular data mining application
• Store the data in a database separate from
the operational systems
10
Data Pre-Processing
• Cleaning
– Domain consistency: replace certain values with
null
– De-duplication: customers are often added to the
database (DB) on each purchase transaction
– Disambiguation: highlighting ambiguities for a
decision by the user
• e.g., if names differed slightly but addresses were the
same
11
Data Pre-Processing –cont--
• Enrichment
– Additional fields are added to records from external
sources which may be vital in establishing
relationships.
• Coding
– e.g., take addresses and replace them with regional
codes
– e.g., transform birth dates into age ranges
• It is often necessary to convert continuous data
into range data for categorisation purposes.
12
Data Mining Task
• Various taxonomies exist. E.g. Berry & Linoff 6 tasks:
– Classification
– Estimation (a.k.a. regression)
– Prediction
– Association Rule Discovery (a.k.a. Affinity Grouping )
– Clustering
– Description
• The tasks are also referred to as operations. Cabena et al. define 4 operations:
– Predictive Modelling
– Database Segmentation (a.k.a. clustering)
– Link Analysis
– Deviation Detection
• Beware! Different authors use different names for the same technique, operation
or task.
13
Clasification
• Classification involves considering the
features of some object then assigning it it to
some pre-defined class, for example:
– Spotting fraudulent insurance claims
– Which phone numbers are fax numbers
– Which customers are high-value
14
Regression
• Regression deals with numerically valued
outcomes rather than discrete categories as
occurs in classification.
– Estimating the number of children in a family
– Estimating family income
15
Prediction
• Essentially the same as classification and
estimation but involves future behavior
• Historical data is used to build a model
explaining behavior (outputs) for known inputs
• The model developed is then applied to current
inputs to predict future outputs
– Predict which customers will respond to an
advertising promotion
– Classifying loan applications
16
Association Rule Discovery
• Association Rule Discovery is also referred to
as Market Basket Analysis, or Affinity
grouping
• A common example is discovering which
items are bought together at the
supermarket. Once this is known, decisions
can be made on, for example:
– how to arrange items on the shelves
– which items should be promoted together
17
Clustering
• Clustering is also sometimes referred to as

segmentation (though this has other meanings in
other fields)
• In clustering there are no pre-defined classes. A
similarity measure is used to group records. The user
must attach meaning to the clusters formed
• Clustering often precedes some other data mining
task, for example:
– once customers are separated into clusters, a promotion
might be carried out based on market basket analysis of
the resulting cluster
18
Deviation Detection
• Records whose attributes deviate from the norm
by significant amounts are also called outliers
• Application areas include:
– fraud detection
– quality control
– tracing defects
• Visualization techniques and statistical
techniques are useful in finding outliers
• A cluster which contains only a few records may
in fact represent outliers
19

Data Mining, Data Pattern, Machine Learning (Week 2

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Data Mining, Data Pattern, Machine Learning (Week 2

Enviado por

Direitos autorais:

Formatos disponíveis

DATA MINING

Data Mining, Data Pattern

• Clustering is also sometimes referred to as

Você também pode gostar