Escolar Documentos
Profissional Documentos
Cultura Documentos
Tan,Steinbach, Kumar
4/18/2004
What is Data?
An attribute is a property or
characteristic of an object
Attributes
A collection of attributes
describe an object
Object is also known as
record, point, case, sample,
entity, or instance
Tan,Steinbach, Kumar
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
4/18/2004
Attribute Values
Tan,Steinbach, Kumar
4/18/2004
Measurement of Length
B
7
2
C
D
10
E
15
Tan,Steinbach, Kumar
4/18/2004
Types of Attributes
Ordinal
u
Interval
u
Ratio
u
Tan,Steinbach, Kumar
4/18/2004
Distinctness:
Order:
Addition:
Multiplication:
Tan,Steinbach, Kumar
4/18/2004
Attribute
Type
Description
Examples
Nominal
mode, entropy,
contingency
correlation, c2 test
Ordinal
hardness of minerals,
{good, better, best},
grades, street numbers
median, percentiles,
rank correlation,
run tests, sign tests
Interval
calendar dates,
temperature in Celsius
or Fahrenheit
mean, standard
deviation, Pearson's
correlation, t and F
tests
temperature in Kelvin,
monetary quantities,
counts, age, mass,
length, electrical
current
geometric mean,
harmonic mean,
percent variation
Ratio
Operations
Attribute
Level
Transformation
Comments
Nominal
Ordinal
Interval
new_value =a * old_value + b
where a and b are constants
An attribute encompassing
the notion of good, better
best can be represented
equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Thus, the Fahrenheit and
Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).
Ratio
new_value = a * old_value
Discrete Attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented
using a finite number of digits.
Continuous attributes are typically represented as floating-point
variables.
Tan,Steinbach, Kumar
4/18/2004
x=
1 n
x = xi
n i =1
n
w x
i =1
n
w
i =1
Middle value if odd number of values, or average of the middle two values
otherwise
Mode
Empirical formula:
Tan,Steinbach, Kumar
4/18/2004
positively skewed
Tan,Steinbach, Kumar
symmetric
negatively skewed
4/18/2004
Variance:
1 n
1 n 2 1 n 2
1 n
1
2
2
2
s =
( xi - x ) =
[ xi - ( xi ) ] s = ( xi - ) =
N i =1
N
n - 1 i =1
n - 1 i =1
n i =1
2
2
x
i
2
i =1
Tan,Steinbach, Kumar
4/18/2004
Box Plot
3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1
Q1=(4.3+4.3)/2=4.3
Tan,Steinbach, Kumar
Median = 4.4
Q3=(4.7+4.8)/2=4.75
4/18/2004
Record
Data Matrix
Document Data
Transaction Data
Graph
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Tan,Steinbach, Kumar
4/18/2004
Curse of Dimensionality
Sparsity
u
Resolution
u
Tan,Steinbach, Kumar
4/18/2004
Record Data
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Tan,Steinbach, Kumar
4/18/2004
Data Matrix
Projection
of y load
Distance
Load
Thickness
10.23
5.27
15.22
2.7
1.2
12.65
6.25
16.22
2.2
1.1
Tan,Steinbach, Kumar
4/18/2004
Document Data
team
coach
pla
y
ball
score
game
wi
n
lost
timeout
season
Document 1
Document 2
Document 3
Tan,Steinbach, Kumar
4/18/2004
Transaction Data
Items
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Tan,Steinbach, Kumar
4/18/2004
Graph Data
2
1
5
2
5
Tan,Steinbach, Kumar
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
4/18/2004
Chemical Data
Tan,Steinbach, Kumar
4/18/2004
Ordered Data
Sequences of transactions
Items/Events
An element of
the sequence
Tan,Steinbach, Kumar
4/18/2004
Ordered Data
Tan,Steinbach, Kumar
4/18/2004
Ordered Data
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Tan,Steinbach, Kumar
4/18/2004
Data Quality
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Tan,Steinbach, Kumar
4/18/2004
Noise
Outliers
Tan,Steinbach, Kumar
4/18/2004
Missing Values
Tan,Steinbach, Kumar
4/18/2004
Duplicate Data
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
Tan,Steinbach, Kumar
4/18/2004
Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Tan,Steinbach, Kumar
4/18/2004
Aggregation
Purpose
Data reduction
u
Change of scale
u
Tan,Steinbach, Kumar
4/18/2004
Aggregation
Variation of Precipitation in Australia
Sampling
Tan,Steinbach, Kumar
4/18/2004
Sampling
Tan,Steinbach, Kumar
4/18/2004
Types of Sampling
Stratified sampling
Split the data into several partitions; then draw random samples
from each partition
Tan,Steinbach, Kumar
4/18/2004
Sample Size
8000 points
Tan,Steinbach, Kumar
2000 Points
500 Points
4/18/2004
Sample Size
What
Tan,Steinbach, Kumar
4/18/2004
Curse of Dimensionality
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Tan,Steinbach, Kumar
4/18/2004
Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by data
mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce
noise
Techniques
Principle Component Analysis
Singular Value Decomposition
Others: supervised and non-linear techniques
Tan,Steinbach, Kumar
4/18/2004
x1
Tan,Steinbach, Kumar
4/18/2004
x2
e
x1
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the amount
of sales tax paid
Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task of
predicting students' GPA
Tan,Steinbach, Kumar
4/18/2004
Techniques:
Brute-force approch:
uTry
Embedded approaches:
Feature selection occurs naturally as part of the data mining
algorithm
u
Filter approaches:
u
Wrapper approaches:
Use the data mining algorithm as a black box to find best subset
of attributes
u
Tan,Steinbach, Kumar
4/18/2004
Feature Creation
domain-specific
combining features
Tan,Steinbach, Kumar
4/18/2004
transform
Wavelet
transform
Tan,Steinbach, Kumar
Frequency
4/18/2004
based approach
Tan,Steinbach, Kumar
4/18/2004
Data
Equal frequency
Tan,Steinbach, Kumar
K-means
Introduction to Data Mining
4/18/2004
Attribute Transformation
Tan,Steinbach, Kumar
4/18/2004
v - minA
v' =
(new _ maxA - new _ minA) + new _ minA
maxA - minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to 73,600 - 12,000 (1.0 - 0) + 0 = 0.716
98,000 - 12,000
v' =
v - A
73,600 - 54,000
= 1.225
16,000
v
v'= j
10
Tan,Steinbach, Kumar
4/18/2004
Similarity
Numerical measure of how alike two data objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different are two data
objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Euclidean Distance
Euclidean Distance
dist =
( pk - qk )
k =1
Tan,Steinbach, Kumar
4/18/2004
Euclidean Distance
point
p1
p2
p3
p4
p1
p3
p4
1
p2
0
0
y
2
0
1
1
p1
p1
p2
p3
p4
x
0
2
3
5
0
2.828
3.162
5.099
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
Distance Matrix
Tan,Steinbach, Kumar
4/18/2004
Minkowski Distance
dist = ( | pk - qk
k =1
1
r r
|)
Tan,Steinbach, Kumar
4/18/2004
r = 2. Euclidean distance
Tan,Steinbach, Kumar
4/18/2004
Minkowski Distance
point
p1
p2
p3
p4
x
0
2
3
5
y
2
0
1
1
L1
p1
p2
p3
p4
p1
0
4
4
6
p2
4
0
2
4
p3
4
2
0
2
p4
6
4
2
0
L2
p1
p2
p3
p4
p1
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
L
p1
p2
p3
p4
p1
p2
p3
p4
0
2.828
3.162
5.099
0
2
3
5
2
0
1
3
3
1
0
2
5
3
2
0
Distance Matrix
Tan,Steinbach, Kumar
4/18/2004
Mahalanobis Distance
-1
mahalanobis( p, q) = ( p - q) ( p - q)
S j ,k
1 n
=
( X ij - X j )( X ik - X k )
n - 1 i =1
4/18/2004
Mahalanobis Distance
Covariance Matrix:
0.3 0.2
S=
0
.
2
0
.
3
A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
Tan,Steinbach, Kumar
4/18/2004
2.
3.
Tan,Steinbach, Kumar
4/18/2004
2.
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Cosine Similarity
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Correlation
Correlation measures the linear relationship
between objects
To compute correlation, we standardize data
objects, p and q, and then take their dot product
4/18/2004
Scatter plots
showing the
similarity from
1 to 1.
Tan,Steinbach, Kumar
4/18/2004
2 (chi-square) test
2
(
Observed
Expected
)
c2 =
Expected
The larger the 2 value, the more likely the variables are
related
Tan,Steinbach, Kumar
4/18/2004
Play chess
Sum (row)
250 (90)
200 (360)
450
50 (210)
1000 (840)
1050
Sum (col.)
300
1200
1500
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Density
Examples:
Euclidean density
u
Probability density
Graph-based density
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004