Escolar Documentos
Profissional Documentos
Cultura Documentos
Data Preprocessing
1/91
2.
3.
Data cleaning()
4.
5.
Data reduction ()
6.
7.
Summary
2/91
2.
3.
Data cleaning
4.
5.
Data reduction
6.
7.
Summary
3/91
e.g., occupation=
e.g., Salary=-10
Inconsistent(): containing
discrepancies() in codes or names
6/91
7/91
Data cleaning
2.
Data integration
3.
Data reduction
5.
Data transformation
4.
Data discretization
9/91
2.
3.
Data cleaning
4.
5.
Data reduction
6.
7.
Summary
10/91
Motivation
Outliers.
IQR = Q3 - Q1
Measure
Distributive ()
Algebraic ()
Holistic ()
11/91
Distributive measure
Algebraic measure
13/91
Holistic measure
14/91
algebraic
algebraic
measure
measure
SQL : avg( )
x =
1
n
wx
x i (sample)
i =1
x =
i =1
i =1
n
(2.2)
i
i =1
(population)
(2.1)
weighted average
15/91
Median
N
( freq)l
median = L1 + 2
freqmedian
width
(2.3)
Mode
Mode
Empirical formula:
(2.4)
17/91
Mean
Median
Mode
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and
plot outlier individually
s2 =
1 n
1 n 2 1 n 2
( xi x)2 =
[ xi ( xi ) ]
n 1 i=1
n 1 i=1
n i=1
2 =
1
N
(x
i =1
)2 =
1
N
2
i
i =1
(x x)
s =
i =1
n 1
n
1 n 2
1 n 2
2
2
xi 2 xi x + ( x ) =
xi 2 x xi + n ( x )
n 1 i =1
n 1 i =1
i =1
xi
1 n 2
i =1
=
xi 2
n 1 i =1
n
n
2
xi + n ( x )
i =1
2
2
n
n
2 xi
xi
n
1
2
i =1 + n i =1
=
xi
n 1 i =1
n
n2
2
2
n n
2 xi xi
1 n 2
i =1 + i =1
=
xi
n 1 i =1
n
n
n
x
n
1
2
i =1
=
xi
n 1 i =1
n
n
xi
= i =1 i =1
n 1 n ( n 1)
n
2
i
20/91
10
99.7%
95%
68%
+1
+2
+3
+1
+2
+3
+1
+2
+3
21/91
Boxplot Analysis
Boxplot
11
Outliers
Branch 1
Median $80
Q1: $60
Q3: $100
Fig. 2-3
23/91
24/91
12
25/91
Histogram Analysis
26/91
13
Quantile Plot
27/91
28/91
14
R
>
>
>
>
2
0
-4
-2
Sample Quantiles
-3
-2
-1
Theoretical Quantiles
29/91
Scatter plot
30/91
15
Loess Curve
31/91
32/91
16
33/91
17
2.
3.
Data cleaning
4.
5.
Data reduction
6.
7.
Summary
35/91
,p.61
Importance
Data cleaning is one of the three biggest problems
in data warehousingRalph Kimball
Data cleaning is the number one problem in data
warehousingDCI survey
Data cleaning tasks
18
equipment malfunction()
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classificationnot effective when the percentage of
missing values per attribute varies considerably.
the attribute mean for all samples belonging to the same class:
smarter
19
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
40/91
20
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/N.
42/91
21
Regression
y
Y1
y=x+1
Y1
X1
43/91
Cluster Analysis
44/91
22
46/91
23
http://control.cs.berkeley.edu/abc/
47/91
2.
3.
Data cleaning
4.
5.
Data reduction
6.
7.
Summary
48/91
24
Data integration:
Combines data from multiple sources into a coherent
store
Schema integration: e.g., A.cust-id B.cust-number
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from
different sources are different
Possible reasons: different representations, different
scales, e.g., metric vs. British units
49/91
correlation analysis
25
rA , B =
(a
A )( bi B )
( n 1) A B
(a b ) n A B
i i
( n 1) A B
If rA,B > 0: A and B are positively correlated (As values increase as Bs).
The higher, the stronger correlation.
rA,B = 0: independent
2 (chi-square) test
(Observed Expected ) 2
=
Expected
2
The larger the 2 value, the more likely the variables are
related
The cells that contribute the most to the 2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
26
Contingency table
53/91
Table 2.2
Total
Fiction
250(90)
200(360)
450
Non_fiction
50(210)
1000(840)
1050
Total
300
1200
1500
300
Female
450
1200
= 90, 1050
= 840
1500
1500
2 =
54/91
27
Smoothing () :
remove noise from data (binning, regression, clustering)
Aggregation ():
summarization, data cube construction
Generalization ():
concept hierarchy climbing
Normalization (): scaled to fall within a small, specified range
z-score normalization
Attribute/feature construction ()
v' =
2.
v minA
(new _ maxA new _ minA) + new _ minA (2.11)
maxA minA
v'=
3.
v A
(2.12)
v'=
v
10 j
73,600 54,000
= 1.225
16,000
(2.13)
56/91
28
2.
3.
Data cleaning
4.
5.
Data reduction
6.
7.
Summary
57/91
58/91
29
Apex cuboid ()
59/91
60/91
30
Heuristic methods
1.
2.
31
Heuristic methods
3.
4.
Fig. 2.15
64/91
32
Compressed
Data
Original Data
lossless
Original Data
Approximated
s
los
65/91
Data Compression
String compression
There are extensive theories and well-tuned
algorithms
Typically lossless
But only limited manipulation is possible without
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
66/91
33
Wavelet Transformation
multi-resolutional analysis
34
Y1
X1
69/91
35
Linear regression:
Multiple regression:
allows a response variable Y to be modeled as a linear function of
multidimensional feature vector
71/91
Linear regression:y = wx + b
Two regression coefficients, w and b, specify the line
and are to be estimated by using the data at hand
Using the least squares criterion to the known values
of Y1, Y2, , X1, X2, .
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the
above
Log-linear models:
The multi-way table of joint probabilities is
approximated by a product of lower-order tables
72/91
36
Fig. 2.19
73/91
Partition data set into clusters based on similarity, and store cluster
representation
Quality:
Diameter :
Centroid distance : ,
37
N
Allow a mining algorithm to run in complexity that is potentially sublinear to the size of the data
Choose a representative subset of the data
Stratified sampling:
WOR
SRS le random
t
p
(sim le withou
samp ment)
ce
repla
SR SW
Raw Data
76/91
38
Raw Data
Cluster/Stratified Sample
77/91
2.
3.
Data cleaning
4.
5.
Data reduction
6.
7.
Summary
78/91
39
Discretization:
Discretization
40
p.89
41
Entropy (cont.)
83/91
Entropy (cont.)
42
Merge: Find the best neighboring intervals and merge them to form
larger intervals recursively
ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
43
Step 1:
Step 2:
-$351
-$159
Min
msd=1,000
profit
Low=-$1,000
(-$1,000 - 0)
(-$400 - 0)
(-$200 -$100)
(-$100 0)
Max
High=$2,000
($1,000 - $2,000)
(0 -$ 1,000)
(-$400 -$5,000)
Step 4:
(-$300 -$200)
$4,700
(-$1,000 - $2,000)
Step 3:
(-$400 -$300)
$1,838
High(i.e, 95%-0 tile)
(0 - $1,000)
(0 $200)
($1,000 $1,200)
($200 $400)
($1,200 $1,400)
($1,400 $1,600)
($400 $600)
($600 $800)
($800 $1,000)
($2,000 $3,000)
($3,000 $4,000)
($4,000 $5,000)
87/91
44
Fig. 2.24
15 distinct values
country
country
province_or_
province_or_state
state
city
city
street
street
2.
3.
Data cleaning
4.
5.
Data reduction
6.
7.
Summary
90/91
45
2.7 Summary
Discretization
46