Escolar Documentos
Profissional Documentos
Cultura Documentos
You may not use these slides for tutorials, or in a published work (tech report/
conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is
highly likely I will grant you permission.
1
Outline of Tutorial II
In both shape and time series, we consider: The Ubiquity of Shape
Novelty detection (finding unusual shapes or subsequences)
Motif discovery (finding repeated shapes or subsequences) butterflies, fish, petroglyphs, arrowheads,
fruit fly wings, lizards, nematodes, yeast cells,
Clustering faces, historical manuscripts
Classification
Indexing
Visualizing massive datasets
Open problems to solve Drosophila melanogaster
Summary, Conclusions
?
Lance
Armstrong
problems we would like to be
200
0
2000 2001 2002 able to solve, then later we
will see the necessary tools to
1
solve them
0.5
0
0 50 100 150 200 250 300 350 400 450
2
All our Experiments are Reproducible! Example 1: Join Given two data collections, link items occurring in each
Limenitidinae Danainae
Limenitis Danaus
archippus plexippus
.. so similar in
coloration that I will
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
put them both to one*
3
Example 2: Annotation Given an object of interest, automatically
obtain additional information about it.
.. they would
strike the subtlest
minds with awe*
*Purgatorio -- Canto XII 6
4
Example 4: Example 5: Classification Given a labeled training set,
classify future unlabeled examples
Clustering Basal
There is a special reason why
this tree is so tall and inverted* What type of
Given a unlabeled dataset, arrowhead is this?
arrange them into groups
by their mutual similarity
Iguani
a
Alligatoridae Articulate
Crocodylidae Alligatorinae Amphisbaenia Chelonia
For he is well
placed among the
fools who does not
distinguish one
class from another*
*Purgatorio -
Canto XXXIII 64 Phrynosoma braconnieri
*Paradiso -- Canto XIII 115
Specimen 20,773
Blythe, California Baker California
*Purgatorio -- Canto X 127
*Inferno -- Canto XIX 15
5
Example(s) 8: Human Motion Two Kinds of Shape Matching
The two of us walked Join rigid flexible
on that road*
Annotation
Query-by-Content
Texas Key Ideas: Convert shape to graph/tree
Clustering Duran
Arrowhead Use graph/tree edit distance to measure similarity
6
Landmarking
Best Rotation Alignment
Rotation invariant features Red Howler
Monkey
Owl Monkey Owl Monkey Orangutan
(species unknown) Northern Gray-Necked Possibilities include:
Generic Landmarking Ratio of perimeter to area, fractal measures,
Find the major axis of the shape and elongatedness, circularity, min/max/mean Orangutan
use that as the canonical alignment curvature, entropy, perimeter of convex hull,
aspect ratio and histograms
Domain Specific Landmarking A Orangutan
B C
Find some fixed point in your The problem with rotation (juvenile)
domain, eg. the nose on a face, the invariant features is that
stem of leaf, the tail of a fish Generic Landmark Alignment in throwing away rotation
Borneo
information, you must Orangutan
The only problem Generic Landmark Best Rotation
invariably throw away
Mantled
with landmarking is Alignment Alignment useful information
Howler
that it does not work
Histogram Monkey
The easy way to achieve rotation invariance is to The strategy of testing all possible
hold one time series C fixed, and compare it to every
circular shift of the other time series, which is
rotations is very very slow
represented by the matrix C
C People have suggested
various tricks for
Q speedup, like only
testing 1 in 5 of the
rotations
algorithm: [dist] = Test_All_Rotations(Q,C)
dist = infinty
for j = 1 to n
However there now
TempDistance = Some_Dist_Function(Q, Cj)
if TempDistance < dist exists a simple exact
c1 , c2 , K, cn1 , cn c1 , c2 , K, cn1 , cn
dist = TempDistance; c , K , c , c , c ultrafast, indexable c , K , c , c , c
end;
end;
C = 2 n 1 n 1
way to do this* C = 2 n 1 n 1
It sucks being M M
return[dist] a grad student cn , c1 , c2 , K, cn1 cn , c1 , c2 , K, cn1
*VLDB06: LB_Keogh Supports Exact Indexing of Shapes
under Rotation Invariance with Arbitrary Representations
and Distance Measures.
7
The need for rotation invariance
shows up in real time series, as
in these Star Light Curves
Shape Distance Measures
I saw above a million
burning lamps,
A Sun kindled every
one of them, as our
sun lights the stars Speak to me There
we glimpse on high*
of the useful are but
distance Euclidean three
measures Distance
Dynamic Time
c1 , c2 , K, cn1 , cn Warping
c , K , c , c , c
C = 2 n 1 n 1
Longest
M
Common
cn , c1 , c2 , K, cn1 Subsequence
*The Paradiso --
Canto XXIII 28-30
DTW
Alignment
Euclidean
Distance
Mountain Gorilla
Red Howler Monkey Is man an ape Gorilla gorilla beringei
Alouatta seniculus seniculus
or an angel?
8
Matching skulls Euclidean Distance Metric
is an important
problem
A B Given two time
C
C series Q = q1qn
This region
will not be and C = c1cn , the
matched Q Euclidean distance
0 10 20 30 40 50 60 70 80 90 100 between them is
defined as:
D(Q, C ) (qi ci )
n
LCSS I notice that you 2
LCSS can deal
with missing or Alignment Z-normalized i =1
occluded parts the time series
The famous Skhul V is generally reproduced with first The next slide shows a
the missing bones extrapolated in epoxy (A),
however the original Skhul V (B) is missing the
nose region, which means it will match to a modern useful optimization
human (C) poorly, even after DTW alignment
(inset). In contrast, LCSS alignment will not
attempt to match features that are outside a
matching envelope (heavy gray line) created from
the other sequence. DTW
corresponding data Q
I see, because points exceeds r2 , we Warping path w
incremental can safely abandon
value is always a
DTW (Q, C ) = min
K
the calculation wk K
lower bound to k =1
the final value,
Abandon all hope
once it is greater ye who enter here
than the best-so- This recursive function gives us
far, we may as the minimum cost path
well abandon (i,j) = d(qi,cj) + min{ (i-1,j-1), (i-1,j ), (i,j-1) }
9
Dynamic Time Warping II Tests on many diverse datasets
100
FACE (2%)
and I recognized Leaf of mine, in whom I found pleasure
Accuracy
GUNX (3%)
There is an
90
important trick
100
1
5
9
13
17
21
25
29
33
37
41
45
53
57
61
65
69
73
77
81
85
89
93
97
49
to improve value of r
accuracy and
speed This constrained warping, together with a
lower bounding trick called LB_Keogh can
make DTW thousands of times faster! But as a fish dives the shape of that cold
dont take my word for it... through water animal which stings and
lashes people with its tail *
LB_Keogh is fast, because it
cleverly exploits global
constraints
r Christos Faloutsos
PODS 2005
See the below for more information about constrained warping:
Xi, Keogh, Shelton, Wei & Ratanamahatana (2006). Fast Time Series Classification Using Numerosity Reduction. ICML
Ratanamahatana and Keogh. (2004). Everything you know about Dynamic Time Warping is Wrong.
*Purgatorio -- Canto IX 5, Purgatorio -- Canto XXIII, Purgatorio -- Canto XXVI, Paradiso -- Canto XV 88
Plane 7 210 0.95 0.0{3} 0.55 Markov Descriptor All these are in the genus Cercopithecus, These are the same species All these are in the tribe
except for the skull identified as being Bunopithecus hooloc (Hoolock Papionini
either a Vervet or Green monkey, both of Gibbon) Tribe Papionini
Fish 7 350 11.43 9.71{1} 36.0 Fourier /Power Cepstrum which belong in the Genus of Chlorocebus Genus Papio baboons
These are in the Genus Pongo
which is in the same Tribe Genus Mandrillus- Mandrill
(Cercopithecini) as Cercopithecus.
All these are in the family Cebidae
Tribe Cercopithecini These are in the family Lemuridae
Note that DTW is sometimes worth the little Cercopithecus
Family Cebidae (New World monkeys)
Subfamily Aotinae
De Brazza's Monkey, Cercopithecus neglectus These are in the genus Alouatta
extra effort Mustached Guenon, Cercopithecus cephus Aotus trivirgatus
Subfamily Pitheciinae sakis
Red-tailed Monkey, Cercopithecus ascanius These are in the same species
Black Bearded Saki, Chiropotes satanas
Chlorocebus Homo sapiens (Humans)
White-nosed Saki, Chiropotes albinasus
Green Monkey, Chlorocebus sabaceus
Vervet Monkey, Chlorocebus pygerythrus
*Purgatorio -- Canto XXIV 117
10
Flat-tailed Horned Lizard
Phrynosoma mcallii
OK, let us take stock of what we
have seen so far
There are interesting problems in
Dynamic Time shape/time series mining (motifs, anomalies,
Unlike the Warping clustering, classification, query-by-content,
primates, reptiles visualization, joins).
require warping
Very simple transformations let us
treat shapes as time series.
Very simple distance measures
(Euclidean, DTW) work very well.
Data Mining is Constrained by Disk I/O The Generic Data Mining Algorithm
For example, suppose you have Create an approximation of the data, which will fit in main
memory, yet retains the essential features of interest
one gig of main memory and want
to do K-means clustering Approximately solve the problem at hand in main memory
11
Some approximations of
time series Time Series Representations
aabbbccb
0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100120 0 20 40 60 80 100 120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120
Piecewise Adaptive Natural Strings Orthonormal Bi-Orthonormal Discrete Discrete Chebyshev
a Linear
Approximation
Piecewise Language Fourier Cosine Polynomials
a Constant
Approximation
Transform Transform
b Interpolation Regression
Symbolic Non
Haar Daubechies Coiflets Symlets
b Aggregate
Approximation
Lower
Bounding dbn n > 1
b
c Value Slope
c Based Based
b
DFT DWT SVD APCA PAA PLA SYM
i =1
DLB(Q,S) i =1
(sri sri1 )(qvi svi )2
12
Lower Bounding functions are Why do we care so much about
known for wavelets, Fourier, symbolic representations?
SVD, piecewise polynomials,
Chebyshev Polynomials and
clipped data Symbolic Representations Allow:
a Markov Models
time series, none except a
SAX allows lower bounding b Stealing ideas from text processing/
b
b bioinformatics community
c
c
b
etc
SYM DFT
13
How do we obtain SAX? Note we made two parameter choices
C C
The word size, in this
C case 8. C
0 20 40 60 80 100 120 0 20 40 60 80 100 120
1 2 3 4 5 6 7 8
First convert the time c c
series to PAA c c c c 3
representation, then 1
convert the PAA to
b b b b 2
b b
symbols 1
- a a - a a
It takes linear time
0 20 40 60 80 100 120 0 20 40 60 80 100 120
3 Q
2 DFT Yes, here is the function
f 0 10 20 30 40 50 60 70 80 90 100
1 e that lower bounds it for
d PLA
0 c
Recall the SAX, it is called
b
-1 Euclidean
a Haar
MINDIST
-2
APCA distance?
-3 C = bbabcbac
dist() table lookup
SAX
a b c Q = bbaccbac
A raw time series of length 128 is transformed into the a 0 0 0.67
MINDIST(Q , C ) n
(dist(q , c ))
w
i i
2
b 0 0 0 w i =1
word ffffffeeeddcbaabceedcbaaaaacddee. c 0.67 0 0 dist() can be implemented using a table lookup.
We can use more symbols to represent the time series since each symbol
requires fewer bits than real-numbers (float, double)
14
TGGCCGTGCTAGGCCCCACCCCTACCTTGCA
Let us consider the utility of GTCCCCGCAAGCTCATCTGCGCGAACCAGA
SAX for visualizing time ACGCCCACCACCCTTGGGTTGAAATTAAGGA
Data mining problems are I/O bound GGCGGTTGGCAGCTTCCCAGGCGCACGTAC
OK, let us have series. We start with an CTGCGAATAAATAACTGTCCGCACAAGGAGC
The generic data mining algorithm apparent digression,
another quick CCGACGATAGTCGACCCTCTCTAGTCACGAC
mitigates the problem, if you can obey visualizing DNA. CTACACACAGAACCTGTGCTAGACGCCATGA
review GATAAGCTAACACAAAAACATTTCCCACTAC
the lower bounding requirement. TGCTGCCCGCGGGCTACCGGCCACCCCTGG
There is one approximation of time CTCAGCCTGGCGAAGCCGCCCTTCA
G T
GAA GAC GCA GCC TAA TAC TCA TCC
l=2
GAG GAT GCG GCT TAG TAT TCG TCT
l=3
0.20 0.24 CCGTGCTAGGGCCACCTACCTTGGTCCG
CCGCAAGCTCATCTGCGCGAACCAGAAC
CCGTGCTAGGGCCACCTACCTTGGTCCG
CCGCAAGCTCATCTGCGCGAACCAGAA
GCCACCACCTTGGGTTGAAATTAAGGAG GCCACCACCTTGGGTTGAAATTAAGGAG
GCGGTTGGCAGCTTCCAGGCGCACGTAC GCGGTTGGCAGCTTCCAGGCGCACGTA
CTGCGAATAAATAACTGTCCGCACAAGG CTGCGAATAAATAACTGTCCGCACAAGG
0.26 0.30 AGCCGACGATAAAGAAGAGAGTCGACCT
CTCTAGTCACGACCTACACACAGAACCT
AGCCGACGATAAAGAAGAGAGTCGACCT
CTCTAGTCACGACCTACACACAGAACCT
l stands for Level
GTGCTAGACGCCATGAGATAAGCTAACA GTGCTAGACGCCATGAGATAAGCTAACA
15
1 0.02 0.04 0.09 0.04
OK. Given any DNA
0.03 0.07 0.02
string I can make a
0.11 0.03
colored bitmap, so what?
CCGTGCTAGGCCCCACCCCTACCTTGCA CCGTGCTAGGCCCCACCCCTACCTTGCA
GTCCCCGCAAGCTCATCTGCGCGAACCA GTCCCCGCAAGCTCATCTGCGCGAACCA
GAACGCCCACCACCCTTGGGTTGAAATT GAACGCCCACCACCCTTGGGTTGAAATT
AAGGAGGCGGTTGGCAGCTTCCCAGGCG AAGGAGGCGGTTGGCAGCTTCCCAGGCG
0 CACGTACCTGCGAATAAATAACTGTCCGC CACGTACCTGCGAATAAATAACTGTCCGC
ACAAGGAGCCCGACGATAGTCGACCCTC ACAAGGAGCCCGACGATAGTCGACCCTC
TCTAGTCACGACCTACACACAGAACCTG TCTAGTCACGACCTACACACAGAACCTG
TGCTAGACGCCATGAGATAAGCTAACA TGCTAGACGCCATGAGATAAGCTAACA
Can we do something
Two Questions similar for time series?
Would it be useful?
Pan troglodytes
Note Elephas maximus is
is the
the Indian Elephant,
Loxodonta africana is the
chimpanzee
African elephant
16
Can we make bitmaps for time series?
1.5 Yes, with SAX!
1
A
0.5
0
C
- 0.5 G
-1
T
- 1.5
0 20 40 60 80 100 120
GTTGACCA
AA AC CA CC
AG AT CG CT
GA GC TA TC
GG GT TG TT
May.txt Sept.txt
normal6.txt
300
One Year of Italian Power Demand
200
100
January December
August
0
17
ventricular depolarization plateau stage 20
19
We can test how much
repolarization 17
10
clustering
normal15.txt normal14.txt
9
normal13.txt normal7.txt normal2.txt normal16.txt normal18.txt
6
normal4.txt normal3.txt normal12.txt normal17.txt
15
normal6.txt 14
Data Key
12
Cluster 1 (datasets 1 ~ 5):
13 BIDMC Congestive Heart Failure Database (chfdb): record chf02
action potential of a 4
Cluster 3 (datasets 11 ~ 15):
3 Long Term ST Database (ltstdb): record 20021
0 100 200 300 400 500 normal pacemaker cell 2
Start times at 0, 50, 100, 150, 200, respectively
Homo/Pan/ Pongo
Gorilla group
Here the bitmaps are very Pan
18
Time Series Motif Discovery Time Series Motif Discovery
(finding repeated patterns) (finding repeated patterns)
Winding Dataset
( The angular speed of reel 2 )
Winding Dataset
0 50 0 1000 150 0 2000 2500 ( The angular speed of reel 2 )
0 50 0 1000 150 0 2000 2500
... ... agent to generalize from a set of qualitatively different experiences gleaned
from sensors. We see these experiences as motifs. See also Murakami
Yoshikazu, Doki & Okuma and Maja J Mataric
In medical data mining, Caraca-Valente and Lopez-Chavarrias have
introduced a method for characterizing a physiotherapy patients recovery
based of the discovery of similar patterns. Once again, we see these similar
patterns as motifs.
19
Trivial T
OK, we can define motifs, but
Matches
how do we find them?
Space Shuttle STS - 57 Telemetry
C ( Inertial Sensor ) The obvious brute force search algorithm is just too slow
0 100 200 3 00 400 500 600 70 0 800 900 100 0
The most reference algorithm is based on a hot idea from
Definition 1. Match: Given a positive real number R (called range) and a time series T containing a
subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M) R, then M is
bioinformatics, random projection* and the fact that SAX
called a matching subsequence of C. allows use to lower bound discrete representations of time
Definition 2. Trivial Match: Given a time series T, containing a subsequence C beginning at position series.
p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q
or there does not exist a subsequence M beginning at q such that D(C, M) > R, and either q < q< p
or p < q< q.
Definition 3. K-Motif(n,R): Given a time series T, a subsequence length n and a range R, the most
* J Buhler and M Tompa. Finding
significant motif in T (hereafter called the 1-Motif(n,R)) is the subsequence C1 that has highest count motifs using random projections. In
of non-trivial matches (ties are broken by choosing the motif whose matches have the lower
variance). The Kth most significant motif in T (hereafter called the K-Motif(n,R) ) is the subsequence
RECOMB'01. 2001.
CK that has the highest count of non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1 i < K.
T ( m= 1000)
20
We can now use the information in the collision matrix
as a heuristic to hunt for likely motifs.
: : : : :
solution so it agrees with the solution we would have obtained on
the original data
985 985 1 985 2 1
985 b c c c 1 2 : 58 : 985 But which
But approximation
which approximation
1 2 : 58 : 985
1 2 3 4 should we
should we use?
use?
A Simple Experiment
Planted Motifs
Let us imbed two motifs into a random walk
time series, and see if we can recover them
A
D
B C
0 20 40 60 80 100 120 0 20 40 60 80 100 120
B D
21
Danaus
Limeniti plexipp
s us
archipp
us
d
Shape Motifs II -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
c b b
a
We can find shape motifs with 20 40 60
a
80 100 120 140 160 180 200 220 240
Staurastrum tetracerum
Place every circular shift of SAX word in
the projection matrix.
Time improvement over BruteForce
b a c b 0.8
b a c b i j
a c b b a c b b 0.6
R
c b b a
R
c b b a
Giorgio Morandi
Si Si i 0.4
b b a c b b a c 1890 1964
bacb : : : : : : : : i j
0.2
Through his simple and repetitive
b c b b b c b b 0 motifs Morandi became an
c b b b c b b b i 500 important forerunner of Minimalism.
R R 1000
b b b c b b b c
Sj b b c b
Sj
b b c b
j 2000 wikipedia
Dataset size 4000
: : : : : : : :
What is the
most unusual
shape in this
collection? This one!
22
This one is
even more 1st Discord
subtle
Here is a
subset of a
large
collection of
petroglyphs
1st Discord
1st Discord
23
Finding Discords, Fast 0 2 4.2 1.1 2.3 8.5
The Magic Heuristics 0 2 4.2 1.1 2.3 8.5
2 0 3 3.2 3.5 8.2 2 0 3 3.2 3.5 8.2
Function [ dist, loc ] = Heuristic_Search(S, Outer, Inner ) 4.2 3 0 1.2 9.2 9.7 In the outer loop, visit the columns in 4.2 3 0 1.2 9.2 9.7
best_so_far_dist = 0
best_so_far_loc = NaN
1.1 3.2 1.2 0 0.1 7.5 order of the Discord score 1.1 3.2 1.2 0 0.1 7.5
2.3 3.5 9.2 0.1 0 7.6 2.3 3.5 9.2 0.1 0 7.6
for each index p given by heuristic Outer // begin outer loop In the inner loop, visit the row cells in
nearest_neighbor_dist = infinity 8.5 8.8 9.7 7.5 7.6 0 8.5 8.8 9.7 7.5 7.6 0
for each index q given by heuristic Inner // begin inner loop order of nearest neighbor first
if p!= q The code now says The Magic
if RD(Cp , Cq ) < best_so_far_dist If while searching a Heuristics would
break // break out of inner loop given column, you find a reduce the time
end distance less than complexity from
if RD(Cp , Cq ) < nearest_neighbor_dist nearest_neighbor_dist O(n2) algorithm to
nearest_neighbor_dist = RD(Cp , Cq ) then that column
end just O(n)!
cannot have the
end discord.
end // end inner loop
if nearest_neighbor_dist > best_so_far_dist The code also uses
best_so_far_dist = nearest_neighbor_dist heuristics to order the
best_so_far_loc = p search
end
end // end outer loop
return [ best_so_far_dist, best_so_far_loc ]
1.1
3
3.2 1.2
0 1.2
0
9.2
0.1
9.7
7.5
4.2
1.1
3
3.2 1.2
0 1.2
0
9.2
0.1
9.7
7.5
order of the Discord score 2.3 3.5 9.2 0.1 0 7.6 2.3 3.5 9.2 0.1 0 7.6
In the inner loop, visit the row cells in 8.5 8.8 9.7 7.5 7.6 0
Image 1
8.5 8.8 9.7 7.5 7.6 0
ignored here
Visiting the columns in approximately We can try to
order of the Discord score is still very approximate Magic
Inserted into array Augmented Trie
helpful 1 c a a 3 c
b
For the inner loop, we dont really need 2 c a b 1 c a 77
c
3 b
visit the rows in order of nearest neighbor 3 c a a
a
b
9
:: :: :: :: ::
first, so long as we find a near enough :: :: :: :: ::
a c
b 2
c a
neighbor early on c b b 2
c 1 3 731
1 b
m-1 a c b c a
m b c a 2 c 23
24
How Fast is Approximately Magic? Which is the odd man out in
this collection of Red Passion
On a problem dataset of arrowheads Flower Butterflies?
0.9
0.8
0.7
0.6
0.5
0.4
Heliconius melpomene
0.3
(The Postman)
0.2
0.1
0
One of them is not a Red Passion Flower
50
10
0
00 0
Butterfly. A fact that can be discovered
Num 5 00 0 Brut Heliconius erato
ber of 1 00
Time 2
Serie
40
s in da
00
60
00
80
00
00
0
Ra
Appr ndom
ox. Op
e Fo
rce by finding the shape discord (Red Passion Flower Butterfly)
tabas 10 timal 0 100 200 300 400 500 600 700 800 900
e (m
)
Nematode Discords
Drosophila
melanogaster
A B C D E F
B
C
A, D, E A B C
G
B, C, F
Fungus Images
Some spores produced by a rust (fungus) known as
Gymnosporangium, which is a parasite of apple
and pear trees. Note that one spore has sprouted
an appendage known as a germ tube, and is
G thus singled out as the discord.
25
Time Series
Liberation Day Ascension Thursday
Discords in Medical Data
Sunday Dec 25
A cardiologist noted subtle anomalies in this dataset. Let us see if the discord algorithm can find them.
0
Easter Sunday
2500
2000
3rd Discord 1st Discord One years power demand at a Dutch research facility 2nd Discord
Power Demand -3
Record
1500
-4 qtdbsele0606
1000
500 -5
from the
January June December
PhysioBank QT
-6
Database (qtdb)
-7
0 500 1000 1500 2000 2500
-4 ST Wave
-4.5
Sleep Cycles -5
-5.5
-6
-6.5 0 50 100 150 200 250 300 350 400 450 500
Stage II sleep Eyes closed, awake or stage I sleep Eyes open, awake
0 500 1000 1500 2000 How was the discord able to find this very
subtle Premature ventricular contraction? Note
A time series showing a patients respiration (measured by thorax extension), as they wake up. A medical expert, Dr. J. Rittweger, manually that in the normal heartbeats, the ST wave
increases monotonically, it is only in the
segmented the data. The 1-discord is a very obvious deep breath taken as the patient opened their eyes. The 2-discord is much more subtle Premature ventricular contractions that there is
and impossible to see at this scale. A zoom-in suggests that Dr. J. Rittweger noticed a few shallow breaths that indicated the transition of an inflection.NB, this is not necessary true for
sleeping stages. all ECGS
Institute for Physiology. Free University of Berlin. Data shows respiration (thorax extension), sampling rate 10 Hz.
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
to see why it is a
discord. Discord
Discord
26
Spatially Constrained/Informed Mining of Shapes
Phrynosoma
Assessing the Significance of Motifs/Discords
hernandesi
Phrynosoma
douglassii
Iguania
Phrynosoma
taurus
The motif and discord algorithms always return some answer, but is
Phrynosoma
the result interesting, or something we should have expected by
ditmarsi
chance?
Phrynosoma
mcallii
Baker California
Rosegate
(highly variable)
Blythe, California
Annotation of
Applications!
Historical
British
Manuscripts Desmidiaceae,
vol. 2 (1905),
plate 41, fig. 5
Probing
End
Probing
Micrasterias
oscitans
Beet Leafhopper,
Circulifer tenellus
27
Mining Web Logs
5000
Tour De France But what caused the
extra interest in Lance
The Last Word
0 Armstrong in The sun is setting on all other
2000 2001 2002
August/September 2000? symbolic representations of
400
? Lance time series, SAX is the only
200 Armstrong Example by way to go
0
2000 2001 2002
M. Vlachos
We are done!
We have seen that SAX is a very useful tool for solving
problems in shape and time series data mining. I will be Thanks to my students
happy to answer any questions
Chotirat (Ann)
Eamonn Keogh: UCR Ratanamahatana
Chulalongkorn
eamonn@cs.ucr.edu University
What are the Dragomir Yankov
disadvantages of UCR
Xiaopeng Xi
(Yahoo)
28
>> x=random_walk(40,1);
>> timeseries2symbol(x, 16, 8, 4) Just create random walk of length 40 for testing.
ans =
Convert to SAX, with a sliding window of length
16, a word size of 8 and a cardinality of 4
4 3 2 3 1 1 3 2
Appendix A 4
3
2
2
3
3
2
1
1
1
2
4
3
2
3
4
2 3 2 1 2 2 3 4
2 2 1 1 3 2 3 4
2 1 1 2 2 2 4 4
Converting a long time series to a time 2 1 1 3 1 3 4 4
1 1 2 2 2 4 4 3
series bitmap (Intelligent Icon) 1 1 3 1 3 4 4 2
1 2 2 2 4 4 3 2
1 2 1 3 4 4 2 2
1 2 2 4 4 3 2 1
4 3 2 3 1 1 3 2
3 1 3 4 4 2 2 1
2 2 4 4 3 2 1 1 4 2 3 2 1 2 3 3
3 3 4 4 2 2 1 1 3 2 3 1 1 4 2 4
3 4 4 3 3 2 1 1
3 4 4 2 2 1 1 1
0 5 10 15 20 25 30 35 40
4 4 3 3 2 1 1 1
4 4 3 3 2 2 1 1
4 3 3 2 2 2 1 1
4 4 3 2 2 1 1 2
4 4 3 2 2 1 1 3
4 4 2 2 1 1 2 3
4 3 2 2 1 1 3 3
29
Map to some colormap, I have done of the
We need to normalize the matrix z, below is one
1 work belowG
way to do it such that the min value is 0 and the
max values is 1. (matlab code) 1.0000 0.9369 0.8618 0.9696
0.2282 0.7982 0.4575 0.7725
There may be better ways to normalizeG
0.6315 0.4701 0.6407 0.1693
0.5018 0.0000 0.8302 0.4156
>> z=(z-min(min(z)));
>> z=(z/max(max(z)))
z=
Hints I Hints II
ans = ans =
G T C T A A T C When counting patterns, dont G T C T A A T C Note that here lines 1 and 2 are
G C T C A C T T count patterns that span two lines. G T C T A A T C the same. This can happen a lot,
T C T A A G C A G C T C A C T T especially with smooth time series
A T C A C C T G For example, dont count the T C T A A G C A and/or a high compression ratio.
C C A A T C T G underlined As as an occurrence of A T C A C C T G
AA C C A A T C T G The SAX code has an extra
parameter that removes these
redundant lines. It seems like this
makes the Intelligent Icons work
better, and it does make the code
run a little faster.
30
Hints III
Appendix: DTW
For Intelligent Icon the cardinality must be 4
There are some critical facts about the size of the warping window r.
But what is the best sliding window length? r can vary from 0% (the special case of Euclidian distance) to 100% (the special case of full
DTW).
What is the best a word size?
Without lower bounding, the time taken is approximately linear in r, so r =5% is about twice as
fast as r =10%.
At the moment there is no answer to this other than
playing with the data (or CV if you have labeled data) With lower bounding, the time taken is highly non-linear in r, so r =5% is perhaps 10 to 100
times as fast as r =10%.
The good news is that once you find good settings for In general (empirically measured over 35 datasets) the following is true.
your domain (say ECGs) then the settings should work
for all ECGS.
sliding window length word size = 8 If you start with r = 0 and you make it larger, the accuracy improves, then gets worse (see the
4 3 2 3 1 1 3 2 two examples for FACE and GUN in this tutorial, but it is true for other datasets)
Heuristics: The best accuracy tends to be at a relatively small value for r (usually just 2 to 5%)
4 2 3 2 1 2 3 3
The sliding window length should be about twice the For any dataset, the best value for r depends on the size of the training set. For example for CBF
3 2 3 1 1 4 2 4
length of the natural scale at which the data is with just 20 instances, you might need r = 8%, but with 200 instances you only need 1 or 2%, and
interesting. For example, about two heartbeats for with 2,000 instances, you need r = 0% (the Euclidean distance).
cardiology, or for power demand, about two days. 0 5 10 15 20 25 30 35 40
How do you find the best choice for r? Use cross valuation to test for the best value.
The smoother the data, the smaller you can make the
word size. See [a] and [b]
[a] Xiaopeng Xi, Eamonn Keogh, Christian Shelton, Li Wei & Chotirat Ann Ratanamahatana (2006). Fast Time Series
Classification Using Numerosity Reduction. ICML
[b] Ratanamahatana, C. A. and Keogh. E. (2004). Everything you know about Dynamic Time Warping is Wrong. Third
Workshop on Mining Temporal and Sequential Data, in conjunction with the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD-2004), August 22-25, 2004 - Seattle, WA
31