Você está na página 1de 31

Important Note Fair Use Agreement

This agreement covers the use of all slides, please


read carefully.
These notes are provisional
You may freely use these slides for teaching, if
Update notes will be available at the conference, You send me an email telling me the class number/ university in advance.
My name and email address appears on the first slide (if you are using all or most of the slides), or on each
both electronically and (some) paper copies. slide (if you are just taking a few slides).

You may freely use these slides for a conference presentation, if


You send me an email telling me the conference name in advance.
My name appears on each slide you use.

You may not use these slides for tutorials, or in a published work (tech report/
conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is
highly likely I will grant you permission.

(c) Eamonn Keogh, eamonn@cs.ucr.edu

Machine Learning in Time Outline of Tutorial I


Series Databases (and Introduction, Motivation
The ubiquity of time series and shape data
Examples of problems in time series and shape data mining
Everything Is a Time Series!) The utility of distance measurements
Properties of distance measures
Euclidean distance
AAAI Tutorial 2011 Dynamic time warping
Longest common subsequence
Why no other distance measures?
Eamonn Keogh, UCR Preprocessing the data
eamonn@cs.ucr.edu Invariance to distortions
Spatial Access Methods and the curse of dimensionality
Come, we Generic dimensionality reduction
Discrete Fourier Transform
shall learn of Discrete Wavelet Transform
Singular Value Decomposition
the mining Adaptive Piecewise Constant Approximation
Piecewise Linear Approximation
Very Briefly

of time series Piecewise Aggregate Approximation


Why Symbolic Approximation is different
Why SAX is the best symbolic approximation

1
Outline of Tutorial II
In both shape and time series, we consider: The Ubiquity of Shape
Novelty detection (finding unusual shapes or subsequences)
Motif discovery (finding repeated shapes or subsequences) butterflies, fish, petroglyphs, arrowheads,
fruit fly wings, lizards, nematodes, yeast cells,
Clustering faces, historical manuscripts
Classification
Indexing
Visualizing massive datasets
Open problems to solve Drosophila melanogaster

Summary, Conclusions

The Ubiquity of Time Series Examples of problems in time series


Shooting
and shape data mining
Hand moving to
shoulder level
Hand moving down to
grasp gun
Dont Shoot! Motion capture,
0 10 20 30
Hand moving above holster
Hand at rest
40 50 60 70 80 90
meteorology, finance, In the next few slides we will
handwriting, medicine, web logs,
music
see examples of the kind of
400

?
Lance
Armstrong
problems we would like to be
200

0
2000 2001 2002 able to solve, then later we
will see the necessary tools to
1
solve them
0.5

0
0 50 100 150 200 250 300 350 400 450

2
All our Experiments are Reproducible! Example 1: Join Given two data collections, link items occurring in each

People that do irreproducible We can take two


experiments should be boiled alive Danainae different families of
butterflies,
Limenitidinae and
Danainae, and find
the most similar
shape between them
Agreed! All Limenitidinae
experiments
in this
tutorial are
reproducible

Adelpha iphiclus Harma theobene Danaus affinis


Euploea
camaralzeman Why would the two most
Limenitis (subset) Danaus (subset)
similar shapes also have
similar colors and patterns?
Aterica galene Limenitis reducta Greta morgane Danaus plexippus
That cant be a coincidence.
This is an example of
Mllerian mimicry
Limenitis archippus Catuna crithea Tellervo zoilus Placidina euryanassa

Limenitidinae Danainae
Limenitis Danaus
archippus plexippus

Not Batesian mimicry


as commonly believed
Viceroy Monarch

.. so similar in
coloration that I will
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
put them both to one*

*Inferno -- Canto XXIII 29 Photo by Lincoln Brower

3
Example 2: Annotation Given an object of interest, automatically
obtain additional information about it.

Friedrich Bertuchs Bilderbuch fur Kinder


(Weimar, 17981830)
This page was published in 1821

Most images returned by the Google image query stick


Bilderbuch is a childrens encyclopedia of insect do not segment into simple shapes, but some do,
natural history, published in 237 parts over including the 296th one.
It looks like our insect is a Thorny Legged Stick Insect, or
nearly 40 years in Germany. Eurycantha calcarata from Southeast Asia.

Suppose we encountered this page and


wanted to know more about the insect. The
back of the page says Stockinsekt which
we might be able to parse to Stick Insect,
Note that in addition to rotation
but what kind? How large is it? Where do invariance our distance measure
they live? must be invariant to other
differences. The real insect has a
tail that extends past his legs,
and asymmetric positions of
Suppose we issue a query to Google search limbs etc.
for Stick Insect and further filter the
results by shape similarity.

Example 3: Query by Content


Given a large data collection, find the k most
Petroglyphs similar objects to an object of interest.
They appear worldwide
Over a million in America alone Petroglyphs are images incised in rock,
Surprisingly little known about them usually by prehistoric peoples. They were an
important form of pre-writing symbols, used
in communication from approximately 10,000
B.C.E. to modern times. Wikipedia
who so sketched out
the shapes there?*

.. they would
strike the subtlest
minds with awe*
*Purgatorio -- Canto XII 6

4
Example 4: Example 5: Classification Given a labeled training set,
classify future unlabeled examples
Clustering Basal
There is a special reason why
this tree is so tall and inverted* What type of
Given a unlabeled dataset, arrowhead is this?
arrange them into groups
by their mutual similarity
Iguani
a
Alligatoridae Articulate
Crocodylidae Alligatorinae Amphisbaenia Chelonia

For he is well
placed among the
fools who does not
distinguish one
class from another*
*Purgatorio -
Canto XXXIII 64 Phrynosoma braconnieri
*Paradiso -- Canto XIII 115

Example 6: Anomaly Detection (Discords) Example 7: Repeated Pattern Discovery (Motifs)


Given a large collection of objects, find the Given a large collection of objects, find the
you are one that is most different to all the rest. each one is alike pair that is most similar.
merely like in size and
A subset of 32,028 images of Drosophila wings
imperfect rounded shape*
insects*

Specimen 20,773
Blythe, California Baker California
*Purgatorio -- Canto X 127
*Inferno -- Canto XIX 15

5
Example(s) 8: Human Motion Two Kinds of Shape Matching
The two of us walked Join rigid flexible
on that road*
Annotation
Query-by-Content
Texas Key Ideas: Convert shape to graph/tree
Clustering Duran
Arrowhead Use graph/tree edit distance to measure similarity

Classification Just two edits


to change this
Anomaly Detection 0 0.2 0.4 0.6 0.8 1 dog to a cat*

Motif Discovery Convert shape to pseudo time series or feature


vector. Use time series distance measures or vector
Some shapes are already
graph like
distance measures to measure similarity. Needed for articulated shapes
The shape to graph
We only consider this approach in this tutorial. transformation
is very tricky#
It works well for the butterflies, fish, petroglyphs,
arrowheads, fruit fly wings, lizards, nematodes,
yeast cells, faces, historical manuscripts etc
We do not further discuss these ideas, see shock graph work of
discussed at the beginning of this tutorial. Sebastian, Klein and Kimia* and the work of Latecki# and others
*Inferno -- Canto VI MoCap Image by Meredith & Maddock

We can convert shapes into a 1D signal. Thus can we


remove information about scale and offset.
For virtually all shape
matching problems,
Shape Representations
Rotation we must deal with in rotation is the problem
it seemed to change
its shape, from our algorithms
running lengthwise to
revolving round*

If I asked you to group


these reptile skulls,
rotation would not
confuse you

There are two ways to be rotation invariant

0 200 400 600 800 1000 1200


1) Landmarking: Find the one true rotation
There are many other 1D representations of shape, 2) Rotation invariant features
and the algorithms shown in this tutorial can work
with any of them *Paradiso -- Canto XXX, 90.

6
Landmarking
Best Rotation Alignment
Rotation invariant features Red Howler
Monkey
Owl Monkey Owl Monkey Orangutan
(species unknown) Northern Gray-Necked Possibilities include:
Generic Landmarking Ratio of perimeter to area, fractal measures,
Find the major axis of the shape and elongatedness, circularity, min/max/mean Orangutan
use that as the canonical alignment curvature, entropy, perimeter of convex hull,
aspect ratio and histograms
Domain Specific Landmarking A Orangutan
B C
Find some fixed point in your The problem with rotation (juvenile)
domain, eg. the nose on a face, the invariant features is that
stem of leaf, the tail of a fish Generic Landmark Alignment in throwing away rotation
Borneo
information, you must Orangutan
The only problem Generic Landmark Best Rotation
invariably throw away
Mantled
with landmarking is Alignment Alignment useful information
Howler
that it does not work
Histogram Monkey

Domain Specific Landmarking aspect ratio (monkeys) works here


not here aspect ratio (reptiles)
Domain specific
landmarks include
leaf stems, noses, the
tip of arrowheads 0.73 0.49 0.47 0.41 0.54 0.43

The easy way to achieve rotation invariance is to The strategy of testing all possible
hold one time series C fixed, and compare it to every
circular shift of the other time series, which is
rotations is very very slow
represented by the matrix C
C People have suggested
various tricks for
Q speedup, like only
testing 1 in 5 of the
rotations
algorithm: [dist] = Test_All_Rotations(Q,C)
dist = infinty
for j = 1 to n
However there now
TempDistance = Some_Dist_Function(Q, Cj)
if TempDistance < dist exists a simple exact
c1 , c2 , K, cn1 , cn c1 , c2 , K, cn1 , cn
dist = TempDistance; c , K , c , c , c ultrafast, indexable c , K , c , c , c
end;
end;
C = 2 n 1 n 1
way to do this* C = 2 n 1 n 1

It sucks being M M
return[dist] a grad student cn , c1 , c2 , K, cn1 cn , c1 , c2 , K, cn1
*VLDB06: LB_Keogh Supports Exact Indexing of Shapes
under Rotation Invariance with Arbitrary Representations
and Distance Measures.

7
The need for rotation invariance
shows up in real time series, as
in these Star Light Curves
Shape Distance Measures
I saw above a million
burning lamps,
A Sun kindled every
one of them, as our
sun lights the stars Speak to me There
we glimpse on high*
of the useful are but
distance Euclidean three
measures Distance
Dynamic Time
c1 , c2 , K, cn1 , cn Warping
c , K , c , c , c
C = 2 n 1 n 1

Longest
M
Common
cn , c1 , c2 , K, cn1 Subsequence
*The Paradiso --
Canto XXIII 28-30

Dynamic Time Warping is


Euclidean Distance works
well for matching many
useful for natural shapes,
Mantled Howler Monkey which often exhibit Lowland Gorilla
kinds of shapes Alouatta palliata Gorilla gorilla graueri
intraclass variability

DTW
Alignment
Euclidean
Distance
Mountain Gorilla
Red Howler Monkey Is man an ape Gorilla gorilla beringei
Alouatta seniculus seniculus
or an angel?

8
Matching skulls Euclidean Distance Metric
is an important
problem
A B Given two time
C
C series Q = q1qn
This region
will not be and C = c1cn , the
matched Q Euclidean distance
0 10 20 30 40 50 60 70 80 90 100 between them is
defined as:

D(Q, C ) (qi ci )
n
LCSS I notice that you 2
LCSS can deal
with missing or Alignment Z-normalized i =1
occluded parts the time series
The famous Skhul V is generally reproduced with first The next slide shows a
the missing bones extrapolated in epoxy (A),
however the original Skhul V (B) is missing the
nose region, which means it will match to a modern useful optimization
human (C) poorly, even after DTW alignment
(inset). In contrast, LCSS alignment will not
attempt to match features that are outside a
matching envelope (heavy gray line) created from
the other sequence. DTW

Early Abandon Euclidean Distance C


Q Dynamic Time
Warping I
During the
C computation, if This is how the
calculation
abandoned at
current sum of the DTW alignment
this point
Q squared differences
between each pair of C
is found
0 10 20 30 40 50 60 70 80 90 100

corresponding data Q
I see, because points exceeds r2 , we Warping path w
incremental can safely abandon
value is always a
DTW (Q, C ) = min
K
the calculation wk K
lower bound to k =1
the final value,
Abandon all hope
once it is greater ye who enter here
than the best-so- This recursive function gives us
far, we may as the minimum cost path
well abandon (i,j) = d(qi,cj) + min{ (i-1,j-1), (i-1,j ), (i,j-1) }

9
Dynamic Time Warping II Tests on many diverse datasets
100

FACE (2%)
and I recognized Leaf of mine, in whom I found pleasure
Accuracy

95 the face Acer circinatum


(Oregon Vine Maple)

GUNX (3%)
There is an
90
important trick

100
1
5
9
13
17
21
25
29
33
37
41
45

53
57
61
65
69
73
77
81
85
89
93
97
49
to improve value of r
accuracy and
speed This constrained warping, together with a
lower bounding trick called LB_Keogh can
make DTW thousands of times faster! But as a fish dives the shape of that cold
dont take my word for it... through water animal which stings and
lashes people with its tail *
LB_Keogh is fast, because it
cleverly exploits global
constraints
r Christos Faloutsos
PODS 2005
See the below for more information about constrained warping:
Xi, Keogh, Shelton, Wei & Ratanamahatana (2006). Fast Time Series Classification Using Numerosity Reduction. ICML
Ratanamahatana and Keogh. (2004). Everything you know about Dynamic Time Warping is Wrong.
*Purgatorio -- Canto IX 5, Purgatorio -- Canto XXIII, Purgatorio -- Canto XXVI, Paradiso -- Canto XV 88

from its stock this


Classes Instances Euclidean DTW Error Other Techniques tree was cultivated*
Name Error (%) (%) {r}
Face 16 2240 3.839 3.170{3}
Swedish Leaves 15 1125 13.33 10.84{2} 17.82 Sderkvist

Chicken 5 446 19.96 19.96{1} 20.5 Discrete strings


Chamfer 6.0, Hausdorff 7.0
MixedBag 9 160 4.375 4.375{1}
OSU Leaves 6 442 33.71 15.61{2}
Diatoms 37 781 27.53 27.53{1} 26.0 Morphological
Curvature Scale Spaces

Plane 7 210 0.95 0.0{3} 0.55 Markov Descriptor All these are in the genus Cercopithecus, These are the same species All these are in the tribe
except for the skull identified as being Bunopithecus hooloc (Hoolock Papionini
either a Vervet or Green monkey, both of Gibbon) Tribe Papionini
Fish 7 350 11.43 9.71{1} 36.0 Fourier /Power Cepstrum which belong in the Genus of Chlorocebus Genus Papio baboons
These are in the Genus Pongo
which is in the same Tribe Genus Mandrillus- Mandrill
(Cercopithecini) as Cercopithecus.
All these are in the family Cebidae
Tribe Cercopithecini These are in the family Lemuridae
Note that DTW is sometimes worth the little Cercopithecus
Family Cebidae (New World monkeys)
Subfamily Aotinae
De Brazza's Monkey, Cercopithecus neglectus These are in the genus Alouatta
extra effort Mustached Guenon, Cercopithecus cephus Aotus trivirgatus
Subfamily Pitheciinae sakis
Red-tailed Monkey, Cercopithecus ascanius These are in the same species
Black Bearded Saki, Chiropotes satanas
Chlorocebus Homo sapiens (Humans)
White-nosed Saki, Chiropotes albinasus
Green Monkey, Chlorocebus sabaceus
Vervet Monkey, Chlorocebus pygerythrus
*Purgatorio -- Canto XXIV 117

10
Flat-tailed Horned Lizard
Phrynosoma mcallii
OK, let us take stock of what we
have seen so far
There are interesting problems in
Dynamic Time shape/time series mining (motifs, anomalies,
Unlike the Warping clustering, classification, query-by-content,
primates, reptiles visualization, joins).
require warping
Very simple transformations let us
treat shapes as time series.
Very simple distance measures
(Euclidean, DTW) work very well.

We are finally ready to see how symbolic


representations, in particular SAX, allow
Texas Horned Lizard
us to solve these problems
Phrynosoma cornutum

Data Mining is Constrained by Disk I/O The Generic Data Mining Algorithm

For example, suppose you have Create an approximation of the data, which will fit in main
memory, yet retains the essential features of interest
one gig of main memory and want
to do K-means clustering Approximately solve the problem at hand in main memory

Make (hopefully very few) accesses to the original data on disk


Clustering gig of data, 100 sec to confirm the solution obtained in Step 2, or to modify the
Clustering gig of data, 200 sec solution so it agrees with the solution we would have obtained on
Clustering 1 gig of data, 400 sec the original data
Clustering 1.1 gigs of data, 20 hours

But which approximation


should we use?
Bradley, M. Fayyad, & Reina: Scaling Clustering Algorithms to Large Databases. KDD 1998: 9-15

11
Some approximations of
time series Time Series Representations

..note that all except SYM


are real valued Model Based Data Adaptive Non Data Adaptive Data Dictated

Hidden Statistical Grid Clipped


Markov Models Data
Models Sorted Singular Trees Random Piecewise
Piecewise Symbolic Wavelets Spectral
Coefficients Value Mappings Aggregate
Polynomial Approximation Approximation

aabbbccb
0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100120 0 20 40 60 80 100 120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120
Piecewise Adaptive Natural Strings Orthonormal Bi-Orthonormal Discrete Discrete Chebyshev
a Linear
Approximation
Piecewise Language Fourier Cosine Polynomials
a Constant
Approximation
Transform Transform

b Interpolation Regression
Symbolic Non
Haar Daubechies Coiflets Symlets
b Aggregate
Approximation
Lower
Bounding dbn n > 1
b
c Value Slope
c Based Based

b
DFT DWT SVD APCA PAA PLA SYM

The Generic Data Mining Algorithm (revisited)


What is Lower Bounding?
Lower bounding means the estimated distance in the reduced space is always less
than or equal to the distance in the original space.
Create an approximation of the data, which will fit in main
memory, yet retains the essential features of interest
Q Raw Data Q

Approximately solve the problem at hand in main memory S S

Make (hopefully very few) accesses to the original data on disk


to confirm the solution obtained in Step 2, or to modify the DLB(Q,S)
Approximation
solution so it agrees with the solution we would have obtained on or
the original data D(Q,S) (q i s i )
n 2 Representation

M

i =1
DLB(Q,S) i =1
(sri sri1 )(qvi svi )2

This only works if the


approximation allows Lower bounding means that for all Q and
lower bounding S, we have: DLB(Q,S) D(Q,S)

12
Lower Bounding functions are Why do we care so much about
known for wavelets, Fourier, symbolic representations?
SVD, piecewise polynomials,
Chebyshev Polynomials and
clipped data Symbolic Representations Allow:

While there are more than


200 different symbolic or
Hashing
aabbbccb
discrete ways to approximate 0 20 40 60 80 100120 Suffix Trees 0 20 40 60 80 100 120

a Markov Models
time series, none except a
SAX allows lower bounding b Stealing ideas from text processing/
b
b bioinformatics community
c
c
b
etc

SYM DFT

There is one symbolic That representation is SAX


representation of time series, Symbolic Aggregate ApproX
ApproXimation
that allows

Lower bounding of Euclidean distance


Lower bounding of the DTW distance
Dimensionality Reduction
Numerosity Reduction
baabccbc

13
How do we obtain SAX? Note we made two parameter choices

C C
The word size, in this
C case 8. C
0 20 40 60 80 100 120 0 20 40 60 80 100 120

1 2 3 4 5 6 7 8
First convert the time c c
series to PAA c c c c 3
representation, then 1
convert the PAA to
b b b b 2
b b
symbols 1
- a a - a a
It takes linear time
0 20 40 60 80 100 120 0 20 40 60 80 100 120

The alphabet size (cardinality), in this case 3.


baabccbc

SAX Lower Bound to Euclidean Distance Metric


Visual Comparison
D(Q, C ) (qi ci )
n
2
C i =1

3 Q
2 DFT Yes, here is the function
f 0 10 20 30 40 50 60 70 80 90 100
1 e that lower bounds it for
d PLA
0 c
Recall the SAX, it is called
b
-1 Euclidean
a Haar
MINDIST
-2
APCA distance?
-3 C = bbabcbac
dist() table lookup
SAX
a b c Q = bbaccbac
A raw time series of length 128 is transformed into the a 0 0 0.67
MINDIST(Q , C ) n
(dist(q , c ))
w
i i
2
b 0 0 0 w i =1
word ffffffeeeddcbaabceedcbaaaaacddee. c 0.67 0 0 dist() can be implemented using a table lookup.
We can use more symbols to represent the time series since each symbol
requires fewer bits than real-numbers (float, double)

14
TGGCCGTGCTAGGCCCCACCCCTACCTTGCA
Let us consider the utility of GTCCCCGCAAGCTCATCTGCGCGAACCAGA
SAX for visualizing time ACGCCCACCACCCTTGGGTTGAAATTAAGGA
Data mining problems are I/O bound GGCGGTTGGCAGCTTCCCAGGCGCACGTAC
OK, let us have series. We start with an CTGCGAATAAATAACTGTCCGCACAAGGAGC
The generic data mining algorithm apparent digression,
another quick CCGACGATAGTCGACCCTCTCTAGTCACGAC
mitigates the problem, if you can obey visualizing DNA. CTACACACAGAACCTGTGCTAGACGCCATGA
review GATAAGCTAACACAAAAACATTTCCCACTAC
the lower bounding requirement. TGCTGCCCGCGGGCTACCGGCCACCCCTGG
There is one approximation of time CTCAGCCTGGCGAAGCCGCCCTTCA

series that is symbolic and lower The DNA of two species


bounding, SAX
Are they similar?
Being discrete instead of real valued CCGTGCTAGGGCCACCTACCTTGGTCCG
gives SAX some advantages (which we have CCGCAAGCTCATCTGCGCGAACCAGAAC
GCCACCACCTTGGGTTGAAATTAAGGAG
yet to see)
GCGGTTGGCAGCTTCCAGGCGCACGTAC
We are finally ready CTGCGAATAAATAACTGTCCGCACAAGG
to see the utility of AGCCGACGATAAAGAAGAGAGTCGACCT
SAX CTCTAGTCACGACCTACACACAGAACCT
GTGCTAGACGCCATGAGATAAGCTAACA

AA AC CA CC AAA AAC ACA ACC CAA CAC CCA CCC


A C

A C AG AT CG CT AAG AAT ACG ACT CAG CAT CCG CCT


G T
GA GC TA TC AGA AGC ATA ATC CGA CGC CTA CTC
l=1
GG GT TG TT AGG AGT ATG ATT CGG CGT CTG CTT

G T
GAA GAC GCA GCC TAA TAC TCA TCC
l=2
GAG GAT GCG GCT TAG TAT TCG TCT

GGA GGC GTA GTC TGA TGC TTA TTC

GGG GGT GTG GTT TGG TGT TTG TTT

l=3
0.20 0.24 CCGTGCTAGGGCCACCTACCTTGGTCCG
CCGCAAGCTCATCTGCGCGAACCAGAAC
CCGTGCTAGGGCCACCTACCTTGGTCCG
CCGCAAGCTCATCTGCGCGAACCAGAA
GCCACCACCTTGGGTTGAAATTAAGGAG GCCACCACCTTGGGTTGAAATTAAGGAG
GCGGTTGGCAGCTTCCAGGCGCACGTAC GCGGTTGGCAGCTTCCAGGCGCACGTA
CTGCGAATAAATAACTGTCCGCACAAGG CTGCGAATAAATAACTGTCCGCACAAGG
0.26 0.30 AGCCGACGATAAAGAAGAGAGTCGACCT
CTCTAGTCACGACCTACACACAGAACCT
AGCCGACGATAAAGAAGAGAGTCGACCT
CTCTAGTCACGACCTACACACAGAACCT
l stands for Level
GTGCTAGACGCCATGAGATAAGCTAACA GTGCTAGACGCCATGAGATAAGCTAACA

15
1 0.02 0.04 0.09 0.04
OK. Given any DNA
0.03 0.07 0.02
string I can make a
0.11 0.03
colored bitmap, so what?

CCGTGCTAGGCCCCACCCCTACCTTGCA CCGTGCTAGGCCCCACCCCTACCTTGCA
GTCCCCGCAAGCTCATCTGCGCGAACCA GTCCCCGCAAGCTCATCTGCGCGAACCA
GAACGCCCACCACCCTTGGGTTGAAATT GAACGCCCACCACCCTTGGGTTGAAATT
AAGGAGGCGGTTGGCAGCTTCCCAGGCG AAGGAGGCGGTTGGCAGCTTCCCAGGCG
0 CACGTACCTGCGAATAAATAACTGTCCGC CACGTACCTGCGAATAAATAACTGTCCGC
ACAAGGAGCCCGACGATAGTCGACCCTC ACAAGGAGCCCGACGATAGTCGACCCTC
TCTAGTCACGACCTACACACAGAACCTG TCTAGTCACGACCTACACACAGAACCTG
TGCTAGACGCCATGAGATAAGCTAACA TGCTAGACGCCATGAGATAAGCTAACA

Can we do something
Two Questions similar for time series?

Would it be useful?

Pan troglodytes
Note Elephas maximus is
is the
the Indian Elephant,
Loxodonta africana is the
chimpanzee
African elephant

We call these bitmaps Intelligent Icons

16
Can we make bitmaps for time series?
1.5 Yes, with SAX!
1
A
0.5

0
C
- 0.5 G
-1
T
- 1.5

0 20 40 60 80 100 120

GTTGACCA
AA AC CA CC
AG AT CG CT
GA GC TA TC
GG GT TG TT

While they are all example of EEGs, example_a.dat is


from a normal trace, whereas the others contain examples
Time Series Bitmap of spike-wave discharges.

We can further enhance


the time series bitmaps A well known dataset
by arranging the August.txt
Kalpakis_ECG, allegedly normal9.txt
thumbnails by cluster
cluster,
cluster,
, contains only ECGS
instead of arranging by
date,, size
date size,, name etc July.txt June.txt April.txt
normal8.txt normal5.txt

If we view them as time


normal1.txt normal10.txt normal11.txt

We can achieve this with series bitmaps, a handful normal15.txt normal14.txt

May.txt Sept.txt

MDS. stand out normal13.txt normal7.txt normal2.txt normal16.txt normal18.txt

normal4.txt normal3.txt normal12.txt normal17.txt

Oct.txt Feb.txt Dec.txt

normal6.txt

March.txt Nov.txt Jan.txt

300
One Year of Italian Power Demand

200

100

January December
August
0

17
ventricular depolarization plateau stage 20

19
We can test how much
repolarization 17

initial rapid recovery 18


useful information is
repolarization
0 100 200 300 400
phase
500 normal9.txt
16 retained in the bitmaps by
normal8.txt normal5.txt
8 using only the bitmaps for
normal1.txt normal10.txt normal11.txt
7

10
clustering
normal15.txt normal14.txt

9
normal13.txt normal7.txt normal2.txt normal16.txt normal18.txt

6
normal4.txt normal3.txt normal12.txt normal17.txt
15

normal6.txt 14
Data Key
12
Cluster 1 (datasets 1 ~ 5):
13 BIDMC Congestive Heart Failure Database (chfdb): record chf02

Some of the data are not 11


Start times at 0, 82, 150, 200, 250, respectively

Cluster 2 (datasets 6 ~ 10):


heartbeats! They are the 5 BIDMC Congestive Heart Failure Database (chfdb): record chf15
Start times at 0, 82, 150, 200, 250, respectively

action potential of a 4
Cluster 3 (datasets 11 ~ 15):
3 Long Term ST Database (ltstdb): record 20021
0 100 200 300 400 500 normal pacemaker cell 2
Start times at 0, 50, 100, 150, 200, respectively

Cluster 4 (datasets 16 ~ 20):


1 MIT-BIH Noise Stress Test Database (nstdb): record 118e6
Start times at 0, 50, 100, 150, 200, respectively

Think of the implications of this, these


Argulus
Lag Lead Bitmaps can be used for anomaly detection.. americanus
(crustacean)
animals have 3 billion base pairs each, but
64 numbers are enough to cluster them
l=1 l=2 l=3 l=4
Homo
Sapiens
(human) Placental Mammals
Here is a Intelligent Icons are scale invariant (fractal)
Premature
Ventricular Laurasiatheres Afrotheres
Contraction A dendrogram for 12
(PVC) mammals created using only
the information contained in
Perissodactyla
their 8 by 8 Intelligent Icons.
The dendrogram agrees with Primates Cetartiodactyla
the modern consensus except
the two bifurcations marked
with red dots are in the Cetacea
wrong order. Hominidae Cercopithecidae

Homo/Pan/ Pongo
Gorilla group
Here the bitmaps are very Pan

different. This is the most


unusual section of the time
series, and it coincidences
Here the bitmaps are almost the same.
with the PVC.

18
Time Series Motif Discovery Time Series Motif Discovery
(finding repeated patterns) (finding repeated patterns)

Winding Dataset
( The angular speed of reel 2 )
Winding Dataset
0 50 0 1000 150 0 2000 2500 ( The angular speed of reel 2 )
0 50 0 1000 150 0 2000 2500

Are there any repeated


patterns, of about this
length in the above
time series? 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140

Why Find Motifs? I Why Find Motifs? II


Mining association rules in time series requires the discovery of motifs.
Finding motifs in motion capture allows These are referred to as primitive shapes and frequent patterns.
efficient editing of special effects, and can be
Several time series classification algorithms work by constructing typical
used to allow more natural interactions with prototypes of each class. These prototypes may be considered motifs.
To see the full video go to..
video games
www.cs.ucr.edu/~eamonn/SIGKDD07/UniformScaling.html Tanaka, Y. & Uehara, K. Many time series anomaly/interestingness detection algorithms essentially
Or search YouTube for Time series motifs
Araki , Arita and Taniguchi consist of modeling normal behavior with a set of typical shapes (which we see
Celly, B. & Zordan, V. B. as motifs), and detecting future patterns that are dissimilar to all typical shapes.
In robotics, Oates et al., have introduced a method to allow an autonomous

... ... agent to generalize from a set of qualitatively different experiences gleaned
from sensors. We see these experiences as motifs. See also Murakami
Yoshikazu, Doki & Okuma and Maja J Mataric
In medical data mining, Caraca-Valente and Lopez-Chavarrias have
introduced a method for characterizing a physiotherapy patients recovery
based of the discovery of similar patterns. Once again, we see these similar
patterns as motifs.

19
Trivial T
OK, we can define motifs, but
Matches
how do we find them?
Space Shuttle STS - 57 Telemetry
C ( Inertial Sensor ) The obvious brute force search algorithm is just too slow
0 100 200 3 00 400 500 600 70 0 800 900 100 0
The most reference algorithm is based on a hot idea from
Definition 1. Match: Given a positive real number R (called range) and a time series T containing a
subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M) R, then M is
bioinformatics, random projection* and the fact that SAX
called a matching subsequence of C. allows use to lower bound discrete representations of time
Definition 2. Trivial Match: Given a time series T, containing a subsequence C beginning at position series.
p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q
or there does not exist a subsequence M beginning at q such that D(C, M) > R, and either q < q< p
or p < q< q.

Definition 3. K-Motif(n,R): Given a time series T, a subsequence length n and a range R, the most
* J Buhler and M Tompa. Finding
significant motif in T (hereafter called the 1-Motif(n,R)) is the subsequence C1 that has highest count motifs using random projections. In
of non-trivial matches (ties are broken by choosing the motif whose matches have the lower
variance). The Kth most significant motif in T (hereafter called the K-Motif(n,R) ) is the subsequence
RECOMB'01. 2001.
CK that has the highest count of non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1 i < K.

A simple worked example of the motif discovery algorithm


The next 4 slides

T ( m= 1000)

A mask {1,2} was randomly chosen,


Collisions are recorded by
0 500 1000 so the values in columns {1,2} were
incrementing the appropriate
C1 used to project matrix into buckets.
location in the collision matrix
C^1 a c b a Assume that we have a
S^ 1 a c b a 1 58 1
1 a c b a time series T of length 2 b c a b 2
2 b c a b 1,000, and a motif of : : : : : 457 :
: : : : : a = 3 {a,b,c} : : : : : 58 1
: : : : : n = 16
w=4
length 16, which occurs
58 a c c a :
:
58 a c c a twice, at time T1 and : : : : :
: : : : : 2 985 985 1
time T58. 985 b c c c
985 b c c c 1 2 : 58 : 985
1 2 3 4

20
We can now use the information in the collision matrix
as a heuristic to hunt for likely motifs.

We can use lower bounding to discover at what point


that hunt is fruitless
Once again, collisions are
A mask {2,4} was randomly chosen,
recorded by incrementing the
so the values in columns {2,4} were
appropriate location in the
used to project matrix into buckets.
collision matrix
This is a good example of the Generic
Data Mining Algorithm
a 1 c b a 1 58 1 The Generic Data Mining Algorithm 1
b 2 c a b 2 Create an approximation of the data, which will fit in main 2 2
: : : : : 2 :
memory, yet retains the essential features of interest
:
: : : : : 58 2
Approximately solve the problem at hand in main memory
58 27
58 a c c a : Make (hopefully very few) accesses to the original data on disk
: : 3 1
to confirm the solution obtained in Step 2, or to modify the

: : : : :
solution so it agrees with the solution we would have obtained on
the original data
985 985 1 985 2 1
985 b c c c 1 2 : 58 : 985 But which
But approximation
which approximation
1 2 : 58 : 985
1 2 3 4 should we
should we use?
use?

A Simple Experiment
Planted Motifs
Let us imbed two motifs into a random walk
time series, and see if we can recover them

A
D
B C
0 20 40 60 80 100 120 0 20 40 60 80 100 120

B D

0 200 400 600 800 1000 1200

21
Danaus
Limeniti plexipp
s us
archipp
us

Shape Motifs I flat, matched by c

d
Shape Motifs II -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

c b b
a
We can find shape motifs with 20 40 60
a
80 100 120 140 160 180 200 220 240

only minor modifications: segment segment


one two
d
c c c
When converting shape to SAX, try all a a
graffiti
rotations to fit best fit. 20 40 60 80 100 120 140 160 180 200 220 240
Vegas Motif

Staurastrum tetracerum
Place every circular shift of SAX word in
the projection matrix.
Time improvement over BruteForce

Average running time (%)


1
: : : : : : : : watershed segmented image

b a c b 0.8
b a c b i j
a c b b a c b b 0.6
R
c b b a
R
c b b a
Giorgio Morandi
Si Si i 0.4
b b a c b b a c 1890 1964
bacb : : : : : : : : i j
0.2
Through his simple and repetitive
b c b b b c b b 0 motifs Morandi became an
c b b b c b b b i 500 important forerunner of Minimalism.
R R 1000
b b b c b b b c
Sj b b c b
Sj
b b c b
j 2000 wikipedia
Dataset size 4000
: : : : : : : :

Image Discords Image Discords

What is the
most unusual
shape in this
collection? This one!

Shape Discord: Given a collection of shapes S, the shape D is


the discord of S if D has the largest distance to its nearest
match. That is, shape C in S, the nearest match MC of C and
the nearest match MD of D, Dist(D, MD) > Dist(C, MC).

22
This one is
even more 1st Discord
subtle
Here is a
subset of a
large
collection of
petroglyphs

1st Discord

Only one image


shows an arrow
stuck into the
sheep

1st Discord

Image discords Finding Image Discords 0 2 4.2 1.1 2.3 8.5


are potentially 2 0 3 3.2 3.5 8.2

useful in many 4.2 3 0 1.2 9.2 9.7

domains 1.1 3.2 1.2 0 0.1 7.5


2.3 3.5 9.2 0.1 0 7.6
Most arrowheads 8.5 8.8 9.7 7.5 7.6 0
are symmetric, Function [ dist, loc ] = Discord_Search(S)
1.1 2 1.2 0.1 0.1 7.5
best_so_far_dist = 0
but cornertang-dcc.JPG best_so_far_loc = NaN
for p = 1 to size (S) // begin outer loop
The code says
nearest_neighbor_dist = infinity Find the smallest
A B
for q = 1 to size (S) // begin inner loop
(non diagonal) value
if p!= q // Dont compare to self in each column, the
if RD(Cp , Cq ) < nearest_neighbor_dist largest of these is
1st Discord
(Dacrocyte) A B nearest_neighbor_dist = RD(Cp , Cq ) the discord
D end
E
end
Most red C
E end // end inner loop
if nearest_neighbor_dist > best_so_far_dist
blood cells best_so_far_dist = nearest_neighbor_dist
are round best_so_far_loc = p
C D
end
end // end outer loop
return [ best_so_far_dist, best_so_far_loc ]

23
Finding Discords, Fast 0 2 4.2 1.1 2.3 8.5
The Magic Heuristics 0 2 4.2 1.1 2.3 8.5
2 0 3 3.2 3.5 8.2 2 0 3 3.2 3.5 8.2
Function [ dist, loc ] = Heuristic_Search(S, Outer, Inner ) 4.2 3 0 1.2 9.2 9.7 In the outer loop, visit the columns in 4.2 3 0 1.2 9.2 9.7
best_so_far_dist = 0
best_so_far_loc = NaN
1.1 3.2 1.2 0 0.1 7.5 order of the Discord score 1.1 3.2 1.2 0 0.1 7.5
2.3 3.5 9.2 0.1 0 7.6 2.3 3.5 9.2 0.1 0 7.6
for each index p given by heuristic Outer // begin outer loop In the inner loop, visit the row cells in
nearest_neighbor_dist = infinity 8.5 8.8 9.7 7.5 7.6 0 8.5 8.8 9.7 7.5 7.6 0
for each index q given by heuristic Inner // begin inner loop order of nearest neighbor first
if p!= q The code now says The Magic
if RD(Cp , Cq ) < best_so_far_dist If while searching a Heuristics would
break // break out of inner loop given column, you find a reduce the time
end distance less than complexity from
if RD(Cp , Cq ) < nearest_neighbor_dist nearest_neighbor_dist O(n2) algorithm to
nearest_neighbor_dist = RD(Cp , Cq ) then that column
end just O(n)!
cannot have the
end discord.
end // end inner loop
if nearest_neighbor_dist > best_so_far_dist The code also uses
best_so_far_dist = nearest_neighbor_dist heuristics to order the
best_so_far_loc = p search
end
end // end outer loop
return [ best_so_far_dist, best_so_far_loc ]

The Magic Heuristics 0 2 4.2 1.1 2.3 8.5


Approximately Magic Heuristics 0 2 4.2 1.1 2.3 8.5

2 0 3 3.2 3.5 8.2 2 0 3 3.2 3.5 8.2

In the outer loop, visit the columns in 4.2

1.1
3

3.2 1.2
0 1.2

0
9.2

0.1
9.7

7.5
4.2

1.1
3

3.2 1.2
0 1.2

0
9.2

0.1
9.7

7.5

order of the Discord score 2.3 3.5 9.2 0.1 0 7.6 2.3 3.5 9.2 0.1 0 7.6

In the inner loop, visit the row cells in 8.5 8.8 9.7 7.5 7.6 0

Image 1
8.5 8.8 9.7 7.5 7.6 0

order of nearest neighbor first


Time Series 1

caa Rotation invariance


Observations SAX Word

ignored here
Visiting the columns in approximately We can try to
order of the Discord score is still very approximate Magic
Inserted into array Augmented Trie
helpful 1 c a a 3 c
b
For the inner loop, we dont really need 2 c a b 1 c a 77
c
3 b
visit the rows in order of nearest neighbor 3 c a a
a
b
9
:: :: :: :: ::
first, so long as we find a near enough :: :: :: :: ::
a c
b 2
c a
neighbor early on c b b 2
c 1 3 731
1 b
m-1 a c b c a
m b c a 2 c 23

24
How Fast is Approximately Magic? Which is the odd man out in
this collection of Red Passion
On a problem dataset of arrowheads Flower Butterflies?

If we only see 200 arrowheads, we do an extra 21.8% more work than


the Magic algorithm
For larger arrowhead datasets we get even closer to Magic algorithm
In other words, we are doing O(n) work, not O(n2) work.
Empirically we see similar results for other datasets, but in pathological
datasets, we can still be forced to do O(n2) work
Projectile Points
1

0.9

0.8

0.7

0.6

0.5

0.4
Heliconius melpomene
0.3
(The Postman)
0.2

0.1

0
One of them is not a Red Passion Flower
50
10
0
00 0
Butterfly. A fact that can be discovered
Num 5 00 0 Brut Heliconius erato
ber of 1 00
Time 2
Serie
40
s in da
00
60
00
80
00
00
0
Ra
Appr ndom
ox. Op
e Fo
rce by finding the shape discord (Red Passion Flower Butterfly)
tabas 10 timal 0 100 200 300 400 500 600 700 800 900
e (m
)

Nematode Discords
Drosophila
melanogaster

A B C D E F

0 200 400 600 800 1000

B
C

Though 20,000 species have been classified it is


estimated that this number might be upwards of A
500,000 if all were known. Wikipedia
1st Discord A subset of 32,028 images of Drosophila wings

A, D, E A B C

G
B, C, F
Fungus Images
Some spores produced by a rust (fungus) known as
Gymnosporangium, which is a parasite of apple
and pear trees. Note that one spore has sprouted
an appendage known as a germ tube, and is
G thus singled out as the discord.

25
Time Series
Liberation Day Ascension Thursday
Discords in Medical Data
Sunday Dec 25
A cardiologist noted subtle anomalies in this dataset. Let us see if the discord algorithm can find them.

Discords 0 100 200 300 400 500 600 700


Typical Week from the Dutch
Power Demand Dataset
Good Friday

0
Easter Sunday

100 200 300 400 500 600 700

2500

2000
3rd Discord 1st Discord One years power demand at a Dutch research facility 2nd Discord
Power Demand -3
Record
1500
-4 qtdbsele0606
1000

500 -5
from the
January June December
PhysioBank QT
-6
Database (qtdb)
-7
0 500 1000 1500 2000 2500

Shallow breaths as waking cycle begins -3.5

-4 ST Wave

-4.5

Sleep Cycles -5

-5.5

-6

-6.5 0 50 100 150 200 250 300 350 400 450 500

Stage II sleep Eyes closed, awake or stage I sleep Eyes open, awake
0 500 1000 1500 2000 How was the discord able to find this very
subtle Premature ventricular contraction? Note
A time series showing a patients respiration (measured by thorax extension), as they wake up. A medical expert, Dr. J. Rittweger, manually that in the normal heartbeats, the ST wave
increases monotonically, it is only in the
segmented the data. The 1-discord is a very obvious deep breath taken as the patient opened their eyes. The 2-discord is much more subtle Premature ventricular contractions that there is
and impossible to see at this scale. A zoom-in suggests that Dr. J. Rittweger noticed a few shallow breaths that indicated the transition of an inflection.NB, this is not necessary true for
sleeping stages. all ECGS

Institute for Physiology. Free University of Berlin. Data shows respiration (thorax extension), sampling rate 10 Hz.

Discords in Space Shuttle Marotta Valve Series


Example One
Space Shuttle Marotta Valve Series
Poppet pulled
significantly out of
the solenoid
The De-
Energizing
Open Problems
before energizing phase is normal

Let us finish with a brief discussion of some


0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

open problems worthy of study


Example Two
Space Shuttle Marotta Valve Series
Poppet pulled
significantly out of
the solenoid
before energizing

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

This discord is Poppet pulled out of the


Corresponding
section of
solenoid before
subtle, lets zoom in energizing
other cycles

to see why it is a
discord. Discord
Discord

1500 2000 2500


0 50 100

26
Spatially Constrained/Informed Mining of Shapes
Phrynosoma
Assessing the Significance of Motifs/Discords
hernandesi

Phrynosoma
douglassii
Iguania
Phrynosoma
taurus
The motif and discord algorithms always return some answer, but is
Phrynosoma
the result interesting, or something we should have expected by
ditmarsi
chance?
Phrynosoma
mcallii

In a large string database, like this ABBANBCJSMBAVSMABG..


0 200 400 600 800 1000 1200

would it be more interesting to find


Elko
(highly
variable)
Geographic Range
of Phrynosoma
coronatum
A motif pair {ABBA, ABBA}
Elko
(little variation) Rosegate
A motif pair {ABBAACCC, ABBBCCCC}
(little variation)

Baker California

(i.e. shorter but perfect or longer with some misspellings)

Rosegate
(highly variable)

Blythe, California

Annotation of
Applications!
Historical
British

Manuscripts Desmidiaceae,
vol. 2 (1905),
plate 41, fig. 5
Probing
End
Probing

Micrasterias
oscitans

Beet Leafhopper,
Circulifer tenellus

This match is based on shape only, the color


and texture offer independent evidence

27
Mining Web Logs

Search Engine Query Log


It makes sense that the
100
LeTour bursts for LeTour,
50
Tour de France and
0
2000 2001 2002
Lance Armstrong are
all related.
10000

5000
Tour De France But what caused the
extra interest in Lance
The Last Word
0 Armstrong in The sun is setting on all other
2000 2001 2002
August/September 2000? symbolic representations of
400
? Lance time series, SAX is the only
200 Armstrong Example by way to go
0
2000 2001 2002
M. Vlachos

We are done!
We have seen that SAX is a very useful tool for solving
problems in shape and time series data mining. I will be Thanks to my students
happy to answer any questions
Chotirat (Ann)
Eamonn Keogh: UCR Ratanamahatana
Chulalongkorn
eamonn@cs.ucr.edu University
What are the Dragomir Yankov
disadvantages of UCR

using SAX Li Wei Jessica Lin


(Google)
George Mason University

There are Nun

Xiaopeng Xi
(Yahoo)

28
>> x=random_walk(40,1);
>> timeseries2symbol(x, 16, 8, 4) Just create random walk of length 40 for testing.
ans =
Convert to SAX, with a sliding window of length
16, a word size of 8 and a cardinality of 4
4 3 2 3 1 1 3 2

Appendix A 4
3
2
2
3
3
2
1
1
1
2
4
3
2
3
4
2 3 2 1 2 2 3 4
2 2 1 1 3 2 3 4
2 1 1 2 2 2 4 4
Converting a long time series to a time 2 1 1 3 1 3 4 4
1 1 2 2 2 4 4 3
series bitmap (Intelligent Icon) 1 1 3 1 3 4 4 2
1 2 2 2 4 4 3 2
1 2 1 3 4 4 2 2
1 2 2 4 4 3 2 1
4 3 2 3 1 1 3 2
3 1 3 4 4 2 2 1
2 2 4 4 3 2 1 1 4 2 3 2 1 2 3 3

3 3 4 4 2 2 1 1 3 2 3 1 1 4 2 4
3 4 4 3 3 2 1 1
3 4 4 2 2 1 1 1
0 5 10 15 20 25 30 35 40

4 4 3 3 2 1 1 1
4 4 3 3 2 2 1 1
4 3 3 2 2 2 1 1
4 4 3 2 2 1 1 2
4 4 3 2 2 1 1 3
4 4 2 2 1 1 2 3
4 3 2 2 1 1 3 3

>> x=random_walk(40,1); >> x=random_walk(40,1);


I have converted to DNA for visual clarity. Count the frequency of all pair of basepairs.
>> timeseries2symbol(x, 16, 8, 4) >> timeseries2symbol(x, 16, 8, 4)
Obviously we dont really need to do this.
ans = ans = Below I have just done AA and AC

G T C T A A T C G T C T A A T C Assign the results to a matrix z


G C T C A C T T G C T C A C T T
T C T A A G C G T C T A A G C G
C T C A C C T G C T C A C C T G AA AC CA CC
C C A A T C T G C C A A T C T G
C A A C C C G G C A A C C C G G AG AT CG CT
C A A T A T G G C A A T A T G G
A A C C C G G T A A C C C G G T GA GC TA TC
A A T A T G G C A A T A T G G C
A C C C G G T C A C C C G G T C GG GT TG TT
A C A T G G C C A C A T G G C C
A C C G G T C A A C C G G T C A
T A T G G C C A T A T G G C C A
A C G G T C A A A C G G T C A A
C T G G C C A A C T G G C C A A 19 10 CA CC
T G G T T C A A T G G T T C A A
T G G C C A A A T G G C C A A A AG AT CG CT
G G T T C A A A G G T T C A A A
z =
G G T T C C A A G G T T C C A A GA GC TA TC
G T T C C C A A G T T C C C A A
G G T C C A A C G G T C C A A C GG GT TG TT
G G T C C A A T G G T C C A A T
G G C C A A C T G G C C A A C T
G T C C A A T T G T C C A A T T

29
Map to some colormap, I have done of the
We need to normalize the matrix z, below is one
1 work belowG
way to do it such that the min value is 0 and the
max values is 1. (matlab code) 1.0000 0.9369 0.8618 0.9696
0.2282 0.7982 0.4575 0.7725
There may be better ways to normalizeG
0.6315 0.4701 0.6407 0.1693
0.5018 0.0000 0.8302 0.4156
>> z=(z-min(min(z)));
>> z=(z/max(max(z)))

z=

1.0000 0.9369 0.8618 0.9696


0.2282 0.7982 0.4575 0.7725
0.6315 0.4701 0.6407 0.1693 0
0.5018 0.0000 0.8302 0.4156

Hints I Hints II

ans = ans =

G T C T A A T C When counting patterns, dont G T C T A A T C Note that here lines 1 and 2 are
G C T C A C T T count patterns that span two lines. G T C T A A T C the same. This can happen a lot,
T C T A A G C A G C T C A C T T especially with smooth time series
A T C A C C T G For example, dont count the T C T A A G C A and/or a high compression ratio.
C C A A T C T G underlined As as an occurrence of A T C A C C T G
AA C C A A T C T G The SAX code has an extra
parameter that removes these
redundant lines. It seems like this
makes the Intelligent Icons work
better, and it does make the code
run a little faster.

30
Hints III
Appendix: DTW
For Intelligent Icon the cardinality must be 4
There are some critical facts about the size of the warping window r.
But what is the best sliding window length? r can vary from 0% (the special case of Euclidian distance) to 100% (the special case of full
DTW).
What is the best a word size?
Without lower bounding, the time taken is approximately linear in r, so r =5% is about twice as
fast as r =10%.
At the moment there is no answer to this other than
playing with the data (or CV if you have labeled data) With lower bounding, the time taken is highly non-linear in r, so r =5% is perhaps 10 to 100
times as fast as r =10%.
The good news is that once you find good settings for In general (empirically measured over 35 datasets) the following is true.
your domain (say ECGs) then the settings should work
for all ECGS.
sliding window length word size = 8 If you start with r = 0 and you make it larger, the accuracy improves, then gets worse (see the
4 3 2 3 1 1 3 2 two examples for FACE and GUN in this tutorial, but it is true for other datasets)
Heuristics: The best accuracy tends to be at a relatively small value for r (usually just 2 to 5%)
4 2 3 2 1 2 3 3

The sliding window length should be about twice the For any dataset, the best value for r depends on the size of the training set. For example for CBF
3 2 3 1 1 4 2 4
length of the natural scale at which the data is with just 20 instances, you might need r = 8%, but with 200 instances you only need 1 or 2%, and
interesting. For example, about two heartbeats for with 2,000 instances, you need r = 0% (the Euclidean distance).
cardiology, or for power demand, about two days. 0 5 10 15 20 25 30 35 40
How do you find the best choice for r? Use cross valuation to test for the best value.
The smoother the data, the smaller you can make the
word size. See [a] and [b]
[a] Xiaopeng Xi, Eamonn Keogh, Christian Shelton, Li Wei & Chotirat Ann Ratanamahatana (2006). Fast Time Series
Classification Using Numerosity Reduction. ICML
[b] Ratanamahatana, C. A. and Keogh. E. (2004). Everything you know about Dynamic Time Warping is Wrong. Third
Workshop on Mining Temporal and Sequential Data, in conjunction with the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD-2004), August 22-25, 2004 - Seattle, WA

31

Você também pode gostar