Escolar Documentos
Profissional Documentos
Cultura Documentos
Approach Collections
Pedersenl
to
John
W.
Tukey13
Abstract
Document formation two large number not main clustering retrieval categories: (with of documents); improve that these has not tool. first, running and been well received to its often that is too quadratic clustering as an ininto for does in the slow Objections that time second, use fall
of a hierarchy, taking condition returned These neighbor the erate arise only when clustersearch conventional these employs We also which support the the
queries
highest is achieved.
scoring
clustering
are where
corpora appreciably
pairwise
retrieval. problems
Indeed, to direct
cluster
techniques search recall. [9], Variare not in some are often unperfor-
We argue ing tion is used techniques. and provides a document clustering (linear teractive time)
near-neighbor and
to improve right
in terms that
looking
ous studies markedly situations, Furthermore, slow, mance, for ment with surprising Document
strategies and,
access tool
to near-neighbor
be inferior document
is therefore
its indifferent
browsing
paradigm.
has not
has also been studied search, near-neighbor [1]. approach document a new dismissing but for
Introduction
clustering for has been extensively improving an excellent mutually same document review). and, of such similar queries, investigated search The and will that general as reastend aucan
near-neighbor possibility
a methodology is that
we take
Rather tool
to the
clustering right.
determination
of groups
documents
by effectively corpus
broadening
which access
(see [11] for an exhaustive hierarchical In the case and clusters returned
1Xerox 3333 Coyote 2 St anford 3 Princeton Permission granted titla that to
hypothesis). either
operation.
of documents
to more
algorithms
experimentation reasons
as a result,
Palo Hill Alto
possibly
In the case
Research Palo
Road,
University University
1.1
The
of this material is for
Browsing
standard presumes
need.
vs Search
of the information the users
to
formulation a query,
The task
copy that
without
fee
provided
the copies
or distributed notice
lem
mation
expression
search
advantage,
the ACM
Gopyright
is then
a corpus
publication To copy
appaar,
and notice
uments ble,
that
match a situation
this
need. in which
it is not if not
difficult impossithe
is by permission permission.
to imagine
Machinery, and/or
to formulate
such a query
as vector space
For example,
search
1Also known
8,..$1,50
318
user
may
not
be familiar a topic
with
the
vocabulary or may
approprinot wish to
2
In the the
Scatter/Gather
basic iteration of the with groups. scatters groups, to
Browsing
proposed short the or the browsing method, of a small into and Based a small presents on these for clusa small
ate for commit the but fact thing session learn to user
to a particular be looking
of words. specific
Indeed, at all,
user Initially
rather covers
may
wish
to discover Access spectrum: a particular defined document across starts the with out
content specified
of the corpus. an entire sea~ch with more no about the for well the user
of them selected
as specific
as its title;
are gathered
together
to move
which
is refined
as he finds
each become
information end
access
and therefore
Ultimately, process
search emphasis
A glaring
enough,
is cluster
by enumerating
individual
documents.
a technology
to assist
2.1
An
Illustration
a Scatter/Gather of about News session figure, 5000 Servzce session, articles the assigned during where posted month the text to the of Au1. Here,
We now describe collection New gust York 1990. consists Tzmes This the
the access
typically
If one has a specific which directs simply eral lays out define one question, the gives that to
specific index,
in figure
passages in one
if one
on the full
interested logical
an overview,
peruses
to find
happened
that
the
application
of conven-
of questions
information topic.
to be described
as
a single
q
searching By system
Even scribe
used
to de-
direct with
we propose
Scatter/Gather, table-of-contents documents; search search similar for user scribe ipate to find user viced used retrieve [7]. and The methods, documents, is directly found focused the document that
cluster-based, navigating
not fail
discuss articles
and events
one such
or more
word-based,
example, never
as near-neighbor component This individual process, method. tool will but can of which
concerning
international
browsing
event. used in discussion may may than with fail of the topic
one or more
further
documents to a more
used to deswitch be used help also the be be serthat With vide outline which In the diately Kuwait, the user Kuwait clusters Scatter/Gather, of the seem example, obvious and corpus. potentially the from big the forced those are Iraq This we antic-
terms,
particular by some
may
to the of the
of interest.
Scatter/Gather of word-based
queries
Germany
considers
many
319
1990
Education
Domestic
Iraq
kts
sports
oil
Gemany
Legal
\Gathe//
International Stories Deployment Politics Germany Pakistan
Africa
Markets
Oil
Hostages
Gather +
Smaller International
Stories
Trinidad
W. Africa
S. Africa Figure
security
International of Scatter/Gather
Lebanon
Pakistan
Japan
1: Illustration
reduced eight
corpus new
is then
reclustered the
on the reduced
fly
to
method
for
that must
group
must
corpus.
be specified. revealing by the We of suitable ter/Gather. tering able which cluster a cluster for group, will for
the reduced new clusters The articles the is about but of the also eight.
a subset level
of the articles, than the and some clusters effects the user stories cluster, and reveals as articles. and
otherwise
to give yet
enough
be appreciated requirement.
simultaneously. algorithms is a fast reclustering is another, greater to the partitioning user. Buckshot the Fractionation clustering essential more of the We concise makes entire also algorithm careful,
of the Oil the invasion deduces The corners which a cluster a number a small The
among hostages
been separated
present
U.S.
upon
user feels her understanding wishes world. to find other articles out foreign about She selects
is
accuracy
adequate,
in other
define
contains of specific
stories, as well
description
international of a coup
in Trinidad, of that
of miscellaneous learns
taken
user
about
lost
3
Before review cuss
Document
presenting why existing in the the terminology Throughout of clusters. to cluster measure to partition this
Clustering
our document established paper, clustering collection, documents clustering in prior fail the n denotes and algorithms algorithms, work, to meet number the we our of and dis-
stories
2.2
Requirements
depends since clustering basic iteration, cluster given on the existence of two faciliwhich
documents In order
desired a
an algorithm
appropriately Second,
of document Numerous
tolerable
clusters
a minute).
of documents,
documents.
each document
as a
320
set of words, sure The tors types) value in the the absence of the does cosine both cosine vectors. coefficients, Willett larity results Two structed. a collection teT, which vidual each chical the build These ering which ment gram). ilarity fusion. from sider erage known forming due space They tively manner, similar that to [15] than not the documents of length
with
frequency overlap
information, between
and
the
course
of selecting
Global
degree
of word
times
of unique
their
in the corpus. reflecting document. of the word, words occur frequency measure, angle
of the vector
theoretical strategies,
bound
of the corresponding use a binary the might value two scale, presence
in which or the
value
is one or zero,
or the value
be some function If a word A poputhe is zero. sparse to unit product Dice the
document, measure,
in a document, between
performance
cosine
mentioned Other
greedy,
partitional
vectors
the two
simply are
proceed
measures
a number is then
word impact
overlap choice
of the final
Each
document seed.
It is can
collecan imthat
on clustering can be coninto clussets, on to most [5, 2], considthe pair docudendrosimas one conthe clustering typically popular optimal [1 O]. characteristics. by iterainto for in in avis
at each stage,
noteworthy be
selection
transformed parof
documents
by recursively
in an application
of subsets.
titioning the
an indiA hierar-
partitioning
algorithm.
to in-
performance
of near-neighbor missed.
called have
docto be
documents. Numerous clustering single-linkage generally of clusters the (which differ greatest then hierarchical algorithms all pairs exhibits group They when document clusters, hierarchical proceed built similarity becomes pair is the including, clustering
near-neighbor
search,
prominently,
so far,
document
this
perspective,
potential
egy are largely number reason pursued niques acheive number over partitional
of documents)
of a previous
strategies
the maximum
We present
algorithms
(complete-linkage),
similarity
as other
speedup
aggregate
quadratic
algorithms
and are
algorithm
These
certain
document
groups They
a single
document
agglomerate
in a gmedv
2Willett [14] discusses an reverted file approach which can ameliorate this quadratic behavior when a large number of small clusters are desn-ed. Unfortunately, when clusters are large enough to contain a large proportion of the terms in the corpus, this approach yields less improvement
agglomeration
is considered
best or most
criterion.
are considered
321
Definitions
a in a collection document.3 in C. IVI; cs)}j!~ in V and ~(w%, a) is the frequency between the of C(O) cosine and pairs c(f?). of documents, monotone particular, Then Let c(a) (or corpus) with their V be the C, let the frequencies, set of unique as a c(a) occurring of length c(a) = {f(uI,, be the set of words,
To profile most
solve pm(I)
this for
we r
deby
of the Then
a in r
let rm(17)
be the
m documents
to 17, namely
can be represented
~~~m(r) and Pm(O This tively = &(r)/llfim(r)N. can be completed parameter percentage in time proportional defined be fixed. adap-
where
of w% in a. To measure a let S(CY,/3) = (9(4Q))) 9(@))) and ~, let element-wise the similarity us employ functions between In
computation as some
to 1171.4 The
trimming
m maybe
of ]r 1, or may
119(C(Q))II119( 4P))II
where inner been g is a monotone product, our and produces to experience damping II . II denotes that taking better results similarity where function, vector than to (., .) denotes It has norm. the
4.1
Another dual the central of words (or
Cluster
description
Digest
of a document sum profile. those which of a cluster, We thus w highest two sets group Rather appear define weighted (rm(I), is in some than most consider sense the
g to be component-wise traditional
we can
profiles g(c(ff))
tw(I)) form
of the can easily the
p(a)
==
description
119(4~))11
in which case
digest
be comsummary
+ lV\),
is in fact
a cluster
to a user
of Scatter/Gather.
S(a, p)
= (pap)
~p(ci)zp(p),.
Z=l
5
or a document with of the contained group. it in17 by defining phases:
Partitional
partitional
Clustering
clustering algorithms have three
Seed-based
1 Find 2 Assign
CXEI?
be the
unnormalized
sum
profile,
and
then
p(r) =
Similarly, employing qr,z) Sometimes is not because a good
j(r)
m.
is a set P of k disjoint II = C. and the Fractionation initial centers. algorithms, centers. algorithm Let
use
The the this cosine profile measure definition: P(x)). the normalized groups which documents sum profile contents lie on the can be extended to r by designed is only ~ (p(r), the which clusteT clustering our small
are
both
of as rough existence
us call group
of a document
it takes
subroutine this
(see appendix
algorithms sets,
cluster
over
3 Throughout
denote individual
this paper, lower case Greek letters wJ1 be used to documents, Upper case Greek letters wdl denote
groups) groups, and upper case Roman letters
on its results
k centers.
4A full
sort
of the similarities
IS not requmed.
Buckshot sample placation to find accurate significantly the on-the-fly Scatter/Gather. the primary displayed We the Our tradeoff.
applies
the
to
a random apgroups
thus bucketing
nearby a partition
individuals
in the
to find
centers.
Fractionation
to have
one word
in common.
procedure. hence,
@2, . . ..@m}m}
faster,
appropriate by iterations to
for of
Fractionation partitioning
be used
establish which
corpus,
(using computations in nm
the
nearest
occur
of agglomerative groups
produces
refinement
partition
contained
as individuals
iteration,
of procedures ements
is, define C={@,,J: l<i Sri/m, lgj order J. That C. pn/m through at iteration groups running time by The Spin} taking process which the @,,J in re-
5.1
Buckshot The
Finding
Initial
Centers
an enumeration order broken to p2n on i and into groups application the C replacing
lexicographic peated of the buckshot time random algorithm clustering of the subroutine. This algorithm the may sample is quite simple. To of C reduced The point reduce To sampling deterministic. on the same although in our is employed, That corpus similar Buckshot calls alto differtrials The iteration, overall O(rnn). running
is then are
idea
buckets,
separate
process the
terminates remaining
j if # n < k. At
of the O(kn).
random is not
operates if m = O(k)
takes 1 +P+P2
running
algorithm partitions,
has rectangular
experience
generally
produce
qualitatively
5.2
Fractionation The cluster groups uals These and when to Fractionation C into subroutine such groups that algorithm N/m buckets is then the finds applied k centers to each by initially buckOnce fined signed The
Assigning
k centers for those to one simplest have
Documents
been found, each centers center. and based
to
Centers
profiles de-
suitable on some
document
breaking
criterion.
to agglomerate reduction bucket) treated repeated. remain. branching in each are now process
document let r,
to the nearest ith can index. partition. be efficiently for the computing cost group
a factor
a < II,
if i maximizes
individuals, terminates
be broken The
by assigning
iteration
lowest
as building leaves k roots c=al, on a key mon ber, word such are
documents,
k centers of this
p~(I,
), and
remain. individuals but in C are enumerated, ordering a better word index could procedure of the Typically medium reflect sorts jth so that an extrinC based comnumterms. most ffz, . . .. c%. This on C, which as three, is the which
5.3
Given it into rithms,
Refinement
an initial a better there clustering, one. As with is a tradeoff it is now our initial speed desirable and to refine algoaccuracy.
j is a small frequency
clustering
between
323
the
Assign-tosepaparts
Join The purpose of the groups elements documents words may by their Join refinement digests. operator P that they However, Therefore groups Since, is to merge usefully have of be by definition, never their lists
Nearest
process merges
clusters
are too
topical
17 and
ItU(r)
n tu(A)l of w most A) the words number In large is thus topical some for of words each words the for 17. We
W.
the trimmed
t~ (I) r and
> p, for
topical
we must
compute
to decide number
clusters is typically
Split Split two divides new each document This can (without Buckshot J7~} and partition of the group !J in a partition with P into
words the
the number
of documents,
running
groups. clustering
6
ment the
Application
procedures course The
to
Scatter/Gather
initial possible clustering complete and refinein clustering method. is comconsideration. one can use of of algo-
groups. Iz, ..., union G, = {17,,1, 1,,2} new Buckshot of 17,. The G, s: partition
element is simply
two the in
of these
combinations
P = UG,.
pletely Hence, of Buckshot the overall to N. of this is the procedure on some cluster in the would coherency self similarity to the cluster, to the as well cluster only s(I, split I). poorly criterion. simirequires time proportional compute a slower the tens rithm and Split, the O(k N) tering expense rank of A(I, A(rk)}. then
p,
t=l
Each application proportional that quantity between similarity define: = s(r, P) r). ) in the set score criterion
is available
in advance, the
partition algorithm
to II, 1. Hence, in time A groups One This larity average We thus A(r) Let r(f,, {A(r, The i-(r, does to N. modification simple
computation
can be performed
to improve
initial
of thousands is likely then Join, running and use the perform and time thus
to be too slow even for offline algorithm of refinement operators. the overall it is vital as possible, therefore and We and then have
We thus
is in fact documents
proportional
average
as to the centroid.
Assign-to-Nearest does not session, to run accuracy. procedure, of refinement. clustering, affect
of a document
however, We
for the cluseven at the the that yield quickly Bucka two a readithis it with
as quickly
be the
shot bare
), A(r,),..., would
some
iterations procedure P) not < pk for change criterion only of the split groups such the that cosonably produces minishing By virtue algorithm may in fact worsen the partition
clusters can pull even fuzzier.
refinement
O < p s
1. This
modification since
the can
order
algorithm in time
herence
be computed
proportional
5Excessive Iteration
improving away from
rather than
documents
plated that
Scatter/Gather, be computed
it is more
important of deterpartition
an unrevealing
324
overall
complexity section
of both is clearly
clustering O(klV).
evaluation information
metrics access
appropriate goals in
to the which
vaguely
de-
in this the
Scatter/Gather
for
procedure large
is small
to permit The
interactive
document
collections. larger
fast
clustering quickly
constant
for offline
manner
applications.
For extremely
6.1
Naturally
examining data set input
Clustered
the consists data
Data
of our separated clusters, then algorithms clusters i.e., both is larger of the than of
be too under
We are working
ations large
to arbitrarily
the assumption
If the
Fractionaprovides al-
similarity,
partition. a corpus documents so long containing a random from each k widely sample of the then
further gorithms,
size centers,
whatever
high
probability
as n >> k in k. This k = 20 or so. if we choose from that In figures pus This about updates Here events a = 5 means in 1000. from that Given tion is the News consists 5000 our during 2 through set 5, we present described during Some the distributed month are 30 megabytes articles about To the full by output 2.1. the New of the The corYork 1990. text due political initial algorithm this line the task, display number parti(figtime of of and line in to Scatter/Gather Times session SeTvice articles. stories. is to learn month. the Buckshot international create clustering for two the goal applied in section
Scatter/Gather
Session
probability k times
of size
individual
of articles of roughly
some
of news
this
with
weve
400 samples
clusters Thus
of centers cluster.
cluster
digest.
contains
the cluster, note actual not that if we have some pair Then same each titles bucket, contains kets, articles gest Next, 4 (African
cluster. in the
6 (Germany,
of documents be a subset
be merged.
of some
one of the
3 (Pakistan,
issues)
incidents so on.
Conclusion
demonstrates information
metaphor contenks
We find
in Trinidad, about
in Liberia,
document
the
clustering
an in-
in Liberia in that
of the articles
cluster
basis, in
and
experience it is difficult
that
easy to use.
undesirable
formally.
of improved
performance
325
>
(time
(outline
(all-dots
tdb))))
Items..
18 835 749
24 86
53
5 25
47
13 273 310
677 481
1020 844
275
(287) school,
program, STEPS
(1731) year,
TO PR;
week, BUSH
(749) iraq,
IRA;
kuwalt, Many
railltary,
(275) film,
Nasty Teen-Agers 1
american, PAINTING hit, OIL day, right, PRICES rate, THE directo DODGER
play,
york, I;
(481) game,
play,
coach,
RISE week, COUNCIL preslden AS
1
s RE
(844) price,
PUSHES percent,
market,
company, ;
(310)
LEADERS year,
SECURITY leader,
CONVICT dlst
lawyer,
msec
Figure
2: Initial
Scattering
>
(time
(outline
first
2 5 6)))
items.
51 62
8 5 5 4 56 59 99
7 28 714
15
117
242
586
(650) iraq,
60,000 kuwait,
american,
offzci.al,
1 2
LEGISLATIVE state, IN
LEADERS
BACK;
THE PROBLEM ; IN
AN EARL; vote,
election,
year,
political,
candidate,
PUSH FOR UNIFICATION, government, soviet, PAKISTAN, mllxtary, MANDELA national, CORP. corp, CRISIS day, stock,
OF TWO GERMA unificati FEEL LET sta TO SETT offlclal, MERRILL pres OVE f F ara price, PANIC week,
east,
BHUTTO DEATH
state, PAKISTANIS
3 4 5
GOVERNMENT EXCEEDS
DISMIS;
FRACTIOUS polltlcal,
government,
official, group,
HOLD U;
EXECUTIVE
(586) oil,
AS STOCK; year,
market, 237
mdlzon, I;
(126) kuwait
FOREIGN iraq,
OF 12
american, msec
year,
invasion,
real
54184
Figure3:
Second
Scatter
326
>
(time
third items 37
(outline
second
3 4)))
cluster global move move 0 1 (5) (16) rebel, 2 (28) south, 3 (1) security, 4 (51) to to
1 4 1 23 1 51 MUSLIM
12
1 5 3 8 3 10 13 LAY DOW; L; DRAMA IS hostage, WEST AFRICAN leader, COMPETING ant, politlcal, OVER BUT BOO robinson FORCE S officla FACTIONS gove T wednesday, TO SETTLE group, troop, bakr, llberian, HOLD U;
nearest. nearest.
7 71 7 55
MUSLIM
DOW; L;
MILITANTS
NEGOTIATIONS
EXCEEDS
DE KLERK, mandela,
MANDELA africa,
african, IN U.S.
congress,
COMPUTER computer,
technology,
milit IS natio W
COUNCIL state,
BETWEEN
RIVAL
FACTIONS kill,
muslim, BHUTTO
chru+tlan,
beirut, CALLS IN
GOVERNMENT minister,
BHUTTO YEARS
CALLS
BHUTTO
government,
party, TO VISIT
politlcal, 45
military, AFTER
pakxstan, JAPANS
president, ROLE
SHEVARDNADZE
WARS
japanese,
year,
tokyo,
government,
south,
korea,
Figure
4: Third
Scatter
> (prmt-titles (nth 1 thmd)) 3720 REBEL LEADER SEIZES ABOUT A DOZEN FOREIGNERS 4804 4778 3719
3409 3114 3113 2785 2784 2783 2782 1801 1685 1684 248
WAR THREATENS TO WIDEN AS NEIGHBORING COUNTRIES TAKE SIDES REBEL LEADER AGREES TO HOLD CEASE-FIRE
OUSTER OF LIBERIAN PRESIDENT NOW SEEMS NEGOTIATIONS NEGOTIATIONS LIBERIANS LIBERIANS LIBERIAN LIBERIA FIVE FACES FACES OUSTER IN IN TO SETTLE TO SETTLE U.S. U.S. LIBERIAN LIBERIAN
CRITICAL CRITICAL
REBEL LEADER,
LEADER
CHARLES
TRUCE MOVING
TOWARD LIBERIA
OF LIBERIAN
PRESIDENT
NOW SEEMS
INEVITABLE
Figure
5: Titles
of articles
in topic
1 from
Figure
327
B
Here erative in our Let between
Group
we present clustering in [3].
Average
a quadratic algorithm This time which
Clustering
greedy global good to the average agglomresults algorithm similarity has given
finding
the
best
pair
would
simply
involve
Updating
these since
quantities those
only
involving
is similar The
as these, for
presented
complexity
truncated
group.
is 0(n2)
to the
documents
in 17 is defined
to be
of individuals
to be clustered.
Let
G be a set of group
groups.
The
iteration A) A over
which is then
all choices A.
partition
merging
G = (G {I, Initially, individual IGI = k. flat the final recording If that r. we G is simply
A})
u {r
u A}.
groups, one for each when is by inwith
to be clustered. Note that the G, the join cosine partition although each pairwise employ the
procedure hierarchy
of partitions,
could
ner maximization
can be significantly
Recall
P( 17) is the unnormalized Then the to the average inner pairwise product,
related
~~rp~r =
lrl(\rl lrl(lrl
- l)s(r) - l)s(r)
= s(r)
Similarly,
- Irl I)
of two disjoint groups, A = I U
(A)
where
(@( A),
P(A))
{pfrh~(r)}
2(j(r),
Therefore, the pairwise in average a merge the A were if for every that can such
j(A))
+ (j(A),
and fi(r)
fI(A))
are known, decrease each every time 17 E G
merge similarity
the for
least
updated
is performed. known
suppose
s(r n A) = ~j~s(r
328
References
[1] Chris Eighth ence Buckley Annual on Research pages and Alan F. Lewit. In ACM Optimizations of the SIGIR in Confer-
[14] P.
Willett.
fast
procedure ~
for
the
calculation 17:53-60,
of similarity Information 1981. [15] of inverted vector searches. and 97-11O, l%oceedings
coefficients Processing
in automatic Management,
classification.
International
Info?matzon
P.
Willett.
Recent A critical
trends review.
in hierarchical
document
Info~matzon
1988.
l+ocessmg
a Management,
Clustering method. the single-link for Journal Science, P. Willett. Wards in of the Ame?zcan 28:341-344, Hierarchical method. Information Conference 1977. docon ReRetTzeval,
24(5):577-597,
ument
In Proceed-
International
and Development
149-156, 1986. H.C. Luckhurst,
Griffiths,
and
P. Willett.
Using
in document Society
AmeTican 1986.
C. Dubes. Hall,
A~goTithms Cliffs,
fo~ N.J.
CiusteTing
Data.
Engelwood
C.J. and D.
van
Rijsbergen.
The retrieval.
clustering Storage
7:217240, and J.
[7] J.
O.
Pedersen, search: In
a single
Statistic1991. SSL-
Meetings. available
System.
Prentice-
Englewood and
M. J. McGill.
to Modern
McGraw-Hill, an optimally
algorithm Journal,
cluster
method.
Information edition,
Butter-
London,
second
clusthe &
experiments
with
1400 collection.
11:171182,
Information
1975.
Processing
Document Journal
clustering of Information
using
an inverted 2:223-
Sczence,
329