NII International Internship Project: Distributed Data Clustering

NII International Internship Project:
Distributed Data Clustering
Supervisor: Michael HOULE, Visiting Professor
For data mining applications, the size of main memory typically limits the size of the
data sets that can be analyzed. Avoiding the main-memory limitation necessitates a
choice between the use of external memory (disk) or distributed processing (multiple
cores). Either approach also requires a clustering method that is inherently
decomposable. Relatively few parallelizable clustering methods are known, most of
which involve the partitioning of the data set, the independent clustering of each
partition, and the merging of the result clusters across all partitions. For data mining
applications, this divide-and-conquer approach has the effect of missing those very
small aggregations (the nuggets of information) that may prove to be the most
valuable to the user. For example, if the data set is partitioned into 10 subsets for
clustering, any aggregation of 30 points would have (on average) 3 points in any
given partition – typically too few to be recognized as a cluster for that partition.
The project will investigate the application of the relevant-set correlation (RSC)
clustering model [1] to the clustering of data from distributed databases, in such a way
that the smallest nuggets of information are still preserved. Developed at NII, RSC is
a generic model for clustering that requires no direct knowledge of the nature or
representation of the data. In lieu of such knowledge, the model relies solely on the
existence of an oracle for queries-by-example, that accepts a reference to a data item
and returns a ranked set of items relevant to the query. In principle, the role of the
oracle could be played by any similarity search structure, or even a search engine
whose internal ranking function and relevancy scores are secret. The quality of cluster
candidates, the degree of association between pairs of cluster candidates, and the
degree of association between clusters and data items are all assessed according to the
statistical significance of a form of correlation among pairs of relevant sets and/or
candidate cluster sets.
Based on the RSC model, a general-purpose scalable clustering heuristic, GreedyRSC,

has already been developed and demonstrated for very large, high-dimensional
datasets, using a fast approximate similarity search structure (the SASH [2]) as the
oracle [1]. The features of GreedyRSC include:
 The ability to scale to large data sets, both in terms of the number of items and
the size of the attribute sets.
 Genericity, in its ability to deal with different types of attributes (categorical,
ordinal, spatial).
 Automatic determination of an appropriate number of clusters, with the user
specifying as input parameters only the minimum desired cluster size and the
maximum allowable correlation (proportion of overlap) between pairs of clusters.
 Robustness with respect to noisy data.
 The ability to identify clusters of any size (as small as three items).
As currently implemented, GreedyRSC is a batch method that makes use of a single
CPU. It is capable of clustering large data sets through the use of external memory
(disk), by breaking the data into several chunks, each of which can reside in main
memory. However, the computational cost due to the division of the data into c
chunks increases by a factor of c.
The specific goals of this project are:

 To develop a true parallel clustering tool based on the GreedyRSC heuristic,
suitable for use on a network of PCs.
 To demonstrate the efficiency and effectiveness of the parallel GreedyRSC
implementation by an experimental comparison with sequential GreedyRSC and
other clustering methods. In particular, the ability of the method to discover
arbitrarily-small clusters is to be demonstrated.
 To make the clustering tool freely available under the GNU public licence.
The ideal duration of this project is 6 months, although visits of as short as 5 months
will still be considered. Although it is possible to reduce the length of the internship
after being accepted, it may be difficult to extend the duration beyond that which is
stated in the candidate’s application. Therefore, candidates are strongly recommended
to state in their application only the longest possible duration for their intended stay at
NII.
3. References
[1] M. E. Houle, "The relevant-set correlation model for data clustering", in Proc. 8th
SIAM International Conference on Data Mining (SDM 2008), pp. 775-786, Atlanta,
GA, USA, 2008.
[2] M. E. Houle and J. Sakuma, "Fast approximate similarity search in extremely

high-dimensional data sets", in Proc. 21st IEEE International Conference on Data
Engineering (ICDE 2005), pp. 619-630, Tokyo, Japan, 2005.

NII International Internship Project: Distributed Data Clustering

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

NII International Internship Project: Distributed Data Clustering

Enviado por

Direitos autorais:

Formatos disponíveis

NII International Internship Project:

Distributed Data Clustering

Supervisor: Michael HOULE, Visiting Professor

Based on the RSC model, a general-purpose scalable clustering heuristic, GreedyRSC,

The specific goals of this project are:

[2] M. E. Houle and J. Sakuma, "Fast approximate similarity search in extremely

Você também pode gostar