Escolar Documentos
Profissional Documentos
Cultura Documentos
For data mining applications, the size of main memory typically limits the size of the
data sets that can be analyzed. Avoiding the main-memory limitation necessitates a
choice between the use of external memory (disk) or distributed processing (multiple
cores). Either approach also requires a clustering method that is inherently
decomposable. Relatively few parallelizable clustering methods are known, most of
which involve the partitioning of the data set, the independent clustering of each
partition, and the merging of the result clusters across all partitions. For data mining
applications, this divide-and-conquer approach has the effect of missing those very
small aggregations (the nuggets of information) that may prove to be the most
valuable to the user. For example, if the data set is partitioned into 10 subsets for
clustering, any aggregation of 30 points would have (on average) 3 points in any
given partition – typically too few to be recognized as a cluster for that partition.
The project will investigate the application of the relevant-set correlation (RSC)
clustering model [1] to the clustering of data from distributed databases, in such a way
that the smallest nuggets of information are still preserved. Developed at NII, RSC is
a generic model for clustering that requires no direct knowledge of the nature or
representation of the data. In lieu of such knowledge, the model relies solely on the
existence of an oracle for queries-by-example, that accepts a reference to a data item
and returns a ranked set of items relevant to the query. In principle, the role of the
oracle could be played by any similarity search structure, or even a search engine
whose internal ranking function and relevancy scores are secret. The quality of cluster
candidates, the degree of association between pairs of cluster candidates, and the
degree of association between clusters and data items are all assessed according to the
statistical significance of a form of correlation among pairs of relevant sets and/or
candidate cluster sets.
The ideal duration of this project is 6 months, although visits of as short as 5 months
will still be considered. Although it is possible to reduce the length of the internship
after being accepted, it may be difficult to extend the duration beyond that which is
stated in the candidate’s application. Therefore, candidates are strongly recommended
to state in their application only the longest possible duration for their intended stay at
NII.
3. References
[1] M. E. Houle, "The relevant-set correlation model for data clustering", in Proc. 8th
SIAM International Conference on Data Mining (SDM 2008), pp. 775-786, Atlanta,
GA, USA, 2008.