CIKM 2005 Tutorial

Clustering Large and High-Dimensional Data

Charles Nicholas, Department of Computer Science and Electrical Engineering, UMBC
nicholas@umbc.edu

Jacob Kogan, Department of Mathematics and Statstics, UMBC
kogan@umbc.edu

Marc Teboulle, School of Mathematical Sciences, Tel-Aviv University
teboulle@post.tau.ac.il

This page is under occasional construction. Comments and corrections are welcome!

The current version of the tutorial: Nicholas (pdf) Kogan (pdf) Teboulle (pdf). A two-up handout form of Nicholas (pdf)

From the Tutorial:

Alan F. Smeaton, Gary Keogh, Cathal Gurrin, Kieran McDonald and Tom Sødring, "Analysis of papers from twenty-five years of SIGIR conferences: what have we been doing for the last quarter of a century?", ACM SIGIR Forum Volume 36 , Issue 2 Fall 2002 pages: 39 - 43. (pdf)

C. J. van Rijsbergen, Information Retrieval ,Second Edition. page 103

E. Rasmussen,"Clustering Algorithms", in Information Retrieval Data Structures and Algorithms, William Frakes and Ricardo Baeza-Yates, editors, Prentice Hall, 1992

The whisky dendrogram is from http://www.whiskyclassified.com/classification.html
The clustering software used is available from Clustan
(The presenters have no commercial interest in Clustan or any other software vendor mentioned in this tutoral.)

Octave code for single link clustering, complete link clustering, and comparison. Requires Octave to demonstrate. (Could be converted to matlab with some effort.)

The www.vivisimo.com meta-search engine, featuring clustering of search results. See also www.iboogie.com. For more information on search engines that cluster their results, see http://www.llrx.com/features/clusteringsearch.htm and
http://www.llrx.com/features/clusteringsearch2.htm

The Reuters collection is available at http://www.daviddlewis.com/resources/testcollections/reuters21578/

Classics:

A. Jain, M. Murty, and P. Flynn, ``Data Clustering: A Review'', ACM Computing Surveys, 31(3), September 1999. (pdf)
P. Willett, "Recent Trends in Hierarchic Document Clustering: A Critical Review", Information Processing and Management, 24(5), 577-597, 1988.
E. Voorhees, "The Cluster Hypothesis Revisited", Proceedings of SIGIR 1985, 95-104.

Scatter/Gather:

Douglass R. Cutting, David R. Karger, Jan O. Pedersen and John W. Tukey, "Scatter/Gather: a cluster-based approach to browsing large document collections", SIGIR'92.
Douglass R. Cutting, David R. Karger, Jan O. Pedersen, "Constant interaction-time scatter/gather browsing of very large document collections", SIGIR'93
Marti Hearst and Jan Pedersen, Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results, SIGIR'96 (pdf)

Star Clustering:

Javed Aslam, Alain Leblanc and Cliff Stein, "Clustering Data without Prior Knowledge", Algorithm Engineering: 4th International Workshop, Springer LNCS 1982, (ps)
Javed Aslam, Katya Pelekhov and Daniela Rus,"Using Star Clusters for Filtering", CIKM 2000, (pdf)
Javed Aslam, Katya Pelekhov and Daniela Rus, "Generating, Visualizing and Evaluating High-Quality Clusters for Information Organization", Principles of Digital Document Processing (PODDP'98), Springer LNCS 1481, (ps)
Javed Aslam, Katya Pelekhov and Daniela Rus, "Static and Dynamic Information Organization with Star Clusters", CIKM'98 (ps)
Javed Aslam, Katya Pelekhov and Daniela Rus, "A Practical Clustering Algorithm for Static and Dynamic Information Organization", Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, 1999 (ps)

Parallel and/or Distributed Clustering:

H. Kargupta, W. Huang, K. Sivakumar and E. Johnson, Distributed Clustering Using Collective Principal Component Analysis, Knowledge and Information Systems, 3(4), November 2001, 422-448
Clark Olson, "Parallel Algorithms for Hierarchical Clustering", Parallel Computing 21;1313-1325, 1995 (pdf)
Lots more!

High Dimensionality:

Chen Li, Edward Chang, Hector Garcia-Molina James Ze Wang and Gio Wiederhold, "Clindex: Clustering for Similarity Queries in High-Dimensional Spaces", Stanford Technical Report SIDL-WP-1998-0100, Feb. 1999.(pdf)
Lots more!

Papers by the Presenters:

Jacob Kogan, Charles Nicholas, and Vladimir Volkovich,"Text Mining with Information-Theoretical Clustering", Computing in Science and Engineering, accepted May 2003. (pdf)
Jacob Kogan, Marc Teboulle, and Charles Nicholas, “Optimization approach to generating families of k-means like algorithms”, Workshop on Clustering High Dimensional Data and its Applications, held in conjunction with the Third SIAM International Conference on Data Mining (SDM 2003), May 3, 2003. (pdf)
Jacob Kogan, Charles Nicholas, and Vladimir Volkovich, “Text Mining with Hybrid Clustering Schemes”, Workshop on Text Mining, held in conjunction with the Third SIAM International Conference on Data Mining (SDM 2003), May 3, 2003, pages 5-16. (pdf)
Jacob Kogan, Marc Teboulle, and Charles Nicholas, “The Entropic Geometric Means Algorithm: An Approach to Building Small Clusters for Large Text Datasets”, International Conference on Data Mining, workshop on Clustering Large Datasets, accepted October 2003. (pdf)
Jacob Kogan, Marc Teboulle, and Charles Nicholas, "Data Driven Similarity Measures for k-Means Like Clustering Algorithms", January 2003.(pdf)

Some Clustering Software:

The clustering software used in the whisky study is available from Clustan

CLUTO is a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters.

Frank Dellaert's clustering software for Matlab.

http://www.kdnuggets.com/software/clustering.html