Clustering Large and High-Dimensional Data
Charles Nicholas, Department of Computer Science and Electrical Engineering,
UMBC
nicholas@umbc.edu
Jacob Kogan, Department of Mathematics and Statstics, UMBC
kogan@umbc.edu
Marc Teboulle, School of Mathematical Sciences, Tel-Aviv University
teboulle@post.tau.ac.il
This page is under occasional construction. Comments and corrections are welcome!
The current version of the tutorial: Nicholas (pdf)
Kogan (pdf) Teboulle (pdf). A
two-up handout form of Nicholas (pdf)
From the Tutorial:
Alan F. Smeaton, Gary Keogh, Cathal Gurrin, Kieran McDonald and Tom Sødring,
"Analysis of papers from twenty-five years of SIGIR conferences: what have
we been doing for the last quarter of a century?", ACM SIGIR Forum Volume
36 , Issue 2 Fall 2002 pages: 39 - 43. (pdf)
C. J. van Rijsbergen, Information
Retrieval ,Second Edition. page
103
E. Rasmussen,"Clustering Algorithms", in Information Retrieval
Data Structures and Algorithms, William Frakes and Ricardo Baeza-Yates,
editors, Prentice Hall, 1992
The whisky dendrogram is from http://www.whiskyclassified.com/classification.html
The clustering software used is available from Clustan
(The presenters have no commercial interest in Clustan or any other software
vendor mentioned in this tutoral.)
Octave code for single link clustering, complete
link clustering, and comparison. Requires Octave
to demonstrate. (Could be converted to matlab with some effort.)
The www.vivisimo.com meta-search engine,
featuring clustering of search results. See also www.iboogie.com.
For more information on search engines that cluster their results, see http://www.llrx.com/features/clusteringsearch.htm
and
http://www.llrx.com/features/clusteringsearch2.htm
The Reuters collection is available at http://www.daviddlewis.com/resources/testcollections/reuters21578/
Classics:
- A. Jain, M. Murty, and P. Flynn, ``Data Clustering: A Review'', ACM Computing
Surveys, 31(3), September 1999. (pdf)
- P. Willett, "Recent Trends in Hierarchic Document Clustering: A Critical
Review", Information Processing and Management, 24(5), 577-597, 1988.
- E. Voorhees, "The Cluster Hypothesis Revisited", Proceedings of
SIGIR 1985, 95-104.
Scatter/Gather:
- Douglass R. Cutting, David R. Karger, Jan O. Pedersen and John W. Tukey,
"Scatter/Gather: a cluster-based approach to browsing large document
collections", SIGIR'92.
- Douglass R. Cutting, David R. Karger, Jan O. Pedersen, "Constant interaction-time
scatter/gather browsing of very large document collections", SIGIR'93
- Marti Hearst and Jan Pedersen, Reexamining the Cluster Hypothesis: Scatter/Gather
on Retrieval Results, SIGIR'96 (pdf)
Star Clustering:
- Javed Aslam, Alain Leblanc and Cliff Stein, "Clustering Data without
Prior Knowledge", Algorithm Engineering: 4th International Workshop,
Springer LNCS 1982, (ps)
- Javed Aslam, Katya Pelekhov and Daniela Rus,"Using Star Clusters for
Filtering", CIKM 2000, (pdf)
- Javed Aslam, Katya Pelekhov and Daniela Rus, "Generating, Visualizing
and Evaluating High-Quality Clusters for Information Organization", Principles
of Digital Document Processing (PODDP'98), Springer LNCS 1481, (ps)
- Javed Aslam, Katya Pelekhov and Daniela Rus, "Static and Dynamic Information
Organization with Star Clusters", CIKM'98 (ps)
- Javed Aslam, Katya Pelekhov and Daniela Rus, "A Practical Clustering
Algorithm for Static and Dynamic Information Organization", Proceedings
of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, 1999 (ps)
Parallel and/or Distributed Clustering:
- H. Kargupta, W. Huang, K. Sivakumar and E. Johnson, Distributed Clustering
Using Collective Principal Component Analysis, Knowledge and Information Systems,
3(4), November 2001, 422-448
- Clark Olson, "Parallel Algorithms for Hierarchical Clustering",
Parallel Computing 21;1313-1325, 1995 (pdf)
- Lots more!
High Dimensionality:
- Chen Li, Edward Chang, Hector Garcia-Molina James Ze Wang and Gio Wiederhold,
"Clindex: Clustering for Similarity Queries in High-Dimensional Spaces",
Stanford Technical Report SIDL-WP-1998-0100, Feb. 1999.(pdf)
- Lots more!
Papers by the Presenters:
- Jacob Kogan, Charles Nicholas, and Vladimir Volkovich,"Text Mining
with Information-Theoretical Clustering", Computing in Science and Engineering,
accepted May 2003. (pdf)
- Jacob Kogan, Marc Teboulle, and Charles Nicholas, “Optimization approach
to generating families of k-means like algorithms”, Workshop on Clustering
High Dimensional Data and its Applications, held in conjunction with the Third
SIAM International Conference on Data Mining (SDM 2003), May 3, 2003. (pdf)
- Jacob Kogan, Charles Nicholas, and Vladimir Volkovich, “Text Mining
with Hybrid Clustering Schemes”, Workshop on Text Mining, held in conjunction
with the Third SIAM International Conference on Data Mining (SDM 2003), May
3, 2003, pages 5-16. (pdf)
- Jacob Kogan, Marc Teboulle, and Charles Nicholas, “The Entropic Geometric
Means Algorithm: An Approach to Building Small Clusters for Large Text Datasets”,
International Conference on Data Mining, workshop on Clustering Large Datasets,
accepted October 2003. (pdf)
- Jacob Kogan, Marc Teboulle, and Charles Nicholas, "Data Driven Similarity
Measures for k-Means Like Clustering Algorithms", January 2003.(pdf)
Some Clustering Software:
The clustering software used in the whisky study is available from Clustan
CLUTO is a software
package for clustering low- and high-dimensional datasets and for analyzing
the characteristics of the various clusters.
Frank Dellaert's clustering software
for Matlab.
http://www.kdnuggets.com/software/clustering.html