clustering-benchmark

Stars
157

Clustering datasets

Datasets

This project contains collection of labeled clustering problems that can be found in the literature. Most of datasets were artificially created.

All datasets can be found link:https://github.com/deric/clustering-benchmark/tree/master/src/main/resources/datasets/artificial[data folder].

2d-10c

[align="right",options="header"] |=== | data points | clusters | dimension | 2990 | 10 | 2 |===

image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/2d-10c.png["2d-10c",400,float="left"]

[.float .right]

J. Handl and J. Knowles, “Multiobjective clustering with automatic determination of the number of clusters,” UMIST, Tech. Rep., 2004.

atom

[align="right",options="header"] |=== | data points | clusters | dimension | 800 | 2 | 3 |===

image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/atom.png["atom",400,float="left"]

[.float .right]

aggregation

[align="right",options="header"] |=== | data points | clusters | dimension | 788 | 7 | 2 |===

image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/aggregation.png[aggregation,400,float="left"]

[.float .right]

Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007. 1(1): p. 1-30.

chainlink

[align="right",options="header"] |=== | data points | clusters | dimension | 1000 | 2 | 3 |===

image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/chainlink.png["chainlink",400,float="left"]

[.float .right]

Alfred Ultsch, Clustering with SOM: U*C, in Proc. Workshop on Self Organizing Feature Maps ,pp 31-37 Paris 2005.

D31

[align="right",style="asciidoc",options="noborders,wide"] |=== | data points | 3100 | clusters | 31 | dimensions | 2 | image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/D31.png["D31",400,float="left"] | * link:https://github.com/deric/clustering-benchmark/blob/master/src/main/resources/datasets/artificial/D31.arff[ARFF] |===

Veenman, C.J., M.J.T. Reinders, and E. Backer, A maximum variance cluster algorithm. IEEE Trans. Pattern Analysis and Machine Intelligence 2002. 24(9): p. 1273-1280.

3MC

[align="right",options="header",style="literal"] |=== | data points | clusters | dimension | 400 | 3 | 2 |===

[.float .right] image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/3MC.png["3MC",400,float="left"]

DS577

[align="right",options="header"] |=== | data points | clusters | dimension | 577 | 3 | 2 |===

image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/DS577.png["D31",400,float="left"]

[.float .right]

M. C. Su, C. H. Chou, and C. C. Hsieh, “Fuzzy C-Means Algorithm with a Point Symmetry Distance,” International Journal of Fuzzy Systems, vol. 7, no. 4, pp. 175-181, 2005.

cluto-t4_8k

[align="right",options="header"] |=== | data points | clusters | dimension | 8000 | 7 | 2 |===

image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/cluto-t4_8k.png["cluto-t4_8k",400,float="left"]

[.float .right]

G. Karypis, “CLUTO A Clustering Toolkit,” Dept. of Computer Science, University of Minnesota, Tech. Rep. 02-017, 2002, available at http://www.cs.umn.edu/ ̃cluto.

Experiments

This project contains set of clustering methods benchmarks on various dataset. The project is dependent on Clueminer project.

in order to run benchmark compile dependencies into a single JAR file:

mvn assembly:assembly

Consensus experiment

allows running repeated runs of the same algorithm:

./run consensus --dataset "triangle1" --repeat 10

by default k-means algorithm is used.

For available datasets see resources folder.