Clustering datasets

Datasets

This project contains collection of labeled clustering problems that can be found in the literature. Most of datasets were artificially created.

All datasets can be found link:https://github.com/deric/clustering-benchmark/tree/master/src/main/resources/datasets/artificial[data folder].

2d-10c

[align="right",options="header"] |=== | data points | clusters | dimension | 2990 | 10 | 2 |===

image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/2d-10c.png["2d-10c",400,float="left"]

[.float .right]

J. Handl and J. Knowles, “Multiobjective clustering with automatic determination of the number of clusters,” UMIST, Tech. Rep., 2004.

atom

[align="right",options="header"] |=== | data points | clusters | dimension | 800 | 2 | 3 |===

image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/atom.png["atom",400,float="left"]

[.float .right]

aggregation

[align="right",options="header"] |=== | data points | clusters | dimension | 788 | 7 | 2 |===

image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/aggregation.png[aggregation,400,float="left"]

[.float .right]

Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007. 1(1): p. 1-30.

chainlink

[align="right",options="header"] |=== | data points | clusters | dimension | 1000 | 2 | 3 |===

image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/chainlink.png["chainlink",400,float="left"]

[.float .right]

Alfred Ultsch, Clustering with SOM: U*C, in Proc. Workshop on Self Organizing Feature Maps ,pp 31-37 Paris 2005.

D31

[align="right",style="asciidoc",options="noborders,wide"] |=== | data points | 3100 | clusters | 31 | dimensions | 2 | image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/D31.png["D31",400,float="left"] | * link:https://github.com/deric/clustering-benchmark/blob/master/src/main/resources/datasets/artificial/D31.arff[ARFF] |===

Veenman, C.J., M.J.T. Reinders, and E. Backer, A maximum variance cluster algorithm. IEEE Trans. Pattern Analysis and Machine Intelligence 2002. 24(9): p. 1273-1280.

3MC

[align="right",options="header",style="literal"] |=== | data points | clusters | dimension | 400 | 3 | 2 |===

[.float .right] image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/3MC.png["3MC",400,float="left"]

DS577

[align="right",options="header"] |=== | data points | clusters | dimension | 577 | 3 | 2 |===

image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/DS577.png["D31",400,float="left"]

[.float .right]

link:https://github.com/deric/clustering-benchmark/blob/master/src/main/resources/datasets/artificial/DS577.arff[ARFF]

M. C. Su, C. H. Chou, and C. C. Hsieh, “Fuzzy C-Means Algorithm with a Point Symmetry Distance,” International Journal of Fuzzy Systems, vol. 7, no. 4, pp. 175-181, 2005.

cluto-t4_8k

[align="right",options="header"] |=== | data points | clusters | dimension | 8000 | 7 | 2 |===

image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/cluto-t4_8k.png["cluto-t4_8k",400,float="left"]

[.float .right]

link:https://github.com/deric/clustering-benchmark/blob/master/src/main/resources/datasets/artificial/cluto-t4.8k.arff[ARFF]

G. Karypis, “CLUTO A Clustering Toolkit,” Dept. of Computer Science, University of Minnesota, Tech. Rep. 02-017, 2002, available at http://www.cs.umn.edu/ ̃cluto.

Experiments

This project contains set of clustering methods benchmarks on various dataset. The project is dependent on Clueminer project.

in order to run benchmark compile dependencies into a single JAR file:

mvn assembly:assembly

Consensus experiment

allows running repeated runs of the same algorithm:

./run consensus --dataset "triangle1" --repeat 10

by default k-means algorithm is used.

For available datasets see resources folder.

Related Projects

clust4j

A suite of classification clustering algorithm implementations for Java. A number of partitional,...

11 Nov 2015 148

PinkPony

Mining from git history with Pink Pony!

11 Jul 2019 9

autoextractor

A toolkit for clustering web pages based on various similarity measures.

25 Dec 2015 4

barefoot

Java map matching library for integrating the map into software and services with state-of-the-ar...

05 Jun 2014 666