Natural Break classification through Jenks-Caspall algorithm
This is an attempt to implement Jenks-Caspall reiterative forcing algorithm.
This script correctly works as it but it's still in developement. Future improvement will concerns classes design and clusters initialization method.
The well know Natural Break classification can be computed through 2 algorithms:
There are strong Python implementations of Jenks-Fisher available, but they can't process lot of values in a reasonable time. The algorithm makes this somewhat inevitable : the number of possible combination increases exponentially with the number of values, and find the optimal partition will be increasingly slow. Usually softwares which uses Jenks-Fisher sample the data befohand to maintain acceptable running time. Using an optimal method with only a sample of data is a nonsense unless the sampling was taken very carefully.
Jenks-Caspall algorithm was presented in the paper : "Error on choroplethic maps: definition, measurement, reduction" (Jenks GF, Caspall FC, 1971)
This algorithm is subdivised in two steps:
To find the best partition, Jenks introduced three errors criteria :
In it's original paper, Jenks himself have focused on tabular error, primarily because of its simplicity.
Nowadays, another index named Goodness of Variance Fit (GVF) is more widelly used.
Notes about K-means
K-means is generally used to classify multi dimensional data like 2D points. Traditionals K-means implementations can be widely optimized to process 1 dimensional values which is completely ordered.
There have been many variations and studies on K-means algorithm over the past 50 years. One of the most extensively discussed problem concerns the initialization method because K-means performance is extremely sensitive to cluster center initialization. Bad initialization can lead to poor convergence speed and bad overall clustering.
One the most recent proposal to this problem is the K-means++ algorithm (Arthur and Vassilvitskii 2007). It help to initialize partition with centroids that are furthest away as possible from each other. The first cluster center is chosen uniformly at random and each subsequent center is chosen with a probability proportional to its squared distance from the closest existing center.
Note that the random based initializations will give different results for multiple runs. It is a common practise to run k-means multiple times, and choose the better resulting partition.