June 21, 2019

A summary of K-means evaluation strategies

Inertia or within-cluster sum of squares distance is a key measure to evaluate the internally coherent of clustering. The sum of squared distance is calculated between each point and its nearest centroid.

In fact, the result of clustering should satisfy homogeneity. It means that each point only belongs to a cluster. This rule should be also independent of labels. The range of score should be standardized between 0.0 and 1.0.

Completeness measure how well the K-means algorithm assigns all the data points with a given label to the same group. Meanwhile, the score should be standardized from 0.0 to 1.0.

Specifically, V-measure measures the harmonic criteria whether it has satisfied the homogeneity and completeness. In addition, the score is from 0.0 to 1.0.

The Silhouette Coefficient for a sample is defined as:

where a is the mean of intra-cluster distance, b indicates the nearest-cluster distance. Moreover, the range of the parameter is −1 ~ 1. Specifically, 1 is the best result and −1 is the worst result. The higher the score of Silhouette Coefficient is, the more suitable the model satisfies the defined clusters.

Source: Li, B.Y. (2018) An Experiment of K-Means Initialization Strategies on Handwritten Digits Dataset. Intelligent Information Management , 10, 43-48.

Source URL: https://www.scirp.org/journal/paperinformation.aspx?paperid=82761

