Data Clustering Methods (Cont.)


Similarity Measures
In order to cluster the items in a data set, some means of quantifying the degree of association between them is required. This may be a distance measure, or a measure of similarity or dissimilarity.


where C is the number of terms that Di and Dj have in common, and A and B are the number of terms in Di and Dj.

An Example of Data Clustering Using MST Algorithms
A minimal spanning tree (MST) of a weighted graph is a collection of edges connecting all the vertexes such that the sum of the weights of the edges is at least as small as the sum of the weights of any other collection of edges connecting all the vertexes. Or an MST is a tree linking N objects with N-1 connections so that there are no loops and the sum of the N-1 dissimilarities is minimized. Once an MST has been constructed, the corresponding single link hierarchy can be generated in O(N2) operations.


      “Take it moment by moment, and you will find that    
      we are all, as I’ve said before, bugs in amber.”    
      ― Kurt Vonnegut