Data Clustering Methods


Nonhierarchical Methods
Partitioning methods are to divide the data set of N objects into M clusters, where no overlap is allowed. Similar items are in a cluster and the cluster may be represented by a centroid or cluster representative that is indicative of the characteristics of the items it contains.
  1. Assign the first document D1 as the representative for C1.
  2. For Di, calculate the similarity S with the representative for each existing cluster.
  3. If Smax is greater than a threshold value St, add the item to the corresponding cluster and recalculate the cluster representative; otherwise, use Di to initiate a new cluster.
  4. If an item Di remains to be clustered, return to Step 2.

Hierarchical Methods
Produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected. The hierarchical methods can be either agglomerative, with N-1 pairwise joins beginning from an unclustered data set, or divisive, beginning with all objects in a single cluster and progressing through N-1 divisions of some cluster into a smaller cluster.
  1. Identify the two closest points (clusters) and combine them in a cluster.
  2. Identify and combine the next two closest points.
  1. If more than one cluster remains, return to Step 1.