Slide 6.19: Data clustering methods

Data Clustering Methods

Nonhierarchical Methods
Partitioning methods are to divide the data set of N objects into M clusters, where no overlap is allowed. Similar items are in a cluster and the cluster may be represented by a centroid or cluster representative that is indicative of the characteristics of the items it contains.

Assign the first document D₁ as the representative for C₁.
For D_i, calculate the similarity S with the representative for each existing cluster.
If S_max is greater than a threshold value S_t, add the item to the corresponding cluster and recalculate the cluster representative; otherwise, use D_i to initiate a new cluster.
If an item D_i remains to be clustered, return to Step 2.

Hierarchical Methods
Produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected. The hierarchical methods can be either agglomerative, with N-1 pairwise joins beginning from an unclustered data set, or divisive, beginning with all objects in a single cluster and progressing through N-1 divisions of some cluster into a smaller cluster.

Identify the two closest points (clusters) and combine them in a cluster.
Identify and combine the next two closest points.

If more than one cluster remains, return to Step 1.