Successor Variety (Cont.)


Once the successor varieties for a given word have been derived, this information is used to segment the word. There are four ways of doing this:
  1. Using the cutoff method, some cutoff value is selected for successor varieties and a boundary is identified whenever the cutoff is reached

  2. With the peak and plateau method, a segment break is made after a character whose successor variety exceeds that of the character immediately preceding it and the character immediately following it

  3. In the complete word method, a break is made after a segment if the segment is a complete word in the corpus

  4. The entropy method works as follows:
    1. Let |Dαi| be the number of words in a text body beginning with the i length sequence of letters .
    2. Let |Dαij| be the number of words in Dαi with the successor j.
    3. The probability that a member of Dαi has the successor j is given by |Dαij|/|Dαi|.
    4. The entropy of |Dαi| is
       Hαi = Σ-(|Dαij|/|Dαi|)×log2(|Dαij|/|Dαi|) where p=1,...,26
    A set of entropy measures can be determined for a word. A cutoff value is selected, and a boundary is identified whenever the cutoff value is reached. It takes advantage of the distribution of successor variety letters.
Two criteria are used to evaluate the various segmentation methods:
  1. Number of correct segment cuts divided by total number of cuts, and
  2. Number of correct segment cuts divided by total number of boundaries.
They found that none of the methods performed perfectly, but that techniques that combined certain of the methods did best.




      Pessimist: Oh, this can’t get any worse!    
      Optimist: Yes, it can!