Removing Stopwords (Cont.)


There are two ways to filter stoplist words from an input token stream by using a lexical analyzer, which converts a sequence of characters into a sequence of tokens.

Examining Lexical Analyzer Output and Removing Stopwords
This approach makes the stoplist problem into a standard list searching problem: every token must be looked up in the stoplist, and removed from further analysis if found. The usual solutions to this problem are adequate, including binary search trees, binary search of an array, and hashing. Undoubtedly the fastest solution is hashing.
  1. When hashing is used to search a stoplist, the list must first be inserted into a hash table. Each token is then hashed into the table.

  2. If the resulting location is empty, the token is not a stopword, and is passed on; otherwise, comparisons must be made to determine whether the hashed value really matches the entries at that hash table location.

  3. If not, then the token is passed on; if so, the token is a word, and is eliminated from the token stream.
This strategy is fast, but is slowed by the need to re-examine each character in a token to generate its hash value, and by the need to resolve collisions.

Removing Stopwords as Part of Lexical Analysis
The hashing strategy can be improved by incorporating computation of hash values into the character-by-character processing of lexical analysis. The output of the lexical analysis phase is then a hash value as well as a token, with a small increase in the cost of lexical analysis. Since lexical analysis must be done anyway, and recognizing even a large stoplist can be done at almost no extra cost during lexical analysis, this approach is extremely efficient. Furthermore, lexical analyzers that filter stoplists can be generated automatically, which is easier than writing stopword filters by hand.




      What do you call a guy who does not fart in public?    
      A private tutor.