Department of Computer Science, University of Utah
Abstract
Many approaches for using special purpose hardware to enhance the performance of an information retrieval have been proposed or implemented. Rather than using conventional digital computers or parallel processors developed for general applications, these special purpose processors have been designed specifically for particular facets of information retrieval, such as the rapid searching of unstructured data or manipulating surrogates or indices for the information in the database. After discussing the reasons why a special purpose processor may be desirable, three different techniques for implementing a text scanner are presented--associative memories, finite state machines, and cellular arrays. The problems of using a conventional microprocessor are discussed, along with possible hardware enhancements to improve its performance. System performance considerations, especially when the use of an index or other surrogate shift the performance from text scanning to a "seek and search" mode, are covered. Finally, special purpose processors for handling surrogates and the use of optical processing are presented.
The previous chapters have described techniques for implementing information retrieval systems on general purpose computing systems, including parallel machines designed to handle a variety of applications. However, there are times when an information retrieval system implemented using general purpose hardware may not be cost-effective or provide adequate performance. For example, to achieve satisfactory performance most information retrieval systems use some form of index. However, for a database consisting of messages that must be available as soon as they are received, there is no time available to construct, update, and maintain an index. Instead, special purpose searching hardware can provide the desired solution.
Special purpose hardware development for information retrieval has generally fallen into two areas--systems for rapidly searching unstructured data and systems for manipulating surrogates or indices for the information in the database. While most special purpose systems have been implemented using conventional digital logic, recently the use of optical processing has been suggested. The availability of fast microprocessors may also lead to special purpose systems that resemble a conventional workstation, but have additional logic to enhance their processing of information, more efficient connection and use of peripheral devices such as disk drives, or operating system software optimized for the particular application.
The design of a special purpose processor is heavily influenced by the types of operations that will be performed. In the case of text searching, these are the patterns and operators that are allowed when specifying a query. The most basic operation is performing an exact match search for a specified string of characters, which may represent either a single word or phrase. Proper handling of phrases may be more complex than simply searching for the words of the phrase each separated by a blank. Unless the text in the database has been carefully edited so that it fits a standard form, the words of the desired phrase may not be separated only by a single blank, but could be separated by a newline sequence (if the phrase started on one line and ended on another), more than one blank or some other space character (like a horizontal tab), typographic markup (such as changing from roman to italic type), a footnote reference, or other complications. Most special purpose text searchers can only handle the most basic phrase structures.
In addition to exactly matching a string of characters, it is often desirable to specify a more complex match pattern. A regular expression can specify not only a specific character to match at a pattern location, but can indicate a set of characters (such as an upper- or lowercase A, any vowel, the start of a line, or a word delimiter). It also can indicate that a pattern element can optionally occur or can occur more than once, or indicate two or more alternative patterns that are acceptable in a location.
While regular expressions could be used to specify the search pattern, often an information retrieval system will allow only a specific set of operations based on the nature of searching text. The most common of these operations is the don't care or wildcard, which specifies that any character is acceptable in a particular location in the search pattern (single-character or fixed-length don't care, FLDC) or an arbitrary number of characters are acceptable at the location (variable-length don't care, VLDC). The VLDC can either come at the start of a word (an initial VLDC, matching a pattern with any prefix), end of a word (a terminal VLDC, matching a pattern with any suffix), at both the start or end of a word (matching a specified stem), embedded within a word (an embedded VLDC), or any combination of these. Some text searchers cannot handle VLDCs, or can handle only initial or terminal VLDCs.
One other special pattern may specify that a numeric string should occur at the pattern location, and its value should be within a given range. Another pattern may specify an error tolerance, where the pattern is matched if there are fewer than n characters that don't match. These options are not commonly available on hardware-based text searchers (Yu et al. 1986).
Search expression often contain two or more patterns combined with Boolean operators, the most common being and, or, and and not. Word proximity, an extended form of and, indicates that two search terms need to be located within a specified number of words in the database for a match to occur. Proximity can be either directed (X must be followed within n words by Y) or undirected (R and S must be within n words of each other, but in any order). Like phrases, word proximity is complicated by markup or other information embedded in the text.
A search expression also can specify that a term, phrase, or Boolean combination must occur in particular contexts for a match to occur. For example, a search may look for "hardware and text retrieval in abstract" or "patent and software within 2 sentences." Other operators may expand the number of terms in a query. A macro facility allows the user to include a predefined subexpression into a query, while a thesaurus can automatically form the or of all synonyms of a word. For example, "airplane" might expand to "airplane or plane or 747 or DC-10 . . . ."
Many people have proposed implementations of information retrieval functions using custom hardware. Many of these consider only one aspect of information retrieval, without considering its effect on the overall retrieval system and its performance. Simple searchers, capable of exact matching against a small number of patterns (possibly containing don't cares) have been particularly popular as a VLSI design project. However, only a small number of these have been incorporated into complete information retrieval systems.
The algorithms implemented in hardware differ from the algorithms described in the previous chapters in two fundamental ways. They are much simpler but do many things in parallel or inherently in the logic. The reason for the simplicity is that each alternative in an algorithm often requires its own logic or increases the complexity of the other hardware. This not only complicates the design, but can result in slower performance because of longer data paths and increased loading of output drivers. It is also easier to design a simple function that is then replicated many times than a large single unit incorporating a variety of functions.
Many operations that would be cumbersome on a conventional processor are very simple to perform in custom hardware. As an example, the program to reverse the order of the bits of a word takes a number of steps of isolating, aligning, and replacing each bit of the word. In hardware, it simply takes the changing of the connections from straight-through to reversed. No active logic is required and the operation is done in zero time. If operations do not have to share hardware, they can be done simultaneously (for example, the shift-and-add being done in parallel with the decrementing of a counter during a multiplication). Because of this, most hardware solutions are not simply faster implementations of standard information retrieval algorithms.
The most commonly proposed special purpose hardware system for text information retrieval is a searcher. The rationale for needing a special search engine is that a conventional processor cannot search rapidly enough to give reasonable response time. For example, egrep on a 12 MIPS RISC processor (Sun SPARCstation-l) can search at about 250 KBytes per second. The searching of a 10 GByte database would take 40,000 seconds (over 11 hours). If a large database resides on a number of disk drives, it will be faster and more efficient not to bring the raw information from the disk into a central processor, but instead search it with a processor closer to the disk. Because each search processor is operating only on the data on its disk, the search time will remain the same as the database size increases and the additional disks (each with a searcher) necessary to hold the database increases the parallelism of the search.
However, because of the slow searching speeds for a central processor, most commercial information retrieval systems use some sort of inverted file index rather than actually searching the documents. The argument in the past against inverted files was that they substantially increase the storage requirements of a database. The index may take more disks than the database itself (Bird et al. 1978), but this may be less of a concern with lower-cost disk drives. This means that to be competitive, the cost of searching must be comparable to the cost of a disk drive.
However, there are a number of problems associated with inverted files or other document surrogates besides their storage requirements. In many cases, information is discarded to reduce the size of the surrogate. For inverted files, commonly occurring words are generally not included in the index. This means that a search for a phrase like "changing of the guard" really finds anything that has "changing," any two words, and "guard." The phrase "To be or not to be, that is the question" matches any document containing the word "question," since all other words in the phrase are discarded. A similar problem exists for superimposed code words, where the combining of the codes for two words makes it seem like another word is present when it is not.
These artifacts caused by the use of a surrogate tend to confuse a user because it is not clear why a document that doesn't match the query was retrieved. Time is spent trying to understand an irrelevant document, perhaps more time than would be spent reviewing a relevant one. But even if this were not the case, there are other problems associated with the use of a surrogate for locating documents. Time must be spent building and maintaining the surrogate, a nontrivial amount of time for very large databases. Also, a document is not retrievable until it has been entered into the surrogate data structure. This may not be acceptable for a message retrieval system, where incoming documents must be available as soon as they have arrived.
Most proposed text searchers divide the work into three activities: search control, term matching, and query resolution. Search control includes taking a request from the processor running the retrieval system, configuring the search hardware (typically by loading appropriate tables and registers), managing the reading of the data from the disks, and reporting back the results to the retrieval system. This search control is best performed by a conventional microprocessor (or, in the past, a minicomputer).
Searching for particular terms or patterns is performed by the special hardware, with the results passed to a query resolver that handles operations such as Booleans, contexts, and proximities. This considerably simplifies the search patterns. A good example of this is a query that specifies that five different terms must occur in the same paragraph, in any order. The regular expression for handling this is very complex, consisting of looking for the start of paragraph pattern, then all possible orderings of the five terms (120 sequences), then the end of the paragraph pattern. A better way is for the term matcher to look for the individual terms of the pattern. The query resolver then combines the term matcher results, which indicate that particular terms have been detected to determine if the pattern has been matched. Because this latter processing is required only when a term has been found, its realtime requirements are considerably less than that of comparing characters from the disk, and it can be handled by a conventional processor.
In effect, the term matcher converts a document into a list of entries that are a shorthand description of the document as it pertains to the query. The query resolution program can then process this greatly reduced information according to whatever scheme is desired. This could be a Boolean query language enhanced with contexts and word location proximity, weighted term document scoring, or any other technique.
Three different techniques have been proposed for implementing the term matcher: associative memories, finite state machines, and cellular arrays (Hollaar 1979). As their performance improves, standard microprocessors, perhaps augmented with simple hardware assists, can also be used for the term matcher.
If commodity disk drives are used to keep costs low, rather than special drives like those with parallel reads, the disk transfer and seek rates will determine the speed of the searcher. There are two possible options: match the term comparator to the nominal transfer rate of the disk (raw read bit rate), or match it to the effective transfer rate (average bit rate when seeks and rotational latency are considered). In either case, the same performance will result, since it is based on how many characters come from the disk in a unit of time. Matching to the effective transfer rate permits a lower comparator speed at the expense of buffer memory. How much lower depends on the seek characteristics for the typical or worst-case queries. Matching to the nominal transfer rate was used for many searcher implementations, because of the high cost of memory for buffers at the time they were proposed.
In an associative memory, each memory word has a comparator. The bit pattern in the comparand register is sent to each of these comparators, which indicate whether it matches their memory word. The comparison can be done in one cycle regardless of the size of the memory, but the cost per bit is considerably higher than for conventional memory because of the logic required. (A dynamic RAM memory cell may need only a single transistor, while an associative memory cell has tens or hundreds of transistors, depending on whether features such as masking to selectively compare bits in a word are implemented.)
If the cost were ignored, an associative memory would seem ideal for finding the locations of a pattern within a database. It is not. First of all, the data must be broken down into fixed-length groups to match the width of the associative memory. This could be done on word delimiters, in which case the width of the associative memory would have to be the length of the longest word and much space would be wasted, or it would have to be every n characters, packing the associative memory. In the latter case, however, words start in arbitrary positions within the width of the associative memory and will not always line up with the pattern. This can be solved by cycling the pattern through the width of the associative memory, but this is complicated by words that start in one memory location and end in another. Handling of VLDCs, especially embedded ones, is also difficult.
A more cost-effective approach is to store the search terms in the associative memory and shift the data through the comparand register. Essentially, this uses the associative memory as a set of comparators, each programmed with one of the search terms. This approach works well for fully specified terms, ones with fixed-length don't cares, or for initial or terminal VLDCs. Embedded VLDCs cannot be easily handled.
Besides the obvious parallel-comparator implementation for an associative memory, two schemes based on hardware hashing of the comparand have been suggested. Bird et al. (1979) proposed a hashing based on the length of the comparand word and its initial characters, while Burkowski (1982) used a small mapping RAM based on subsets of the search terms selected so that each has a unique bit pattern. While these approaches substantially reduce the cost of implementing the associative memory, they both suffer from the inability to handle embedded VLDCs or classes of characters (any delimiter, a number, any letter) in a pattern.
Finite state automata are capable of recognizing any regular expression. An FSA can be described as a 5-tuple {A, S, M, B, F} where A is the input alphabet, S is a set of elements called states, M is a mapping from AxS into S, B is a member of S called the beginning state, and F is a nonempty subset of S called the final states. The operation of the FSA is simple. At the start, it is in state B. Whenever a character arrives, the mapping M is used to determine the next state of the FSA. This process continues until a final state is reached, indicating the term that has been matched. Implementation is equally simple. A memory holds the mapping function M as an array of states. Each state contains an array with one entry for each character in the input alphabet, indicating the next state. A register holds the memory base address for the current state, and is concatenated with the input character to give the memory address to read to find the next state, which is then loaded into the register.
There is a problem with the straightforward FSA implementation--the size of memory required for the mapping function. For a query with 250 terms (as might occur when a thesaurus is used), about 2000 states would be necessary. For seven-bit input characters, this would require over 3 million bits of memory. Although that does not seem high today, 15 years ago that was the same order of magnitude for the memory found on a mainframe computer, and would cost hundreds of thousands of dollars.
Two ways of reducing the memory requirements of the FSA, at the expense of additional logic, have been proposed. Bird observed that there are two different state transition types in the mapping memory: sequential and index (Mayper et al. 1980). Sequential states are used for matching single characters, and are stored sequentially in memory. Matching the character causes the state register to be advanced by one and a mismatch causes it to go to a specified failure address. For a seven-bit input character, only eight bits are necessary for each sequential state--seven for the match character and one to indicate it is a sequential state.
In those places where there is more than one character of interest, the index state is used. It consists of a code indicating it is an index state and a bit vector indicating the interesting characters in the mapping. A seven-bit input character requires a 128-bit vector. When an index state is processed, the first action is to determine whether the bit corresponding to the input character is set. If not, a transition is made to a default state. If the bit is set, the number of bits set for characters with a lower value is determined and used as an index added to the current state address to give the location of a word that contains the address of the next state. Since there are far more sequential states than index states, the bit requirements are reduced. Extrapolating from the figures in Haskin et al. (1983), approximately 250 Kbits would be necessary for the 250-term query.
There are a number of difficulties with the Bird approach. The first is the dramatic size difference between the sequential and index states (8 bits vs. over 128 bits). This is mitigated by having more than one sequential state in each memory word, although some parts of the word might end up unused, and by operating on nibbles (half bytes) rather than the full character for index states. This reduces the vector to 16 bits and simplifies the hardware necessary to compute the index, at the expense of two index operations per character and a more complex state table.
Another difficulty is that when any search term has an initial VLDC, all states must be index states, substantially increasing the memory requirements. To solve this problem, Bird suggested having two matchers, one for terms that don't have initial VLDCs and a smaller one for those that have them. He also suggested a third FSA to handle phrases, rather than as part of the query resolver.
The other way of reducing the FSA memory requirements was proposed by Haskin (1980) (further described in Haskin et al. [1983] and Hollaar et al. [1984]). It consists of dividing the FSA state table into a number of partitions that can then be processed in parallel. The partitions are made so that each state must check only a single character, like the Bird sequential state, to determine whether to go to a specified address or a mismatch state. If there is more than one character that could occur simultaneously, each must be in a different partition. This can be easily determined by an examination of all the terms when the state tables are being constructed. A successful match not only specifies the next state in a partition, but can force another partition into a specified state.
For example, if the terms DOG and DOT were specified, the first state would look for a D and transition to state 2 when it is found. That state would look for an O. However, since the next state needs to look for both a G and a T, a successful match of the O would cause a transition to state 3, which would look for a G, and would force another partition into a state looking for a T. This fork operation consists of forcing the address of a neighbor matcher, since the partitioning technique assures that the neighbor is in an idle state.
A state can be programmed for an exact match, a match ignoring the bit that differentiates between upper- and lowercase characters, or a match if the input character is of a particular class (numeric, alphabetic, delimiter). A startup mechanism is used to simplify the partitioning of the state table, and to reduce the matcher requirements eliminating starts that would terminate after the initial character of the word. All forms of fixed- and variable-length don't cares can be handled. For the set of 250 terms, approximately 20 character matchers of less than 5 Kbits each are necessary, for a total memory requirement of under 100 Kbits, less than half the bits of the Bird FSA and a thirtieth of the conventional FSA requirements. Moreover, the matchers are cycled only once for each input character, allowing the use of slower memories than would be possible with the Bird FSA.
A cellular array uses a large number of very simple matchers connected in a string or as a more complex structure (Hollaar 1979). In its most basic form, the cell is programmed with a single match character. The cell has an input enable line and an output match line, as well as an input that is connected to a bus containing the current input character. Whenever the enable line is true, the cell compares the character on the bus to its match character, and if they are equal, sets its output match line true. FLDCs can be handled by a dummy cell that repeats its enable line delayed by one character time, and VLDCs by a cell that sets its match output whenever its enable line is set and clears it when a word delimiter is found (assuming that VLDCs should not extend across a word). The cells can also be extended to handle matching of classes of characters, like delimiters. Of course, any addition to a cell beyond an exact character match complicates its design and means fewer cells can be placed on an integrated circuit.
One of the first cellular matchers was General Electric's GESCAN, developed in the mid-1960s (Crew et al. 1967). It was capable of searching data stored on special magnetic tape drives at 120,000 characters per second, based on a query programmed into 80 cells. Copeland (1978) proposed a similar cellular system as an extension to a database machine; it was capable of handling a single bounded-length pattern string. Mukhopadhyay (1979) proposed a complex cell and a variety of ways to interconnect the cells to handle regular expressions and much of the pattern matching capabilities of SNOBOL4. However, the number of gates needed for a general interconnection network for n cells is O(n2), so for a reasonable number of cells the cost of the interconnections will exceed the cost of the cells. Lee (1985) (also Lee and Lochovsky [1990]) proposed a cellular system as part of a larger hardware-implemented information retrieval system that placed blocks of the data into the cellular array and broadcast programming instructions based on the search pattern to all the cells. Foster and Kung (1980) proposed a systolic pattern matcher, where both the pattern and the data being searched are moved through the cellular array. This eliminates the need to broadcast the data to every cell, eliminating a heavily loaded data bus.
Cellular arrays are used in two currently marketed search machines. A spin-off of General Electric, GESCAN International, has developed a new implementation of GESCAN that can be attached to a minicomputer or workstation. Rather than use a special tape drive, data is read from the host disk system and transferred to the search logic. The host also acts as the query resolver. Each integrated circuit contains 16 cells, and can operate at 5 MBytes per second. The Fast Data Finder from TRW (Yu et al. 1986) uses a cell that can perform not only single character comparisons, but also includes support for ignoring misspellings and numeric comparisons. Each integrated circuit holds 8 cells, and up to 9,216 cells can be included in a system. It can search at 12.5 MBytes per second, although the disk storing the data often limits its speed.
It was clear a decade ago that available microprocessors could not keep up with the disk for a multiterm match, but the latest RISC-based processors are substantially faster than mainframe computers were only a few years ago, at a fraction of the cost. While their performance depends on the particular algorithm used for searching, looking at the inner loop for the various grep programs, egrep appears to have the tightest loop for complex multiterm searches. On a Sun SPARC, an optimized version takes 10 instructions and 13 cycles. This means that the inner loop searching rate for a 25 MHz SPARCstation-1 is about 1.92 MBytes per second, approximately the nominal disk rate of 1.875 MBytes per second for a 15 MHz drive. A 40 MHz SPARC will run at over 3 MBytes per second.
But even as microprocessors become faster, they still may not be a good choice for handling the basic term matching operation. To achieve a high MIPS rate, the program for the microprocessor must be stored in a high-speed memory. For a 25 MHz machine, it must have a cycle time of less than 40 ns. This means the use of a cache memory to achieve better performance than would be available with dynamic RAM, increasing the cost and complexity of the microprocessor system.
If we look at the inner loop for egrep, we find that it implements a finite state recognizer, using two arrays. The first one is indexed by the current state and the input character and gives the next state. The second indicates whether a match has been found. For a system handling 250 terms, to accommodate large scarches generated by a thesaurus, about 2,000 states would be necessary. An FSA for handling 2,000 states and 7-bit characters would take 12 256K RAMs and an 11-bit holding register, costing well under $50.
Of course, it is not necessary for the line between the microprocessor and the special search hardware to be drawn between term matching and query resolution as an alternative to performing the entire search in the microprocessor. Hardware augmentation can be used to improve the search performance of the microprocessor. For example, a simple startup mechanism such as was used in the PFSA could be used to initially examine the input characters, with the processor only having to work when the start of a possible match has been found (Hollaar 1991). As microprocessor cost performance continues to improve, they will play a larger role in the implementation of special purpose text searchers. These systems will differ from general purpose systems using the same microprocessor by having low-overhead control programs rather than general operating systems like UNIX, simplified inter-faces connecting their peripheral devices, file systems optimized for text searching, and some custom logic to assist in the matching process.
Text searchers were proposed as a means of improving the response time over a conventional computer doing the search, without the problems associated with using a surrogate such as an inverted file. Using a special purpose searcher that operates at disk speed, the time necessary to complete a user's query is the time it takes to read all the data from a disk drive. While this obviously varies with the type of disk, for most low-cost, high-capacity disks it is about five minutes. This is far too long for an interactive system, and complicates the search control by requiring that a number of different user queries must be combined in a batch to give reasonable performance.
One solution to this problem is to make the search go faster by using higher performance disks, but this substantially increases the system cost because of the low production volumes and resulting higher costs for such disks. Furthermore, the speedup is on the order of 10, while a factor of 100 or more is desirable for an interactive system.
A more reasonable solution is to combine the attributes of searching and using a surrogate to overcome the difficulties with each approach. An inexact surrogate can be used to eliminate documents that have no hope of matching a query. Superimposed codewords will provide a list of documents that is a superset of the documents containing the search terms; if the term doesn't really occur in the document, but is an artifact of the superimposed codeword scheme, it will be eliminated by the search. A fully inverted file, where every term is indexed with its location within a document is not necessary; phrases, contexts, and word location proximities can be handled by the following search.
Rather than the 50 percent to 300 percent overhead for a fully inverted file (Bird et al. 1978), a partially inverted file could have an overhead of less that 5% (Hollaar 1991). This is because only one entry is necessary in the list of documents containing a word no matter how many times a word occurs in the document, and information regarding the location of the word within the document is not stored. The partially inverted surrogate also provides a quick feedback to the user on whether the follow-up search should even be started. It may indicate that too few documents (and possibly no documents, if a term that is not in the database is used) or too many documents would be searched, based on the user's intuition of what is in the database. The search could then be canceled and the query refined, based on information from the index.
It is not necessary to index a document before it is available. As long as the number of unindexed documents remains low relative to an average search, they can simply be added to the list of documents for every search, making them available as soon as the text is loaded.
When a surrogate is used to reduce the number of documents to be searched, the search goes from a scan to a seek and search mode. This changes the critical disk parameter from its transfer rate to its seek time relative to document transfer time. To see why this is so, consider an ESDI drive like the Maxtor XT-8760, where the seek time and rotational latency to position to the start of a document is about 20 milliseconds. It takes another 25 msec to read 35,000 characters (the size of the average U. S. Patent, for example) off two tracks, for a total of 45 msec. The effective transfer rate is about 750 KBytes per second, 40 percent of the nominal transfer rate of 1,875 KBytes per second. The effective transfer rate will be less for smaller documents, since the seek time will remain the same but the amount of data read will be less. A disk drive with the same seek characteristics, but which could read the data in zero time, would be only 2.25 times faster.
Just as parallel transfer drives are not particularly effective in a seek and search mode, optical drives with their high seek times are even more devastating to search performance. If the seek time in the small document example above were changed to a 150 ms positioning time (typical of today's optical disks), the effective transfer rate is only 148 KBytes per second, less than 8 percent of the nominal transfer rate.
Substantial improvements can be made by using an appropriate file system. A randomly organized file system, where blocks are placed in any convenient free location on the disk and a pointer (either in the previous block or in a master block) is used to locate the next data block(such as used in most file systems) is convenient for a time sharing system, where files come and go. However, using our example ESDI disk drive, if we have to do an average seek before reading a 512 byte block, we will have an effective transfer rate of under 50 KBytes per second. The use of a large block size improves this, at the expense of higher unusable disk capacity due to internal fragmentation.
The use of a contiguous file system, where each document is stored in consecutive disk blocks, substantially improves disk performance. In one disk revolution after positioning, almost 50 blocks can be read, rather than just one. Since the documents are seldom removed (or expanded) after they are loaded into a archival text database, the problems of file reorganization and lost disk space are minimal.
While text searching has been the most commonly proposed information retrieval operating to be implemented using special purpose hardware, a number of people have examined manipulating document surrogates using custom processors. These have included processors for the merging of the sorted lists of documents in an inverted file index and for searching the document signatures to find specified patterns.
The index for an inverted file contains pointers to items within the database. These pointers are generally stored in sorted order to ease their combining to determine those documents that match a particular query. The basic merging operation produces a list of pointers that is the union (for an OR operation), intersection (for an AND), or Boolean difference (for an AND NOT). While this is a simple operation to program on a general purpose computer, the overhead associated with aligning data, processing auxiliary fields, and flow of control such as loop counters mean that the mainframe computers of the 1970s were saturated keeping up with disk reads.
Stellhorn (1977) proposed a system that took blocks of the sorted lists and merged them in a block-parallel, bit-serial fashion. Logic following the merge operation removed either duplicate (for an OR) or duplicate and single entries (for an AND). Hollaar (1978) developed a system based on a simple merge element (both serial and parallel versions were designed) connected by a network or arranged as a binary tree, which was programmed to match the merge operation specified by the query. Both of these systems were 10 to 100 times faster than a conventional computer of the time. However, as processors have become less expensive and much faster, special purpose hardware for list merging is does not seem cost-effective today.
A signature file, as discussed in a previous chapter, provides an alternative to storing a list of document pointers for each word in the database. A signature is developed for each document that attempts to capture the contents of the document. These signatures, which are considerably smaller than the actual documents, are then searched for potential matches. These matches can be directly presented to the user, or can be searched to eliminate false matches caused by information dropped when the signatures were created.
The most common signature is based on superimposed codewords. Keywords or word stems are selected from the document and hashed to produce an n-bit vector that has m bits set. These vectors are then ORed together to form the signature. Determining whether a document contains a particular word simply requires checking to see if the bits corresponding to the hash of that word are set in the signature. Note that superimposed codewords do not readily allow don't cares or context and proximity operations, and may produce "false drops" when the bit vectors representing two words are ORed, and the result has all the bits of another word, not present in the document, set.
Ahuja and Roberts (1980) proposed storing the superimposed codeword signatures in an associative memory. To reduce the cost, rather than have comparators for every bit in the associative memory, they divided the signatures into blocks and stored each block in a memory module. During the associative search, a codeword is read from the memory, compared against the desired pattern, whether there was a match is saved, and the next codeword from the memory is examined. This word-serial, bit-parallel approach reduces the hardware requirements to one comparator (for as many bits as are in the codeword) per memory module, plus the associated control logic.
Lee (1985) (also Lee and Lochovsky [1990]) observed that it is not necessary to check every bit position of the signatures, but only those that are set in the hash of the word of interest. Since the number of bits set in a hash is small compared to the number of bits in the codeword, this can substantially improve the search performance. Since the same bits in each codeword are being skipped, a word-parallel, bit-serial architecture is best.
The previously discussed special purpose systems were all based on conventional digital electronics. Spurred by work on laser-based compact disk technology and the use of fiber optics for high-speed data transmission, there has been renewed interest in employing optically based approaches in information retrieval. This has included improving the bandwidth of optical mass storage and optical searching of information.
As we previously mentioned, most optical storage systems now in use do not have either the data rate or access speed of magnetic disk memories. This is particularly true for compact disks, which were originally designed to handle the relatively low data rates needed for reproducing audio information. However, there are ways of substantially increasing the data rates of optical disks beyond the standard techniques of increasing the bit density or rotational speed of the disk.
It is difficult to build a magnetic read head that can read data from many adjacent, closely spaced tracks at the same time. However, it is possible to have a laser beam that illuminates many adjacent tracks, with the information from each track falling on a different position of a photodetector array. This allows a byte or word of data to be read out with only minimal problems of alignment or skewed data. In the case of a parallel-head magnetic drive, the relative timing of the bits of the word depends on the mechanical alignment of the different heads, which can change with temperature or movement during seeks.
Even more interesting is the use of a holographic memory, rather than a rotating optical disk. This offers the potential of not only improved data rates, but also better storage densities and access times. Berra et al. (1989) discusses a page hologram memory capable of holding 725 MBytes on a 6-inch square card, and volume holograms capable of holding up to 1012 bits. Since access to information in the holographic memory does not require moving a read head assembly and waiting for a disk to rotate to the start of the data, access times are reduced and rotational latency is eliminated. Instead, the laser beam can be deflected to the proper position in under 100 msec and all the information on the page will be immediately available.
Although it is possible to convert the optical signals from a laser disk or holographic memory to electrical signals and process them using conventional digital electronics, one also can use optical means for comparing the information against a pattern without the need for conversion. Berra et al. (1989) discuss a number of these techniques, and present an optical full text search design (Mitkas et al. 1989).
Optical comparisons are performed by using an exclusive-OR operation on dual rail logic (both true and complement of each bit is available). A 1 bit is stored as a TF pattern, and a 0 bit as FT. A spatial light modulator (SLM), which is a one-or two-dimensional array that can alter some parameter of the optical signals passing through the array, such as intensity or phase, is placed in the light path. In its simplest form, the SLM is programmed to form a mask in which individual bits are either passed (P) or blocked (X). XP is used for comparing against a 1, PX for comparing against a 0, and XX can be used for a don't care in a particular bit position. If all pattern bits match the data bits, no light will be passed. It only takes a lens and a single detector to determine if there is a match.
For example, if the pattern is 10d1, where d is a don't care, the mask will be XPPXXXXP. For data of 1011 (dual rail form of TFFTTFTF) or 1001 (TFFTFTTF), no light will be passed, while for data 0011 (FTFTTFTF), the resulting light pattern will be FTFFFFFF, so light is passed and a mismatch is indicated.
The optical text search machine contains a page composer, which generates optical patterns to be fed to the optical comparator. Its main element is a fast optical scanner that converts the sequential (and possibly byte-parallel) input from the optical disk to two-dimensional patterns consisting of shifting sequences of the bytes of text. The optical comparator masks these patterns with an array programmed with all the search terms. Photodetectors sense whether no light passes, indicating a match of a particular term. It is possible to do 100,000 parallel comparisons using an SLM in a single step. Other optical techniques can be employed to allow the multiple input streams to be handled through the same SLM, further increasing its performance.
A number of special-purpose hardware systems have been proposed to provide most cost-effective or faster performance for information retrieval systems. Most of these have been text scanners, and several of these have been used in special applications.
While high-speed microprocessors and general purpose parallel machines can now provide much of the performance that previously required special purpose logic (such as for index merging), it still appears that hardware augmentation to those processors can improve the performance of a large system at a small cost. Special purpose hardware for information retrieval will continue to be an interesting research and development area.
AHUJA, S. R., & C. S. ROBERTS. 1980, "An Associative/Parallel Processor for Partial Match Retrieval Using Superimposed Codes." Proceedings of the 7th Annual Symposium on Computer Architecture, May 6-8, 1980 (published as SIGARCH Newsletter, vol. 8, no. 3) pp. 218-27.
BERRA, P. B., A. GHAFOOR, P. A. MITKAS, S. J. MARCINKOWSKI, & M. GUIZANI. 1989. IEEE Transactions on Knowledge and Data Engineering, (1), 111-32.
BIRD, R. M., J. B. NEWSBAUM, & J. L. TREFFTZS. 1978, "Text File Inversion: An Evaluation." Fourth Workshop on Computer Architecture for Non-Numeric Processing, Syracuse University, August 1978 (published as SIGIR vol. 13, no. 2; SIGARCH vol. 7, no. 2; and SIGMOD vol. 10, no. 1), pp. 42-50.
BIRD, R. M., & J. C. TU. 1979, "Associative Crosspoint Processor System." U. S. Patent 4,152, 762, May 1, 1979.
BURKOWSKI, F. J. 1982, "A Hardware Hashing Scheme in the Design of a Multiterm String Comparator." IEEE Transactions on Computers, C-31, (9), 825-34.
COPELAND, G. P. 1978, "String Storage and Searching for Data Base Applications: Implementation of the INDY Backend Kernel." Fourth Workshop on Computer Architecture for Non-Numeric Processing, Syracuse University, August 1978 (published as SIGIR vol. 13, no. 2; SIGARCH vol 7, no. 2; and SIGMOD vol. 10, no. 1), pp. 8-17.
CREW, B. L., & M. N. GUNZBURG. 1967, "Information Storage and Retrieval System." U. S. Patent 3,358,270, December 12, 1967.
FOSTER, M. J., & H. T. KUNG. 1980. "Design of Special-Purpose VLSI Chips: Examples and Opinions." Proceedings of the 7th Annual Symposium on Computer Architecture, May 6-8, 1980 (published as SIGARCH Newsletter, vol. 8, no. 3) pp. 300-07.
HASKIN, R. L. 1980. "Hardware for Searching Very Large Text Databases." Ph.D. Thesis, University of Illinois at Urbana-Champaign, August.
HASKIN, R. L., & L. A. HOLLAAR. 1983, "Operational Characteristics of a Hardware-based Pattern Matcher." ACM Transactions on Database Systems, 8(1).
HOLLAAR, L. A. 1978. "Specialized Merge Processor Networks for Combining Sorted Lists." ACM Transactions on Database Systems, 3(3), 272-84.
HOLLAAR, L. A. 1979. "Text Retrieval Computers." Computer, 12(3), 40-52.
HOLLAAR, L. A., & R. L. HASKIN. 1984, "Method and System for Matching Encoded Characters." U. S. Patent 4,450,520, May 22, 1984.
HOLLAAR, L. A. 1991. "Special-Purpose Hardware For Text Searching: Past Experience, Future Potential." Information Processing & Management, Vol. 27, No. 4, pp. 371-78.
LEE, D. L. 1985. "The Design and Evaluation of a Text-Retrieval Machine for Large Databases." Ph.D. Thesis, University of Toronto, September.
LEE, D. L., & F. H. Lochovsky. 1990, "HYTREM--A Hybrid Text-Retrieval Machine for Large Databases." IEEE Transactions on Computers, 39(1), 111-23.
MAYPER, V. Jr., A. L. NAGY, R. M. BIRD, J. C. TU, & L. S. MICHAELS. 1980. "Finite State Automaton with Multiple State Types." U. S. Patent 4,241,402, December 23, 1980.
MITKAS, P. A., P. B. BERRA, & P. S. GUILFOYLE. 1989, "An Optical System for Full Text Search." Proceedings of SIGIR 89.
MUKHOPADHYAY, A . 1979, "Hardware Algorithms for Nonnumeric Computation." IEEE Transactions on Computers, C-28(6), 384-89.
ROBERTS, D. C. 1978. "A Specialized Computer Architecture for Text Retrieval." Fourth Workshop on Computer Architecture for Non-Numeric Processing, Syracuse University, August 1978 (published as SIGIR vol. 13, no. 2; SIGARCH vol. 7, no. 2; and SIGMOD vol . 10, no . 1), pp . 51-59.
STELLHORN, W. H. 1977. "An Inverted File Processor for Information Retrieval." IEEE Transactions on Computers, C-26(12), 1258-67.
YU, K.-I, S.-P. HSU, R. E. HEISS, Jr., & L. Z. HASIUK. 1986. "Pipelined for Speed: The Fast Data Finder System." Quest, Technology at TRW, 9(2), Winter 1986/1987, 4-19.
17.1 INTRODUCTION
17.1.1 Search Patterns
17.1.2 Hardware Implementations
17.2 TEXT SEARCHERS
17.2.1 Searcher Structure
Associative Memories
Finite State Machines
Cellular Arrays
Standard Microprocessors
17.2.2 Search Performance
Seek and search mode
17.3 SURROGATE PROCESSORS
17.3.1 List Merging
17.3.2 Signature Processors
17.4 OPTICAL PROCESSORS
17.4.1 Improved Optical Storage
17.4.2 Optical Data Processing
17.5 SUMMARY
REFERENCES