University of Iowa
Abstract
Thesauri are valuable structures for Information Retrieval systems. A thesaurus provides a precise and controlled vocabulary which serves to coordinate document indexing and document retrieval. In both indexing and retrieval, a thesaurus may be used to select the most appropriate terms. Additionally, the thesaurus can assist the searcher in reformulating search strategies if required. This chapter first examines the important features of thesauri. This should allow the reader to differentiate between thesauri. Next, a brief overview of the manual thesaurus construction process is given. Two major approaches for automatic thesaurus construction have been selected for detailed examination. The first is on thesaurus construction from collections of documents, and the second, on thesaurus construction by merging existing thesauri. These two methods were selected since they rely on statistical techniques alone and are also significantly different from each other. Programs written in C language accompany the discussion of these approaches.
Roget's thesaurus is very different from the average thesaurus associated with Information Retrieval (IR) systems accessing machine readable document databases. The former is designed to assist the writer in creatively selecting vocabulary. In IR systems, used to retrieve potentially relevant documents from large collections, the thesaurus serves to coordinate the basic processes of indexing and document retrieval. (The term document is used here generically and may refer to books, articles, magazines, letters, memoranda, and also software.) In indexing, a succinct representation of the document is derived, while retrieval refers to the search process by which relevant items are identified. The IR thesaurus typically contains a list of terms, where a term is either a single word or a phrase, along with the relationships between them. It provides a common, precise, and controlled vocabulary which assists in coordinating indexing and retrieval. Given this objective, it is clear that thesauri are designed for specific subject areas and are therefore domain dependent.
Figure 9.1 displays a short extract from an alphabetical listing of thesaurus terms and their relationships in the INSPEC thesaurus. This thesaurus is designed for the INSPEC domain, which covers physics, electrical engineering, electronics, as well as computers and control. The thesaurus is logically organized as a set of hierarchies. In addition to this hierarchical arrangement, the printed INSPEC thesaurus also includes an alphabetical listing of thesaural terms. Each hierarchy is built from a root term representing a high-level concept in the domain. In Figure 9.1, "computer-aided instruction" is a part of the hierarchy whose root note or top term (TT) is "computer applications." The "see also" link leads to cross-referenced thesaural terms. NT suggests a more specific thesaural term; similarly, BT provides a more general thesaural term. RT signifies a related term and this relationship includes a variety of links such as part-whole and object-property. UF is utilized to indicate the chosen form from a set of alternatives. In the above example, "computer-aided instruction" is used for the alternative "teaching machines." The converse of UF is USE, which takes the user from a form that is not part of the vocabulary to a valid alternative. The example shows that "caesium" is the preferred form over "cesium." CC and FC are from the INSPEC classification scheme and indicate the subject area in which the term is used.
A carefully designed thesaurus can be of great value. The indexer is typically instructed to select the most appropriate thesaural entries for representing the document. In searching, the user can employ the thesaurus to design the most appropriate search strategy. If the search does not retrieve enough documents, the thesaurus can be used to expand the query by following the various links between terms. Similarly, if the search retrieves too many items, the thesaurus can suggest more specific search vocabulary. In this way the thesaurus can be valuable for reformulating search strategies. It is quite common to provide on-line thesauri, which simplifies the query reformulation process. In addition, such query modifications can also be more system initiated than user initiated. However, this obviously requires rather complex algorithms since the system has to know not only how to reformulate the query but also when to do so.
There exists a vast literature on the principles, methodologies, and problems involved in thesaurus construction. However, only a small portion is devoted to the "automatic" construction of thesauri. This mirrors the current state of art, marked by an abundance of manually generated thesauri. In fact, there is even much skepticism over the possibility of fully automating this procedure. This is because manual thesauri are highly complex structures which exhibit a variety of relationships including hierarchical, nonhierarchical, equivalent, and associative ones. The automatic detection of such relationships continues to be challenging. But, the reader can be assured that some automatic methodologies have emerged over the last few decades. As in most other subfields of IR, these methods are strongly motivated by statistics. In contrast, manual thesaurus construction is a highly conceptual and knowledge-intensive task and therefore also extremely labor intensive. Consequently, the search for alternative automatic methods is definitely of value.
The rest of this chapter is organized into six sections. The next, section 9.2, describes features of thesauri. Section 9.3 introduces the different approaches for automatic thesaurus construction following a brief description of manual thesaurus construction. Sections 9.4 and 9.5 focus on two major approaches for automatic thesaurus construction. The three C programs included with this chapter are described briefly in section 9.6. The last section presents the conclusions.
Some important features of thesauri will be highlighted here. For a more detailed discussion, please consult Aitchison and Gilchrist (1972). The objective is for the reader to be able to compare thesauri. In general, the discussion applies to both manually and automatically generated thesauri. However, differences between the two are also identified where appropriate.
Coordination refers to the construction of phrases from individual terms. Two distinct coordination options are recognized in thesauri: precoordination and post-coordination. A precoordinated thesaurus is one that can contain phrases. Consequently, phrases are available for indexing and retrieval. A postcoordinated thesaurus does not allow phrases. Instead, phrases are constructed while searching. The choice between the two options is difficult. The advantage in precoordination is that the vocabulary is very precise, thus reducing ambiguity in indexing and in searching. Also, commonly accepted phrases become part of the vocabulary. However, the disadvantage is that the searcher has to be aware of the phrase construction rules employed. Thesauri can adopt an intermediate level of coordination by allowing both phrases and single words. This is typical of manually constructed thesauri. However, even within this group there is significant variability in terms of coordination level. Some thesauri may emphasize two or three word phrases, while others may emphasize even larger sized phrases. Therefore, it is insufficient to state that two thesauri are similar simply because they follow precoordination. The level of coordination is important as well. It should be recognized that the higher the level of coordination, the greater the precision of the vocabulary but the larger the vocabulary size. It also implies an increase in the number of relationships to be encoded. Therefore, the thesaurus becomes more complex. The advantage in postcoordination is that the user need not worry about the exact ordering of the words in a phrase. Phrase combinations can be created as and when appropriate during searching. The disadvantage is that search precision may fall, as illustrated by the following well known example, from Salton and McGill (1983): the distinction between phrases such as "Venetian blind" and "blind Venetian" may be lost. A more likely example is "library school" and "school library." The problem is that unless search strategies are designed carefully, irrelevant items may also be retrieved. Precoordination is more common in manually constructed thesauri. Automatic phrase construction is still quite difficult and therefore automatic thesaurus construction usually implies post-coordination. Section 9.4 includes a procedure for automatic phrase construction.
Term relationships are the most important aspect of thesauri since the vocabulary connections they provide are most valuable for retrieval. Many kinds of relationships are expressed in a manual thesaurus. These are semantic in nature and reflect the underlying conceptual interactions between terms. We do not provide an exhaustive discussion or listing of term relationships here. Instead, we only try to illustrate the variety of relationships that exist. Aitchison and Gilchrist (1972) specify three categories of term relationships: (1) equivalence relationships, (2) hierarchical relationships, and (3) nonhierarchical relationships. The example in Figure 9.1 illustrates all three categories. Equivalence relations include both synonymy and quasi-synonymy. Synonyms can arise because of trade names, popular and local usage, superseded terms, and the like. Quasi-synonyms are terms which for the purpose of retrieval can be regarded as synonymous, for example, "genetics" and "heredity," which have significant overlap in meaning. Also, the terms "harshness" and "tenderness," which represent different viewpoints of the same property continuum. A typical example of a hierarchical relation is genus-species, such as "dog" and "german shepherd." Nonhierarchical relationships also identify conceptually related terms. There are many examples including: thing--part such as "bus" and "seat"; thing--attribute such as "rose" and "fragance."
Wang, Vandendorpe, and Evens (1985) provide an alternative classification of term relationships consisting of: (1) parts--wholes, (2) collocation relations, (3) paradigmatic relations, (4) taxonomy and synonymy, and (5) antonymy relations. Parts and wholes include examples such as set--element; count--mass. Collocation relates words that frequently co-occur in the same phrase or sentence. Paradigmatic relations relate words that have the same semantic core like "moon" and "lunar" and are somewhat similar to Aitchison and Gilchrist's quasi-synonymy relationship. Taxonomy and synonymy are self-explanatory and refer to the classical relations between terms. It should be noted that the relative value of these relationships for retrieval is not clear. However, some work has been done in this direction as in Fox (1981) and Fox et al. (1988). Moreover, at least some of these semantic relationships are commonly included in manual thesauri. Identifying these relationships requires knowledge of the domain for which the thesaurus is being designed. Most if not all of these semantic relationships are difficult to identify by automatic methods, especially by algorithms that exploit only the statistical relationships between terms as exhibited in document collections. However, it should be clear that these statistical associations are only as good as their ability to reflect the more interesting and important semantic associations between terms.
It is in general preferable to have a single entry for each thesaurus term. However, this is seldom achieved due to the presence of homographs--words with multiple meanings. Also, the semantics of each instance of a homograph can only be contextually deciphered. Therefore, it is more realistic to have a unique representation or entry for each meaning of a homograph. This also allows each homograph entry to be associated with its own set of relations. The problem is that multiple term entries add a degree of complexity in using the thesaurus--especially if it is to be used automatically. Typically the user has to select between alternative meanings. In a manually constructed thesaurus such as INSPEC, this problem is resolved by the use of parenthetical qualifiers, as in the pair of homographs, bonds (chemical) and bonds (adhesive). However, this is hard to achieve automatically.
Specificity of the thesaurus vocabulary is a function of the precision associated with the component terms. A highly specific vocabulary is able to express the subject in great depth and detail. This promotes precision in retrieval. The concomitant disadvantage is that the size of the vocabulary grows since a large number of terms are required to cover the concepts in the domain. Also, specific terms tend to change (i.e., evolve) more rapidly than general terms. Therefore, such vocabularies tend to require more regular maintenance. Further, as discussed previously, high specificity implies a high coordination level which in turn implies that the user has to be more concerned with the rules for phrase construction.
This has relevance mainly for statistical thesaurus construction methods which work by partitioning the vocabulary into a set of classes where each class contains a collection of equivalent terms. Salton and McGill (1983, 77-78) have stated that in order to maintain a good match between documents and queries, it is necessary to ensure that terms included in the same thesaurus class have roughly equal frequencies. Further, the total frequency in each class should also be roughly similar. (Appropriate frequency counts for this include: the number of term occurrences in the document collection; the number of documents in the collection in which the term appears at least once). These constraints are imposed to ensure that the probability of a match between a query and a document is the same across classes. In other words, terms within the same class should be equally specific, and the specificity across classes should also be the same.
Normalization of vocabulary terms is given considerable emphasis in manual thesauri. There are extensive rules which guide the form of the thesaural entries. A simple rule is that terms should be in noun form. A second rule is that noun phrases should avoid prepositions unless they are commonly known. Also, a limited number of adjectives should be used. There are other rules to direct issues such as the singularity of terms, the ordering of terms within phrases, spelling, capitalization, transliteration, abbreviations, initials, acronyms, and punctuation. In other words, manual thesauri are designed using a significant set of constraints and rules regarding the structure of individual terms. The advantage in normalizing the vocabulary is that variant forms are mapped into base expressions, thereby bringing consistency to the vocabulary. As a result, the user does not have to worry about variant forms of a term. The obvious disadvantage is that, in order to be used effectively, the user has to be well aware of the normalization rules used. This is certainly nontrivial and often viewed as a major hurdle during searching (Frost 1987). In contrast, normalization rules in automatic thesaurus construction are simpler, seldom involving more than stoplist filters and stemming. (These are discussed in more detail later.) However, this feature can be regarded both as a weakness and a strength.
These different features of thesauri have been presented so that the reader is aware of some of the current differences between manual and automatic thesaurus construction methods. All features are not equally important and they should be weighed according to the application for which the thesaurus is being designed. This section also gives an idea of where further research is required. Given the growing abundance of large-sized document databases, it is indeed important to be challenged by the gaps between manual and automatic thesauri.
The process of manually constructing a thesaurus is both an art and a science. We present here only a brief overview of this complex process. First, one has to define the boundaries of the subject area. (In automatic construction, this step is simple, since the boundaries are taken to be those defined by the area covered by the document database.) Boundary definition includes identifying central subject areas and peripheral ones since it is unlikely that all topics included are of equal importance. Once this is completed, the domain is generally partitioned into divisions or subareas. Once the domain, with its subareas, has been sufficiently defined, the desired characteristics of the thesaurus have to be identified. Since manual thesauri are more complex structurally than automatic ones, as the previous section has shown, there are more decisions to be made.
Now, the collection of terms for each subarea may begin. A variety of sources may be used for this including indexes, encyclopedias, handbooks, textbooks, journal titles and abstracts, catalogues, as well as any existing and relevant thesauri or vocabulary systems. Subject experts and potential users of the thesaurus should also be included in this step. After the initial vocabulary has been identified, each term is analyzed for its related vocabulary including synonyms, broader and narrower terms, and sometimes also definitions and scope notes. These terms and their relationships are then organized into structures such as hierarchies, possibly within each subarea. The process of organizing the vocabulary may reveal gaps which can lead to the addition of terms; identify the need for new levels in the hierarchies; bring together synonyms that were not previously recognized; suggest new relationships between terms; and reduce the vocabulary size. Once the initial organization has been completed, the entire thesaurus will have to be reviewed (and refined) to check for consistency such as in phrase form and word form. Special problems arise in incorporating terms from existing thesauri which may for instance have different formats and construction rules. At this stage the hierarchically structured thesaurus has to be "inverted" to produce an alphabetical arrangement of entries--a more effective arrangement for use. Typically both the alphabetical and hierarchical arrangements are provided in a thesaurus. Following this, the manually generated thesaurus is ready to be tested by subject experts and edited to incorporate their suggestions.
The above informal description is very sketchy. It is a long process that involves a group of individuals and a variety of resources. Once the thesaurus has been designed and implemented for use within a retrieval system, the next problem is that it needs to be maintained in order to ensure its continued viability and effectiveness. That is, the thesaurus should reflect any changes in the terminology of the area. The problem is that since the older documents are still within the system, the updated thesaurus must also retain the older information. Updates are typically slow and again involve several individuals who regularly review and suggest new and modified vocabulary terms as well as relationships. Therefore, typically a thesaurus evolves with time and slowly responds to changes in the terminology of the subject.
In selecting automatic thesaurus construction approaches for discussion here, the criteria used are that they should be quite different from each other in addition to being interesting. Also, they should use purely statistical techniques. (The alternative is to use linguistic methods.) Consequently, the two major approaches selected here have not necessarily received equal attention in the literature. The first approach, on designing thesauri from document collections, is a standard one. The second, on merging existing thesauri, is better known using manual methods. Programs included with this chapter are based on these two major approaches. We also discuss a third automatic approach which is quite novel and interesting, although it is based on tools from expert systems and does not use statistical methods. In this approach, thesauri are built using information obtained from users. We first present a brief overview of each approach and then provide detailed descriptions.
Here the idea is to use a collection of documents as the source for thesaurus construction. This assumes that a representative body of text is available. The idea is to apply statistical procedures to identify important terms as well as their significant relationships. It is reiterated here that the central thesis in applying statistical methods is to use computationally simpler methods to identify the more important semantic knowledge for thesauri. It is semantic knowledge that is used by both indexer and searcher. Until more direct methods are known, statistical methods will continue to be used. Work by Soergel (1974) is relevant to this point since it includes an interesting discussion on the various semantic interpretations of significant statistical associations between words. The first two programs, select.c and hierarchy.c, included with this chapter are based on this approach of designing a thesaurus from a collection of documents.
This second approach is appropriate when two or more thesauri for a given subject exist that need to be merged into a single unit. If a new database can indeed be served by merging two or more existing thesauri, then a merger perhaps is likely to be more efficient than producing the thesaurus from scratch. This approach has been discussed at some length in Forsyth and Rada (1986). The challenge is that the merger should not violate the integrity of any component thesaurus. Rada has experimented with augmenting the MeSH thesaurus with selected terms from SNOMED (Forsyth and Rada 1986, 216). MeSH stands for Medical Subject Headings and is the thesaurus used in MEDLINE, a medical document retrieval system, constructed and maintained by the National Library of Medicine. It provides a sophisticated controlled vocabulary for indexing and accessing medical documents. SNOMED, which stands for Systematized Nomenclature of Medicine, is a detailed thesaurus developed by the College of American Pathologists for use in hospital records. MeSH terms are used to describe documents, while SNOMED terms are for describing patients. Ideally, a patient can be completely described by choosing one or more terms from each of several categories in SNOMED. Both MeSH and SNOMED follow a heirarchical structure. Rada's focus in his experiments has been on developing suitable algorithms for merging related but separate thesauri such as MeSH and SNOMED and also in evaluating the end products. The third program (merge.c) included here implements two different merging algorithms adapted from Rada's work.
In this third alternative, the idea is that users of IR systems are aware of and use many term relationships in their search strategies long before these find their way into thesauri. The objective is to capture this knowledge from the user's search. This is the basis of TEGEN--the thesaurus generating system designed by Guntzer et al. (1988). They propose TEGEN as a viable alternative technique for automatic thesaurus construction. The procedure involves examining the types of Boolean operators used between search terms, the type of query modification performed by the user, and so on. User feedback is included to resolve any ambiguities and uncertainties. Feedback is also required to select the specific type of relationship between two terms once it has been decided that the terms are indeed related. Therefore, their approach requires considerable interaction with the user population. TEGEN is designed using production rules which perform a detailed analysis of the user's search pattern. Given that this approach utilizes mainly expert system methodologies, no representative program is included here.
The overall process may be divided into three stages: (1) Construction of vocabulary: This involves normalization and selection of terms. It also includes phrase construction depending on the coordination level desired. (2) Similarity computations between terms: This step identifies the significant statistical associations between terms. (3) Organization of vocabulary: Here the selected vocabulary is organized, generally into a hierarchy, on the basis of the associations computed in step 2. The first program select.c implements procedures for stages 1 and 2. Program hierarchy.c implements one method for organizing the vocabulary in the third stage.
The objective here is to identify the most informative terms (words and phrases) for the thesaurus vocabulary from document collections. The first step is to identify an appropriate document collection. The only loosely stated criteria is that the collection should be sizable and representative of the subject area. The next step is to determine the required specificity for the thesaurus. If high specificity is needed, then the emphasis will be on identifying precise phrases; otherwise, more general terms can be sought. Terms can be selected from titles, abstracts, or even the full text of the documents if available. This initial set of vocabulary terms is now ready for normalization. The simplest and most common normalization procedure is to eliminate very trivial words such as prepositions and conjunctions. For this, an appropriate stoplist of trivial words needs to be constructed. The article by Fox (1989-1990) on the construction of stoplists may be useful here. The next standard normalization procedure is to stem the vocabulary. Stemming reduces each word into its root form. For example, the terms "information," "informing," and "informed" could all be stemmed into the same root "inform." The chapter in this book on stemming algorithms may be consulted for this. The resulting pool of stems is now ready to be analyzed statistically with two objectives: first, to select the most interesting stems as discussed in the following section, and second, to create interesting phrases for a higher coordination level as discussed in the section on phrase construction.
There are a number of methods for statistically evaluating the worth of a term. The ones we discuss here are: (1) selection of terms based on frequency of occurrence, (2) selection of terms based on Discrimination Value, (3) selection of terms based on the Poisson model. Program select.c, includes routines for all three methods, which are described briefly below.
Selection by Frequency of Occurence: This is one of the oldest methods and is based on Luhn's work, which has been extensively discussed in the literature; see for example, Salton and McGill (1983). The basic idea is that each term may be placed in one of three different frequency categories with respect to a collection of documents: high, medium, and low frequency. Terms in the mid-frequency range are the best for indexing and searching. Terms in the low-frequency range have minimal impact on retrieval, while high-frequency terms are too general and negatively impact search precision. Salton recommends creating term classes for the low-frequency terms; however, it is not evident how to do this automatically. High-frequency terms are generally coordinated into phrases to make them more specific (see the later section on phrase construction for this). Threshold frequencies are generally not fixed and therefore user specified. Program select.c includes a routine for this selection method. Threshold frequencies are specified by the user via the global variables LOW_THRESHOLD and HIGH_THRESHOLD.
Selection by Discrimination Value (DV): DV measures the degree to which a term is able to discriminate or distinguish between the documents of the collection as described by Salton and Yang (1973). The more discriminating a term, the higher its value as an index term. The overall procedure is to compute the average interdocument similarity in the collection, using some appropriate similarity function. Next, the term k being evaluated is removed from the indexing vocabulary and the same average similarity is recomputed. The discrimination value (DV) for the term is then computed as:
DV(k) = (Average similarity without k) - (Average similarity with k)
Good discriminators are those that decrease the average similarity by their presence, that is, those for which DV is positive. Poor discriminators have negative DV, while neutral discriminators have no effect on average similarity. Terms that are positive discriminators can be included in the vocabulary and the rest rejected.
Program select.c includes a routine called dv-all which computes DV for all terms with respect to a collection. In the program, average similarity with all terms intact is referred to as the baseline. The algorithm used is a straightforward one using the method of centroids. In this method the average interdocument similarity is computed by first calculating the average document vector or centroid vector for the collection. This is performed by a routine called centroid. Next, the similarity between every document and this centroid is calculated. This generates the total similarity in the collection, which can then be used to calculate the average similarity. For large collections of documents, a more efficient algorithm such as that based on the cover coefficient concept may be tried as suggested by Can and Ozkarahan (1987). However, this is not done here.
Selection by the Poisson Method: This is based on the work by Bookstein and Swanson (1974), Harter (1975), and Srinivasan (1990) on the family of Poisson models. The Poisson distribution is a discrete random distribution that can be used to model a variety of random phenomena including the number of typographical errors in a page of writing and the number of red cars on a highway per hour. In all the research that has been performed on the family of Poisson models, the one significant result is that trivial words have a single Poisson distribution, while the distribution of nontrivial words deviates significantly from that of a Poisson distribution. This result is used here to select nontrivial words as thesaurus terms.
The get-Poisson-dist routine in the select.c program prints out the distributions in a collection for all terms in the inverted file. Two distributions are produced for each term: the actual distribution and the one expected under the assumption of a single Poisson model. These distributions will have to be compared using the chi-square test to identify any significant differences. The reader is referred to Harter (1975) for information on these chi-square comparisons. Terms whose distributions deviate significantly are selected to be in the vocabulary. The rest may be discarded.
This step may be used to build phrases if desired. As mentioned before, this decision is influenced by the coordination level selected. Also, phrase construction can be performed to decrease the frequency of high-frequency terms and thereby increase their value for retrieval. Two methods are described below. Program select.c includes routines for the first method. Given insufficient details for the second method, it is not implemented here. However, since it is an interesting approach, we include a sketchy algorithm.
Salton and McGill Procedure: This is the standard one proposed in Salton and McGill (1983, 133-34) and adapted by Rada in Forsyth and Rada (1986, 200). Their procedure is a statistical alternative to syntactic and/or semantic methods for identifying and constructing phrases. Basically, a couple of general criteria are used. First, the component words of a phrase should occur frequently in a common context, such as the same sentence. More stringent contextual criteria are possible, such as the words should also appear within some specified distance. The second general requirement is that the component words should represent broad concepts, and their frequency of occurrence should be sufficiently high. These criteria motivate their algorithm, which is described below.
1. Compute pairwise co-occurrence for high-frequency words. (Any suitable contextual constraint such as the ones above may be applied in selecting pairs of terms.)
2. If this co-occurrence is lower than a threshold, then do not consider the pair any further.
3. For pairs that qualify, compute the cohesion value. Two formulas for computing cohesion are given below. Both ti and tj represent terms, and size-factor is related to the size of the thesaurus vocabulary.
4. If cohesion is above a second threshold, retain the phrase as a valid vocabulary phrase.
We do not include a program for this complete algorithm. However, the routine cohesion in select.c is an implementation of the Rada formula.
Choueka Procedure: The second phrase construction method is based on the work by Choueka (1988). He proposes a rather interesting and novel approach for identifying collocational expressions by which he refers to phrases whose meaning cannot be derived in a simple way from that of the component words, for example, "artificial intelligence." The algorithm proposed is statistical and combinatorial and requires a large collection (at least a million items) of documents to be effective. The author has been quite successful in identifying meaningful phrases and is apparently extending the algorithm. Unfortunately, some critical details are missing from their paper. Therefore, we do not include an implementation of their approach. However, a sketchy procedure is included since the overall process looks rather interesting.
1. Select the range of length allowed for each collocational expression. Example: two to six words.
2. Build a list of all potential expressions from the collection with the prescribed length that have a minimum frequency (again, a preset value).
3. Delete sequences that begin or end with a trivial word. The trivial words include prepositions, pronouns, articles, conjunctions, and so on.
4. Delete expressions that contain high-frequency nontrivial words.
5. Given an expression such as a b c d, evaluate any potential subexpressions such as a b c and b c d for relevance. Discard any that are not sufficiently relevant. (It is not clear from the paper how relevance is decided, but perhaps it is also based on frequency.)
6. Try to merge smaller expressions into larger and more meaningful ones. For example, a b c d and b c d may merge to form a b c d. (Again, the exact criteria for allowing a merger are not given.)
The main difference between this procedure and the previous one is that this one considers phrases that have more than two words. It is of course possible to extend the previous procedure to include phrases with more than two words. However, computing cohesion is likely to be more challenging than simply applying the formula recursively over larger sized phrases. Choueka's procedure also allows phrases to be substituted by longer and more general phrases as in step 6. However, the criteria used for allowing this should be carefully formulated.
Once the appropriate thesaurus vocabulary has been identified, and phrases have been designed if necessary, the next step is to determine the statistical similarity between pairs of terms. There are a number of similarity measures available in the literature. An extensive study has been done on the comparison of different similarity measures (McGill et al. 1979). Select.c includes two similarity routines:
1. Cosine: which computes the number of documents associated with both terms divided by the square root of the product of the number of documents associated with the first term and the number of documents associated with the second.
2. Dice: which computes the number of documents associated with both terms divided by the sum of the number of documents associated with one term and the number associated with the other.
Either measure can be used to assess the similarity or association between terms in a document collection. The algorithms implemented in the program select.c can be made more accurate by including only those instances in each numerator wherein the two terms co-occur in the same sentence and within some specified distance (that is, in a unit smaller than the entire document), as suggested in Salton and McGill (1983) and Soergel (1974). This makes the criteria for similarity more stringent.
Once the statistical term similarities have been computed, the last step is to impose some structure on the vocabulary which usually means a hierarchical arrangement of term classes. For this, any appropriate clustering program can be used. A standard clustering algorithm generally accepts all pairwise similarity values corresponding to a collection of objects and uses these similarity values to partition the objects into clusters or classes such that objects within a cluster are more similar than objects in different clusters. Some clustering algorithms can also generate hierarchies. It should be noted there are major differences between available clustering algorithms, and the selection should be made after carefully studying their characteristics.
Given the chapter in this book on clustering, such programs are not included here. Instead, we include the implementation of an alternative simple procedure to organize the vocabulary which is described in Forsyth and Rada (1986, 200-01). This algorithm implemented in hierarchy.c is quite different from the standard clustering algorithms and is based on the following assumptions: (1) high-frequency words have broad meaning, while low-frequency words have narrow meaning; and (2) if the density functions of two terms, p and q (of varying frequencies) have the same shape, then the two words have similar meaning. As a consequence of these two assumptions, if p is the term with the higher frequency, then q becomes a child of p. These two assumptions motivate their entire procedure as outlined below. Besides illustrating the procedure, hierarchy.c also includes appropriate data structures for storing thesauri organized as hierarchies.
1. Identify a set of frequency ranges.
2. Group the vocabulary terms into different classes based on their frequencies and the ranges selected in step 1. There will be one term class for each frequency range.
3. The highest frequency class is assigned level 0, the next, level 1 and so on.
4. Parent-child links are determined between adjacent levels as follows. For each term t in level i, compute similarity between t and every term in level i-1. Term t becomes the child of the most similar term in level i-1 . If more than one term in level i-1 qualifies for this, then each becomes a parent of t. In other words, a term is allowed to have multiple parents.
5. After all terms in level i have been linked to level i-1 terms, check level i-1 terms and identify those that have no children. Propagate such terms to level i by creating an identical "dummy" term as its child.
6. Perform steps 4 and 5 for each level starting with level 1.
The third program, merge.c, is designed to merge different hierarchies. This is useful when different thesauri (perhaps with different perspectives) are available for the same subject, or when a new subject is being synthesized from existing ones. The algorithm for the program has been adapted from Chapter 14 of Forsyth and Rada (1986) in which experiments in augmenting MeSH and SNOMED have been described. Two different merging algorithms have been implemented. The first, called simple-merge, links hierarchies wherever they have terms in common. The second, called complex-merge, adopts a more interesting criteria. It links terms from different hierarchies if they are similar enough. Here, similarity is computed as a function of the number of parent and child terms in common. Also, sufficiency is decided based on a preset user specified threshold.
Three programs are included at the end of this chapter. The first can be used to select terms and to construct phrases; the second generates (or reads) and stores hierarchies. The third program merges different hierarchies.
This program contains a variety of routines for the various selection criteria used in designing the thesaurus vocabulary (see Appendix 9.A). It requires two input files: a direct file and an inverted file. Both contain the same information, but arranged differently. The direct file is a listing of document numbers corresponding to the database of interest. Each document number is associated with a set of term and term-weight pairs. A document may be associated with all its component terms or perhaps only a select few. The term-weight represents the strength of association between the document and the term. This term-weight may be assigned manually, or for example be some function of the term's frequency of occurrence in the document. The file is arranged such that the rows pertaining to the same document are grouped together. The inverted file is a listing of terms. Here each term is linked to its associated document numbers and the term-weights. The interpretation of term-weights should be the same in both input files. In fact, the two files should contain identical information. The inverted index is arranged such that rows corresponding to a term are grouped together. In both files, document numbers are represented by integers; the terms are character strings and the weights are decimal numbers. One or more spaces may be used to distinguish between the three. Figure 9.2 below shows a brief extract of both files.
Besides these two files, an output file must also be specified. The user will have to specify four global variables. The first two are MAXWORD: specifying the maximum number of characters in a term; and MAXWDF: specifying the expected maximum frequency of occurrence for a term within a document. The other two parameters are: LOW-THRESHOLD AND HIGH-THRESHOLD, which are used when partitioning the terms by frequency of occurrence in the collection into HIGH, MEDIUM, and LOW frequency classes.
This program can perform two major and separate functions (see Appendix 9.B). First, if given the hierarchical relationships between a set of terms, it records these relationships in its internal inverted file structure. This can then be used for other purposes. Second, it is also capable of generating the hierarchical structure automatically using Rada's algorithm. For this, the input required is the inverted file, which has the same structure as in Figure 9.2. The second input file is a link file, which is a sequence of rows representing link information. A row consists of a parent term followed by any number of spaces and then a child term. This file is used if the link information is simply provided to the program. For this the user will have to set two parameters: MAXWORD, which specifies the maximum size for a term, and NUMBER-OF-LEVELS, which constrains the size of the generated hierarchy.
This program contains routines to perform the two types of merging functions described (see Appendix 9.C). Four input files are required here: an inverted file and a link file for each thesaurus hierarchy. Their formats are as described before. Two parameters will have to be set: MAXWORD described before and THRESHOLD which specifies the minimum similarity for use in the complex merge routine.
This chapter began with an introduction to thesauri and a general description of thesaural features. Two major automatic thesaurus construction methods have been detailed. A few related issues pertinent to thesauri have not been considered here: evaluation of thesauri, maintenance of thesauri, and how to automate the usage of thesauri. The focus has been on the central issue, which is the construction of thesauri. However, these secondary issues will certainly be important in any realistic situation.
AITCHISON, J., and A. GILCHRIST. 1972. Thesaurus Construction -- A Practical Manual. London: ASLIB.
BOOKSTEIN, A., and D. R. SWANSON. 1974. "Probabilistic Models for Automatic Indexing." J. American Society for Information Science, 25(5), 312-18.
CAN, F., and E. OZKARAHAN. 1985. Concepts of the Cover-Coefficient-Based Clustering Methodology. Paper presented at the Eighth International Conference on Research and Development in Information Retrieval. Association for Computing Machinery, 204-11.
CHOUEKA, Y. 1988. Looking for Needles in a Haystack OR Locating Interesting Collocational Expressions in Large Textual Databases. Paper presented at the Conference on User-Oriented Content-Based Text and Image Handling, MIT, Cambridge, Mass. 609-23.
FORSYTH, R., and R. RADA. 1986. Machine Learning -- Applications in Expert Systems and Information Retrieval. West Sussex, England: Ellis Horwood Series in Artificial Intelligence.
FOX, E. A. 1981. "Lexical Relations: Enhancing Effectiveness of Information Retrieval Systems." SIGIR Newsletter, 15(3).
FOX, E. A., J. T. NUTTER, T. AHLSWERE, M. EVENS, and J. MARKOWITZ. 1988. Building A Large Thesaurus for Information Retrieval. Paper presented at the Second Conference on Applied Natural Language Processing. Association for Computational Linguistics, 101-08.
FOX, C. FALL 1989/Winter 1990. "A Stop List for General Text." SIGIR Forum, 21(1-2), 19-35.
FROST, C. O. 1987. "Subject Searching in an Online Catalog." Information Technology and Libraries, 6, 60-63.
GUNTZER, U., G. JUTTNER, G. SEEGMULLER, and F. SARRE. 1988. Automatic Thesaurus Construction by Machine Learning from Retrieval Sessions. Paper presented at the Conference on User-Oriented Content-Based Text and Image Handling, MIT, Cambridge, Mass., 588-96.
HARTER, S. P. 1975. "A Probabilistic Approach to Automatic Keyword Indexing. Parts I and II." J. American Society for Information Science, 26, 197-206 and 280-89.
MCGILL, M. et al. 1979. An Evaluation of Factors Affecting Document Ranking by Information Retrieval Systems. Project report. Syracuse, New York: Syracuse University School of Information Studies.
SALTON, G., and M. MCGILL. 1983. Introduction to Modern Information Retrieval. NewYork: McGraw-Hill.
SALTON, G., and C. S. YANG. 1973. "On the Specification of Term Values in Automatic Indexing." Journal of Documentation, 29(4), 351-72.
SOERGEL, D. 1974. "Automatic and Semi-Automatic Methods as an Aid in the Construction of Indexing Languages and Thesauri." Intern. Classif. 1(1), 34-39.
SRINIVASAN, P. 1990. "A Comparison of Two-Poisson, Inverse Document Frequency and Discrimination Value Models of Document Representation." Information Processing and Management, 26(2) 269-78.
WANG, Y-C., J. VANDENDORPE, and M. EVENS. 1985. "Relationship Thesauri in Information Retrieval." J. American Society of Information Science, 15-27.
cesium
USE caesium
computer-aided instruction
see also education
UF teaching machinse
BT educational computing
TT computer applications
RT education
teaching
CC C7810C
FC 7810Cf
Figure 9.1: A short extract from the 1979 INSPEC thesaurus
9.2 FEATURES OF THESAURI
9.2.1 Coordination Level
9.2.2 Term Relationships
9.2.3 Number of Entries for Each Term
9.2.4 Specificity of Vocabulary
9.2.5 Control on Term Frequency of Class Members
9.2.6 Normalization of Vocabulary
9.3 THESAURUS CONSTRUCTION
9.3.1 Manual Thesaurus Construction
9.3.2 Automatic Thesaurus Construction
9.3.3 From a Collection of Document Items
9.3.4 By Merging Existing Thesauri
9.3.5 User Generated Thesaurus
9.4 THESAURUS CONSTRUCTION FROM TEXTS
9.4.1 Construction of Vocabulary
Stem evaluation and selection
Phrase construction
COHESION (ti, tj) =
co-occurrence-frequency/sqrt(frequency(ti) * frequency(tj))
(Rada, page 200)
COHESION (ti, tj) =
size-factor * (co-occurrence-frequency/(total-frequency(ti) *
total-frequency(tj))) (Salton and McGill, 85)
9.4.2 Similarity Computation
9.4.3 Vocabulary Organization
9.5 MERGING EXISTING THESAURI
9.6 BRIEF DESCRIPTION OF C PROGRAMS INCLUDED
9.6.1 Program Select.c
2 math 2.0 mellitus 1 1.0
2 logic 1.0 logic 2 1.0
1 diabetes 2.0 diabetes 1 2.0
1 mellitus 1.0 math 2 2.0
3 math 1.0 math 3 1.0
Direct file extract Inverted file extract
Figure 9.2: Short extracts from both input files
9.6.2 Program Hierarchy.c
9.6.3 Program merge.c
9.7 CONCLUSION
REFERENCES
APPENDIX
/*
PROGRAM NAME: hierarky.c
PURPOSE: This program will generate a hierarchy in two ways.
1) It can simply read the parent-child links from an input fifile and
store the links in the inverted file structure,
OR
2) It can use the Rada algorithm which splits up words
into different frequency groups and then builds links
between them.
INPUT FILES REQUIRED: (Depends on the option selected).
Option 1: requires inverted file and link file.
Option 2: requires inverted file.
1) inverted file: sequences of
term document number weight.
(multiple entries for any term should be grouped together)
2) links file: sequences of
parent term child term
NOTES: Filters such as stop lists and stemmers should be used
before running this program.
PARAMETERS TO BE SET BY USER:
1) MAXWORD: identifies the maximum size of a term
2) NUMBER_OF_LEVELS: specifies the desired number of levels
in the thesaurus hierarchy to be generated, used by the
Rada algorithm
COMMAND LINE: (INPUT & OUTPUT FILES ARE SPECIFIED INTERACTIVELY)
hierarky
***********************************************************************/
#include <stdio.h>
#include <string.h>
#include <math.h>
#define MAXWORD 20 /* maximum size of a term */
#define NUMBER_OF_LEVELS 10 /* # of levels desired in the thesaurus */
struct doclist { /* sequences of document # and weight pairs */
int doc; /* document number */
float weight /* term weight in document */
struct doclist *nextdoc; /* ptr. to next doclist record */
} doclistfile;
struct parentlist {
char term[MAXWORD]; /* parent term */
struct invert *parent; /* ptr. to parent term in inverted file */
struct parentlist *nextparent; /* ptr. to next parentlist record */
} parentfile;
struct childlist {
char term[MAXWORD]; /* child term */
struct invert *child; /* ptr. to child term in inverted file */
struct childlist *nextchild; /* ptr. to next childlist record */
} childfile;
struct invert { /* inverted file */
char term[MAXWORD]; /* term */
struct doclist *doc /* sequences of document # and weigh */
struct parentlist *parents; /* ptr. to parent terms */
struct childlist *children; /* ptr. to child terms */
int level; /* thesaurus level based on term frequency */
struct invert *nextterm; /* ptr. to next invert record */
} invfile;
struct invert *startinv /* ptr. to first record in inverted file */
struct invert *lastinv /* ptr. to last record in inverted file */
struct doclist *lastdoc; /* ptr. to last document in doclist */
static char currentterm[MAXWORD]; /* tracks current term in inverted file */
static int Number_of_docs; /* total # of documents which is computed */
static struct invert *get_mem_invert ( ); /* these 4 functions will obtain */
static struct doclist *get_mem_doclist ( ); /* memory for records. The type of */
static struct parentlist *get_mem_parentlist ( ); /* is indicated by the name of */
static struct childlist *get_mem_childlist ( ); /* the function */
static FILE *input; /* inverted file */
static FILE *input1; /* link file */
static FILE *output; /* holds any output */
static
float cohesion ( ), /* compute cohesion between two terms */
total_wdf ( ), /* compute total frequency of term in dbse. */
get_freq_range ( );
static
void read_invfile ( ), /* read in the inverted file */
read_links ( ), /* read in the links file */
add_link ( ), /* called within read_links ( ) */
pr_invert ( ), /* print the inverted file */
add_invert ( ), /* called within read_invfile ( ) */
write_levels ( ), /* initialize the levels information */
generate_Rada_hierarchy ( ), /* generate the Rada hierarchy */
get_term_data ( ); /* get basic information about terms */
struct invert *find_term ( ); /* searches for term in inverted file and */
/* returns its address. */
int main (argc)
int argc;
{
char ch, fname[128];
startinv = NULL; lastinv = NULL; lastdoc = NULL;
currentterm [ 0 ] = '\0'; Number_of_docs = 0;
if (argc > 1)
{
(void) printf ("There is an error in the command line\n");
(void) printf ("Correct usage is:\n");
(void) printf ("hierarchy\n");
exit (1);
}
(void) printf ("\nMake a selection\n");
(void) printf ("To simply read links from a link file enter 1\n");
(void) printf ("To use Rada's algorithm to generate links enter 2\n");
(void) printf ("To quit enter 3\n");
(void) printf ("Enter selection: ");
ch=getchar ( );
switch (ch)
{
case '1':
(void) printf ("\nEnter name of inverted file: ");
(void) scanf ("%s", fname);
if ( (input=fopen (fname, "r") ) ==NULL) {
(void) printf ("cannot open file %s\n", fname);
exit (1);
}
(void) printf ("Enter name of link file: ");
(void) scanf ("%s", fname);
if ( (input1=fopen (fname,"r") ) == NULL) {
(void) printf ("cannot open file %s\n",fname);
exit (1);
}
(void) printf ("Enter name of output file: ");
(void) scanf ("%s", fname);
if ( (output=fopen (fname,"w") ) ==NULL) {
(void) printf ("cannot open file %s\n", fname);
exit (1);
}
read_invfile ( );
(void) fprintf (output,"\nINVERTED FILE\n\n");
pr_invert ( );
read_links ( );
(void) fprintf (output,"\nINVERTED FILE WITH LINK INFORMATION\n\n");
pr_invert ( );
(void) fclose (input); (void) fclose (input1); (void) fclose (output);
break;
case '2':
(void) printf ("\nEnter name of inverted file: ");
(void) scanf ("%s", fname);
if ( (input=fopen (fname,"r") ) ==NULL) {
(void) printf ("cannot open file %s\n", fname);
exit (1);
}
(void) printf ("Enter name of output file: ");
(void) scanf ("%s", fname);
if ( (output=fopen (fname,"w") ) ==NULL) {
(void) printf ("cannot open file %s\n", fname);
exit (1);
}
read_invfile ( );
(void) fprintf (output,"\nINVERTED FILE\n\n");
pr_invert ( );
generate_Rada_hierarchy ( );
(void) fprintf (output,"\nINVERTED FILE AFTER GENERATING RADA HIERARCHY\n\n");
pr_invert ( );
(void) fclose (input); (void) fclose (output);
break;
case '3':
exit (0);
}
return (0);
}
/***********************************************************************
read_invfile ( )
Returns: void
Purpose: Read in the inverted file entries from the disk file
**/
static void read_invfile ( )
{
int docid; /* holds current document number */
char temp [MAXWORD]; /* holds current term */
float weight; /* holds current term weight */
struct doclist *p; /* structure to store doc#-weight pair */
(void) fscanf (input,"%s%d%f", temp, &docid, &weight); /* r ad next line */
while (strlen (temp) > 0)
/* while its found a legitimate term */
{
if (!strncmp (currentterm, temp, MAXWORD) ) {
/* if this term has previously been entered in inverted file then */
/* only need to attach a doclist record to the same entry */
p = get_mem_doclist ( ); /* get memory for doclist record */
p->doc = docid; /* assign document number */
p->weight = weight; /* assign term weight */
p->nextdoc = NULL;
if (lastdoc) lastdoc-nextdoc = p; /* connect p to the doclist chain
for this term if it already exists */
lastdoc = p; /* set this global variable */
}
else add_invert (docid, temp, weight);
/* else term is a brand new term & need to make a new inverted file entry */
temp [0] = ' \0 ';
(void) fscanf (input,"%s%d%f", temp, &docid, &weight); /* read next line */
}
}
/***********************************************************************
add_invert(docid,temp,weight)
Returns: void
Purpose: Start a new entry in the inverted file. It is called in the
read_invfile function when a new term is read from the input
file.
**/
static void add_invert (docid, temp, weight)
int docid; /* in: document number */
char temp [MAXWORD]; /* in: new index term */
float weight; /* in: index term weight */
{
struct invert *p; /* p will get attached to inverted file */
p = get_mem_invert ( ); /* get memory for p */
(void) strncpy (p-term, temp, MAXWORD); /* copy over the term */
p->parents = NULL; /* to begin this term has no parent terms */
p->children = NULL; /* also no child terms */
p->doc = get_mem-doclist ( ); /* start a doclist structure */
p->doc->doc = docid; /* assign document number */
p->doc->weight = weight; /* assign term weight */
p->doc->nextdoc = NULL;
p->nextterm = NULL;
if (startinv = NULL) startinv = p;
/* if this is the first entry in inverted file, then update global variable */
if (lastinv) lastinv-nextterm = p; /* update ptr. to last inverted file record */
lastinv = p;
lastdoc = p->doc; /* update ptr. to last document */
(void) strncpy(current term, temp, MAXWORD); /* update global variable current term */
} /* to the new term just added */
/**********************************************************************
read_links( )
Returns: void
Purpose: Read-parent child link information from a file and record
links in the inverted record structure
**/
static void read_links( )
{
char parent[MAXWORD], /* tracks parent term */
child[MAXWORD]; /* tracks child term */
(void) fscanf(input1,"%s%s",parent,child); /* read input line */
while (strlen(parent) > 0)
/* while a legitimate parent has been found */
{
add_link(parent,child); /* this function will add the appropriate links in */
/* the inverted file */
child[0] = '\0'; parent[0] = '\0'; /* now throw out the old parent & child */
(void) fscanf(input1,"%s%s",parent,child); /* read next input line */
}
}
/***********************************************************************
add_link(parent,child)
Returns: void.
Purpose: Used within read_links. Basically, for each parent-child
link specified, it adds the appropriate link information into
the inverted file.
Notes: If a term in the link file is not in the inverted file then
the program will give a suitable message and exit.
**/
static void add_link(parent,child)
char parent[MAXWORD], /* in: holds the parent term */
child[MAXWORD]; /* in: holds the child term */
{
struct invert *p, /* holds add. of parent term in inv. file */
*q; /* holds add. of child term in inv. file */
struct parentlist *new_parent; /* structure used to store parent info. */
struct childlist *new_child; /* structure used to store child info. */
p = find_term(parent); /* find address of parent term */
if (!p) { printf("\nPlease check the input files.\n");
printf("\nParent term %s is not in the inverted file\n",parent);
exit(0);}
q = find_term(child); /* find address of child term */
if (!q) { printf("\nPlease check the input files. Output may be incorrect.\n");
printf("\nChild term %s is not in the inverted file\n",child);
exit(0);}
/* first add parent links for given child */
new_parent = get_mem_parentlist( ); /* get memory for parentlist record */
(void) strncpy(new_parent->term, parent, MAXWORD); /* copy over parent term */
new_parent->parent = p; /* store address of parent term in inverted file */
if (q->parents == NULL) { /* i. e. no parents listed for given child yet */
q->parents = new_parent; /* first parent link made */
new_parent->nextparent = NULL;
}
else { /* at least 1 parent already listed for given child *
new_parent->nextparent = q->parents; /* attach newparent to front of list */
q->parents = new_parent;
}
/* next add child links for given parent */
new_child = get_mem_childlist( ); /* get memory for childlist record */
(void) strncpy(new_child->term,child,MAXWORD); /* copy over child term */
new_child->child = q; /* store address of child term in inverted file*/
if (p->children == NULL) { /* no children listed for given parent yet*/
p->children = new_child; /* first child link made */
new_child->nextchild = NULL;
}
else { /* at least 1 child already listed for given parent */
new_child->nextchild = p->children; /* attach newchild to front of list */
p->children = new_child;
}
}
/***********************************************************************
pr_invert( )
Returns: void
Purpose: Print the inverted file. It prints each term, its
associated document numbers, term-weights and parent child terms.
**/
static void pr_invert( )
{
struct invert *inv_addr; /* tracks address of current inv. file record */
struct doclist *doc_addr; /* tracks address of current doclist record */
struct parentlist *parent_addr; /* tracks address of current parentlist record */
struct childlist *child_addr; /* tracks address of current childlist record */
inv_addr = startinv; /* begin at top of inverted file */
while (inv_addr) { /* while a legitimate term. . . . */
(void) fprintf(output, "TERM: %s\nPARENT TERMS: ", inv_addr->term);
parent_addr = inv_addr->parents; /* find addr. of first parent */
while (parent_addr) { /* printing all parents */
(void) fprintf(output, "%s ", parent_addr->term);
parent_addr = parent_addr->nextparent; /*loop through remaining parents */
}
(void) fprintf(output,"\nCHILD TERMS: ");
child_addr = inv_addr->children; /* find addr. of first child */
while (child_addr) { /* printing all children */
(void) fprintf(output,"%s ",child_addr->term);
child_addr = child_addr->nextchild; /* loop through remaining children */
}
(void) fprintf(output,"\n\n");
(void) fprintf(output," DOCUMENT NUMBER TERM
doc_addr = inv_addr->doc; /* find addr. of first associated doc. */
while (doc_addr) { /* print all docs. and term weights */
(void) fprintf(output," %-30d ",doc_addr->doc);
(void) fprintf(output,"%-10.5f\n", doc_addr->weight);
doc_addr = doc_addr->nextdoc; /* loop through remaining documents */
}
(void) fprintf(output,"\n");
inv_addr = inv_addr->nextterm; /* go to next inverted file entry */
}
}
/***********************************************************************
total_wdf(term)
Returns: float
Purpose: Compute total within document frequency for specified term
in the database
**/
static float total_wdf(term)
char term[MAXWORD]; /* in: term for which total_wdf is required */
{
struct invert *term_addr; /* add. of above term in inverted file */
struct doclist *doc_ad; /* tracks add. of associated doclist record */
float totalwdf; /* tracks total wdf */
totalwdf = 0.0;
term_addr = find_term(term); /* obtain address of the term in inv. file */
if (term_addr) { /* if term was found */
doc_ad = term_addr->doc; /* find address of associated doclist record */
while (doc_ad) {
totalwdf = totalwdf + doc_ad-weight; /* loop through doclist reeords to */
doc_ad = doc_ad-nextdoc; /* compute the total weight */
}}
else (void) fprintf(output, "Term %s is not in the inverted file. Could lead to
problems\n",term);
return(totalwdf);
}
/***********************************************************************
get_freq_range(minimum,maximum)
Returns: float
Purpose: Compute the difference between the maximum total term frequency
and the minimum total term frequency observed in the inverted file.
**/
static float get_freq_range(minimum, maximum)
float *minimum, /* out: returns minimum totalwdf */
*maximum; /* out: returns maximum totalwdf */
{
struct invert *inv_addr;
float freq, max, min;
inv_addr = startinv; /* begin at top of inverted file */
/* initialize min and max to equal frequency of 1st term in file */
if (inv_addr) {
freq = total_wdf(inv_addr->term);
min = freq; max = freq;
inv_addr = inv_addr->nextterm; /* go to next term in inv. file */
}
while (inv_addr {
/* while a legitimate term compare with max and min. */
freq = total_wdf(inv_addr->term);
if (freq < max) max = freq;
if (freq > min) min = freq;
inv_addr = inv_addr->nextterm; /* go to next term in inv. file */
}
*minimum = min; *maximum = max;
return(max - min); /* returning the difference */
}
/***********************************************************************
write_levels( )
Returns: void
Purpose: Write the level numbers for each term into the inverted file
depending on the total wdf of the term in the database and
the user selected parameter NUMBER_OF_LEVELS.
The level numbers are marked 0, 1, 2, ... etc. Level 0
refers to the highest frequency class, level 1 the next
frequency class etc.
**/
static void write_levels( )
{ i, /* counter through the different levels */
number; /* holds NUMBER_OF_LEVELS */
float freq, /* holds frequency of term in database */
range, /* holds diff. between highest & lowest freqs. */
current_low, /* tracks lower frequency of current level */
current_high; /* tracks higher frequency of current level */
float high, /* highest term frequency in database */
low; /* lowest term frequency in database */
struct invert *inv_addr; /* tracks current inverted file record */
/* range holds the difference between highest & lowest totalwdf in dbse. */
range = get_freq_range(&low,&high);
number = NUMBER_OF_LEVELS; /* user specified global parameter */
inv_addr = startinv; /* start with the first term in inverted file */
while(inv_addr) { /* while a legitimate term was found */
freq = total_wdf(inv_addr->term);
current_low = low;
for (i=(number-1); i>=0; i- -) {
if (i == 0) current_high = high; else current_high = current_low + (range/number);
/* if the term's frequency is within this narrow range, then level = i */
if ((freq >= current_low) && (freq <= current_high)) {
inv_addr->level = i;
break;
}
current_low = current_high; /* loop through the frequency levels */
} /* ending for loop */
inv_addr = inv_addr->nextterm; /* loop through other inv. file terms */
}
}
/***********************************************************************
generate_Rada_hierarchy ( )
Returns: void
Purpose: Create the levelptrs data structure and generate the hierarchy
according to Rada's algorithm
**/
static void generate_Rada_hierarchy ( )
{
struct termlist {
struct invert *term; /* pointer to term in inverted file */
int mark; /* equals 1 if term is propagated else 0 */
struct termlist *nextterm; /* pointer to next termlist record */
} termlistfile, *p, *q, *r;
/* levelptrs is an array of pointers. Each slot points to the start of
the chain of termlist records for that level */
struct termlist *levelptrs[NUMBER_OF_LEVELS];
int i;
struct invert *inv_addr; /* tracks current term in inverted file */
float coh, max_cohesion;
write_levels ( ); /* this routine computes and writes the level number for each */
/* term in the inverted file *
for (i=0;i < NUMBER_OF_LEVELS;i++) levelptrs[i] = NULL; /* intializing the array */
/* now create the termlist chain for each level */
inv_addr = startinv; /* start with first term in inverted file */
while (inv_addr) { /* while there is a term there */
p = (struct termlist *)malloc(sizeof(termlistfile)); /* get memory for termlist */
if (!p) {
(void) fprintf(output,"\nout of memory\n");
exit(1);
}
p-term = inv_addr; /* assign the address of term in inverted file*/
p-mark = 0; /* initially term not linked */
/* Note: this term has been assigned to a frequency level already. Now, if this */
/* is the first term read for this level then set the appropriate levelptrs entry*/
/* to point to this term */
if (levelptrs[inv_addrlevel] == NULL) {
levelptrs[inv_addrlevel] = p;
p->nextterm = NULL;
}
else { /* now this is not the first term encountered for this level, so simply */
/* attach it to the front of the chain */
p->nextterm = levelptrs[inv_addr->level];
levelptrs[inv_addr->level] = p;
}
inv_addr = inv_addr->nextterm; /* process next inverted file term */
} /* end while */
/* start with each level and compute max-cohesion with previous level */
for (i=1;i < NUMBER_OF_LEVELS;i++) {
p = levelptrs[i];
while (p) {
max_cohesion = 0.0;
q = levelptrs[i-1]; /* q set to the previous level's first term */
while (q) { /* as long as there are terms in this previous level . . . */
coh = cohesion(p->term->term,q->term->term);
if (coh > max_cohesion) max_cohesion = coh;
q = q->nextterm;
}
/* max_cohesion for terms in p has been computed */
/* adding parent-child links and marking parents as propagated */
q = levelptrs[i-1];
while (q && max_cohesion 0.0) {
coh = cohesion(p->term->term,q->term->term);
if (coh == max_cohesion) { /* this ensures multiple links possible */
add_link(q->term->term,p->term->term); /* routine adds the actual link */
q-mark = 1; /* to show that parent term has been linked */
}
q = qnextterm;
} /* end while(q) */
p = pnextterm; /* go to next term in level i */
max_cohesion = 0.0;
} /* end while(p) */
/* checking all terms in level[i-1] to make sure they have propagated */
q = levelptrs[i-1];
while (q) {
if (qmark == 0) { /* i.e. term has no child in next level */
qmark = 1;
r = (struct termlist *)malloc(sizeof(termlistfile));
if (!r) { (void) fprintf(output,"\nout of memory\n");
exit(2);
}
r->term = q->term; rmark = 0; /* making a copy of term as its dummy child */
r->nextterm = levelptrs[i]; /* inserting r at beginning of level i chain */
levelptrs[i] = r;
}
q = qnextterm;
}
} /* for */
}
/***********************************************************************
cohesion(term1, term2)
Returns: void
Purpose: Compute the cohesion between two terms
**/
static float cohesion(term1, term2)
char term1[MAXWORD], /* in: the two terms which are being */
term2[MAXWORD]; /* in: compared to determine cohesion */
{
float 11, /* holds # of documents associated with term 1 */
12, /* holds # of documents associated with term 2 */
common; /* holds # of documents in common */
get_term_data(term1, term2, &/11,2, &common);
return(common/(sqrt(/11 * 12)));
}
/***********************************************************************
get_term_data(term1, term2, l1, l2, common);
Returns: void
Purpose: Given two terms, it determines the number of documents in
each and in common between them.
**/
static void get_term_data(term1, term2, l1, l2, common)
char term1[MAXWORD], /* in: term 1 */
term2[MAXWORD]; /* in: term 2 */
float *11, /* out: # of documents associated with term 1 */
*12, /* out: # of documents associated with term 2 */
*common; /* out: # of documents associated with both */
{
struct invert *p, *q; /* holds addresses of both terms in inv. file */
struct doclist *doc_ad1, *doc_ad2; /*tracks addresses of doclists records */
int count1, /* # of documents associated with term 1 */
count2, /* # of documents associated with term 2 */
com; /* # of documents in common */
p = find_term(term1); q = find_term(term2); /* find addresses of terms */
doc_ad1 = p-doc; /* start with doclist record for term1 */
count1 = 0; count2 = 0; com = 0; /* initialize */
/* first get length for document 1 and number of common terms */
while (doc_ad1) {
count1 = count1 +1;
doc_ad2 = qdoc;
/* for each doc. of term1 loop through all docs. of term2 */
while (doc_ad2) {
if (doc_ad1doc = doc_ad2doc) { /* if they are the same doc. # */
com = com +1;
break;
}
doc_ad2 = doc_ad2-nextdoc;
}
doc_ad1 = doc_ad1-nextdoc;
}
/* now get length of document 2 */
doc_ad2 = qdoc;
while (doc_ad2) {
count2 = count2 + 1;
doc_ad2 = doc_ad2nextdoc;
}
*l1 = count1; *l2 = count2; *common = com;
}
/***********************************************************************
*find_term(term)
Returns: address of a struct invert record
Purpose: Search for a specified term in the inverted file and
return address of the corresponding inverted file record.
**/
static struct invert *find_term(term)
char term[MAXWORD]; /* in: term to be located in inverted file */
{
struct invert *inv_addr; /* tracks addr. of current rec. in inv, file */
inv_addr = startinv; /* begin at top on inv. file */
while(inv_addr) {
if (!strcmp(term, inv_addr-term)) return(inv_addr);
inv_addr = inv_addr-nextterm;
}
(void) fprintf(output,"Term %s not found\n",term);
return (NULL);
}
/***********************************************************************
*get_mem_invert( )
Returns: address of a struct invert record
Purpose: dynamically obtain enough memory to store 1 invert record.
**/
static struct invert *get_mem_invert( )
{
struct_invert *record;
record = (struct invert *)malloc(sizeof(invfile));
if (!record) {
(void) fprintf(output,"\nout of memory\n");
return (NULL);
}
return(record);
}
/***********************************************************************
*get_mem_doclist( )
Returns: address of a struct doclist record
Purpose: dynamically obtain enough memory to store 1 doclist record.
**/
static struct doclist *get_mem_doclist( )
{
struct doclist *record;
record = (struct doclist *)malloc(sizeof(doclistfile));
if (!record) {
(void) fprintf(output,"\nout of memory\n");
return (NULL);
}
return(record);
}
/***********************************************************************
*get_mem_parentlist()
Returns: address of a struct parentlist record
Purpose: dynamically obtain enough memory to store i parentlist recorded
**/
static struct parentlist *get_mem_parentlist ( )
{
struct parentlist *record;
record = (struct parentlist *)malloc(sizeof (parentfile));
if (!record) {
(void) fprintf (output,"\nout of memory\n");
return (NULL);
}
return(record);
}
/***********************************************************************
*get_mem_childlist()
Returns: address of a struct childlist record
Purpose: dynamically obtain enough memory to store 1 childlist record
**/
static struct childlist *get_mem_childlist( )
{
struct childlist *record;
record = (struct childlist *)malloc(sizeof (childfile));
if (!record) {
(void) fprintf(output,"\nout of memory\n");
return(NULL);
}
return(record);
}
/*
PROGRAM NAME: select. c
PURPOSE: 1) Compute Discrimination Value of terms,
2) Compute Poisson Distributions for terms,
3) Parition terms by their total within document frequencies,
4) Compute cohesion between pairs of terms,
5) Compute Dice's coefficient of similarity between two terms.
INPUT FILES REQUIRED:
1) a direct file, sequences of:
document# term weight
(multiple entries for any document should be grouped together )
2) an inverted file, sequences of
term document# weight
(multiple entries for any term should be grouped together)
NOTES: Filters such as stop lists and stemmers should be used before
before running this program.
PARAMETERS TO BE SET BY USER:
1) MAXWORD - maximum size of a term
2) MAXWDF - maximum value expected for the within document
frequency for a term in the collecgtion.
3) LOW_THRESHOLD - threshold for LOW and MID frequency ranges
4) HIGH_THRESHOLD - threshold for MID and HIGH frequency ranges
COMMAND LINE:
select direct_file inverted_file output_file
/***********************************************************************
#include <stdio.h>
#include <string.h>
#include <math.h>
#define MAXWORD 20 /* maximum size of a term */
#define MAXWDF 30 /* maximum WDF for a word in a database */
#define LOW_THRESHOLD 2.0
#define HIGH_THRESHOLD 4.0
struct termlist { /* sequences of term and weight pairs */
char term[MAXWORD]; /* term */
float weight; /* term weight in document */
struct termlist *nextterm; /* ptr. to next termlist record */
} termlistfile;
struct doclist { /* sequences of document # and weight pairs */
int doc; /* document number */
float weight; /* term weight in document */
struct doclist *nextdoc; /* ptr. to next doclist record */
} doclistfile;
struct direct { /* direct file: document to list of terms*/
int docnum; /* document # */
struct termlist *terms; /* sequences of term and weight pairs */
struct direct *next; /* ptr. to next direct record */
} directfile;
struct invert { /* inverted file: term to list of documnts */
char term[MAXWORD]; /* term */
struct doclist *doc; /* sequences of document # and weight pairs */
struct invert *next; /* ptr. to next invert record */
} invfile;
static struct invert *startinv; /* ptr. to first record in inverted file */
static struct invert *lastinv; /* ptr. to last record in inverted file */
static struct doclist *lastdoc; /* ptr. to last document in doclist */
static struct direct *startdir; /* ptr. to first record in direct file */
static struct direct *lastdir; /* ptr. to last record in direct file */
static struct termlist *lastterm; /* ptr. to last term in termlist */
static struct direct *start_centroid; /* ptr. to centroid record */
static FILE *input; /* direct file */
static FILE *input1; /* inverted file */
static FILE *output; /* file to hold all output */
static int currentdoc; /* tracks current document in direct file */
static char currenterm[MAXWORD]; /* tracks current term in inverted file */
static int Number_of_docs; /* total # of document which is computed */
static
float av_doc_similarity( ), /* compute average doc. similarity in dbse. */
factorial( ), /* compute factorial of a number */
cosine( ), /* compute cosine between two terms*/
dice( ), /* compute dice beteen two terms */
total_wdf( ), /* compute total frequency of term in dbse. */
cohesion( ); /* compute cohesion between two terms */
static
void initialize( ), /* initialize files and global variables */
read_directfile( ), /* read in the direct file */
add_direct( ), /* called within read_directfile( ) */
pr_direct( ), /* print the direct file */
read_invfile( ), /* read in the inverted file */
add_invert( ), /* called within read_invfile( ) */
pr_invert( ), /* print the inverted file */
centroid( ), /* compute the document centroid for dbse. */
pr_centroid( ), /* print the document centroid */
get_Poisson_dist( ), /* compute Poisson distributions for terms */
Partition_terms( ), /* partition terms by frequency */
dv_all( ), /* compute discrimination value of terms */
get_doc_data( ), /* get basic info. about documents */
get_term_data( ); /* get basic info. about terms */
static struct direct *get_mem_direct( ); /* these 4 get_mem functions are */
static struct invert *get_mem_invert( ); /* used to obtain memory for a */
static struct doclist *get_mem_doclist( ); /* record. The record type is */
static struct termlist *get_mem termlist( ); /* obvious from the name */
struct invert *find_term( ); /* searches for term in inverted file */
/* and returns its address */
int main(argc,argv)
int argc;
char *argv[ ];
{
char ch, word1[MAXWORD], word2[MAXWORD];
if (argc!=4)
{
(void) printf ("There is an error in the command line\n");
(void) printf("Correct usage is\n");
(void) printf ("select direct_file inverted_file output_file\n");
exit(1);
}
initialize(argv);
(void) fprintf(output,"\nREADING IN DIRECT FILE\n");
read_directfile( );
(void) fprintf(output,"\nPRINTING DIRECT FILE\n\n");
pr_direct( );
(void) fprintf(output,"\nNUMBER OF DOCUMENTS IS: %d\n\n",Number_of_docs);
(void) fprintf(output,"\nREADING IN INVERTED FILE\n");
read_invfile( );
(void) fprintf(output,"\nPRINTING INVERTED FILE\n");
pr_invert( );
(void) printf("\nPlease make a selection\n\n");
(void) printf ("To compute DV for all terms enter 1\n");
(void) printf ("TO compute Poisson distributions enter 2\n");
(void) printf("To partition terms by frequency enter 3\n");
(void) printf ("To compute cohesion between two terms (for phrase construction) enter 4\n");
(void) printf("To compute Dice's coefficient between two terms enter 5\n");
(void) printf ("To quit enter 6\n\n");
(void) printf("Enter your choice: ");
ch = getchar( );
switch(ch)
{
case '1':
centroid( );
(void) fprintf(output, "\nCENTROID\n\n");
pr_centroid( );
(void) fprintf(output,"\nDISCRIMINATION VALUES FOR ALL TERMS\n\n");
dv_all( );
break;
case '2':
(void) fprintf(output,"\nACTUAL AND POISSON DISTRIBUTIONS OF WITHIN DOCUMENT
FREQUENCIES FOR ALL TERMS\n\n");
(void) fprintf(output,"WDF = Within Document Frequency & #docs = Number of
documents\n\n");
get_Poisson_dist( );
break;
case '3':
(void) printf("Make sure that the threshold parameters are set correctly in
the programm\n");
(void) fprintf(output,"\nPARTITIONING THE TERMS INTO LOW, MEDIUM, HIGH
FREQUENCY CLASSES\n\n");
Partition_terms( );
break;
case '4':
(void) printf("enter first word: ");
(void) scanf("%s", word1);
if (find_term(word1) == NULL) {
printf("sorry, %s is not in the inverted file\n", word1);
break;
}
(void) printf("enter second word: ");
(void) scanf("%s", word2);
if (find_term(word2) == NULL) {
printf("sorry, %s is not in the inverted file\n", word2);
break;
}
(void) fprintf(output, "Cohesion between %s and %s is %f\n", word1, word2,
cohesion(word1, word2));
break;
case '5':
(void) printf("enter first word: ");
(void) scanf ("%s", word1);
if (find_term(word1) == NULL) {
printf("sorry, %s is not in the inverted file\n",word1);
break;
}
(void) printf ("enter second word: ");
(void) scanf("%s", word2);
if (find_term(word2) == NULL) {
printf("sorry, %s is not in the inverted file\n", word2);
break;
}
(void) fprintf(output,"Dice's coefficient between %s and %s is %f\n", word1,
word2,dice(word1,word2));
break;
case '6':
exit(0);
default:
(void) printf("no selection made\n");
}
(void) fclose(input1); (void) fclose(input); (void) fclose(output);
return(0);
}
/***********************************************************************
initialize(argv)
Returns: void
Purpose: Open all required files and initialize global variables
**/
static void initialize(argv)
char *argv[ ]; /* in: holds the three parameters input at the command line */
{
if (( input = fopen(argv[1],"r")) = NULL ) {
(void) printf("couldn't open file %s\n",argv[1]);
exit(1); /* input direct fil */}
if (( input1 = fopen(argv[2],"r")) == NULL ) {
(void) printf("couldn't open file %s\n",argv[2]);
exit(1); /* input inverted file */}
if (( output = fopen(argv[3],"w")) == NULL) {
(void) printf("couldn't open file %s for output\n",argv[3]);
exit(1); /* output file */
}
/* set initial values of global variables */
startinv = NULL; lastinv = NULL; lastdoc = NULL;
startdir = NULL; lastdir = NULL; lastterm = NULL;
start_centroid = NULL;
currentdoc = 0; currentterm[0] = '\0'; Number_of_docs = 0;
}
/***********************************************************************
/*
read_directfile( )
Returns: void
Purpose: Read in the direct file entries from the 1st input file
**/
static void read_directfile( )
{
int docid; /* holds the current document number */
char temp[MAXWORD]; /* holds the current term */
float weight; /* holds the current term weight */
struct termlist *p; /* structure to store the term-weight pair */
(void) fscanf(input,"%d%s%f",&docid,temp,&weight); /* read the next line */
while (docid > 0)
/* while its found a legitimate document number */
{
if (docid == currentdoc) {
/* if this document number has previously been entered in direct file */
/* then only need to attach a termlist record to the same entry */
p = get_mem_termlist( ); /* get memory for a termlist record */
(void) strncpy(p-term,temp,MAXWORD); /* copy the new word over */
p->weight = weight; /* assign the new weight over */
p->nextterm = NULL;
if (lastterm) lastterm->nextterm = p; /* connect p to the termlist
chain for this document */
lastterm = p; /* set this global variable */
}
else { /* else docid represents a new document */
Number_of_docs = Number_of_docs +1; /* increment global variable */
add_direct(docid,temp,weight); /* starts a brand new entry in */
/* the direct file */
}
docid = 0;
(void) fscanf(input,"%d%s%f",&docid,temp,&weight);
}
}
/***********************************************************************
add_direct(docid,temp,weight)
Returns: void
Purpose: Start a new entry in the direct file. It is called in
the read_directfile function when a new document number is
read from the input file.
**/
static void add_direct(docid,temp,weight)
int docid; /* in: new document number */
char temp[MAXWORD]; /* in: index term */
float weight; /* in: index term weight */
{
struct direct *p; /* structure p will be attached to direct file */
p = get_mem_direct( ); /* get memory for p */
p->docnum = docid; /* assign the document number */
p->terms = get_mem_termlist( ); /* get memory for termlist structure */
(void) strncpy(p->terms->term,temp,MAXWORD); /* assign index term to it */
p->terms->weight = weight; /* assign term weight to it */
p->terms->nextterm = NULL; /* current end of termlist */
p->next = NULL; /* current end of direct file */
if (startdir == NULL) startdir = p /* if this is the very first document
then global variable pointing to start of direct file should be updated */
if (lastdir) lastdir->next = p; /* update pointer to last direct file rec. */
lastdir = p;
lastterm = p->terms; /* update pointer to last term */
currentdoc = docid; /* update the global variable currentdoc to the */
/* document number just added */
}
/***********************************************************************
pr_direct()
Returns: void
Purpose: Print the direct file. It prints sequences of document# term weight.
**/
static void pr_direct ( )
{
struct direct *dir_addr; /* tracks address of current direct file record */
struct termlist *term_addr; /* tracks address of current termlist record */
dir_addr = startdir; /* start with beginning of direct file */
while (dir_addr { /* check for legitimate direct file record */
(void) fprintf(output,"DOCUMENT NUMBER: %d \n",dir_addr-docnum);
(void) fprintf(output," TERM TERM WEIGHT\n");
term_addr = dir_addr-terms; /* get addr. of first term */
while (term_addr) { /* loop through all the terms */
(void) fprintf(output," %-30s ",term_addr-term);
(void) fprintf(output,"%-10.3f\n",term_addr-weight);
term_addr = term_addr-nextterm; /* go to next term for the doc. */
}
(void) fprintf(output,"\n");
dir_addr = dir_addr-next; /* go to next direct file record */
}
}
/***********************************************************************
read_invfile( )
Returns: void
Purpose: Read in the inverted file entries from 2nd input file
**/
static void read_invfile( )
{
int docid; /* holds currenct document number */
char temp[MAXWORD]; /* holds current term */
float weight; /* holds current term weight */
struct doclist *p; /* structure to store doc#-weight pair */
(void) fscanf(input1,"%s%d%f",temp,&docid,&weight); /* read next line */
while (strlen(temp) > 0)
/* while its found a legitimate term */
{
if (!strncmp(currentterm,temp,MAXWORD)) {
/* if this term has previously been entered in inverted file */
/* then only need to attach a doclist record to same term entry */
p = get_mem_doclist( ); /* get memory for doclist record */
p->doc = docid; /* assign document number */
p->weight = weight; /* assign weight */
p->nextdoc = NULL;
if (lastdoc) lastdoc->nextdoc = p; /* connect p to the doclist
chain for this ter */
lastdoc = p; /* set this global variable */
}
else add_invert(docid,temp,weight);
/* else term is a brand new term & need to make a new inverted file entry */
temp[0] = '\0';
(void) fscanf(input1,"%s%d%f",temp,&docid,&weight); /* read next line */
}
}
/***********************************************************************
add_invert(docid,temp,weight);
Returns: void
Purpose: Start a new entry in the inverted file. It is called in the
read_invfile function when a new term is read from the
input file
**/
static void add_invert(docid,temp,weight)
int docid; /* in: document number */
char temp[MAXWORD]; /* in: new index term */
float weight; /* in: index term weight */
{
struct invert *p; /* structure p will be attached to inverted file */
p = get_mem_invert( ); /* get memory for p */
(void) strncpy(p->term,temp,MAXWORD); /* copy over the term */
p->doc = get_mem_doclist( ); /* start a doclist structure */
p->doc->doc = docid; /* assign document number */
p->doc->weight = weight; /* assign term weight */
p->doc->nextdoc = NULL;
p->next = NULL;
if (startinv == NULL) startinv = p;
/* if this is the first entry in inverted file, then update global var. */
if (lastinv) lastinv->next = p; /* update ptr. to last inverted file record */
lastinv = p;
lastdoc = p->doc; /* update ptr. to last document */
(void) strncpy(currentterm,temp,MAXWORD); /* update global var. currentterm to the */
/* new term just entered */
}
/***********************************************************************
pr_invert( )
Returns: void
Purpose: Print the inverted file. It prints sequences of
term document# weight.
**/
static void pr_invert( )
{
struct invert *inv_addr; /* tracks address of current inverted file record */
struct doclist *doc_addr; /* tracks address of current doclist record */
inv_addr = startinv; /* start with beginning of inverted file */
while (inv_addr) { /* check for legitimate inverted file record */
(void) fprintf(output,"TERM: %s\n",inv_addr-term);
(void) fprintf(output," DOCUMENT NUMBER TERM WEIGHT\n");
doc_addr = inv_addr-doc; /* get addr. of first document */
while (doc_addr) { /*loop through all the associated doc.#s and weights*/
(void) fprintf(output," %-30d ",doc_addr-doc);
(void) fprintf(output,"%-10.5f\n",doc_addr-weight);
doc_addr = doc_addr->nextdoc; /* get addr. of next document */
}
(void) fprintf(output,"\n");
inv_addr = inv_addr->next; /* go to next inverted file record */
}
}
/***********************************************************************
centroid( )
Returns: void
Purpose: Compute and return the centroid for the documents of
the database.
Notes: Centroid is stored as a direct file record. Document
number given to it is Number_of_docs +1. The centroid is
computed by determining for each index term in the inverted
file, its average weight in the database. These average
weights are then stored in the direct file entry for the
centroid. (Note that these average weights are not used
anywhere, but could be used in computing DV for terms).
**/
static void centroid( )
{
struct invert *inv_addr; /* tracks address of current inverted file record */
struct doclist *doc_addr; /* tracks address of current doclist record*/
float total_weight, /* tracks total weight for each term */
av_term weight; /* holds average term weight for each term*/
struct termlist *q, /* structure used to create centroid entry in
the direct file */
*lastterm; /* tracks the last term in the centroid */
start_centroid = get_mem_direct( ); /* centroid stored as direct file record */
start_centroid->docnum = Number_of_docs +1; /* assign its pseudo doc.# */
start_centroid->next = NULL; /* end of direct file chain */
lastterm = NULL;
inv_addr = startinv; /* begin at top of inverted file */
while (inv_addr) { /* while there is a legitimate inv. file record...*/
doc_addr = inv_addr->doc; /* get address of first document */
total_weight = 0.0; /* start with a 0 total weight for this term */
while (doc_addr) { /* if this is a legitimate doc. addr. */
total_weight = total_weight + doc_addr->weight;
/* update total weight for term */
doc_addr = doc_addr->nextdoc; /* loop through all docs. for the term */
}
av_term_weight = total_weight/Number_of_docs;
/* calculating average term wt. */
q = get_mem_termlist( );
(void) strncpy(q->term,inv-addr->term,MAXWORD);
q->weight = av_term_weight;
q->nextterm = NULL;
if (lastterm == NULL) start_centroid->terms = q;
/* if this is the first term entry for the centroid */
else lastterm->extterm = q;
/* else connect this term to the centroid's termlist chain */
lastterm = q;
inv_addr = inv_addr->next; /* go on to the next inverted file entry */
}
}
/***********************************************************************
pr_centroid( )
Returns: void
Purpose: Print the centroid from the direct file
**/
static void pr_centroid( )
{
struct termlist *term_addr; /* tracks address of current termlist record */
/* note the centroid is given a document number = Number_of_docs + 1 */
/* therefore it may be treated as a special kind of document vector */
if (start_centroid) {
/* if there is a centroid */
(void) fprintf (output,"----------------------------------------\n");
(void) fprintf(output,"TERM WEIGHT \n");
(void) fprintf(output,"-----------------------------------------\n");
term_addr = start_centroid->terms; /* get first term address */
while (term_addr) { /* printing out term and weight pairs */
(void) fprintf(output,"%-30s ",term_addr->term);
(void) fprintf(output,"%-10.5f\n",term_addr->weight);
term_addr = term_addr->nextterm; /* loop through all terms */
}
(void) fprintf(output,"\n");
}
}
/***********************************************************************
get_Poisson_dist( )
Returns: void
Purpose: Get the Poisson distribution data for any term
Notes: This function has two parts:
PART I: Determine the actual within doc. freq. distribution
PART II: Determine the distribution anticipated under the
single Poisson model.
It is assumed that the within document frequency of a term
is stored as the term weight in the inverted file.
**/
static void get_Poisson_dist( )
{
struct invert *inv_addr; /* tracks address of current inverted file record */
struct doclist *doc_ad; /* tracks address of current doclist record */
float dist[MAXWDF][2]; /* store for each term */
/* column 1 = within document frequency (wdf) */
/* column 2 = document frequency */
int i, /* counter to add information to the dist array */
j, /* counter to loop through dist array */
found, /* flag used to match wdf in dist array */
numdocs, /* counter to track # of docs. with the same wdf */
docs_with_term; /* tracks the number of documents having the term */
float first, /* these five local variables are */
second, /* used to determine expected distribution */
result,
exponent,
lambda; /* single Poisson parameter */
inv_addr = startinv; /* start at the beginning of the inverted file */
/* PART I: For each term determine the number of documents in the
collection that have a particular wdf */
while (inv_addr) { /* check for legitimate inv. file record */
docs_with_term = 0;
doc_ad = inv_addr->doc; /* get the first doc. address */
i = 0; /* used to check if this is the very first entry in dist */
while (doc_ad) {
if (i == 0) { /* if first entry in dist */
dist[i][0] = doc_ad->weight; dist[i][1] = 1;
/* assign wdf and doc. frequency = 1 to first row in dist */
i++;
docs_with_term++;
}
else { /* dist already has other entries, hence look for
any previous entries for the same wdf value */
found = 0;
for (j=0;j < i;j++) { /* loop through dist */
if (dist[j][0] == doc_ad->weight) { /* if found the same wdf */
dist[j][1] = dist[j][1] + 1; /* add 1 to the doc. frequency */
found = 1;
docs_with_term++;
}
} /* ending for */
if (found == 0) { /* if not found the same wdf in dist */
/* start new row in dist */
dist[i][0] = doc_ad->weight; dist[i][1] = 1;
i++;
docs_with_term++;
}
} /* ending else */
doc_ad = doc_ad->nextdoc;/* loop through other documents for same term */
} /* ending while */
/* ending if */
/* now print out actual distribution information for this term */
(void) fprintf(output,"\nTerm = %s\n",inv_addr->term);
(void) fprintf(output,"\nActual Distribution: ");
(void) fprintf(output," WDF #docs\n");
(void) fprintf(output," 0 %d\n",Number_
of_docs->docs_with_term);
for (j=0;j < i;j++)
(void) fprintf(output," %-16. 0f %-6. 0f\n",dist[j][0]
,dist[j][1]);
/* PART II: */
/* computing lambda - the only single Poisson parameter */
(void) fprintf(output,"\nExpected Distribution: ");
(void) fprintf(output," WDF #docs\n");
/* call the function total_wdf to compute the total frequency of the term */
lambda = (total_wdf(inv_addr->term))/Number_of_docs;
first = exp(-lambda);
numdocs = -1; j = -1;
/* computing document frequency for each within document frequency value */
while (numdocs != 0) {
j = j + 1;
exponent = j; /* type conversion necessary for pow function */
if (j == 0) result = first * Number_of_docs;
else {
second = pow(lambda,exponent);
result = ( ( (first * second)/factorial(j)) * Number_of_docs);
}
if ( (result - floor(result) ) 0.5) numdocs = floor(result);
else numdocs = ceil(result);
(void) fprintf (output," %-16d %-6d\n",j,numdocs);
}
inv_addr = inv_addr->next; /* continue with the next inverted file term */
} /* end while */
}
/***********************************************************************
factorial(n)
Returns: float
Purpose: Return the factorial of a number. Used in get_poisson_dist
**/
static float factorial(n)
int n; /* in: compute factorial for this parameter */
{
float answer; /* holds the result */
if (n==1) return(1.0);
answer = factorial(n-1)*n;
return(answer);
}
/***********************************************************************
total_wdf(term)
Returns: float
Purpose: Compute total frequency in the database for a
specified term using the inverted file.
Notes: It is assumed that the appropriate term frequency is stored
as the term weight in the inverted file. This routine can
also be used to filter out the low and high frequency terms.
The resulting mid frequency terms can be used as input to
the program which generates hierarchies.
**/
static float total_wdf(term)
char term[MAXWORD]; /* in: term for which total frequency is to be found */
{
struct invert *inv_addr; /* tracks current inverted file record */
struct doclist *doc_addr; /* tracks current doclist record */
float total; /* tracks the total frequency */
total = 0.0; /* initial value */
/* function find_term will find out where the term is in the inverted file */
inv_addr = find_term(term);
if (inv_addr) { /* if this term was found in the inverted file */
doc_addr = inv_addr->doc; /* get first associated document address */
while (doc_addr) { /* update the total frequency */
total = total + doc_addr-weight;
doc_addr = doc_addr->nextdoc; /* loop through other associated docs. */
}
}
return(total);
}
/***********************************************************************
Partition_terms( )
Returns: void
Purpose: Assign each term in the inverted file to one class:
HIGH, MEDIUM or LOW frequency depending upon its total
frequency in the collection. This function utilizes
two parameters defined at the top of the program:
LOW_THRESHOLD and HIGH_THRESHOLD, which should be set
by the user.
**/
static void Partition_terms( )
{
struct invert *inv_addr; /* tracks address of current inverted file record */
float total; /* holds total frequency of each term */
inv_addr = startinv; /* start at the beginning of the inverted file */
(void) fprintf(output,"\nTerm - Total Frequency - Frequency Class\n\n");
while (inv_addr) { /* if a legitimate address */
/* compute total frequency for term in collection */
total = total_wdf(inv_addr->term);
(void) fprintf(output,"\n%s - %f -",inv_addr->term, total);
if (total < LOW_THRESHOLD) (void) fprintf(output," LOW\n");
else if (total > HIGH_THRESHOLD) (void) fprintf(output," HIGH\n");
else (void) fprintf(output," MEDIUM\n");
inv_addr = inv_addr->next; /* continue with next inverted file entry */
}
}
/***********************************************************************
cohesion(term1, term2)
Returns: float
Purpose: Compute the cohesion between two terms
**/
static float cohesion(term1, term2)
char term1 [MAXWORD], /* in: the two terms which are being */
term2 [MAXWORD]; /* in: compared to determine cohesion */
{
float l1, /* holds # of documents associated with term1 */
l2, /* holds # of documents associated with term2 */
common; /* holds # of documents in common */
get_term_data(term1, term2, &l1, &l2, &common);
return(common/(sqrt(l1 * l2) ) );
}
/***********************************************************************
dv_all( )
Returns: void
Purpose: Compute Discrimination Value (DV) for all terms in the
database
Notes: Similarity between two documents as calculated here is a
function of the number of terms in common and the number
of terms in each. Term weights are not involved
**/
static void dv_all( )
{
struct invert *inv_addr; /* tracks address of current inv. file record */
float DV, /* holds computed DV */
baseline; /* holds baseline similarity */
/* first compute baseline similarity */'
baseline = av_doc_similarity("-"); /* the dummy term '-' is used for this */
(void) fprintf(output,"--------------------------------------------/n");
(void) fprintf(output, "TERM DV\n");
(void) fprintf(output, "-------------------------------------------/n");
inv_addr = startinv; /* begin at top of inverted file */
while (inv_addr) { /* if legitimate inverted file record */
DV = av_doc_similarity(inv_addr ->term - baseline;
(void) fprintf(output," %-30s %-10.5f \n", inv_addr->term, DV);
inv_addr = inv_addr->next; /* go to next inverted file record */
}
}
/***********************************************************************
av_doc_similarity(term)
Returns: float
Purpose: Compute average similarity between each document
and the centroid of the database. The word specified in
term is ignored during computations.
**/
static float av_doc_similarity (term)
char term[MAXWORD]; /* in: term is ignored during computations */
{
struct direct *dir_addr; /* tracks current direct file record */
float dl1, /* holds # of terms in document */
dl2, /* holds # of terms in centroid */
common, /* holds # of terms in common between them */
total_sim; /* holds total similarity between them */
total_sim = 0.0;
dir_addr = startdir; /* begin with first direct file record */
while (dir_addr) {
/* get_doc_data returns #of terms in each document and #of terms in common */
get_doc_data(dir_addr, start_centroid, &dl1, &dl2, &common, term);
total_sim = total_sim + cosine (dl1, dl2, common);
dir_addr = dir_addr->next; /* go to next direct file record */
}
return (total_sim/Number_of_docs);
}
/***********************************************************************
dice (term1, term2)
Returns: float
Purpose: Returns Dice's coefficient of similarity between
any two documents
**/
static float dice (term1, term2)
char term1[MAXWORD], /* in: the two terms that are being compared */
term2[MAXWORD]; /* in: */
{
float l1, l2, common;
get_term data(term1, term2, &l1, &l2, &common);
if (l1 == 0 || l2 == 0) return(0.0);
return(common/(l1 + l2));
}
/***********************************************************************
cosine (l1, l2, common)
Returns: float
Purpose: Returns cosine similarity between two documents
**/
static float cosine (l1, l2, common)
float l1, /* in: # of terms associated with document 1*/
l2, /* in: # of terms associated with document 2 */
common; /* in: # of terms in commn between then */
{
float temp;
if (11 == 0 || 12 == 0) return (0.0);
temp = sqrt (11 * 12);
return(common/temp);
}
/***********************************************************************
get_doc_data(p, q, l1, l2, common, index)
Returns: void
Purpose: Given two document numbers, it determines the number of
of index terms in each and in common. It will exclude the index
term (specified as the last parameter) from consideration. Used
in av_doc_similarity for DV calculations.
**/
static void get_doc_data (p,q, l1,l2, common, index)
char index[MAXWORD]; /* in: term to be excluded from computations */
struct direct *p, *q; /* in: addresses of two documents numbers */
float *l1, *l2; /* out: number of terms in each document */
float *common; /* out: number of terms in common */
{
struct termlist *term_addr1, /* holds address of first docs. termlist*/
*term_addr2; /* holds address of second docs. termlist */
int count1, /* number of terms in first doc. */
count2, /* number of terms in second doc. */
com; /* number of terms in common */
term_addr1 = p->terms;
count1 = 0; count2 = 0; com = 0;
/* first find out number of terms in document 1 & # of common terms */
while (term_addr1) {
if (strncmp(term_addr1->term, index, MAXWORD)) count1 = count1 + 1;
/* if its not the term to exclude */
term_addr2 = q->terms;
while (term_addr2) {
if (!strncmp(term_addr1-term, term_addr2-term, MAXWORD)) {
/* if the two terms are the same */
if (! strncmp(term_addr1-term, index, MAXWORD)) com = com + 1;
/* if they do not match the term to exclude */
break;
}
term_addr2 = term_addr2-nextterm;
}
term_addr1 = term_addr1-nextterm;
}
/* now find out number of terms in document 2 */
term_addr2 = q ->terms;
while (term_addr2) {
if (strncmp(term_addr2->term, index, MAXWORD)) count2 = count2 + 1;
term_addr2 = term_addr2-nextterm;
}
*l1 = count1; *12 = count2; *common = com;
}
/***********************************************************************
get_term_data(term1, term2, l1, l2, common)
Returns: void
Purpose: Get info regarding number of documents in common
between any two terms.
**/
static void get_term_data(term1, term2, l1, l2, common)
char term1[MAXWORD], /* in: term 1 to be compared with */
term2[MAXWORD]; /* in: term 2 */
float *l1, /* out: # of documents associated with 1st term */
*l2, /* out: # of documents associated with 2nd term */
*common; /* out: # of documents in common between them */
{
struct invert *p, /* holds address of first term in inverted file */
*q; /* holds address of second term in inv. file */
struct doclist *doc_ad1, /* tracks doclist for first term */
*doc_ad2; /* tracks doclist for second term *
int count1, /* holds # of documents associated with 1st term */
count2, /* holds # of documents associated with 2nd term */
com; /* holds # of documents common between them */
/* find addresses of both terms in the inverted file */
p = find_term(terml); q = find_term(term2);
doc_ad1 = p-doc; /* obtain 1st terms doclist address */
count1 = 0; count2 = 0; com = 0;
/* first get # of documents indexed by term 1 & # of docs. in common */
while (doc_ad1) {
count1 = count1 +1;
doc_ad2 = q->doc;
while (doc_ad2) {
if (doc_ad1->doc == doc_ad2->doc) {
/* if the document numbers are the same */
com = com +1;
break;
}
doc_ad2 = doc_ad2->nextdoc;
}
doc_ad1 = doc_ad1->nextdoc;
}
/* now get # of documents indexed by term 2 */
doc_ad2 = q->doc;
while (doc_ad2) {
count2 = count2 + 1;
doc_ad2 = doc_ad2->nextdoc;
}
*l1 = count1; *l2 - count2; *common = com;
}
/***********************************************************************
*find_term(term)
Returns: address of a struct invert record
Purpose: search for a specified term in the inverted file &
return address of the record
**/
struct invert *find_term(term)
char term[MAXWORD]; /* in: term to be located in inverted file */
{
struct invert *inv_addr; /* tracks addr. of current rec. in inv. file*/
inv_addr = startinv; /* begin at top of inv. file */
while(inv_addr) {
if (!strcmp(term,inv_addr->term)) {return(inv_addr);}
inv_addr = inv_addr->next;
}
(void) fprintf(output,"Findterm routine: Term %s not found\n",term);
return(NULL);
}
/***********************************************************************
*get_mem_direct( )
Returns: address of a struct direct record
Purpose: dynamically obtain enough memory to store 1 direct record
**/
static struct direct *get_mem_direct( )
{
struct direct *record;
record = (struct direct *)malloc(sizeof(directfile));
if (!record) {
(void) fprintf(output,"\nout of memory\n");
exit(0);
}
return(record);
}
/***********************************************************************
*get_mem_termlist( )
Returns: address of a struct termlist record
Purpose: dynamically obtain enough memory to store one termlist record
**/
static struct termlist *get_mem_termlist( )
{
struct termlist *record;
record = (struct termlist *)malloc(sizeof (termlistfile));
if (!record) {
(void) fprintf(output,"\nout of memory\n");
exit(0);
}
return(record);
}
/***********************************************************************
*get_mem_invert( )
Returns: address of a struct invert record
Purpose: dynamically obtain enough memory to store one inverted
file record
**/
static struct invert *get_mem_invert( )
{
struct invert *record;
record = (struct invert *)malloc(sizeof(invfile));
if (!record) {
(void) fprintf(output,"\nout of memory\n");
exit(0);
}
return(record);
}*
/***********************************************************************
*get_mem_doclist( )
Returns: address of a struct doclist record
Purpose: dynamically obtain enough memory to store one doclist
record
**/
static struct doclist *get_mem_doclist( )
{
struct doclist *record;
record = (struct doclist *)malloc(sizeof(doclistfile));
if (!record) {
(void) (void) fprintf(output,"\nout of memory\n");
exit(0);
}
return(record);
/*
PROGRAM NAME: merge.c
PURPOSE: This program is used to merge two separate hierarchies.
The program first reads each inverted file. It then reads in the
corresponding link file which gives parent-child links to build
the hierarchy. It can then perform two different types of mergers:
1) Simple merge in which a point of connection between the two
thesauri is made wherever they have terms in common.
2) Complex merge in which any two terms are connected if they
have sufficiently (above a specified threshold) similar sets
of parent and child terms.
INPUT FILES REQUIRED:
1) inverted file for 1st hierarchy
2) links file for 1st hierarchy
3) inverted file for 2nd hierarchy
4) links file for 2nd hierarchy
An inverted file consists of sequences of
term document number weight
(multiple entries for any term should be grouped together)
A link file consists of sequences of
parent term child term
PARAMETERS TO BE SET BY USER:
1) MAXWORD: which specifies the maximum size of a term
2) THRESHOLD: which specifies the minimum similarity level
for use in complex merge.
COMMAND LINE:
merge inverted_file_1 link_file_1 inverted_file_2 link_file_2 output_file
/***********************************************************************
#include <stdio.h>
#include <string.h>
#include <math.h>
#define MAXWORD 20 /* maximum size of a term */
#define THRESHOLD 0.6 /* similarity threshold for complex_merge */
struct doclist { /* sequences of document # and weight */
int doc; /* document number */
float weight; /* term weight in document */
struct doclist *nextdoc; /* ptr. to next doclist record */
} doclistfile;
struct parentlist { /* sequences of parent terms */
char term[MAXWORD]; /* parent term */
struct invert *parent; /* ptr. to parent term in inverted file */
struct parentlist *nextparent;/* ptr. to next parentlist record */
} parentfile;
struct connections { /* holds information about connected terms */
struct invert *termadd; /* address of connected term in inverted file */
struct connections *next_connection /* ptr. to next connections record */
} connectlist;
struct childlist { /* sequences of child terms */
char term[MAXWORD]; /* child term */
struct invert *child; /* ptr. to child term in inverted file*/
struct childlist *nextchild; /* ptr. to next childlist record */
} childfile;
struct invert { /* inverted file */
char term[MAXWORD]; /* term */
struct doclist *doc; /* sequences of document # and weight */
struct parentlist *parents; /* pointer to list of parent terms */
struct childlist *children; /* pointer to list of children terms */
struct connections *connect; /* pointer to connection in other hierarchy */
struct invert *nextterm; /* ptr. to next invert record */
} invfile;
static struct invert *startinv; /* ptr. to first record in inverted file */
static struct invert *lastinv; /* ptr. to last record in inverted file */
static struct doclist *lastdoc; /* ptr. to last document in doclist */
static struct invert *start_inv1; /* ptr. to the start of 1st inverted file */
static struct invert *start_inv2; /* ptr. to the start of 2nd inverted file */
static FILE *input1; /* first inverted file */
static FILE *input2; /* first link file */
static FILE *input3; /* second inverted file */
static FILE *input4; /* second link file */
static FILE *output; /* holds any outputs */
static char currentterm[MAXWORD]; /* tracks current term in inverted file */
static struct invert *get_mem_invert( ); /* these four get_mem functions * /
static struct doclist *get_mem_doclist( ); /* obtain memory dynamically for */
static struct parentlist *get_mem_parentlist( ); /* storing different types of */
static struct childlist *get_mem_childlist( ); /* records. */
static struct connections *get_mem_connections( );
static struct invert *find_term( ); /* searches for term in inverted file and */
/* returns address of the term */
static int compare( );
static
void initialize( ), /* initialize global variables */
open_files( ), /* open files */
read_invfile( ), /* read in the inverted file */
add_invert( ), /* called within read_invfile( ) */
read_links( ), /* read in the links information */
add_link( ), /* called within read_links( ) */
pr_invert( ), /* print the inverted file */
simple_merge( ), /* simple merge between both hierarchies */
complex_merge( ); /* complex merge between hierarchies */
int main(argc,argv)
int argc;
char *argv[ ];
{
char ch;
if (argc!=6)
{
(void) printf( "There is an error in the command line\n");
(void) printf("Correct usage is:\n");
(void) printf("merge inverted_file_1 link_file_1 inverted_file_2 link_file_2 output_file\n");
exit(1);
}
start_inv1 = NULL; start_inv2 = NULL; /* initialize start of both inverted files */
initialize( );
open_files(argv);
(void) fprintf(output, "\nREADING FIRST INVERTED FILE\n");
read_invfile(input1);
start_inv1 = startinv;
(void) fprintf(output, "\nREADING FIRST LINK FILE\n");
read_links(input2, start_inv1);
(void) fprintf(output, "\nPRINTING FIRST INVERTED FILE\n\n");
pr_invert(start_inv1);
/* re-initialize */
initialize( );
(void) fprintf(output, "\nREADING SECOND INVERTED FILE\n");
read_invfile(input3);
start_inv2 = startinv;
(void) fprintf(output, "\nREADING SECOND LINK FILE\n");
read_links(input4, start_inv2);
(void) fprintf(output, "\nPRINTING SECOND INVERTED FILE\n\n");
pr_invert(start_inv2);
(void) printf("Make a selection\n");
(void) printf("To use the simple_merge algorithm, enter 1\n");
(void) printf("To use the complex_merge algorithm, enter 2\n");
(void) printf("\nEnter selection: ");
ch = getchar( );
switch(ch)
{
case '1':
(void) fprintf(output,"\nPERFORMING A SIMPLE MERGE OF THE TWO INVERTED FILES\n");
simple_merge(start_inv1,start_inv2);
(void) fprintf(output,"\nPRINTING FIRST INVERTED FILE AFTER SIMPLE MERGE\n\n");
pr_invert(start_inv1);
(void) fprintf(output,"\nPRINTING SECOND INVERTED FILE AFTER SIMPLE MERGE\n\n");
pr_invert(start_inv2);
break;
case '2':
(void) fprintf(output,"\nPERFORMING A COMPLEX MERGE OF THE TWO INVERTED FILES\n");
complex_merge(start_inv1,start_inv2);
(void) fprintf(output,"\nPRINTING FIRST INVERTED FILE AFTER COMPLEX MERGE\n\n");
pr_invert(start_inv1);
(void) fprintf(output,"\nPRINTING SECOND INVERTED FILE AFTER COMPLEX MERGE\n\n");
pr_invert(start_inv2);
}
(void) fclose(input1);(void) fclose(input2);
(void) fclose(input3);(void) fclose(input4);(void) fclose(output);
return(0);
}
/***********************************************************************
open_files(argv)
Returns: void
Purpose: Open all input & output files
**/
static void open_files(argv)
char *argv[ ];
{
if (( input1 = fopen(argv[1],"r")) == NULL ) {
(void) printf("couldn't open file %s\n",argv[1]);
exit(1); /* inverted file for first thesaurus hierarchy */
}
if (( input2 = fopen(argv[2],"r")) == NULL ) {
(void) printf("couldn't open file %s\n",argv[2]);
exit(1); /* link file for first thesaurus hierarchy */
}
if (( input3 = fopen(argv[3],"r")) == NULL ) {
(void) printf("couldn't open file %s\n",argv[3]);
exit(1); /* inverted file for second thesaurus hierarchy */
}
if (( input4 = fopen(argv[4],"r")) == NULL ) {
(void) printf("couldn't open file %s\n",argv[4]);
exit(1); /* link file for second thesaurus hierarchy */
}
if (( output = fopen(argv[5],"w")) == NULL) {
(void) printf("couldn't open file %s for output \n",argv[5]);
exit(1); /* output file */
}
}
/***********************************************************************
initialize( )
Returns: void
Purpose: Initialize global variables
**/
static void initialize( )
{
startinv = NULL; /* start of inverted file */
lastinv = NULL; /* end of inverted file */
lastdoc = NULL; /* last document considered */
currentterm[0] = '\0'; /* current term being considered */
}
/***********************************************************************
read_invfile(input)
Returns: void
Purpose: Read the inverted file from a disk file
**/
static void read_invfile(input)
FILE *input;
{
int docid; /* holds current document number */
char temp[MAXWORD]; /* holds current term */
float weight; /* holds current term weight */
struct doclist *p; /* structure to hold document numner-weight pair */
(void) fscanf(input, "%s%d%f", temp, &docid, &weight); /* read next line */
while (strlen(temp) > 0) /* while a legitimate line */
{
if (!strncmp(currentterm,temp,MAXWORD)) {
/* if temp is the same as current term then simply add next document-weight info */
p = get_mem_doclist( );
p->doc = docid; /* assign doc. number */
p->weight = weight; /* assign doc. weight */
p->nextdoc = NULL;
if (lastdoc) lastdoc->nextdoc = p; /* connect p to doclist chain for this term */
lastdoc = p; /* set this global variable */
}
else add_invert(docid,temp,weight); /* temp not the same as current term, hence */
temp[0] = '\0'; /* start a new entry in the inverted file */
(void) fscanf(input, "%s%d%f", temp, &docid, &weight); /* read next input line */
}
}
/***********************************************************************
add_invert(docid,temp,weight)
Returns: void
Purpose: Called in read_invfile when a new term is being read from the file.
Starts a new entry in the inverted file.
**/
static void add_invert(docid, temp, weight)
int docid; /* in: document number */
char temp[MAXWORD]; /* in: new index term */
float weight; /* in: index term weight */
{
struct invert *p; /* structure p will be attached to inv. file */
p = get_mem_invert( ); /* get memory for p */
(void) strncpy(p->term,temp,MAXWORD); /* copy over the term */
p->parents = NULL; /* initially term has no parents */
p->children = NULL; /* also no children terms */
p->doc = get_mem_doclist( ); /* get memory for a doclist structure */
p->doc->doc = docid; /* assign the document number */
p->doc->weight = weight; /* assign term weight */
p->doc->nextdoc = NULL;
p->connect = NULL; /* initially this term not connected to any */
/* other in any other hierarchy */
p->nextterm = NULL;
if (startinv == NULL) startinv = p; /* if this is the 1st term in inverted file */
if (lastinv) lastinv->nextterm = p; /* always update lastinv pointer */
lastinv = p;
lastdoc = p-doc;
(void) strncpy(currentterm,temp,MAXWORD); /* update the value of currentterm to the */
/* new term that has just been read */
}
/***********************************************************************
read_links(input,startinv)
Returns: void
Purpose: Add the link information to the inverted file
**/
static void read_links(input,startinv)
FILE *input; /* in: input file */
struct invert *startinv; /* in: start of this inverted file */
{
char parent[MAXWORD], /* holds parent term */
child[MAXWORD]; /* holds child term */
(void) fscanf(input, "%s%s", parent,child); /* read first input line */
while (strlen(parent) > 0 && strlen(child) > 0) /* while non-trivial input */
{
if (!find_term(parent,startinv) | | !find_term(child,startinv))
{
(void) printf("Please check your input files\n");
(void) printf("Term %s or term %s is not in inverted file\n",parent,child);
exit(0);
}
add_link(parent,child,startinv); /* this function makes links */
child[0] = '\0'; parent[0] = '\0'; /* throw out old parent & child info */
(void) fscanf(input, "%s%s",parent,child); /* read next line */
}
}
/***********************************************************************
add_link(parent,chile,startinv)
Returns: void
Purpose: Function is used within read_links
**/
static void add_link(parent, child, startinv)
char parent[MAXWORD], /* in: specify parent term */
child[MAXWORD]; /* in: specify child term */
struct invert *startinv; /* in: specify start of this inv. file */
{
struct invert *p, *q; /* holds adds. of both terms in inv. file */
struct parentlist *new_parent; /* structure to hold new parent info. */
struct childlist *new_child; /* structure to hold new child info. */
p = find_term(parent, startinv); /* find address of parent & child */
q = find_term(child, startinv); /* terms in the inverted file */
/* first add parent links for the given child */
new_parent = get_mem_parentlist( ); /* get memory for parent record */
(void) strncpy(new_parent->term, parent, MAXWORD); /* copy over parent term */
new_parent->parent = p; /* store addr.(in inverted file) of parent term */
if (q->parents == (NULL) { /* i. e. no parents listed for this child yet*/
q->parents = new_parent; /* first parent link made */
new_parent->nextparent = NULL;
}
else { /* at least 1 parent already listed for given child */
new_parent->nextparent = q->parents; /* attach new parent in front of list */
q->parents = new_parent;
}
/* next add child links for given parent */
new_child = get_mem_childlist( ); /* get memory for child record */
(void) strncpy(new_child->term, child, MAXWORD); /* copy over child term */
new_child->child = q; /* store addr. (in inverted file) of child term */
if (p->children == NULL) { /* i. e. no child terms listed for this parent */
p->children = new_child; /* first child link made */
new_child->nextchild = NULL;
}
else { /* at least 1 child already exists for given parent */
new_child->nextchild = p->children;/* attach new child to front of list */
p->children = new_child;
}
}
/***********************************************************************
pr_invert(startinv)
Returns: void
Purpose: Print either inverted file. Prints each term, its
associated document numbers, term-weights and parent
and child terms.
**/
static void pr_invert(startinv)
struct invert *startinv; /* in: specifies start of inverted file */
{
struct invert *inv_addr; /* tracks add. of current inv. file record */
struct doclist *doc_addr; /* tracks add. of current doclist record */
struct parentlist *parent_addr; /* tracks add. of current parentlist record */
struct childlist *child_addr; /* tracks add. of current childlist record */
struct connections *connect_term_add; /* tracks connected terms */
inv_addr = startinv; /* begin at top of inv. file */
while (inv_addr) { /* while a legitimate term. . . .*/
(void) fprintf(output,"TERM: %s\nPARENT TERMS:, ", inv_addr>term);
parent_addr = inv_addr->parents; /* find addr. of first parent */
while (parent_addr) { /* printing all parents */
(void) fprintf(output,"%s",parent_addr->term);
parent_addr = parent_addr->nextparent; /* loop through remaining parents */
if(parent_addr) (void) fprintf(output,", ");
}
(void) fprintf(output,"\nCHILD TERMS: ");
child_addr = inv_addr->children; /* find addr. of first child */
while (child_addr) { /* printing all children */
(void) fprintf(output,"%s",child_addr->term);
child_addr = child_addr->nextchild; /* loop through remaining childrend */
if(child_addr) (void) fprintf(output,", ");
}
(void) fprintf(output,"\nDOCUMENT NUMBER TERM WEIGHT\n");
doc_addr = inv_addr->doc; /* find addr. of first associated doc. */
while (doc_addr) { /* printing all documents */
(void) fprintf(output,"%-30d",doc_addr->doc);
(void) fprintf(output,"%-10.5f\n",doc_addr->weight);
doc_addr = doc_addr->nextdoc; /* loop through remaining docs. */
}
/* if the terms is connected then print the term from the other hierarchy */
(void) fprintf(output, "CONNECTIONS IN OTHER THESAURUS HIERARCHY:\n");
connect_term_add = inv_addr->connect;
while (connect_term_add)
{
(void) fprintf(output," %s", connect_term_add-termadd-term);
connect_term_add = connect_term_add-next_connection;
if(connect_term_add) (void) fprintf(output,", ");
}
(void) fprintf(output,"\n\n");
inv_addr = inv_addr->nextterm; /* loop to next term in inverted file */
}
}
/***********************************************************************
simple_merge(startinv1, startinv2)
Returns: void
Purpose: In this function, two terms in different hierarchies are merged
if they are identical.
**/
static void simple_merge(startinv1, startinv2)
struct invert *startinv1, /* in: specifies start of 1st inv. file */
*startinv2; /* in: specifies start of 2nd inv. file */
{
struct invert *inv_addr1, *inv_addr2;
struct connections *r1, *r2 /* storage to hold info. about connected terms */
inv_addr1 = startinv1; /* start with top of 1st inv. file */
while (inv_addr1) { /* looking for this term in the other inv. file */
inv_addr2 = find_term(inv_addr1->term, startinv2);
if (inv_addr2) { /* if term was found then update connect */
r1 = get_mem_connections( );
r1->termadd = inv_addr2;
r1->next_connection = inv_addr1->connect;
inv_addr1->connect = r1;
r2 = get_mem_connections( );
r2->termadd = inv_addr1;
r2-=>next_connection = inv_addr2->connect;
inv_addr2->connect = r2;
}
inv_addr1 = inv_addr1->nextterm;
}
}
/***********************************************************************
complex_merge(startinv1, startinv2)
Returns: void
Purpose: In this routine any two terms in different hierarchies are merged
if they have 'similar' parents and children. Similarity is
computed and compared to a pre-fixed user specified THRESHOLD
**/
static void complex_merge(startinv1, startinv2)
struct invert *startinv1, /* in: specifies start of 1st inv. file */
*startinv2; /* in: specifies start of 2nd inv. file */
{
struct invert *inv1_addr, /* tracks current term in 1st inv. file */
*inv2_addr; /* tracks current term in 2nd inv. file */
struct connections *r1, *r2; /* tracks connected terms */
int compare ( );
inv1_addr = startinv1; /* begin at top of 1st inv. file */
while(inv1_addr) { /* while addr. legitimate . . . . */
inv2_addr = startinv2; /* now begin at top of 2nd inv. file */
while(inv2_addr) {
if (compare(inv1_addr, inv2_addr)) { /* this returns 1 of 2 terms are */
/* similar enough, then connect the two terms */
r1 = get_mem_connections();
r1->termadd = inv2_addr;
r1->next_connection = inv1_addr->connect;
inv1_addr->connect = r1;
r2 = get_mem_connections();
r2->termadd = inv1_addr;
r2->next_connection = inv2_addr->connect;
inv2_addr->connect = r2;
}
inv2_addr = inv2_addr->nextterm; /* loop through 2nd inv. file */
}
inv1_addr = inv1_addr->nextterm; /* loop through 1st inv. file */
}
}
/***********************************************************************
compare(p,q)
Returns: int
Purpose: Used to compare two terms for more than just equality.
A similarity value is computed and if it is greater than a THRESHOLD then 1 is returned, else 0
**/
static int compare(p,q)
struct invert *p, *q; /* addresses of two terms to be compared */
{
struct parentlist *parent1, /* tracks parentlist of 1st term */
*parent2; /* tracks parentlist of 2nd term */
struct childlist *child1, /* tracks childlist of 1st term */
*child2; /* trakcs childlist of 2nd term */
float count1, /* tracks # of parents + children of 1st term */
count2, /* tracks # of parents + children of 2nd term */
count; /* tracks # of common parents + children */
count = 0.0; count1 = 0.0; count2 = 0.0; /* initialize all counts */
/* first check # of parents for p-term & the # of common parents */
parent1 = p->parents;
while(parent1) { /* parent of 1st term */
count1 = count1 + 1.0;
parent2 = q->parents;
while(parent2) { /* loop through parents of second term */
if(!strncmp(parent1->term,parent2->term,MAXWORD)) count = count + 1.0;
parent2 = parent2->nextparent;
}
parent1 = parent1->nextparent;
}
/* next compute # of parents for q->term */
parent2 = q->parents;
while (parent2) { /* loop through parents of 2nd term again */
count2 = count2 + 1.0;
parent2 = parent2->nextparent;
}
/* now check # of children for p-term & the # of common children */
child1 = p->children;
while(child1) { /* loop through children of 1st term */
count1 = count1 + 1.0;
child2 = q->children;
while(child2) { /* loop through children of 2nd term */
if (!strncmp(child1->term,child2->term,MAXWORD)) count = count + 1.0;
child2 = child2->nextchild;
}
child1 = child1->nextchild;
}
/* next compute # of children for q->term */
child2 = q->children;
while (child2) { /* loop through children of 2nd term */
count2 = count2 + 1.0;
child2 = child2->nextchild;
}
if (count ! = 0.0) { /* if there is anything in common at all */
if ((count/(sqrt(count1 * count2))) = THRESHOLD) {
/* printf("value is %f\n",(count/(sqrt(count1*count2)))); */
return(1);
}
else return(0);
}
return(0);
}
/**********************************************************************
find_term(term,startinv)
Returns: address of a struct invert record
Purpose: Search for a specified term in the specified inverted file
and return address of the corresponding record. If not
found then returns NULL.
**/
static struct invert *find_term(term, startinv)
char term[MAXWORD]; /* in: term to be searched */
struct invert *startinv; /* in: inverted file to search in */
{
struct invert *inv_addr;
inv_addr = startinv; /* begin at top of inverted file */
while(inv_addr) {
if (!strcmp(term,inv_addr->term)) return(inv_addr);
inv_addr = inv_addr-nextterm;
}
return(NULL);
}
/**********************************************************************
*get_mem_invert( )
Returns: address of a struct invert record
Purpose: dynamically obtain enough memory to store 1 invert record
**/
static struct invert *get_mem_invert ()
{
struct invert *record;
record = (struct invert *)malloc(sizeof(invfile));
if (!record) {
(void) fprintf(output,"\nout of memory\n");
return(NULL);
}
return(record);
}
/***********************************************************************
*get_mem_doclist ()
Returns: address of a struct doclist record
Purpose: dynamically obtain enough memory to store 1 doclist record
**/
static struct doclist *get_mem_doclist()
{
struct doclist *record;
record = (struct doclist *)malloc(sizeof(doclistfile));
if (!record) {
(void) fprintf(output,"\nout of memory\n");
return(NULL);
}
return(record);
}
/***********************************************************************
*get_mem_parentlist()
Returns: address of a struct parentlist record
Purpose: dynamically obtain enough memory to store 1 parentlist record.
**/
static struct parentlist *get_mem_parentlist()
{
struct parentlist *record;
record = (struct parentlist *)malloc(sizeof(parentfile));
if (!record) {
(void) fprintf(output,"\nout of memory\n");
return(NULL);
}
return(record);
}
/****************************************************************************
*get_mem_childlst()
Returns: address of a struct childlist record
Purpose: dynamically obtain enough memory to store 1 childlist record.
**/
static struct childlist *get_mem_childlist ()
{
struct childlist *record;
record = (struct childlist *)malloc(sizeof(childfile));
if (!record) {
(void) fprintf(output,"\nout of memory\n");
return(NULL);
}
return(record);
}
/***********************************************************************
*get_mem_connections()
Returns: address of a struct connections record
Purpose: dynamically obtain enough memory to store 1 connections record.
**/
static struct connections *get_mem_connections( )
{
struct connections *record;
record = (struct connections *)malloc(sizeof(connectlist));
if (!record) {
(void) fprintf(output,"\nout of memory\n");
return(NULL);
}
return(record);
}