D-Lib Magazine
November 1998
ISSN 1082-9873
WWW -- Wealth, Weariness or Waste
Controlled vocabulary and thesauri in support of online information access
David Batty
CDB Enterprises, Inc.
Adjunct Professor, Graduate School of Library and Information Science
The Catholic University of America
[email protected]
Abstract
This article offers some thoughts on the problems of access to information in a machine-sensible environment, and the potential of modern library techniques to help in solving them. It explains how authors and publishers can make information more accessible by providing indexing information that uses controlled vocabulary, terms from a thesaurus, or other linguistic assistance to searchers and readers.
Introduction
Effective communication in any context is made easier by the use of a common language that both parties understand. In human speech, common language may occasionally be ambiguous or misleading because of the way it is used, but usually, the quantity and feedback of exchanged messages allows the parties to communicate with each other. In the world of recorded information, there is not the same opportunity for immediate interaction, feedback and realization. Writers are not consistent in their use of all the words that might describe or refer to topics, so a searcher who chooses to scan the full text of a document must include in a search statement all the synonyms, related terms, and all levels of detail that the author might have used. The searcher is not talking to the author directly but to the document itself, a static instrument. Therefore we need a substitute for the result of the human, spoken interchange: the "Ah, yes, I see. What you mean is ... " That needed substitute is the controlled index language (thesaurus) that the indexer uses to interpret and represent the themes, concepts, and language of the author and that the searcher uses to interpret and represent sometimes vague expressions of a need to know.
Practical Experience
Unfortunately, in the USA the value of controlled index languages was ignored for many years. Many early online database providers were dissuaded from using a thesaurus for what seemed to be valid economic considerations: fear of the apparent expense of designing and using a thesaurus; and doubt that the size of the database would justify that expense. These apprehensions turned out to be ill-founded. Databases grew to almost unmanageable size; the cost of searching titles and abstracts far exceeded what would have been the investment in the development and use of a controlled index language.
With CD/ROM publication we have at our fingertips physical packages that can contain 650MB in a single unit -- roughly the equivalent of 200,000 digitally stored single-spaced, 8.5"x11" pages or 10,000 printed book pages stored as graphic images. But this accessibility is misleading, since we cannot flip through the CD/ROM anything like as easily as we can through a printed book or document file. And with the rapid growth in publication on the Web, we have available almost unlimited information resources -- even bearing in mind Sturgeon's Law: nine-tenths of anything is junk. The problem is to find information to meet a need without having to read all of it. Underlying this problem is editorial supervision. Even acknowledging the low cost of CD/ROM production, most publishers exercise care for the intellectual product, as do conventional book publishers, as do publishers on the Web. But when anyone can set up an accessible site and load into it anything an untidy mind chooses, the level of control is minimal. Full text searching will always be valuable for browsing in any size of file; but in large files, controlled language access searching will always support efficient retrieval.
This is something that even law firms, long convinced of the efficacy of whole text searching, are beginning to appreciate. Several studies of litigation support activities have shown that the use of whole text searching is not as effective as the use of even a simply designed controlled index language. For example, see "An evaluation of retrieval effectiveness for a full-text document-retrieval system" by David C. Blair and M. E. Maron and follow-up articles by Gerard Salton, and Blair and Maron. Litigation support companies hired by law firms to gather discovery documents and have them ready for attorneys at the drop of a question have long used controlled languages, which they often call taxonomies because they use them in hierarchical form. Lawyers have now found that the investment in developing a taxonomy, even for a single major piece of litigation, and in indexing the discovery documents, is much less than the continuing cost of paralegals (or attorneys) to do text searching over and over again -- sometimes for the same documents.
In addition to the economic objections raised by early database providers, was the objection that such organization might be too complex for the user -- but people have learned to use and to value other organized reference works: encyclopedias, back-of-the-book indexes, and even the yellow pages of a telephone directory.
Perhaps it is the idea of an intermediary language that is unsettling: "Why can't I go straight to what I want?" Yet nobody, on arriving in a strange city, would expect to go directly to the city hall (even knowing the address) without a map, or without a knowledgeable cab driver (the reference librarian) who already knows the map.
Economic Considerations
The database preparation cost of a free- or full-text system is minimal, no more than what it costs to put the files into the computer. More terms are available for searching than are normally affordable in a controlled index language, and therefore, are likely to include more precise terms than can be included in the controlled index language. The searcher is thus offered a choice of great exhaustivity (range of topics) or great specificity (precise terms) depending on the manipulation of the search terms and the construction of the search statements and a search strategy.
Therein lies the main, and two-sided, difficulty. The searcher must trust the author to have included all of the terms necessary to describe the topic. There are databases in all fields in which a writer may not use the main topic term at all. When the Washington Post newspaper went online with its PostHaste system, it asked CDB Enterprises to construct a thesaurus so the articles could be offered in both indexed and full text forms. Their reason: a natural language search on the word MURDER might retrieve an article on July weather -- "It's murder out there"; and some articles never included a word for the main topic, because it was obvious. Indeed, in that summer, as we worked on their thesaurus, the golfer Arnold Palmer hit two holes-in-one on the same hole on two successive days in a major tournament, and the article describing this unprecedented feat never mentioned the word GOLF. Since writers are not consistent in their use of all the words that can be used to describe topics, the searcher must plan very carefully to include all synonyms and all necessary levels of specificity in searching. Further, simple Boolean intersection of natural language terms can produce imprecise results when two or more terms are present in a record, but without the semantic association expected by the user. (A search for window shades including the terms VENETIAN and BLIND might retrieve information on SIGHTLESS PERSONS IN VENICE.) Proximity searching can overcome this problem to some extent.
In other words, effort and time, and therefore cost, saved in database production is passed on to the searcher. Such a situation may satisfy a commercial database provider, by saving expense at the beginning of the process and increasing income at the point of use, but is not likely to satisfy the user. Full-text searching and controlled language searching are not the only access methods; they are used here only to emphasize two, possibly extreme, approaches. Further possibilities will be discussed at the end of the article.
It is not uncommon today for a database to contain records collected and indexed over a span of 50 years by a variety of index languages or, indeed, by none at all. Over time, the working language of a discipline changes and, eventually, reaches the point where a new, consistent index language is needed.
What is a Thesaurus?
A thesaurus, to a layman, is a fat book prepared by somebody called Peter Mark Roget and used by college students to enlarge their vocabulary when writing term papers -- and, often and unfortunately, to vary the representation of the same concept from sentence to sentence. A thesaurus to an information scientist is a controlled set of the terms used to index information in a database, and therefore also to search for information in that database so the same concepts are represented by the same term. For many years in this country, thesauri were often presented as alphabetized lists of key terms, taken from the document to be indexed with references to and from other terms made as necessary. This traditional practice has changed in recent years to a more structured approach based on an analytical technique. Ironically, this means that the original misuse of the word "thesaurus" by information scientists, to describe purely alphabetical lists of terms (Roget organized his thesaurus by categories of knowledge, and included an alphabetized list of terms only as an index), has been amended so that it is now closer to a proper use of Roget's meaning to include both categorization and alphabetical listing.
"Thesaurus" is only one name for a controlled index language or one of its parts; "taxonomy", "classification", and "hierarchy" are others. "Keyword list" should really be used only for a list of terms taken directly from authors' original language, and "ontology" should not be used at all, because it means "the study of being" (Greek ONTO -- (being) + LOGIA (discourse) -- not a possible and mistaken back formation into ONTO (essence) + LOGOS (word) -- as the essence or existence of words).
There are three levels of thesauri:
- Universal, like the Library of Congress List of Subject Headings, or the Library of Congress Classification scheme, or the Dewey Decimal Classification scheme;
- Broad areas, like the Medical Subject Headings (MeSH) of the U.S. National Library of Medicine, or the Thesaurus of Engineering and Scientific Terms (TEST) originally from the Engineers Joint Council, or the Art and Architecture Thesaurus (AAT) supported by the Getty Trust; and
- Specific areas, like the Transportation Research Thesaurus (TRT) administered by the Transportation Research Board of the National Research Council, or the ERIC Thesaurus on education.
The Power of Structured Language
This article makes an assumption about the nature and appearance of a thesaurus, even though many thesauri have been and will be created that use only some of these features, and in varied forms. The assumption is that a well-developed thesaurus is based on a recognition of a hierarchical structure, or a set of them, in which clusters of concepts that share a common characteristic are organized in families (called "facets"), and represented by natural language terms useful in the context in which the thesaurus will be used.
For example, Wood, Nylon, Steel, Copper, Wool all share the characteristic of being MATERIALS, whatever other characteristics some of them may share, like Combustibility. The terms constitute the beginnings of a MATERIALS facet. Sometimes the terms in a facet can be divided into subfacets by secondary characteristics: Within the MATERIALS facet, Wood and Wool are ORGANIC MATERIALS; Steel and Copper are METALS; and Nylon is a PLASTIC. Facets and subfacets are then arranged as simple hierarchies of terms, from general to specific.
Facet procedure has many advantages. By organizing the terms into smaller, related groups, each group of terms can be examined more easily and efficiently for consistency, order, hierarchical relationships, relationships to other groups, and the acceptability of the language used in the terms. The faceted approach is also useful for its flexibility in dealing with the addition of new terms and new relationships. Because each facet can stand alone, changes can usually be made easily in a facet at any time without disturbing the rest of the thesaurus.
The faceted approach combined with the use of notational codes to represent the hierarchies is especially amenable to software applications. CDB Enterprises developed its own software for thesaurus construction: it ensures the integrity of the structure of hierarchies and references between them; it generates printed and on-screen displays of different thesaurus formats; and it even displays the thesaurus for point-and-shoot allocation of terms in online indexing. Nevertheless, the process of facet analysis and the construction and maintenance of a thesaurus is, and must always remain, essentially an intellectual endeavor.
A further and final benefit of the faceted approach becomes apparent in the use of the thesaurus: it is much easier for an indexer or searcher to understand a set of hierarchically organized facets as a conceptual map, which shows the precise level and set of associations of a term. Categories of related terms are easier to negotiate than a long list of alphabetized terms.
Typical thesaurus displays based on a faceted structure are:
- Explicitly hierarchical displays, sometimes called taxonomies,
- Alphabetical displays, showing references between related terms or references from unused synonym terms to the postable or preferred term,
- Rotated displays, in which each word in every natural language term is used as an index term to indicate the several phrases in which it appears.
Other Approaches
Enough has been said above to make it clear that full-text searching is fraught with the risk of overwhelming retrieval, and thus confusion, and that the development and application of controlled index languages (though extremely efficient in use) can be expensive enough to discourage authors or database providers. Are there alternatives?
Hypertext, with its links from an assigned word or phrase in a document to a related document is attractive, but it places on the author the same burden (perhaps heavier) as indexing the document using a controlled language. Increasingly, electronic information providers are using hypertext techniques to navigate their files, both within a file and across files, even for ephemeral information like daily news extracts.
A transparent search thesaurus is a concept that has been discussed in information science research literature, but not yet implemented effectively. It would require the analysis of an extremely large collection of text in machine-readable form, possibly using an approach similar to Lauren Doyle's automatic indexing research in the 1960's on "semantic road maps" to generate a very large thesaurus with connections of varying degrees of intensity between terms. A search entered by a user would assemble all synonyms and tight semantic relationships in a string of OR relationships in the search statement, and join each string to other, conceptually separate strings in a group with AND relationships between each string. The combined search statement could then be used in full-text searching -- assuming that the text is susceptible to searching by the search engine and assuming that the search engine works at a speed tolerable to the user's expectations.
Conclusion
Perhaps the information environment and content, and the way people expect to approach them, prompt the access mode in a machine-sensible environment. Thus, scholarly and technical information, especially when organized as reports or reflective communication, is probably most susceptible to a combination of controlled index/search language and full- text searching, because the conceptual structure and the vocabulary are likely to be well organized. Indeed, the authors of such information are motivated to organize and even index their work because of their intention to disseminate the information. Digested information like news bulletins, current awareness commentaries, and even sales catalogs, directories and handbooks are better suited to hypertext techniques. Undigested and unorganized information that is often the author's self-serving imposition on the Web probably deserves only a full-text search engine general enough to handle more or less anything. Like the old computer produced rotated term indexes that relied only on the words in the titles of printed documents, the searcher has to work harder and longer than the author or the system, and may find only a little of what the authors produced and did not identify carefully enough.
Information is as accessible, for retrieval and for comprehension, as its author or its publisher makes it. There is a burden of effort in information storage and retrieval that may be shifted from shoulder to shoulder, from author, to indexer, to index language designer, to searcher, to user. It may even be shared in different proportions. But it will not go away. Once we realize and face this, we may begin to make sense of the wealth (or waste) of electronic information.
References
David Batty, "Words, words, words" in "Database Design" column, Database, December 1988, pp. 109-113.
David Batty, "Theasurus construction and maintenance: a survival kit." Database, February 1989, pp. 13-20.
David C. Blair and M. E. Maron, Communications of the ACM, March 1985, 28:3, pp. 289-299.
David C. Blair and M.E. Maron, Information Processing and Management, 1990, 26:3, pp. 437- 447.
Gerard Salton, Communications of the ACM, July 1986, 29:7, pp. 648-656.
David Batty is President of CDB Enterprises, Inc., a consulting company that specializes in library and information system design, information storage and retrieval systems, and especially the controlled index languages that drive them.
Copyright © 1998 David Batty
Top | Magazine
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | E-mail the EditorD-Lib Magazine Access Terms and Conditions
hdl:cnri.dlib/november98-batty