D-Lib Magazine
December 1998
ISSN 1082-9873
Evaluating Search Engine Models for Scholarly Purposes
A Report from the Internet Applications Laboratory
Anthony F. Beavers
The University of Evansville
[email protected]
The Internet allows for the efficient dissemination of texts, thereby creating a rich hypertextual environment that is potentially conducive to stimulating the free exchange of ideas in a manner worthy of the modern scholar. However, the fact that any user whatsoever may disseminate texts in this manner presents two distinct problems. First, finding relevant resources on the Internet may take a fair amount of time and, second, once resources are found, determining their reliability is often difficult if the user is not already an expert in the field of the resource under consideration. These problems -- efficiency in searching and academic quality-control -- are surmountable with existing technology, and many laboratories around the world are working hard to shape this technology into a variety of academic information retrieval services.
Some of these efforts depend on developing a system of meta-tags that extend the html markup language to communicate effectively with search engine databases so that no manual data entry is needed. While these tags will ultimately be necessary to make any wide-spread academic information retrieval system efficient, their use at this point in history suffers from a serious setback: until a standardized tagging system becomes accepted and implemented by a large group of users, any search engine that uses them will be restricted to a relatively small set of Internet resources. Also, the use of meta-tags does not solve the problem of quality-control, so that in addition to meta-tags, some means of determining which files to include in a search engine index is needed. The issues of standardized tagging systems and quality-regulating mechanisms are related but independent problems.
In 1995, a small team at the University of Evansville began to address these problems, looking for ways to allow cataloging of the Internet immediately in its current state of disarray while preserving some sort of quality-control. Since then their efforts have undergone considerable revision, each time producing a mechanism better than the previous. These efforts have now been consolidated into the newly-formed Internet Applications Laboratory (IALab), temporarily funded by the University of Evansville, with the express mission of providing free access to worthy academic resources for the global community. The means by which the IALab seeks to do this is by developing Internet filtering mechanisms that couple well with search engine technology. One guiding principle of the IALab is that the Internet allows scholars to disseminate their own research in an academically meaningful manner by placing it on servers at their host institutions. Consolidating efforts, such as the filtering mechanisms just mentioned, act on this research by adding a procedure for validation or accreditation that ensures a measure of reliability for users who find those resources through one of these filters.
The Argos Model
The first attempt from what is now the IALab to provide a quality-regulating mechanism was implemented with Argos (http://argos.evansville.edu), a limited area search engine (LASE) dedicated to ancient and medieval studies, but applicable in other disciplines, put on-line in October of 1996. Argos uses a very simple crawling procedure to limit the scope of return sets to collections of resources hand-selected by scholars working in the field. As an example of its effectiviveness, in 1996, AltaVista returned 44,000 hits for a search of the word "Plato," including references to a few software packages, an ale, a consulting firm, a small town in Illinois and the Spanish word for "plate"; Argos returned about 300 hits, all of which were pertinent to the Plato that lived and worked in ancient Greece.
To determine the scope of Argos, we enlisted the help of the major index sites in ancient and medieval studies that were already established on the Internet. These sites consisted primarily of pages of links to hand-selected academic resources. Then we built a web crawler to search each of these "associate sites," plus each page to which they link. Special html extensions were devised for the use of the associate sites whereby they can instruct the crawler not to follow a link or to follow a link to a second or third level, provided that these secondary and tertiary links stay on the same server and are located further down the directory chain. (Why we did this is a story in itself, but one for another article.) The net result of the procedure is that it passes editorial control over the contents of Argos to the editors of the associate sites. When they add a link to their sites, Argos picks it up, and when they remove a link, it automatically falls out of the Argos search window, so that the procedure guarantees the user that any resource found through Argos was selected by a professional academician.
The model is variable in ways the IALab has not yet implemented. Voting mechanisms could easily be added whereby a resource must show up in more than one associate site before it is included in the search engine, and a variety of additional html tags could be devised to allow the editors of the associate sites to classify resources or to direct the crawler with more control. In the absence of these additions, Argos suffers from a lack of features. It allows users single word, Boolean and phrase searching, but always across the entire dataset, and since it pulls database information off the pages that are actually searched, the entries in a return set do not share a common format, particularly the title.
In addition, a lack of a common vision among the associate sites makes the return sets divergent in their quality. Of course, this may not be a problem with a different editorial board, though the difficulties in securing agreement and cooperation across the Internet on issues such as these should not be underestimated. To this day, one of the associates, the most scholarly and well-established in fact, continues to deliver a picture of a dog decked out for Christmas to Argos' quality-controlled database, and it took us four months to get another associate to block a hidden link to a triple-x sex site.
Differences of vision can never be entirely eliminated when Argos is workng with many, independent associates. In another implementation, which we are calling the single associate model, they disappear. Bernard Hibbitts' Jurist: The Law Professor's Network (http://jurist.law.pitt.edu) is based on this single associate idea. In this case, IALab sends a special LASE crawler over just his site allowing Professor Hibbitts complete editorial control over the contents of the database. This has proven quite effective, though it still has the some of the shortcomings of Argos mentioned above.
The general Argos model has the advantage of being easy to implement in other subject areas. To test this, IALab built Hippias: Limited Area Search of Philosophy on the Internet (http://hippias.evansville.edu,), edited by Peter Suber, a professor of philosophy at Earlham College. It was built in one weekend after the initial associate sites had been selected. We would have made many more LASEs based on the Argos model, if we had had system resources to support them at the time.
As it turns out, that we did not was for the best. IALab's next experiments do not have the limitations of the Argos model; they are based on a database model that allows for the categorization of links and the manipulation of the page descriptions in return sets into a standard format from the search-engine side of the equation. I will say more about this in a moment. In the meantime, I should point out that the Argos model could be made to provide these features as well, if there were a standardized system of meta-tags, provided that the authors of indexed files use them consistently. If our experiment with Argos has taught us anything, however, it is that indexing procedures must remain fairly free-form, at least with current technology. Even in the face of clear instructions, it is unlikely that a common usage of standards will emerge soon, even if agreement is ever reached about what the standards should be. The later IALab models take this problem into account as well.
The Noesis Model and the Encyclopedic Vision
The high quality return sets from Argos compared to the major search engines led us to start reconceptualizing the search engine. The term "search engine" is a bit of a misnomer, if the device also pre-filters the Internet, and calling a list of links to selected pages an "index" really does a disservice to that enterprise as well. A bibliography, when it provides ready access to the sources that it lists, is an encyclopedic collection of content, and when that content is peer-reviewed and rendered searchable by a LASE, the result is an "encyclopedia" that is collectively maintained by scholars around the world.
To demonstrate this, we devised a thought-experiment to show that what was standing in the way of this realization was largely a psychological phenomenon. Without advocating that this be the case, we started to imagine how a project such as Argos would appear, if all of the pages in a return set appeared with standardized cataloguing information and were formatted in the same page layout. The result would appear to be a unified effort to disseminate scholarship freely on the part of the scholars around the world. This still remains the case, even though the pages are, in fact, formatted differently. What the experiment shows, however, is that our failure to think along these lines earlier was due to the psychological expectation of common format only and not to the limitations of the technology. With this in mind, we started to think in smaller terms. How many quality links would it take to make even a large encyclopedia? Nothing on the order of 14,000, then the size of the Argos dataset. At this point, we turned our sights back to the discipline of philosophy. Instead of linking to all the biographies on Plato, for instance, a better service could be provided users by listing only two or three of the best.
So, instead of running the crawler across associate sites, we revised the single-associate model used with the Jurist project and started cataloguing URLs one at a time in our own database. We track the author and his or her institutional affiliation, the title, the resource type, that is, whether the resources is an essay directed at professional audiences, a lecture for undergraduate students, a book review, an image, a primary text or a research tool, and a few other pieces of information. We also added a hierarchical system of classifying resources. Though the prospects of the one-link-at-a-time approach sounded daunting at first, the reality of the situation turned out quite the contrary. We wrote a special user interface to allow a development team to catalogue resources. It takes less than a minute to catalogue a link, much less than the time it takes to process a book in a standard library. Furthermore, once that link is catalogued, it is done so for the entire Internet world and we don't have to handle any books. Special procedures allow us to edit this database easily. A robot deals with dead links, and we are writing a variety of procedures to report conditions that suggest when an entire website may have moved, when a resources has changed significantly, and so on.
To filter the quality, we began by classifying only resources written by Ph.D.s in philosophy, though we announced that we would revisit this policy later. Its role has been temporary to allow us a measure of quality-control. We are now in the process of adding a professional users' module that allows professionals to configure their own personal research link libraries from the Noesis dataset. As they do this, a robot will evaluate their decisions and automatically accredit resources according to a variety of variables that can be manipulated by the site editor. In addition, the personal research modules will be evaluated according to topic, and another robot will classify resources according to how they are actually used by professionals rather than by imposing an exterior system, like that of the Library of Congress, on them.
The result is what we are calling the Noesis model. Users can see a manifestation of it at http://noesis.evansville.edu. It allows users to search only essays, lectures, book reviews, images, primary texts, or research tools, or any combination of these, starting at any point in our hierarchical tree downwards, using simple word searching or phrase searching with or without boolean criteria. This makes the site effective for those who are new to philosophy and yet useful for more seasoned academicians. (Noesis: Philosophical Research On-Line does not yet include a topic tree, though one is available at another manifestation of the model at Exploring Plato's Dialogues. (See http://plato.evansville.edu). The Plato site cross-links several text files directly with the search engine thereby producing what we are calling a virtual learning environment. To learn more, see the information file at that site.)
The Noesis model overcomes many of the limitations of the Argos model. It is more precise in its filtering mechanism, and it offers a wider array of search options for the user. It also has a higher degree of reliability. It's chief disadvantage is that it requires human intervention to maintain the database. So far, this hasn't been a problem. Two student workers have done this effectively. Given a standard work week, they can easily catalogue over 1,000 links a week, if that were necessary, while maintaining the dataset. As the Internet grows, this may become a problem, and procedures will be needed so that users can modify their entries from their end. We are anticipating this with software innovations for the Goliath Project discussed below. Even so, it is worth pointing out that we have made significant headway in cataloguing the portion of the Internet dedicated to professional philosophy long before any standards have been reached.
David and Goliath
Noesis enacts its quality-control mechanism on the search engine side. The Goliath Project, a joint venture between the IALab and the International Consortium for Alternative Academic Publication (ICAAP), http://www.icaap.org, uses the traditional means of peer-review by indexing only the independent journals that are springing up on the Internet.
Its procedures represent a synthesis of the database model used with Noesis and a meta-tag system developed by an ICAAP team headed by Mike Sosteric, a sociologist at the Centre for Global and Social Analysis, Athabasca University. The crawler mechanism used for the Goliath Project goes by the name of DAVID, a dedicated accrediting variable indexing device. It is accrediting in that it can promise users that any item appearing in a return set has undergone a procedure of true peer-review, and it is variable because it uses a database requiring human intervention for pages without the standardized tags and automatically defaults to a meta-tag system for pages with them. It can easily be adapted to accommodate a variety of meta-tagging systems, thereby allowing full-coverage cataloging of independent periodicals on the Internet long before any universal agreement is reached concerning meta-tagging standards.
Goliath will allow a full range of search options, like Noesis, but it will cut across all the academic disciplines. Though the software is still under construction, it will be finished by the end of January at which point we will begin populating the database. By the end of 1999, the collection will be sizeable enough to be significantly useful for professional academicians and student scholars.
The hope of the IALab and the ICAAP is that Goliath will stimulate the proliferation of independent journals on the Internet that operate without economic interest. The price of this technology is inexpensive enough to create an Internet in which quality information is disseminated efficiently to the global community free of charge. In a matrix where authors have traditionally not been paid for their contributions to journals, we hope that authors will respond positively to these independent journals as well. Goliath means a wider readership, because access is free and efficient; and because it provides mechanisms for the validation of resources, Internet publication should start to "count" in promotion and tenure decisions. Furthermore, Goliath will work to bridge the gap between the general public and the university, allowing scholars the more traditional role of informing society rather than being subject to its economic whims.
Conclusions
What we have learned from these experiments is that, in no uncertain terms, it is technologically possible and economically feasible to build a system of dissemination for academic resources that is completely administrated by the scholarly world without the intervention of economic interests. If the IALab has not yet demonstrated this fully in the concrete, this is only because we have been operating on a very small budget in an inexpensive lab that employs undergraduate Interns under the direction of a single faculty advisor. (This should underline the economic feasibility of enterprises like the ones discussed above.) It is not because standards must first be reached for meta-tags, nor is it because the problem is technologically difficult, though a considerable part of the paper paradigm must be rethought. We fully believe that the new Internet technology offers the academic community improvements to the existing system of dissemination as long as it does not wait for the corporate sector to solve these problems for it.
Copyright © 1998 Anthony F. Beavers
Corrected the spelling of Mike Sosteric's name. He is the head of the International Consortium for Alternative Academic Publication (ICAAP) team mentioned in the story. The Editor, January 4, 1999, 8:15 AM.
Top | Magazine
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | E-mail the EditorD-Lib Magazine Access Terms and Conditions
hdl:cnri.dlib/december98-beavers