Stories

D-Lib Magazine
November 1998

ISSN 1082-9873

Defining Collections in Distributed Digital Libraries

blue

Carl Logoze
Cornell University
[email protected]

 

David Fielding
Cornell University
[email protected]

1. Order and Chaos in Global Information Space

The World Wide Web provides unprecedented access to globally distributed content. The success of the Web, both in terms of number of resources and use of those resources, is largely due to three defining characteristics of the Web. Because of its universality, anyone can participate in the Web, as author, publisher, or consumer, with a minimal investment of hardware and expertise. Because of its uniformity, resources, services, and users participate on the Web as equals in a common information space. Finally, because of decentralization, the Web is fundamentally anarchistic beyond basic agreement at the technical level on protocols and transport mechanisms.

These principles that are fundamental to the Web's success are also the root of the problems that frequently confound its users. As many have found, universality often translates to "quantity without quality", where content from Nobel Prize winners co-exists with content from prize winners from a local first-grade writing contest. Uniformity frequently means that specialized and domain-specific tools, technologies, and guidance essential for using many classes of information (e.g., geo-spatial, statistical, scientific) are difficult or impossible to find. Decentralization frequently means that it is difficult to impose the organizational structures necessary, ensuring information integrity -- i.e., reliability and accessibility, security and privacy for content and users, and survivability (preservation) of information.

This apparent paradox in the utility of the Web, in fact, reflects the highly variable manner in which people seek out and use information in their daily lives. At times their motivations may resemble those of shoppers in a busy commercial district hoping to stumble upon the perfect gift. Such serendipitous browsing is often served by the lack of organization in Web space. In contrast, there are other times, when people wish to undertake more focused, discipline-specific information tasks or when they wish to purposely screen out certain genre of information (e.g., protect their children from inappropriate content). In these situations greater levels of organization, selection, and specialization than are currently available on the Web are more appropriate.

The challenge in designing digital library architectures and systems is to accommodate these different models of information behavior. Selection, organization, and specialization should be permitted without being imposed. In addition, mechanisms for introducing selection, organization, and specialization should be flexible, extensible, and independent of other characteristics of the digital library, such as how content and services are physically distributed or how and by whom the components of the digital library are managed.

In this paper, we describe a design for a digital library collection service. The collection service is an independent mechanism for introducing structure into a distributed information space. Due to its independence from other services and mechanisms in the digital library, the collection service neither constrains other organizational models nor does it impose structure when it is neither needed nor desired.

The motivation for the collection service design lies in traditions well established in the library community, where collection development serves three important roles:

The collection service architecture adapts these traditional collection-related roles to the distributed and dynamic nature of digital libraries. First, it defines collection membership through criteria rather than containment -- resources become members of the collection because they conform to a set of formal criteria (for example, subject classification, language, or genre). Such criteria allow automatic and/or dynamic selection of resources from a set of distributed information sources, based on either metadata about those resources or the content within the resources themselves. Second, by providing query routing and query pre-processing and post-processing facilities the collection service facilitates resource discovery that is tailored to the characteristics of the collection (rather than to the features of a specific search engine). Finally, the collection service acts as a distributed metadata repository, storing, disseminating, and processing data relevant to the management and administration of objects in the collection.

The collection service is one of several services in the component-based digital library architecture that we are developing and experimenting with as part our research. Other services include a repository service for storing digital content, a naming service for registering and resolving unique names for objects, and an index service that processes queries for the discovery of content.

Defining the collection service as a service distinct from other services, notably index and repository services, is significant for a number of reasons.

The remainder of this paper is structured as follows. Section 2, which follows, summarizes the component-based digital library architecture that is the context for the design described in this paper. Section 3 describes a collection abstraction that is appropriate for the new networked information environment. Section 4 describes the status of our implementation of the collection abstraction. We then give some concluding remarks in Section 5.

2. Establishing the Context: Component-based Distributed Digital Libraries

Over the past four years the Cornell Digital Library Research Group (CDLRG) has been researching the technology and deployment of distributed digital libraries. Our work on digital library architecture is based on the following principles:

The initial result of this work is Dienst [LSDK95], the technical foundation for the Networked Computer Science Technical Research Library [DL99] (NCSTRL - pronounced "ancestral"), a digital library of computer science research reports. More recently, we have been designing the Cornell Reference Architecture for Distributed Digital Libraries [LP98] (CRADDL - pronounced "cradle"), a set of components that form the core of a digital library infrastructure. CRADDL is being implemented in the CORBA distributed object framework; CRADDL services are deployed as CORBA objects, and service requests are expressed as methods requests to those objects.

CRADDL defines a basic set of digital library services, which interact as shown in Figure 1. By core, we mean the set of services that are necessary to provide basic digital library functionality: object naming and storage, object discovery, and user access. Because the architecture is open (its functionality is exposed through service-based protocols), other services can be added to enhance this core functionality.

Figure 1 - Interaction of core digital library services

A brief summary of these services and their interactions is as follows.

Of these five services, only the user interface is accessed directly by a human. The others are used by programs, in particular other CRADDL services, but also by other digital library or publishing systems. This modular design allows easy integration of higher-level digital library services (summarization services, payment services, and the like) with existing CRADDL services, or evolution of existing services as the architecture matures.

The modular design also creates a hierarchy of selection mechanisms in the digital library architecture, which facilitates and encourages the creation of customized digital libraries:

3. Defining a Collection in a Distributed Digital Library

Earlier in this paper, we defined three roles that collection services provide in the traditional library: selection, specialization, and administration. From the standpoint of user visibility, selection dominates these roles; the quality and usefulness of a library is generally determined by the resources available from it. Without a doubt, the models for collection selection and containment, as used in the traditional library sense, are challenged in the digital library.

First, and most obvious, the network makes it irrelevant whether the physical bits that make up a digital resource are located on a disk drive in the library or across the world. In fact, the notion of physical location for an individual resource in a digital library is ill-defined. A single resource, as perceived by a user, may actually be an aggregation of physical bit streams (or programmatically produced bit streams) from widely distributed sources. For example, consider a multimedia encyclopedia, which is a composition of text, images, audio, moving images, and live data feeds that reside on, or are produced from, distributed servers.

A more subtle and, from the policy point of view, more troublesome issue arises from the difficulty of placing a distinct boundary around the resources contained in a digital collection. Consider the problem of linkages among resources. For example, if Object A is contained in a collection, are objects B, C, and D that are linked to from Object A also contained in the collection? If so, are all objects transitively linked to Object A via other objects also contained? This issue has been explored by others as a problem of defining the "control zone" that libraries establish [RA96]. Solutions to this problem have important implications in areas such as legal liability and public service responsibilities.

Finally, traditional library policies for selection and acquisition are challenged by the different model of "publication" on the Internet. Traditional publishing, that which involves physical media (e.g., books, maps), is characterized by a relatively small number of publishing authorities (due to a high cost of entry) and a relatively low frequency of publication. Standard library practices rely on these characteristics to make the acquisitions and collection administration process manageable. For example, library acquisitions departments may "trust" the quality of certain publishers or series from certain publishers. To shortcut the overhead of item-by-item selection, libraries may in selected cases adopt blanket acquisitions policies for those series.

These selection and acquisition techniques are not appropriate for networked publishing for a number of reasons. On the net, cost of entry is relatively low and, in effect, anyone can become a publisher. There is no way to assume the legitimacy of these publishing authorities. (In fact, recent experiences in Internet news publication have shown that the time pressures of Web publication often challenge standards of quality of supposed "legitimate" publishers.) Because of the negligible cost of publishing, the frequency at which new resources appear is orders of magnitude greater than in traditional publishing. Finally, many of these resources are ephemeral; disappearing due to whim or poor administration by the publishing authority.

With these characteristics of networked resources in mind, we suggest the following definitions, both logical and operational, for a digital library collection and containment within that collection.

    1. They direct queries only to those index servers that can return objects in the collection.
    2. They employ filtering techniques, either within the queries or through post-processing the results of the queries, to select only those objects in the respective index servers that fit the collection criteria.
    3. They employ resource discovery aids that are specialized for the collection. Examples of such as aids are domain-specific stop-word lists, stemming algorithms, or thesauri.

An example, shown in Figure 2, illustrates both this logical and operational definition. At the bottom of figure are five repositories that provide access to a number of digital resources. The red-shaded circles in the repositories are resources about computer science and the green-shaded circles are resources about economics. (For the sake of this simple example, we can say that the aboutness of a resource is determined by the value of a controlled-vocabulary metadata field -- e.g., Dublin Core subject -- associated with the resource.) As illustrated, objects fitting these subject classifications are only located within some repositories and are mixed in with objects that do not fit the collection criteria.

Illustrated above the repositories are a set of index servers that download information (via some protocol) from these repositories. Each index server is administratively configured to access only certain repositories (based on quality judgements, licensing agreements, or other reasons). Discovery of resources about computer science involves only querying index servers "1" and "3" incorporating filters in the query of the nature "subject equals computer science". Similarly, discovery of resources about economics involves only querying index servers "2" and "3" with an appropriate query filter.

Selective query routing and filtering offers the important advantage of facilitating focused resource discovery. The need for focused resource discovery is one of the primary motivators of efforts to establish standards for Web metadata, where the goal is creating mechanisms that enable queries to focus on certain semantic characteristics of resource (e.g., author, title, date of publication). Our concern here is permitting resource discovery to focus on a specific category of resources. For example, assume that an author "Joseph Halpern" publishes documents in both computer science and economics. The mechanism shown above makes it possible for user interfaces (and users) to designate that a search for documents that match the query "author equals Joseph Halpern" should only return resources from a specific collection, computer science. The combination of semantic focus and collection focus make it possible for networked resource discovery to move beyond the "high recall without precision" problem that characterizes current web search services.

This definition of collection, and the resources that are members of it, has a number of advantages:

4. Collection Service Implementations

In the previous section we described a collection both conceptually, as a criterion for resource membership, and operationally, as tools for resource discovery. In this section, we describe implementations of this definition. First, we describe the implementation of the collection service in Dienst and its deployment in NCSTRL. Second, we describe work-in-progress to implement this as a CORBA service in a more component-based system.

4.1 The Dienst Collection Service

The NCSTRL collection currently provides access to over 24,000 computer science research reports from over 120 institutions. Discovery of documents and access to those documents involve the interoperation of over forty servers communicating via the Dienst protocol and proxy servers operating through FTP and HTTP.

The NCSTRL collection is logically and administratively divided into publishing authorities. Each publishing authority has control over addition and administration of documents in their own sub-collection repositories. The metadata fields (e.g., title, author, abstract) for each document in these repositories are then indexed by one or more index servers. The metadata is accessed through Dienst protocol requests to the respective repository. The Dienst collection service allows the federation of these index and repository servers into a single uniform collection. The Dienst protocol requests [DIENST] defined for the collection service give access to the following information:

Within NCSTRL, information from the collection service is used by user interface gateways to the collection. Each user interface service is configured with the host and port number of a collection server. Periodically (every hour) each user interface gateway contacts a collection server to obtain collection information, as described above. The requesting user interface server then stores the collection information internally in a table. The user interface gateway then stores this information to later create a search interface, for example showing a list of publishing authorities from which a user may choose those to which searches should be restricted, and determine to which index servers searches should be routed.

Figure 3 illustrates the interaction between a collection server, user interface servers, and index servers in Dienst. As shown, each user interface servers queries the collection server (via protocol) for collection information. For a specific query, an individual user interface (labeled UI1 in the figure) uses this collection information to determine which index servers should process the query.

The Dienst collection service has a number of limitations. First, collection criteria are hard-wired into the implementation. As described above, the NCSTRL collection is partitioned into sub-collections that correspond to the partitioning of the name space for documents in NCSTRL (i.e., each sub-collection corresponds to a Handle System naming authority). Second, the Dienst protocol and server implementation limits the ability of user interface servers to interact with more than one collection (and its associated set of sub-collections). This has prevented us from expanding the NCSTRL service to include additional scholarly collections; for example, physics, mathematics, etc. Finally, the Dienst architecture incorrectly conflates the functions of the user interface service with query routing. Although the collection service provides information for query routing, the actual dispatch of queries to index servers takes place in the user interface service. This limits our capacity to performing query routing that is highly collection specific.

4.2 The CRADDL Collection Service

CRADDL is a component or service based digital library architecture that we are currently developing as a reference implementation of our research results and as a testbed for future research. Two CRADDL services, the FEDORA repository architecture and the STARTS index server implementation, are in the prototype phase and are available for interoperability testing. In this section, we describe our initial work on the design and prototyping of the collection service.

The CRADDL collection service is implemented as a set of distributed servers that act as a metadata repository for collection specific information, and that perform collection specific query routing in the manner described in Section 3. Each collection service maps to a single collection; in effect, a collection exists and is accessible in the digital library infrastructure if there is a collection service for it.

4.2.1 The collection service and user interface servers

Similar to Dienst, the main consumers of collection services are user interface gateways. Each user interface gives human-friendly access to one or more collections through interaction with the collection services corresponding to those collections. The interaction between user interface gateways and collection services, through defined protocol requests, involves the exchange of a number of types of metadata about the collection:

In addition, the interaction between a user interface service and the collection service involves submission of query requests and the return of corresponding result sets. This interaction is further described in Section 4.2.2.

Figure 4 gives an example of how a user interface service might use the collection metadata provided by collection servers. The user interface server shown subscribes to three collections: computer science, physics and economics. People using this user interface can choose the collection to which they wish to direct their search, as shown in the first example form at the left bottom of the figure. (On a help screen, the user interface service might display the collection service-supplied descriptions of the collections.) Following the choice of a collection, in this case, a user has chosen computer science, a query screen tailored to that collection is presented (the user interface uses information provided by the respective collection service to create this screen). As shown in the bottom right of the figure, this customized screen for computer science might include a choice of ACM classification as a query feature.

Figure 4 - User interface interaction with collection services

One open issue in the interaction between user interface servers and collection services is how user interface servers "learn" about collections and their associated services. We plan to examine methods whereby user interface servers can discover new collections (in the spirit of the old WAIS directory of servers).

4.2.2 Components of the collection service

Each collection service consists of two types of servers: a central collection server (CCS) and one or more collection query routers (CQR).

The CCS serves as the central point of management of the collection. Collection management involves creation and modification of:

Each CQR provides 1) local, replicated access to collection metadata and 2) query routing tailored for local conditions. The former (replication of metadata) makes sense from the standpoint of reliability. The logic for the latter (localized query routing) is as follows. We assume that index servers will be distributed globally with replication of individual index servers. As is well known, global connectivity varies dramatically. We can model patterns of global connectivity through the notion of a connectivity region [LFP98]. A connectivity region is defined as a group of nodes on the network that among themselves have good connectivity, relative to nodes outside of the region. (We note that connectivity regions do not necessarily correspond to geographic regions, due to peculiarities in the global telecommunications networks.) Localized query routing is then defined as dispatching queries, if possible, to those index servers that are within a single connectivity region. In case of index server failure, a backup index server should be chosen from another region with which there is relatively good connectivity.

The CQR is the mechanism for performing this localized query routing. Each connectivity region for a collection has a corresponding CQR. When a user interface subscribes to a collection it contacts the CCS from which it obtains a list of CQRs. Based on its own analysis of connection characteristics to the available CQRs it then chooses a CQR as its "local" connectivity region. The user interface then uses that CQR for collection specific queries, which are routed by the CQR to index servers in the connectivity region.

Figure 5 - Distributed collection service and connectivity regions

Figure 5 illustrates this regional architecture and service interactions with it. The three gray circles represent connectivity regions. Each region has a collection query router (CQR), pictured in yellow, and a set of index servers, pictured in red. (Note that index servers are assigned to a region in the context of a specific collection. Another collection might assign an index server to a completely different region.) The CQRs communicate with the central collection server (CCS), shown as the blue rectangle, to obtain copies of collection metadata. When a user interface server, shown in green, subscribes to the collection, it first contacts the CCS. Once the user interface chooses a CQR, it then submits queries to that CQR, which then dispatches those queries to regional index servers (the communication links shown as black arrows).

In Dienst and NCSTRL we implemented a limited version of this regional architecture in which regions are statically configured. In reality, connectivity between nodes on the Internet is highly dynamic. The configuration of regions -- the index servers that are members of a region -- should adapt to changing connectivity and server load. We are currently exploring methods for sharing load information among the CQRs and the CCS to allow dynamic region configuration. This information sharing is shown in Figure 5 by the red communication arrows between the CCS and CQRs.

5. Conclusion

The physical proximity or collocation of resources is irrelevant to networked information systems. Globally distributed content can be immediately and uniformly available. The current World Wide Web demonstrates the advantages of such universal access, yet it also shows its flaws. Attributes of the traditional library such as organization, specialization, and selection have been shown, in many situations, to be necessary for effective resource discovery and use.

We have described in this paper a mechanism that facilitates such organization, specialization, and selection in a distributed information space. The logical independence of this mechanism, the collection service, from other digital library services allows the organizational dimension to be independent from the physical distribution of content and the administration of that content, and it allows the coexistence of several organizational schemes. Moreover, it does not prohibit the dissemination of, discovery of, and access to content and services in the relatively chaotic fashion that makes the current World Wide Web such a success.

Finally, we have described two implementations of the collection service. The first, and somewhat limited, is deployed as part of the globally distributed NCSTRL collection. The second more powerful implementation is currently under development as part of our digital library architecture research.

Acknowledgements

The work described in this paper was funded by the Defense Advanced Research Project Agency under Grant No. MDA 972-96-1-006 with the Corporation for National Research Initiatives. This paper does not necessarily represent the views of CNRI or DARPA. We would also like to acknowledge the contributions of the other members of the Cornell Digital Library Research Group: Naomi Dushay, Sandra Payette, and Dean Krafft. Finally, we’d like to thank Jim Davis, whose initial design of Dienst made this work possible.

References

[LSDK95] C. Lagoze, E. Shaw, J. R. Davis, and D. B. Krafft, "Dienst Implementation Reference Manual", Cornell Computer Science Technical Report TR95-1514, May 1995, http://cs-tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/ncstrl.cornell/TR95-1514.

[DL99] J. R. Davis and C. Lagoze, "NCSTRL: Design and Deployment of a Globally Distributed Digital Library", to appear in IEEE Computer, February 1999.

[LP98] C. Lagoze and S. Payette, "An Infrastructure for Open-Architecture Digital Libraries", Cornell Computer Science Technical Report TR98-1690, June 1998, http://cs-tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/ncstrl.cornell/TR95-1514.

[PL98] S. Payette and C. Lagoze, "Flexible and Extensible Digital Object and Repository Architecture (FEDORA)", Second European Conference on Research and Advanced Technology for Digital Libraries (ECDL98), Heraklion, Crete, September 1998.

[KW95] R. H. Kahn and R. Wilensky, "A Framework for Distributed Object Services", Corporation for National Research Initiatives", http://www.cnri.reston.va.us/cstr/arch/k-w.html.

[DLP98] R. Daniel Jr., C. Lagoze, and S. Payette, "A Metadata Architecture for Digital Libraries", Advances in Digital Libraries 1998, Santa Barbara, April 1998.

[GC97] L. Gravano, Kevin Chang, Hector Garcia-Molina, Carl Lagoze, and Andreas Paepcke, "STARTS: Stanford Protocol for Internet Retrieval and Search", January 1997, http://www-db.stanford.edu/~gravano/starts.html.

[RA96] R. Atkinson, "Library Functions, Scholarly Communication, and the Foundation of the Digital Library: Laying Claim to the Control Zone", The Library Quarterly, July 1996.

[GKR98] D. Gibson, J. Kleinberg, and P. Raghavan, "Inferring Web Communities from Link Topology", in Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, Pittsburgh 1998.

[PMRC98] A. Paepcke, H. Garcia-Molina, G. Rodriquez, and J. Cho, "Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies", Stanford University Technical Report SIDL-WP-1998-0099.

[LFP98] C. Lagoze, D. Fielding, and S. Payette, "Making Global Digital Libraries Work: Collection Services, Connectivity Regions, and Collection Views", ACM Digital Libraries ’98, Pittsburgh, June 1998.

[DIENST] J. Davis and C. Lagoze, "Dienst Protocol Version 4.1", http://www.cs.cornell.edu/NCSTRL/protocol.html.

[DFL98] N. Dushay, J. L. French, and C. Lagoze, "Distributed Searching: Predicting Performance of Remote Indexers", forthcoming.

[RDF98] O. Lasila and R. R. Swick eds., "Resource Description Framework (RDF) Model and Syntax Specification", W3C Working Draft 08 October 1998, http://www.w3.org/TR/WD-rdf-syntax/.

[RDF98b] D. Brickley, R. V. Guha, and A. Layman eds., "Resource Description Framework (RDF) Schema Specification", W3C Working Draft 14 August 1998, http://www.w3.org/TR/WD-rdf-schema/

Copyright © 1998 Carl Lagoze and David Fielding

Top | Magazine
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | E-mail the Editor

D-Lib Magazine Access Terms and Conditions

hdl:cnri.dlib/november98-lagoze