Advanced Web Presentation through Data Modeling

An Open Architecture for the Personalized Webs of the Future

Leon Shklar

Bellcore, 445 South St., Morristown, NJ 07960
Computer Science Department, Rutgers University, New Brunswick, NJ 08903

[email protected]

D-Lib Magazine, April 1996

ISSN 1082-9873

1.0 Introduction

The explosion in the amounts and variety of information has made the knowledge about its existence, location, and the means of retrieval very confusing. The information explosion has further accelerated with the acceptance of the World Wide Web [ber92], causing a universal rush to create Web pages and use them to provide on-line access to the vast legacy of existing heterogeneous information. Such information ranges from documents in a variety of proprietary representation formats to engineering and financial databases, and is often accessible only through specialized vendor tools and locally developed applications. Moreover, rapidly increasing sophistication in presenting information on the Web is already forcing us to treat ftp and gopher information sources, and even early HTML pages as parts of the same legacy.

The main focus of this article is the InfoHarness ^TM system [shk94,shk95-1], which is designed to provide Web access to existing heterogeneous information without any relocation, reformatting and restructuring of data. InfoHarness has been productized and is now a part of Bellcore's ADAPT/X product line. It has been designed with an open, extensible, and modular architecture. A prototype extension of InfoHarness, called GeoHarness¹, is being developed by the members of the USDAC Consortium² for accessing geospatial data. It is also used to support advanced Web presentation of existing heterogeneous information in other domains, e.g., at Rutgers University for accessing judicial opinions from Federal Appeals courts and the U. S. Supreme Court (see the example in Section 3.3). The current prototype supports the largely automatic generation of InfoHarness and GeoHarness repositories, and provides access to raw data from Mosaic, Netscape and other Web browsers through a gateway program.

In Section 2, we discuss current methods and tools for providing Web access to existing heterogeneous information. We see the most promising approach in building logical data models and using them to support all kinds of sophisticated presentation of the original information on the World Wide Web. Section 3 provides a general description of the InfoHarness system. In Section 3.1, we briefly describe the object model. In Section 3.2, we discuss the generation of InfoHarness repositories, followed by an example in Section 3.3. Section 4.0 provides a brief summary and a discussion of our current work.

2.0 Providing Web Access to Existing Information

There have been numerous attempts to provide partial remedies for data heterogeneity by implementing a variety of ever-changing filters for format conversions. The filters are used to generate HTML documents either dynamically, using the Common Gateway Interface (CGI) mechanism, or off-line. The off-line approach requires substantial human and computing resources for the initial conversion and maintenance of information. Maintaining the repositories presents the additional dilemma of either creating new and updating existing information in HTML, or continuously managing evolving data in multiple formats. The dynamic approach helps to postpone the conversion until the information is requested and eliminates problems with the initial processing and maintenance of information. However, the access-time conversion may not be appropriate for some rich document formats (framemaker, etc.) for the following reasons:

Slowness of the conversion.
Need to "tune" the filters to support local standards.
Low quality of generated HTML (may often require human post-processing).

Using the Multipurpose Internet Mail Extensions (MIME) [bor93], supported by most Web browsers, helps to avoid data conversion through the use of third-party presentation tools. However, it may require renaming the original files because MIME's type recognition mechanism relies on file extensions. Even though the mapping of file extensions into MIME types is customizable, it is still fixed for every given server, unless the type assignment is performed by some specialized gateway program. Adding support for new MIME types often requires end users to obtain and install third-party tools. Further, MIME alone does not provide any support for logically linking together relevant documents.

Nevertheless, many systems rely on MIME as their primary presentation mechanism. The OMNIS system [cla95] has been designed to provide access to library information that includes both catalogs and digitized texts. The scanned-in documents may contain images, postscript or other formatted information, and are stored in a database. At presentation time, the OMNIS gateway converts textual information to HTML, while images are converted to common MIME types before being passed to the browser. This is quite feasible because OMNIS has full control over the format and representation of information that is stored in its database.

Harvest [bow94] provides support for extracting summaries from distributed heterogeneous information and for executing searches over these summaries. Once the resources have been identified, the responsibility of accessing them is handed over to the Web browsers. Harvest provides efficient and flexible methods of indexing widely distributed information. MIME mappings are used to provide access to the wide variety of information, so the problems that were described earlier still persist.

There have been a number of attempts to build logical models of distributed heterogeneous information and use these models to support advanced Web presentation. The Multimedia-Oriented Repository Environment (MORE) [eic94] was designed as a set of CGI programs that operate in conjunction with a stock httpd server to provide access to a relational database containing meta-information, which specifies how to retrieve physical data. The meta-information is entered into the database off-line by the human librarians.

WebMake [bae95] introduces methods for building Web structures over existing software, e.g. source and object code for software systems. In WebMake, meta-level structural documents are used to create abstractions by logically combining software modules or other structural documents. A set of tools has been developed to provide a distributed software development environment by utilizing the CGI mechanism. A specialized Web client is required to obtain full access to the WebMake functionality.

HyperG [and95] uses an object-oriented database layer to provide information modeling and model maintenance facilities in addition to integrated attribute and content-based search. The system supports logical grouping of documents into collections that may span multiple HyperG servers. Special cluster collections are used to group together related multimedia and multi-lingual information. HyperG uses its own HyperG Text Format (HTF) that is converted to HTML by the HyperG servers when they respond to HTTP requests.

The objective of the InfoHarness system [shk94, shk95-1] is to provide Web access to large amounts of heterogeneous information in a distributed environment without any relocation, restructuring, or reformatting of data. Like MORE and HyperG, InfoHarness uses metadata for search and retrieval of heterogeneous information (Figure 3). It provides advanced search and browsing capabilities without imposing constraints on information suppliers or creators. InfoHarness utilizes stable abstract class encapsulation and presentation hierarchies that need not be modified to add terminal classes that accommodate new kinds of information and new indexing technologies. InfoHarness provides tools for the automatic generation of meta-data based on user inputs and the analysis of existing information.

Closely related to this effort is our work on defining an Information Repository Definition Language (IRDL) [shk95-2] - a high-level language for describing information resources and the desired logical structure of information repositories. The language provides high flexibility in imposing abstractions on heterogeneous information. Presently, the IRDL interpreter generates InfoHarness metadata entities. With the emergence of Web objects, it should become possible to perform the direct generation of Web data structures.

Figure 1. InfoHarness Architecture.

3.0 The InfoHarness System

The main components of the current InfoHarness implementation include (Figure 1):

The InfoHarness server, which uses metadata to traverse, search, and retrieve the original information.
The CGI gateway, which is used to pass requests from HTTP clients to the InfoHarness server (via an HTTP server) and responses back to the clients.
The metadata generator, which supports the off-line generation of the InfoHarness metadata entities representing the desired logical structure and organization of the original information. This metadata is used by the InfoHarness server to support dynamic search and presentation of raw data.

At access-time, the Web clients issue query, traversal, or retrieval requests that are passed on to the gateway, which performs the following operations:

Parses the request, and reads input information when the request is associated with an HTML form.
Establishes a socket connection with the InfoHarness server, generates and sends out a request, and waits for a response.
Parses the response, converts it to a combination of HTML forms and hyperlinks, adds an HTTP header, and passes the transformed response to an HTTP browser.

The InfoHarness architecture is open, modular, extensible and scalable. The InfoHarness server implements the abstract class presentation hierarchy that does not have to be modified to support a new data type, or a new indexing technology [shk95-1]. The methods associated with abstract classes are general enough because they are data-driven and can invoke independent programs. The definitions of terminal classes are also data-driven and are not part of the implementation, which makes the system capable of supporting arbitrary information access and management tools (e.g., browsers, indexing technologies, access methods).

Figure 2. Simple and Composite Objects.

3.1 The Object Model

As mentioned earlier, an important advantage of InfoHarness is that it provides access to heterogeneous information without making any assumptions about its location and representation. This is achieved by generating metadata and associating it with the original information. Metadata entities, which encapsulate units of information that are of interest to end-users, are called encapsulation units (EU). An EU may be associated with a file (e.g., a man page), a portion of a file (e.g., a C function), a set of files (e.g., a set of related man pages), or a request for the retrieval of data from an external source (e.g., a database query). For example, a C file and a function that occurs in this file may, in different contexts, each present a unit of interest to end-users. Consequently, they may be encapsulated by separate metadata entities.

An InfoHarness object (IHO) is defined recursively as one of the following:

A simple object, composed of a single encapsulation unit and, optionally, attribute-value pairs.
A collection object, composed of a set of references to other InfoHarness objects (its children) and optional attributes.
A composite object that combines a simple object and a set of references to other InfoHarness objects.

An InfoHarness object contains a unique identifier that is recognized and maintained by the system. Each simple and composite object (Figure 3) stores the possibly remote location of raw data, the logical address of the encapsulated portion of this data (e.g., name of a C function or title of a document section), and typing information that determines the access-time data presentation method. Data location may be expressed by any legal Uniform Resource Locator (URL), which makes it possible not only to model local legacy information but also to create multiple views of the existing Web resources.

For example, an object that encapsulates a C function would be assigned a presentation type C and would contain both the location of a C file containing the function and the name of this function. The type of this object determines the presentation method that would separate out the function at the presentation time. In addition, each object may contain arbitrary number of attribute-value pairs (e.g., owner, last update, security information, decompression method, etc.).

Figure 3. InfoHarness Collections.

An object that contains references to a set of other InfoHarness objects may be either a collection or a composite object. Only composite objects may contain an encapsulation unit (Figure 2). A sample composite object both encapsulates an abstract of a paper and contains references to objects that encapsulate text, HTML, postscript and Latex versions of the full paper. Collection objects may contain references to independent indices that in turn reference their child objects (Figure 3). An index may be created either from the encapsulated contents of child objects or from the values of their attributes (an information source of the index). By an abuse of notation, we will refer to such collection objects as indexed collections, and say that an InfoHarness object belongs to an indexed collection if it is a child of a collection object.

An indexed collection contains information about the index source, type, and the location of the associated index structures. The type ensures proper selection of query and mapping methods, the latter responsible for mapping selected information into InfoHarness objects (Figure 3). Consequently, any indexed collection may make use of external data retrieval methods that are not parts of InfoHarness, making it possible to utilize existing heterogeneous index structures.

An InfoHarness repository is a set of objects that are known to a single InfoHarness server. Any object may be a member of an arbitrary number of collections (its parents). An object that has one or more parents always contains unique object identifiers of its parent objects. An object that does not have any parent is unreachable and may only be accessed if used as an initial starting point (or entry point) in the traversal.

3.2 Building InfoHarness Repositories

The metadata generator (Figure 1) supports the off-line creation of metadata entities. The generator commands either encapsulate the raw data, or group existing objects into sets. In addition, the generator is responsible for the creation of independent indices that reference members of indexed collections.

There are only three different generator commands:

The encapsulate command requires information about type and location of physical data. It returns a set of InfoHarness objects, each of which encapsulates a portion of data. Boundaries of these portion are determined by type. For example, an encapsulation command may refer to the type rmail and the location of an RMAIL file. The output in this example is a set of objects, each of which is associated with a separate mail message.
The group command requires a set of pointers to individual objects and, optionally, the desired type of the index. The command generates an object associated with the collection, as well as parent-child and child-parent relationships between the collection objects and each member of the input set. The optional type parameter determines the technology to be used for indexing physical data associated with member objects. If the type is not specified, no index is created.
The merge command requires an object and a set of references to additional objects. It produces a composite object that encapsulates the same physical data as the input object and contains the mentioned set of references.

To simplify using the generator, InfoHarness utilizes a number of macros for standard operations. For a more uniform solution to building information repositories, we have designed a high-level Information Repository Definition Language (IRDL), which supports very compact and simple specifications for building large and complex information repositories [shk95-2]. With IRDL, the generator commands are produced by the IRDL Interpreter.

3.3 Sample InfoHarness Repository

To illustrate the concepts discussed in Sections 3.1 and 3.2, we discuss how to use InfoHarness for advanced search and presentation of judicial opinions from the U.S. Supreme Court that are available as a collection of plain text files at ftp.cwru.edu (the life demo is available). Here, information related to a single court case may be distributed between multiple files.

Given the location of the original information, the desired access-time presentation of individual cases, and the desired full-text indexing technology, we have implemented a twenty-line (including declarations) IRDL program that generates the repository of the Supreme Court cases by performing the following steps:

Creates simple objects that encapsulate individual judicial opinions (one per file). The encapsulation method determines the case numbers for the opinions and stores them as attributes of the encapsulating objects.
For each object created in step one, finds other objects related to the same case, encapsulates them together with the presentation type Case, and excludes them from any further consideration. The presentation method for this type is responsible for generating the internal hyperlinks to individual opinions and the external hyperlinks to related information (the Supreme Court photo, bios of the judges, etc.), Figure 5.
Creates an indexed collection of the objects created in step 2 using the Latent Semantic Indexing technology [dum88].

Figure 4. Query Interface for the Example in Section 3.3.

Figure 5. Data Presentation for the Example in Section 3.3.

In this example, each indexed object is a composite object (Section 3.1). Consequently, when presenting the results of a query (Figure 4), for each case we see not only a hyperlink for its content but also hyperlinks for the individual opinions. The internal hyperlinks for individual opinions are also available when presenting an individual case (Figure 5).

4.0 Conclusions

We have discussed using InfoHarness to provide Web access to existing heterogeneous information. We believe that the system's open and extensible architecture makes it well-positioned to take full advantage of new exciting developments in the Web technology. Even though Java [gos95] and other emerging mobile code systems provide new and exciting opportunities in presenting information on the Web, we see them as complementary to the new developments in the traditional HTML presentation and the HTTP protocol.

Advancing Web technology is likely to rapidly antiquate the existing Web structures, including images, applets, and static and dynamic HTML pages. These structures represent a tremendous investment and can not be recreated with every new step in the technological advance. Consequently, modeling methods that support advanced presentation of existing heterogeneous information have to progress as well. Similar methods should be applicable to building virtual Webs, with both navigation and presentation controlled by personalized meta-information.