Bellcore,
445 South St., Morristown, NJ 07960
Computer Science Department,
Rutgers University,
New Brunswick, NJ 08903
[email protected]
D-Lib Magazine, April 1996
The explosion in the amounts and variety of information
has made the knowledge about its existence, location, and the means
of retrieval very confusing. The information explosion has further
accelerated with the acceptance of the World Wide Web
[ber92],
causing a universal rush to create Web pages and use them to provide
on-line access to the vast legacy of existing heterogeneous information.
Such information ranges from documents in a variety of proprietary
representation formats to engineering and financial databases, and
is often accessible only through specialized vendor tools and locally
developed applications. Moreover, rapidly increasing sophistication in
presenting information on the Web is already forcing us to treat
ftp and gopher information sources, and even early
HTML
pages as parts of the same legacy.
The main focus of this article is the
InfoHarness
TM
system [shk94,shk95-1],
which is designed to provide Web access to existing heterogeneous
information without any relocation, reformatting and restructuring of data.
InfoHarness
has been productized and is now a part of
Bellcore's ADAPT/X
product line.
It has been designed with an open, extensible, and modular architecture.
A prototype extension of InfoHarness
, called
GeoHarness
1,
is being developed by the members of the
USDAC
Consortium2 for accessing geospatial data.
It is also used to support advanced Web presentation of existing heterogeneous
information in other domains, e.g., at Rutgers University for accessing
judicial opinions from Federal Appeals courts and the
U. S. Supreme Court (see the example in Section 3.3).
The current prototype supports the largely automatic
generation of InfoHarness
and
GeoHarness
repositories, and provides
access to raw data from Mosaic, Netscape and other Web browsers through
a gateway program.
In Section 2, we discuss
current methods and tools for providing Web access to existing
heterogeneous information.
We see the most promising approach in building logical data
models and using them to support all kinds of sophisticated
presentation of the original information on the World Wide Web.
Section 3 provides a general description of the
InfoHarness
system.
In Section 3.1, we briefly describe the object model.
In Section 3.2, we discuss the generation of
InfoHarness
repositories, followed
by an example in Section 3.3.
Section 4.0 provides a brief summary and a discussion
of our current work.
2.0 Providing Web Access to Existing Information
There have been numerous attempts to provide partial remedies for data
heterogeneity by implementing a variety of ever-changing
filters for format conversions. The filters are used to generate
HTML
documents either dynamically, using the
Common Gateway Interface
(CGI
)
mechanism, or off-line. The off-line approach
requires substantial human and computing resources for the initial conversion
and maintenance of information. Maintaining the repositories presents the
additional dilemma of either creating new and updating existing information
in HTML
, or
continuously managing evolving data in multiple formats.
The dynamic approach helps to postpone the conversion until the information
is requested and eliminates problems with the initial processing and
maintenance of information. However, the access-time conversion may not be
appropriate for some rich document formats (framemaker, etc.) for the
following reasons:
HTML
(may often require human post-processing).
Using the Multipurpose Internet Mail Extensions
(MIME
) [bor93],
supported by most Web browsers, helps to avoid data conversion
through the use of third-party presentation tools. However, it may require
renaming the original files because MIME
's
type recognition mechanism relies on file extensions.
Even though the mapping of file extensions into
MIME
types is customizable, it is still
fixed for every given server, unless the type
assignment is performed by some specialized gateway program.
Adding support for new MIME
types often requires end users to obtain and install third-party tools.
Further, MIME
alone does not provide any
support for logically linking together relevant documents.
Nevertheless, many systems rely on
MIME
as their primary presentation
mechanism.
The OMNIS
system
[cla95] has been
designed to provide access to library information that includes both
catalogs and digitized texts.
The scanned-in documents may contain images, postscript or other formatted
information, and are stored in a database. At presentation time, the
OMNIS
gateway converts textual information to
HTML
, while images are converted
to common MIME
types before
being passed to the browser. This is quite feasible because
OMNIS
has full
control over the format and representation of information that is stored
in its database.
Harvest
[bow94]
provides support for extracting summaries
from distributed heterogeneous information and for executing searches
over these summaries.
Once the resources have been identified, the responsibility of accessing them
is handed over to the Web browsers.
Harvest
provides efficient and flexible methods of indexing widely distributed
information. MIME
mappings
are used to provide access to the wide variety of information, so the
problems that were described earlier still persist.
There have been a number of attempts to build logical models of
distributed heterogeneous information and use these models to support
advanced Web presentation.
The Multimedia-Oriented Repository Environment
(MORE
) [eic94]
was designed as a set of CGI
programs
that operate in conjunction with a stock
httpd
server to provide
access to a relational database containing meta-information, which
specifies how to retrieve physical data. The meta-information is
entered into the database off-line by the human librarians.
WebMake
[bae95]
introduces methods for building Web structures over existing software,
e.g. source and object code for software systems.
In WebMake
, meta-level
structural documents are used to create abstractions by
logically combining software modules or other structural documents.
A set of tools has been developed to provide a distributed
software development environment by utilizing the
CGI
mechanism.
A specialized Web client is required to obtain full access to the
WebMake
functionality.
HyperG
[and95]
uses an object-oriented database layer to provide information modeling
and model maintenance facilities in addition to integrated attribute
and content-based search. The system supports logical grouping of
documents into collections that may span multiple
HyperG
servers. Special
cluster collections are used to group together related multimedia
and multi-lingual information. HyperG
uses its own HyperG Text Format
(HTF
) that is converted to
HTML
by the
HyperG
servers when they respond to
HTTP
requests.
The objective of the InfoHarness
system [shk94, shk95-1]
is to provide Web access to large amounts of
heterogeneous information in a distributed environment without any
relocation, restructuring, or reformatting of data.
Like MORE
and
HyperG
,
InfoHarness
uses metadata for search and retrieval of heterogeneous
information (Figure 3).
It provides advanced search and browsing capabilities without
imposing constraints on information suppliers or creators.
InfoHarness
utilizes stable abstract class
encapsulation and presentation hierarchies that need not be modified to
add terminal classes that accommodate new kinds of information and new
indexing technologies. InfoHarness
provides tools for the automatic generation of meta-data based on user
inputs and the analysis of existing information.
Closely related to this effort is our work on defining an Information
Repository Definition Language (IRDL
)
[shk95-2] - a high-level language for describing
information resources and the desired logical structure of information
repositories. The language provides high flexibility in imposing
abstractions on heterogeneous information.
Presently, the IRDL
interpreter generates InfoHarness
metadata entities. With the emergence of
Web objects, it should become possible to perform the
direct generation of Web data structures.
Figure 1. InfoHarness
Architecture.
The main components of the current
InfoHarness
implementation
include (Figure 1):
InfoHarness
server, which uses
metadata to traverse, search, and retrieve the original information.
CGI
gateway, which is used to
pass requests from HTTP
clients to
the InfoHarness
server
(via an HTTP
server) and responses
back to the clients.
InfoHarness
metadata entities
representing the desired logical structure and organization of the
original information. This metadata is used by the
InfoHarness
server to support
dynamic search and presentation of raw data.
HTML
form.
InfoHarness
server, generates
and sends out a request, and waits for a response.
HTML
forms and hyperlinks, adds
an HTTP
header, and passes the
transformed response to an HTTP
browser.
The InfoHarness
architecture is open,
modular, extensible and scalable.
The InfoHarness
server implements the
abstract class presentation hierarchy that does not have to be modified
to support a new data type, or a new
indexing technology [shk95-1]. The methods associated
with abstract classes are general enough because they are data-driven and
can invoke independent programs.
The definitions of terminal classes are also data-driven and are not part of
the implementation, which makes the system capable of supporting arbitrary
information access and management tools (e.g., browsers, indexing
technologies, access methods).
Figure 2. Simple and Composite Objects.
As mentioned earlier, an important advantage of
InfoHarness
is that it provides access
to heterogeneous information without making any assumptions about its
location and representation.
This is achieved by generating metadata and associating it with the original
information. Metadata entities,
which encapsulate units of information that are of interest to end-users,
are called encapsulation units (EU). An EU may be associated with a file
(e.g., a man page), a portion of a file (e.g., a C function), a set of files
(e.g., a set of related man pages), or a request for the retrieval of data
from an external source (e.g., a database query). For example, a C file and
a function that occurs in this file may, in different contexts, each
present a unit of interest to end-users. Consequently, they may be
encapsulated by separate metadata entities.
An InfoHarness
object (IHO) is defined
recursively as one of the following:
InfoHarness
objects (its children) and optional attributes.
InfoHarness
objects.
An InfoHarness
object contains a unique
identifier that is recognized and maintained by the system. Each simple
and composite object (Figure 3) stores the possibly remote location of raw
data, the logical address of the encapsulated portion of this data (e.g.,
name of a C function or title of a document section),
and typing information that determines
the access-time data presentation method. Data location may be expressed by
any legal Uniform Resource Locator (URL), which makes it possible not only
to model local legacy information but also to create multiple views of the
existing Web resources.
For example, an object that encapsulates a C
function would be assigned a presentation type
C
and would contain both the location of
a C
file containing the function and the
name of this function. The type of this object determines the presentation
method that would separate out the function at the presentation time.
In addition, each object may contain arbitrary number of attribute-value pairs
(e.g., owner, last update, security information, decompression method, etc.).
Figure 3. InfoHarness
Collections.
An object that contains references to a set of other
InfoHarness
objects may
be either a collection or a composite object. Only composite objects may
contain an encapsulation unit (Figure 2). A sample composite object both
encapsulates an abstract of a paper and contains references to objects that
encapsulate text, HTML, postscript and Latex versions of the full paper.
Collection objects may contain references to independent indices
that in turn reference their child objects (Figure 3). An index may be created
either from the encapsulated contents of child objects or from the values of
their attributes (an information source of the index). By an abuse of
notation, we will refer to such collection objects as indexed collections,
and say that an InfoHarness
object belongs
to an indexed collection if it is a child of a collection object.
An indexed collection contains information about the index source, type,
and the location of the associated index structures.
The type ensures proper selection of query and mapping
methods, the latter responsible for mapping selected information into
InfoHarness
objects (Figure 3). Consequently,
any indexed collection may make use of external data retrieval methods that
are not parts of InfoHarness
,
making it possible to utilize existing heterogeneous index structures.
An
The metadata generator (Figure 1) supports the off-line creation of metadata
entities. The generator commands either encapsulate the raw data, or group
existing objects into sets. In addition, the generator is responsible for
the creation of independent indices that reference members of indexed
collections.
There are only three different generator commands:
To illustrate the concepts discussed in Sections 3.1
and 3.2, we discuss how to use
Given the location of the original information, the desired access-time
presentation of individual cases, and the desired full-text indexing
technology, we have implemented a twenty-line (including declarations)
IRDL program that generates the repository of the Supreme Court cases
by performing the following steps:
In this example, each indexed object is a composite object
(Section 3.1). Consequently, when presenting
the results of a query (Figure 4), for each case we
see not only a hyperlink for its content but also hyperlinks for the
individual opinions. The internal hyperlinks for individual opinions are
also available when presenting an individual case
(Figure 5).
We have discussed using
Advancing Web technology is likely to rapidly antiquate the existing Web
structures, including images, applets, and static and dynamic
hdl://cnri.dlib/april96-shklar
InfoHarness
repository is a set of
objects that are known to a single
InfoHarness
server.
Any object may be a member of an arbitrary number of collections (its parents).
An object that has one or more parents always
contains unique object identifiers of its parent objects. An object that
does not have any parent is unreachable and may only be accessed
if used as an initial starting point (or entry point) in the traversal.
3.2 Building InfoHarness Repositories
To simplify using the generator, InfoHarness
objects,
each of which encapsulates a portion of data. Boundaries of these portion
are determined by type. For example, an encapsulation command may refer to
the type rmail and the location of an RMAIL file. The output in this
example is a set of objects, each of which is associated with a separate
mail message.
InfoHarness
utilizes a number of macros for standard operations.
For a more uniform solution to building information repositories,
we have designed a high-level
Information Repository Definition Language (IRDL), which supports
very compact and simple specifications for building large and complex
information repositories [shk95-2]. With IRDL,
the generator commands are produced by the IRDL Interpreter.
3.3 Sample InfoHarness Repository
InfoHarness
for advanced
search and presentation of judicial opinions from the U.S. Supreme Court
that are available as a collection of plain text files at
ftp.cwru.edu
(the life demo
is available).
Here, information related to a single court case may be distributed
between multiple files.
Case
, and excludes them from any
further consideration. The presentation method for this type is
responsible for generating the internal hyperlinks to individual
opinions and the external hyperlinks to related information
(the Supreme Court photo, bios of the judges, etc.),
Figure 5.
4.0 Conclusions
InfoHarness
to
provide Web access to existing heterogeneous information.
We believe that the system's
open and extensible architecture makes it well-positioned to take
full advantage of new exciting developments in the Web technology.
Even though Java
[gos95] and other emerging
mobile code systems provide new and exciting opportunities in presenting
information on the Web, we see them as complementary to the new developments
in the traditional HTML
presentation
and the HTTP
protocol.
HTML
pages.
These structures represent a tremendous investment and can not be
recreated with every new step in the technological advance.
Consequently, modeling methods that support advanced presentation of
existing heterogeneous information have to progress as well.
Similar methods should be applicable to building virtual Webs,
with both navigation and presentation controlled by personalized
meta-information.
References
Footnotes
InfoHarness
is a Bellcore
trademark.
GeoHarness
work is a part of
the GeoLens
project being funded
by a NASA Information Infrastructure Technology and Applications
Program Cooperative Agreement (#NCC5-102), "Usability and Interoperability:
A Dual Strategy for Enabling Broader Public Use of NASA's Remote Sensing
Data on the Internet", Clifford Behrens, Principal Investigator.
Copyright © 1996 Bellcore