|
Journal of the American Society for Information Science and Technology
(JASIST) -- Table of Contents
Contributed by Richard Hill
American Society for Information Science and Technology
Silver Spring, Maryland, USA
Fax: (301) 495-0810
Phone: (301) 495-0900
[email protected]
VOLUME 52, NUMBER 5
[Note: below the contents of Bert Boyce's "In This Issue" has been cut into the
Table of Contents.]
CONTENTS
Editorial
In this issue
Bert R. Boyce
Page 369
Research
- A Noninformetric Analysis of the Relationship between Citation Age and Journal
Productivity
L. Egghe
Page 371, Published online 1 February 2001
In this issue Egghe provides an explanation based upon the central limit
theorem for a regularity observed by Wallace between citation age and journal
productivity which implies that the there is no informetric explanation, but
rather the observation of a statistical effect. He then examines the Leimkuhler
curve, showing the arcs at the tail to be a mathematical rather informetric artifact.
The relationship between the fraction of multinational publications of a country
and the country's fractional score is also shown to be probabilistic in nature.
However, the relationship between the Price index and median age requires both
probabilistic and informetric explanation, and the cumulative first citation
distribution seems best explained with a curve incorporating Lotka's exponent
and thus has high informetric value.
- Automatic Cataloguing and Searching for Retrospective Data by Use of OCR
Text
Yuen-Hsien Tseng
Page 378, Published online 5 February 2001
We also include four papers concerning the automatic characterization of documents
and queries. First, using a test collection of 7990 OCR scanned book pages from
500 books in four languages, and 30 queries, 15 content and 15 known item, Tseng
applies variable length n-gram indexing and byte size normalization. Document
terms are weighted at 1 plus the log of term frequency except for those on the
first two pages of a book. These are incremented by eight not one. Each occurrence
of a query term increments it weight by 1 plus the cube of the n-gram length
minus1. Known item searches are limited to the first two pages of each book.
Precision and recall results achieved second place in a contest entered. A similar
approach has promise with Chinese text.
- An Experimental Study in Automatically Categorizing Medical Documents
Berthier Ribeiro-Neto, Alberto H.F. Laender, and Luciano R.S. de Lima
Page 391, Published online 5 February 2001
In another automatic characterization paper Ribeiro-Neto, et alia, test their
coding algorithm which assigns International Code of Diseases category codes
to medical documents against a file of 20,569 patient records. The ICD codes
are represented as a directed acyclic graph, and supplemented with acronym and
synonym dictionaries for the codes. For each section of each document the acronyms
and synonyms are converted to code strings and root node codes are identified.
A window of document terms around each root node term is created and the longest
path from the graph including these terms is extracted. These codes are assigned
to the document in a ranked order by relative path length for that root.
Using documents with specialists assigned ICD codes as an ideal set, 19,651
were categorized at between 70 and 80% for all recall levels, while 918 were
not. However, specialists made incorrect assignments in 589 documents, and in
391 made assignments not supported by the text, but that may have been the result
of additional information. In only 158 cases was the algorithm clearly incorrect.
- Automatic Query Expansion via Lexical-Semantic Relationships
Jane Greenberg
Page 402, Published online 9 February 2001
Next, using 42 queries, in the form of Boolean statements with free text
terminology,
collected from MBA students and the ABI/Inform database, Greenberg maps against
the ProQuest Controlled Vocabulary selecting those queries that contained at
least one ProQuest term. These were searched in initial form, a form mapped from
ProQuest, and using expansions that took all synonyms, all narrower terms, all
broader terms, and all related terms. Greenberg conducted all searches on Dialog
and subtracted the initial and mapped results form the other returns to gauge
the expansions effectiveness. Relevance judgements were made on the basis of
topical matching (aboutness) by the contributors of the queries reviewing the
Union set of the responses to the query forms where each retrieved list was limited
to a length 15 or less citations. If the retrieved set was under 16 all were
presented, and if between 16 and 100 the top 15 ranked by similarity to the query
(Dice Coefficient) were used, while if above 100 a random sample of size one
hundred was used for the similarity ranking. Broader terms and Related terms
each improved recall nearly 100%, while Narrower terms increased the baseline
from .266 to .473. Synonyms improved from the .226 base to .369. The baseline
precision of .794 was reduced to .766 by the use of synonyms, to .733 by the
use of narrower terms, .544 by the use of related terms, and .595 by the use
of Broader terms.
- Modeling User Interest Shift Using a Bayesian Approach
Wai Lam and Javed Mostafa
Page 416, Published online 1 February 2001
In a different approach to query modification Lam and Mostafa address information
filter modification in response to changing user needs. Such filters assume a
stability of user need, when in fact, information needs evolve at unpredictable
speeds and in unpredictable ways adding to the normal relevance assessment problems.
Their passive filter stores and ranks material received for later review, building
its profile from a subset of MeSH headings based on user relevance feedback
assessments
of documents presented. Documents are classed using a cosine similarity measure,
the user provides binary interest weights for each class, and their running average
is maintained as the relevance probability of the class which is used to rank
all classes after the first. Positive feedback also modifies a second vector
used to select the initial class, providing a means of relearning of changing
interests. Since this relearning requires considerable iterations, with degraded
interim results, a means of quick shift detection is needed. Using the sequence
of feedback data and Bayes theorem, with associated costs of a wrong decision,
the posterior probability that a shift has occurred can be computed. An upward
shift will result in the new class and the old most probable class each being
assigned half of the sum of their probabilities. A downward shift of the most
probable class will use the user profile vector to identify the class weights
to sum and distribute, since the class vector values of the other classes will
be near zero. Simulation studies indicate that the system is able to recognize
and correct for interest shifts.
- General-Purpose Compression for Efficient Retrieval
Adam Cannane and Hugh E. Williams
Page 430, Published online 5 February 2001
The final paper in this issue is concerned with compression techniques that
can speed up retrieval since disc seek and transfer cost savings can exceed
decompression
costs. Cannane and Williams describe an algorithm that identifies unique character
strings occurring at least twice by way of multiple passes, replaces them with
a reference number, and continues to form a hierarchy of longer strings that
may contain references to shorter ones. The process terminates when no further
duplicate strings are to be found. The representation created and an associated
string dictionary allow decompression at any random access point. Using the
Canterbury
collection for compression experiments, and the TREC Wall Street Journal and
WEBDOC files, and databases of genomic records, weather data, and geographic
data, compression is found to be superior to GZIP, COMPRESS, and the Huffman
coding scheme, but not as effective as BZIP2, although decompression is faster
than BZIP2.
Book Reviews
- Digital Capital: Harnessing the Power of Business Web, by Don Tapscott, David
Ticoll, & Alex Lowy
Shana R. Ponelis
Page 438, Published online 1 February 2001
- A Place at the Table: Participating in Community Building, by Kathleen de
la Pena McCook
Marianne Orme
Page 439, Published online 1 February 2001
|