Copy Link
Add to Bookmark
Report
IRList Digest Volume 4 Number 55
IRList Digest Tuesday, 15 November 1988 Volume 4 : Issue 55
Today's Topics:
Abstracts - Recent SIGIR Forum
News addresses are
Internet: fox@vtopus.cs.vt.edu
BITNET: foxea@vtcc1.bitnet (replaces foxea@vtvax3)
----------------------------------------------------------------------
Date: Fri, 23 Sep 88 09:24:31 CDT
From: "Dr. Raghavan" <raghavan%raghavansun%usl.csnet@RELAY.CS.NET>
Subject: Abstracts in most recent ACM SIGIR Forum
...
[Note: I have attempted to strip out all format codes except
ones bracketed by "$" for equations. - Ed.]
ABSTRACTS
(Chosen by G. Salton from recent issues of journals in the retrieval area.)
ONLINE TEXT RETRIEVAL VIA BROWSING
J. F. Cove and B. C. Walsh, Department of Computer Science, University of
Liverpool, Liverpool L69 3BX, England.
Browsing refers to information retrieval where the initial search criteria
are generally quite vague. The fundamentals of browsing are explored as a
basis for the creation of an intelligent computer system to assist with the
retrieval of online information. Browsing actions via a computer terminal
are examined, together with new methods of accessing text and satisfying
user queries. Initial tests with a prototype system illustrated the use of
different retrieval strategies when accessing online information of varying
structure. The results suggest the construction of a more intelligent
processing component to provide expanded capabilities for content
extraction and navigation within text documents.
(INFORMATION PROCESSING & MANAGEMENT, Vol. 24, No. 1, pp. 31-37,
1988)
AN APPROACH TO THE EVALUATION OF CATALOG SELECTION SYSTEMS
Caroline M. Eastman, Department of Computer Science, University of South
Carolina, Columbia, SC 29208.
The similarities between classification systems for catalog selection and
information retrieval systems indicate that similar evaluation
methodologies might well be appropriate. The characteristics of
classification systems and of information retrieval systems are
summarized, and two catalog selection systems (GRANT and Grundy) are
presented as examples. The contributions of this article are a discussion
of the system characteristics that allow the use of measures such as recall
and precision in evaluation and a brief overview of related research
within the field of information retrieval.
(INFORMATION PROCESSING & MANAGEMENT, Vol. 24, No. 1, pp. 23-30, 1988)
AN IMPROVED ALGORITHM FOR THE CALCULATION OF EXACT TERM DISCRIMINATION
VALUES
Abdelmoula El-Hamdouchi and Peter Willett, Department of Information Studies,
University of Sheffield, Western Bank, Sheffield S10 2TN, UK.
The term discrimination model provides a means of evaluating
indexing terms in automatic document retrieval systems. This article
describes an efficient algorithm for the calculation of term discrimination
values that may be used when the interdocument similarity measure used
is the cosine coefficient and when the document representatives have been
weighted using one particular term-weighting scheme. The algorithm has
an expected running time proportional to $Nn sup 2$ for a collection of
$N$ documents, each of which has been assigned an average of $n$ terms.
(INFORMATION PROCESSING & MANAGEMENT, Vol. 24, No. 1, pp. 17-22, 1988)
PREDICTING DOCUMENT RETRIEVAL SYSTEM PERFORMANCE: AN EXPECTED
PRECISION MEASURE
Robert M. Losee, Jr., School of Information and Library Studies, University
of North Carolina, Chapel Hill, NC 27514, USA.
Document retrieval systems based on probabilistic or fuzzy logic
considerations may order documents for retrieval. Users then examine the
ordered documents until deciding to stop, based on the estimate that the
highest ranked unretrieved document will be most economically not retrieved.
We propose an expected precision measure useful in estimating the
performance expected if yet unretrieved documents were to be retrieved,
providing information that may result in more economical stopping
decisions. An expected precision graph, comparing expected precision versus
document rank, may graphically display the relative expected precision of
retrieved and unretrieved documents and may be used as a stopping aid for
online searching of text data bases. The effectiveness of relevance
feedback may be examined as a search progresses. Expected precision values
may also be used as a cutoff for systems consistent with probabilistic
models operating in batch modes. Techniques are given for computing the best
expected precision obtainable and the expected precision of subject
neutral documents.
(INFORMATION PROCESSING & MANAGEMENT, Vol. 23, No. 6, pp. 529-537,
1987)
AN ANALYSIS OF APPROXIMATE VERSUS EXACT DISCRIMINATION VALUES
Carolyn J. Crouch, Computer Science Department, Tulane University, New
Orleans, LA 70118, USA.
Term discrimination values have been used to characterize and select
potential index terms for use during automatic indexing. Two basic
approaches to the calculation of discrimination values have been suggested.
These approaches differ in their calculation of space density; one method
uses the average document-pair similarity for the collection and the other
constructs an artificial, ``average'' document, the centroid, and
computes the sum of the similarities of each document with the centroid.
The former method has been said to produce ``exact'' discrimination values
and the latter ``approximate'' values.
This article investigates the differences between the algorithms
associated with these two approaches (as well as several modified versions
of the algorithms) in terms of their impact on the discrimination value
model by determining the differences that exist between the rankings of
the exact and the approximate discrimination values. The experimental
results show that the rankings produced by the exact approach and by a
centroid-based algorithm suggested by the author are highly compatible.
These results indicate that a previously suggested method involving
the calculation of exact discrimination values cannot be recommended in
view of the excessive cost associated with such an approach: the
approximate (i.e., ``exact centroid'') approach discussed in this article
yields a comparable result at a cost that makes its use feasible for any
of the experimental document collections currently in use.
(INFORMATION PROCESSING & MANAGEMENT, Vol. 24, No. 1, pp. 5-16,
1988)
AN EXPERT SYSTEM FOR MACHINE-AIDED INDEXING
Clara Martinez, John Lucey, and Elliott Linder, American Petroleum
Institute, 156 William St., New York, New York 10038.
The Central Abstracting & Indexing Service of the American Petroleum
Institute (API-CAIS) has successfully applied expert system techniques to
the job of selecting index terms from abstracts of articles appearing
in the technical literature. Using the API Thesaurus as a base, a
rule-based system has been created that has been in productive use since
February 1985. The index terms selected by computer are reviewed by a
human index editor, as are the terms selected by CAIS's human indexers.
After editing, the terms are used for printed indexes and for online
computer searching.
(JOURNAL OF CHEMICAL INFORMATION AND COMPUTER
SCIENCES, Vol. 27, pp. 158-, 1987)
HISTORICAL NOTE: INFORMATION RETRIEVAL AND THE FUTURE OF AN ILLUSION
Don R. Swanson, Graduate Library School, University of Chicago, 1100 East
57th Street, Chicago, IL 60637.
More than thirty years ago there was good evidence to suggest that
information retrieval involved conceptual problems of greater subtlety than
is generally recognized. The dramatic development and growth of online
services since then seems not to have been accompanied by much interest
in these conceptual problems, the limits they appear to impose, or the
potential for transcending such limits through more creative use of the
new services.
In this article, I offer a personal perspective on automatic indexing
and information retrieval, focusing not necessarily on the mainstream of
research but on those events and ideas over a 34-year period that have led
to the view stated above, and that have influenced my perception of
important directions for future research.
Some experimental tests of information systems have yielded good
retrieval results and some very poor results. I shall explain why I think
that occurred, why I believe that the poor results merit special
attention, and why we should reconsider a suggestion that Robert
Fairthorne put forward in 1963 to develop postulates of impotence -
statements of what cannot be done. By understanding such limits we are
led to new goals, metaphors, problems, and perspectives.
(JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 39,
No. 2, pp. 92-98, 1988)
SEMIAUTOMATIC DETERMINATION OF CITATION RELEVANCY: A PRELIMINARY
REPORT
G. David Huffman, College of Science and Technology, University of Southern
Mississippi, Hattiesburg, MS 39406.
Technology transfer, research and development and engineering projects
frequently require in-depth literature reviews. These reviews are
carried out using computerized, bibliographic data bases. The review
and/or searching process involves keywords selected from data base
thesauri. The search strategy is formulated to provide both breadth and
depth of coverage and yields both relevant and nonrelevant citations.
Experience indicated that about 10-20% of the citations are relevant. As
a consequence, significant amounts of time are required to eliminate the
nonrelevant citations. This paper describes statistically based, lexical
association methods which can be employed to determine citation relevance.
In particular, the searcher selects relevant terms from citation-derived
indexes and this information along with lexical statistics is used to
determine citation relevance. Preliminary results are encouraging with
the techniques providing an effective concentration of relevant citations.
(INFORMATION PROCESSING & MANAGEMENT, Vol. 23, No. 6, pp. 573-582,
1987)
COMPARING RETRIEVAL PERFORMANCE IN ONLINE DATA BASES
Katherine W. McCain, Howard D. White, and Belver C. Griffith, College of
Information Studies, Drexel University, Philadelphia, PA 19104.
This study systematically compares retrievals on 11 topics across
five well-known data bases, with MEDLINE's subject indexing as a focus. Each
topic was posed by a researcher in the medical behavioral sciences. Each
was searched in MEDLINE, EXCERPTA MEDICA, and PSYCINFO, which permit
descriptor searches, and in SCISEARCH and SOCIAL SCISEARCH, which express
topics through cited references. Searches on each topic were made with
(1) descriptors, (2) cited references, and (3) natural language ( a
capability common to all five data bases). The researchers who posed the
topics judged the results. In every case, the set of records judged
relevant was used to calculate recall, precision, and novelty ratios.
Overall, MEDLINE had the highest recall percentage (37%), followed by
SSCI (31%). All searches resulted in high precision ratios; novelty
ratios of data bases and searches varied widely. Differences in record
format among data bases affected the success of the natural language
retrievals. Some 445 documents judged relevant were not retrieved from
MEDLINE using its descriptors; they were found in MEDLINE through
natural language or in an alternative data base. An analysis was performed
to examine possible faults in MEDLINE subject indexing as the reason for
their nonretrieval. However, no patterns of indexing failure could be
seen in those documents subsequently found in MEDLINE through
known-item searches. Documents not found in MEDLINE primarily represent
failures of coverage - articles were from nonindexed or selectively
indexed journals. Recommendations to MEDLINE managers include expansion
of record format and modification of journal and article selection policies.
(INFORMATION PROCESSING & MANAGEMENT, Vol. 23, No. 6, pp. 539-553,
1987)
STRATEGIES FOR BUILDING DISTRIBUTED INFORMATION RETRIEVAL SYSTEMS
Ian A. Macleod, T. Patrick Martin, Brent Nordin, and John R. Phillips,
Department of Computing and Information Science, Queen's University,
Kingston, Ontario, Canada K7L 3N6
In this article we discuss the need for distributed information retrieval
systems. A number of possible configurations are presented. A general
approach to the design of such systems is discussed. A prototye
implementation is described together with the experiences gained from this
implementation.
(INFORMATION PROCESSING & MANAGEMENT, Vol. 23, No. 6, pp. 511-528,
1987)
OPTIMAL BUCKET SIZE FOR MULTIATTRIBUTE RETRIEVAL IN PARTITIONED FILES
Caroline M. Eastman, Department of Computer Science, University of South
Carolina, Columbia, SC 29298, USA.
The problem of optimal bucket size for multiattribute retrieval in
partitioned files is considered. The query types considered include
exact match queries, range queries, partial match queries, and best
match (including nearest neighbor) queries. The similarities among
formulas which have been derived in several different contexts are
examined.
(INFORMATION SYSTEMS, Vol. 12, No. 4, pp. 375-383, 1987)
FORWARD MULTIDIMENSIONAL SEARCH WITH APPLICATION TO INFORMATION
RETRIEVAL SYSTEMS
Charles X. Durand, Computer and Information Sciences, State University of
New York, College at Potsdam, Potsdam, NY 13676, USA.
A new architecture for information retrieval systems is presented. If
it was implemented, this architecture would allow the system to process
retrieval statements that are equivalent to fuzzily defined queries. The
philosophy on which the centerpiece of this system is based - the document
search module - is fully explained in this paper. The emphasis is placed
on the quick elimination of irrelevant references. A new technique, that
takes into account the user's knowledge to discriminate between documents
before they are actually retrieved from the data base, was developed. The
search technique uses simple computations to select or eliminate potential
candidates for retrieval. This technique
does not have, qualitatively, the shortcomings of, not
only conventional retrieval techniques, but also retrieval systems that
accept relevance feedback from the user, in order to refine the search
process. No implementation details have been included in this article and
system performance figures are not discussed.
(INFORMATION SYSTEMS, Vol. 12, No. 4, pp. 363-370, 1987)
THE CD-ROM MEDIUM
David H. Davies, Project Manager, Optical Recording Project, 3M Company,
420 North Bernardo Avenue, Mountain View, CA 94043.
This article details the critical elements that make up the CD-ROM
optical disc medium. This includes the basic laser and drive operational
mechanics, the nature of the actual disc itself, the data organization at
the channel code level and at the logical file level, and aspects of error
correction and detection methods used. A brief synopsis of disc fabrication
is presented. The article concludes with descriptions of advances in
the technology currently on the horizon.
(JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 39,
No. 1, pp. 34-42, 1988)
DESIGN CONSIDERATIONS FOR CD-ROM RETRIEVAL SOFTWARE
Edward M. Cichocki and Susan M. Ziemer, I.S. Grupe, Inc., 948 Springer
Drive, Lombard, IL 60148.
The CD-ROM requires a different kind of retrieval system design from
systems on magnetic media because the disc's physical characteristics and
drive differ from those of magnetic media. Retrieval system designers
must be concerned with ways to minimize seeks (access time), transfer large
amounts of data following each seek, store data proximally, and maximize
CD-ROM performance. Three methods to maximize that performance are
described: single key mode, multiple key mode, and inverted file mode.
Well-conceived design and well-executed retrieval systems for CD-ROM
databases can result in performance that equals the state-of-the-art
online systems.
(JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 39,
No. 1, pp. 43-46, 1988)
CD-ROM: POTENTIAL MARKETS FOR INFORMATION
Julie B. Schwerin, Info Tech, P.O. Box 633, Pittsfield, VT 05762.
With the availability of CD-ROM, users and producers of information
products are confronted with a new information delivery medium having
different characteristics from anything else that exists today. As this
new medium is being introduced in various markets, we are discovering the
difference between CD-ROM as ``a new way to look at how we produce and
consume information products,'' and ``another variation on a familiar
theme.'' Except for a few limitations, the opportunity for CD-ROM in
information markets since the beginning has been characterized as broad
and rich, virtually unlimited in applications. When approached this
way, CD-ROM challenges current practices of publishing and integrating
information in a fundamental way. As the medium is introduced in markets
today, in its very early stages, it is very limited in its application as
compared with current products and represents more of a variation than a
revolution in information consumption behavior. Yet as users and
producers alike experiment and gain confidence in using CD-ROM, its full
potential will be realized for both users and producers.
(JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 39,
No. 1, pp. 54-57, 1988)
HYPERMEDIA: FINALLY HERE
Tekla S. Perry, Field Editor
Reading a book, listening to music, watching a movie: all these
traditional means of obtaining information are linear. Every reader,
listener, or viewer starts at the beginning and proceeds down the same
path to the predetermined ending.
The thought process, however, is not linear. The mind jumps from present
to past to future, thoughts linked by associations that bring up images,
words, scents, and those haunting melodies that linger in your head for days.
In 1965 Ted Nelson, a writer and philosopher, coined the word hypertext,
which he defined simply as nonlinear reading and writing. He saw computer
networks then being developed as the mechanism for hypertext information
storage and access, and soon expanded his vision to embrace hypermedia -
ways of conveying information that besides text would also incorporate
sounds and images.
Given the computer technology of the 1960s and 1970s, hypertext was not a
workable concept, however. But several recent technological advances
have sparked a new wave of interest in hypertext and hypermedia, beyond
the theoretical. Today's technology can at last create a practical
hypermedia system.
(IEEE SPECTRUM, pp. 38-45, 1987)
PARALLEL TEXT SEARCH METHODS
Gerard Salton and Chris Buckley
A comparison of recently proposed parallel text search methods to
alternative available search strategies that use serial processing
machines suggests parallel methods do not provide large-scale gains in
either retrieval effectiveness or efficiency.
COMMUNICATIONS OF THE ACM, Vol. 31, No. 2, pp. 202-215, 1988)
FOLIOPUB: A PUBLICATION MANAGEMENT SYSTEM
Johann H. Schlichter and Leslie Jill Miller, Xerox Corporation.
In contrast to desktop publishing systems, production publishing
systems, such as $Xyvision sup 4$ are used for large documents generated by
many people. Possible application areas include in-plant publishing of
technical manuals and complex reports with many types and sources of content.
The tasks of writing, editing, illustrating, and page layout are usually
performed by different people. Thus, production publishing requires a
sophisticated publication management system that coordinates tasks, manages
the data produced by the people performing these tasks, and supports
processing operations.
The prototype system described in this paper captures and tracks input
from 15 authors and graphics specialists and enforces a uniform style on
the 200-page quarterly report produced.
(IEEE COMPUTER, Vol. 21, No. 1, pp. 61-69, 1988)
OPTICAL DISKS BECOME ERASABLE
Robert P. Freese, Alphatronix Inc.
Since the computer was invented, storage and retrieval of digital
information has been a major challenge. Engineers have continually sought
to develop more convenient storage methods to hold more data and make the
data easier to access. Today's technologies include paper, microfilm,
magnetic tape, floppy disks, CD-ROM, and write-once read-many (WORM)
optical disks. With roughly the same capacity of 600 Mbytes, WORM disks
are the closest thing users so far have to erasable optical storage. But
information recorded on a WORM disk can neither be erased nor rerecorded.
Although erasable optical recording has been under discussion for
several years, through magneto-optic technology this capability is about to be
introduced in complete, commercial data-storage systems, including some
for desktop computers and workstations. Data storage may never be the same.
(IEEE SPECTRUM, pp. 41-45, 1988)
INTERMEDIA: THE CONCEPT AND THE CONSTRUCTION OF A SEAMLESS INFORMATION
ENVIRONMENT
Nicole yankelovich, Bernard J. Haan, Norman K. Meyrowitz, and Steven M.
Drucker, Brown University.
Hypermedia is simply an extension of hypertext that incorporates
other media in addition to text. With a hypermedia system, authors can
create a linked body of material that includes text, static graphics,
animated graphics, video, and sound.
A hypermedia system expressly developed for use in a university setting,
Intermedia, provides a framework for object-oriented, direct
manipulation editors and applications. With it, instructors can construct
exploratory environments for their students as well as use applications
for day-to-day work, research, and writing. Intermedia is also an
environment in which programmers can develop consistent applications, using
object-oriented programming techniques and reusable building blocks.
(IEE COMPUTER, Vol. 21, No. 1, pp. 81-96, 1988)
FINDING FACTS VS. BROWSING KNOWLEDGE IN HYPERTEXT SYSTEMS
Gary Marchionini and Ben Shneriderman, University of Maryland.
For hypertext and electronic information systems to be effective,
designers must understand how users find specific facts, locate fragments
of text that satisfy information queries, or just browse. Users' in
information retrieval depends on the cognitive representation (mental
model) of a system's features, which is largely determined by the
conceptual model designers provide through the human-computer interface.
Other determinants of successful retrieval include the users' knowledge
of the task domain, information-seeking experience, and physical setting.
In this article we present a user-centered framework for
information-seeking that has been used in evaluating two hypertext
systems. We then apply the framework to key design issues related to
information retrieval in hypertext systems.
(IEEE COMPUTER, Vol. 21, No. 1, pp. 70-80, 1988)
CREATION AND DISTRIBUTION OF CD-ROM DATABASES FOR THE LIBRARY
REFERENCE DESK
Ron J. Rietdyk, Vice President, SilverPlatter Information Services
Inc., 37 Walnut Street, Wellesley Hills, MA 02181.
SilverPlatter has been delivering CD-ROM products to the library
reference market since August 1986. Before that, the product was tested for
about three months at a limited number of libraries. This article
summarizes our experiences and gives some first observations on the
use of this exciting new technology in libraries. Three important
groups are discussed:
Information Providers
Librarians
End-Users in the library
All three groups have different interests and concerns. A list of
the most significant advantages and objections within each group is given.
The article offers ideas about how to overcome the often very real
objections of the different players in this marketplace.
(JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 39,
No. 1, pp. 58-62, 1988)
TOOLS AND METHODS FOR COMPUTATIONAL LEXICOLOGY
Roy J. Byrd, Nicoletta Calzolari, Martin S. Chodorow, Judith L. Klavans,
Mary S. Neff, and Omneya A. Rizk, IBM T.J. Watson Research Center,
Yorktown Heights, New York 10598.
This paper presents a set of tools and methods for acquiring,
manipulating, and analyzing machine-readable dictionaries. We give several
detailed examples of the use of these tools and methods for particular
analyses. A novel aspect of our work is that it allows the combined
processing of multiple machine-readable dictionaries. Our examples describe
analyses of data from Webster's Seventh Collegiate Dictionary, the
Longman Dictionary of Contemporary English, the Collins bilingual
dictionaries, the Collins Thesaurus, and the Zingarelli Italian
dictionary. We describe existing facilities and results they have
produced as well as planned enhancements to those facilities, particularly
in the area of managing associations involving the senses of polysemous
words. We show how these enhancements expand the ways in which we can
exploit machine-readable dictionaries in the construction of large
lexicons for natural language processing systems.
(COMPUTATIONAL LINGUISTICS, Vol. 13, No. 3-4, pp. 219-240, 1987)
LARGE LEXICONS FOR NATURAL LANGUAGE PROCESSING: UTILIZING THE
GRAMMAR CODING SYSTEM OF LDOCE
Bran Bouguraev, University of Cambridge Computer Laboratory, Corn
Exchange Street, Cambridge, CB2 3QG, England.
Ted Briscoe, Department of Linguistics, University of Lancaster,
Bailrigg, Lancaster LA1 4YT, England.
This article focuses on the derivation of large lexicons for natural
language processing. We describe the development of a dictionary
support environment linking a restructured version of the Longman Dictionary
of Contemporary English to natural language processing systems. The
process of restructuring the information in the machine readable version of
the dictionary is discussed. The Longman grammar code system is used to
construct `theory neutral' lexical entries. We demonstrate how such
lexical entries can be put to practical use by linking up the system
described here with the experimental PATR-II grammar development
environment. Finally, we offer an evaluation of the utility of the
grammar coding system for use by automatic natural language parsing systems.
(COMPUTATIONAL LINGUISTICS, Vol. 13, No. 3-4, pp. 203-218, 1987)
ONE-PASS TEXT COMPRESSION WITH A SUBWORD DICTIONARY
Matti Jakobsson, University of Vaasa, Raastuvankatu 31, SF-65100,
Vaasa, Finland.
A new one-phase technique for compression text files is presented as a
modification of the Ziv and Lempel compression scheme. The method replaces
parts of words in a text by references to a fixed-size dictionary which
contains the subwords of the text already compressed. An essential part
of the technique is the concept of reorganization. Its purpose is to
drop from the dictionary the parts which are never used. The
reorganization principle is based on observations of information theory
and structural linguistics. By the reorganization concept the method
can adapt to any text file with no a priori knowledge of the nature
of the text.
(JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 39,
No. 4, pp. 262-269, 1988)
------------------------------
END OF IRList Digest
********************