Copy Link
Add to Bookmark
Report
IRList Digest Volume 1 Number 17
IRList Digest Sunday, 27 Oct 1985 Volume 1 : Issue 17
Today's Topics:
References - Abstracts of Talks at ACM SIGIR 85 in Montreal
----------------------------------------------------------------------
From: "V.J. Raghavan" <ihnp4!sask!regina!raghavan@UCB-VAX>
Date: Tue, 27 Aug 85 16:02:23 cst
Subject: Conf. '85
[This year's Montreal Conference on IR had many interesting papers.
A file, in "SMART" form, of the abstracts, was typed in at Univ.
of Regina and edited at Virginia Tech. Order info for proceedings is:
Proc. of the Eighth Annual Int. ACM SIGIR Conf. on R&D in Inf. Ret.,
ACM Order No. 606850 -- Ed]
.I 1
.T
Graphical Information Resources: Maps and Beyond
.A
MICHAEL LESK
.W
The rise of computer graphics offers a new challenge for information
retrieval: how to search and retrieve information which is partly or
wholly graphical. As an example, procedures for handling geographical
information, such as street maps and directories, are explained. With
this data, it is possible to find routes on maps, retrieve locations
and names of people or businesses, and draw maps. But a comparision
of these programs with programs for face processing or computer
typesetting makes clear how far we are from general purpose routines.
Today, successful graphics routines contain a great deal of local
domain knowledge. There is no analog of the simple keyword systems
that handle textual documents in any subject area. Just as
computational linguists have found that subject matter expertise is
nessary to do really sophisticated processing of English, it seems
also necessary for sophisticated processing of pictures; the difference
is that we don't know how to do unsophisticated processing of graphics.
.I 2
.T
Implications of Boolean Structure for Probabilistic Retrieval
.A
ABRAHAM BOOLSTEIN
.W
The purpose of an information retrieval system is to retrieve documents
in response to a request. However, there are many strategies as to how
this can be accomplished. Most popular are the Boolean systems attached
to the commercial on-line database systems. Less conspicuous are a number
of approaches being developed in university research laboratories and
which are the object of this paper.
This paper is divided into two parts: The first wil be a more or less
nontechnical overview of research in IR over the last ten to fifteen
years. The second part, which is a bit more technical and a lot more
speculative, will cover some interesting questions that are now being
studied.
.I 3
.T
Generalized Vector Space Model In Information Retrieval
.A
S. K. M. WONG
WOJCIECH ZIARKO
PATRICK C. N. WONG
.W
In information retrieval, it is common to model index terms and documents
as vectors in a suitably defined vector space. The main difficulty with
this approach is that the explicit representation of term vectors is not
known a priori. For this reason, the vector space model adopted by
Salton for the SMART system treats the terms as a set of orthogonal
vectors. In such a model it is often necessary to adopt a separate,
corrective procedure to take into account the correlations between terms.
In this paper, we propose a systematic method (the generalized vector
space model) to compute term correlations directly from the automatic
indexing scheme. We also demonstrate how such correlations can be
included with minimal modification in the existing vector based
information retrieval systems. The preliminary experimental results
obtained from the new model are very encouraging.
.I 4
.T
Handling Multiple Data Bases in Document Retrieval
.A
IAN A. MACLEOD
.W
There is no such thing as a standard document. Bibliographic information
comes in a wide variety of formats. Existing retrieval systems handle
different document styles either by creating an artificial document type
or by providing different and independent data bases. Neither approach
seems satisfactory. In this paper we describe a data model which we feel
is more appropriate for document representation and show it can handle the
multiple document type problem quite naturally.
.I 5
.T
Theoretical Measures in P/Q Document Spaces
.A
ROBERT R. KORFHAGE
.W
Early work on the use of profiles to aid in the interpretation of queries
in an information retrieval system developed the concepts of retrieval
subspaces that are either ellipsoids or Cassini ovals. This report extends
the theoretical basis of these models, exploring other models and the
effect of different metrics on the conceptual retrieval subspace.
.I 6
.T
Composite Document Extended Retrieval - An Overview
.A
EDWARD A. FOX
.W
'Composite document' is the name given herein to items which possess
some complex internal structure and are made up of a variety of types
of data, including a significant amount of textual material. Examples
are: elaborate bibliographic records, office correspondence or reports,
electronic mail, digests from computer bulletin boards, entries in online
name servers for network users, books, and parts of some expert system
knowledge bases. Investigation of composite documents is being done in
connection with research in passage retrieval, database management, office
modeling of documents, natural language analysis, and information retrieval.
Preliminary results indicate that extending the vector space model leads to
better document representation, and that extending standard Boolean logic
query capabilities results in more effective searches. Further study with
a variety of specially prepared test collections will aid evaluation of the
several models and methods proposed.
.I 7
.T
Automatic Assignment of Soft Boolean Operators
.A
GERARD SALTON
ELLEN VOORHEES
.W
The conventional bibliographic retrieval systems are based on Boolean
query formulations and inverted file implementations. Such systems
provide rapid responses in answer to search queries but they are not
easy to use by uninitiated patrons. An extended Boolean retrieval
strategy has been devised in which the Boolean operators are treated
more or less strictly, depending on the setting of a special parameter,
known as the p-value. The extended system is much more forgiving than
the conventional system, and provides better retrieval effectiveness.
In this study various problems associated with the determination of
appropriate p-values are discussed, and suggestions are made for an
automatic assignment of p-values. Evaluation output is included to
illustrate the operations of the suggested procedures.
.I 8
.T
Output Ranking Methodology for Document-Clustering-Based
Boolean Retrieval Systems
.A
TADEUSZ RADECKI
.W
During the last two decades or so several alternative approaches to
document clustering have been introduced, each of which constitutes
a basis for a number of various clustering methods. Unfortunately,
these methods are limited in many ways. In particular, most of the
available document clustering procedures have been designed to be
applicable in the cases where both documents and user queries are
represented by sets of weighted or unweighted index terms. However,
in commercial document retrieval systems queries are usually
characterized by Boolean expressions of variables corresponding to
the index terms of interest to the system's users. Therefore, it
would be desirable to extend the conventional approaches to document
clustering with the aim of implementing them in operational
environments. In developing these extended approaches, one of the
most important problems to be solved appears to be related to the
system's adaptation to individual users. More specifically, in the
course of such an adaptation the system should utilize all the
information available about the user's needs, and document and query
representations should be modified accordingly, so that the most
relevant documents could be retrieved. This paper presents an
an account of the author's most recent research along these lines.
After providing a background for widely known approaches to
document clustering, rigorous reasoning for extending these
approaches to a conventional Boolean retrieval framework is given.
.I 9
.T
The New Oxford English Dictionary and its Potential Users:
Some Preliminary Comments
.A
JOHN STUBBS
.W
On May 15, 1984, the Oxford University Press formally announced its
intention to computerize the Oxford English Dictionary. The OED and
its Supplement, which comprise approximately 60 million words, will
be entered into a computer; a database design will be created for this
integrated version of the Dictionary; and, in addition to its printed
form, the Dictionary will be available in electronic form.
.I 10
.T
Processing Free-Text Input to Obtain a Database of Medical Information
.A
EMILE C. CHI
CAROL FRIEDMAN
NAOMI SAGER
MARGARET S. LYMAN
.W
The Linguistic String Project of New York University has developed computer
programs that convert the information in free-text documents of a technical
specialty into a structured form suitable for mapping into a relational
database. The processing is based upon the restrictions on the use of
language that are characteristic of the subject matter and the document type.
These restrictions are summarized in a "sublanguage grammar" that provides a
set of word classes and formulas corresponding to the objects and relations
of interest in the domain. The programs are independent of the particular
sublanguage grammar employed. The application to narrative patient records
will be described and the applicability of the methods to other domains
discussed.
.I 11
.T
An Approach to Multikey Sequencing in an Equiprobable Keyterm
Retrieval Situation
.A
B. G. T. LOWDEN
.W
We present a simple generalized technique, for sequencing a multi-
attribute file, which can be used in a situation where the query
pattern is unknown and the term content equiprobable. The method
is based on constructing a short spanning path through the records
thereby minimizing the sum of the Hamming distances between them.
Retrieval performance is simulated over a range of query expressions
and results suggest a significant reduction in the number of block
accesses using this technique as compared with records randomly
distributed over the file space.
.I 12
.T
Optimization of Inverted Vector Searches
.A
CHRIS BUCKLEY
ALAN F. LEWIT
.W
A simple algorithm is presented for increasing the efficiency of
information retrieval searches which are implemented using inverted
files. This optimization algorithm employs knowledge about the
methods used for weighting document and query terms in order to
examine as few inverted lists as possible. An extension to the
basic algorithm allows greatly increased performance optimization
at a modest cost in retrieval effectiveness. Experimental runs are
made examining several different term weighting models and showing
the optimization possible with each.
.I 13
.T
P-Trees: Storage Efficient Multiway Trees
.A
DAVID M. ARNOW
AARON M. TENENBAUM
CONNIE WU
.W
A new variation of high order multiway tree structures, the P-tree, is
presented. P-trees have average access costs that are significantly
better than those of B-trees and are no worse (and often better) in
storage utilization. Unlike compact B-trees, they can be maintained
dynamically, and unlike dense multiway trees and B-trees, their
associated insertion algorithm, which is also presented, is cheap and
involves (at most) a very localized rearrangement of keys.
.I 14
.T
Efficient Variants of Huffman Codes in High Level languages
.A
Y. CHOUEKA
S. T. KLEIN
T. PERL
.W
Although it is well-known that Huffman Codes are optimal for text
compression in a character per character encoding scheme, they are
seldom used in practical situations since they require a bit-by-bit
decoding algorithm, which has to be written in some assembly language
and will perform rather slowly. A number of methods and some slight
variants are here presented that avoid these difficulties. The decoding
algorithms efficiently process the encoded string on a byte-per-byte
basis, and can be programmed in any high level language. This is achieved
at the cost of reducing the net compression savings by a negligible amount,
and storing some tables in the internal memory. Experimental results are
presented.
.I 15
.T
Optimization of a Hierarchical File Organization for Spelling Correction
.A
TETSURO ITO
CLEMENT T. YU
.W
The application of a hierarchically organized file to spelling check and
correction problems seems to be promising, since it can correct more than
the common typing mistakes. However, the speed of retrieving the exactly
matched records to the input is rather slow, and the time for computing
similarity values is not negligible. Here some techniques to overcome
these defects are presented along with the approximate estimation of the
improved file search time.
.I 16
.T
Designing an Information Retrieval Interface Based on User Characteristics
.A
CHRISTINE L. BORGMAN
DONALD O. CASE
CHARLES T. MEADOW
.W
With the increasing number of information retrieval systems and databases
available and the increasing demand of end users to utilize the system,
a need exists for improved training mechanisms. This paper reports on a
project to develop an integrated online instruction and assistance system
to be used as a "front end" to the U.S. Department of Energy's RECON
retrieval system. The conceptual framework for the interface design will
be based on individual characteristics of its current and prospective
users, predominantly scientists conducting energy research. We will
build a prototype based on information gained from interviews with
scientists using the system (either directly or through a search
intermediary) and interviews with search intermediaries. In addition, we
will outline a separate interface tailored to search intermediary needs,
based on interviews and on known differences between end users and
intermediaries in information seeking behavior and conceptual views of
information systems. The paper reports on research in progress through
the initial interviews. The remaining parts of the project are outlined.
.I 17
.T
Different levels of expertise for an expert system in information retrieval
.A
B. DEFUDE
.W
The aim of this article is to describe the specifications of an Information
Retrieval Expert System (IRES). The resort to expert systems is a recent
way which seems to be very convenient to realize performant and convivial
systems. In fact the capabilities allowed by classical Information Retrieval
Systems (IRS) are not adequate. This inadequacy comes from some important
restrictions :
- limited indexing vocabulary, very often designed a priori.
- the search terms have to belong to this vocabulary and there
are just some simple semantical relations (as synonymy for example)
which can be used during the search.
Consequently, if the terms used in the query are not those used for indexing
we do not have a good answer.
.I 18
.T
One-Time Complete Indexing of Text: Theory and Practice
.A
RAYMOND J. D'AMORE
CLINTON P. MAH
.W
N-gram indexing provides a significant alternative to the keyword indexing
and full text scanning methods. N-grams are sets of overlapping word
fragments selected for indexing according to information-theoretic
considerations. They provide complete, fixed indexing of all words, numbers,
and special terms in text. The behavior of such an n-gram system can be
statistically modeled and validated over a wide range of text. The model
provides a descriptive and predictive tool for controlling precision and
recall during retrievals and scaling estimates of relevance to an adaptive,
reference noise distribution for a given collection. Partial inversion of
and simple data compression are some of the capabilities which make n-gram
systems performance-competitive with other approaches.
.I 19
.T
Experiments with cited titles for automatic document indexing and similarity
measure in a probabilistic context
.A
K. L. KWOK
.W
Most people will agree that the ultimate goal of an information retrieval
(IR) system would be to incorporate as many artificial intelligence
techniques as possible so that the system would be able to reply to queries
with specific answers (so called fact retrieval). Current capabilities for
large scale general purpose retrieval however is still at the stage of
document retrieval level, namely retrieving documents that most probably
contain answers to a query. This may not necessarily be a transient stage
at all, as has been pointed out that document retrieval capability may serve
as a pre-processing stage to the more detailed fact retrieval, even if the
problems associated with the latter are assumed solved.
.I 20
.T
A Learning Algorithm Applied to Document Redescription
.A
MICHEAL D. GORDON
.W
Relevance feedback in document retrieval usually involves an inquirer
revising a query to obtain better retrieval effectiveness [Rocchio] and
[Salton]. The revised query is adjusted to correspond to the descriptions
of known documents which the inquirer has found relevant. Several problems
exist with such methods, however, and will be described in this paper.
.I 21
.T
The Cluster Hypothesis Revisited
.A
ELLEN M. VOORHEES
.W
A new means of evaluating the cluster hypotheses is introduced and the
results of such an evaluation are presented for four collections. The
results of retrieval experiments comparing a sequential search, a cluster-
based search, and a search of the cluster collection in which individual
documents are scored against the query are also presented. These results
indicate that while the absolute performance of a search on a particular
collection is dependent on the pairwise similarity of the relevant documents,
the relative effectiveness of clustered retrieval versus sequential retrieval
is independent of this factor. However, retrieval of entire clusters in
response to a query usually results in a poorer performance than retrieval
of individual documents from cluster.
.I 22
.T
C. T. YU
Y. T. WANG
C. H. CHEN
.W
It is well known [Salt, LaYu] that the retrieval of documents stored in
clusters allows queries to be answered efficiently. Furthermore,
clustered retrieval may sometimes gain in retrieval effectiveness over
relevance term weighting [Crof]. It has also been showm [VanR] that
clustering of terms, when used properly (e.g. in tree dependence model
[HaVa]) will gain significant improvement over retrieval processes based
on the independence of terms [RoSp, YuSa]. Although numerous clustering
algorithms [Salt, VanR, Ever] have been proposed and tested, the
following problems remain.
1) the clustering algorithms are extremely expensive to use, if
the number of objects (documents or terms) is large;
2) when the user enviroment changes, it is necessary to perform
reclustering;
3) most clustering algorithms cluster objects based on syntactic
relationships between the objects only. Information which
is extractable from previous users is ignored.
.I 23
.T
Concepts of the Cover Coefficient-Based Clustering Methodology
.A
FAZLI CAN
ESEN A. OZKARAHAN
.W
Document clustering has several unresolved problems. Among them are high
time and space complexity, difficulty of determining similarity thresholds,
order dependence, nonuniform document distribution in clusters, and
arbitrariness in determination of various cluster initiators. To overcome
these problems to some degree, the cover coefficient based clustering
methodology has been introduced. The concepts used in this methodology
have created certain new concepts, relationships, and measures such as the
effect of indexing on clustering. These new concepts are discussed. The
result of performance experiments that show the effectiveness of the
clustering methodology and the matching function are also included. In
these experiments, it has been also observed that the majority of the
documents obtained in a search are concentrated in a few clusters
containing a low percentage of documents of the database.
.I 24
.T
The LIVE Project - Retrieval Experiments Based on Evaluation Viewpoints
.A
P. BOLLMANN
F. JOCHUM
U. REINER
V. WEISSMANN
H. ZUSE
.W
The LIVE-project (Leistungsbewertung von Information Retrieval VErfahren) at
the Technische Universitaet Berlin, West Germany concerns with the evaluation
of information retrieval systems. Two fields are mainly under investigation.
One area is about the investigation of methodological foundations of retrieval
experiments. Results of the LIVE-project can be seen on three different areas:
on the one hand measurement-theoretical criteria for the application of simil-
arity and evaluation measures in retrieval experiments have been considered
and developed. Further work has been done in the application of statistical
principles of experimental design in information retrieval. Especially control
structures of factors in retrieval tests have been investigated and some
aspects of statistical models for experimentation in information retrieval.
The other topic in the LIVE project is the conduction of a retrieval experiment
in cooperation with FIZ4 which is an information service center for mathematics
physics and energy in Karlsruhe, West Germany. Amongst the many databases which
FIZ4 is offering, the LIVE-project used the database about physics in their
retrieval experiment.
.I 25
.T
On Retrieval Tests with an Inhomogeneous Query Collection
.A
FRIEDBERT JOCHUM
.W
Retrieval tests are often performed with an inhomogeneous query collection,
that means certain properties of queries vary within the collection. The
presented paper investigates under which conditions this inhomogeneity may
lead to a loss of validity of retrieval tests and should be controlled by
an experimental design. In particular the interrelationship between
experimental design of a retrieval test, inhomogeneity of the used query
collection and measurement theoretical properties of the used evaluation
measure will be discussed.
.I 26
.T
The Impact of VLSI Generalized Associative Memory Devices on the Intelligence
and Architecture of Computer Systems
.A
D. R. MCGREGOR
J. R. MALONE
M. HENNING
.W
An intelligent system requires to function at different time-scales:
- fast associative recall of previously encountered information
- more complete, but slower search for hypotheses, checks for
consistency and so on. We must not expect automatic systems
to do these in seconds.
This paper describes a novel database system which used a new type of
associative memory (now being implemented in VLSI) to improve its speed
of response.
.I 27
.T
A Testbed for Information Retrieval Research:
The Utah Retrieval System Architecture
.A
LEE A. HOLLAAR
.W
The Utah Retrieval System Architecture provides an excellent testbed for
the development and testing of new algorithms or techniques for information
retrieval. URSA is a message-based structure capable of running on a
variety of system configurations, ranging from a single mainframe processor
to a system distributed across a number of dissimilar processors. It can
readily support a variety of specialized backend processors, such a high-
speed printers.
The architecture divides the components of a text retrieval system into
two classes: servers and clients. A triple of servers (index, search, and
document access) for each database provide the capabilities normally
associated with a retrieval system. Possible clients for these servers
include a window-based user interface, whose query language can be easily
modified, a connection to a mainframe host processor, or Al-based query
modification programs that wish to use the database.
Any module in the system can be replaced by a new module using a different
algorithm as long as the new module complies with the message formats for
that function. In fact, with some care this module switch can occur while
the system is running, without affecting the users. A monitor program
collects statistics on all system messages, giving information regarding
query complexity, processing time for each module, queuing times, and
bandwidths between every module.
This paper discusses the background of URSA and its structure, with
particular emphasis on the features that make it a good testbed for
information retrieval techniques.
.I 28
.T
An Integrated Hierarchical File Organization For Data Selection and Retrieval
.A
SAU-LAN LEE
.W
This paper presents an approach, the IHF approach, of organizing data of a
hierarchical structure into one single file, maintaining adequate represen-
tation of the relationships among the data. The IHF approach requires each
data item be stored in the basic form of name-value pair and consists in
assigning each data item with identifier which provides not only the data
item identification but also an implicit identification of the hierarchical
structural relationships among groups of data items. Data selection and
retrieval operations are effectively implemented with the concept of masking
of relevant identifier bits, without resorting to any linking devices.
Application of the IHF technique on CAFS is examined.
.I 29
.T
RUBRIC: An Environment for Full Text Information Retrieval
.A
RICHARD M. TONG
VICTOR N. ASKMAN
JAMES F. CUNNINGHAM
CARL J. TOLLANDER
.W
This paper describes an ongoing research project that is concerned with
developing a computer-based aid for retrieval from unformated full text
databases. Unlike other attempts to improve upon Boolean keyword retrieval
systems, our research concentrates on providing an easily used rule-based
language for expressing retrieval concepts. This language draws upon work
both in artificial intelligence and the theories of evidential reasoning to
allow the user to construct queries that have better precision and recall
than more traditional forms. The paper includes a discussion of our formal
model of retrieval, the rule language, and some comparative performance
results.
.I 30
.T
ANNOD - A Navigator of Natural-language Organized (Textual) Data
.A
ROBERT E. WILLIAMSON
.W
ANNOD is the name of a system developed at the National Library of Medicine
(NLM), which implements a set of linguistic and empirical techniques that
permit retrieval of natural language information in response to natural
language queries. The system is based on Dr. Gerard Salton's SMART document
retrieval system and is presently implemented on a mini-computer as part of
an Interactive Text Management System, ITEMS. Actual experience with
retrieval of information from NLM's Hepatitis Knowledge Base (HKB), an
encyclopedic hierarchical, full-text file, is presented. The techniques
used in ANNOD include: automatic stemming of words, common word deletion,
thesaurus expansion, a complex empirical matching (ranking) algorithm
(similarity measure), and techniques expressly designed to permit rapid
response in a mini-computer environment. Preliminary testing demonstrates
high efficiency in identifying portions of a text which are relevant to users.
.I 31
.T
The User's Mental Model of an Information Retrieval System
.A
CHRISTINE L. BORGMAN
.W
An empirical study was performed to train naive subjects in the use of a
prototype Boolean logic-based information retrieval system on a bibliographic
data-base. Subjects were undergraduates with little or no prior computing
experience. Subjects trained with a conceptual model of the system performed
better than subjects trained with procedural instructions, but only on complex,
problem-solving tasks. Performance was equal in simple tasks. Differences in
patterns of interaction with the system (based on a stochastic process model)
showed parallel results. Most subjects were able to articulate some
description of the system operation, but few articulated a model similar to
the card catalog analogy provided in training. Eleven of 43 subjects were
unable to achieve minimal competency in system use. The failure rate was
equal between training conditions and genders: the only differences found
between those passing and failing the benchmark test were in academic major and
frequency of library use.
.I 32
.T
A Study of the Relationship Between User Profiles and User Queries
.A
MICHAEL A. SHEPHERD
.W
This research investigates the relationship between a user profile and a
user query by examining the degree of similarity of a user profile and a
query and the degree of similarity between the sets of documents retrieved
by the profile and the query, respectively. This relationship will form
the basis for the integration of a profile and a query into a single
enhanced query to retrieve a set of documents that reflects the similarity
between the sets of documents retrieved by the profile and by the query.
Such an integration will permit the development of an expert search
intermediary in which the user profile acts as a knowledge base in providing
a context for the interpretation of the query to the database.
.I 33
.T
A Conceptual Model and Experiments on How People
Classify and Retrieve Documents
.A
ALBERTO J. CANAS
FRANK SAFAYENI
DAVE W. CONRATH
.W
In other words, a better understanding of the way in which people classify
and search for information may be instrumental for designing more effective
automated systems. Cognitive psychologists have dealt with the problem of
categorization for many years (e.g. Rosch, 1978) but not within the context
of storage and retrieval. Their work has been based on the categorization
of simple stimuli (e.g. geometric shapes) and has not addressed the problem
of retrieval after categorization.
The purpose of this paper is twofold:
1) To present a conceptual model of the cognitive processes in the
storage and retrieval of documents.
2) To describe ongoing experiments and their potential implications
in the design of automated systems.
------------------------------
END OF IRList Digest
********************