Copy Link
Add to Bookmark
Report
IRList Digest Volume 2 Number 62
IRList Digest Friday, 28 November 1986 Volume 2 : Issue 62
Today's Topics:
Article - Automatic Indexing of Text For IR: A Conspectus (parts)
News addresses are ARPANET: fox%vt@csnet-relay.arpa BITNET: foxea@vtvax3.bitnet
CSNET: fox@vt UUCPNET: seismo!vtisr1!irlistrq
----------------------------------------------------------------------
Date: Thu, 13 Nov 86 11:49:21 -0100
From: Wyle <seismo!mcvax!ifi.ethz.chunet!wyle>
Subject: DRAFT of conspectus paper follows
...
The conspectus paper is quickly becoming stale news. We just got the
proceedings of the Pisa conference and I would like to add them to this
paper, but my neural network model is taking up all of my non-teaching
time.
Anyway, I think chapter 3 is appropriate for IRlist digest, and I would
really appreciate critique, comments, criticisms, etc.
By the time this paper is distributed, the links between chunet and
csnet, and chunet and arpa should make my e-mail address:
wyle@ifi.ethz.chunet.csnet or
wyle@ifi.ethz.chunet.arpa
but the old addresses:
...!decvax!seismo!mcvax!cernvax!ethz!Wyle
Wyle%ifi.ethz.cernvax.<many domains>
should still work ok.
Without further ado, here is the current DRAFT of my conspectus paper:
[Note: I have included part of section 1, and section 4 (bibliography)
as well as an aid to the reader. I have left in ^L (formfeeds) for
those who print these out and hope that won't hurt others. - Ed]
Automatic Indexing of Text For Information Retrieval: A Conspectus
M.F. Wyle
Institut fuer Informatik
Swiss Federal Institute of Technology
ETH / SOT
8092 Zuerich, Switzerland
1. Introduction
The problems of indexing text have been in human civilization since the
earliest collections of written language. It is not possible to
read everything we would like, nor to find the specific things we would
like to read. In our current information revolution, it is becoming
much easier to store and process larger and larger amounts of text. As
the quantity of text increases, the quality of its indexing must also
increase, in order to discriminate between the concepts contained in the
different texts. It is not clear that the quality of current indexing
methods is adequate to meet the challenge created by our information
revolution.
In this paper, we shall try to summarize the current state of indexing
methods and research. This first section is a very brief summary of
what automatic indexing is, and our perception of current indexing
methods. The second section summarizes recent publications, and the
final section outlines possible directions of future research.
... [Note: rest of section 1 and all of section 2 is omitted. - Ed]
3. Current Avenues of Research
The cost of computing power continues to drop with the advent of new
technologies. The deficiencies of current text indexing methods, which
currently plague only a few users, will soon become apparent to
everyone. The use of write once, read mostly (WORM) media will soon be
common, using compact disc technology. These devices are very dense
(500-600 Mbytes) and inexpensive. It will therefore be quite simple to
have a world telephone book on one 7 cm disk, the collected publications
of the ACM on another, and a large encyclopedia on a third. How can we
index this information effectively? Our conspectus of the current
research does not yield any instant solutions. However, there are some
promising results which deserve further study and analysis.
3.1 Thesauri
The use of thesauri in text indexing is one important area which has not
yet received the attention it deserves. The formal mathematical models
proposed by Schaeuble [Schae 86] could be used to construct software
which will ensure the logical consistency of a thesaurus. A consistent
thesaurus could then be used to index text into consistent, specific
concept categories, which may in turn produce great performance improvements.
One major drawback in the use of hierarchical dictionaries and thesauri
is inflexibility. Natural language is highly dynamic, and the
descriptors used to convey concepts change. In addition, completely new
concepts and words are introduced into the language at an accelerating
rate. The cost of maintaining a large dictionary or thesaurus is
therefore high. However, new tools [Dome86] are making the maintenance
of such structures much easier.
Most existing thesauri are inconsistent and under-used. In some of the
large bibliographic retrieval systems, indexing includes the manual
assignment of thesaurus descriptors. These systems use thesauri only
to enhance recall. However, thesauri could also be used to enhance
precision through narrower descriptors. Consistent thesauri could be
used to maintain concept consistency in indexing descriptors. The
feasibility of using thesauri more effectively in automatic indexing is
still an open research question.
3.2 Performance Testing
Many indexing systems and techniques were compared using a standard
document and query base during experiments in the early 1970's
[Salt83]. The AI methods surfacing now have not been evaluated using
this standard, and comparison is therefore not possible. Another gaping
need for automatic indexing research is therefore the establishment,
implementation, and use of a set of standards, to compare the
performance of the latest indexing systems to each other and existing
ones.
Performance experiments are also needed to compare the indexing
capabilities of the emerging parallel hardware. Experiments must be
performed to benchmark new processors and algorithms in terms of their
indexing capabilities, not just their arithmetic processing power.
3.3 Artificial Intelligence applications in text indexing
The SMART [Salt83] system attempted unsuccessfully to use syntactic
analysis methods to recognize phrases in queries and documents, and to
use the phrases as indexing units. Salton concluded that these
syntactic methods do not provide improvements over standard retrieval
using a thesaurus. However, recent work by Smeaton [Smea86] shows
improved retrieval performance by linguistically parsing query and
document text as part of the retrieval strategy. The continuously
improving parsers developed in natural language processing can be used
to better index documents and queries to improve retrieval performance.
Although more limited in scope, an expert system and semantic network
constructed by Shoval [Shov85] uses term relations and search rules to
assist a user find appropriate search terms in a query. Similarly, a
knowledge based system approach to document retrieval is presented by
Biswas et al [Bisw85]. An important research area in AI appears to be
the automation of the construction of semantic networks and knowledge
bases.
3.4 Associative Networks
An associative network may be used to address the indexing problem in
the following way:
Construct a large associative network, perhaps using a Boltzmann
[Hint84] model. At key nodes, load descriptors and document identifiers
from a document base. Then load queries as input to the network, and
fix the output at the correct answers to these queries. Allow the
network to settle, and examine the resulting connections. These
connections will have implemented an algorithm which has perfect
performance using the given queries and documents. The associations
between nodes may lead to important insights into unconscious or
non-obvious connections between descriptors in a document base. The
network will also "discover" indirect connections between descriptors
which are not apparent but very useful in indexing.
The network itself may be used as an indexing system, or as a method of
automatically assigning relevance weights to descriptors in order to
enhance an existing indexing strategy. Descriptors in the network need
not be limited to word stems. They could be syntactic traces, or
thesaurus terms, or some combination of descriptor types. If a
sufficient set of queries and correct results to a large document
collection is made available, an associative network model may give
enormous insight into descriptor relations.
4.0 References
[Addi83] Addis, T R and L Johnson, "Knowledge for Machines.," in The
Fifth Generation Computer Project, State of the Art
Report,1983, Pergamon Infotech, Ltd, Maidenhead, Berkshire,
England.
[Bae84] Baertschi, M, "Term Dependence in information retrieval
models," PhD Thesis, ETH, Zrich, 1984.
[Bis85] Biswas, G, V Subramanian, and J C Bezdek, "A knowledge
based system approach to document retrieval," Second
Conference on Artificial Intelligence Applications:
The Engineering of Knowledge-Based Systems , IEEE Comput.
Soc. Press, Washington, DC, 11-13 Dec. 1985.
[Bos85] Bose, P K and M Rajinikanth, "KARMA: knowledge-
based assistant to a database system," Second
Conference on Artificial Intelligence Applications:
The Engineering of Knowledge-Based Systems , IEEE
Computer Society Press, Washington, DC, 11-13 Dec.
1985.
[Broo85] Brooks, H M, P J Daniels, and N J Belkin, "Problem
descriptions and user models: developing an intelligent
interface for document retrieval systems," in Advances
in Intelligent Retrieval: INFORMATICS 8.
Proceedings of an Aslib/British Computer Society Joint
Conference, 16-17 April 1985 Oxford, England, p191-
214, Aslib, London.
[Brow85] Brownstein, M, "Managing information intelligently
[Quantum's Knowledge Management System],"
Hardcopy, vol.14, no. 11, pp. 139-141, November1985.
[Chig85] Chignell, M H, A Loewenthal, and P A Hancock,
"Intelligent interface design," IEEE 1985 Proceedings
of the International Conference on Cybernetics and
Society , p. 620-3, Tucson, (12-15 November1985).
[Chud84] Chudacek, J, "Non-grammatical Language Processing,"
Preprint Institute TNO for Mathematics, Information
Processing, and Statistics, The Hague, Netherlands (1984).
[Damo85] D'Amore, R J, Mah, C P, "One Time Complete Indexing
of Text: Theory and Practice." Proceedings of the 8th
Annual International ACM SIGIR Conference,
Montreal, (1985).
[DeHe74] De Heer, T, "The Application of the Concept of
Homeosemy to Natural Language Information
Retrieval." Information Processing and Management
vol 18 no 5 (1982).
[Dien85] Diener, R A V, "Relational knowledge structures: a
structural model of information for research and
retrieval.," in Challenges to an Information Society.
Proceedings of the 47th ASIS Annual Meeting,
Philadelphia, Pennsylvania, October 1985.
[Dome86] Domenig, M., Shann, P., "Towards a Dedicated Database
Management System for Dictionaries," Proc. 11th International
Conference on Computational Linguistics, August 25-29 1986,
IKP Universitaet Bonn.
[Fren86] Frenkel, K A, "Evaluating Two Massively Parallel
Machines," Comm ACM vol 29 no 8, (August 1986).
[Giri84] Girill, T R, "Online Access Aids for Documentation:
A Bibliographic Outline," ACM SIGUCCS 12th User
Services Conference, Reno, Nevada (12 November 1984).
[Hint84] Hinton, G E, Sejnowski, T J, and Ackley, D H, "Boltzmann
Machines: Constraint Satisfaction Networks that Learn,"
Technical Report CMU-CS-84-119, Carnegie Mellon
University, (May 1984).
[Huff52] Huffman, D, "A Method for the Construction of
Minimu Redundancy Codes, " Proc. IRE v 40 p 1098 -
1101 (September1952).
[Jona84] Jonak, Z, "Automatic Indexing of Full Texts,"
Information Processing and Management, vol. 20, no.
5-6, pp. 619-627, 1984.
[Kuhl83] Kuhlen, R, "Natural language research.," SIGART
Newsletter, no. 83, pp. 20-21, January 1983.
[Kwok84] Kwock, K L, "A document-document similarity measure based
on cited titles and probability theory, and its application to
relevance feedback retrieval," in Research and Development
in Information Retrieval, The British Computer Society
Workshop Series, University Press, Cambridge, (1984).
[McCu85] McCune, B P, R M Tong, J S Dean, and D G Shapiro,
"RUBRIC: a system for rule-based information
retrieval," IEEE Trans. Software Eng. , vol. SE-11, no. 9,
pp. 939-945, Mountain View, CA, September 1985.
[Medl85] Meder, N, "Artificial intelligence as a tool of
classification, or: the network of language games as
cognitive paradigm," Int. Classif. (Germany), vol. 12,
no. 3, pp. 128-132, 1985.
[Mite85] Mitev, N N and S Walker, "Information retrieval aids
in an online public access catalogue: automatic
intelligent search sequencing," in Advances in
Intelligent Retrieval: INFORMATICS 8. Proceedings of
an Aslib/British Computer Society Joint Conference,
16-17 April 1985 Oxford, England, p. 215-26, Aslib,
London, 1985.
[Mris86] Morris, D A, "GEFILE, The Electronic File Cabinet,"
General Electric Company Silicon Systems Technology
Department Press Release, (August 1986).
[Morr85] Morrissey, J M, "Interactive Querying Techniques for
an Office Filing Facility.," Information Processing and
Management, vol. 22, no. 2, pp. 121-34, 1986.
[Pate84] Patel-Schneider, P, Brachman, R, Levesque, H, "ARGON:
Knowledge Representation Meets Information
Retrieval," Fairchild Technical Report No 654,
(September 1984).
[Salt83] Salton, G and M J McGill, Introduction to modern
information retrieval, McGraw Hill International Book
Company, Paris,1983.
[Salt84] Salton, G, "Extended boolean information retrieval - an
outline," in National Online Meeting 1984, ed. T H Hogan,
p 339-346, Learned Information, Inc., Medford (1984).
[Salt86] Salton, G, "Another Look At Automatic Text Retrieval
Systems," Comm ACM vol 29 no 7 (July 1986), p 648
- 656.
[Schae86] Schaeuble, P, Frei, H P, "Thesauri in Information
Retrieval," unpublished.
[Schn83] Schneider, C, "Syntaktische Relationen in der
automatischen Indexierung zur Relationierung von
Deskriptoren am beispiel juristischer dokumente," PhD
Dissertation, Regensburg, 1983.
[Shov85] Shoval, P, "Principles, procedures, and rules in an
expert system for information retrieval," Information
and processing management, vol. 21, no. 6, pp. 475-
487, 1985.
[Smea86] Smeaton, A F, "Incorporating Syntactic Information
Into a Document Retrieval Stragegy: An Investigation"
Prepublication.
[Teuf86] Teufel, B, Schmidt, S, "Full Text Retrieval Based on
Syntactic Similarities," unpublished
[Wong85] Wong, S K M, and Ziarko, W, "On Generalized Vector Space
Model In Information Retrieval," Ann Soc Math Series IV,
Fundam Inf, v 8, no 2, p 253-267, (1985).
[Zamo81] Zamora, E M, Pollack, J J, Zamora, A, "The Use of
Trigram Analysis for Spelling Error Detection,"
Information Processing and Management, vol 17 no 6
(1981).
------------------------------
END OF IRList Digest
********************