Copy Link
Add to Bookmark
Report
IRList Digest Volume 1 Number 11
IRList Digest Wednesday, 25 Sep 1985 Volume 1 : Issue 11
Today's Topics:
Query - Location of Bruce Croft
- Info desired on projects using NLP to create knowledge base
- Format for SIGIR Forum, Values from cosine correlations
Article - Proposal for SIGIR/SIGDOC Workshop
----------------------------------------------------------------------
From: "Robert B. Allen" <rba%lafite@MOUTON>
Date: Tue, 17 Sep 85 15:59:12 edt
Subject: Bruce Croft
do you happen to have an electronic address for
Bruce Croft in Dublin?
[Try bcroft@irlearn on BITNET, or bcroft%IRLEARN.BITNET@csnet-relay - Ed]
Thanks
Bob Allen
------------------------------
From: MARS%red.rutgers.edu@CSNET-RELAY
Date: 18 Sep 85 15:30:36 EDT
Subject: NLP for knowledge acquisition
ReSent-From: Ken Laws <Laws@SRI-AI>
ReSent-To: IRList%vpi.csnet@CSNET-RELAY
Hi: I am interested in info about projects which use Natural Language
Processing Techniques to analyse scientific articles or abstracts
with the aim of deriving knowledge bases from them.
I am aware of a few projects in that field (UCLA, IBM
Heidelberg, Leiden University, Chemical Abstracts), but I
would appreciate any further pointers. Please
reply directly to me, and I will summarize to the net. Thanks.
Nicolaas J.I. Mars
[Can anyone at NLM or CMU report on their efforts? Other projects? - Ed]
------------------------------
Subject: From: Michael_Gordon%UB-MTS%UMich-MTS.Mailnet@MIT-MULTICS
Date: Mon, 23 Sep 85 11:15:30 EDT
Subject: SIGIR Forum, etc.
...
1) I'm submitting a camera-ready article for WInter's
SIGIR Forum (at the request of Vijay Raghaven). Do you want
it in a special format ("long" paper, two columns) or is a
typo-free laser printer docuemnt OK?
[Since printing many pages is expensive, we ask that people either
send us fairly closely typed submissions or else provide machine
readable form (preferably TeX or TROFF so we can reformat). DO NOT
double space. Laser printer or typeset originals, single spaced,
without big margins, is the preferable form. Long paper, 2 cols,
is fine since ACM can reduce and we can save costs.
More comments, Vijay? - Ed]
2) Do you have Vijay's electronic mail address?
[I use ihnp4!sask!regina!raghavan@ucb-vax which should help people
on DARPA Internet and using UUCP. - Ed]
3) I've been performing some (simulated) Cosine-based
retrieval experiments. I consistently see *extremely* high
COsine scores (often over 0.8) between queries and rel docs.
THe queries are weighted, often employing 10 or more non-0
terms.
These correleations seem too high.
I'm wondering what the data for some "typical" Cosine
based retreival experioemts looked like. I've look in some places
I expected to find such results, but without success. In particularar,
what I'd like to see are things like this:
a query that was submitted to the system (all terms, plus weights)
a rank ordering of documents (including the indexed descriptions,
including weights, if any) plus an indication of if, in fact, the
document was relevant to the query.
I'm interested in seeing a representative sampling of such
data, for weighted and unweighted queries combined with weighted
and unweighted docs. Data showing how Cosines seem to vary with
the number of terms used to index a document (and/or query) are of
interest, too.
If you can point any thing out, or forward any data, I'd appre-
ciate it greatly.
Thx alot, Mike
[Similarity depends on the queries and documents, obviously! Fairly
short queries matched against titles can give many hits with
similarity close to 1. Has anyone looked at the distributions
of similarities for test collections, in detail?
I do have data that may help and hope others can send some too,
so you can summarize results.- Ed]
------------------------------
From: Michael Lesk <lesk%petrus@MOUTON>
Date: Sat, 7 Sep 85 18:28:58 edt
Subject: SIGIR/SIGDOC Workshop
The following is a proposal for a workshop which, although not yet
formally approved, [Note: Diana Patterson of SIGDOC has signed - Ed]
is very likely to take place in Snowbird, Utah, June 30-July 2, 1986.
Chair: Michael Lesk; Local Arrangements: Lee Hollaar; Treasurer: Karen Kukich.
Attendance will be limited to 75; there will be no formal proceedings,
but a report will be written for some ACM publication; a number of
prominent people (Karen Sparck Jones, David McDonald, Donald Walker,
Patricia Wright, etc.) have indicated interest in attending. Comments
on the workshop, or indications of interest, are welcome. Please
notify the chair at: bellcore!lesk, or lesk%bellcore@csnet-relay, or
(if you have current routing tables) lesk@bellcore. Phone: 201-829- 4070.
NOTE: I will be on vacation Sept 9 - Oct 4; failure to reply
during those dates merely means your message has not been read!! --
Thanks, Michael Lesk
Writing to be Searched:
A Workshop on Document Generation Principles
As computers learn to write English, and others improve at
searching it, they ought to benefit from people who know how to do
these jobs. We're proposing a workshop bringing together AI special-
ists in document generation, information retrieval experts, people who
know how to write manuals, and those who write programs to evaluate
writing.
Introduction.
In recent years there has been a surge of interest in the use of
computer programs that write English.[1,2,3] Expert systems, for exam-
ple, need to explain what they are doing. Programs are making
increasing strides in fluency, domain coverage, and expressive
power.[4,5] In fact, it is remarkable that there has been a long dis-
cussion over the last ten years about whether or not apes have
mastered language, based on utterances such as ``Please tickle more,
come Roger tickle''[6] while computer programs saying things like
``The market crept upward early in the session yesterday, but stumbled
shortly before trading ended''[7,8] have not impressed the public
nearly as much. But even supposing that computers can now write
English, what should they write?
One obvious answer is computer programming manuals (``if X is
good, recursive X is better''). Today it is more and more important
to have good documentation for the increasingly complex systems, typi-
cally computer based, which now pervade automobiles, airplanes,
- 2 -
military systems, hospitals, telephone companies, and many other areas
of life. The manuals associated with many a microcomputer weigh more
than the computer does. Worse yet, these manuals vary widely in qual-
ity and good manuals are very important for proper use. It is partic-
ularly urgent, for example, that operators of complex military systems
or nuclear power plants be able to find out what to do in emergencies
or other unusual circumstances.
It is also important to consider language generation for other
purposes, such as answering questions or explaining the output of
expert systems. In fact, we should really be considering the entire
information transfer system, in which English serves to represent
knowledge and deliver it to people. The use of knowledge representa-
tion formalisms for the first purpose and of graphical interaction for
the second may greatly affect conventional writing. Indexing tech-
niques, although used now primarily to identify relevant passages, may
also serve the knowledge representation function. Browsing systems
may well produce a need for new kinds of documents (Hypertext,
Polytext) and thus new kinds of writing.
What We Know Today.
Reference manuals are not conventional literary works. Much of
this documentation is never read cover to cover, but is only referred
to as necessary. Thus the indexing of this material is almost as
essential as its composition. Again, significant strides in this area
have also been made by researchers. It is now possible to design
full-text retrieval systems that accept conventional documents and
questions in natural English,[9] and then retrieve documents or pas-
sages from documents that probably answer the questions. Such systems
are now for sale from several vendors.[10] Meanwhile, the researchers
are exploring the construction and use of thesauri to identify
synonyms and related terms automatically, and the use of feedback to
improve retrieval based on the results of earlier searches.[11] The
introduction of user feedback into retrieval systems and indexing
means that we are moving towards a world in which the process of writ-
ing followed by reading may be replaced with more integrated informa-
tion systems, which respond to questions by generating appropriate
replies on the fly.
Retrieval studies have indicated that certain kinds of vocabulary
control can improve indexing efficiency. Avoiding very common and
broad words, for example, or very infrequent words, makes it easier to
find the correct documents in response to queries. However, this
information is rarely used to affect the writing of documents with the
idea that they can then be indexed more satisfactorily. When new com-
puter systems are being designed, and names must be given to a collec-
tion of invented objects, nobody bothers to consult retrieval experts
to decide on good names, despite their experience with vocabulary con-
trol. The choice of names in fact matters, and people don't agree
very much.[12]
Indexers have, for years, been familiar with the problems of
vocabulary control and choice of words to describe subjects. The
- 3 -
methodologies they use, and the similar expertise of lexicographers,
have implications for natural language generation. The problems of
the optimum selection of names both with respect to writing descrip-
tions and to designing the systems being described (where possible)
are not often considered and the choices evaluated. A recent paper
suggesting the use of linguistic principles in designing computer com-
mand languages, however, was greeted with enthusiasm at a meeting of
computer hackers, so that interactions between these fields seem pro-
fitable.[13]
Some interesting retrieval experiments have been run on unusually
formatted or unusually structured texts. The National Library of
Medicine, for example, assembled the Hepatitis Knowledge Base to serve
as an on-line encyclopedia of hepatitis information;[14,15] rather
than a conventional monograph or review article, it is a tree-
structured outline of this subject area, with many cross references.
To date, however, no such experimental document architecture has been
widely accepted. Nor have the more formal AI knowledge representation
languages, despite their obvious promise and their attraction to
information scientists, yet been able to cover a large subject
domain.[16]
There is, of course, an overall process of information transfer
here. Somewhere, there is a database of information; and there are
people who need that information. In between, we produce a document,
which describes the program or the database, and which people can then
refer to. Note that they rarely read it cover to cover, so that con-
ventional writing rules about plot and characterization are often
inappropriate. ``Technical manuals are not bedtime reading.''[17]
These are documents typically meant only for retrieval; and yet they
tend to be written in the same way, and in the same form, as documents
written as literary works. Many of the same problems arise, of
course, with respect to textbooks, handbooks, news articles, and even
ordinary business correspondence. Again, much of it needs indexing
and retrieval. And in many cases it doesn't get indexed or retrieved,
because it is too much work when done by hand.
Using reference manuals as an example, there is no general agree-
ment on the kind of manual that ought to be provided. Research on the
use of models to teach people about programs has not indicated whether
having a concrete analogy to the program task helps or hinders learn-
ing.[18] The relative value of examples, explanations, and terse
reference summaries is not established by research. And there are
arguments about how long a manual should be, with some computer scien-
tists believing that longer is better and others believing that
shorter is better. Much of the discussion eventually comes down to
``I know good writing when I see it,'' a statement that even if valid
gives little guidance to those trying to produce computer-written
documentation.
Nor are the conventional style checking programs of a great deal
of help. Too many of the programs now sold to assist writers are
trivial Flesch-type readability indexes, or other very low level style
and spelling checkers. There is research, but little production, on
- 4 -
programs to catch grammatical errors; and the idea of rating English
for any rhetorical quality is impossible today. We often describe
writing as ``informative,'' ``exciting,'' ``convincing,'' or ``easy to
read,'' but there is no program that can evaluate any of these attri-
butes when given a text.
It is clear, however, that documentation is very important:
nearly every survey of computer software rates not only the perfor-
mance but also the manual. Users depend critically on the manual, and
the absence of good documentation will often make an otherwise attrac-
tive service or piece of software unusable. One experience indicated
that documentation was the most frequent source of trouble in using a
microcomputer system.[19]
An important avenue now being explored is the use of computers to
generate both their programs and their descriptions from the same for-
mal specification, as in the GIST project of Bill Swartout.[20] This
will at least guarantee the consistency of the program and its manual.
It is important, of course, not to lose more in style and understanda-
bility than is gained in cost and timeliness through the use of
computer-written text.
Much talk about documentation today concerns format: we should
emphasize that the primary purpose of this workshop is to go beyond
that. Page layout is not irrelevant, but it is not a substitute for
good English, and it is unfortunate that the ease with which word pro-
cessors can manipulate format has resulted in much experimentation
with appearance, and less with content. The intent of this workshop
is to deal with rhetoric and semantics, not with page makeup.
Questions.
Is there a better kind of information transfer system that could
be devised? How should expert systems explain what they are doing?
What kind of business and military correspondence systems should be
built in the future? When, as we expect, computer systems will not
only contain but generate their own explanations, should these look
like conventional manuals?[21] Is there a difference, for example,
between the principles of writing documents and writing explanations
for expert systems? Which is more useful? Perhaps only a retrieval
system is needed, with an explanation generator; or perhaps conven-
tional manuals should be written, but designed specially to be
searched rather than read. In practice, the people who produce docu-
mentation are already starting to ask what kinds of formats, and what
kinds of style and typography, are appropriate for documentation any-
way. We'd like to upgrade this discussion to talk about rhetorical
style.
In particular, given programs that can produce reasonably well-
phrased English, it should be possible to turn knobs inside the pro-
grams, and see what effects these have on the larger properties of the
text. This offers the possibility of producing text under very con-
trolled conditions, much better understood than any generation of such
complex material for normal psychological testing. In addition,
- 5 -
merely the presence in a workshop of several different natural
language generation systems, combined with experts in producing actu-
ally useful documents, should be very valuable in discussing what pro-
perties of the systems are connected to what features of the resulting
English.
Workshop Specifics.
In this workshop we will bring together subject specialists in
four main areas:
* Artificial intelligence researchers working in natural language
generation;
* Documentation specialists interested in writing style and qual-
ity, and in the definition of a `good' document;
* Text analysis developers, building programs that analyze text
automatically and try to make value judgments about it; and
* Retrieval experts, who know how to build systems for keyword
matching and retrieval.
Another major area that should be represented, but possibly not until
a later meeting, is computer graphics. The value of illustrations,
diagrams, and charts is unquestioned but it is not clear how we can
integrate graphics with text today.
Here are some examples of interesting comments, mostly recent
research results in the above fields:
1. A high degree of grammatical variation does not seem important to
produce natural effects in short paragraphs (as evidenced by
Karen Kukich's stock market report generator).[22]
2. Structure in queries is not very useful in retrieval; unordered
lists of keywords do about as well (Ellen Voorhees and Gerard
Salton).[23]
3. Checking for hackneyed phrases, although seemingly a trivial
operation, is perceived as very valuable by many writers (either
Writer's Workbench, by Nina Macdonald and Lorinda Cherry,[24] or
Epistle, by Lance Miller and George Heidorn).[25]
4. Syntax is much less important for retrieval than semantics; you
need to know what the words mean more than you need to know their
relationship (Harris, Cowie, and Tuttle).[26,27,28]
5. People frequently leaf through manuals, even when tables of con-
tents and indexes are available; documents should be formatted to
cater for this (Patricia Wright).[29,30]
6. Editing manuals to make them suitable for machine translation,
requiring simple language, has turned out to make them better in
- 6 -
the original language as well.
7. Even humor has its place in documentation. ``The grace which
eloquence had failed to work in those men's hearts, had been
wrought by a laugh.'' (Mark Twain). Seriously, although the
point of manuals is not to make the reader laugh, in some com-
puter manuals anything that would keep the reader awake would be
valuable.
A possible new strategy might be to bypass the typical indexing step
of making a list of words to represent a document, each with an
assigned weight, by having the generator select these itself, possibly
with greater accuracy. An improvement in the reverse direction might
be the use of the same vocabulary control data base that is used for
indexing to select the words used in the text.
Note that for retrieval, what really matters is the choice of
specific words used to name objects and actions. The structures which
connect these words are less important and have been almost unused in
retrieval systems. Yet, for those who evaluate English, the specific
word choice is almost ignored! Instead, the vast majority of the
effort is spent on syntax. Thus, for documents intended for refer-
ence, almost the entire current literature on automatic checking is
mis-aimed. Moreover, relatively little effort in document generation
as a whole has been spent on choice of specific words or phrases, and
yet this is the most important aspect for retrieval purposes.
We hope that by talking to each other, the generators will dis-
cover that they can significantly increase the utility of their output
without increasing the effort of generating it. And we hope that the
retrieval and analysis experts will learn what it is that they should
be looking for in documents, and increase the performance of their
systems without an increase in cost.
Our best possible outcome, of course, is that the participants
will find something which is not quite a conventional reference
manual, but serves the same purpose and does it better. Whether this
will be a structured document still written in English, or a
question-answering database with an explanation generator, it is
impossible to say. But unless the various groups start talking to one
another, we'll never find out.
Michael Lesk
Bell Communications Research
435 South St., Rm. 2A-385
Morristown, NJ 07960
August 9, 1985
- 7 -
References
1. E. Conklin and D. McDonald, "Salience: The Key to the Selection
Problem in Natural Language Generation," Proc. 20th Meeting ACL,
pp. 129-135, 1982.
2. K. R. McKeown, "The TEXT System for Natural Language Generation:
An Overview," Proc. 20th Meeting ACL, pp. 113-120, Toronto, Ont.,
1982.
3. R. E. Cullingford, M. W. Krueger, M. Selfridge, and M. A. Bien-
kowski, "Automated Explanations as a Component of a Computer-
Aided Design System," IEEE Trans. Sys., Man & Cybernetics, pp.
168-181, 1982.
4. W. C. Mann, "An Overview of the NIGEL Text Generation Grammar,"
Proc. 21st ACL Meeting, pp. 79-84, 1983.
5. A. K. Joshi and B. L. Webber, "Beyond Syntactic Sugar," Proc. 4th
Jerusalem Conf. on Information Technology, pp. 590-594, 1984.
6. S. Chevalier-Skolnikoff, "The Clever Hans Phenomenon, Cuing and
Ape Signing: A Piagetan Analysis of Methods for Instructing
Animals," in The Clever Hans Phenomenon: Communication with
Horses, Whales, Apes and People, ed. Thomas Sebeok and Robert
Rosenthal, vol. 364, pp. 60-93, New York Academy of Sciences,
1981.
7. Karen Kukich, Knowledge-Based Report Generation: A Knowledge-
Engineering Approach to Natural Language Report Generation. Ph.D
Thesis, University of Pittsburgh, 1983
8. Karen Kukich, "ANA's First Sentences: Sample Output from a
Natural Language Stock Report Generator," Proc. Nat'l Online
Meeting, pp. 271-80, 1983.
9. G. Salton and M. McGill, Introduction to Modern Information
Retrieval, McGraw-Hill, 1983.
10. Among sellers of free text retrieval systems are ``Cucumber
Information Systems'' (5611 Kraft Drive, Rockville, MD 20852) and
``Knowledge Systems, Inc.'' (12 Melrose St., Chevy Chase, MD
20815).
11. G. Salton, The SMART Retrieval System -- Experiments in Automatic
Document Processing, Prentice-Hall, 1971.
12. G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais,
"Statistical Semantics: Analysis of the potential performance of
key-word information systems," Bell Sys. Tech. J., vol. 62, no.
6, pp. 1753-1806, 1983.
13. Marion O. Harris, "Thoughts on an All-Natural User Interface,"
- 8 -
Proc. Summer USENIX Conf., pp. 343-347, Portland, Oregon, June
1985.
14. L. M. Bernstein and R. E. Williamson, "Testing of a Natural
Language Retrieval System for a Full Text Knowledge Base," J.
Amer. Soc. Inf. Sci, vol. 35, no. 4, pp. 235-247, 1984.
15. R. E. Williamson, "ANNOD -- A Navigator of Natural-Language
Organized (Textual) Data," Proc. 8th SIGIR Meeting, pp. 252-266,
Montreal, Quebec, 1985.
16. M. E. Lesk, "Programming Languages for Text and Knowledge Pro-
cessing," Ann. Rev. Inf. Sci. and Tech., vol. 19, pp. 97-128,
1984.
17. Janet Asteroff, "On Technical Writing and Technical Reading,"
Information Technology and Libraries, vol. 4, no. 1, pp. 3-8,
March 1985.
18. Christine Borgmann, "The User's Mental Model of an Information
Retrieval System," Proc. 8th SIGIR Meeting, pp. 268-273, Mont-
real, Quebec, 1985.
19. Marilyn Mantel and Nancy Haskell, "Autobiography of a First-Time
Discretionary Microcomputer User," Human Factors in Computing
Systems: Proc. CHI '83 Conference, pp. 286-290, 1983.
20. Bill Swartout, "GIST English Generator," Proc. AAAI-82, pp. 404-
409, Pittsburgh, Penn., 1982.
21. Ariel Shattan and Jenny Hecker, "Documenting UNIX: Beyond Man
Pages," Proc. Summer USENIX meeting, pp. 437-454, Portland, Ore.,
1985.
22. Karen Kukich, "Design of a Knowledge-Based Report Generator,"
Proc. 21st Meeting ACL, pp. 145-50, 1983.
23. E. Voorhees and G. Salton, "Automatic Assignment of Soft Boolean
Operators," Proc. SIGIR Conf., pp. 54-69, 1985.
24. L. L. Cherry and N. H. Macdonald, "The Unix Writer's Workbench
Software," Byte, vol. 8, no. 10, pp. 241-248, Oct. 1983.
25. G. E. Heidorn, K. Jensen, L. A. Miller, and R. J. Byrd, "The
Epistle Text-Critiquing System," IBM Systems J., vol. 21, no. 3,
pp. 305-326, 1982.
26. M. O. Harris, Howto: An Amateur System for Program Counseling,
1983. private communication.
27. J. R. Cowie, "Automatic Analysis of Descriptive Texts," Conf. on
Applied Natural Language Processing, pp. 117-123, Santa Monica,
Cal., Feb. 1-3, 1983.
- 9 -
28. M. S. Tuttle, D. D. Sherertz, M. S. Blois, and S. Nelson,
"Expertness from Structured Text? Reconsider: A Diagnostic
Prompting System," Conf. on Applied Natural Language Processing,
pp. 124-131, Santa Monica, Cal., Feb. 1-3, 1983.
29. Patricia Wright, "Manual Dexterity: a user-oriented approach to
creating computer documentation," Human Factors in Computing Sys-
tems: Proc. CHI '83 Conference, pp. 11-18, 1983.
30. T. G. Sticht, "Comprehending Reading at Work," in Cognitive
Processes in Comprehension, ed. M. A. Just and P. A. Carpenter,
Lawrence Erlbaum, 1977.
------------------------------
END OF IRList Digest
********************