Copy Link
Add to Bookmark
Report
IRList Digest Volume 1 Number 18
IRList Digest Monday, 4 Nov 1985 Volume 1 : Issue 18
Today's Topics:
Query - NSF Travel Grant for 1986 R&D in IR Conf. - Pisa?
Interactive index-in-context queries
Call for Papers - ACL 86
Article - Museums on Disc
Announcement - IRList now on Bitnic's database server
----------------------------------------------------------------------
From: Don <kraft@LSU>
Date: Fri, 1 Nov 85 09:17:10 cst
Subject: pisa travel grant announcement
when will IRlist print my Pisa travel grant announcement?
[Note: On Oct. 19 issues 15 and 16 of IRlist were both sent out.
Unfortunately, they were dated Sep instead of Oct, so people may
have been confused, or may have missed one of the two. In any case
in Issue 16 the second entry published is classified as
Call for Papers - Applications for NSF funds to Pisa Conf.
which is a message from Don Kraft of LSU to "Dear World:"
Please note that papers are due Jan 15, 1986. - Ed]
------------------------------
From: Mark Zimmermann <zimmer%lll-tis-a.arpa@CSNET-RELAY>
Date: Wed Oct 30 04:26:42 1985
Subject: interactive index-in-context queries?
Summary: I want a fast, simple, interactive browsing tool, and something like
an extended index-in-context (where the full text of the document is available
instantly when requested from an index item) seems to have a lot of potential.
Questions: --what are the difficulties with this approach?
--do products exist to do this (esp. on Macintosh or Sun)?
--are there good places to read about this in the literature?
Appended below is part of a message I sent out to a friend recently,
describing in more detail the background of the above task. If anybody can
help, please send mail to me here ("zimmer@lll-tis") or at "zim@mitre". Tnx!
^z
**************
Here's my current project -- I want to create a fast, interactive, index-in-
-context (maybe this is called a "KWIC" = "Key Word in Context"?) to handle
multi-megabyte data files (e.g., my collection of 3000 msgs from the past
year). I'll describe what I fantasize having on the Mac (which will be
limited to 200K or so files) -- and maybe you can comment or help, if you
like. I think that it's a pretty trivial project on a Mac or Sun (maybe
a few days work, plus time to improve the user interface?). Very tough to
do the user interface on a non-bit-mapped-screen such as our standard dumb
terminals hooked up to the mainframe at work ....
What it looks like: system has a window with scroll bars that shows a
chunk of the index-in-context -- the alphabetized (ignoring case) words that
are indexed are all lined up in the middle, like:
...azilians have domesticated the aardvark and are using it to ...
...common to be using a pine wood abacus for rapid calculations...
...have domesticated the aardvark and are using it to perform a...
... domesticated the aardvark and are using it to perform a var...
etc. (index words are right here ^^^, of course)
User scrolls around the index, and when something looks potentially
interesting/relevant, clicks on that item and another window opens up
showing a big chunk of text (several dozen lines, at least) around that
point, also in a scrollable window. Everything happens instantly....
One might also be able to edit, in a very simple way, the index -- besides
having a predetermined list of words to ignore (a, an, the...) one might let
the user click on an index entry and then hit backspace, or "cut", to delete
it....
Implementation: I would preprocess the document to remove tabs,
replace <crlf>s with spaces, etc. Then scan through the document and
build up a list (linear array, really) of pointers to the first letter of
each word to be indexed (an address, maybe relative to the first entry
in the document). Then sort that list so that the 10 (or so) letters
after the pointed-to locations come in alphabetical order (ignore case).
NOTE: we ignore word boundaries to save time and simplify. We ignore
"record" boundaries, to save time and simplify. We probably put a bunch
of spaces at the beginning and end of the file, to simplify/eliminate
end effects.
So, we now have our document (minus <crlf>s) and our sorted index of pointers
into that document. As the user scrolls around in the index, we fetch each
line to be displayed by subtracting 40 or so characters around the pointed-to
character, then typing out 80 characters or so. Show result in a window.
Hold everything in memory at once ... if that's not possible, we have
to have some pre-fetching (paging?) set up to get all the text surrounding
the areas being viewed in the index into fast memory before user clicks on
an item to be fetched. If an item is selected, open another window and print
out the pointer location contents +-1000 or so bytes.
DQW suggests perhaps showing in the index only the index word plus a count
of how many occurrences it has, i.e.:
aardvark 1
abacus 2
and 12345
Akhiezer 17
etc., and then with a click expanding a chosen word into its full list of
index items. Might be useful as an option, but I'd delay it until we see
how unwieldy the full way is (with scroll bars, you can skip over dull
zones easily). CS suggests that we could use the above idea to get into
sub-indices -- that is, if one clicked on "Akhiezer" above, one might get
into an index which was sorted by all the words within 100 (or so?) characters
of the occurrence of Akhiezer -- a rather different idea, that might require
more pre-processing or auxiliary index files than I want to tackle right now.
The index should only take up 1/2 or so the size of the whole document
(4 bytes/pointer, and the average indexed word is probably at least 4 letters
long, so there should be little chance of the index exceding the size of
the document).
A left parenthesis is a good delimiter to use, in addition to a space.
So, any comments? If this already exists commercially, please give me a
pointer to it, so I don't reinvent the whole thing ... I'm told that a KWIC
index is a "standard student project", but that it tends not to be too
useful ... but with the speed and scrollability that I envision, I think
it would be a tremendous tool for me to have a hand in browsing through my
mountains of files.
I'm going to try to do something quick and dirty in MacFORTH to prove
the concept out for this index-context thing ... have discussed it with
a variety of friends, and gotten various responses.... Will forward an
edited excerpt from this note to IRList and info-mac, and see if there is
any help there ....
(zim@mitre or zimmer@lll-tis)
------------------------------
From: Don Walker <walker%mouton.arpa@CSNET-RELAY>
Date: Thu, 31 Oct 85 16:47:40 est
Subject: CALL FOR PAPERS; ACL 1986 Annual Meeting
CALL FOR PAPERS
24th Annual Meeting of the Association for Computational Linguistics
10-13 June 1986, Columbia University, New York, NY, USA
SCOPE: Papers are invited on all aspects of computational linguistics,
including, but not limited to, pragmatics, discourse, semantics, and
syntax; understanding and generating spoken and written language;
linguistic, mathematical, and psychological models of language;
phonetics and phonology; speech analysis, synthesis, and recognition;
translation and translation aids; natural language interfaces; and
theoretical and applications papers of every kind.
REQUIREMENTS: Papers should describe unique work that has not
been submitted elsewhere; they should emphasize completed work rather
than intended work; and they should indicate clearly the state of
completion of the reported results. Authors should send eight copies
of an extended abstract up to eight pages long (single-spaced if
desired) to:
Alan W. Biermann
ACL86 Program Chair
Department of Computer Science
Duke University
Durham, NC 27706, USA
[919:684-3048; awb%duke@csnet-relay]
SCHEDULE: Papers are due by 6 January 1986 . Authors will be
notified of acceptance by 25 February. Camera-ready copies of final
papers prepared on model paper must be received by 18 April along with
a signed copyright release statement.
OTHER ACTIVITIES: The meeting will include a program of tutorials and
a variety of exhibits and demonstrations. Anyone wishing to arrange an
exhibit or present a demonstration should send a brief description to
Alan Biermann along with a specification of physical requirements:
space, power, telephone connections, tables, etc.
CONFERENCE INFORMATION: Local arrangements are being handled by Kathy
McKeown and Cecile Paris, Department of Computer Science, Columbia
University, New York, NY 10027; 212:280-8194 and 8125; mckeown and
cecile @columbia-20.arpa. For other information on the conference and
on the ACL more generally, contact Don Walker (ACL), Bell Communications
Research, 445 South Street, MRE 2A379, Morristown, NJ 07960;
201:829-4312; walker@mouton.arpa or walker%mouton@csnet-relay or
bellcore!walker@berkeley.
Program Committee: Alan W. Biermann, Duke University
Kenneth W. Church, AT&T Bell Laboratories
Michael Dyer, University of California at Los Angeles
Carole D. Hafner, Northeastern University
George E. Heidorn, IBM T.J. Watson Research Center
David D. McDonald, University of Massachusetts
Fernando C.N. Pereira, SRI International
Candace L. Sidner, BBN Laboratories
John S. White, Siemens Communication Systems
LSA SUMMER LINGUISTIC INSTITUTE: ACL-86 is scheduled just before the
53rd LSA Institute, which will be held at the Graduate School and
University Center of the City University of New York from 23 June to 31
July. The 1986 Institute is the first to focus on computational
linguistics. During the intervening week, a number of special courses
will be held that should be of particular interest to computational
linguists. For further information contact D. Terence Langendoen, CUNY
Graduate Center, 33 W. 42nd Street, New York, NY 10036; 212:921-9061;
tergc%cunyvm@wiscvm.arpa.
------------------------------
From: Werner Uhrig <CMP.WERNER%r20.utexas.edu@CSNET-RELAY>
Date: Sat 2 Nov 85 14:13:34-CST
Subject: COMPUTERISED ARCHIVES - MUSEUMS ON DISC
[ from "The Economist", Oct 26, 85. page 100 ]
COMPUTERISED ARCHIVES - MUSEUMS ON DISC
Museums and libraries face a dilemma. They wish to preserve their treasures but
they must allow the public access to them. The two jobs are oftenincompatible.
...
Microfilm provides one answer, but it is inadequate for several reasons. The
film itself is as perishable as paper, while the film-reading machines are
bulky and expensive.
The Smithsonian Institution's Air and Space Museum in Washington (the world's
busiest museum, with 12m visitors last year) thinks it has a better way. Mr.
Hernan Ottano, head of the Smithsonian's "advanced projects" division, has
developed a way of recording digitised images of documents that makes them easy
to record and retrieve. It is having a test run on more than 50,000 documents
that make up the archives of Wernher von Braun, the rocket pioneer. By the
middle of next year, you will be able to walk into the Smithsonian and buy one
video disc on which are copies of all von Braun's papers.
The Air and Space Museum has already recorded its collection of nearly 1m
photographs on video discs using a simple analog system, where the photograph
is filmed by a video camera and the image transferred to a video disc ....
....
The handling of originals decreased by 50% in a year after the first disc was
made available. The discs sell for $30 each.
But the Otano project is more ambitious. By turning the images into digital
information, you can transmit them over telephone lines and simultaneously
index the text. It begins with a digitising video camera, which automatically
focuses, adjusts for things like light and paper colour, zooms in or out to
captuere the whole document and then makes a black-and-white image consisting
of 4m spots, or pixels, digitally stored. Unlike microfilm the image being
recorded is displayed on a screen so it can be checked. It is then compressed
by a personal computer into 50 kilobytes of imformation per image by ignoring
large uniform areas.
Mr. Otano's group added software, that in the case of printed documents, can
turn any text into an ASCII text string - meaning that a computer can then
recognise the words (it can read 2,000 different typefaces). By looking for
words, it can then make an automatic index of the text in the documents
themselves.
...
And, to give an idea of the scale, Encyclopaedia Britannica could go on a
double-sided disc, as could the contents of 33 filing cabinets. With a
video-disc player and a printer, a museum can have, for a few thousand dollars,
a system that can produce indexed, word-perfect and effectively indestructible
copies of its collection to which anybody can have access merely by buying a
disc.
...
The idea is so simple that museum directors the world over will be kicking
themselves that they did not think of it (the Smithsonian has applied for a
patent). And eventually, enthuses Mr. Otano, colour photographs and paintings
as well as three-dimensional objects (fossils, coins) could be photographed and
digitally stored in the same way. The gargantuan collections of the great
museums - the Smithsonian alanoe holds 100m items - will then be safe.
------------------------------
From: Henry Nussbacher <HJNCU%CUNYVM.BITNET%wiscvm.wisc.edu@CSNET-RELAY>
Date: Mon, 21 Oct 85 10:49 EDT
Subject: Ir-List now abstracted into Internetwork Database server
As per a recent announcement in Ir-List, the Ir-List digest has now been added
to the Database server at node Bitnic in Bitnet. Refer back to Issue #15
(listed as September 19th - but should have been October 19th) for further
details.
Hank
------------------------------
END OF IRList Digest
********************