Copy Link
Add to Bookmark
Report
IRList Digest Volume 3 Number 21
IRList Digest Thursday, 6 August 1987 Volume 3 : Issue 21
Today's Topics:
Email - Problems, plans for IRList
Address - Dr. M.B. Koll, Personal Library Software
Query - Contact for obtaining SMART
- Suggestions for providing online access to Canadian Tax Act
Seminar - Responsa system demonstration
- Short-context disambiguation in large text databases
News addresses are ARPANET: fox@vtopus.cs.vt.edu BITNET: foxea@vtvax3.bitnet
CSNET: fox@vt UUCPNET: seismo!vtisr1!irlistrq
----------------------------------------------------------------------
Date: Thu, 6 Aug 87 15:50:33 edt
From: fox (Ed Fox)
Subject: electronic mail problems and plans relating to IRList
1. Recent problems
Two weeks ago we had lightning hits that caused around $40K of
damage to our departmental computers. The machine that IRlist is
usually composed on was down for that period, so it has been difficult
to get news out. I will attempt to catch up on this in the next week.
If you sent in news and it does not appear soon, please send your
communication in again, since some messages were lost. I apologize
for any inconvenience.
2. Disappearance of seismo as UUCP connection
By 1 September, the machine called "seismo" that is at the Center for
Seismic Studies will stop serving as a polling center for UUCP mail.
Please stop using seismo!vtisr1!fox as a UUCP address to reach me.
We will have our machine "vtopus" connected to several other UUCP
machines, so fox@vtopus.uucp or an address with the appropriate route
should work as a replacement. I do not encourage UUCP traffic, but if
it is necessary, use vtopus!fox rather than vtisr1!fox since vtisr1 is
becoming more isolated than before.
3. Connection to the ARPANET
By early September there will be some changes, hopefully improvements, with
IRList mail handling. The main point is that our machine "vtopus"
will eventually become the central point for all IRList business. Virginia Tech
is now part of SURANET, which is part of NSFNET, and so we are on the DARPA
Internet. When we get all the addressing and other software issues corrected,
vtopus will be accessible for FTP and other services. I will post information
when it is available and when we have finished testing. At that time,
people who want access to back issues in quantity will be able to get
direct access; up till then I will honor requests for small numbers of
back issues. Later, vtopus will also be on BITNET, so UUCP, ARPANET,
and BITNET mail will be from one place.
4. Interim situation
Meanwhile, please try to send mail to my BITNET address,
foxea@vtvax3.bitnet, which will always remain as an option for
reaching me. ARPANET and CSNET members can reach that with address
foxea%vtvax3.bitnet@wiscvm.wisc.edu and BITNET members can reach it
directly. The address for vtopus is now and will continue to be
fox@vtopus.cs.vt.edu but I prefer it not be used a great deal till our
ARPANET connection is perfected.
4. Help with address changes
Please notify me in advance if you change address or wish to drop
your subscription, unless you are handling these matters with someone
who maintains a local redistribution. Please try to give complete
addresses, and if it is not obvious, indicate if your address is
relative to BITNET or ARPANET or UUCPNET since it is sometime hard to
reach people. If you stop receiving IRList, be sure to let me know
and we can try to see what happened - I drop people when mailers tell
me messages are not getting through.
Thanks for your patience! - Ed
------------------------------
Date: Thu, 6 Aug 87 15:58:43 edt
From: fox (Ed Fox)
Subject: Announcement from Dr. Matthew B. Koll
Dr. Matthew B. Koll has asked me to announce his new address:
Personal Library Software
15215 Shady Grove Road
Rockville MD 20850
(301) 926-1402
He is no longer with George Mason University, and has shifted efforts
from his former company, KNM Inc., which marketed SIRE, to devote full
time to Personal Library Software. They have a package which is an
enhanced version of SIRE.
Dr. Koll does not now have an ARPANET address, so should be contacted
directly at the address above. He may have openings for experienced C
programmers who are knowledgeable about information retrieval, and
have some background in UNIX.
------------------------------
Date: Fri, 24 Jul 87 16:28:09 PDT
From: George Cross <cross@cs1.wsu.edu>
Subject: SMART
Hi,
Do you have a contact for getting a copy of SMART from Cornell? I remember
seeing a license agreement posted some time ago and Don Kraft ordered one
for LSU. Thanks.
---- George
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
George R. Cross cross@cs1.wsu.edu
Computer Science Department ...!ucbvax!ucdavis!egg-id!ui3!wsucshp!cs1!cross
Washington State University faccross@wsuvm1.BITNET
Pullman, WA 99164-1210 Phone: 509-335-6319 or 509-335-6636
[Note: contact chrisb@cornell.arpa by electronic mail, or write to
Professor Gerard Salton at Cornell. - Ed]
------------------------------
Date: Fri, 10 Jul 87 17:06:49 EDT
From: seismo!mnetor!lsuc!dave
Subject: Indexing of a complex statute for on-line retrieval
We at the Law Society of Upper Canada are responsible for
post-law school legal education in Ontario, both for call to
the Bar (the Law Society governs the legal profession in the
province and admits new members through the Bar Admission Course)
and for continuing legal education.
We've been using CAI for several years, particularly to teach
Canadian income tax law. Our tax courses are taken by over 1,000
students a year plus a number of lawyers and others, and we're
developing more advanced courses for lawyers' use.
We have the opportunity to acquire an on-line version of the
(Canadian) Income Tax Act, a rather massive statute. In its
published version, along with history of changes, regulations
and various minor annotations, it's over 1400 pages. I'm told
the raw on-line data is something like 5-10Mb. The publisher is
interested in us putting the Act up on our system so they can
gain experience in the "electronic publishing" field, and learn
how it might be used and how it can best be organized for retrieval.
They are therefore willing to let us have it for free.
My interest is in making this tremendously useful information
available to people who are on our system anyway for studying
tax through CAI. If the experiment is successful, we might look
to putting other primary and secondary tax sources on-line in the
future.
Ours is a UNIX system, a Perkin-Elmer 3220 (roughly the power of
a VAX-11/750) running UNIX version 7. We're educational source-licensed
for UNIX and can upgrade the license to System V if necessary.
My question is: how should I go about putting the data up on-line?
(We'll be getting the data in raw ASCII form from a different system.)
We don't have a lot of time to devote to this, as we're very busy
with other projects. Are there existing tools I can make use of?
At the most primitive level, I imagine I would just stick the
data into a UNIX file and give people existing tools like "grep"
and "more" for searching and browsing through it. I can imagine
indexing the section and subsection numbers too, perhaps by
location in the file so the user could seek to the right provision
quickly. I'm a real novice in the field of information retrieval,
however.
I'd appreciate any suggestions as to (1) quick solutions or existing
tools which will make the data more usable; (2) references to literature
on storage/retrieval of complex statutes; and (3) specific ideas of
more complex indexing or retrieval mechanisms that we might implement
down the road. Many thanks.
David Sherman
Computer Education Facility
The Law Society of Upper Canada
Osgoode Hall
Toronto, Canada M5H 2N6
dave@lsuc.uucp +1 416 947 3466
{ seismo!mnetor pyramid!utai decvax!utcsri ihnp4!utzoo } !lsuc!dave
[Note: There are various retrieval packages that might work. The SMART
system is available from Cornell for a nominal charge, but may not run
on your hardware/software. The Personal Librarian would probably work
and Matt Koll could tell you. See other msgs in this digest for
contact information about these two systems. There are many others
around, and many people working on legal information retrieval - I
hope some will contact you with details and you will let us know what
you decide. - Ed]
------------------------------
Date: Thu, 6 Aug 87 16:49:24 edt
From: fox (Ed Fox)
Subject: Demonstration of RESPONSA System
YOU ARE INVITED TO AN ONLINE DEMONSTRATION OF THE RESPONSA SYSTEM
An advanced full-text retrieval system
(with morphological processing) for
2000 years of Rabbinical Literature
by
Yaacov Choueka
Bell Communications Research
Morristown, New Jersey
(on sabbatical leave from the
Department of Mathematics and Computer Science
Bar-Ilan University, Ramat-Gan, ISRAEL)
WHEN: Wedn. August 12 from 1:30 - 3pm, and 7:30 - 9pm
WHERE: Newman Library, 6th floor board room
WHAT: Come and stop by if you would like to see
* An interesting full-text retrieval system with a
remarkably fast response time (despite some "hostile"
parameters such as the size of the database, the
complexity of the search, the long and not-so-reliable
telephone communications lines to Israel, and the
1200-baud transmission rate).
* An automatically lemmatized (in a context-free sense)
50-million words corpus (probably the only
lemmatized one of this size in any language).
*A complete morphological component embedded in an
operational retrieval system.
* An online module for accurate and complete
morphological analysis of any word in the language.
* Some beginnings of applications of a short-context
approach (how many different "following neighbors"
are there for a given ambiguous word with 200,000
occurrences? How many of these neighbors occur
more than 1000 times, and which are they? Do they
disambiguate the given word? How can this
information be used in on-line retrieval or dictionary
building contexts?).
WHO:
Dr. Choueka has almost twenty years of experience in
teaching and research in computer science, some of it (in
the early years) in finite automata and formal languages
theory, but most of it in information retrieval,
computational linguistics and text processing. He was
part of the team that initiated the RESPONSA in 1966,
and has served as its Director and Principal Investigator since 1975.
------------------------------
Date: Thu, 6 Aug 87 16:50:05 edt
From: fox (Ed Fox)
Subject: Seminar on Disambiguation
COMPUTER SCIENCE SEMINAR
McBryde Hall Room 201
Wedn. August 12, 10:15 - 11:30AM
Short Is Beautiful:
Short-context disambiguation in large textual databases
by
Yaacov Choueka
Bell Communications Research
Morristown, New Jersey
(on sabbatical leave from the
Department of Mathematics and Computer Science
Bar-Ilan University, Ramat-Gan, ISRAEL)
ABSTRACT:
Morphological disambiguation (i.e., finding the
intended "correct" meaning of an ambiguous word in a
specific context) is an intellectually challenging and
practically important issue in automatic text processing. One
of the suggested pragmatic approaches, specially viable for
large textual databases, the short-context method, proposes
to use the (very) short context of an ambiguous word as an
adequate vehicle for its disambiguation. An experiment
carefully designed to test this idea and its validity was
developed and applied to a small French corpus some time
ago, and the results were recently reported elsewhere.
Based on the clearly positive outcome of this test, an online
short-context disambiguation program was incorporated as
an operational component in the Responsa full-text retrieval
system (Hebrew, 50 million words), and is being now tested
on a large scale.
Using this program, the user can submit a word W to
the system, which will respond by instantly displaying a list
of all the different right (left) neighbors of W in the
database, together with the neighbor's "local" frequency (its
frequency as a neighbor of W), ranked by the local
frequencies. Preliminary findings show that more often
than not such a short context of the word is enough to
correctly disambiguate its appropriate occurrences. If
needed, however, a further expansion of the right neighbor
into the corresponding set of its right ones can again be
displayed, giving the set of all the different two-word right
contexts of the word under examination.
It was found that, in general, no more than a few
minutes are required for a casual user to decide on the
intended meaning of an ambiguous W in its most frequent
contexts, thus resulting in the immediate disambiguation of
thousands of occurrences of W in the text. When
automatically recorded, the user's decisions can greatly help
in achieving a "context-sensitive" lemmatization of the
corpus, once its "context-free" one has been completed. The
method is also very useful in information retrieval contexts,
where it gives the user an efficient tool for specifying, in a
query with an ambiguous word, which of the word's
contexts should be retrieved, thus greatly enhancing the
precision of the retrieval. Finally, it is expected that by
gradually accumulating these disambiguation decisions in the
appropriate word-entry of the available automatic
dictionary of the language, "local expert systems" for many
ambiguous words will develop, that can greatly facilitate
ambiguity resolution in practical situations.
------------------------------
END OF IRList Digest
********************