Copy Link
Add to Bookmark
Report
IRList Digest Volume 4 Number 42
IRList Digest Thursday, 28 July 1988 Volume 4 : Issue 42
Today's Topics:
Email - Address for Charles Meadow
Query - Definition of hypertext/hypermedia
Reply - Suffixing, stemming
Discussion - Metamorph, stemming, online search style
- Online search style
- Metamorph
Announcement - Forum on small-systems database products
- NTIS demo on Japanese research
- Thesis defense on comparing extended Boolean schemes
News addresses are
Internet or CSNET: fox@vtopus.cs.vt.edu or fox@fox.cs.vt.edu
BITNET: foxea@vtvax3.bitnet (soon will be foxea@vtcc1)
----------------------------------------------------------------------
Date: Wed, 27 Jul 88 08:58:06 CST
From: Jeff Huestis <C81350JH@WUVMD>
Subject: Address for Charles T. Meadow
Ed: do you have an email address for Charles T. Meadow? ...
--Jeff
------------------------------
Date: Wed, 27 Jul 88 14:47 EDT
From: VENTURA%21514%atc.bendix.com@RELAY.CS.NET
Subject: What exactly IS "hypertext"/"hypermedia"?
Does anyone have a good (succinct) definition of what hypertext/-media is?
I am trying to figure out whether or not an application I am working on
qualifies.
CA Ventura
------------------------------
Date: Fri, 22 Jul 88 16:42:38 EDT
From: Donna Harman <harman@nav.icst.nbs.gov>
Subject: reply to stemming query in IRDIGEST
[Note: to send mail to Donna, do not use the above address (at
least I could not get it to work) - instead try
harman%icst-nav@icst-osi.arpa
Be careful in later correspondence since "Reply" may use the
one you see above under "From" rather than what I have given. - Ed.]
I don't know how to reply to the IRDIGEST, so I am trying it this way.
[Note: You did fine - use addresses in the header of each IRList
or as explained in the Welcome message. - Ed.]
Reply to the query on suffixing:
In interest of answering the actual question, I am supplying four
references--my paper on stemming performance, and three papers on
actual algorithms.
Harman D., "A Failure Analysis on the Limitation of Suffixing
in an Online Environment", Proceedings of the Tenth
Annual International Conference on Research and
Development in Information Retrieval, New Orleans, 1987.
Lovins J.B., "Development of a Stemming Algorithm", Mechanical
Translation and Computational Linguistics 11, March 1968.
(this is the description of the Lovins stemming algorithm
which has been extended for use as the SMART stemmer).
Porter M.F. "An Algorithm for Suffix Stripping", Program, Vol 14,
July 1980.
(this is a newer algorithm, removing fewer stems)
Ulmschneider J. and Doszkocs T. "A Practical Stemming Algorithm for
Online Search Assistance", Online Review 7(4), 1983.
(this is a description of how to tailor-build a stemming
algorithm for a given collection)
In interest of rabid discussion on stemming, I will put forth the following
strawman for debate.
Stemming is not an improvement on full word retrieval except in two
situations:
1) storage is a problem--stems store in less space, although the inverted
file is not smaller (same number of postings, just organized under a
smaller number of terms)
2) the number of documents is small and/or recall is much more important
than precision.
Fire away!
[Note: Since there are conflicting results regarding the value of
stemming and that seems to depend on the stemming algorithm and the
collection being used for the tests, why not just try to figure out
what combination of cases is best rather than make such a categorical
statement as you have done above? - Ed.]
------------------------------
Date: Mon, 25 Jul 88 19:14:54 EDT
From: MARCUS@Lids.mit.edu (Richard Marcus)
Subject: Metamorph Stemming Search Costs and Style
Ed,
I have comments on three subjects in recent IRList Digests
which seem to be interrelated in various ways:
(1) Metamorph -- Ed, I admire your restraint in attempting to report on
this effort which has received so much hype and provided so little technical
details by which to judge it. I don't have any more details on Metamorph
as such, but there was an interesting article in BYTE (May, 1988;
p 297ff) by Roy E. Kimbrell which describes an apparently related
"N-Gram" method attributed to Raymond D'Amore and Clinton Mah of
PAR Government Systems Corp (McLean, VA). This N-Gram approach uses many
[Note: full address is 1840 Michael Faraday Dr., Suite 300,
Reston, VA 22090-5341 and switchboard is 703/478-9690 - Ed.]
of the Salton SMART techniques (weighted vectors, cosine matching,
clustering, stemming, etc.) but applied to letter strings, or n-grams,
WITHIN words. Although I would argue against statistical, non-word methods
as techniques of CHOICE, at least the methods are reasonably well
explained and some indication of experiments with a test corpus is
given (but no details or comparison with other methods).
(2) Ed, your pointers to Aalbersberg [IRLD:4(38)] on stemming were
good starters. Coincidently, a stemming (conflation) algorithm in the C
programming language is given by Kimbrell in the above-mentioned Byte
article. Let me also add that Julie Lovins, a linguist, developed
a nice stemming algorithm under our Intrex Project (Lovins, Mechanical
Translation 11:22-31[1968]) which has been used to good effect by us
and a number of other organizations. A useful evaluation of the
algorithm was reported by Julie in the Journal of ASIS
[22(1)28-40; January, 1971].
[Note: the Lovins method is the basis for what is used in SMART - Ed.]
One interesting point is how drastically the evaluation depends on
the context. Salton has, I believe, reported on small but significant
effectiveness for simple stemmers in SMART. Donna Harman has reported
(Proceedings 1988 RIAO Conference, pps 839-848) on experiments with
the NLM IRX system that stemming doesn't help at all. Harman suggests
that the IRX batch oriented context might be the reason for non
utility of stemming and an interactive context would probably
yield different results. Our own research supports the latter;
experiments with our highly interactive CONIT system (see, e.g.,
Marcus, Journal ASIS 34(6):381-404; Nov., 1983) have demonstrated
the critical importance of stemming in that context.
(3) Costs Affecting Search Styles -- Bill Joel (supported by
Jeff Huestis) is right on! Cost is a critical component of context.
The Telebase Easynet front end system owes a large part of its success
to techniques for holding down online costs. We have reported
(see, e.g., Marcus, Proceedings ASIS 85; 22:289-292) how cost factors
markedly influence search behavior online. Despite exponential increases
in benefits/costs factors, we have not yet reached the point where
online users can derive anything like the full effectiveness of the
interactive capabilities on computers (although we're working toward
that goal with our 'smart Boolean' approach).
---Dick Marcus, MIT Lab for Information and Decision Systems...
------------------------------
Date: 25 Jul 88 17:03:00 EDT
From: Nahum (N.) Goldmann <ACOUST@BNR.CA>
Subject: Please post. Thanks. (re:Do online costs affect search styl
In response to Dr.Joel's request on IRLIST, the key-factor in negotiating
a search online under the pressure (cost) is the KNOWLEDGE OF THE
SEARCH SUBJECT. I discussed this in detail in Chapters 2 and 10
of my book (ONLINE RESEARCH AND RETRIVAL, TAB Professional
and Reference Books). This knowledge is generally associated with
the END-USER of information, as opposed to the INTERMEDIARY (information
brocker). Your analogy with library is entirely correct, except that a
sane specialist would never ask a librarian to search at the stacks
on his/her behalf (precisely because it has to be interactive).
I believe that it is better to negotiate online for some (the end-user)
but is necessary to define beforehand for the others (the intermediary).
Nahum Goldmann
acoust@bnr
Tel. (613)763-2329
------------------------------
Date: Sun, 24 Jul 88 10:10:09 EDT
From: Tung-Ying Chang <EC6C6003@TWNMOE10.bitnet>
Subject: Metamorph
Dear Professor Fox,
I have received volume 4 issue 36-40 and try to review the comments
/materials which you mentioned in issue 40. I read the article "Word
ladders and a tower of Babel lead to computational heights defying
assault" in Scientific American Aug. 1987. I consider that this is the
article which Defense Science mentioned in regard to Bell Lab's research.
There is not technical details but general description.
I agree with you that we don't need to discuss commerical systems
unless there is something new. I suspect most of "new things" are
covered with commerical secret. Anyway, I am interested to web structure
and morpheme retrieval. Thank you very much.
Good luck.
Tung-Ying Chang
~ Tung-Ying Chang Professor Fox 7/24/88 Metamorph
------------------------------
Date: Mon, 25 Jul 88 15:04:27 EST
From: "James S. Cowie" <JCOWIE@YALEVM>
Subject: PCDBMS-L at YALEVM
Greetings, IRLIST people...
Just a brief note to inform you that due to a great positive response to
initial inquiries, there now exists a Listserv forum for discussion of
small-systems database products in academic or library contexts. All are
welcome. The new list is PCDBMS-L at YALEVM. Products to be discussed
include Paradox, NotaBene, Quattro, Dbase, Rbase, DataEase, Reflex,
Revelation, etc.
yours truly,
James Cowie
Yale University Library
Systems Office
~ James S. Cowie Irlist 7/25/88 PCDBMS-L
Acknowledge-To: <JCOWIE@YALEVM>
------------------------------
Date: Wed, 27 Jul 88 13:07:59 EDT
From: Edward A. Fox <fox>
Subject: NTIS demonstration
On Friday July 29 at 2pm (in the Idea Salon in CPAP, at
104 Draper Road, Blacksburg, VA) there will be a
demonstration by Tim Feinstein of NTIS of their
system to access Japanese research work. All are invited.
For more information, contact John Dickey, Center for Public
Administration and Policy, VPI&SU (703) 961-5133/5830.
------------------------------
Date: Wed, 27 Jul 88 13:04:49 EDT
From: Edward A. Fox <fox@fox.cs.vt.edu>
Subject: defense
Whay C. Lee will have his MS thesis defense on Friday,
July 29 at 10am in McBryde room 558. The title of
his thesis is "Experimental Comparison of Schemes
for Interpreting Boolean Queries".
All are invited. - Ed Fox
------------------------------
END OF IRList Digest
********************