Copy Link
Add to Bookmark
Report

IRList Digest Volume 2 Number 55

eZine's profile picture
Published in 
IRList Digest
 · 1 year ago

IRList Digest           Thursday, 16 October 1986      Volume 2 : Issue 55 

Today's Topics:
Query - Addresses of Robertson and Croft
- List of outstanding problems in information retrieval
Discussion - SOUNDEX: why not consider Davidson's algorithm ?
- SOUNDEX: what about canonical form for given names etc.?
- SOUNDEX: 92 lines of code included
Article - Abstract on Geometric Analysis of Similarity Measures

News addresses are ARPANET: fox%vt@csnet-relay.arpa BITNET: foxea@vtvax3.bitnet
CSNET: fox@vt UUCPNET: seismo!vtisr1!irlistrq

----------------------------------------------------------------------

Message-Id: <8610130807.AA04199@cernvax.UUCP>
From: Hans-Peter Giger <seismo!mcvax!ethz!hpg>
Date: Mon, 13 Oct 86 08:07:21 -0100
Subject: email adress unknown


Hi
does anybody know the email adresses of the following persons:

S.E Robertson at City University in London
W.B Croft at University of Massachusetts

We are looking for algorithms to help us converting weighted search request into
series of boolean search request. We are using Elementary Logic Conjuncts as
presented by Jamieson and Heine.

Thanks a lot
Adress: mcvax!cernvax!ethz!giger

Hanspeter Giger

[Note: The address for Bruce is
croft%umass-cs.csnet@csnet-relay.arpa
You might contact Chris Buckley (or he might respond to this) about
some of the work at Cornell that may relate. I am not sure if S.E.
Robertson receives this digest; if he does not and is interested
please let him know. - Ed]

------------------------------

Date: Fri, 10 Oct 86 09:45:15 edt
From: choueka@thunder.bellcore.com (Yaacov Choueka)
Message-Id: <8610101345.AA15205@thunder.bellcore.com>
Subject: problems in inf. ret.

Help

I was referred to you by M. Lesk for the following inf.: A few years ago
Salton published a list of outstanding problems in inf. ret. (probably in
Forum but I am not sure).Do you have any ref. on that? (May be a copy?).

Thanks. Y. Choueka (tel. (201) 8295175)

[Note: In JASIS, July 1985, 36(4):268-271, G. Salton talks about some of the
progress that has been made and some of the areas where more work is needed.
In SIGIR Forum 14(4), Summer 83 (also is Proc. 6th Int. ACM SIGIR
Conf. on R&D in IR), M.E. Maron had article "Open Problems in Information
Retrieval."

Perhaps other readers have other references, or would like to send
in comments on problems. There was a good discussion at the ASIS
Annual Meeting sessions run by ACM SIGIR; perhaps one of the panelist
there or a participant would like to summarize the comments made. - Ed]

------------------------------

Date: Sat, 11 Oct 1986 15:37 CST
From: Robert H Greenfield <RHG@UREGINA1>
Subject: Davidson Code

In IRList Vol 2 No 50, R Daniel Johnson comments on Soundex Code
in reply to a request for information from Stephen Page (IRList
Vol 2 No 44).

As a part of my dissertation, I investigated Soundex Code and
compared it to Davidson's code. To put it quickly, the Davidson
Code is easier to calculate and for most purposes superior to
the Soundex. Soundex brought too many non-similar sounding names
together. In some cases, Davidson's code didn't bring the
desired names together. This study looked for duplicate records
in an approximately 25,000 name radiology data base.

I don't understand why Davidson's code is ignored as much as it is.

Davidson L. Retrieval of Misspelled Names in an Airlines
Passenger Record System. Comm ACM; 1962; 5:169-71.

Greenfield RH. An Experiment to Measure the Performance of
Phonetic Key Compression Schemes. Meth Inform Med; 1977 Oct;
16(4):230-3.

R H Greenfield, University of Regina.

PS: Where is Vijay Raghavan hiding these days?
Acknowledge-To: Robert H Greenfield <RHG@UREGINA1>

[Note: Vijay's new network address in Southern Louisiana is
raghavan%usl.csnet@csnet-relay.arpa
I agree with you that people should look at other algorithms besides
Soundex. I had some students compare various methods, and
incorporated them in our version of the SMART system a few years ago.
We did some retrieval tests and found Soundex to give lower precision
but higher recall than either Davidson's 1962 method, or Davidson's
method with "m" and "n" conflated. See article in Proceedings of
ACM SIGIR Conf. in 1985 for a few more details. - Ed]

------------------------------

Message-Id: <MS.V3.10.cfe.0.apollo.139.4@andrew.cmu.edu>
Date: Mon, 13 Oct 86 15:00:29 edt
From: cfe@andrew.cmu.edu (Craig F. Everhart)
Subject: Soundex improvements?


When Knuth introduces Soundex, he says that it works pretty well for
surnames. Are there other, similar, name-compression techniques that are
designed to work well for given names, or for any names? Naively, I could
ask about simple, cheap techniques to map ``Bill'' and ``William'' to the
same canonical representation--a different canonical representation, of
course, from the one to which it maps ``Liz'' and ``Elizabeth'' and
``Libby.''

------------------------------

Date: Thu, 09 Oct 86 09:38:15 O
From: Henry Nussbacher <HANK@BARILVM>
Subject: SOUNDEX

I have been using a SOUNDEX-like program in VM for the past year to
phonetically find user's last names. There is alot of optimizing that
can be done in this area. The program is currently optimized for
American last names and not for European names. Example: A name like
Finkel would be reduced to 2736 but Finkle would become 27361 since the
trailing vowel is deemed to be significant in American names. An Israeli
name like Mizrachi would prefer to have the trailing vowel dropped. I leave
the optimization up to all computer scientists and hackers. For me it
finds a match 98% of the time.

The code is written in Rexx and should be fairly easy to understand for
those that know PL1 or Pascal.

> /*
> This exec will take a single character string that is a person's
> last name and change it into a numeric index. This index can then
> be used to search.
>
> Syntax:
> ======
>
> PHONETIC name
> The EXEC will stack a number when it leaves. It the responsibility
> of the caller to handle the stacked value.
>
> Phonetic rules:
> ==============
>
> 1) 'W' or 'H' at beginning of word is kept, 'W' or 'H' anywhere else
> is thrown out
> 2) Vowel at beginning or end of word is unique, anywhere else it
> is thrown out
>
>
> Logic:
> =====
>
> 1) Uppercase the name
> 2) Change the following letters:
> S,Z to 9
> R to 8
> M,N to 7
> L to 6
> W,H to Z
> D,T to 4
> C,G,J,K,Q,X to 3
> B,F,P,V to 2
> anything else to Y
> 3) Insert an 'X' at the beginning and the end
> 3) Change any 'YX' to '1'
> any 'XY' to '1'
> any 'XZ' to '5'
> 4) Change all single occurrences of 'X', or 'Z' to null
> 5) Reduce all double and triple occurrences of the numbers - 2,3,
> 4,6,7,8,9 to be single digit in length.
> 6) Change all occurrences of 'Y' to null
>
> Example:
> =======
>
> A search of the Weizmann NAMES file for MUSPACHER would find
> NUSSBACHER.
>
>
> Author: Henry Nussbacher
> Date : July 1985
> */

> Arg name /* Pick up the name */
> Upper name /* Upper case it */
> nuname = Translate(name,'YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY222233333344ZZ677899',
> '|@#$%^&*()_-+=!c|\"}{?/`~AEIOUYBFPVCGJKQXDTHWLMNRSZ')
> nuname = 'X'nuname'X' /* Surround the word */
> nuname = Change(nuname,'YX','1')
> nuname = Change(nuname,'XY','1')
> nuname = Change(nuname,'XZ','5')
> nuname = Change(nuname,'Z','')
> nuname = Change(nuname,'X','')
> nuname = Change(nuname,'222','2')
> nuname = Change(nuname,'333','3')
> nuname = Change(nuname,'444','4')
> nuname = Change(nuname,'666','6')
> nuname = Change(nuname,'777','7')
> nuname = Change(nuname,'888','8')
> nuname = Change(nuname,'999','9')
> nuname = Change(nuname,'22','2')
> nuname = Change(nuname,'33','3')
> nuname = Change(nuname,'44','4')
> nuname = Change(nuname,'66','6')
> nuname = Change(nuname,'77','7')
> nuname = Change(nuname,'88','8')
> nuname = Change(nuname,'99','9')
> nuname = Change(nuname,'Y','')
> Queue nuname
> Exit
>
>
> Change: Procedure
> Parse Arg string, old, new
> If old == '' Then Return new||string
> out = ''
> Do While Pos(old,string) ^= 0
> Parse Var string prepart (old) string
> out = out||prepart||new
> End
> Return out||string

------------------------------

Date: Mon, 8 Sep 86 10:15:56 cdt
From: Bill Jones <seismo!mcc.com!mccabe%hulk>
Message-Id: <8609081515.AA09912@hulk>
Subject: abstract submission to IRList

Dear Dr. Fox,

Enclosed please find an abstract of a manuscript "
Pictures of Relevance:
A Geometric Analysis Of Similarity Measures" by myself and George
Furnas. This manuscript will shortly appear as an MCC tech report and
it has also been submitted for possible publication in JASIS. Copies
can be obtained by contacting me (William P. Jones) in any of three
ways:
1. email - send to "
mccabe@mcc.com" ("mccabe" is a family name more
distinctive than "
jones".)
2. phone - 512-338-3326
3. or by writing me at the address below.
I thought it appropriate to give you readers advance notice of this
manuscript. Thank you.



PICTURES OF RELEVANCE: A GEOMETRIC ANALYSIS OF SIMILARITY MEASURES

William P. Jones, Microelectronics and Computer Technology Corporation
P.O. Box 200195, Austin, Texas 78720

George W. Furnas, Bell Communications Research, 435 South St.
Morristown, N. J. 07960


We want computer systems that can help us assess the similarity or
relevance of existing objects (e.g. documents, functions, commands,
etc.) to a statement of our current needs (e.g. the query). Towards
this end, a variety of similarity measures have been proposed. However,
the relationship between a measure's formula and its performance is not
always obvious. A geometric analysis is advanced and its utility
demonstrated through its application to six conventional information
retrieval similarity measures and a seventh spreading activation
measure. All seven similarity measures work with a representational
scheme wherein a query and the database objects are represented as
vectors of term weights. A geometric analysis characterizes each
similarity measure by the nature of its iso-similarity contours in an
n-space containing query and object vectors. This analysis reveals
important differences among the similarity measures under consideration
and suggests conditions under which these differences will have an
empirically verifiable impact upon retrieval performance. The cosine
coefficient, for example, is shown be insensitive to between- object
differences in term relevance while the inner product measure is shown
to be subject to an unbounded influence from individual query terms.
The context-sensitive spreading activation measure may overcome both of
these limitations and deserves further study. The geometric analysis is
intended to complement, and perhaps to guide, the empirical analysis of
similarity measures.

[Note: I have read this and found it quite interesting. They use
equi-similarity contours similar to those we used in the p-norm
extended Boolean query processing studies dating back some 3-4 years,
but have applied that to vector forms. They have added in a taxonomy
of characteristics of measures that sharpens the effects and
differences between measures. And the spreading activation approach
has been receiving a lot of attention in the AI community. - Ed]

------------------------------

END OF IRList Digest
********************

← previous
next →
loading
sending ...
New to Neperos ? Sign Up for free
download Neperos App from Google Play
install Neperos as PWA

Let's discover also

Recent Articles

Recent Comments

Neperos cookies
This website uses cookies to store your preferences and improve the service. Cookies authorization will allow me and / or my partners to process personal data such as browsing behaviour.

By pressing OK you agree to the Terms of Service and acknowledge the Privacy Policy

By pressing REJECT you will be able to continue to use Neperos (like read articles or write comments) but some important cookies will not be set. This may affect certain features and functions of the platform.
OK
REJECT