Copy Link
Add to Bookmark
Report
IRList Digest Volume 2 Number 33
IRList Digest Thursday, 7 August 1986 Volume 2 : Issue 33
Today's Topics:
Discussion - Machine Readable Collins Dict., Job at Leeds Univ.
----------------------------------------------------------------------
Date: 24-JUL-1986 23:09:36
From: RAHTZ%UK.AC.OXFORD.VAX1@AC.UK
Subject: The Machine-Readable Collins English Dictionary, Job at Leeds
The Machine-Readable Collins English
Summary of work in progress
Sebastian Rahtz
Department of Computer Studies
University of Southampton
1. Introduction:
This short document summarizes the responses I had to a letter sent
out in June 1986 to all the people who have ordered a tape of Collins
English Dictionary from the Oxford Text Archive (my thanks to Lou
Burnard for the list of names and addresses). I am grateful to all
those who replied to my request for information about how they had
decoded the text, and what they were doing with it; since it was
apparent that quite a lot of work had been done, and that some were
much further on than others, it seemed sensible to send out a summary
of the replies. I have either included sections of electronic mail directly,
or summarized paper mail. ...
2. Philip Taylor, University of London
Date: 1-JUL-1986 11:23:07
From: CHAA006@UK.AC.RHBNC.VAXA
I carried some some work on transliterating the dictionary from photo-
typesetting codes to a more useable form some years ago, when I first received
the tape. I had two objectives:- (1) to provide an online English-language
HELP system, using VMS help, for all entries in the dictionary, and (2) to
integrate the dictionary into the Dennison spelling checker (which also runs
on the VAX). Neither of these projects was 100% successful, but the
intermediate results may be of some use to you. (As part of (1), I also
implmented the core of the IPA on a Mellordate DT80/1 (VT-100 look-alike),
with reasonable success).
I should be happy to pass on all the work I have done, provided only that any
publications resulting from this work acknowledge the various contributors, and
that any further work which you carry out should be equally freely available
among the Academic community.
Philip Taylor (RHBNC, Univ. of London) [CHAA006@UK.AC.RHBNC.VAXB]
2.1 Pascal programs
Here are the more useful files from my work on the Collins English
Dictionary; they are written in Pascal and Macro-32. The programs TYPESET,
DECRYPT and PARSE are good starting points. TYPESET, as is, will produce quite
acceptable output even on unmodified VT100s; if you have any DT80/1s, I can
copy the IPA ROM for you, and the output will then be as close to Collins type-
set form as I was able to achieve within the time available.
If you have no DT80s, I could let you have the IPA in 8*8 dot-matrix form, and
you could burn it into ROMs for whatever devices you do have.
3. Ian Ellis, University of New England
From: ian%oz.neumann@oz.munnari 3-JUL-1986 03:41
Date: Thu, 3 Jul 86 11:30:10 est
Thank you for your letter regarding CED. As yet no one on this Campus has
tried to use CED other than a list of words. We did try to figure some of
the symbols and produce a database but lack of user pressure has allowed
us to put it on the back burner
Ian Ellis,
Director,
Computer Centre,
University of New England
4. Edward Fox, Virginia Tech
From: vtisr1!fox@gov.css.seismo 3-JUL-1986 08:06
You have hit the jackpot!
I have worked with several students during the last year on
the Collins English Dictionary. One completed his M.S. project
specifically on this.
We are almost done with production of a database, that can be
used from Prolog or from any relational database system, and
probably modified for other systems. I hope to be sending a tape
to Oxford Text Archive by the end of August.
Ed Fox (BITNET[cheapest]:foxea@vtvax3 or foxea%vtvax3.bitnet@wiscvm.arpa;
CSNET:fox@vt;Internet:fox%vtisr1.uucp@seismo.css.gov;UUCP:seismo!vtisr1!fox)
Dr. Edward A. Fox; Dept. of Computer Science; 562 McBryde Hall
Virginia Tech, Blacksburg VA 24061; (703) 961-5113 or 6931
We have done everything EXCEPT for the phonetic and etymology
information - I hope you don't need them! All I have so far is
the MS report - ...
5. David Eckersley, University of Salford
Date: MON, 07 JUL 86 13:57:38 GMT
From: D_ECK@UK.AC.SALFORD.SYSC
University of Salford Computing Services: Dr J B Slater, Director
I reply to your letter of June 24th concerning the Collins English Dictionary
from the Oxford Text Archive. I'm afraid we have for the time being shelved our
plans for using this data. The person who was do to the work left us, and I
have not taken it up. We did not manage to attach any consistent meanings to
the embedded codes in the text.
D Eckersley
(Secretary, IUSC)
6. Eric Atwell, University of Leeds
From: E S Atwell [eric@uk.ac.leeds.ai]
Date: Tue, 15 Jul 86 13:26:11 bst
I'm afraid I haven't done anything of use to
you with the CED tape: I got it mainly to evaluate it and compare
it to the machine-readable versions of two other dictionaries,
the Oxford Advanced Learner's Dictionary (OALD) and the Longman
Dictionary of Contemporary English (LDOCE). I am researching
into aspects of parsing and grammatical analysis of unresticted
`raw' text, for which a large non-`toy' dictionary is required.
Each word in the dictionary needs detailed grammatical
information; and the grammatical codes used in OALD and LDOCE are
far more refined and detailed than those of CED, so I have
concentrated work on the other two. In fact, LDOCE has already
been converted into a database-type format, and this form is
available for general (including commercial) research, though at
a price - at the Alvey workshop on linguistic theory and computer
applications at UMIST last september, a figure of pounds 30,000 was
mentioned! As an alternative, I have a copy of the OALD tape,
and last year I got one of our undergraduates to attempt a
reformatting of this as a Third Year Project. Unfortunately, he
did not get as far as a form worthy of general distribution, but
after graduating he stayed on here over the summer to finish
parsing the original file; the end result is exemplified by the
sample at the end of this letter.
I am currently trying to get some funding from OUP to carry this
work further (in collaboration with Prof Sampson of the
linguistics dept. and Tony Cowie from our English dept.)
However, if you are committed to using the CED, I suggest you get
in touch with the Speech research group at IBM Scientific Centre
in Winchester; they have extracted a quarter-million wordlist
from CED I believe, with grammatical part-of-speech and phonetic
transcription codes (but with other fields ignored); the CED
phonetic transcriptions are, they say, better than those of OALD
or LDOCE, which is why they are 'out on a limb' in the sense that
most other researchers i know of are using OALD or LDOCE.
Eric Steven Atwell Artificial Intelligence Group
Department of Computer Studies
phone: +44 532 431751 ext 6307/6119 Leeds University
JANET: eric@uk.ac.leeds.ai Leeds LS2 9JT
UUCP: ...!seismo!mcvax!ai.leeds.ac.uk!eric England
EARN/BITNET/ARPA: eric%uk.ac.leeds.ai@rl.earn
EXAMPLE OF PARSED REFORMATTED OALD FILE:
headword :B
alternative spelling of headword :b
pronunciation :bi
+++++++start of pieces+++++++
conjugation or plural label :pl
conjugation or plural spelling :B's
conjugation or plural spelling :b's
pronunciation :biz
__________definition__________
text :the second letter of the English alphabet.
**********end of entry**********
headword :baa
pronunciation :bq
+++++++start of pieces+++++++
word class label :n
__________definition__________
text :cry of a sheep or lamb.
***change in part of speech***
word class label :vi
text :(baaing, baaed or baa'd /bqd/) make this
cry; bleat.
====subentry====
derivative :%@-lamb
word class label :n
---subentry definition---
text :child's word for a sheep or lamb.
**********end of entry**********
headword :baas
pronunciation :bqs
+++++++start of pieces+++++++
word class label :n
__________definition__________
text :(S Africa) boss.
**********end of entry**********
headword :babble
pronunciation :%babl
+++++++start of pieces+++++++
word class label :vi
word class label :vt
__________definition__________
verb pattern :2A
verb pattern :2B
verb pattern :2C
text :talk in a way that is difficult to
understand; make sounds like a b
__________definition__________
verb pattern :6A
verb pattern :15B
====subentry====
idiom :@ (out)
text :, repeat foolishly; tell (a secret):
@ (out) nonsense/secrets.
***change in part of speech***
word class label :n
nountype :U
text :childish or foolish talk; confused talk
not clearly to be understoo
__________definition__________
text :gentle sound of water flowing over
stones, etc.
====subentry====
derivative :bab.bler
pronunciation :%bablE(r)
word class label :n
---subentry definition---
text :person who @s, esp one who tells secrets.
**********end of entry**********
headword :babe
pronunciation :beIb
+++++++start of pieces+++++++
word class label :n
__________definition__________
text :(liter) baby.
__________definition__________
text :inexperienced and easily deceived person.
__________definition__________
text :(US sl) girl or young woman.
**********end of entry**********
headword :babel
pronunciation :%beIbl
+++++++start of pieces+++++++
word class label :n
__________definition__________
text :the Tower of B@, tower built to reach
heaven. (Gen 11).
__________definition__________
text :(sing with indef art) scene of noisy and
confused talking: What a @
**********end of entry**********
headword :ba.boo
alternative spelling of headword :babu
pronunciation :%bqbu
+++++++start of pieces+++++++
word class label :n
__________definition__________
text :(as Hindu title) Mr; Hindu gentleman;
Hindu clerk; (old use, pej) H
**********end of entry**********
headword :ba.boon
pronunciation :bE%bun
US pronunciation :ba-
+++++++start of pieces+++++++
word class label :n
__________definition__________
text :large monkey (of Africa and southern Asia)
with a dog-like face.
cross reference :the illus at ape
**********end of entry**********
headword :baby
pronunciation :%beIbI
+++++++start of pieces+++++++
word class label :n
conjugation or plural label :pl
conjugation or plural spelling :-bies
__________definition__________
6.1 Further remarks
It will be interesting to see what others are doing with CED and
other dictionary tapes, so please do circulate your findings.
You may like to join Euralex, the European association for
lexicography, and find other related work through their bulletin (I
assume you are not already a member as your name did not appear
on the recent membership list). For details contact RRK
Hartmann, Language Centre, Exeter University, Exeter EX4 4QH (no
JANET address that I know of!)
I would also like to hear how your 3rd year project student gets
on. TEFL students might prefer a ``browser aid" for LDOCE or
OALD, as these as specifically designed for 2nd language
learners; in my previous job at Lancaster University, I wrote a
browser aid for the LDOCE which ELT MA students could use. The
speaking CED sounds a great idea. A major problem with
`off-the-shelf' speech synthesisers is that they have no way of
producing varied ``listenable" intonation contours for sentences and
longer texts; but this problem is neatly sidestepped in a talking
dictionary, as most fields (keyword, part of speech, spelling)do
not require smooth continuous speech, and the definition fields
tend to be short sentences or sentence-fragments where a very
simple intonation contour would be quite acceptable to the user.
Even so, as you suggest, it is still quite ambitious for a third
year project!
6.2 an interesting job
From: E S Atwell [eric@uk.ac.leeds.ai] 22-JUL-1986 16:10
Subj: vacancy for NLP/AI/OR Software Engineer
I am collaborating with Professor Sampson on a Parsing research project, and
we have just had the go-ahead to advertise for a software engineer to work
with us on the project. I would be most grateful if you could bring the
following details to the attention of any potential candidates you know of.
********* UNIVERSITY OF LEEDS ****** ANNEALING PARSER PROJECT *********
Applications are invited for a post of SOFTWARE ENGINEER, to work on a project
developing a parser for unrestricted English using the connexionist technique
of simulated annealing. The project (funded by the Joint Speech Research
Unit) is supervised by Prof. Geoffrey Sampson of the Linguistics & Phonetics
Department (where the post will be tenable) and Eric Atwell of the Computer
Studies Department. The person appointed will be working on a SUN-3/52M
Workstation dedicated to his/her use. Candidates should have a good honours
degree; experience with natural language analysis, and of programming in a
Unix environment, will be advantages.
The post is available from 1 October 1986 for a fixed term of up to 3 years.
Starting salary will be within the range 8020 to 9495 pounds (under rev
Other-Related IA Grade, according to age, qualifications, and experience.
Informal enquiries may be made to Prof. Sampson on (0532) 431751 ext.6252;
or by electronic mail to Eric Atwell, eric@Leeds.AI via JANET or
eric%UK.ac.Leeds.AI@RL.EARN via EARN or BITNET. For application forms and
further particulars write to the Registrar, The University, Leeds LS2 9JT,
quoting reference no. 14/20.
****** The closing date for applications is 14 AUGUST 1986 ******
Leeds University is one of the largest and most influential universities in
the country. Leeds itself is the commercial, social and sporting centre for
much of North and West Yorkshire; it has all the facilities you would expect
of a major city, yet the outskirts of Leeds lead directly out onto 2,00 square
miles of outstandingly beautiful countryside. Leeds also offers some of the
cheapest housing in England; for example, pounds 15,000 buys a two-bedroomed
se or a larger terraced house.
Simulated Annealing, a technique originating in statistical mechanics, can be
used in operational research and artificial intelligence in optimisation
problems requiring an efficient search of a very large search space. We plan
to apply this technique to parsing unrestricted English, where the search
space is a set of trees. The appointee will find a stimulating research
environment at Leeds: the University is a thriving centre for research in
Computer Analysis of Language and Speech, Artificial Intelligence, Operational
Research and related areas. In addition to her/his dedicated workstation, the
appointee will have access to a wide range of equipment and software,
including specialist Departmental libraries, a VAX 11/750 dedicated to
Artificial Intelligence research, and a spacious SUN LOUNGE with a network of
Suns, fileserver, laserprinter, and large south-facing windows.
7. Ron Hardie, Brighton Polytechnic
[summary of letter] has only just started thinking about CED;
interested to hear what others are doing.
Ron Hardie, Department of Modern Languages, Brighton Polytechnic,
Brighton BN1 9PH
8. Herbert Wenzel, Erlangen
[summary of letter] writing text retrieval system for PC, now
integrating dictionary, but has problems physically reading tape (sent
suggestion).
Professor H. Wenzel, Institut fEuLr Technische Chemie II,
Egerlandstr. 3, Erlangen, W. Germany.
9. Roger Mitton, University of London
[summary of letter] looked at Collins dictionary but finds the Oxford
Advanced Learner's more useful. Has produced a database from the
OALDCE as part of research into spelling checking, which is available
from the Oxford Text Archive.
Roger Mitton,
Dept Computer Science, Birkbeck College, Malet Street, London WC1E 7HX
------------------------------
END OF IRList Digest
********************