Copy Link
Add to Bookmark
Report
IRList Digest Volume 2 Number 35
IRList Digest Monday, 11 August 1986 Volume 2 : Issue 35
Today's Topics:
Report - File Format for Machine Readable Webster's 7th Coll. Dict.
----------------------------------------------------------------------
Date: Fri, 1 Aug 86 11:28:01 CDT
From: James Peterson <seismo!mcc.com!peterson>
Subject: W7 file format
Since you asked about my report, I am including below the introductory
part (not the appendix). It seems too long for IRList, but as editor,
that is your decision. You may want to cut it even more. jim
[Note: the report below is unedited - it seems about right for an
issue of IRList. Readers not interested in this topic need not read
any more than they wish. - Ed]
Webster's Seventh New Collegiate Dictionary
A Computer-Readable File Format
James L. Peterson
Microelectronics and Computer Technology Corporation (MCC)
Austin, Texas 78759
Section 1. Introduction
A transcript of Webster's Seventh New Collegiate Dictionary [1] is
available in a computer-readable form. This is not just a word list,
but a copy of the entire dictionary including definitions, cross
references, variants, synonyms, and so on. It consists of some
15,696,929 characters, with 68,764 main entries. It could be used for
all forms of text processing, including spelling, hyphenation, syntax,
semantics, and so on.
The original dictionary was keyboarded onto the Q-32 computer at
System Development Corporation (SDC) for a project headed by John
Olney [2]. The dictionary was then heavily edited and moved onto an
IBM 360. Magnetic tapes of this form were moved to IBM T. J. Watson
Research Center and further processed by C. Alberga. A copy of this
was acquired by Robert Amsler, who used the Pocket Dictionary for his
dissertation [3]. We have acquired a copy of the dictionary from
Amsler, and have modified it in many minor ways. This document
describes our version.
The dictionary is such a large collection of text that it is broken
up into 220 files for easier handling. These files reside under the
names d.101, d.102, d.103, ..., d.320. The file names were selected to
require three digits in all cases.
The first letter of each word in a file is the same; that is when
we switch from words starting with 'D' to words starting with 'E',
we start a new file. Otherwise, files are broken to create roughly
equal-sized files (from 70,000 to 80,000 characters).
Here is a sample of a dictionary file.
F;chase;1;;;vb;;
P;'ch{a-}s
E;ME [italic chasen], fr. MF [italic chasser], fr. (assumed) VL [italic#
captiare] -- more at [mini CATCH]
D;1;a;;vt;to follow rapidly : [mini PURSUE]
D;1;b;;vt;[mini HUNT]
D;1;c;;vt;to follow regularly or persistently with the intention of#
attracting or alluring
L;2;;;[italic obs]
D;2;;;vt;[mini HARASS]
D;3;;;vt;to seek out
D;4;a;;vt;to cause to depart or flee : [mini DRIVE]
L;4;b;;[italic slang]
D;4;b;;vt;to take (oneself) off
D;1;;;vi;to chase an animal, person, or thing
D;2;;;vi;[mini RUSH], [mini HASTEN]
S;0;[mini PURSUE], [mini FOLLOW], [mini TRAIL]:
S;1;[mini CHASE] implies going swiftly after and trying to overtake something#
fleeing or running;
S;2;[mini PURSUE] suggests a continuing effort to overtake, reach, attain;
S;3;[mini FOLLOW] puts less emphasis upon speed or intent to overtake and may#
not imply an awareness on the part of the leader that he is pursued;
S;4;[mini TRAIL] may stress a following of tracks or traces rather than a#
visible object
F;chase;2;;;n;;
D;1;a;;n;the act of chasing : [mini PURSUIT]
D;1;b;;n;[mini HUNTING] -- used with [italic the]
D;1;c;;n;an earnest or frenzied seeking after something desired
D;2;;;n;something pursued
D;3;a;;n;a franchise to hunt within certain limits of land
D;3;b;;n;a tract of unenclosed land used as a game preserve
Each line of the file has a character in column 1 which identifies
the type and format of the line. The following table shows the
number and meaning of each line type.
Frequency Line type Meaning
________________________________________________________
68,764 F First line, start for a new word
30,673 E Etymology
66,987 P Pronunciation
9,959 V Variant
140,501 D Definition, one per line
19,123 R Related word
4,596 X Cross-Reference
11,992 L Label
835 S Synonym block
Each line is composed of a number of fields. Fields are separated by a
semicolon and are defined by their position. The first field of each
line is the line type character (F, E, P, V, D, L, R, X, or S, as
given above). The remaining fields depend upon the type of the line.
For example, the second entry on an F-line is a main entry word, the
fifth field has hyphenation information, and the seventh has part of
speech information.
The following table shows the contents of the fields for each line type.
F lines - Main entry
F1 - F
F2 - Main entry
F3 - Homograph Number
F4 - Prefix/Suffix/Infix
F5 - Hyphenation
F6 - Part of Speech
F7 - Part of Speech Joiner
F8 - Secondary Part of Speech
E lines - Etymology
E1 - E
E2 - text
P lines - Pronunciation
P1 - P
P2 - text
V lines - Variants
V1 - V
V2 - Variant Word
V3 - Hyphenation
V4 - Variant Level
D lines - Definitions
D1 - D
D2 - Sense number
D3 - Sense letter
D4 - Sense subnumber
D5 - Part of Speech
D6 - Text definition
R cards - Related Words
R1 - R
R2 - Related Word or Phrase
R3 - Hyphenation
R4 - Part of Speech
R5 - Part of Speech Joiner
R6 - Secondary Part of Speech
X lines - Cross Reference
X1 - X
X2 - Word
X3 - superscript, if any
X4 - subscript, if any
X5 - Type of cross reference
X6 - Secondary word
L lines - Labels
L1 - L
L2 - Sense Number
L3 - Sense Letter
L4 - Sense Subnumber
L5 - Label Text
S lines - Synonym Block
S1 - S
S2 - Synonym number
S3 - text
Appendix I discusses each line and field in more detail.
Some lines, particularly definitions and synonym blocks are quite
long and hence it is difficult to fit them on one 80 character line.
Therefore, lines are split whenever necessary so that no physical line
exceeds 80 characters. Lines are always split at a blank, and the
incomplete line is terminated with a sharp or hash mark character
('#'). When processing the dictionary, if a line is terminated with a
# character, then the # character should be replaced by a blank and
the next physical line should be read in and appended to the previous
line.
1.1 Character codes
A major problem with the dictionary is its character set. First, the
dictionary publisher did not feel constrained in his use of
characters, but choose whatever symbols best fit his purpose. Second,
the dictionary was originally encoded in an extended BCD (for the Q-32
computer), then translated into EBCDIC (for the IBM 360/370) and now
has been translated into ASCII. None of these character sets is
completely compatible with the others, nor is any of them sufficient
to represent all of the variation found in the original printed
dictionary. Hence an encoding scheme must be used to expand the set of
representable characters.
This expansion occurs in two independent directions: font
information and special characters.
Font information is represented by use of the square brackets in
ASCII to surround any special font material. Five font types are
recognized: (1) italic, (2) MINI-CAPS, (3) bold, (4) subscripts, and
(5) superscripts. Each is denoted by an identifying keyword
immediately after the opening left square bracket, followed by a
space, followed by the material to be in the defined font, followed by
the closing right square bracket. For example, an italic was is
represented as [italic was], while a mini-caps AMBIENT is [mini
AMBIENT] and a bold syn is [bold syn]. Superscripts and subscripts may
be italic, mini-caps, or bold, and a few superscripted superscripts
also occur, as in 6.24 {times} 10 [sup 10 [sup 10]].
The dictionary includes a large number of special symbols which are
not representable in ASCII. These include all the Greek alphabet, the
Hebrew alphabet and many miscellaneous other special symbols. All
special symbols which are not available in ASCII (and some that are)
have been given names which are represented by the name encased in
braces, as {degrees} (for a degree symbol) {times} (for multiplication
represented by a small x), {tau} (for the lower-case greek letter
tau), and so on. A complete list of these special symbols is in
Appendix II.
Each symbol name has been selected to exclude embedded blanks. Thus
all characters between an opening right brace and its closing right
brace are non-blank. Certain characters which are in ASCII (braces,
brackets, question mark, exclamation mark, and so on) have also been
represented in this way because it allowed them to be used for other
purposes (such as font and special character representation), and they
occurred only infrequently (less than 100 times).
Section Errors in W7.
While processing W7 both to understand its contents and to put those
contents into a usable form, we encountered a large number of errors.
These errors were of several types:
* Merged illustrations. For example, under false, the illustration was
"< ~ documents ~ teeth >" and should have been "< ~ documents > < ~
teeth >". To correct this, we searched for any line of the form
"<...~...~...>".
* Words containing letters with accents (236 entries). The accent
field was wrong about half the time. The normal problem was that
the accent was on the wrong letter. In these cases, the hyphenation
information generally showed syllables that were two letters two
long.
* Incorrect values in fields. The lists in Appendix I were examined
for rare or inappropriate values; for example, a g in a numeric
field, or a zero in an alphabetic field.
* Mismatched parentheses or brackets. We wrote a program to simply
count parentheses, braces, and brackets. Many were found to be
mismatched.
* Duplicate words. We scanned the text for instances where the same
word occurs twice in a row. The assumption was that these would be
places where the last word on a line of the original input was typed
twice by mistake. Instead we found a large number of places where
of or was typed as or or, as a was typed as a a, and the first word
on an input line was typed twice.
* Incorrect article. We found all occurrences of a followed by a word
starting with a vowel, or an followed by a consonant.
All these errors, once found and verified by visual inspection of
the printed dictionary, were corrected by hand, using a text editor.
A last form of error analysis was an attempt to find typographical
errors. The approach was simple: we extracted a list of all
unique words used in the dictionary definitions. This produced a
list of 54,298 words. We compared this list with the list of all
words defined in the dictionary (main entries, variants, and
related words). This reduced our list to 20,292 words that were
used in definitions, but not defined. Many of these were derived
forms of defined words: past tense, plurals, and so on. Doing
some simple suffix analysis, we were left with about 8,000 words.
Most of these were apparently Greek or Latin botanical or
zoological names. Deleting those ending in -ia or -ae and all
words in italics in the dictionary left a list of 2,821 words.
These were checked by hand to produce a list of 903 incorrectly
spelled words. We also found 54 words which were used, but not
defined:
Australasian bloomery broadheaded clothesline crossbeam
darkskinned dinnerware doorbell entranceway equivalve etc
fieldworks flattish foreseen foretold fourpence gunstock hairdress
hindlimbs homeward hyperactivity leftward lightcolored longlegged
lowcut messroom Mr. nailhead neckband noctuid nubby parimutuel
partaken pregenital pyrotechny rangeland rosebush sawteeth
seneschal sheepdogs shorthaired sightsee snowstorm songbook
spinymargined spondumene sulphates supersensitized TV twelvefold
understock upcurved valency workpiece.
We also found a smaller list of words with typographical errors in
the main entry in the computer files.
Of the 903 typographical errors, 543 were the result of a missing
blank between two words. Of the remaining 360, 34% were a missing
letter, 27% were a wrong letter, 20% were an extra letter, and 13%
were the result of transposed letters. The remaining errors were
caused by two extra or two missing letters, or by transposing two
letters around a third. The middle letter in this case was always
a vowel. (For example, 'min' would be typed 'nim'.)
We also found 10 cases of typographical errors in the original
printed dictionary. These are
[barranca] gulley => gully
[bitch] doublecross => double-cross
[capsicum] genu => genus
[drift] quantitive => quantitative
[fornication] NCE => New Catholic Edition
[kid] goodhumored => good-humored
[lapse] apostasize => apostatize
[lycopodium] clubmosses => club mosses
[type species] permanenty => permanently
[vanity] knicknack => knickknack
[select] mismatched parenthesis
[terpineol] mismatched parenthesis
[tapa] an Hawaiian (but see 'a Hawaiian' for [luau] and [poi])
[pay] REQUITE in syn not defined
[rude] ROUGH in syn not defined
[Lag b'Omer] 33d => 33rd
[one] 3d-person => 3rd-person
It was interesting to follow these errors through the various
printings and editions of the Merriam-Webster dictionaries. Four
errors were corrected in the 1970 printing of W7, one in the 1971
printing of W7, and one in the 1975 printing of the New Collegiate
Dictionary. Three errors remain in the 1983 printing of the Ninth
New Collegiate ([bitch], [drift], [vanity]).
------------------------------
END OF IRList Digest
********************