Public-Access Computer Systems Review Volume 05 Number 03

Published in
· 5 years ago
  

  
+ Page 1 + 
  
----------------------------------------------------------------- 
            The Public-Access Computer Systems Review 
  
Volume 5, Number 3 (1994)                          ISSN 1048-6542 
----------------------------------------------------------------- 
  
To retrieve an article file as an e-mail message, send the GET 
command given after the article information to 
listserv@uhupvm1.uh.edu.  (Files are also available from the 
University of Houston Libraries' Gopher server: info.lib.uh.edu, 
port 70.) 
  
                            CONTENTS 
  
COMMUNICATIONS 
  
Using the World-Wide Web to Deliver Complex Electronic Documents: 
Implications for Libraries 
  
     By John Price-Wilkin (pp. 5-21) 
  
     To retrieve this file:   GET PRICEWIL PRV5N3 F=MAIL 
  
The World-Wide Web (also called the Web) is a very promising tool 
for libraries to use to explore the delivery of rich and complex 
documents.  Nevertheless, there are many limitations in the Web's 
HTML markup language and the ability of Web servers to deliver 
structured information.  This paper explores the benefits and 
limitations of the Web in the context of several projects taking 
place at the University of Virginia, both in the Library and in 
the University's Institute for Advanced Technology in the 
Humanities.  A gateway between the Web and the SGML-based PAT 
system that helps to overcome the Web's inherent limitations is 
also described. 
  
+ Page 2 + 
  
----------------------------------------------------------------- 
            The Public-Access Computer Systems Review 
----------------------------------------------------------------- 
  
Editor-in-Chief 
  
Charles W. Bailey, Jr. 
University Libraries 
University of Houston 
Houston, TX 77204-2091 
(713) 743-9804 
Internet: lib3@uhupvm1.uh.edu 
  
Associate Editors 
  
Columns: Leslie Pearse, OCLC 
Communications: Dana Rooks, University of Houston 
  
Editorial Board 
  
Ralph Alberico, University of Texas, Austin 
George H. Brett II, Clearinghouse for Networked Information 
     Discovery and Retrieval 
Priscilla Caplan, University of Chicago 
Steve Cisler, Apple Computer, Inc. 
Walt Crawford, Research Libraries Group 
Lorcan Dempsey, University of Bath 
Pat Ensor, University of Houston 
Nancy Evans, Pennsylvania State University, Ogontz 
Charles Hildreth, READ, Ltd. 
Ronald Larsen, University of Maryland 
Clifford Lynch, Division of Library Automation, 
     University of California 
David R. McDonald, Tufts University 
R. Bruce Miller, University of California, San Diego 
Paul Evan Peters, Coalition for Networked Information 
Mike Ridley, University of Waterloo 
Peggy Seiden, Skidmore College 
Peter Stone, University of Sussex 
John E. Ulmschneider, North Carolina State University 
  
+ Page 3 + 
  
Technical Support 
  
Tahereh Jafari, University of Houston 
  
Publication Information 
  
Published on an irregular basis by the University Libraries, 
University of Houston.  Technical support is provided by the 
Information Technology Division, University of Houston. 
Circulation: 8,202 subscribers in 65 countries (PACS-L) and 2,562 
subscribers in 52 countries (PACS-P). 
  
Back issues are available from listserv@uhupvm1.uh.edu.  To 
retrieve a cumulative index to the journal, send the following e- 
mail message to the list server: GET INDEX PR F=MAIL. 
  
Back issues are also available from the University of Houston 
Libraries' Gopher server.  Point your Gopher client at 
info.lib.uh.edu, port 70, and follow this menu path: 
  
     Looking for Articles 
          Electronic Journals 
               University of Houston Libraries E-Journals 
                    The Public-Access Computer Systems Review 
  
The journal's URL is gopher://info.lib.uh.edu:70/11/articles/e- 
journals/uhlibrary/pacsreview. 
  
The first three volumes of The Public-Access Computer Systems 
Review are also available in book form from the American Library 
Association's Library and Information Technology Association 
(LITA).  The price of each volume is $17 for LITA members and $20 
for non-LITA members.  All three volumes can be ordered as a set 
for $45 (indicate that you want the PACS Review set, order number 
7712-X).  To order, contact: ALA Publishing Services, Order 
Department, 50 East Huron Street, Chicago, IL 60611-2729, (800) 
545-2433. 
  
+ Page 4 + 
  
----------------------------------------------------------------- 
The Public-Access Computer Systems Review is an electronic 
journal that is distributed on the Internet and on other computer 
networks.  There is no subscription fee. 
     To subscribe, send an e-mail message to 
listserv@uhupvm1.uh.edu that says: SUBSCRIBE PACS-P First Name 
Last Name. 
     The Public-Access Computer Systems Review is Copyright (C) 
1994 by the University Libraries, University of Houston.  All 
Rights Reserved. 
     Copying is permitted for noncommercial use by academic 
computer centers, computer conferences, individual scholars, and 
libraries.  Libraries are authorized to add the journal to their 
collection, in electronic or printed form, at no charge.  This 
message must appear on all copied material.  All commercial use 
requires permission. 
----------------------------------------------------------------- 
  
+ Page 5 + 
  
----------------------------------------------------------------- 
Price-Wilkin, John.  "Using the World-Wide Web to Deliver Complex 
Electronic Documents: Implications for Libraries."  The Public- 
Access Computer Systems Review 5, no. 3 (1994): 5-21.  To 
retrieve this file, send the following e-mail message to 
listserv@uhupvm1.uh.edu: GET PRICEWIL PRV5N3 F=MAIL.  (The file 
is also available from the University of Houston Libraries' 
Gopher server: info.lib.uh.edu, port 70.) 
----------------------------------------------------------------- 
  
1.0  Introduction 
  
The World-Wide Web (also called the Web) is a very promising tool 
for libraries to use to explore the delivery of rich and complex 
documents. [1]  Nevertheless, there are many limitations in the 
Web's HTML markup language and the ability of Web servers to 
deliver structured information.  This paper explores the benefits 
and limitations of the Web in the context of several projects 
taking place at the University of Virginia, both in the Library 
and in the University's Institute for Advanced Technology in the 
Humanities.  A gateway between the Web and the SGML-based PAT 
system that helps to overcome the Web's inherent limitations is 
also described. 
  
2.0  SGML and TEI 
  
The most worthwhile products that libraries can buy are ones that 
conform to standards and are not tied to a specific software 
package or operating system.  These are the only products with 
enduring value.  Certainly, there are exciting electronic 
resources being produced for specific software packages and 
operating systems, but the extent to which libraries can build 
collections of hypertext resources that are usable in the future 
will depend entirely on the conformance of their resources to 
true national and international standards.  The most important 
standard for this discussion is SGML, a standard designed to 
express the organization of documents and to accommodate even the 
most complex multimedia materials. 
  
+ Page 6 + 
  
     A brief (and admittedly superficial) discussion of SGML and 
the Text Encoding Initiative may be helpful.  SGML (Standard 
Generalized Markup Language, ISO 8879) is a standard approved by 
the ISO for the descriptive markup of documents.  The language of 
SGML is sufficiently flexible that the sense of "document" has 
been expanded to include coordinated time-based elements of 
hypermedia (e.g., animated dance, music, and character-based 
score and choreography moving in synchrony at a pace controllable 
by the user).  SGML is not a tag set: there are no pre-set tags. 
Instead, SGML is a set of rules (or a grammar) for articulating 
that vocabulary.  These rules are sufficiently rigorous so that 
specialized software can check the validity or conformance of a 
document.  The specification of that grammar is a DTD (Document 
Type Definition); the DTD can also function to document many 
decisions about the organization of a text.  Without that 
validity--i.e., without being parsed against a DTD--the document 
is not SGML encoded, although it may share many of the 
characteristics of SGML. 
     For our work at Virginia, the most notable of these 
characteristics has been the descriptive nature of the tagging. 
Rather than saying that an element of the text appears in bold, 
17 point Helvetica, centered at the top of a new page, we use the 
tags to define the function of a textual element (e.g., a title). 
The tag set used must necessarily elaborate the elements of the 
texts we see in an academic environment: a tag set designed for 
articles or documentation, for example, will omit important 
elements needed for encoding poetry.  To serve those needs, the 
Text Encoding Initiative (or TEI) has published a set of 
guidelines for the application of SGML to texts in the 
humanities.  Functions of the text or hypertext, expressed 
descriptively and with a standard language, are freed from the 
constraints of a specific software package or application.  SGML- 
encoded works can serve a variety of functions, depending on the 
user's needs and available software. 
  
+ Page 7 + 
  
3.0  The Potential of the Web 
  
The Web uses a client/server architecture.  Sophisticated Web 
clients, such as Mosaic, offer an exciting sense of the 
possibilities of electronic publishing on the network.  Several 
revolutionary concepts that have been awaited with anticipation 
are incipient in all aspects of that relationship between client, 
server, and publication.  These characteristics are: 
  
     o    Open systems--the ability to make resources available 
          to a variety of operating systems and a variety of 
          applications is evident throughout the Web.  Computers 
          running X Windows, Microsoft Windows, and the Macintosh 
          System 7 all participate equally.  In addition to 
          Mosaic, other clients, such as Cello and OmniWeb, are 
          available.  Multimedia tools, such as image viewers, 
          are a matter of personal choice. 
  
     o    Standards--given the Web's use of HTML, the importance 
          of standards is heightened, and HTML is inexorably 
          moving toward greater expressiveness and greater 
          conformance to the SGML standard. 
  
     o    Distributed information--the notion of a universe of 
          distributed information, scattered throughout the 
          Internet while being conceptually linked to other 
          information, is becoming a reality through the use 
          of the Web. 
  
4.0  Representative Web Projects 
  
Over the past two years at the University of Virginia, faculty 
and staff involved in several projects began to develop a variety 
of electronic materials using the SGML standard.  Partly this was 
to serve already apparent needs, but it was also to take 
advantage of the potentials of electronic publishing.  While the 
Library's Electronic Text Center and, later, its Digital Image 
Center began to develop skills in creating electronic materials 
in standard formats for networked access, scholars at the 
Institute for Advanced Technology in the Humanities undertook the 
daunting task of composing advanced, standards-based electronic 
research materials without having the tools with which to publish 
these materials.  With the introduction of Mosaic, the Web was 
quickly seen as a way to deliver these materials, and, with 
relative ease, large bodies of SGML-encoded material were 
converted to HTML for Web access.  In order to focus on 
particular aspects of those projects, the following example 
projects are divided into sections on editions, history, image 
archives, and instruction. 
  
+ Page 8 + 
  
4.1  Editions 
  
In general, the Web offers creators of editions of literary or 
other works the ability to represent a vast, interconnected web 
of scholarly resources in a variety of different ways.  The user 
might view the resources simply, as in an edition of a work 
without the introduction of a critical apparatus.  A more complex 
approach is also possible, with the user following the critical 
apparatus at every turn.  And finally, a rich and scholarly 
approach is possible, allowing the user to view manuscript (or 
printing) evidence or to examine the editor's assessment of the 
evidence by comparing high-quality scans of original pages to the 
marked-up transcriptions.  With proper markup, an edition can be 
viewed in as many ways as the reader desires.  It can be a 
variorum, a study edition, a critical edition, or historical 
evidence.  The form the edition takes is defined by the user's 
needs or preferences. 
  
4.1.1  British Poetry 
  
The British Poetry Archive documents are perhaps the simplest of 
those discussed here.  (The project's URL is http:// 
www.lib.virginia.edu/etext/britpo/britpo.html.) 
     The two texts now available were transcribed by students in 
Jerome McGann's graduate courses.  In addition to the SGML- 
encoded text itself, each work includes material such as 
introductions, notes, and glosses as well as high-quality digital 
facsimiles of pages from the original editions.  The materials 
are freely available on the Internet, and Mr. McGann hopes that 
others will contribute to the archive.  These texts represent the 
simplest of the hypertext editions available on the University of 
Virginia's Web, with supporting materials providing potential 
deviations from an otherwise linear progression.  The texts were 
encoded in TEI-conformant SGML with the assistance of the 
Library's Electronic Text Center, and they were then converted to 
HTML for the purpose of making them available on the Web. 
  
+ Page 9 + 
  
4.1.2  Dante Gabriel Rossetti 
  
To date, the most fully developed project is Jerome McGann's 
ongoing edition--or archive--of the works of Dante Gabriel 
Rossetti.  (The project's URL is http:// 
jefferson.village.virginia.edu/rossetti/rossetti.html.) 
According to McGann, the Rossetti archive is: 
  
     a hypermedia environment for studying the works of the 
     Pre-Raphaelite poet and painter D. G. Rossetti (1828-1882). 
     The archive is a structured database holding digitized 
     images of Rossetti's works in their original documentary 
     forms.  Rossetti's poetical manuscripts, early printed texts 
     --including proofs and first editions--as well as his 
     drawings and paintings are stored in the archive, in full 
     color as needed.  The materials are marked up for electronic 
     search and analysis, and they are supplied with full 
     scholarly annotations and notes. [2] 
  
The organization of the archive is designed to capitalize on the 
uniquely intertwined nature of Rossetti's artistic process, 
linking image to text and text to image.  When Rossetti 
accompanied a painting by sonnets, the poems are included in the 
archive along with an image of the painting.  When Rossetti 
illustrated a poem with a painting, an image of the painting is 
included.  Since Rossetti frequently designed his own editions, 
electronic versions of his print works, with linked text and 
images, are also available.  McGann describes the difficulty of 
studying Rossetti's works in a traditional print environment, and 
then sets about trying to overcome those difficulties by melding 
the resources in a way that allows the reader to follow the 
threads of art, poetry, or translations without losing access to 
the other materials. 
  
4.1.3  Piers Plowman 
  
The third project was begun in the 1994-95 academic year by one 
of the most recent Institute fellows, Hoyt Duggan.  (The 
project's URL is http://jefferson.village.virginia.edu/piers 
/archive.goals.html.) 
  
+ Page 10 + 
  
     Mr. Duggan, an accomplished editor of Middle English texts, 
created an edition of the Piers Plowman B text using the Web. 
More in the model of the traditional scholarly edition, Mr. 
Duggan's project brings together transcription and facsimile to 
resolve vexing editorial problems. When the scribe uses an 
abbreviation to represent a letter combination (e.g., a barred 
"p" for "pre"), the reader typically wants the editor's best 
judgement in rendering what was intended (i.e., "pre").  Many of 
those decisions deal with unambiguous evidence, and some with 
less certain evidence.  Through SGML, both the suspension or 
abbreviation is registered as well as the reading of the 
character. 
     To the greatest extent possible, digital facsimiles of all 
seventeen surviving manuscripts will be included.  With facsimile 
evidence, it is always possible to return to something resembling 
the original document to evaluate the editor's decision.  Duggan 
has also found that it is possible to create extremely 
high-resolution images that, with enlargement and other digital 
treatments, can reveal important new information about the 
original composition. 
  
4.2  History 
  
With new technological tools, historians are offered both 
challenges and opportunities.  Electronic resources allow them to 
blend evidence and interpretation in ways that help both student 
and researcher.  A simple approach in using the materials is 
possible, where the reader follows the argument without examining 
evidence.  It is also possible for the reader to examine the 
methodology of the researcher, either to scrutinize the research 
or to be instructed in the methodology of research.  The process 
of bringing evidence and interpretation together brings 
challenges of immense proportions.  For example, the role 
geography plays in defining an event can be brought to bear on 
the problem, but it may involve the use of sophisticated systems 
of geographic analysis.  Two projects at the Institute have used 
many diverse resources to explore their topics, incorporating 
nineteenth Census data, geographic models, and animated 
sequences. 
  
4.2.1  Ayers (Valley of the Shadow) 
  
Edward Ayers, a historian of the Civil War and the 
Reconstruction, was one of the Institute's first two fellows. 
(The project's URL is http://jefferson.village.virginia.edu/ 
vshadow/vshadow.html.)  According to Ayers, the project: 
  
+ Page 11 + 
  
     interweaves the histories of places on both sides of the 
     Mason-Dixon line.  It is the story of two communities 
     relatively close to one another, sharing considerable prewar 
     characteristics and similar experiences in the war itself. 
     There was one area in the United States for which that was 
     most clearly the case: the Great Valley that stretched from 
     Pennsylvania, through Maryland and Virginia, into Tennessee. 
     [3] 
  
Ayers focuses on two towns--Staunton, Virginia and Chambersburg, 
Pennsylvania--as representative communities from that Valley that 
served as such an important economic, cultural, and military 
locus of the War.  The Web serves the historical ends by 
balancing narrative--a filtering or interpretation of evidence-- 
with the presentation of that evidence.  Ayers has described one 
dilemma of the historian as a tight-rope act between providing 
access to evidence and creating an organizing argument that does 
not also obscure that evidence.  His approach, providing the 
deepening layers of evidence as "rhizomes" beneath the surface of 
narrative, has been well-supported by the Web. 
  
4.2.2  Dobbins (The Forum at Pompeii) 
  
Dobbins, a classical archaeologist, reconstructs Pompeii from 
archaeological evidence in a virtual space to advance his 
argument.  (The project's URL is http:// 
jefferson.village.virginia.edu/pompeii/page-1.html.) 
     He uses computer-aided design (CAD) tools to bring precision 
to his reconstruction.  Animation is being added to the CAD 
representations to provide a three-dimensional perspective of 
buildings and space.  Structures that are normally seen in 
isolation from each other are assembled in a total vision of 
Pompeii that may suggest a degree of planning and coordination. 
  
4.3  Image Archives 
  
The Digital Image Center's image collections can be seen as 
passive collections of standards-based images.  (The project's 
URL is http://www.lib.virginia.edu/dic/class/arh102.) 
     The image collections are organized to reflect the focus of 
an individual class or an art exhibit.  All of the images are 
TIFF files subjected to JPEG compression.  As such, they can be 
examined with a variety of image tools, ranging from simple 
viewers to software with analytical capabilities.  Most 
importantly, the tool used is largely the choice of the user.  As 
a result of planning and philosophy, all images are durable 
enough to stand close scrutiny: they were scanned in 24-bit color 
at a sufficiently high resolution to be enlarged several times 
without significant degradation. 
  
+ Page 12 + 
  
     The most developed collection is representative of this 
archival philosophy.  William Westphal's graduate architectural 
history course on urban form includes hundreds of architectural 
images, primarily from the Italian Renaissance, organized around 
his lectures.  Students can access these resources at all times 
over the network as well as in a closed classroom environment 
designed to efficiently access the images.  Since they were 
scanned at high resolutions, the images compare favorably with 
the original slides, and they can be examined closely on screen. 
The original slides have frequently degraded or had imperfections 
that were corrected in the scanning process. 
  
4.4  Instruction 
  
The final project demonstrates the instructional capabilities of 
the Web.  (The project's URL is http://www.lib.virginia.edu/ 
etext/scanner.html.) 
     Using the Web to provide access to training materials has 
many strengths. It gives variation to what would otherwise be a 
flat, linear document.  The document is dynamic and can easily 
accommodate other elements as they are created by staff. 
     Scanning text is one of the most repetitive training 
operations provided in the Electronic Text Center.  Unlike 
searching electronic texts, where every research need may entail 
a different approach and different training needs, many of the 
scanning decisions are generalizable and can be represented in a 
training document.  The project's instructional Web pages on 
scanning were designed to reduce the amount of staff intervention 
and give a greater degree of freedom to users. 
  
4.5  Evaluation of the Projects 
  
While the majority of the projects discussed here could be 
supported by numerous stand-alone, operating-system specific 
hypertext products, the Web has several advantages. 
     The projects' electronic resources are widely available on 
the Internet, and users can access them on a variety of computer 
platforms, regardless of the fact that the Web server is running 
on a UNIX computer.  (Attractive graphical Web clients, such as 
Mosaic and OmniWeb, are available for Macintoshes, IBM-compatible 
computers using Microsoft Windows, UNIX computers with X Windows, 
and NeXTs.) 
  
+ Page 13 + 
  
     Another key advantage is that the source material for the 
editions either conforms to or is in the process of being 
composed using international standards; it is marked up to 
suggest the functional characteristics of the collections, rather 
than their representational characteristics.  Elements, such as 
titles, quotations, and headings, are marked to suggest their 
functional role in the document, rather than any presumed display 
value.  Displays depend instead on the capabilities of the user's 
software, which utilizes the functional characteristics of the 
elements to determine how to present the information. 
     This reliance on functional--not representational-- 
characteristics means that the same materials can be used in a 
variety of different ways, supporting the creation of editions 
with other software packages (e.g., Electronic Book Technology's 
DynaText), use with different analytical tools (e.g., 
morphological parsers), and access through different database 
schemes (e.g., text-specific systems or relational database 
managers designed for images).  A high degree of flexibility, 
viability, and multi-platform access can be maintained. 
     Each of the mentioned editions and historical analyses was 
first composed in a very rich SGML format that was designed to 
discriminate between the functional characteristics of low-level 
elements.  They were subsequently converted (as automatically as 
possible) to static HTML versions for use with the Web. 
Elements, such as discrete descriptive bibliographic 
characteristics, become simple list items, and most complex prose 
and verse elements are reduced to paragraphs and line breaks. 
After this conversion, it was discouraging to see that richness 
disappear, but the original document remained unchanged. 
     There is a continued expectation by the scholars who created 
these resources that better tools will be developed to tap the 
inherent complexity of these materials.  The standards-based 
format of the materials ensures that these scholars will be able 
to take advantage of these new tools when they become available. 
  
5.0  The Web as an Authoring and Document Delivery Environment 
  
The authoring and document delivery capabilities of the Web are 
significantly limited for documents of even moderate complexity. 
Authoring for the Web is usually done in HTML.  HTML has many 
virtues, not least of which is its striving for expressiveness 
and SGML validity.  It is, however, an impoverished tag set with 
little ability to reflect the complexities of most of the 
documents discussed earlier, despite their being offered through 
the Web.  It is important to note that the Web is a limited 
document delivery environment.  Its inability to recognize or use 
structural features of documents forces unpleasant administrative 
decisions that will likely restrict the later use of these 
documents. 
  
+ Page 14 + 
  
5.1  HTML's Lack of Expressiveness 
  
The range of HTML tags available to users is limited.  In 
contrast to the hundreds of tags made available by the TEI 
guidelines, roughly two dozen tags are made available in HTML. 
While HTML will be expanded with HTML+ to give greater precision 
in areas such as tabular data, HTML+ cannot be expected to 
provide the breadth needed to support literary and historical 
documents, or even to support standard journal literature. 
     This lack of expressiveness and insufficient breadth of tags 
also leads to the author's inability to differentiate important 
elements with HTML.  In HTML, the same small set of tags is 
necessarily used for diverse sets of elements.  For example, the 
<BR> code (line break) is used for verse lines, table elements, 
stanza divisions, dramatis personae, and many features.  Authors 
are also left with little ability to represent the structural 
organization of a document.  Where the author wishes to define a 
bounded segment of text, such as a stanza or chapter, no tag is 
available for this purpose.  Instead, authors rely extensively on 
dividing documents into files representing major structural 
divisions.  Elements that are normally defined as structural tags 
in SGML, such as the paragraph (or <P>) tag, are not defined by 
HTML in a way that reliably defines the contents of a paragraph. 
This paucity of tags in HTML results in the author of any 
document of moderate complexity using many tags to effect a 
desired appearance, rather than to characterize the content. 
This type of tagging confuses function and appearance. 
     The inability of HTML to represent complexity is often 
closely linked to the inability of Web servers to provide access 
to complex representations of documents.  This inability is 
fundamentally linked to the notion of structure.  Where 
structural distinctions exist in the markup language, there is no 
inherent ability in the Web to deliver that individual element. 
So, for example, HTML defines glossaries and glossary entries, 
but, in order to provide access to an individual glossary entry 
from a hypertext link, the server must send the entire file 
(i.e., the file containing the glossary) to the user.  Smaller 
glossaries cause few problems, but this makes providing access to 
individual "glossary" entries in a document such as the Oxford 
English Dictionary, where all 500 MB would be transferred across 
the network, effectively impossible.  While Web browsers are 
intelligent enough to move automatically within the file to the 
chosen glossary entry, the file transfer paradigm is impractical 
for large-scale information delivery.  Given this, it must also 
be pointed out that there are very few HTML tags that define 
structural relationships.  Structures such as chapters, sections, 
or poems are not represented. 
  
+ Page 15 + 
  
     The Web's deficiency with regard to structural features 
leads to decisions with serious negative administrative 
consequences.  Because the Web does not include structure 
awareness in its protocol and because HTML markup provides so 
little support for structural representation of features, the 
author and the administrator are forced to fragment documents 
into a sets of reasonably sized components.  In converting the 
ARL book University Libraries and Scholarly Communication (URL: 
http://www.lib.virginia.edu/mellon/mellon.html) to HTML, I found 
that, using the Web and HTML alone, it was necessary to divide 
the dozen chapters into separate files.  While this may not sound 
onerous, extending this practice to a large collection of 
documents--or even a small collection of large documents--would 
be very difficult.  An HTML version of the OED would become a set 
of 300,000 files.  Chadwyck-Healey's English Poetry Database 
would become either 2,500 files (if the administrator wished to 
provide access at the volume level) or 65,000 files (if access to 
individual poems were supported).  Even this severe approach does 
not solve needs that might arise for substructures, such as 
quotations and definitions within the OED or specific stanzas 
within a poem. 
  
5.2  Overall Limitations of HTML 
  
For documents of limited complexity, HTML is an effective 
authoring environment; however, it seriously limits the ways in 
which a more complex document or a set of documents can be used. 
No differentiation of important elements (e.g., stanzas and 
subdivisions of prose) can take place, and it will be necessary 
to upgrade the coding of HTML documents within the year. 
     The Web also lacks inherent document management or document 
access capabilities.  In part because of the limitations of the 
markup language and in part because of the design of the 
protocol, there is a paucity of structure represented and no 
structure recognized.  I emphasize "inherent," however, because 
the Web also provides a gateway capability that can more than 
compensate for this deficiency. 
  
6.0  Exploring Alternatives 
  
I have been developing a gateway from the Web to an indexed 
collection of texts in an SGML-aware system to take advantage of 
the complexity of the documents and yet make them available 
through the Web.  The texts are nearly all in fully validated 
SGML tag sets, each with significant expressiveness.  In contrast 
to an HTML collection, potentially consisting of many files 
representing the many component parts of the collection, each 
text is a single file with as many as hundreds of thousands of 
structural components. 
  
+ Page 16 + 
  
6.1  Collections 
  
Three diverse examples are provided to help understand the nature 
of the collections used in the gateway. 
  
6.1.1  University of Virginia Middle English Collection 
  
The Middle English collection assembled by the University of 
Virginia's Electronic Text Center is approximately thirty texts 
in a single file.  (The collection's URL is http:// 
etext.virginia.edu/Mideng.query.html.) 
     Texts vary in size from several dozen pages to several 
hundred pages.  One of the Library's smaller collections is 
approximately 11 MB of raw text, but it grows as new materials 
become available.  The markup language used is SGML complying 
with the Oxford Text Archive's DTD, a tag set that will 
eventually represent a valid subset of the TEI DTD.  The tags 
differentiate major structural elements, such as tales in the 
Canterbury Tales, bibliographic elements, and elements of 
composition (e.g., verse lines, stanzas, and paragraphs).  Markup 
is rich enough to support a wide range of analytical 
requirements, and the texts have been made available for the 
purpose of analysis to the University of Virginia community for 
much of the past two years.  With the permission of Open Text, 
the Oxford Text Archive, and creators of individual texts, access 
to this collection is unrestricted.  It can be accessed in a 
variety of ways, including the Web. 
  
6.1.2  Chadwyck-Healey English Poetry Database 
  
The Chadwyck-Healey English Poetry Database is purchased on tape 
from the publisher and made available indexed by PAT.  Access to 
this collection is restricted to a consortium of five 
universities in Virginia.  As yet incomplete, the collection 
currently consists of nearly 1,600 works with more than 64,000 
poems and 233,000 pages.  The raw text is relatively large (340 
MB), but, indexed with PAT, searches usually yield results in 
less than one second.  The SGML used with the English Poetry 
Database is a very rich set of tags designed in consultation with 
a TEI representative.  It is more than adequately expressive 
about the poems, including structural markup for poems, poem 
divisions such as stanzas, lineation, and attributes such as 
whether rhyme is used. 
  
+ Page 17 + 
  
6.1.3  Oxford English Dictionary 
  
The Oxford English Dictionary is the largest and arguably the 
most complex resource made available through this service.  The 
570 MB document contains approximately 300,000 entries, many with 
more than fifty subelements.  Strictly speaking, it is not in 
SGML form because it has not been validated against a DTD.  The 
electronic version was, however, designed to take advantage of 
SGML's characteristics, and it significantly benefits from the 
file's structural and descriptive markup. 
  
6.2  Web to PAT Gateway 
  
I have constructed a gateway between the Web and the more 
sophisticated SGML texts using the Web's CGI (Common Gateway 
Interface) and PAT, an SGML-aware text retrieval program.  Text 
is returned from PAT to the Web in the richer SGML, and it is 
converted on the fly to HTML, primarily using HTML to control the 
appearance of the text on the screen.  This gateway is being 
documented elsewhere (URL: http://sansfoy.lib.virginia.edu/pub 
/www-to-pat/), but several facets are relevant to this 
discussion. 
  
6.2.1  Expressive Representation of Text is Retained 
  
The original unmodified texts are accessed through the gateway 
without compromising the expressiveness of the original markup. 
Although the sophisticated SGML markup is dynamically rendered as 
HTML as the user retrieves results, the text remains in the 
original rich SGML form behind the Web representation.  Decisions 
about the way that the fuller tag set maps to HTML are registered 
in filters, and, as HTML becomes more expressive, a better match 
between the original tags and the HTML can be made. 
  
6.2.2  Simple Queries and Simple Access 
  
Users need not be familiar with PAT's query language to search 
texts and take advantage of the structural characteristics of the 
more expressive markup.  A word or phrase search returns 
keywords-in-context (KWIC) views to the user, from which a view 
of larger context is possible.  Eventually, this process may lead 
the user to retrieval of entire sections (e.g., chapters or 
acts).  All expanded views are made from hypertext links that 
initiate structural retrievals such as "the chapter that includes 
this search result." 
  
+ Page 18 + 
  
6.2.3  Menu-Driven Structural Queries 
  
It is possible to facilitate complex queries through menus.  For 
example, in the OED, the word lookup function facilitated by the 
Web includes queries such as: "give me entries that include my 
word within the Lookup field of the Headword Group field," or 
"give me entries that include my word in the Variant Form field." 
The user is not aware of the complexity of the query taking 
place, but can modify the type of query by selecting different 
variations on the search menus.  Boolean queries that ask for the 
intersection of document structures have been challenging to 
users employing command-line and analytically oriented 
interfaces.  However, through simple fill-out forms and menu 
selections, queries such as "(stanzas including [word/phrase]) 
INTERSECT (stanzas including [word/phrase])" are executed without 
the user needing to understand the system's command syntax. 
While we also offer access through several complex, analytical 
interfaces (PatMotif and PowerSearch from Open Text as well as a 
locally developed VT 100 interface), most users can avoid these 
more complicated interfaces. 
  
6.2.4  Access to Structure 
  
Finally, the administrator of a collection need not resort to 
fragmenting files to make it possible to provide access to the 
component parts of a collection.  As mentioned earlier, an HTML 
approach to the OED would require us to divide it into 300,000 
files.  I was recently able to represent the dozens of parts, 
chapters, sections, and subsections of a voluminous SGML 
technical document through this strategy, making hypertext links 
and each component accessible by utilizing the fairly rich 
markup; however, the document remained a single file.  Resource 
management is made more reasonable through a system cognizant of 
a file's structure. 
  
+ Page 19 + 
  
6.2.5  Future Approaches 
  
This strategy has many possibilities.  Journal literature coded 
in SGML may be successfully accessed through this sort of 
strategy.  For example, a journal run marked up according to the 
more elaborate Association of American Publishers DTD could 
return articles to the user through PAT queries.  Another 
approach would facilitate browsing by recognizing the structural 
relationship of author and abstract to article, article to issue, 
issue to volume, and volume to collection.  Throughout, the 
collection would exist as a single file, searchable across all 
articles by a single query.  The collection would not need to be 
compromised by converting the articles to HTML, but would instead 
continue to remain in the more expressive AAP SGML format, 
filtered for display in the process of retrieving information. 
Through this strategy, the Web can be an effective means of 
accessing the original files in a fuller SGML, without resorting 
to fragmenting the material into files corresponding to the 
individual articles or even parts of articles.  Similar 
strategies for books and documentation are possible. 
  
7.0  What Does the Web Offer Libraries? 
  
The Web is a complex system with great potential and serious 
limitations.  We should use caution as we consider composing in 
HTML: it is a short-term coding strategy.  Documents composed in 
HTML will have limited expressiveness, and, because HTML is not 
yet stable, they are likely to need continuing enhancement to be 
used in the Web.  There is much to be excited about with the Web: 
it is a viable system that suggests what electronic publishing on 
the Internet can be.  We have lacked credible, demonstrable 
examples of standards-based, networked hypertext in the past, and 
the Web has changed that.  There is a great deal of untapped 
potential in the Web.  By exploiting the Web's ability to talk to 
other more sophisticated programs, we can begin to take advantage 
of that potential and make tomorrow's promise real today. 
  
+ Page 20 + 
  
     A subtext of this article has been the importance of 
standards--both employing them in creating hypertexts and 
extending the Web to take greater advantage of them.  Standards 
have been attractive to libraries because they help ensure long- 
term viability.  However, as Jefferson remarked in 1790, 
standards are also an important key to information being 
generally useful, regardless of context: 
  
     Measures, weights and coins, thus referred to standards 
     unchangeable in their nature . . . will themselves be 
     unchangeable.  These standards, too, are such as to be 
     accessible to all persons, in all times and places.  The 
     measures and weights derived from them . . . are within the 
     calculation of every one who possesses the first elements of 
     arithmetic, and of easy comparison, both for foreigners and 
     citizens, with the measures, weights, and coins of other 
     countries. [4] 
  
Notes 
  
1. A version of this article was presented as a paper at the Yale 
Hypertext Conference, May 1994.  An HTML version of the original 
speech, with active links to the resources discussed, is 
available via the World-Wide Web; URL: http:// 
sansfoy.lib.virginia.edu/pub/yale.html. 
  
2. Jerome McGann, The Complete Writings and Pictures of Dante 
Gabriel Rossetti: A Hypermedia Research Archive (Charlottesville, 
VA: Institute for Advanced Technology in the Humanities, 
University of Virginia, 1994).  (Electronic document available 
via the World-Wide Web; URL: http:// 
jefferson.village.virginia.edu/rossetti/rossetti.html.) 
  
3. Edward Ayers, The Valley of the Shadow: Living the Civil War 
in Pennsylvania and Virginia (Charlottesville, VA: Institute for 
Advanced Technology in the Humanities, University of Virginia, 
1994).  (Electronic document available via the World-Wide Web; 
URL: http://jefferson.village.virginia.edu/vshadow/vshadow.html.) 
  
4. Thomas Jefferson, "Public Papers," in Writings (New York: 
Literary Classics of the U.S., 1984), 410. 
  
  
About the Author 
  
John Price-Wilkin, Systems Librarian for Information Services, 
Alderman Library, University of Virginia, Charlottesville, VA 
22903. Internet: jpw@virginia.edu. 
  
+ Page 21 + 
  
----------------------------------------------------------------- 
The Public-Access Computer Systems Review is an electronic 
journal that is distributed on the Internet and on other computer 
networks.  There is no subscription fee. 
     To subscribe, send an e-mail message to 
listserv@uhupvm1.uh.edu that says: SUBSCRIBE PACS-P First Name 
Last Name. 
     This article is Copyright (C) 1994 by John Price-Wilkin. 
All Rights Reserved. 
     The Public-Access Computer Systems Review is Copyright (C) 
1994 by the University Libraries, University of Houston.  All 
Rights Reserved. 
     Copying is permitted for noncommercial use by academic 
computer centers, computer conferences, individual scholars, and 
libraries.  Libraries are authorized to add the journal to their 
collection, in electronic or printed form, at no charge.  This 
message must appear on all copied material.  All commercial use 
requires permission. 
-----------------------------------------------------------------
Public-Access Computer Systems Review Volume 05 Number 03

Share this article

Let's discover also

Public-Access Computer Systems Review Volume 01 Number 03

Public-Access Computer Systems Review Volume 02 Number 01

Public-Access Computer Systems Review Volume 01 Number 01

Public-Access Computer Systems Review Volume 03 Number 02

Public-Access Computer Systems Review Volume 02 Number 02

Public-Access Computer Systems Review Volume 03 Number 01

Public-Access Computer Systems Review Volume 04 Number 02

Public-Access Computer Systems Review Volume 04 Number 06

Public-Access Computer Systems Review Volume 04 Number 01

Public-Access Computer Systems Review Volume 05 Number 04

Recent Articles

The First Earth's Circumnavigation by Antonio Pigafetta

Yak Facts Issue #10: It's Flavorific!

Yak Facts Issue #9: Now with Ginseng

Yak Facts Issue #8: As Seen On TV

Yak Facts Issue #7: Caution: Live Animals

Yak Facts Issue #6: Repeat as necessary

The Esoteric Origin of the Universal Weekly Sequence

Yak Facts Issue #5: Repeat as necessary

Yak Facts Issue #4: In Technicolor

SA CROCORIGA MANNOSA

Recent Comments