Copy Link
Add to Bookmark
Report
Public-Access Computer Systems Review Volume 03 Number 01
+ Page 1 +
-----------------------------------------------------------------
The Public-Access Computer Systems Review
Volume 3, Number 1 (1992) ISSN 1048-6542
-----------------------------------------------------------------
To retrieve an article file as an e-mail message, send the GET
command given after the article information to LISTSERV@UHUPVM1
(BITNET) or LISTSERV@UHUPVM1.UH.EDU (Internet). To retrieve the
article as a file, omit "F=MAIL" from the end of the GET command.
CONTENTS
REFEREED ARTICLES
Analysis of Search Failures in Document Retrieval Systems: A
Review
By Yasar Tonta (pp. 4-53)
To retrieve this file: GET TONTA PRV3N1 F=MAIL
1.0 Introduction
2.0 Overview of a Document Retrieval System
3.0 Search Failure Analysis
3.1 Measures of Retrieval Effectiveness
3.2 Methods of Analyzing Search Failures
3.2.1 Analysis of Search Failures Utilizing Retrieval
Effectiveness Measures
3.2.2 Analysis of Search Failures Utilizing User
Satisfaction Measures
3.2.3 Analysis of Search Failures Utilizing Transaction
Logs
3.2.4 Analysis of Search Failures Utilizing the Critical
Incident Technique
3.3 Summary
4.0 Review of Studies Analyzing Search Failures
4.1 Studies Utilizing Precision and Recall Measures
4.1.1 The Cranfield Studies
4.1.2 Lancaster's MEDLARS Studies
4.1.3 Blair and Maron's Full-Text Retrieval System Study
4.1.4 Markey and Demeyer's Dewey Decimal Classification
Online Project
4.2 Studies Utilizing User Satisfaction Measures
4.3 Studies Utilizing Transaction Logs
4.4 Studies Utilizing the Critical Incident Technique
4.5 Other Search Failure Studies
4.6 Related Studies
5.0 Conclusion
+ Page 2 +
COLUMNS
Public-Access Provocations: An Informal Column
Those Who Don't, Won't
By Walt Crawford (pp. 54-56)
To retrieve this file: GET CRAWFORD PRV3N1 F=MAIL
REVIEWS
Zen and the Art of the Internet: A Beginner's Guide to the
Internet
Reviewed by Billy Barron (pp. 57-59)
To retrieve this file: GET BARRON PRV3N1 F=MAIL
-----------------------------------------------------------------
The Public-Access Computer Systems Review
Editor-in-Chief
Charles W. Bailey, Jr.
University Libraries
University of Houston
Houston, TX 77204-2091
(713) 743-9804
LIB3@UHUPVM1 (BITNET) or LIB3@UHUPVM1.UH.EDU (Internet)
Associate Editors:
Columns: Leslie Pearse, OCLC
Communications: Dana Rooks, University of Houston
Reviews: Roy Tennant, University of California, Berkeley
Editorial Board
Ralph Alberico, University of Texas, Austin
George H. Brett II, University of North Carolina
General Administration
Steve Cisler, Apple
Walt Crawford, Research Libraries Group
Lorcan Dempsey, University of Bath
Nancy Evans, Pennsylvania State University, Ogontz
+ Page 3 +
Charles Hildreth, READ Ltd.
Ronald Larsen, University of Maryland
Clifford Lynch, Division of Library Automation,
University of California
David R. McDonald, Tufts University
R. Bruce Miller, University of California, San Diego
Paul Evan Peters, Coalition for Networked Information
Mike Ridley, University of Waterloo
Peggy Seiden, Pennsylvania State University, New Kensington
Peter Stone, University of Sussex
John E. Ulmschneider, North Carolina State University
Publication Information
Published on an irregular basis by the University Libraries,
University of Houston. Technical support is provided by the
Information Technology Division, University of Houston.
Circulation: 3,906 subscribers in 35 countries (PACS-L) and 230
subscribers in 14 countries (PACS-P).
Back issues are available from LISTSERV@UHUPVM1 (BITNET) or
LISTSERV@UHUPVM1.UH.EDU (Internet). To obtain a list of all
available files, send the following e-mail message to the
LISTSERV: INDEX PACS-L. The name of each issue's table of
contents file begins with the word "CONTENTS."
-----------------------------------------------------------------
-----------------------------------------------------------------
The Public-Access Computer Systems Review is a refereed
electronic journal that is distributed on BITNET, Internet, and
other computer networks. There is no subscription fee.
To subscribe, send an e-mail message to LISTSERV@UHUPVM1 (BITNET)
or LISTSERV@UHUPVM1.UH.EDU (Internet) that says:
SUBSCRIBE PACS-P First Name Last Name. PACS-P subscribers also
receive two electronic newsletters: Current Cites and Public-
Access Computer Systems News.
The Public-Access Computer Systems Review is Copyright (C) 1992
by the University Libraries, University of Houston. All Rights
Reserved.
Copying is permitted for noncommercial use by computer
conferences, individual scholars, and libraries. Libraries are
authorized to add the journal to their collection, in electronic
or printed form, at no charge. This message must appear on all
copied material. All commercial use requires permission.
-----------------------------------------------------------------
+ Page 57 +
-----------------------------------------------------------------
The Public-Access Computer Systems Review 3, no. 1 (1992):
57-59.
-----------------------------------------------------------------
-----------------------------------------------------------------
Kehoe, Brendan P. Zen and the Art of the Internet: A Beginner's
Guide to the Internet. Chester, PA: 1992. [Computer file]
Reviewed by Billy Barron.
-----------------------------------------------------------------
Zen and the Art of the Internet is a new guide to the Internet
that was written by Brendan Kehoe of Widener University. His
goal was to introduce the reader to the resources that are
available on the Internet. At the same time, Kehoe tried to
avoid system specific information. It should be noted that parts
of Zen and the Art of the Internet were derived from other works.
Zen and the Art of the Internet starts off with a chapter on
network basics. This chapter is a good introduction to the
Internet, but it is not a general guide to networking. Rather,
it is Internet and TCP/IP specific. If this chapter can be
faulted for anything, it is that it oversimplifies some of the
material. On the other hand, it definitely should not scare off
the novice user.
The e-mail and FTP chapters are very good, although they do get
technical at times. The e-mail chapter could be improved by the
addition of a section on etiquette similar to the excellent one
in the FTP chapter.
The Telnet chapter is packed with examples of Telnet-accessible
services, and it explains how to find out about more services. I
was rather disappointed by the omission of any information on
tn3270. A description of how Telnet is different on IBM
mainframes is also needed. These omissions may lead to some
confusion on the part of IBM mainframe users.
Kehoe describes other tools that are available on the Internet.
These descriptions are well-rounded and useful, but Kehoe has
just covered the most common tools.
One of the most outstanding sections of Zen and the Art of the
Internet is called "Things You'll Hear About." In a lot of ways,
this chapter is a FAQ (Frequently Asked Questions) to the
Internet, and it will answer many questions of the new network
user. At the same time, it introduces the novice user to the
folklore of the Internet without being intimidating.
+ Page 58 +
Zen and the Art of the Internet also has useful sections that
contain information about commercial services, other networks,
how to retrieve files, and how to find out more about the
Internet. The USENET chapter does a great job of covering the
most common misconceptions people have about that network.
The document includes a helpful glossary.
The conclusion states "this guide is far from complete--the
Internet changes on a daily (if not hourly) basis." Then Kehoe
goes on to ask for suggestions. For Zen and the Art of the
Internet to be useful in the long run, it will need to be updated
on a fairly regular basis. From what I can tell, it sounds like
Kehoe is planning on doing this. I'm sending in my suggestions,
and I highly recommend you do the same.
Overall, I was very impressed with this document. In fact, the
same day that I downloaded it I had our receptionist make copies
and distribute them to the whole Academic Computing Support
Staff. In a couple of days, I am going to do the same with our
library. My girlfriend's university just got on the Internet and
I'm giving her two sources of information to start with: the
first is HYTELNET and the second is going to be Zen and the Art
of the Internet. It has a few rough spots, but I'm sure that
Kehoe will fix them. The biggest problem is that it paints too
rosy a picture of the Internet, but this kind of document is
intended to get users interested in the network not to critique
it.
I try to stay ahead of most Internet users in terms of my
knowledge of what's available and how to access it. Well, I
learned a couple of things while reading Zen and the Art of the
Internet, so it is not just for novices. At the same time, it is
easily understandable by novices. My message to Brendan Kehoe
is: Keep up the good work!
Access Instructions
The file is available on host FTP.CS.WIDENER.EDU (147.31.254.132)
in the directory pub/zen and on FTP.UU.NET in (137.39.1.9) in the
directory /inet/doc. Although the author reports that he has
signed an agreement with a major publishing house, he has
indicated that the network versions will continue to be
available.
+ Page 59 +
About the Author
Billy Barron
VAX/UNIX Systems Manager
University of North Texas
BILLY@UNT.EDU
-----------------------------------------------------------------
The Public-Access Computer Systems Review is a refereed
electronic journal that is distributed on BITNET, Internet, and
other computer networks. There is no subscription fee.
To subscribe, send an e-mail message to LISTSERV@UHUPVM1 (BITNET)
or LISTSERV@UHUPVM1.UH.EDU (Internet) that says:
SUBSCRIBE PACS-P First Name Last Name. PACS-P subscribers also
receive two electronic newsletters: Current Cites and Public-
Access Computer Systems News.
This article is Copyright (C) 1992 by Billy Barron. All Rights
Reserved.
The Public-Access Computer Systems Review is Copyright (C) 1992
by the University Libraries, University of Houston. All Rights
Reserved.
Copying is permitted for noncommercial use by computer
conferences, individual scholars, and libraries. Libraries are
authorized to add the journal to their collection, in electronic
or printed form, at no charge. This message must appear on all
copied material. All commercial use requires permission.
-----------------------------------------------------------------
+ Page 54 +
-----------------------------------------------------------------
"Public-Access Provocations: An Informal Column." The Public-
Access Computer Systems Review 3, no. 1 (1992): 54-56.
-----------------------------------------------------------------
"Those Who Don't, Won't"
By Walt Crawford
It's all well and good to blather on about refinements and
extensions to online catalogs. More power to those providing
full-text access and adding retrieval for visual materials and
sound. Good access to non-textual material is a vital part of
modern library catalogs.
What provokes me this time, however, has to do with the end-
result of most library catalog searching: text. The provocation
comes in the term "post-literate," particularly as used by those
who say "text-impaired" in the sense of "hung up on text, that
old-fashioned narrow channel of communication."
A Post-Literate Society Would be a Medieval Society
Need I say more? Perhaps--but that heading is the gist of this
diatribe, and my optimistic viewpoint shows up in the word
"would" rather than "will." I don't believe we're headed for a
post-literate society (and desperately hope that we're not), and
feel that librarians should make every effort to see to it that
we're not.
Some of you will already know the prequel to this column's title:
Dick Dougherty's ALA Presidential slogan, "Kids who read,
succeed." As a rejoinder to those who urge us to move smoothly
into a post-literate society, the follow-on is more important:
"Those who don't, won't."
+ Page 55 +
In a society where most people were "visually literate" or "media
literate" but lack solid, well-developed, constantly-used reading
skills, the literate or "text-oriented" minority would be in
control--perhaps not always overtly, but most assuredly where it
counts. In that dystopian vision, we would revert to a medieval
state where the knowledgeable few rule the ignorant masses. This
time the ignorant masses would not think themselves ignorant,
since they would be flooded with "information" through the vastly
richer channels of sight and sound.
I've heard some of the visions of those who believe that text
doesn't matter. Business people will solve problems by working
through virtual-reality embodiments of the situation at hand, or
use video game-like methods to arrive at the best course of
action. Maybe I'm getting gullible in middle age, but (God help
me) I believe that some of these people are serious!
Narrowness Can be a Virtue
Yes, text represents a narrower communication channel than sight
and sound. Another way of saying that is that text provides
specificity. Text is also linear, which makes it ideal for
logical operations, case-building and the other tools of
argument. (For the purposes of this discussion, mathematics is a
special case of text: even narrower, even more specific, and
generally even more linear.)
Of course text isn't all there is. A description of the Sistine
Chapel can't substitute for photographs of the ceiling and chapel
itself, which in turn don't really substitute for being there. A
description of Stravinsky's "Variations on 'Vom Himmel Hoch'"
would be pretty pallid, while the music itself is an astonishing
cross-century blend and a considerable pleasure. I watch network
TV (and I don't mean PBS) and enjoy it; a description of "Evening
Shade" or "Northern Exposure" could surely not replace the
experience itself.
Turning to more "factual" matters, I would never approve a
proposed extension to my house without seeing appropriate
drawings--but I would also never approve the extension without
detailed textual and mathematical support for those drawings, in
the form of firm costs, structural details, and a proper
contract.
+ Page 56 +
Literacy Empowers
Without the ability to read carefully, thoroughly and
thoughtfully, a person will always be at the mercy of others.
That's true of mathematical literacy as well, to be sure: if you
can't do approximate calculations in your head, you have no basis
for challenging a dishonest tradesperson or a simple error in
charging. Reading, thinking, understanding: these provide power,
the power to participate fully in the modern world. In a post-
literate society, only the reading minority would have that
power. That is, I maintain, a future to be avoided rather than
dealt with.
One final note. If you believe that this column is an attack on
multimedia, or that I am arguing that only print is important or
that entertainment is unimportant: go back and read it again,
this time more carefully.
About the Author
Walt Crawford is a Senior Analyst at The Research Libraries Group
(Mountain View, CA), and is Vice President/President- Elect of
the Library & Information Technology Association (LITA), a
division of the American Library Association.
-----------------------------------------------------------------
The Public-Access Computer Systems Review is a refereed
electronic journal that is distributed on BITNET, Internet, and
other computer networks. There is no subscription fee.
To subscribe, send an e-mail message to LISTSERV@UHUPVM1 (BITNET)
or LISTSERV@UHUPVM1.UH.EDU (Internet) that says:
SUBSCRIBE PACS-P First Name Last Name. PACS-P subscribers also
receive two electronic newsletters: Current Cites and Public-
Access Computer Systems News.
This article is Copyright (C) 1992 by Walt Crawford. All Rights
Reserved.
The Public-Access Computer Systems Review is Copyright (C) 1992
by the University Libraries, University of Houston. All Rights
Reserved.
Copying is permitted for noncommercial use by computer
conferences, individual scholars, and libraries. Libraries are
authorized to add the journal to their collection, in electronic
or printed form, at no charge. This message must appear on all
copied material. All commercial use requires permission.
-----------------------------------------------------------------
+ Page 4 +
-----------------------------------------------------------------
Tonta, Yasar. "Analysis of Search Failures in Document Retrieval
Systems: A Review." The Public-Access Computer Systems Review 3,
no. 1 (1992): 4-53. [Refereed article]
-----------------------------------------------------------------
Abstract
This paper examines search failures in document retrieval
systems. Since search failures are closely related to overall
document retrieval system performance, the paper briefly
discusses retrieval effectiveness measures such as precision and
recall. It examines four methods used to study retrieval
failures: retrieval effectiveness measures, user satisfaction
measures, transaction log analysis, and the critical incident
technique. It summarizes the findings of major failure analysis
studies and identifies the types of failures that usually occur
in document retrieval systems.
1.0 Introduction
Online document retrieval systems often fail to retrieve some
relevant documents. More often than not they also retrieve
nonrelevant documents. Such search failures may occur due to a
variety of reasons, including problems with user-system
interfaces, retrieval rules, and indexing languages.
Studying search failures presents extremely complicated problems.
For instance, it is not clear exactly what constitutes a "search
failure." While some researchers study search failures using
retrieval effectiveness measures such as precision and recall,
others prefer using "user satisfaction" as a criterion in
deciding whether a search has failed or not. This paper will
look at various (mostly implied) definitions of "search failure"
and discuss some of the methods used in failure analysis studies.
+ Page 5 +
2.0 Overview of a Document Retrieval System
The principal function of a document retrieval system is to
retrieve all relevant documents from a store of documents, while
rejecting all others. A perfect document retrieval system would
retrieve ALL and ONLY relevant documents. Maron [1] provides a
more detailed description of the document retrieval problem and
depicts the logical organization of a document retrieval system
(see Figure 1).
-----------------------------------------------------------------
Figure 1. Logical Organization of a Conventional Document
Retrieval System. Source: Maron [2].
-----------------------------------------------------------------
|------<-----|
| |
|---------------| |-----V-------| |
| Incoming | | Inquiring | |
| documents | | patron | |
|-------|-------| |-----|-------| |
| | |
|-------V-------| |------------| |-----V-------| |
| Document |---->| Thesaurus |---->| Query | |
| identification| | Dictionary | | formulation | |
| (Indexing) |<----| |<----| | |
|-------|-------| |-----|------| |-----|-------| |
| | | |
|-------V-------| |-----V------| |-----V-------| |
| Index | | Retrieval | | Formal | |
| records |---->| rule |<----| query | |
|---------------| |-----|------| |-------------| |
| |
|--------------->---------------|
-----------------------------------------------------------------
+ Page 6 +
As Figure 1 suggests, the basic characteristics of each incoming
document (e.g., author, title, and subject) are identified during
the indexing process. Indexers may consult thesauri or
dictionaries (controlled vocabularies) in order to assign
acceptable index terms to each document. Consequently, an index
record is constructed for each document for subsequent retrieval
purposes.
A user can identify proper search terms by consulting these index
tools during the query formulation process. After checking the
validity of initial terms and identifying new ones, the user
determines the most promising query terms (from the retrieval
point of view) to submit to the system as the formal query.
However, most users do not know about the tools that they can
utilize to express their information needs, which results in
search failures because of a possible mismatch between the user's
vocabulary and the system's vocabulary.
Maron describes the search process as follows:
the actual search and retrieval takes place by matching the
index records with the formal search query. The matching
follows a rule, called "Retrieval Rule," which can be
described as follows: For any given formal query, retrieve
all and only those index records which are in the subset of
records that is specified by that search query [3].
Thus, a document retrieval system consists of (1) a store of
documents (or, representations thereof); (2) a population of
users each of whom makes use of the system to satisfy their
information needs; and (3) a retrieval rule which compares the
representation of each user's query with the representations of
all the documents in the store so as to identify the relevant
documents in the store. There also should be a user interface to
allow users to interact with the system.
In reality, the ideal document retrieval system discussed in this
section does not exist. Document retrieval systems do not
retrieve ALL and ONLY relevant documents, and users may be
satisfied with systems that rapidly retrieve a few relevant
documents.
+ Page 7 +
3.0 Search Failure Analysis
Before reviewing major failure analysis studies, it is helpful to
examine some approaches used in studying search failures in
document retrieval systems and to discuss the various definitions
of "search failure" used by researchers. After all, we cannot
analyze search failures if we do not recognize them.
3.1 Measures of Retrieval Effectiveness
Retrieval effectiveness measures such as "precision" and "recall"
are widely used to evaluate the effectiveness of online document
retrieval systems. A few measures, which are discussed below,
are also used in the study of search failures. This paper will
not review all the measures of retrieval effectiveness suggested
in the literature since they are seldom, if ever, used in the
analysis of search failures.
Precision is defined as the proportion of retrieved documents
which are relevant, whereas recall is defined as the proportion
of relevant documents retrieved [4]. These two measures are
generally used in tandem in evaluating retrieval effectiveness in
document retrieval systems.
Precision can be taken as the ratio of the number of documents
that are judged relevant for a particular query over the total
number of documents retrieved. For instance, if, for a
particular search query, the system retrieves two documents and
the user finds one of them relevant, then the precision ratio for
this search would be 50%.
Recall is considerably more difficult to calculate than precision
because it requires finding relevant documents that will not be
retrieved during users' initial searches [5]. Recall can be
taken as the ratio of the number of relevant documents retrieved
over the total number of relevant documents in the collection.
Take the above example. The user judged one of the two retrieved
documents to be relevant. Suppose that later three more relevant
documents that the original search query failed to retrieve were
found in the collection. The system retrieved only one out of
the four relevant documents from the database. The recall ratio
would then be equal to 25% for this particular search.
+ Page 8 +
"Fallout" is another measure of retrieval effectiveness. Fallout
can be defined as the ratio of nonrelevant documents retrieved
over all the nonrelevant documents in the collection. The
earlier example also can be used to illustrate fallout. The user
judged one of the two retrieved documents as relevant, and,
later, three more relevant documents that the original query
missed were identified. Further suppose that there are nine
documents in the collection altogether (four relevant plus five
nonrelevant documents). Since the user retrieved one nonrelevant
document out of a total of five nonrelevant ones in the
collection, the fallout ratio would be 20% for this search.
3.2 Methods of Analyzing Search Failures
This section discusses the analysis of search failures using
retrieval effectiveness methods (e.g., recall), user satisfaction
measures, transaction logs, and the critical incident technique.
3.2.1 Analysis of Search Failures Utilizing Retrieval
Effectiveness Measures
If precision and recall are seen as performance measures with the
given definitions, it instantly becomes clear that "performance"
can no longer be defined as a dichotomous concept. As precision
and recall are defined as percentages, we can think of "degrees"
of search failure or success. This view would probably best
reflect different performance levels attained by current document
retrieval systems. It is impossible to find a perfect document
retrieval system. In reality, retrieval systems are imperfect,
and they are better or worse than one another.
+ Page 9 +
Performance measures such as precision and recall can be used in
the analysis of search failures.
In the precision example in Section 3.1, only 50% of the
documents retrieved were relevant, resulting in a precision of
50%. If each nonrelevant document that the system retrieves for
a given query represents a search failure, then it is also
possible to think of precision as a measure of search failure:
failure to retrieve relevant documents ONLY. The more
nonrelevant documents the system retrieves for a given query, the
higher the degree of precision failures. If no retrieved
document happens to be relevant, then the precision ratio becomes
zero due to severe precision failures.
In the recall example, the recall ratio was 25%, implying that
the system missed 75% of the relevant documents in the
collection. If each missed relevant document represents a search
failure, then it is possible to think of recall as a measure of
search failure: failure to retrieve ALL relevant documents in the
collection. The more relevant documents the system misses the
higher the degree of recall failure. If the system fails to
retrieve any relevant documents from the collection, then the
recall ratio becomes zero due to severe recall failures.
Precision and recall are two different quantitative measures of
aggregation of search failures. For convenience, search failures
analyzed using precision and recall are called precision failures
and recall failures.
Precision failures can easily be detected. They occur when the
user finds some retrieved documents nonrelevant, even if those
documents are assigned the index terms that the user initially
asked for in the search query. Users may feel that index terms
have been incorrectly assigned to documents that are not really
relevant to those subjects.
It should be noted that "relevance" is defined as a relationship
"between a document and a person in search of information" and it
is a function of a large number of variables concerning both the
document (e.g., what it is about, its currency, language, and
date) and the person (e.g, person's education and beliefs) [6].
(For a comprehensive review of the concept of "relevance," see
[7].)
+ Page 10 +
Recall failures mainly occur because index terms that users would
normally utilize to retrieve documents about particular subjects
do not get assigned to documents that are relevant to those
subjects. As stated earlier, detecting recall failures,
especially in large scale document retrieval systems, is much
more difficult. Researchers have therefore used somewhat
different approximations to calculate recall figures in their
experiments.
Although information retrieval textbooks mention "fallout" as a
measure of retrieval effectiveness, the author is not aware of
any experiment where fallout ratio has been successfully
calculated [8]. Calculating the fallout ratio in large
collections is as difficult, if not more difficult, as
calculating the recall ratio. To calculate the fallout ratio,
all nonrelevant documents retrieved during the search must be
identified, all nonrelevant documents in the overall collection
must be found, and the size of the collection must be
established.
It is tempting to say that documents that are not retrieved are
probably not relevant; however, since recall failures do occur in
document retrieval systems, this is not the case. If all of the
unretrieved documents in a collection were scanned, some of them
would be relevant. The fallout ratio could then be calculated.
It should be noted that this method can only be used for specific
queries where the number of relevant documents in the whole
collection is known to be small.
"Fallout failures" do occur constantly in document retrieval
systems even if it is impractical to quantify them. Whenever the
system retrieves too many nonrelevant records, users feel the
consequences of fallout failure. Either they must scan long
lists of useless records (hence "fallout") or abandon the search.
+ Page 11 +
Notice that fallout failures also can be seen as severe precision
failures. Fallout failure has not been adequately studied;
however, it is known that users tend to resist scanning through
screens of retrieved items. For instance, Larson [9] found that
in a large online catalog the average number of records retrieved
was 77.5, but users scanned an average of less than 10 records
per search. It is not clear why the users stopped scanning after
a few records. Some may have been satisfied with the results.
Some users might have abandoned their searches due to frustration
because the system retrieved too many unpromising, nonrelevant
records [10]. It would be interesting to study what percentage
of searches in online catalogs get abandoned in view of user
frustration from fallout failures.
It is also theoretically possible to envision "perverse" document
retrieval systems where, for a given query, the system first
would retrieve all nonrelevant documents before it would
eventually retrieve relevant ones [11]. However, in real life,
"perverse" document retrieval systems are unlikely to exist.
Mainly, retrieval effectiveness measures are used to determine
and study three types of search failures: (1) retrieving
nonrelevant documents (precision failures); (2) missing relevant
documents (recall failures); and (3) retrieving too many
unpromising, nonrelevant documents (fallout failures). Failure
analysis aims to find out the causes of these failures so that
existing systems can be improved in a variety of ways.
+ Page 12 +
So far, this paper has examined a few of the measures of
retrieval effectiveness and the ways in which they are used in
the study of search failures. It was noted that document
retrieval systems are not perfect and that we cannot expect them
to achieve, or even approximate, the impossible ideal of
retrieving ALL and ONLY relevant documents in the collection.
Some would argue that users would like to find some relevant
documents, but not necessarily ALL of them, unless (as in rare
occasions such as patent searching) ALL are wanted.
Users prefer high precision to high recall. They wish to
retrieve "some good references without having to examine too many
bad ones" [12]. Consequently, it is more important for a
document retrieval system to "distinguish between wanted and
unwanted items" quickly than to retrieve all relevant items in
the collection.
It also should be noted that not everyone is satisfied with the
most commonly used retrieval effectiveness measures (precision
and recall). For instance, Cooper has questioned the use of
recall as a performance measure because it takes into account not
only retrieved documents, but also unretrieved documents. In his
view, this is wasted effort since the relevance of unretrieved
documents has little bearing on the notion of subjective user
satisfaction [13]. He maintains that "an ideal evaluation
methodology must somehow measure the ultimate worth of a
retrieval system to its users in terms of an appropriate unit of
utility" [14].
3.2.2 Analysis of Search Failures Utilizing User Satisfaction
Measures
Some failure analysis studies are based on user satisfaction
measures, rather than on retrieval effectiveness measures.
Although it may at first seem straightforward, analyzing search
failures utilizing user satisfaction measures is a complex
process that provides interesting challenges.
+ Page 13 +
First, defining user satisfaction is difficult. Several authors
tried to address this issue. Tessier, Crouch, and Atherton
discussed such factors as the search output, the intermediary,
the service policies, and the "library as a whole" as the main
determinants of the user satisfaction [15]. Bates examined the
effects of "subject familiarity" and "catalog familiarity" on
search success and found that the former has a slight detrimental
effect, while the latter has a very significant beneficial effect
on search success [16]. Tessier used factor analysis and
multiple regression techniques to study the influence of various
variables on overall search satisfaction. She found that "the
strongest predictors of satisfaction were the precision of
search, the amount of time saved, and the perceived quality of
the database as a source of information" [17]. Hilchey and
Hurych found "a strong positive relationship between perceived
relevance of citations and search value" when they performed a
statistical analysis on the online reference questionnaire forms
returned by the users in a university library [18].
Second, user satisfaction relies heavily on users' judgments
about search failures or successes; however, users' judgments may
be inconsistent for various reasons. For example, Tagliacozzo
found that "MEDLINE was perceived as 'helpful' by respondents
who, in other parts of the questionnaire [used in the author's
research], showed that they had NOT found it particularly useful"
[19, (original emphasis)]. Tagliacozzo warns us: "Caution should
therefore be used in taking the users' judgments at face value,
and in inferring from single responses that their information
needs were, or were not, satisfied by the service" [20].
It follows that it is not usually sufficient to obtain a binary
"Yes/No" response from the user about being satisfied or not
satisfied with the results. Ankeny found that the use of a
two-point (yes-no) scale "appeared to result in inflated success
ratings" [21]. When pressed, users are likely to come up with
further explanations. For example, a user might say: "Yes, in a
way my search was successful even though I couldn't find what I
wanted." A second user might say that a given search was not
successful because "it did not retrieve anything new."
+ Page 14 +
A researcher getting such answers would have hard time
classifying them. The data gathering tools that the researcher
employs to elicit information from users should be sensitive
enough to handle such answers by asking more detailed questions.
After all, a decision has to be made if a search was successful
or not. Further conditions have been introduced in some studies
to facilitate this decision-making process. In Ankeny's study,
for example, a successful search has three characteristics:
the patron must indicate that s/he found EXACTLY what was
wanted, that s/he was FULLY satisfied with the search, and
that s/he marked none of the 10 listed reasons for
dissatisfaction where the reasons for dissatisfaction ranged
from "system problems" to "too much information," from
"information not relevant enough" to "need different
viewpoint" [22, (original emphasis)].
Nevertheless, it is still possible that a given search may be a
failure even if answers given by a user met all three of these
conditions. It was noted earlier that users tend to abandon some
searches that retrieve too many items. Many users may prefer to
retrieve a few relevant documents quickly. They would not
consider a search as a "failure" even if the system has missed
some relevant documents (i.e., recall failure).
User satisfaction measures are influenced by both user group and
search goal factors. For example, an undergraduate student
writing a term paper may be satisfied if a search retrieves a few
relevant textbooks. However, the situation is entirely different
for a health professional. This user may want to know everything
about a certain case because the outcome of missing relevant
information may have serious consequences. For example, a health
professional investigating a medical procedure on "MEDLINE only
found records showing it to be safe, missing the reports of
fatalities associated with the procedure" [23].
+ Page 15 +
The above examples show that some caution is needed when
interpreting users' indication of satisfaction. There are some
published studies that show that "in many cases high levels of
reported end-user 'satisfaction' . . . may not reflect true
success rates" [24]. Furthermore, as Cheney notes, we do not
"know what end users expect of their search results, because no
study has examined end users' expectations of database searching.
Neither has any study examined the actual quality of end-user
search results measured in terms of precision and recall" [25].
So far, the discussion has concentrated on the analysis of search
failures that were based on retrieval effectiveness or "user
satisfaction." As part of a carefully designed and conducted
experiment under "as real-life a situation as possible,"
Saracevic and Kantor studied, among other things, the
relationship between user satisfaction and precision and recall
[26].
Their experiment involved 40 users who each submitted a query
that reflected a real information need. Thirty-nine professional
searchers did online searches on Dialog databases for these
queries. Each query was searched by nine different professionals
and the results were combined for evaluation purposes. The
precision ratio for a given search was estimated as the number of
relevant items retrieved by the search divided by the total
number of items retrieved by the search. Similarly, recall ratio
was estimated as the number of relevant items retrieved by the
search divided by the total number of relevant items in the union
of items retrieved by all searchers for that question [27]. Five
utility measures were used: (1) whether the user's participation
and the resultant information was worth it (on a five-point
scale); (2) time spent; (3) perceived (by the users) dollar value
of the items; (4) whether the information contributed to the
resolution of the research problem (on a five-point scale); and
(5) whether the user was satisfied with the results (on a five-
point scale).
+ Page 16 +
They found that "searchers in questions where users indicated
high overall satisfaction with results . . . were 2.49 times more
likely to have higher precision" [28]. They interpreted their
findings pertaining to the relationship between utility measures
and retrieval effectiveness measures as follows:
In general, retrieved sets with high precision increased the
chance that users assessed that the results were "worth
more of their time than it took," were "high in dollar
value," contributed "considerably to their problem
resolution," and "were highly satisfactory." On the other
hand, high recall did not significantly affect the odds for
any of those measures. . . . These are interesting findings
in another respect. They indicate that utility of results
(or user satisfaction) may be associated with high
precision, while recall does not play a role that is even
closely as significant. For users, precision seems to be
the king and they indicated so in the type of searches
desired. In a way this points out to the elusive nature of
recall: this measure is based on the assumption that
something may be missing. Users cannot tell what is missing
any more than searchers or systems can. However, users can
certainly tell what is in their hand, and how much is NOT
relevant [29, (original emphasis)].
3.2.3 Analysis of Search Failures Utilizing Transaction Logs
The availability of transaction logs, which record users'
interaction with the document retrieval systems, provides the
opportunity to study and monitor search failures unobtrusively.
Larson states: "Transaction monitoring, in its simplest form,
involves the recording of user interactions with an online
system. More complete transaction monitoring also will record
the system responses and performance data (such as response time
for searches), providing enough information to reconstruct all of
the user's interactions with the system" [30]. This includes
search queries entered, records displayed, help requests, errors,
and the system responses. (For a review of online catalog
transaction log studies, see [31].)
+ Page 17 +
Since transaction logs also contain invaluable information about
failed searches, researchers have been interested in scanning
transaction logs in order to identify failed searches. Several
researchers identified "zero hits" from the transaction logs of
selected online catalogs and looked into the reasons for search
failures [32]. A few others employed the same method when they
studied search failures in MEDLINE [33]. These researchers used
a rather practical definition of search failure when scanning
transaction logs. A search was treated as a failure if it
retrieved no records.
Needless to say, the definition of search failure as zero hits is
incomplete since it does not include partial search failures.
More importantly, there is no reason to believe that all
"non-zero hits" searches were successful ones. Such an
assumption would mean that no precision failures occurred in the
systems under investigation! Furthermore, "not all zero hits
represent failures for the patrons . . . It is possible that the
patron is satisfied knowing that the information sought is not in
the database, in which case the zero-hit search is successful"
[34]. Precedence searching in litigation is an example of a
zero-hit search that is successful.
Some newer document retrieval systems such as Okapi and CHESHIRE
can accommodate relevance feedback techniques and incorporate
users' relevance judgments in order to improve retrieval
effectiveness in subsequent iterations [35]. Transaction logs of
such online catalogs also record the user's relevance judgment
for each record that is displayed. Using these logs, the
researcher is able to determine whether the user found a given
record to be relevant or not.
The availability of relevance judgments in transaction logs has
opened up new avenues for studying search failures in online
library catalogs. Researchers are now able to study not only
zero-hit searches, but also failed searches that retrieve
nonrevelant records. Obviously, the rendering of relevance
judgments makes it easier to identify precision failures, but
there still needs to be some kind of mechanism to identify recall
failures.
+ Page 18 +
What constitutes a search failure when the relevance judgment for
each retrieved document is recorded in the transaction log? Some
researchers came up with yet another practical definition of
search failure and analyzed it accordingly. For example, during
the evaluation of Okapi online catalog, a search was counted as a
failure "if no relevant record appears in the first ten which are
displayed" [36]. This definition of search failure is quite
different from one based on precision and recall. It is
dichotomous, and it assumes that users will scan at least ten
records before quitting. This assumption might be true for some
searches and for some users, but not for all searches and users.
It also downplays the importance of search failures. Searches
retrieving at least one relevant record in ten are considered
"successful" even though the precision rate for such searches is
quite low (10%).
Although transaction monitoring offers unprecedented
opportunities to study search failures in document retrieval
systems and provides "highly detailed information about how users
actually interact with an online system, . . . it cannot reveal
their intentions or whether they are satisfied with the results"
[37].
Some of the shortcomings of transaction monitoring in studying
search failures are as follows.
First, it is not clear what constitutes a "search failure" in
transaction logs. As mentioned earlier, defining all zero-hit
searches as search failures has some serious flaws.
Second, transaction logs have very little to offer when studying
recall failures in document retrieval systems. Recall failures
can only be determined by using different methods such as
analysis of search statements, indexing records, and retrieved
documents. In addition, additional relevant documents that were
not retrieved in the first place can be found by performing
successive searches in the database.
+ Page 19 +
Third, transaction logs can document search failure occurrences,
but they cannot explain why a particular failure occurred.
Search failures in online catalogs occur for a variety of
reasons, including simple typographical errors, mismatches
between users' search terms and the vocabulary used in the
catalog, collection failures (i.e., requested item is not in the
system), user interface problems, and the way search and
retrieval algorithms function. Further information is needed
about users' needs and intentions in order to find out why a
particular search failed.
Finally, since the users remain anonymous in transaction logs,
analysis of these logs "prevents correlation of results with user
characteristics" [38].
3.2.4 Analysis of Search Failures Utilizing the Critical
Incident Technique
Based on their empirical investigation of tools, techniques, and
methods for the evaluation of online catalogs, Hancock-Beaulieu,
Robertson, and Neilson [39] found that "transaction logs can only
be used as an effective evaluative method with the support of
other means of eliciting information from users." One of the
techniques to elicit information from users about their needs and
intentions is known as "critical incident technique." Data
gathered through this technique, which is briefly discussed
below, facilitates the study of search failures in document
retrieval systems. When it is used in conjunction with the
analysis of transaction log data, the critical incident technique
permits search failures to be correlated with user
characteristics.
The critical incident technique was first used during World War
II to analyze the reasons that pilot candidates failed to learn
to fly. Since then, this technique has been widely used, not
only in aviation, but also in defining the critical requirements
of and measuring typical performance in the health professions.
Flanagan [40] describes the critical incident technique as
follows:
+ Page 20 +
The critical incident technique consists of a set of
procedures for collecting direct observations of human
behavior in such a way as to facilitate their potential
usefulness in solving practical problems and developing
broad psychological principles. The critical incident
technique outlines procedures for collecting observed
incidents having special significance and meeting
systematically defined criteria.
By an incident is meant any observable human activity
that is sufficiently complete in itself to permit
inferences and predictions to be made about the person
performing the act.
The critical incident technique essentially consists of two
steps: (1) collecting and classifying detailed incident reports,
and (2) making inferences that are based on the observed
incidents.
Recently, the critical incident technique has been used to assess
"the effectiveness of the retrieval and use of biomedical
information by health professionals" [41]. In the same study,
researchers have used this technique to analyze and evaluate
search failures in MEDLINE. Using a structured interview process
that included administering a questionnaire, they asked users to
comment on the effectiveness of online searches that they
performed on the MEDLINE database. Each report obtained through
structured interviews was called an "incident report."
Researchers matched these incident reports against MEDLINE
transaction log records corresponding to each search in order to
find out the actual reasons for search success or failure. These
incident reports provided much sought after information about
user needs and intentions, and they put each transaction log
record in context by linking search data to the searcher.
+ Page 21 +
Although the critical incident technique enables the researcher
to gather information about user needs and intentions so that he
or she can better explain the causes of search failures, it also
has some shortcomings. Information gathered through the critical
incident technique has to be corroborated with transaction log
data. The verification of user satisfaction or dissatisfaction
via transaction log data may provide further clues as to why
searches succeed or fail. However, the researcher may not be
able to confirm each and every user's account of his or her
search from the transaction logs. As the users are usually not
identified in the transaction logs, it is sometimes difficult to
find the search in question in the logs.
There are a variety of reasons for this problem. First, the
user's advance permission has to be sought in order to examine
his or her search(es) in the transaction logs. Second, users may
not be able to recall the details of their searches after the
fact. Third, the logs may not contain enough data about the
search: the items displayed and users' relevance judgments are
not recorded in most transaction logs.
The lack of enough data in transaction logs also influences the
effectiveness of the critical incident technique. The researcher
has to rely a great deal on what the user says about the search.
For instance, if the items displayed by the user along with
relevance judgments are not recorded in the transaction logs, the
researcher will not be able to find the precision ratio.
Furthermore, the critical incident technique per se does not tell
us much about the documents that the user may have missed during
the search: we still have to find out about recall failures using
other methods.
3.3 Summary
This section discussed various methods of analyzing search
failures in document retrieval systems. It emphasized that the
issue of search failure is complex. It demonstrated that no
single method of analysis is self-sufficient to characterize all
the causes of search failures. The next section will review the
findings of major studies in this area.
+ Page 22 +
4.0 Review of Studies Analyzing Search Failures
Numerous studies have shown that users experience a variety of
problems when they search document retrieval systems and they
often fail to retrieve relevant documents. The problems users
frequently encounter when searching, especially in online
catalogs, are well documented in the literature [42]. However,
few researchers have studied search failures directly [43]. What
follows is a brief overview of major studies of search failures
in document retrieval systems. Not surprisingly, the results of
these studies are not directly comparable because they use
different definitions and methods of analysis.
4.1 Studies Utilizing Precision and Recall Measures
Several major studies employed precision and recall measures to
analyze search failures.
4.1.1 The Cranfield Studies
Cyril Cleverdon, who was Librarian of the College of Aeronautics
at Cranfield, England, and his colleagues conducted a series of
studies in late 1950s and early 1960s to investigate the
performance of indexing systems [44]. They also studied the
causes of search failures in document retrieval systems. This
paper only reviews findings that pertain to search failures.
In the first study (Cranfield I), Cleverdon compared the
efficiency of retrieval effectiveness of four indexing systems:
the Universal Decimal Classification, an alphabetical subject
index, a special facet classification, and the uniterm system of
co-ordinate indexing. Some 18,000 research reports and
periodical articles in the field of aeronautics were indexed
using these four indexing systems, and 1,200 queries were used in
the tests [45].
The main purpose of the Cranfield I experiment was to test the
ability of each indexing system to retrieve the "source document"
upon which each query was based. Researchers knew beforehand
that "there was at least one document which would be relevant to
each question" [46]. The recall ratio was calculated based on
the retrieval of source documents. However, this recall ratio
should be regarded as a type of "constrained" recall since the
objective was just to find source documents in the collection.
Cranfield I tests have shown that "the general working level of
I.R. systems appears to be in the general area of 60%-90% recall
and 10%-25% of relevance [i.e., precision]" [47].
+ Page 23 +
During the tests, each search was "carried on to the stage where
the source document was retrieved or alternatively the searcher
was unable to devise any further reasonable search programmes"
[48]. Each query was judged to be a success or failure: a search
was a success if the source document was retrieved, a failure if
it was not. Swanson states: "The decision to measure retrieval
success solely in terms of the source document was prompted by an
understandable, though unfortunate, desire to determine whether
any given document was or was not relevant to the question" [49].
Relevant documents other than source documents, which would have
been retrieved during the search, were not taken into account.
The success rate for all searches was found as 78% [50]; source
documents were successfully retrieved for most search queries.
Cleverdon's analysis of search failures was based on 329
documents and queries. The total number of search failures was
495 [51]. He classified the causes of search failures under four
main headings: (1) question, (2) indexing, (3) searching, and (4)
system. Each heading included further subdivisions to specify
the exact cause(s) of each search failure. For example,
questions could be "too detailed," "too general," "misleading" or
just plain "incorrect." Likewise, insufficient, incorrect, or
careless indexing; insufficient number of entries; and lack of
cross references caused further search failures. Included under
searching were "lack of understanding," "failure to use all
concepts," "failure to search systematically," and "incorrect" or
"insufficient searching." The lack of some features in indexing
systems, such as synonymity and inability to combine particular
concepts, also caused search failures.
The number of failed searches under each subdivision is given in
several tables. The reasons for failures in searches carried out
by the project staff are as follows: questions, 17%; indexing
process, 60%; searching 17%; and, indexing system, 6%. The
percentages of failures in searches performed by the technical
staff (i.e., the end-users) were somewhat higher for searching
(37%).
+ Page 24 +
It appears that well over half of the failures in this study were
caused by the indexing process. Cleverdon summarizes the results
of the analysis of search failures as follows [52]:
The analysis of failures . . . shows most decisively that
the failures were, for more than all other reasons together,
due to mistakes by the indexers or searchers, and that a
third of the failures could have been avoided if the project
staff had indexed consistently, as well as they were capable
of doing. Put another way, this means that in every hundred
documents, the indexers failed to index adequately five
documents, the failure usually consisting of the omission of
some particular concept.
The second study (Cranfield II) conducted by Cleverdon and his
colleagues was an attempt to investigate the performance of
indexing systems based on such factors as the exhaustivity of
indexing and the level of specificity of the terms in the index
language. The test collection consisted of some 1,400 research
reports and periodical articles on the subject of aerodynamics
and aircraft structures. Some 221 queries (all single theme
queries) were obtained from the authors of selected published
papers. However, most tests were based on 42 queries and 200
documents [53].
Precision and recall were used to determine the retrieval
effectiveness of indexing systems. It is difficult to cite a
single performance figure because the Cranfield II experiment
involved a number of different index languages with a large
number of variables. It was found that there exists an inverse
relationship between recall and precision and that "the two
factors which appear most likely to affect performance are the
level of exhaustivity of indexing and the level of specificity of
the terms in the index language" [54]. As noted in the preface
to volume two of the report, a detailed intellectual analysis of
the reasons for search failures was not carried out.
+ Page 25 +
4.1.2 Lancaster's MEDLARS Studies
The Cranfield projects tested retrieval effectiveness in a
laboratory setting, and the size of the test collection was small
(1,400 documents). By contrast, Lancaster, studied the retrieval
effectiveness of a large biomedical reference retrieval system
(MEDLARS) in operation [55]. The MEDLARS database (Medical
Literature Analysis and Retrieval System) contained some 700,000
records at that time. Some 300 "real life" queries were obtained
from researchers and were used in the tests.
The retrieval effectiveness of the MEDLARS search service was
measured using precision and recall. The precision ratio was
calculated according to the definition given in section 3.1.
However, it would have been extremely difficult to calculate a
true recall figure in a file of 700,000 records because this
would have meant having the requester examine and judge each and
every document in the collection. Lancaster explains how the
recall figure was obtained:
We therefore estimated the MEDLARS recall figure on the
basis of retrieval performance in relation to a number of
documents, judged relevant by the requester, BUT FOUND BY
MEANS OUTSIDE MEDLARS. These documents could be, for
example,
1. documents known to the requester at the time of
his request,
2. documents found by his local librarian in non-NLM
[National Library of Medicine] generated tools,
3. documents found by NLM in non-NLM-generated tools,
4. documents found by some other information center,
or
5. documents known by authors of papers referred to
by the requester [56, (original emphasis)].
+ Page 26 +
Relevant documents identified by the requester for each query
made up the "recall base" upon which the calculation of the
recall figure was based. An example illustrates how recall was
calculated. The recall base consists of six documents that are
known to the requester to be relevant before the search. Under
these circumstances, if "only 4 are retrieved, we can say that
the recall ratio for this search is 66%" [57].
Based on the results of 299 test searches, Lancaster found that
the MEDLARS Search Service was operating with an average
performance of 58% recall and 50% precision.
Lancaster also studied the search failures using precision and
recall. He investigated recall failures by finding some relevant
documents using sources other than MEDLARS and then checking to
see if the relevant documents had also been retrieved during the
experiment. If some relevant documents were missed, this was
considered as a recall failure and measured quantitatively.
Precision failures were easier to detect since users were asked
to judge the retrieved documents as being relevant or
nonrelevant. If the user decided that some documents were
nonrelevant, this was considered to be a precision failure and
measured accordingly. However, identifying the causes of
precision failures proved to be much more difficult because the
user might have judged a document to be nonrelevant due to index,
search, document, and other characteristics as well as the user's
background and previous experience with the document.
To date, Lancaster's study is the most detailed account of the
causes of search failures that has been attempted. As Lancaster
points out:
The "hindsight" analysis of a search failure is the most
challenging aspect of the evaluation process. It involves,
for each "failure," an examination of the full text of the
document; the indexing record for this document (i.e., the
index terms assigned . . . ); the request statement; the
search formulation upon which the search was conducted; the
requester's completed assessment forms, particularly the
reasons for articles being judged "of no value"; and any
other information supplied by the requester. On the basis
of all these records, a decision is made as to the prime
cause or causes of the particular failure under review [58].
+ Page 27 +
Lancaster found that recall failures have occurred in 238 out of
302 searches, while precision failures occurred in 278 out of 302
searches. More specifically, some 797 relevant documents were
not retrieved. More than 3,000 documents that were retrieved
were judged nonrelevant by the requesters. Lancaster's original
research report contains statistics about search failures along
with detailed explanations of their causes.
Lancaster discovered that almost all of the failures could be
attributed to problems with indexing, searching, the index
language, and the user-system interface. For instance, the
indexing subsystem in his research "contributed to 37% of the
recall failures and . . . 13% of the precision failures" [59].
The searching subsystem, on the other hand, was "the greatest
contributor to all the MED
LARS failures, being at least partly
responsible for 35% of the recall failures and 32% of the
precision failures" [60].
4.1.3 Blair and Maron's Full-Text Retrieval System Study
More recently, Blair and Maron [61] conducted a retrieval
effectiveness test on a full-text document retrieval system.
They utilized a database that "consisted of just under 40,000
documents, representing roughly 350,000 pages of hard-copy text,
which were to be used in the defense of a large corporate law
suit" [62]. The tests were based on some 51 queries obtained
from two lawyers.
Precision and recall were used as performance measures in the
Blair and Maron study. The precision ratio was straightforward
to calculate (by dividing the total number of relevant documents
retrieved by the total number of documents retrieved). Blair and
Maron used a different method to calculate the recall ratio. The
way they found unretrieved relevant documents (and thus studied
recall failures) was as follows. They developed "sample frames
consisting of subsets of the unretrieved database" that they
believed to be "rich in relevant documents" and took random
samples from these subsets. Taking samples from subsets of the
database rather than the entire database was more advantageous
from the methodological point of view "because, for most queries,
the percentage of relevant documents in the database was less
than 2 percent, making it almost impossible to have both
manageable sample sizes and a high level of confidence in the
resulting Recall estimates" [63].
+ Page 28 +
The results of Blair and Maron's tests showed that the mean
precision ratio was 79% and the mean recall ratio was 20% [64].
Blair and Maron found that recall failures occurred much more
frequently than one would expect: the system failed to retrieve,
on the average, four out of five relevant documents in the
database. They showed quite convincingly that high recall
failures can result from free-text queries, where the user's
terminology and that of the system do not match.
Blair and Maron also observed that users involved in their
retrieval effectiveness study believed that "they were retrieving
75 percent of the relevant documents when, in fact, they were
only retrieving 20 percent" [65].
4.1.4 Markey and Demeyer's Dewey Decimal Classification Online
Project
Markey and Demeyer studied the Dewey Decimal Classification (DDC)
system "as an online searcher's tool for subject access,
browsing, and display in an online catalog" [66]. Two online
catalogs were employed in the study: "(1) DOC, or Dewey Online
Catalog, in which the DDC had been implemented as an online
searcher's tool for subject access, browsing, and display; and
(2) SOC, or Subject Online Catalog, in which the DDC had not been
implemented" [67].
They also conducted online retrieval performance tests using
recall and precision measures to reveal problems with online
catalogs and to identify their inadequacies. Precision was
defined in their study as the proportion of unique relevant items
retrieved and displayed. This definition of precision differs
from the one given in Section 3.1 in that it takes into account
only retrieved and displayed items (instead of all retrieved
items) in the calculation of precision ratio. The researchers
made no attempt to have users display and make relevance
assessments about all the retrieved items in order to calculate
the absolute precision ratio [68].
+ Page 29 +
Their estimated recall scores were also based on retrieved and
displayed items only, not on all the relevant items in the
collection. Understandably, they found it impractical to scan
the entire database for every query to find all the relevant
items in the collection. They used an estimated recall formula
"that combined the relevant items retrieved and displayed in the
SOC search for a query and the relevant items retrieved and
displayed in the DOC search for the same query" [69]. In order
to find the estimated recall ratio for each search, the number of
unique relevant items retrieved and displayed in one catalog was
divided by the total number of unique relevant items retrieved
and displayed for the same query in both catalogs. No attempt
was made to find other potentially relevant items in the
database.
The estimated recall scores in the study ranged from a low of 44%
to a high of 75%. They found that "searches were likely to
retrieve and display a large proportion of relevant items that
were unique . . . for the same topic in SOC and DOC" even though
DOC's estimated recall was lower than that of SOC [70]. They
also asked users if they were satisfied with the search results,
and "the majority of patrons expressed satisfaction with the
search in the system yielding higher estimated recall" [71]. The
average precision scores ranged from a low of 26% to a high of
65% [72]. Considering that only a fraction of items retrieved in
the searches were actually displayed, the authors noted that
precision was affected by the order in which retrieved items were
displayed. They found precision to be a less reliable criterion
with which to measure the performance of an online catalog [73].
They asked users which system gave more satisfactory results for
their searches and compared users' responses with the precision
scores. They concluded that "there was no relationship between
patrons' search satisfaction and the precision of their online
searches" [74].
Markey and Demeyer also analyzed a total of 680 subject searches
as part of the DDC Online Project and found that 34 out of 680
subject searches (5%) failed. Two major reasons for subject
search failures were identified as follows: (1) the topic was
marginal (35%), and (2) the users' vocabulary did not match
subject headings (24%) [75]. Their research report gives a
detailed account of the failure analysis of different subject
searching options in an online catalog enhanced with a
classification system (DDC) [76].
+ Page 30 +
Markey and Demeyer apparently did not count "zero retrievals" as
search failures. Nor did they include in their analysis partial
search failures that retrieved at least some relevant documents.
Presumably, that's why the number of search failures they
analyzed were relatively low.
4.2 Studies Utilizing User Satisfaction Measures
It was noted earlier (Section 3.2.2) that analyzing search
failures utilizing user satisfaction measures is extremely
complicated. Few researchers have attempted to look at search
failures in light of user satisfaction.
Hilchey and Hurych analyzed 153 online search evaluation forms
returned by the users in a university library [77]. Almost half
of the respondents (47%) found the search results "most
relevant." An additional 32% of the respondents graded the
results as "half relevant." Only 6% found all search results
relevant. In short, 85% of the respondents felt that search
results were at least half relevant. It should be noted that the
return rate in this study was about 10%. Although authors claim
that the return rate was "unprejudiced in any way," returned
questionnaire forms may have primarily come from satisfied users.
Ankeny reviewed the studies reporting user satisfaction in end-
user search services such as MEDLINE and BRS/After Dark [78].
Most end-users seemed to be satisfied with the online search
services.
Ankeny also reported the results of two studies that he
conducted. In the first study, he surveyed 190 end-users and
found that 78% of the users located what they wanted in two
business databases (DIALOG Business Connection and Dow Jones
News/Retrieval). More than 81% of the users rated the services
favorably by giving "an overall rating of 4 or 5 on the
five-point scale" [79].
+ Page 31 +
In the second study, Ankeny surveyed some 600 end-users. He used
a stricter measure of search success that had a reliability
coefficient of .90. Search success was not measured on a
five-point scale in the second study. Rather, in order for a
search to be qualified as successful, the user had to answer
three questions that affirmed that the user was fully satisfied
with the search, found exactly what was desired, and was not
dissatisfied in any way. He states: "Of the 600 searches in the
sample, 233 met all three criteria for complete success and 367
were less than successful, yielding an overall success rate of
38.8 percent" [80]. Reported reasons for dissatisfaction in 367
"less-than-successful" searches were as follows: system problems;
amount, relevancy, or level of the information retrieved; lack of
better printed instructions; and lack of more informed and
accommodating staff.
Kirby and Miller analyzed search failures encountered by MEDLINE
end-users employing the Colleague search software [81]. In order
to find the search successes and failures, end-users compared
their search results with the mediated follow-up search results.
"Successful" and "incomplete" end-user searches were identified
as follows:
"Successful" Colleague searches were those for which the
follow-up search added nothing important, as indicated by
one of two questionnaire responses: "My search gave
satisfactory results, and nothing ESSENTIAL was added by the
second search" . . . or "Neither search provided
satisfactory results." Both responses were regarded as
"successful" in that the end user was no less successful in
meeting the information need than the trained search
analyst. "Incomplete" Colleague searches were those which
had missed important articles, according to end user
questionnaire responses after reviewing the follow-up search
results" [82, (original emphasis)].
However, end-users were not asked to judge each record retrieved
by either search. Rather, "the comparison was based on search
terms and combinations recorded on the follow-up search form, and
on the number of citations printed in the follow-up search" [83].
Kirby and Miller examined 52 searches. Of the 52 searches, 31
were "incomplete." The major cause of search failures (67.7%)
was the search strategy. The rest of the search failures were
due to system mechanics and database selection (22.6% and 9.7%,
respectively).
+ Page 32 +
4.3 Studies Utilizing Transaction Logs
Several researchers have used transaction logs to study search
failures in online catalogs. Dickson [84] studied a sample of
"zero-hit" author and title searches using the transaction log of
Northwestern University Library's online catalog and analyzed why
the searches failed. She found out that about 23% of author
searches and 37% of title searches retrieved nothing.
Misspellings and mistakes in the search formulation were the
major causes of zero-hit searches.
Jones [85] examined transaction logs of the Okapi online catalog
and identified several unsatisfactory areas in the operation of
Okapi due to, among others, spelling errors, failures in subject
searching, and user-system interface problems. He analyzed some
300 subject searches performed on Okapi and found that 25% of
them failed: "Using relevance assessments based on a display of
the first ten records, the experimenter decided that 62.4% of
searches were almost certainly successful, 13% may have been
successful, 4.5% were collection failures and 25% failed
absolutely" [86].
In a follow-up study, it was found that 17 out of 122 sessions
(or 13.9%) failed in the Okapi (including 2 sessions that failed
due the collection not containing relevant items). (Most
sessions contained more than one search.) In 7 sessions, the
users' vocabulary did not match that of the catalog (e.g.,
"sociology of shopping"). Another 4 sessions failed because the
topics expressed by the users were too specific (e.g., "textile
industry input-output tables"). Two searches failed because
searches did not describe users' needs (e.g., one user entered
his query simply as "sterling" although the interviewer found out
he was actually looking for "economics--sterling shares and
gold") [87].
The most recent Okapi report states that "the proportion of (non-
aborted) searches which failed to retrieve any records is very
low indeed (3.9% overall)" [88]. The authors of the report claim
that the improvement is primarily due to: (1) Okapi's "best
match" search, and (2) stemming and automatic cross-referencing
[89].
+ Page 33 +
Peters [90] analyzed the transaction logs of a union online
catalog (the University of Missouri Information Network) and
found that 40% of the searches in that catalog produced zero
hits. He classified the causes of search failures under 14
different groups, including typographical and spelling errors
(10.9% and 9.9%, respectively) and the search system itself
(9.7%). Approximately 40% of the failures were collection
failures (i.e., the item sought was not in the database).
However, it should be noted that Peters' study was not based on a
rigorous analysis of zero-hit searches by re-entering queries to
determine the exact causes of failures. Rather, "the analyzers
made intelligent guesses . . . of the probable causes" [91].
Hunter [92] analyzed thirteen hours of transaction logs,
amounting to some 3,700 searches performed in a large academic
library online catalog. She used the same classification schema
as Peters and categorized the causes of search failures under 18
different groups. The overall search failure rate in Hunter's
study was found to be 54.2%. The major causes of search failures
were identified as the controlled vocabulary in subject searching
(29%), the system itself (18%), and the typographical errors
(15%). However, it was not explained in detail what sorts of
controlled vocabulary failures occurred and what the specific
causes were.
C. Walker and her colleagues [93] obtained similar results when
they studied the problems encountered by clinical end-users of
MEDLINE and GRATEFUL MED. They defined search failure, which
they called "unproductive search," as "one that did not retrieve
any citations," and they analyzed 172 such searches [94]. They
found that 48% of the search failures occurred because of some
flaw in the search strategy. The software in use was responsible
for 41% of the search failures. System failures constituted some
11% of all search failures.
Zink [95] analyzed transaction logs of 6,118 searches that took
place on the WolfPAC online catalog at the University of Nevada.
He found that:
+ Page 34 +
more than one of every four (27.81 percent or 1,702)
failed to retrieve at least one bibliographical record.
Subject searches yielded 667 unsuccessful searches, or 39.19
percent of the total number of unsuccessful searches.
Author searches resulted in 250 unsuccessful searches (14.69
percent of the total). Searches by all other criteria
accounted for 300 unsuccessful searches (17.63 percent of
the total) [96].
Collection failures (57.60%), misspellings (18%), and placing
first name "improperly" before last name (15.20%) caused most of
the author search failures. Similar failure rates were also
observed for the title searches (collection failures, 61.86%, and
misspellings, 14.23%). In 111 unsuccessful title searches
(22.89%), searchers seemed to be attempting to find subject or
author information. Sixty-three percent of the subject searches
failed because the user-entered subject words were not
"legitimate" Library of Congress subject headings. Misspellings
and collection failures accounted for 23.24% and 10.64% of all
subject search failures.
Most of the studies summarized above benefitted from transaction
monitoring to the extent that "zero-hit" searches were identified
from transaction logs [97]. Researchers examined the zero-hit
searches in order to find out why a particular search query
failed to retrieve anything in the database. Unlike Lancaster
[98], they did not attempt to identify the causes of recall and
precision failures.
4.4 Studies Utilizing the Critical Incident Technique
It was mentioned earlier (Section 3.2.4) that Wilson, Starr-
Schneidkraut, and Cooper studied searching in MEDLINE using the
critical incident technique [99]. The researchers first devised
a sampling strategy and developed an interview protocol to elicit
the desired information from the subjects. They then developed
three "frames of reference" to analyze the interview data: "(1)
'Why was the information needed?,' (2) 'How did the information
obtained impact the decision-making of the individual who needed
the information?,' and (3) 'How did the information obtained
impact the outcome of the clinical or other situation that
occasioned the search?'" [100]. After a qualitative analysis of
the critical incident reports, the frames of reference were used
to create three similar taxonomies.
+ Page 35 +
In the same study, they asked users to explain what they needed
the information for and whether they were satisfied with the
search outcome. They used incident forms to record the user's
account of why a particular search failed or succeeded and, with
permission, they tape-recorded the user's comments. They later
tried to match these "incident reports" against MEDLINE
transaction log records for each search in order to find out the
actual reasons for search failures and successes.
They examined some 26 user-designated ineffective incident
reports in order to "characterize the nature of the ineffective
searches, analyze the relationship between what the user said and
what the transaction log said happened during the search, and
ascertain, by performing an analogous MEDLINE search, whether a
search could have been performed which would have met the user's
objective" [101]. Most ineffective searches (23 out of 26) were
identified as such because the users "could not find what they
were looking for and/or could not find relevant materials." An
appendix summarizing the analysis of each ineffective search
accompanied their research report.
After extensive examination of interview transcripts and
transaction logs for ineffective searches, the researchers
concluded that users did not appear to comprehend:
1. How to do subject searching.
2. How MeSH [Medical Subject Headings] works.
3. How they can apply that understanding to map their
search requests into a vocabulary that is likely to
retrieve considerably more relevant materials [102].
It appears that critical incident technique can successfully be
used in the analysis of search failures in online catalogs as
well. Matching incident reports against transaction logs is
especially promising. Since the analyst will, through incident
reports, gather contextual data for each search query, more
informed relevance judgments can be made. Furthermore, this
technique also can be utilized to compare user-designated search
effectiveness with that obtained through traditional retrieval
effectiveness measures.
+ Page 36 +
4.5 Other Search Failure Studies
Some experimental studies looked into strict matching failures
that occurred when users tried to do catalog searches.
Gouke and Pease [103] analyzed the success rates of the users in
matching titles and found that the success rate in finding
"nonproblem" titles was 82%, whereas the rate was 48% for
"problem" titles. Almost half of the users failed to match
simple titles in the online catalog for various reasons (e.g.,
titles appearing as subject, hyphenated words, words on stoplist,
foreign titles, and abbreviations).
Alzofon and Van Pulis [104] surveyed 430 users of the LCS online
catalog of the Ohio State University Libraries to identify the
patterns of searching. They also studied the success rates for
known-item and subject searches. They replicated the users'
searches on the catalog and found that the author-title search
had a success rate of 85% compared with 77% for author searches
and 68% for subject searches.
Janosky, Smith, and Hildreth [105] studied the errors that users
made in performing searches in the LCS online catalog of the Ohio
State University Libraries. They hired 30 volunteer students who
had no prior experience with the online catalog under
investigation. Each student searched four queries in the
catalog. (Queries were the same for all students.) They
performed one subject search and three known-item searches.
Authors summarize the procedure and results as follows:
They [users] were asked to search until they either found
the item(s) in question or believed that the item(s) was not
present in the library system. They were told that it was
possible that the item in question was not contained in the
library. While searching, subjects were asked to think
aloud . . . . A success rate was computed for each search.
Since all search items were actually in the library system
(subjects were not told this fact), "success" is defined as
correctly locating the information requested about an item .
. . . For the four searches, the success rate ranged from a
high of 58% to a low of 0% [106].
+ Page 37 +
It appears that users experienced serious problems with the
mechanical aspects of searching in this catalog, which in turn
influenced the success rate considerably. For instance,
"HELP-AUTHOR" was the "correct" help command, and users who
entered "HELP AUTHOR" failed to get any help about author
searches (notice the hyphen between the two words). On-screen
and offline instructions in this system that advised users to
type in commands "exactly as listed" did not seem to help users
much to recover from such search failures. A more forgiving user
interface would have easily prevented similar failures from
occurring in the first place. The authors concluded: "It is not
sufficient to simply tell users that they have made an error.
Failures to deal with the causes of an error often snowballed
into a whole string of misinterpretations, resulting in complete
failures to solve the problem of using LCS" [107].
4.6 Related Studies
A few studies that were not directly concerned with the causes of
search failures, but which nevertheless addressed relevant issues
are summarized below.
Hildreth considers the "vocabulary" problem as the major
retrieval problem in today's online catalogs and asserts that "no
other issue is as central to retrieval performance and user
satisfaction" [108]. This may be because controlled vocabularies
are far more complicated than users can easily grasp in a short
period of time. Several researchers have found that the lack of
knowledge concerning the Library of Congress Subject Headings
(LCSH) is one of the most important reasons why searches fail in
online catalogs [109]. Larson [110] found that almost half of
all subject searches in MELVYL retrieved nothing. More recently,
Larson [111] analyzed the use of MELVYL over a longer period of
time (six years) and found that there is a significant positive
correlation between the failure rate and the percentage of
subject searching. This confirms the findings of an earlier
formal analysis of factors contributing to success and
satisfaction: "problems with subject searching were the most
important deterrents to user satisfaction" [112].
+ Page 38 +
Larson [113] reviewed the literature on subject search failures
in online catalogs along with remedies offered to reduce subject
search problems. Subject retrieval failures in online catalogs
could be reduced in a number of ways, including assigning more
subject headings to bibliographic records, providing keyword
searching, and enhancing classification retrieval.
Carlyle studied the match between users' vocabulary and LCSH
using transaction logs and found that "single LCSH headings match
user expressions exactly about 47% of the time" [114]. A study
conducted by Van Pulis and Ludy [115] showed that 53% of the
users' terms matched subject headings in the online catalog.
Vizine-Goetz and Markey Drabenstott extracted queries from
transaction logs of three online catalogs (SULIRS, ORION, and
LS/2000) and analyzed them "both by computer and manually to
determine the extent to which they matched subject headings"
[116]. They found that less than half of the subject query terms
exactly matched the Library of Congress subject headings. The
findings suggest that some search failures can be attributed to
controlled vocabularies in online catalogs. However, as the
authors note, "such analyses . . . reveal little about whether
matching terms satisfactorily represent users' topics of
interest" [117].
5.0 Conclusion
It appears that there is no agreed upon definition of what
constitutes search failure in document retrieval systems. In
part, this is due to the multiplicity of data gathering tools and
techniques used in the analysis of search failures (e.g., the
critical incident technique, controlled experiments, interviews,
questionnaires, talk-aloud techniques, and transaction
monitoring). Different data gathering methods have different
strengths and weaknesses.
+ Page 39 +
Many of the studies reviewed in this paper examined search
failures based on zero retrievals in online catalogs. Partial
search failures have been studied much less frequently.
Experiments that investigate the relationship between search
failures and user needs or characteristics are even scarcer.
This is not surprising because identifying zero retrievals from
transaction logs is relatively easy and inexpensive. By
contrast, analyzing search failures using precision and recall
measures is more expensive and time-consuming. So is the
investigation of user needs and interests, which could help
researchers make more informed judgments about search failures
identified through other means. No single method or technique is
self-sufficient to analyze all search failures in document
retrieval systems and to interpret the findings.
As for the causes of search failures, transaction logs of the
searches that retrieved nothing in online catalogs reveal that
users are having numerous mechanical problems, such as improperly
keying commands and misspelling words. Such problems can be
alleviated to a certain extent by designing more intuitive user
interfaces that would not only take into account user expertise
and task complexity, but also would give advice and simplify the
user's task [118]. Newer online catalogs are dealing with these
problems by incorporating more sophisticated stemming algorithms
and Soundex-type techniques to correct misspellings.
Transaction log analysis also reveals that users' lack of
knowledge of controlled vocabularies and query languages causes
many search failures and, subsequently, results in user
frustration. Most users are not aware of the role of controlled
vocabularies in document retrieval systems. They do not seem to
understand the structure of rigid indexing and query languages.
Consequently, their search query terms, which are expressed in
their own words, often fail to match the titles and subject
headings of the documents, causing search failures. "Brittle"
query languages based on Boolean logic tend to exacerbate this
situation further, especially for complicated search queries.
+ Page 40 +
Transaction monitoring is the most appropriate technique to study
search failures when the cause(s) of search failures are obvious
(e.g., zero retrievals due to misspellings or collection
failures). However, transaction monitoring seems to be less
efficient in dealing with more complicated failures. For
example, partial failures can be best studied with the help of
the user. After all, the user is the key person in the analysis
of search failures. It is the user who can explain what he or
she was trying to do and whether it was successful. Such input
from the user puts each search into perspective and provides much
needed contextual information. However, users do not get
identified in most transaction log studies. Without user
feedback, researchers are faced with the unenviable task of
coming up with a rational explanation as to why a particular
search failed.
Notwithstanding the circumstantial evidence gathered through
various online catalog studies in the past, studies examining the
match between users' vocabulary and that of online document
retrieval systems are scarce. Moreover, the probable effects of
mismatching on search failures are yet to be fully explored.
Users prefer to be able to express their information needs in
natural language, but most contemporary online catalogs cannot
accommodate search requests submitted in natural language form.
However, it is believed that natural language query interfaces
may reduce search failures in document retrieval systems.
Natural language search terms will more likely match the titles
of the documents in the database. Consequently, the role of
natural language interfaces in reducing search failures in
document retrieval systems needs to be thoroughly studied.
User input should be sought when analyzing search failures with
retrieval effectiveness measures such as precision and recall.
The same can be said for failure analysis studies that are based
on user satisfaction measures. We should strive for full-scale
user involvement as much as possible in every stage of analysis
of search failures. Despite user participation in the evaluation
process, search failures in document retrieval systems are
unlikely to be eliminated altogether. However, only through user
participation will we find the real causes of search failures
and, consequently, build better document retrieval systems.
+ Page 41 +
Notes
1. M. E. Maron, "Probabilistic Retrieval Models," in Progress in
Communication Sciences, vol. 5, ed. Brenda Dervin and Melvin J.
Voigt. (Norwood, NJ: Ablex, 1984), 145-176.
2. Ibid., 155.
3. Ibid.
4. C. J. Van Rijsbergen, Information Retrieval, 2nd ed. (London:
Butterworths, 1979), 10.
5. David C. Blair and M. E. Maron, "An Evaluation of Retrieval
Effectiveness for a Full-Text Document-Retrieval System,"
Communications of the ACM 28 (March 1985): 291.
6. S. E. Robertson, M. E. Maron, and W .S. Cooper, "Probability
of Relevance: A Unification of Two Competing Models for Document
Retrieval," Information Technology: Research and Development 1
(1982): 1.
7. Tefko Saracevic, "Relevance: A Review of and a Framework for
the Thinking on the Notion in Information Science," Journal of
the American Society for Information Science 26 (1975): 321-343.
See also: Michael Eisenberg and Linda Schamber, "Relevance: The
Search for a Definition," in ASIS '88: Proceedings of the 51st
ASIS Annual Meeting, Atlanta, Georgia, October 23-27, 1988, ed.
Christine L. Borgman and Edward Y .H. Pai. (Medford, NJ: Learned
Information, 1988), 164-168.
8. An attempt has been made in Cranfield II to plot
recall/fallout graphs. The size of the collection used in this
experiment was relatively small (1,400 documents) and many tests
were done with 200 documents. Nevertheless, no analysis has been
performed to find out the causes of fallout failures. For
details, see: Cyril Cleverdon, Jack Mills, and Michael Keen,
Factors Determining the Performance of Indexing Systems, Volume
1, Design (Cranfield, England: Aslib, 1966); and Cyril Cleverdon
and Michael Keen, Factors Determining the Performance of Indexing
Systems, Volume 2, Test Results (Cranfield, England: Aslib,
1966).
+ Page 42 +
9. Ray R. Larson, "Between Scylla and Charybdis: Subject
Searching in the Online Catalog," in Advances in Librarianship,
vol. 15, ed. Irene P. Godden (San Diego, CA: Academic Press,
1991), 188. See also: S. E. Wiberley and R. A. Dougherty,
"Users' Persistence in Scanning Lists of References," College &
Research Libraries 49 (1988): 149-156.
10. J. L. Kuhns implied that frustration usually occurs when a
user reaches his or her "futility point" in a given search. The
futility point is defined as "the number of retrieved documents
the inquirer is willing to browse through before giving up his
search in frustration." Source: David C. Blair, "Searching
Biases in Large Interactive Document Retrieval Systems," Journal
of the American Society for Information Science 31 (July 1980):
271.
11. Michael Buckland and Fredric Gey, personal communication,
1991.
12. Robert Wages, "Can Easy Searching be Good Searching? A Model
for Easy Searching," Online 13 (May 1989): 80.
13. William S. Cooper, "On Selecting a Measure of Retrieval
Effectiveness," Journal of the American Society for Information
Science 24 (1973): 87-100, 413-424. Compare this with: Dagobert
Soergel, "Is User Satisfaction a Hobgoblin?," Journal of the
American Society for Information Science 27 (July-August 1976):
256-259.
14. Ibid., 88.
15. Judith A. Tessier, Wayne W. Crouch, and Pauline Atherton,
"New Measures of User Satisfaction with Computer-Based Literature
Searches," Special Libraries 68 (November 1977): 383-389.
16. Marcia J. Bates, "Factors Affecting Subject Catalog Search
Success," Journal of the American Society for Information Science
28 (May 1977): 161-169.
17. Mark T. Kinnucan, "The Size of Retrieval Sets," Journal of
the American Society for Information Science 43 (January 1992):
73.
+ Page 43 +
18. Susan E. Hilchey and Jitka M. Hurych, "User Satisfaction or
User Acceptance? Statistical Evaluation of an Online Reference
service," RQ 24 (Summer 1985): 455.
19. Renata Tagliacozzo, "Estimating the Satisfaction of
Information Users," Bulletin of the Medical Library Association
65 (April 1977): 248.
20. Ibid.
21. Melvon L. Ankeny, "Evaluating End-User Services: Success or
Satisfaction," Journal of Academic Librarianship 16 (January
1991): 356.
22. Ibid., 354. See also: Ethel Auster and Stephen B. Lawton,
"Search Interview Techniques and Information Gain as Antecedents
of user satisfaction with Online Bibliographic Retrieval,"
Journal of the American Society for Information Science 35 (March
1984): 90-103.
23. Sandra R. Wilson, Norma Starr-Schneidkraut, and Michael D.
Cooper, Use of the Critical Incident Technique to Evaluate the
Impact of MEDLINE. (Palo Alto, CA: American Institutes for
Research, 1989), AIR-64600-9/89-FR. For hypothetical examples as
to the importance of unretrieved but relevant documents, see:
Soergel, "Is User Satisfaction a Hobgoblin?," 258-259.
24. Ankeny, "Evaluating End-User Services," 356.
25. Debora Cheney, "Evaluation-Based Training: Improving the
Quality of End-User Searching," Journal of Academic Librarianship
17 (July 1991): 155.
26. Tefko Saracevic and Paul Kantor, "A Study of Information
Seeking and Retrieving. II. Users, Questions, and
Effectiveness," Journal of the American Society for Information
Science 39 (May 1988): 177-196.
+ Page 44 +
27. Tefko Saracevic, Paul Kantor, Alice Y. Chamis, and Donna
Trivison, "A Study of Information Seeking and Retrieving. I.
Background and Methodology," Journal of the American Society for
Information Science 39 (May 1988): 161-176. Note that it is not
discussed in this paper how they calculated the precision/recall
ratios and what figures (i.e., number of records (a) retrieved,
(b) relevant, (c) not relevant) they obtained. As they stressed
several times in their report, the recall figures they obtained
were not absolute but comparative. For a more detailed account,
see Part II of their article.
28. Saracevic and Kantor, "A Study of Information Seeking and
Retrieving. Part II," 193.
29. Ibid.
30. Ray R. Larson, "The Decline of Subject Searching: Long Term
Trends and Patterns of Index Use in an Online Catalog," Journal
of American Society for Information Science 42 (April 1991): 198.
31. Charles W. Simpson, "OPAC Transaction Log Analysis: The
First Decade," in Advances in Library Automation and Networking,
vol. 3, ed. Joe A. Hewitt (Greenwich, Conn.: JAI Press, 1989),
35-67.
32. J. Dickson, "Analysis of User Errors in Searching an Online
Catalog," Cataloging & Classification Quarterly 4 (Spring 1984):
19-38; Thomas A. Peters, "When Smart People Fail: An Analysis of
the Transaction Log of an Online Public Access Catalog," Journal
of Academic Librarianship 15 (November 1989): 267-273; Rhonda N.
Hunter, "Successes and Failures of Patrons Searching the Online
Catalog at a Large Academic Library: A Transaction Log Analysis,"
RQ 30 (Spring 1991): 395-402; and Steven D. Zink, "Monitoring
User Search Success through Transaction Log Analysis: the WolfPac
Example," Reference Services Review 19 (1991): 49-56.
+ Page 45 +
33. Martha Kirby and Naomi Miller, "MEDLINE Searching on
Colleague: Reasons for Failure or Success of Untrained End
Users," Medical Reference Services Quarterly 5 (1986): 17-34; and
Cynthia J. Walker et al., "Problems Encountered by Clinical End
Users of MEDLINE and GRATEFUL MED," Bulletin of the Medical
Library Association 79 (January 1991): 67-69.
34. Hunter, "Successes and Failures," 401.
35. Stephen Walker and Micheline Hancock-Beaulieu, Okapi at
City: An Evaluation Facility for Interactive Information
Retrieval (London: The British Library, 1991), British Library
Research Report 6056; and Ray R. Larson, "Classification
Clustering, Probabilistic Information Retrieval and the Online
Catalog," Library Quarterly 61 (April 1991): 133-173.
36. Stephen Walker and Richard M. Jones, Improving Subject
Retrieval in Online Catalogues, 1: Stemming, Automatic Spelling
Correction and Cross-Reference Tables (London: The British
Library, 1987), 139, British Library Research Paper 24. See
also: R. Jones, "Improving Okapi: Transaction Log Analysis of
Failed Searches in an Online Catalogue," Vine no. 62 (1986):
3-13.
37. Larson, "The Decline of Subject Searching," 198.
38. Sharon Seymour, "Online Public Access Catalog User Studies:
A Review of Research Methodologies, March 1986-November 1989,"
Library and Information Science Research 13 (1991): 97.
39. Micheline Hancock-Beaulieu, Stephen Robertson and Colin
Neilson, "Evaluation of Online Catalogues: Eliciting Information
from the User," Information Processing & Management 27 (1991):
532.
40. John C. Flanagan, "The Critical Incident Technique,"
Psychological Bulletin 51 (1954): 327.
41. Wilson, Starr-Schneidkraut and Cooper, Use of the Critical
Incident Technique to Evaluate the Impact of MEDLINE, 2.
+ Page 46 +
42. Sammy R. Alzofon and Noelle Van Pulis, "Patterns of
Searching and Success Rates in an Online Public Access Catalog,"
College & Research Libraries 45 (March 1984): 110-115; Marcia J.
Bates, "Subject Access in Online Catalogs: a Design Model,"
Journal of American Society for Information Science 37 (1986):
357-376; Christine L. Borgman, "Why are Online Catalogs Hard to
Use? Lessons Learned from Information-Retrieval Studies," Journal
of American Society for Information Science 37 (1986): 387-400;
Pauline A. Cochrane and Karen Markey, "Catalog Use Studies Since
the Introduction of Online Interactive Catalogs: Impact on Design
for Subject Access," Library and Information Science Research 5
(1983): 337-363; Mary Noel Gouke and Sue Pease, "Title Searches
in an Online Catalog and a Card Catalog: A Comparative Study of
Patron Success in Two Libraries," Journal of Academic
Librarianship 8 (July 1982): 137-143; Charles R. Hildreth,
Intelligent Interfaces and Retrieval Methods for Subject
Searching in Bibliographic Retrieval Systems (Washington, DC:
Cataloging Distribution Service, Library of Congress, 1989);
Beverly Janosky, Philip J. Smith, and Charles Hildreth, "Online
Library Catalog Systems: An Analysis of User Errors,"
International Journal of Man-Machine Studies 25 (1986): 573-592;
Neal N. Kaske, A Comprehensive Study of Online Public Access
Catalogs: an Overview and Application of Findings (Dublin, OH:
OCLC, 1983), OCLC Research Report # OCLC/OPR/RR-83- 4; Cheryl
Kern-Simirenko, "OPAC User Logs: Implications for Bibliographic
Instruction," Library Hi Tech 1 (1983): 27-35; Ray R. Larson,
"Workload Characteristics and Computer System Utilization in
Online Library Catalogs" (Ph.D. diss., University of California
at Berkeley, 1986); Gary S. Lawrence, V. Graham, and H. Presley,
"University of California Users Look at MELVYL: Results of a
Survey of Users of the University of California Prototype Online
Union Catalog," Advances in Library Administration 3 (1984):
85-208; Karen Markey, Subject Searching in Library Catalogs:
Before and After the Introduction of Online Catalogs (Dublin, OH:
OCLC, 1984); Karen Markey, "Users and the Online Catalog: Subject
Access Problems," in The Impact of Online Catalogs, ed. J.R.
Matthews. (New York: Neal-Schuman, 1986), 35-69; Joseph K.
Matthews, A Study of Six Public Access Catalogs: a Final Report
Submitted to the Council on Library Resources, Inc. (Grass
Valley, CA: J. Matthews and Assoc., Inc., 1982); Joseph Matthews,
Gary S. Lawrence, and Douglas Ferguson, eds., Using Online
Catalogs: a Nationwide Survey. (New York: Neal-Schuman, 1983);
and Chih Wang, "The Online Catalogue, Subject Access and User
Reactions: A Review," Library Review 34 (1985): 143-152.
+ Page 47 +
43. Examples of such studies are (in chronological order): Cyril
W. Cleverdon, Report on the Testing and Analysis of an
Investigation into the Comparative Efficiency of Indexing
Systems (Cranfield, England: Aslib, 1962); Cleverdon, Mills and
Keen, Factors Determining the Performance of Indexing Systems,
Volume 1, Design; Cleverdon and Keen, Factors Determining the
Performance of Indexing Systems, Volume 2, Test Results; F. W.
Lancaster, Evaluation of the MEDLARS Demand Search Service.
(Washington, DC: US Department of Health, Education and Welfare,
1968); F. W. Lancaster, "MEDLARS: Report on the Evaluation of Its
Operating Efficiency," American Documentation 20 (1969): 119-142;
Dickson, "Analysis of User Errors in Searching an Online
Catalog"; Blair and Maron, "An Evaluation of Retrieval
Effectiveness for a Full-Text Document-Retrieval System"; Jones,
"Improving Okapi: Transaction Log Analysis of Failed Searches in
an Online Catalogue"; Karen Markey and Anh N. Demeyer, Dewey
Decimal Classification Online Project: Evaluation of a Library
Schedule and Index Integrated into the Subject Searching
Capabilities of an Online Catalog (Dublin, OH: OCLC, 1986),
Report Number: OCLC/OPR/RR-86-1; Kirby and Miller, "MEDLINE
Searching on Colleague"; S. Walker and Jones, Improving Subject
Retrieval in Online Catalogues; Wilson, Starr-Schneidkraut, and
Cooper, Use of the Critical Incident Technique to Evaluate the
Impact of MEDLINE; Simone Klugman, "Failures in Subject
Retrieval," Cataloging & Classification Quarterly 10 (1989): 9-
35; Peters, "When Smart People Fail"; Ankeny, "Evaluating End-
User Services: Success or Satisfaction"; Hunter, "Successes and
Failures"; C. Walker et al., "Problems Encountered by Clinical
End Users of MEDLINE and GRATEFUL MED"; and Zink, "Monitoring
User Search Success through Transaction Log Analysis: the WolfPac
Example."
44. Cleverdon, Report on the Testing and Analysis; Cleverdon,
Mills and Keen, Factors Determining the Performance of Indexing
Systems, Volume 1, Design; and Cleverdon and Keen, Factors
Determining the Performance of Indexing Systems, Volume 2, Test
Results.
45. Cleverdon, Report on the Testing and Analysis, 1.
46. Ibid., 8-9.
+ Page 48 +
47. Ibid., 89. The design and findings of the Cranfield I
experiment have been criticized by many authors. For example,
see: Don R. Swanson, "The Evidence Underlying the Cranfield
Results," Library Quarterly 35 (1965): 1-20. For a review of the
Cranfield tests, see: Karen Sparck Jones, "The Cranfield Tests,"
in Information Retrieval Experiment, ed. Karen Sparck Jones
(London: Butterworths, 1981), 256-284.
48. Ibid., 11.
49. Swanson, "The Evidence Underlying the Cranfield Results,"
5.
50. This percentage was obtained by averaging the figures given
in the fifth column of Table 3.1 of Cleverdon, Report on the
Testing and Analysis, 22.
51. This summary is based on Cleverdon, Report on the Testing
and Analysis, Chapter 5. The report also includes the complete
summary of the analysis of search failures (Appendix 5A) and
"some examples of the complete analysis of the individual
documents" (Appendix 5B).
52. Ibid., 88.
53. Cleverdon, Mills, and Keen, Factors Determining the
Performance of Indexing Systems, Volume 1, Design; and Cleverdon
and Keen, Factors Determining the Performance of Indexing
Systems, Volume 2, Test Results.
54. Cleverdon and Keen, Factors Determining the Performance of
Indexing Systems, Volume 2, Test Results, i ("Summary"). For the
detailed performance figures along with recall/precision graphs,
see volume 2 of the full report.
55. Lancaster, Evaluation of the MEDLARS Demand Search Service.
56. Ibid., 16, 19.
57. Ibid., 19-20.
58. Lancaster, "MEDLARS: Report on the Evaluation of Its
Operating Efficiency," 123.
+ Page 49 +
59. Ibid., 127.
60. Ibid., 131.
61. Blair and Maron, "An Evaluation of Retrieval Effectiveness
for a Full-Text Document-Retrieval System."
62. Ibid., 290-291.
63. Ibid., 291-293.
64. Ibid., 293.
65. Ibid., 295.
66. Markey and Demeyer, Dewey Decimal Classification Online
Project, 1.
67. Ibid., 109.
68. Ibid., 162.
69. Ibid., 144.
70. Ibid., 146.
71. Ibid., 149.
72. Ibid., 165, Table 42.
73. Ibid., 162.
74. Ibid., 166.
75. Ibid., 182.
76. Ibid.; especially, see Chapter 8, 173-291.
77. Hilchey and Hurych, "User Satisfaction or User Acceptance?"
78. Ankeny, "Evaluating End-User Services," 352-354.
79. Ibid., 354.
+ Page 50 +
80. Ibid.
81. Kirby and Miller, "MEDLINE Searching on Colleague."
82. Ibid., 20.
83. Ibid.
84. Dickson, "Analysis of User Errors in Searching an Online
Catalog," 26.
85. Jones, "Improving Okapi: Transaction Log Analysis of Failed
Searches in an Online Catalogue."
86. Ibid., 7-8.
87. S. Walker and Jones, Improving Subject Retrieval in Online
Catalogues, 117-119.
88. S. Walker and Hancock-Beaulieu, Okapi at City, 30. The
authors also surveyed the users to find out if they were
satisfied with their search results using a five-point
satisfaction scale. Ninety-five out of a total of 120 users (or
80%) indicated that they were satisfied with the search outcome
(they marked 4 or 5 on the scale), 19 users (or 16%) "had some
reservations" (i.e., they marked 3 on the scale), and 6 users (or
4%) "were negative" (i.e., they marked 1 or 2). Ibid., 24-25.
89. Ibid., 31.
90. Peters, "When Smart People Fail."
91. Ibid., 270.
92. Hunter, "Successes and Failures."
93. C. Walker, et al., "Problems Encountered by Clinical End
Users of MEDLINE and GRATEFUL MED."
94. Ibid., 68.
+ Page 51 +
95. Zink, "Monitoring User Search Success."
96. Ibid., 51
97. The following studies should be exempted from this as their
analyses were not based on zero-hit searches only: Jones,
"Improving Okapi: Transaction Log Analysis of Failed Searches in
an Online Catalogue"; S. Walker and Jones, Improving Subject
Retrieval in Online Catalogues; and S. Walker and
Hancock-Beaulieu, Okapi at City.
98. Lancaster, Evaluation of the MEDLARS Demand Search Service.
99. Wilson, Starr-Schneidkraut and Cooper, Use of the Critical
Incident Technique to Evaluate the Impact of MEDLINE.
100. Ibid., 5.
101. Ibid., 81.
102. Ibid., 83-84.
103. Gouke and Pease, "Title Searches in an Online Catalog and a
Card Catalog," 139.
104. Alzofon and Van Pulis, "Patterns of Searching and Success
Rates in an Online Public Access Catalog," 113.
105. Janosky, Smith and Hildreth, "Online Library Catalog
Systems: An Analysis of User Errors."
106. Ibid., 576.
107. Ibid., 591.
108. Hildreth, Intelligent Interfaces and Retrieval Methods for
Subject Searching in Bibliographic Retrieval Systems, 69.
+ Page 52 +
109. Bates, "Subject Access in Online Catalogs: a Design Model";
Borgman, "Why are Online Catalogs Hard to Use? Lessons Learned
from Information-Retrieval Studies"; David R. Gerhan, "LCSH in
vivo: Subject Searching Performance and Strategy in the OPAC
Era," Journal of Academic Librarianship 15 (1989): 83-89;
Klugman, "Failures in Subject Retrieval"; David Lewis, "Research
on the Use of Online Catalogs and Its Implications for Library
Practice," Journal of Academic Librarianship 13 (1987): 152-157;
Karen Markey, "Users and the Online Catalog: Subject Access
Problems," in The Impact of Online Catalogs, ed. J.R. Matthews.
(New York: Neal-Schuman, 1986), 35-69; Wang, "The Online
Catalogue, Subject Access and User Reactions: A Review."
110. Larson, "Between Scylla and Charybdis: Subject Searching in
the Online Catalog," 181.
111. Larson, "The Decline of Subject Searching," 208.
112. University of California Users Look at MELVYL: Results of a
Survey of Users of the University of California Prototype Online
Union Catalog. (Berkeley, CA: The University of California,
1983), 97.
113. Larson, "Classification Clustering, Probabilistic
Information Retrieval and the Online Catalog," 136-144
114. Allyson Carlyle, "Matching LCSH and User Vocabulary in the
Library Catalog," Cataloging & Classification Quarterly 10
(1989): 37.
115. Noelle Van Pulis, and L.E. Ludy, "Subject Searching in an
Online Catalog with Authority Control," College & Research
Libraries 49 (1988): 528-529.
116. Diane Vizine-Goetz and Karen Markey Drabenstott, "Computer
and Manual Analysis of Subject Terms entered by Online Catalog
Users," in ASIS '91: Proceedings of the 54th ASIS Annual Meeting.
Washington, DC, October 27-31, 1991, ed. Jose-Marie Griffiths
(Medford, NJ: Learned Information, 1991), 157.
+ Page 53 +
117. Ibid., 161.
118. Michael K. Buckland and Doris Florian, "Expertise, Task
Complexity, and Artificial Intelligence: A Conceptual Framework,"
Journal of American Society for Information Science 42 (October
1991): 635-643.
Acknowledgements
The helpful comments and suggestions of the referees are
gratefully acknowledged.
About the Author
Yasar Tonta, Ph.D. candidate, School of Library and Information
Studies, University of California, Berkeley, CA 94720.
-----------------------------------------------------------------
The Public-Access Computer Systems Review is a refereed
electronic journal that is distributed on BITNET, Internet, and
other computer networks. There is no subscription fee.
To subscribe, send an e-mail message to LISTSERV@UHUPVM1 (BITNET)
or LISTSERV@UHUPVM1.UH.EDU (Internet) that says:
SUBSCRIBE PACS-P First Name Last Name. PACS-P subscribers also
receive two electronic newsletters: Current Cites and Public-
Access Computer Systems News.
This article is Copyright (C) 1992 by Yasar Tonta. All Rights
Reserved.
The Public-Access Computer Systems Review is Copyright (C) 1992
by the University Libraries, University of Houston. All Rights
Reserved.
Copying is permitted for noncommercial use by computer
conferences, individual scholars, and libraries. Libraries are
authorized to add the journal to their collection, in electronic
or printed form, at no charge. This message must appear on all
copied material. All commercial use requires permission.
-----------------------------------------------------------------