Copy Link
Add to Bookmark
Report
Machine Learning List Vol. 1 No. 09
Machine Learning List: Vol. 1 No. 9
Tuesday, Oct 2, 1989
Contents:
Report on the IJCAI-89 Workshop on Knowledge Discovery in Databases
IJCAI ML papers
The Machine Learning List is moderated. Contributions should be relevant to
the scientific study of machine learning. Mail contributions to ml@ics.uci.edu.
Mail requests to be added or deleted to ml-request@ics.uci.edu. Back issues
of Volume 1 may be FTP'd from /usr2/spool/ftp/pub/ml-list/V1/<N> or N.Z where
N is the number of the issue; Host ics.uci.edu; Userid & password: anonymous
----------------------------------------------------------------------
From: Gregory Piatetsky-Shapiro <gps0@gte.COM>
Subject: Report on the IJCAI-89 Workshop on Knowledge Discovery in Databases
Date: Thu, 28 Sep 89 11:09:15 EDT
Gregory Piatetsky-Shapiro (gps0%gte.com@relay.cs.net)
GTE Laboratories, 40 Sylvan Road, Waltham MA 02254
----------
Computers promise fountains of wisdom, but only deliver a flood of data.
--------- (a frustrated MIS executive)
1. Background The current growth in size and number of large
============== databases creates both the need and an opportunity
for extracting knowledge from them. Recent results have been reported on
extracting medical diagnostic rules, drug side effects, classes of stars, rules
for expert systems, and rules for semantic query optimization. Several
commercial systems for discovery in databases have appeared for discovery in
the last two years.
Thus timing was right for this workshop, which brought
together many leading researchers in Machine Learning, Expert
Databases, Knowledge Acquisition, Fuzzy Sets, and other areas. The
interaction revealed different viewpoints on Knowledge Discovery:
while ML people look at DB as flat "data sets", something to be
exploited, ignoring transactions, securiity and updates, DB people see
DB as active entities that should be enriched by integrating the
discovered knowledge into the database or meta-database. The workshop
also demonstrated a great variety of existing approaches to machine
discovery.
I was the workshop chairman and the program committee consisted of Jaime
Carbonell (Carnegie Mellon), William Frawley (GTE Laboratories), Kamran Parsaye
(IntelligenceWare, Los Angeles), J. Ross Quinlan (University of Sydney,
Australia}, Michael Siegel (MIT), and Ramasamy Uthurusamy (GM Research
Laboratories). We have received 69 submissions from 12 countries. Of these, 39
submissions from 9 countries were accepted. Nine very interesting papers were
presented in three sessions: Data-Driven Discovery methods, Knowledge-Based
Approaches and Systems and Applications.
Space does not permit discussion of workshop papers, so I will
only describe the panel discussion with Pat Langley (then UCI, now NASA),
Larry Kerschberg (George Mason U.) and Quinlan.
2. Panel Discussion What came out is that "Machine Discovery" is a very
=================== promising direction that will become more important
because there will be more domains for which there are no human experts. There
was a general agreement that we need to develop more efficient algorithms and
apply them to real databases. There was also an agreement that we need to deal
with uncertainty, and that more expressive languages (e.g. first-order) are a
good research direction.
3 Use of Background Knowledge in Discovery
============================================
There was, however, a lively disagreement on usage of background knowledge.
Tom Mitchell pointed out that some problems are so large (e.g. molecule
segmentation in Meta-Dendral), that domain constraints (e.g. double bonds don't
break) are needed to limit the search space. Quinlan suggested that if we can
design efficient algorithms that search well, we should try to avoid such
constraints, because they limit what we can find. It is OK to use background
knowledge only if it is verified by the data.
Jaime Carbonell said that he really liked constraints, and that in his latest
domain -- logistics planning -- the size of search space may be 10^300 without
constraints. No system, no matter how efficient, can search that space. It is
an issue of being able to tackle much larger problem by putting in background
knowledge (e.g. "Trucks don't drive over water") and therefore narrowing the
search at the risk of missing some interesting solutions (e.g. a truck can
drive over a frozen lake in the winter). We should be able to play both
sides of this trade-off.
Langley suggested to deal with this tradeoff by developing incremental learning
systems, that can discover new things and reuse them in discovery, thereby
boot-strapping themselves. Such systems can be started with little or no
background knowledge and eventually reach a good level. However, it took
scientists several hundred years to discover the present knowledge, so the
proposed incremental system may run for quite a while.
4 Good applications domains
===========================
Quinlan pointed that a minimum requirement for a good domain is having
measurement of *important* parameters. An example of bad domain is off-shore
oil drilling rigs that collect and send enormous amounts of data, which are
just measurements of various things on the platform. Nobody knows whether they
are relevant to the production of the platform or not. To go looking in a
database like that, which is very large, is probably not a good idea. Medical
domain is suitable for discovery because we have many medical databases, and a
considerable medical expertise goes into deciding what is recorded in the
databases.
Kerschberg suggested applying discovery to "Legacy systems". These are old
systems, frequently over twenty years old, for which the people who have
maintained them arte no longer available. There may still be COBOL applications
running against databases but nobody knows much about the applications or
databases. There are companies that maintain such legacy systems, and they do
it by reverse engineering to analyze both the databases, and the programs.
Philip Schrodt from the University of Missouri has described Inter-University
Consortium for Political and Social Research, which has about 2000 data sets -
mostly social survey data for the last twenty years. There is probably plenty
to discover there, and ML researchers can get access to this data.
Langley observed that machine learning has had the most success with simple
diagnostic tasks. Domains like medical diagnosis, or diagnosis of problems
with cars are those where most of the initial high payoff applications are
likely to be.
Barry Silverman suggested that there are many goverment databases with
real discovery problems. E.g. the IRS databases contain many interesting
relationships. The FBI has a database with records of every airplane
crush. Analysis of this database may contribute to improved safety in
the air.
5. Directions for future research
====================================
Quinlan said that there are many interesting directions, e.g. incremental
algorithms vs batch algorithms, or very fast algorithms. Almost anything in
this area has some substance. A bad direction was to prove that some algorithm
is better than ID3. A more serious example of bad research is trying to
squeeze the last one percent of performance.
Langley addressed methodological concerns. He suggested that we ought to build
on what has been done before, before coming up with new algorithms.
We shouldn't run a system on a couple of examples and see "if this is
good". Rather, the discovered knowledge should be brought back into the
database to see whether it is useful.
We should build tightly integrated systems - this will force us to
generalize our theory.
Finally, he called for having more tools, tested and documented, made
available to other people. This, of course, will make it easier to other
people to run comparative studies (they will also show how bad your system is,
but that is what science is all about).
Kerschberg took an engineering view. He suggested taking some of the existing
algorithms and seeing how they scale up on large systems - very large
databases. It is not so important how long it takes to discover knowledge,
because it can be done off-line. But once the knowledge is discovered, try to
bring it back into the database.
It was also suggested that interactive tools should be considered, where a
"knowledge analyst" works together with a machine. Algorithms need to
re-examined from this point of view (e.g. neural network may need to generate
explanations from its weights).
Some of the research issues that were not adequately addressed in this
workshop, but (in my opinion) will become more important in the future are:
* Using parallel machines for discovery
* Better representation of the discovered results (e.g. visual,
natural language, animation, etc...)
Finally, discovery systems should be applied to large, real data sets and
judged on whether they can make useful and significant discoveries.
It will be hard, but I am optimistic!
----------------------------------------------------------------------
Subject: IJCAI ML papers
Date: Tue, 03 Oct 89 11:43:16 -0700
From: Michael Pazzani <pazzani@ICS.UCI.EDU>
Message-ID: <8910031143.aa13004@ICS.UCI.EDU>
The two papers I liked best at IJCAI were Mooney, R. "The effect of
rule use on the utility of explanation-based-learning", Tambe, R. &
Rosenbloom, P. "Eliminating expensive chunks by restricting
expressiveness".
Both papers deal with the utility issue in learning. (Note that I did
not say the utility issue in explanation-based learning. The utility
issue arises in any system that incorporates redundant rules, whether
they are learned by an empirical or an explanation-based component, or
hand-coded by a knowledge engineer.) The most interesting aspect of
these papers is that the authors identified reasons that some kinds of
redundant knowledge are expensive and found a mechanism to incorporate
some "inexpensive" chunks into the performance system. While I feel
that there is value to statistical techniques for eliminating expensive
chunks, I think that more work needs to be done on analytical
techniques that avoid the creation of chunks that are likely to be
expensive or that restrict the use of expensive chunks.
It is clear that humans maintain a large amount of special purpose
redundant knowledge (e.g., I can sound out the word "bat", but
normally during reading this is not necessary.) Most psychologists I
have spoken with cannot even comprehend why there would be a utility
problem. They assume that, in humans, recognition is an extremely
efficient process and chaining is a rather slow process. In our
current computational models, if there is a "recognition" process at
all, it's equivalent to backward chaining with a depth limit of one.
I feel that a deeper understanding of recognition, and the process by
which one procedure operationalizes a concept for a separate procedure
(rather than a resource limited copy of the search procedure), will
allow machine learning systems to incorporate redundant knowledge and
that these two papers are an important step toward that goal.
----------------------------------------------------------------------
END of ML-LIST 1.8