Copy Link
Add to Bookmark
Report
Machine Learning List Vol. 2 No. 02
Machine Learning List: Vol. 2 No. 2
Wednesday, Jan 10, 1990
Contents:
Over/under generalization in ID3
COLT
The Machine Learning List is moderated. Contributions should be relevant to
the scientific study of machine learning. Mail contributions to ml@ics.uci.edu.
Mail requests to be added or deleted to ml-request@ics.uci.edu. Back issues
may be FTP'd from ics.uci.edu in /usr2/spool/ftp/pub/ml-list/V<X>/<N> or N.Z
where X and N are the volume and number of the issue; ID & password: anonymous
- ----------------------------------------------------------------------
Date: Mon, 8 Jan 90 19:33:46 EST
From: Rob Holte <HOLTE%UOTTAWA.bitnet@ugw.utcs.utoronto.ca>
Subject: Over/under generalization in ID3
In Vol.2 No.1 Dietterich described the following problem, which Quinlan
and Sage also have observed,
GIVEN:
two classes, M and F, the training set contains Many examples of M
and Few examples of F, test examples can be misclassified in two ways:
(mF) Guess M when F is the correct class
(fM) Guess F when M is the correct class
PROBLEM:
one type of error (misclassification) occurs much more often than the other.
Let me note in passing that the data presented does not agree on which of
the two types of error is the more common:
Dietterich reports many more errors of type mF than fM.
Quinlan and Sage don't report number of errors, but error RATES.
In Quinlan's table the error rate in the column "Less Frequent Class" is
uniformly much higher then in the other column (N.B. the column headings
name the class that is guessed, not the class that is correct (private
communication)). To convert these rates to numbers of misclassifications
one must multipy them by the number of test examples classified each
disjunct (the disjunct's "test coverage"). Assuming test coverage is not
affected by the class associated with a disjunct, Quinlan and Sage are
reporting many more errors of type fM than mF.
In any case, this problem is not the same as the problem of small
disjuncts described at IJCAI by myself, Liane Acker and Bruce Porter.
The problem of small disjuncts, as we defined it, is simply the
problem that error rate increases dramatically, and often suddenly, as
training coverage decreases. Quinlan's data is a perfect illustration
(the pattern is extremely evident in both columns. the variable "n" is
"training coverage") and shows that the small disjunct problem arises
regardless of class distribution (our IJCAI results were based on data
in which examples were almost evenly divided between the two classes).
The problem of small disjuncts is not related to class distribution in
any simple way. First, the disjuncts associated with the less
populous class (F) need not be small (recall that "small" means that
the number of training examples covered by the disjunct (its "training
coverage") is below a certain threshold (eg. 7)). Conversely, many of
the disjuncts associated with the more populous class could be small
- -- this is the case in Quinlan's data. Secondly, the class
distribution of examples at each leaf of a decision tree is always
highly asymmetric -- the whole point of the information-theoretic
attribute selection criterion is to create high asymmetry as quickly
as possible. So it's not clear to me that there should be any simple
relation between the error rate of a disjunct and the class
distribution of the whole population. Thirdly, let me repeat a point
I stressed in the IJCAI presentation (less so in the written version).
The problem of small disjuncts cannot be solved by data-oriented
methods (such as the one Dietterich suggests): it must be solved by
using a more appropriate bias.
- -- Rob Holte holte@uottawa.bitnet
Computer Science Department
University of Ottawa
- ----------------------------------------------------------------------
Subject: Over/under generalization in ID3
Date: Tue, 09 Jan 90 11:21:39 -0800
From: "David W. Aha" <aha@ICS.UCI.EDU>
> Has anyone else noticed this phenomenon? Has anyone done a systematic
> study of it or found any ways to overcome it? Perhaps this is the
> same as the problem of "small disjuncts" that Holte et al discussed at
> IJCAI-89. It seems like any method that computed the marginal
> probabilities of 1 and 0 (e.g., naive Bayes) should be able to avoid
> this problem.
> Suggestions?
I observed this in both our instance-based learning (IBL) algorithms and C4
a few years ago. I designed our recent IBL algorithms to be sensitive to this.
1. IB3, our noise-tolerant algorithm (IJCAI 89), employs a method to
distinguish noisy instances from others. It compares
the classification accuracy of an instance (defined as the percentage
of correct classification guesses a saved instance makes on subsequently
presented, nearby training instances) with its class's observed
relative frequency. We then use a confidence intervals of proportions
test to determine whether the instance's accuracy is significantly
better or worse than its class's frequency. So: an instance with
a 75% accuracy looks great if its class's observed relative frequency is
low, but not good if its class's frequency is, say, 90%.
2. IB4, our algorithm that learns attribute relevancies (MLW 89), updates its
attribute weights more quickly from instances that are members
of low-frequency concepts than from instances that are members of
high-frequency concepts. Informal lesion studies showed me that learning
rate increased dramatically for highly skewed distributions when this
approach was used.
My suggestions:
1. Yes, count this is as an example in which marginal probabilities of 0 and
1 are used to avoid this problem.
2. Learning algorithms must explicitly attend to this problem (especially
incremental learners). A quality-ensuring function (on parts of the
learned concept description) should be a function of concept distribution
skew (among other variables).
3. Learning rate from training instances in low-frequency concepts
should be higher than for high-frequency concepts, but there are caveats.
Rob Holte confirmed to me that our _earlier_ algorithms failed to work well on
small probability disjuncts. We've now tackled this problem, but have only
empirical analyses (at the moment). Personally, I think the MIRO/ECM approach
(Drastal et al, MLW 89) seems most promising: in the limit, domain knowledge
is required to efficiently distinguish noise from exceptions (or small
disjuncts). Of course, obtaining the domain knowledge is the real problem.
- ----------------------------------------------------------------------
Date: Wed, 10 Jan 90 14:35:39 EST
From: fulk@cs.rochester.EDU
Subject: COLT
CALL FOR PAPERS
COLT '90
Third Workshop on Computational Learning Theory
Rochester, NY
August 6 - August 8, 1990
The third workshop on Computational Learning Theory will be held in
Rochester, NY. The conference will be jointly sponsored by SIGACT and
SIGART, and is expected to be similar in style to the previous such
workshops held at MIT and UC/Santa Cruz. Registration at COLT '90 is open.
It is expected that most papers will present rigorous formal analyses
of theoretical issues in Machine Learning. Possible topics include,
but are not limited to: resource and robustness analysis of learning
algorithms, general learnability and non-learnability results in new and
existing computational learning models, theoretical comparisons among
learning models, and papers that connect learning theory with work in
robotics, neural nets, pattern recognition and cryptography. R. Freivalds
(Latvian State University, Riga) has agreed to present an invited talk;
the program committee may consider more such.
Authors should submit an extended abstract that consists of:
A) cover page with title, authors' names,
(postal and e-mail) addresses, and a 200 word summary.
B) body not longer than 10 pages in twelve-point font.
Be sure to include a clear definition of the model used, an overview
of the results, and some discussion of their significance, including
comparison to other work. Proofs or proof sketches should be included
in the technical section. Authors should send 10 copies of their
abstract to
John Case
COLT '90
Department of Computer and Information Sciences
103 Smith Hall
University of Delaware
Newark, DE 19716.
The deadline for receiving submissions is April 9, 1990. This deadline
is FIRM. Authors will be notified by May 22, final camera-ready papers
will be due June 18, and this deadline is ABSOLUTE. The proceedings will
be published by Morgan-Kaufmann. For further information about submissions
contact John Case (telephone: 302-451-2711, email: case@cis.udel.edu).
Chair and local arrangements: Mark A. Fulk (U. Rochester).
Program committee:
J. Case (Delaware, chair),
D. Angluin (Yale),
E. Baum (NEC Research, Princeton)
S. Ben David (Technion, Israel),
M. Fulk (U. Rochester),
D. Haussler (UC Santa Cruz),
L. Pitt (U. Illinois),
R. Rivest (MIT),
C. Smith (Maryland),
S. Weinstein (U. Pennsylvania).
Note: papers that have appeared in journals or that are being submitted
to other conferences are not appropriate for submission to COLT with the
exception of papers submitted to the IEEE 30th Symposium on Foundations of
Computer Science (FOCS).
A joint submission policy coordinated with FOCS permits authors to send
a paper to both conferences; in the event that both conferences accept the
paper, it will be published in the FOCS proceedings, the authors will be
invited to give a talk at both conferences, and a short (one-page) abstract
will be printed in the COLT proceedings.
As the FOCS decisions may be quite late, authors of dual submissions
will be asked to send the abstract with their final copy, so as to
allow the publisher to substitute the abstract upon receiving word of
the FOCS decision.
It is, of course, required that authors notify both committees of the
dual submission by including a note in the cover letters.
- ----------------------------------------------------------------------
END of ML-LIST 2.2