Copy Link
Add to Bookmark
Report
Machine Learning List Vol. 2 No. 01
Machine Learning List: Vol. 2 No. 1
Monday, Jan 8, 1990
Contents:
Over/under generalization in ID3
Lexical databases
Empirical Discovery Programs
The Machine Learning List is moderated. Contributions should be relevant to
the scientific study of machine learning. Mail contributions to ml@ics.uci.edu.
Mail requests to be added or deleted to ml-request@ics.uci.edu. Back issues
may be FTP'd from ics.uci.edu in /usr2/spool/ftp/pub/ml-list/V<X>/<N> or N.Z
where X and N are the volume and number of the issue; ID & password: anonymous
- - ----------------------------------------------------------------------
Date: Wed, 27 Dec 89 11:34:10 PST
From: tgd@CS.ORST.EDU (Tom Dietterich)
Subject: Over/under generalization in ID3
In a study that I am performing, I've noticed the following
phenomenon. Suppose I have a training set (and test set) containing
many more positive examples (class=1) than negative examples
(class=0). In my case, the ratio is about 8:1. The concept learned
by ID3 tends to overgeneralize in this case. On the test set, it will
make many more errors where it predicts a 1 but a 0 is desired (as
opposed to errors where it predicts a 0 but a 1 is desired).
Has anyone else noticed this phenomenon? Has anyone done a systematic
study of it or found any ways to overcome it? Perhaps this is the
same as the problem of "small disjuncts" that Holte et al discussed at
IJCAI-89. It seems like any method that computed the marginal
probabilities of 1 and 0 (e.g., naive Bayes) should be able to avoid
this problem.
Suggestions?
- - --Tom
- - ----------------------------------------------------------------------
From: munnari!cluster.cs.su.oz.au!quinlan@uunet.uu.NET (Ross Quinlan)
Date: Tue, 2 Jan 90 13:41:36 +1100
Subject: Over/under generalization in ID3
I've noticed the same thing. Suppose the number of training examples
matched by a disjunct is n, with e of them in error. If the classes
are asymmetric, the Laplace estimate of the error rate as (e+1)/(n+2)
underestimates for the less frequent class and overestimates for the
more frequent, particularly for small values of n. The phenomenon is
therefore relevant to the problem of small disjuncts discussed at IJCAI
by Rob Holte, Liane Acker and Bruce Porter.
I tried experiments with a couple of domains: prob-disj (2 classes
distributed 85%-15%) and lost3ply (74%-26%). There were 27 repetitions
of the following:
* split data randomly into training/test sets (70%-30%)
* grow a tree
* calculate the error rate of each leaf (disjunct) on unseen cases
* plot as a function of n & class
For instance, the leaves in which e=0 give the following results for the
lower values of n in the prob-disj case:
Average error rate on unseen cases
Less Frequent More Frequent
Class: Class:
n=1 61% 22%
n=2 42% 13%
n=3 39% 8%
n=4 20% 5%
n=5 22% 5%
n=6 (none) 5%
n=7 11% 4%
n=8 7% 4%
n=9 0% 0%
n=10 0% 0%
The results on lost3ply were similar. I developed a model for the
error rate based on the frequency of each class in the description
space "around" each leaf; this gave a better fit than the Laplace estimate,
but is still not entirely satisfactory. That's about as far as I got
before being distracted by other stuff. I still think there are some
very interesting findings lurking in the underbrush.
Ross
- - ----------------------------------------------------------------------
Subject: Over/under generalization in ID3
Date: Wed, 03 Jan 90 14:06:18 -0800
From: sage@harlie.ICS.UCI.EDU (Stephanie Sage)
You suggest that the over-generalization problem you noticed with ID3
might not occur in a simple Bayes classifier since it stores the
marginal probabilities of the classes. In general, I believe this is
so. However, this problem does occur with the Bayesian algorithm when
the skewed class probabilities *interact* with inaccurate
instance-given-class probabilities (which are inaccurate due to noise,
attribute dependence, unrepresentative sample, or whatever). This
interaction results in instances of the low frequency class being more
likely to be classified as instances of the high frequency class (what
you call over-generalization). If you are interested, I can try to
reproduce my results and/or demonstrate by example the conditions
under which the problem occurs.
I agree that Holte, Acker, & Porter's work on small disjuncts is relevant.
I would also suggest the following paper:
Spackman, Kent A. (1989). Signal detection theory: Valuable tools for
evaluating inductive learning. {\it Proceedings of the Sixth International
Workshop on Machine Learning.} Ithaca, NY: Morgan Kaufmann.
This paper describes the use of signal detection methodology to evaluate
a system's performance in terms of both "true positives" (+ instances
classified as +) and "false positives" (- instances classified as +, i.e.
errors of commission). This is relevant to your problem since the
over-generalization effect you note is precisely the problem of "false
positives". Spackman suggests the use of a parameter to control the
sensitivity of a learning algorithm to positive instances (reduced
sensitivity will alleviate over-generalization). He does not address the
implementation of such a parameter. Also, the paper refers to the
over-generalization problem in general, and does not address specific causes
of that problem (e.g. uneven distribution of instances across categories).
- - --stephanie
- - ----------------------------------------------------------------------
Date: Wed, 3 Jan 90 10:04 GMT
From: Padraig Cunningham <CNNNGHMP%vax1.tcd.ie@CUNYVM.CUNY.EDU>
Subject: Lexical databases
I am currently involved in research into knowledge acquisition from
text and, since issues of scale are so important, I would like to get
a complete lexical database as soon as possible. Has anyone any
information on what lexical databases are available, either in the
public domain or commercially?
Padraig Cunningham
Hitachi Dublin Laboratory
Trinity College Dublin
Ireland
- - ----------------------------------------------------------------------
From: "R. Bharat Rao" <bharat@gaea.cs.uiuc.EDU>
To: ml@ICS.UCI.EDU
Subject: Request for Empirical Discovery Programs
As part of my Ph.D. thesis I need to do empirical discovery, and
discover relationships between groups of variables. These variables
may be real-valued, nominal, discrete (different domains), or
combinations of these types, and varying amounts of background
knowledge (often none) will be provided. The data also will be (known
to be) noisy in certain domains.
I am looking for programs that can do empirical discovery - as many as
I can find - and I wish to apply them to domains the programs are
suited to.
If you have the source code, or pointers to the code + literature on
discovery programs that I could use, I would be very grateful if you
could e-mail me the details at the address above.
I will summarize and post the results.
Thanks,
- - -Bharat
R.Bharat Rao
E-Mail: bharat@gaea.cs.uiuc.edu
US Mail:AI Group, Beckman Institute, Univ. of Illinois, Urbana-Champaign.
- - ----------------------------------------------------------------------
END of ML-LIST 2.1