Copy Link
Add to Bookmark
Report

Machine Learning List Vol. 6 No. 22

eZine's profile picture
Published in 
Machine Learning List
 · 1 year ago

 
Machine Learning List: Vol. 6 No. 22
Tuesday, August 9, 1994

Contents:
A guide to the literature on learning graphical models
Inductive Learning Competitions
CFP: Genetic Algorithms Conferecnce (ICGA6)
CFP : SDAIR95
EuroCOLT95: Call for Papers
IEEE Symposium on INTELLIGENCE in NEURAL AND BIOLOGICAL SYSTEMS
Incremental Learning (3)
Overlooked issues concerning the conservation law
Conservation Law (4)



The Machine Learning List is moderated. Contributions should be relevant to
the scientific study of machine learning. Mail contributions to ml@ics.uci.edu.
Mail requests to be added or deleted to ml-request@ics.uci.edu. Back issues
may be FTP'd from ics.uci.edu in pub/ml-list/V<X>/<N> or N.Z where X and N are
the volume and number of the issue; ID: anonymous PASSWORD: <your mail address>
URL- http://www.ics.uci.edu/AI/ML/Machine-Learning.html

----------------------------------------------------------------------

Date: Mon, 8 Aug 94 18:14:33 PDT
From: Wray Buntine <wray@ptolemy-ethernet.arc.nasa.gov>
Subject: A guide to the literature on learning graphical models


A guide to the literature on learning graphical models
Wray L. Buntine
RIACS at NASA Ames Research Center
wray@kronos.arc.nasa.gov

This literature review discusses different methods under the general rubric
of learning probabilistic graphical models from data. Because many problems
in artificial intelligence, statistics and neural networks can be
represented as a probabilistic graphical model, this area provides an
unifying perspective on learning. There are probably sections of the
literature I have missed out. Apologies to those concerned.

The literature review, compiled August 1994, contains 75 references.

It is available via FTP from:
FTP site: ack.arc.nasa.gov
FTP file: /pub/buntine/graphbib.ps.Z
or can be viewed from World Wide Web (WWW) using
URL: ftp://ack.arc.nasa.gov/pub/buntine/graphbib.ps.Z

------------------------------

Date: Wed, 3 Aug 94 17:11:14 BST
From: David.Page@comlab.ox.ac.uk
Subject: Inductive Learning Competitions

UPDATE ON INDUCTIVE LEARNING COMPETITIONS

Donald Michie, Stephen Muggleton
David Page and Ashwin Srinivasan
Oxford University Computing Laboratory, UK.


Regrettably the train files issued for these competitions were incorrect
in some respects and also not always consistent as between the graphics
representation of a given train and its Prolog representation. The files
are accordingly withdrawn with apologies, the relevant fixes have been
made in the train generator and drawing program, and replacement files
have been created. The compressed tar file trains.tar.Z at the ftp site
has been changed to contain the replacement files, including the
corrected Prolog generator and concept tester. The errors affected
bucket-shaped cars and also restricted ``double'' to short rectangular
cars in the graphics but not in the Prolog. We thank those who sent email
messages reporting these errors.

All entries already received that were based on either the original Prolog
files or the original graphics files will be scored on that basis. There
will be no question of penalising any such entry for something not the
entrant's fault. Entries, however, received after today's date (August 3,
1994) should be based on the replacement files now in place and will be
assessed accordingly. We also have plugged gaps and ambiguities in the
earlier account of complexity scoring. The file ml-chall.ps in
trains.tar.Z has been modified to provide the details of this more
precise definition of complexity.


URL = ftp://ftp.comlab.ox.ac.uk/pub/Packages/ILP/trains.tar.Z
FTP site = ftp.comlab.ox.ac.uk
FTP file = pub/Packages/ILP/trains.tar.Z

------------------------------

Subject: CFP: Genetic Algorithms Conferecnce (ICGA6)
Date: Tue, 09 Aug 94 09:49:39 -0600
From: Robert Elliott Smith <rob@comec4.mh.ua.edu>



Call for Papers

ICGA-95

The Sixth International Conference on
Genetic Algorithms

15-20 July, 1995
University of Pittsburgh
Pittsburgh, Pennsylvania

The Sixth International Conference on Genetic Algorithms (ICGA-95),
will be held on July 15-20, 1995 at the University of Pittsburgh.
This meeting brings together an international community from academia,
government, and industry interested in algorithms suggested by the
evolutionary process of natural selection, such as genetic algorithms,
evolution strategies, evolutionary programming, and genetic programming.
Topics of particular interest include mathematical descriptions of
evolutionary computation; function optimization, machine learning and
neural networks that include evolutionary computation methods;
and papers discussing how methods of evolutionary computation are related
to biological and cognitive systems, including artificial life and
classifier systems.

Papers describing significant, unpublished research in these areas are
solicited. Authors must submit four (4) complete copies of their
paper (hardcopy only), received by January 16, 1995, to the Program
Chair:

Larry J. Eshelman
Philips Laboratories
345 Scarborough Road
Briarcliff Manor, NY 10510
(914) 945-6491

Papers should be no longer than 10 pages, single-spaced, and printed
using 12 pt. type. Electronic submissions will not be accepted.
The first page should include the title and short abstract but not the
names of the paper's authors or their addresses, to allow for anonymous
peer review. Please include a separate page with the authors' names as
well as the title and abstract. This page should also include the name,
physical address and e-mail address of the corresponding author. (It will
be assumed that the first author is the corresponding author unless
otherwise specified.)

Evaluation criteria include the significance of results, originality, and
the clarity and quality of the presentation. Questions on the conference
program and submission should be directed to lje@philabs.philips.com.
Other questions should be directed to rob@comec4.mh.ua.edu.

Important Dates:

January 16, 1995: Submissions must be received
April 5, 1995: Notification to authors mailed
May 5, 1995: Revised, final camera-ready paper due
July 15-20, 1995: Conference dates

ICGA-95 Conference Committee:

Conference Chair: Steven F. Smith, Carnegie Mellon
Publicity: Robert E. Smith, University of Alabama
Program Chair: Larry J. Eshelman, Philips Laboratories
Local Arrangements: Alice E. Smith, University of Pittsburgh


------------------------------

Date: Mon, 8 Aug 94 16:58 EDT
From: David Lewis <lewis@research.att.com>
Subject: CFP : SDAIR95

Call for Papers

Fourth Annual Symposium
on Document Analysis and Information Retrieval (SDAIR '95)

April 24-26, 1995
Desert Inn Hotel, Las Vegas, Nevada

Conference Chair:
Donna Harman
National Institute of Standards and Technology

SCOPE

The purpose of this symposium is to present the results of current
research and to stimulate the exchange of ideas in the general field
of Document Understanding. Papers on all aspects of document analysis
and information retrieval are solicited, with particular emphasis on:

Document Analysis
Multilingual OCR
Language identification
Multilingual character sets
Domain specific dictionaries / lexicons
Logical structure recognition
Recognition of tables and equations
Recognition of maps and mechanical drawings

Information Retrieval
Full-text retrieval
Retrieval from structured documents
Text categorization
Evaluation of IR systems
Image and multimedia retrieval
Language-specific influences on retrieval
Text representation

The two themes to be highlighted at this year's symposium are the
intersection of document analysis and information retrieval, and the
ramifications of multilingual data in both fields.

SUBMISSIONS

Please send seven copies of complete papers, with authors name,
address, telephone number, fax number and e-mail address to the
appropriate Program Chair:

Larry Spitz, Chair (Doc. Analysis)
or
David D. Lewis, Chair (Info. Ret.)
c/o Information Science Research Institute
University of Nevada, Las Vegas
4505 Maryland Parkway
Box 454021
Las Vegas, NV 89154-4021

The papers should be no longer than 20 double-spaced pages or 5,000
words. Papers which have already appeared in journals or published
conference proceedings should not be submitted. Both camera ready and
machine readable copies of the accepted papers will be required. The
proceedings will be available at the conference.

CONFERENCE TIMETABLE
Papers Due October 1, 1994
Notification To Authors December 1, 1994
Camera Ready Copy February 1, 1995


PROGRAM COMMITTEE

Document Analysis
=================
Larry SPITZ, Fuji Xerox (chair)
Henry BAIRD, AT&T Bell Labs
Andreas DENGEL, DFKI
Hiromichi FUJISAWA, Hitachi
Jonathon HULL, SUNY Buffalo
Junichi KANAI, UNLV
Juergen SCHUERMANN, Daimler Benz
Suzanne TAYLOR, Unisys
Karl TOMBRE, INRIA

Information Retrieval
=====================
David LEWIS, AT&T Bell Labs (chair)
Christopher BUCKLEY, Cornell
Kenneth CHURCH, AT&T Bell Labs
Robert KORFHAGE, U. Pittsburgh
Fausto RABITTI, CNR-IEI
Kazem TAGHVA, UNLV
TOKUNAGA Takenobu, Tokyo Inst. Tech.
Howard TURTLE, West Publishing
Peter WILLETT, U. Sheffield
Ross WILKINSON, RMIT




------------------------------

Date: Mon, 8 Aug 1994 01:17:59 +0200
From: Paul.Vitanyi@cwi.nl
Subject: EuroCOLT95: Call for Papers



CALL FOR PAPERS---EuroCOLT '95
Second European Conference on
Computational Learning Theory

Barcelona, Spain
March 13-15, 1995

The Second European Conference on Computational Learning Theory
(EuroCOLT'95) will be held at the Universitat Polytecnica
de Catalunya, Barcelona, Spain, from Monday, March 13, through Wednesday,
March 15, 1995.
(The inaugural European conference on Computational Learning Theory
was held 20--22 December 1993 at Royal Holloway, University of London.)

The EuroCOLT 95 conference is sponsored by the EATCS,
by the European Union through NeuroCOLT ESPRIT Working Group Nr. 8556,
and by IFIP through SSGFCS WG 14.2 and SSGFCS WG 14.4.


We invite papers in all areas that relate directly to the
analysis of learning algorithms and the theory of machine
learning, including artificial and biological neural networks,
genetic and evolutionary algorithms, robotics, pattern recognition,
inductive logic programming, inductive inference, information
theory, decision theory, Bayesian/MDL estimation, statistical
physics, and cryptography. We plan to include work in progress
of both theoretical and experimental nature,
with poster sessions to foster discussion, and authoritative invited
lectures. We look forward to a lively, interdisciplinary meeting.


Abstract Submission: Authors should submit thirteen copies
(preferably two-sided copies) of an extended abstract to be
received by Thursday, September 21, 1994, to


Paul Vitanyi - EuroCOLT'95
CWI
Kruislaan 413
1098 SJ Amsterdam
The Netherlands

An abstract must be received by September 21, 1994 (or
postmarked September 14 and sent airmail, or sent overnight
delivery on September 20). This deadline is FIRM! Papers that have
appeared in journals or other conferences, or that are being
submitted to other conferences, are not appropriate for
submission to EuroCOLT. Authors of countries where it is difficult
to duplicate manuscripts may submit a single copy or by email.

Abstract Format: The abstract should consist of a cover page with
title, authors' names, postal and e-mail addresses, and a 200-
word summary. The body of the abstract should be no longer than
10 pages with roughly 35 lines/page in 12-point font. Papers
deviating significantly from this length constraint will not be
considered. The body should include a clear definition of the
theoretical model used, an overview of the results, and some
discussion of their significance, including comparison to other
work. Proofs or proof sketches should be included in the
technical section. Experimental results are welcome, but are
expected to be supported by theoretical analysis.

Notification: Authors will be notified of acceptance or rejection
by a letter mailed on or before Monday, October 31, with possible
earlier notification via e-mail. Final camera-ready papers will
be due on Thursday, December 15. The proceedings will be published
by Springer-Verlag, and will be available at the meeting.

Program Chair: Paul Vitanyi (CWI, Amsterdam, Netherlands,
e-mail paulv@cwi.nl).

Conference and Local Arrangements Co-Chairs: Felipe Cucker (University
Pompeu Fabra, Barcelona, e-mail cucker@upf.es) and Ricard Gavalda
(Universitat Politecnica de Catalunya, Barcelona, e-mail gavalda@lsi.upc.es).

Program Committee:

M. Anthony (LSE, Univ. London, UK), E. Baum (NEC Research Inst., Princeton),
N. Cesa-Bianchi (Univ. Milano, Italy), J. Koza (Stanford Univ, Palo Alto, USA),
M. Li (Univ. Waterloo, Canada), S. Muggleton (Oxford University, UK),
W. Maass (TU Graz, Austria), J. Rissanen (IBM Almaden, USA),
H.-U. Simon (Univ. Dortmund, Germany),K. Yamanishi (NEC, Princeton, USA),
L. Valiant (Harvard Univ, Cambridge, USA),
P. Vitanyi (CWI/Univ. Amsterdam, Netherlands), R. Freivalds (Univ. Riga, Latvia)





------------------------------

Date: Mon, 1 Aug 94 13:47:17 EDT
From: INBS-conference <inbs@bingsuns.cc.binghamton.edu>
Subject: IEEE Symposium on INTELLIGENCE in NEURAL AND BIOLOGICAL SYSTEMS


C A L L for P A P E R S

First International IEEE Symposium on


INTELLIGENCE in NEURAL AND BIOLOGICAL SYSTEMS


May 29-31,1995, Washington DC Area, USA

Sponsored by: IEEE Computer Society;


In Cooperating with: AAAS Society; AAAI Society; SMC Society


AIMS and SCOPE of the Symposium

The Intelligence in Artificial Neural Networks and the Computational
evolution of the Biological Systems are two very important and very
active research areas, which offer and promise many practical
applications to scientists and other professionals in industry
and goverment as well. In response to this demand, the INBS
Symposium offers a theoretical and a practical medium for the
evolutionary and the intelligence processes in both artificial and
biological systems and the interaction between these fields.

Some Topics

. EVOLUTIONARY COMPUTING (DNA sequence processing, Genome Processes,
DNA, Topologies, Synthesis of DNA, Formal Linguistic of DNA,
Structure/Function Correlation, Computational Genetics)
. BIOTECHNOLOGY AND APPLICATIONS (Mapping the Genome, Human Genome,
Molecular Computing, Limitations of Biological Models)
. GENETIC ALGORITHMS (Clustering, Optimization, Searching, Programming, etc)
. LANGUAGES UNDERSTANDING (Natural Languages, NL Translation,
Text Abstraction, Computational Linguistics)
. LEARNING AND PERCEPTION (Supervised, Unsupervised, Hybrid,
Understanding, Planning, Interpretation)
. NEUROSCIENCE (Adaptive Control Models, Fuzzy & Probabilistic Models,
Hybrid Models, Dynamic Neural and Neurocomputing Models,
Self-Organized Models)

SOME KEYNOTE DISTIGUISHED SPEAKERS

A.Apostolico, Europe; S.Arikawa, Japan; S.Hameroff, USA


PROGRAM COMMITTEE Local Arrangement/Publicity Committee
N.Bourbakis, BU-SUNY,USA, Chair Cynthia Shapiro
R.Brause, UG,Germany Sukarno Mertoguno
K.DeJong, GMU,USA James Gattiker
J.Shavlik, UWM,USA Ali Moghaddamzadeh
C.Koutsougeras, TU,USA
H.Kitano, JAPAN
M.Perlin, CMU,USA Publication Registration Committee
H.Pattee, BU,USA D.I. Kavraki
D.Schaffer, Philips Lab,USA
D.Searls, UPA,USA
J.Collado-Vides, UNAM
T.Yokomori, UEC,Japan
S.Rasmussen, Los Alamos,NL
G.Paun, Roumania
A.Restivo, U.Palermo,Italy
M.Chrochmore, U.Paris, France
D.Perrin, U.Paris, France
R.Reynolds, WSU,USA
M.Conrad, WSU,USA
M.Kameyama, TU,Japan
J.Nikolis, UP,Greece
T.Head, BU, USA, Vice Chair
C.Benham, MS,USA
R.VanBuskirk, BU-SUNY,USA
E.Dietrich, BU-SUNY,USA
S.Koslow, NIH,USA
M.Huerta, NIH,USA
B.Punch, MSU,USA



INFORMATION FOR AUTHORS
Authors are requested to submit four copies (in English) of
their typed complete manuscript (25 pages max), or an
extended summary (5-10 pages max) by Nov. 31,1994, to
N. G. Bourbakis, Binghamton University,T.J.Watson School,
AAAI Lab, NY 13902,Tel: 607-777-2165, 607-771-4033;
Fax:607-777-4822, E-mail : Bourbaki@BingSuns.CC.Binghamton.edu,
or INBS@Bingsuns.CC.Binghamton.edu.
Each manuscript submitted to INBS must indicate the most relevant
areas and include the complete address of at least one of the
authors. Notification of acceptance, Jan. 31,1995. Final
copies of the papers accepted by INBS due March 21 ,1995.


------------------------------

From: Sreerama Murthy <murthy@blaze.cs.jhu.edu>
Date: Wed, 3 Aug 94 10:19:14 EDT
Subject: Incremental learning


Rob Holte points out an apparent "paradox" in incremental learning.

We have a finite dataset from which we draw one example at a time.
The examples we have not yet drawn define the generalization accuracy
at any moment in time. <some matter deleted> In other words, if the
hypothesis is correct on the example, generalization accuracy drops!
On the other hand, if the new example is one the hypothesis classifies
incorrectly, C remains the same and generalization accuracy rises!

The problem here seems to be the definition of generalization accuracy.
More precisely, generalization accuracy as Holte defines it, seems to
make sense only when there is an infinite repositary of (possibly
redundant) examples. When it is known that the collection of *all possible*
instances is finite, a more appropriate measure seems to be the accuracy
on the entire data, seen or unseen. For example, if we have a classifier
that is correct on all possible instances, except the only remaining unseen
instance, it is not correct to say that the classifier has zero accuracy
on the concept.

-Sreerama K. Murthy murthy@cs.jhu.edu



------------------------------

From: Robert Holte <holte@csi.uottawa.ca>
Subject: Incremental Learning
Date: Tue, 2 Aug 1994 19:49:05 -0400



Hi, thanks for your email about the "paradox" I described in the ML-list.
As several people have raised similar points, I thought I'd answer them
all at once, I think everything will be of interest to everyone, but from
time to time I might make specific points to specific people, so watch for
your name!!

The people involved are:

Peter Clark pclark@cs.utexas.edu
Sreerama Murthy murthy@cs.jhu.edu
Michael Pazzani pazzani@pan.ICS.UCI.EDU
Mark Plutowski pluto@cs.ucsd.edu
Jude Shavlik shavlik@cs.wisc.edu

The point that was common to pretty much all the responses I got was
put this way by Sreerama Murthy:

The problem here seems to be the definition of generalization accuracy.
More precisely, generalization accuracy as Holte defines it, seems to
make sense only when there is an infinite repositary of (possibly
redundant) examples. When it is known that the collection of *all possible*
instances is finite, a more appropriate measure seems to be the accuracy
on the entire data, seen or unseen. For example, if we have a classifier
that is correct on all possible instances, except the only remaining unseen
instance, it is not correct to say that the classifier has zero accuracy
on the concept.

Like this quote, you all say that generalization accuracy is not what
ought to be measured. Most of you suggest specific alternative measures,
which vary in minor details but essentially amount to a PAC-like sampling
framework in which the training set is a bona fide sample, and accuracy is
defined as the probability of drawing an example that is correctly
classified by the current hypothesis (I'll call this "overall accuracy").

Your arguments in favour of measuring overall accuracy and not generalization
accuracy I agree with entirely when I'm wearing my pragmatic hat: we don't
care where accuracy comes from as long as we get it. But as a scientist
studying induction, I do want to know where the accuracy is coming from --
how much is due to sheer memorization, how much to generalization
(I shall leave deduction out of this note, for simplicity)? Cullen Schaffer
tells me this is his position too: he did not mean to say that
generalization accuracy is the only thing that should be measured; he meant
that we should break down overall accuracy into its components.

It seems to me that since our chosen field of study is induction, not rote
memorization, it is generalization accuracy that we ought to be
particularly interested in (in our role as scientists), not overall accuracy.
One further reason why we ought to be interested in analyses of
generalization accuracy is that our standard experimental method
(PARTITION a dataset into training and testing parts) measures
generalization accuracy, not overall accuracy. A few of our datasets happen
to contain a few duplicate records, but those are the exception; as a rule,
our standard methodology, on our standard datasets, is measuring
generalization accuracy.

Cullen's Law and my "paradox" point out that generalization accuracy
behaves rather differently than overall accuracy. But for me, the most
intriguing result pertaining to generalization accuracy is that famous
PAC result (which I'll severely paraphrase assuming you all will recognize
it) that with a polynomial-size sample, ANY hypothesis (from a finite
set of possible hypotheres) consistent with the training data will have
error less than epsilon (with high probability). Since this applies to ANY
hypothesis, it applies to a hypothesis with a generalization accuracy of 0;
i.e. rote memorization PAC-learns, we don't need to generalize at all.

Two tiny technical points to close off. First, the analysis I did was
independent of the number of training or testing examples -- it holds
from the first example to the last. Second, I said that, if you want a
monotone increasing learning curve, you MUST change your hypothesis
whenever you get a "confirming" example. I did not say how to do this
change in such a way that your accuracy actually does increase (I haven't
a clue how to guarantee increasing accuracy), all I said was that if you
don't change you can be sure your accuracy will decrease.

Mark Plutowski's note included several tantalizing results from a NIPS
paper of his (reference below). I hope Mark won't mind my appending
some excerpts from his message.

Mark -- please send me a copy of the paper!!

Rob Holte
Computer Science Dept.
University of Ottawa
Ottawa, Ontario
CANADA K1N 6N5



excerpts from the message of Mark Plutowski pluto@cs.ucsd.edu

... the "paradox" [will] disappear
under weak conditions on the data generating process and model [1].

... Given the assumptions employed in [1],
so long as the size of the validation set grows
fast enough relative to the size of the training set, AND if
the examples are i.i.d., sampled from the same space from which
the training examples are drawn, then hold-out set cross-validation
is an accurate and precise estimator of generalization,
(asymptotically, in a statistical sense) where here generalization
is measured by a criterion that gives the average mean square error,
averaged over training sets of a particular size. [1]

In the case that the training set size is small, and delete-1
cross-validation needs to be used, delete-1 cross-validation is not
over-optimistic, in that (again, asymptotically, in a statistical
sense) it is not a downwardly biased estimate of generalization.
(Another result found in [1].)

The important distinction between these last two results and the
paradoxical situation you describe is in
(a) the theoretical framework used to model learning
and
(b) the criterion used to measure generalization.

... If your algorithm can memorize 99.99% of the available examples,
simply because they are available and you have sufficient memory to store
them, AND your algorithm fails on the remaining 0.01% of the examples,
is the algorithm 99.99% accurate (i.e., with 0.01% error over test examples),
or, 0% accurate (100% error over test examples)?

... As measured in [1], the generalization error in the paradox you
describe would be 0.01%, not 100%. This is because the
test examples are i.i.d., drawn from the same space as the
training examples are drawn. Hence, the paradox you
describe highlights the importance of the i.i.d. assumption,
as well as the stationarity assumption (training and test
examples are drawn from the same space), both of which are often
employed (explicitly or implicitly) in much of learning theory.
Conversely, the cross-validation measure would NOT be an accurate
estimator of the measure of generalization you employ, and the
reason why is that one of the key assumptions is violated.

... REFERENCE:

[1] Plutowski, Sakata, and White.
"Cross-validation estimates integrated mean squared error." NIPS 6.



------------------------------

Date: Sat, 30 Jul 1994 17:56:53 +0900
From: Craig Hicks <hicks@cs.titech.ac.jp>
Subject: Incremental Learning


Robert Holte writes:

>This leads to the counterintuitive result that, if I want my accuracy
>always to increase, then when I get a training example "confirming" my
>hypothesis I MUST CHANGE my hypothesis, and when I get an example
>"refuting" my hypothesis, I need not change my hypothesis!

Let's be optimistic! When we get a training example "confirming" our
hypothesis let's just assume that C=U so that C/U = 1 and
(C-1)/(U-1)=1. If it works, don't mess with it!

------------------------------

Date: Sun, 31 Jul 94 23:00:11 MDT
From: dhw@santafe.edu
Subject: Overlooked issues concerning the conservation law.


This message is broken up into two parts, delineated by lines of
= symbols. The first is a discussion of some important theoretical
issues concerning the no-free-lunch/conservation (nfl) results that
have been lost in the current discussion. The second part focusses
on a message Tom Dietterich just posted, drawing attention to a
number of points in that message that might be misleading.

The result is, I'm afraid, that this is a rather lengthy message. My
hope is that people will find the information density high enough
to warrant wading through most of it. Crucial points are set off
by asterisks.

I would like to thank Michael Pazzani for helping me tailor this
message for the ML community.

===============================================================

1) An important point is being lost in the current discussion
of the nfl results. This is the fact that the results deal with
off-training set error. This is the essential new feature that
distinguishes these results from all the previous work (like
Dietterich's) that has been alluded to.

This feature means we aren't giving an algorithm credit for simple
memorization of what it's seen before. Using the nfl results one can
show that when such credit is removed, many of the positive learning results
from computational learning that have been referred to in Tom's (and
other people's) messages either get much weaker or disappear entirely.
In particular, not giving credit for memorization has this
effect on many of the VC results, PAC results, and statistical physics
of learning results.

Since nobody's disputed the nfl theorems, I presume they all agree
with this important implication concerning computational learning
theory?

*******
* Aside from real-world considerations, the nfl results have
* important implications for the theory of supervised learning.
*******

2) Conventionally, test sets are generated by rerunning the process
that generated the training set, i.e., the test set is generated in
an independent identically distributed (iid) running of that process.
Accordingly, overlap is allowed between the training and testing
sets when one uses iid error as one's performance measure (as opposed
to off-training set error). This means one gets credit (in a low
noise scenario) for simple memorization of the elements in the
training set.

(Aside: both off-training set generation of test sets and iid
generation is "valid". They simply address different issues.)

Now if one has iid generation of test sets (rather than
off-training set generation), so you get credit for memorization,
then one CAN get something-for-nothing results without making
any assumptions about the target. In fact, that's much of what VC
theory is all about!

To give just two examples of such results (neither from VC theory): Some
members of both the PAC and Bayesian communities have claimed to have
first-principles (!!!!!) proofs of Occam's razor. That is, proofs that
purport to make no assumptions about the target. There are a number of
debatable points in those claims, but at a minimum, they clearly
seem to be at odds with the nfl results. Yet very few people
have disputed these claims, and/or tried to clarify them in terms
of the nfl kind of thinking that (judging from the messages that
have been posted to this list) everyone already accepted.

******
* A question for those who "already accepted" the nfl results:
* Do they or do they not believe there are any subtleties related to
* the nfl results lurking in those "first principles proofs" of
* Occam's razor?
******

In general, previous theory not only didn't support the nfl theorems
(as has been suggested); it provided many results that would seem to
contradict the nfl theorems, until one realizes that those results are
based on a different error function. The nfl theorems are something
entirely new.

3) I re-emphasize that VC results giving one confidence in one's
generalization based only on behavior on the training set make NO
assumptions whatsoever about the real world. Yet the nfl kind of
thinking that "everyone already accepted" says that such assumptions
are absolutely critical. What gives? Clearly, there are many formal issues
to be disentangled here that people have previously been unaware of.

4) I would like to (again) second Cullen's statements, this
time concerning the implications of the no-free-lunch theorems
for the validity of real-world learning. *Obviously* practical
learning is based on assumptions about the real world that violate
the uniformity condition underlying the nfl theorems. The central point of
those theorems is to show that such an assumption *must always* be
present if we are to have any belief in learnability in the problem
at hand. (Cullen's discussion of C4.5 makes this point well.)

*****
* However, to give just one example, nobody has yet delineated
* just what those assumptions are for the technique of cross-validation.
* One shouldn't be too glib in saying "yeah, yeah, you gotta make
* assumptions; I already knew that, and I know the ones I'm making"
.
*****

5) A number of the correspondants have insinuated that, in effect,
the uniformity condition behind the nfl theorems has measure 0, and
therefore shouldn't be of concern. However if one relaxes uniformity,
one can just as easily create scenarios in which one's favorite algorithm
performs *worse* than random as scenarios in which it performs
better. That is underlying reason why the uniformity condition is
of interest; not because we believe that the universe actually is
random. (This is a point Cullen touches on.)

6) Finally, Robert Holte's "paradox" is exactly correct (and of course
not a paradox). The effect he discusses is closely tied to the
fact that learning curves can get *worse* for increased training set
size, even for the Bayes optimal generalizer, if one focusses on
off-training set behavior. (Essentially, to get such "bad" curves one
should have the sampling distribution favor those points giving
Robert's paradox.)


============================================================


Tom Dietterich writes:

>>>
We never apply machine learning algorithms to data unless we believe
there is a pattern in that data...
The conservation law is averaging ... over all possible target concepts ...
This is equivalent (as Mike Kearns observed at ML94) to permitting the
teacher to label .... examples by flipping a coin... we
can all agree that data sets constructed this way do not contain any
"pattern".
>>>

The "coin-flipping" view of how the results are derived is well known.
In fact, it was the initial intuition for why the results (and all
their implications) must hold.

One should be careful though with the rest of Tom's comment. In particular,
based on his first quoted sentence it seems that he uses "data set" to
mean training set. (Though this isn't crystal clear.) Then with
his last sentence he seems to imply that the condition underlying
the conservation law can be dismissed based on lack of patterns in
that data set.

However the important point is that *it doesn't matter whether
the data set contains a pattern or not*. (!!!) Or even whether the
test set also contains a pattern. (The no-free-lunch results do NOT
rest on the assumption that there is no pattern in the test set at
hand. In fact, they average over cases involving all possible patterns.)

What matters is whether the two patterns are similar.

The whole point is that the patterns (if any) over the test set
need not correspond in any manner to those over the training set.

This is true even in the data-averaged case (i.e., where one averages
over training sets while keeping the target fixed), which is where
things start to get counter-intuitive, PAC-type intuition starts
to break down, etc.

***

Tom goes on to write:

>>>
And
of course algorithms embodying these assumptions work quite well on a
wide range of real problems, which provides evidence that these
assumptions hold for these problems.
>>>>

Again, one must be careful. For simplicity, equate the
validity of the assumptions behind the algorithms with how well those
algorithms perform, on the problems at hand. Then either Tom's statement
is trivial (if your algorithm performed well on training set A and test
set B, that "provides evidence" that it performs well on training set A and
test set B), or it is not true.

To make the second possibility more precise: it does not matter how
well your algorithm did on training set A and test set B, as far as
some new test set C (generated by the same problem that gave us A
and B, but with no overlap with A or B) is concerned. Or more
precisely still, one has to make some sort of assumptions - which
nobody has ever even delineated - for behavior with A and B to correlate
with C.

In short, the "assumptions behind the algorithms" need not carry over to
novel sets C generated by the same problem.

That's what the no-free-lunch/conservation results tell us.


***

Tom then writes:

>>>
on occasion I have encountered a
learning problem where algorithms such as nearest neighbor and C4.5 do
not do as well as I would like ...
This pushes me to shift my weak assumptions so that they
match better the target concepts of interest.
>>>

I presume that Tom learned that the algorithms weren't doing well
by using a validation set ... So he is implicitly saying that
he will use the assumptions behind cross-validation (which have
never been explicitly delineated) to test his other assumptions.

The implication is that his other assumptions might be in error.
But if so, then could not his assumptions behind cross-validation
also be in error? Remember, the no-free-lunch theorems say that
cross-validation fails as much as it succeeds.

Indeed, I have stumbled upon cases where cross-validation failed
badly. (I was not trying to find such cases when this happened.)
I talked about them in my PhD work that was published in Complex
Systems about 5 years ago.


****

Tom ends with:

>>>
"What is the weakest assumption
that I am willing to make about the target concept?"
You can then
apply computational learning theory to determine whether learning is
possible under this assumption.
>>>

No you can't. Not if you're interested in off-training set behavior,
and not if you wish to use current computational learning theory.

Such current theory has not been extended to off-training set
behavior, and the extension is not trivial.




David Wolpert

------------------------------

Date: Thu, 4 Aug 94 20:58:02 EDT
From: gordon@aic.nrl.navy.mil
Subject: Conservation Law

Schaffer's Conservation Law is a counting argument that does not
consider distributions over situations. Our goal is to characterize
precisely when one gets zero EXPECTED generalization performance (EGP)
over all test cases (instances), and thus we have to consider the
distribution over situations.

In a previous message we stated:

> It seems apparent to us that only a uniform distribution over all possible
> situations would make the counting argument completely true in the world.

Actually, there are infinitely many possible distributions that yield
zero EGP. In a nutshell you will get zero EGP for every learner L over all
test cases iff each test case is equally likely to be positive or negative.
For the sake of brevity we omit our proof, although we will be happy to
mail it to anyone who is interested.

This agrees with what Tom Dietterich said on ml-list:

> is equivalent (as Mike Kearns observed at ML94) to permitting the
> teacher to label the test set examples by flipping a coin.

Like Tom, we don't believe a distribution meeting the above criterion is
what we encounter in the world. If the distribution does not meet the above
criterion, there exist learners that can exhibit positive or negative
expected generalization performance.

Diana Gordon
William Spears

------------------------------

Date: Wed, 3 Aug 94 10:08:21 EDT
From: hartley@aic.nrl.navy.mil
Subject: Conservation Law


The conservation law, though clearly true, may or may not have any
relevance to the real world. In the paper, and in some of the
discussion on this list, Schaffer tries to use examples to show that it
is. I propose to PROVE that it is not.

I will show that the conservation law notwithstanding there MAY exist
a universal generalization algorithm (with better than chance
performance). I say "may" because I will not prove that one does
exist (and certainly will not produce one), but only that its
existence is not ruled out by the conservation law.

The key is in the definition of the word "universal." There are two
relevant definitions:

Applicable independent of any assumptions.
and
Applicable throughout the entire universe.

The first is what most mathematicians mean by universal, and by this
definition the conservation law rules out any useful generalization.

However when asking about the real world it is the second definition
that is important. What could happen in other conceivable universes
is of no possible interest to us.

The definitions are different because not all concepts that exist in a
mathematical sense can be generated by any mechanism within our
universe. Schaffer is summing over the wrong set.

To see that this is true consider the attribute vectors {A1,A2,...Am},
if there are k (binary for the sake of argument) attributes, then the
number of different attribute vectors m = 2^k. There are then 2^(2^k)
possible (deterministic) concepts. For a mechanism to generate an
arbitrary concept from this set, it needs at least 2^k bits of
storage. A mechanism with b<2^k bits can only generate 2^b < 2^(2^k)
different concepts, because that is all it can represent.

Now the universe (visible to date, I do not wish to discuss
cosmology) contains say 2^(200+-50) atoms, the exact number is not
important. Assuming each atom forms one bit and the entire universe
conspires to generate a learning situation it cannot generate more
than 2^(2^200), different concepts.

Therefore if k, the number of attributes, is more than a couple of
hundred, there will be some concepts that CANNOT be generated. In
fact MOST concepts will be of this kind. But the conservation law
gives all these impossible concepts the same weight as the ones that
can actually occur.

It is as if the law of conservation of momentum required the total
momentum change to be zero when summed not over all objects that are
present, but over all conceivable (but not necessarily realizable)
objects.

The total generalization score of a given algorithm over the smaller
set of realizable concepts may still be zero, or it may not be. The
conservation law tells us nothing about this. If there exists an
algorithm for which the sum is sufficiently different from zero this
would be a universal generalization algorithm in any useful sense of
the phrase.

The above calculations are conservative. You will usually be
justified in assuming that whatever generated the concept to be
learned is much smaller than the entire universe. There may be other
limits to what concepts can be produced other than the simple counting
arguments. There may also be universal laws concerning the
probability of different concepts, which would mean that the average
generalization performance for practical purposes should be a weighted
sum.

Ralph Hartley
hartley@aic.nrl.navy.mil


------------------------------

Date: Mon, 8 Aug 1994 10:39:43 +1000
From: Wayne Jenkins <wtj@cs.rmit.oz.au>
Subject: Conservation Paper

The conservation law has a straight forward intuitive appeal.

Let a1, ..., am be the set of attribute vectors not included in the training
set. A learning algorithm chooses some labelling drawn from {0,1}^m as
the hypothesis h. The target function t is also a labelling drawn from
{0, 1}^m. Suppose that the generalization accuracy of h with respect
to t is y. Another target function t' is constructed as the bitwise
negation of the labelling t. Suppose that the generalization accuracy of h
with respect to t' is z. It is easily shown that y + z = 1 and that
y - 0.5 + z - 0.5 = 0 (i.e. the generalization performance). Regardless
of the hypothesis chosen, each target function t will have a counterpart t'
such that y + z = 1. It follows that the sum of the generalization
performance across all target functions will be zero (i.e. each target
function has a counterpart).

I dont find this a negative result but rather an encouraging one.
As previous discussions on this thread have alluded if the probability
of occurence of target functions in our physical world is not uniform
then it would be possible to construct learning algorithms that
exploit this regularity and hence have a positive generalization
performance. I fail to see why Cullen has not addressed this point
in his ML94 paper. He calculates the expected generalization performance
across all attribute vectors not in the training set and across all training
sets. I would have thought that the obvious extension would be to
calculate the expected value of generalization performance across
all class probability vectors. I suspect that if this were done then
it would show that it is possible to construct robust learners for
a given (non-uniform) distribution over class probability vectors.


Cullen has further stated in this thread that

In building algorithms, we naturally do make distributional
assumptions and try to allocate positive generalization performance to
concepts we think likely and negative generalization performance to
those we think unlikely. To me, the main point of the conservation
law paper is to draw attention to the fact that this is ~all we are
doing.

I agree strongly with this statement. The conservation paper has made the
arguments in Wolpert's papers even clearer. We need to make distributional
assumptions about the class probability vectors (target functions) to enable
learning algorithms to have positive generalization performance.

Wayne.

------------------------------

From: Cullen Schaffer <schaffer@roz.hunter.cuny.edu>
Date: Fri, 5 Aug 94 14:40:29 EDT
Subject: Conservation Law


Mike Pazzani suggested the following in a recent note to me:

Perhaps you can help with one issue that the continues to raised in
questions to ML-LIST. In particular, are there any circumstances
(e.g., if the learner knows something about the distribution of
concepts or the distribution of examples) under which the
generalization performance of a learner can be greater than 0. If so,
can you give a few examples of circumstances in which the Conservation
Law does not apply?

The conservation law, being a law, always applies. The question is
only whether it tells us something useful. To take an extreme
example, suppose we have an attribute vector space with ten binary
attributes and hence 2^2^10 definable concepts. Suppose further that
we know that only two of these concepts actually arise in our
application domain: either instances are all positive or they're all
negative. The obvious induction algorithm for this problem--which
looks at one example and predicts whichever class it sees forever
after--will clearly have a generalization accuracy of 100 percent.

However, it's still true that the sum of the generalization
performance of this algorithm over all of the 2^2^10 definable
concepts is zero. The conservation law applies; we just don't find
what it says to be interesting in this case.

Since it seems to be causing so much trouble, let me make it perfectly
clear that

The conservation law makes no statement about AVERAGE generalization
performance. In particular, it does not rule out a positive average.
If we know something a priori about the relationship between
attributes and class--or equivalently about the concept
distribution--we can (and routinely do) apply that knowledge successfully.

When I personally discuss the relevance of the conservation law, I'm
not asking about average performance at all. Instead, I'm wondering
about the existence of certain types of individual problems. The
conservation law tells us

1. That every learner performs worse than chance in generalization on
some problems and

2. That improving the generalization performance of a learner in some
situations entails degrading it in others.

My question is whether these facts ever apply for real learners and
problems of interest. In other words,

1. Are there "real-world" problems where important learners like
C4.5 perform worse than chance in generalization?

2. Are there "real-world" problems where important enhancements
like cross-validation-based decision-tree pruning decrease
generalization performance?

Theory has absolutely nothing to say about these questions. The
conservation law notwithstanding, it's perfectly possible for the
answer to be no in both cases. On the other hand, the empirical fact
is that the answer in both cases is yes.

Moreover, it's my opinion--not based on either theory or
observation--that there are probably lots of important problems of
this kind and that the main reason we aren't familiar with them is
that research effort has gone almost exclusively into finding positive
examples in which intuitively satisfying induction algorithms perform
better than chance and intuitively satisfying algorithm enhancements
increase performance.

If the whole field were now to spend ten years looking for examples of
the opposite kind based on current learners, I think we'd turn up a
lot of them. And I think the fact of finding so many--assuming I'm
right--would give us a very different perspective on "progress" in the
field.

Cullen









------------------------------

From: Steven Salzberg <salzberg@blaze.cs.jhu.edu>
Date: Tue, 9 Aug 94 11:19:18 EDT
Subject: "conservation law" discussion


I find it rather astonishing that this discussion has gone on so long.
This is probably a consequence of the unusually entertaining way in
which Cullen Schaffer presented his results at ML94. At the risk of
prolonging the discussion much longer, but with the hope of injecting
some moderation, I would like to make a few comments.

The "conservation law" and its proof, as stated in the ML94 paper,
are, to me, quite intuitive. (One reason Schaffer gives for the
significance of his result is that it is counterintuitive; this
depends on each person's intuitions, of course.) By making all
possible class assignments equally likely, the summation obviously
will turn out to be zero. This was observed by Diana Gordon and Bill
Spears in an earlier posting on ml-list:

2) However, we have serious doubts about its usefulness. This law appears
to be essentially a counting argument, where we count over all situations
S. Thus, this law says something about the real world if all possible
situations occur, and if all possible situations occur with
equal probability.

Tom Dietterich pointed out further evidence that the law is
unsurprising:

Yes, of course the law is correct. Furthermore, it is just a
corollary of the results from PAC learning theory.

I have discussed this "law" with a few theoreticians, and they agree
with Tom. I have also concluded that it would be embarrassing if
someone from the experimental machine learning community tried to make
a big deal of this result. Thus I'm a bit dismayed to see this
discussion continuing for so long in ml-list. Basically, the result
should not be surprising to anyone, and if it is, it means the
community has not thought very carefully about what it's doing.
(Some of the discussion has made it clear, however, that at least some
ML researchers did not find the law surprising.)

David Wolpert, arguing the other way, commented that:

2) Another important point is that PAC is quite limited in its
theoretical view of the world, whereas the no-free-lunch/conservation
results hold in a very broad context.

This criticism of PAC is completely unjustified by the actual
mathematical result. PAC results typically describe specific concept
classes, but I would argue that this makes them *more* relevant than
Schaffer's or Wolpert's results. If I want to study a specific
problem, then I'm looking at a specific concept class, not the space
of all concepts.

Finally, Schaffer commented that:

***************************************************************
* 2. I don't view the conservation law as a negative result. *
***************************************************************

I guess the fault lies with me for starting my talk by summing up
several hundred years of philosophy with the statement "induction is
impossible."
[...]

The talk was very entertaining, but there is no doubt that Cullen
presented it as an astounding negative result, one that should
surprise us all and make us re-think our research. I recall that the
very first slide said simply, "fasten your seat belts." So it's
rather surprising to see the statement above, apparently reversing the
main claim of the talk. To Cullen's credit, the paper is relatively
modest in its claims and arguments, and does not include the statement
"induction is impossible." Perhaps now we can settle down and move
ahead with our research on trying to develop algorithms for inducing
the kinds of concepts that occur in the various real-world domains
that concern us.

Steven Salzberg



← previous
next →
loading
sending ...
New to Neperos ? Sign Up for free
download Neperos App from Google Play
install Neperos as PWA

Let's discover also

Recent Articles

Recent Comments

Neperos cookies
This website uses cookies to store your preferences and improve the service. Cookies authorization will allow me and / or my partners to process personal data such as browsing behaviour.

By pressing OK you agree to the Terms of Service and acknowledge the Privacy Policy

By pressing REJECT you will be able to continue to use Neperos (like read articles or write comments) but some important cookies will not be set. This may affect certain features and functions of the platform.
OK
REJECT