Copy Link
Add to Bookmark
Report
Machine Learning List Vol. 6 No. 29
Machine Learning List: Vol. 6 No. 29
Sunday, November 27, 1994
Contents:
UCI Machine Learning Repository
KDD Tutorial Overheads Available by FTP
Special Issue on Lazy Learning (AI Review Journal)
AAAI Symposium: Genetic Programming (Nov. 95) CFP
Machine Learning Conference Final Call for Papers
Active Learning Symposium: preliminary Call for Participation
OFF-TRAINING SET ERROR
Responses about C4.5 on a PC
The Machine Learning List is moderated. Contributions should be relevant to
the scientific study of machine learning. Mail contributions to ml@ics.uci.edu.
Mail requests to be added or deleted to ml-request@ics.uci.edu. Back issues
may be FTP'd from ics.uci.edu in pub/ml-list/V<X>/<N> or N.Z where X and N are
the volume and number of the issue; ID: anonymous PASSWORD: <your mail address>
URL- http://www.ics.uci.edu/AI/ML/Machine-Learning.html
----------------------------------------------------------------------
Subject: UCI Machine Learning Repository
Date: Thu, 17 Nov 1994 22:35:18 -0800
From: "Patrick M. Murphy" <pmurphy@focl.ICS.UCI.EDU>
The following is a list of databases that have recently been
added to the UCI Machine Learning Repository.
Any comments or donations would be greatly appreciated
(ml-repository@ics.uci.edu).
Patrick M. Murphy (Site Librarian)
David W. Aha (Off-Site Assistant)
P.S. Thank you to all those who donated these databates.
* Isolet Spoken Letter Recognition database
Created by Ron Cole and Mark Fanty
Donated by Tom Dietterich
This data set was generated as follows. 150 subjects spoke the
name of each letter of the alphabet twice. Hence, we have 52
training examples from each speaker. The speakers are grouped
into sets of 30 speakers each. 6238 + 1559 instances, 26 classes
(one for each letter). All 617 attributes are real-valued scaled
from -1.0 to 1.0. No missing values.
* Sponge Database
Created by Iosune Uriz and Marta Domingo
Donated by Javier Bejar and Ulises Cortes
The problem is to classify atlantic-mediterranean marine sponges.
There are 76 instances described using 45 nominal and numeric
attributes (some missing values).
* Badges Database
Created and donated by Haym Hirsh
This problem was generated for attendee's of MLC94 to solve.
Instances are described using a sequence of characters (the
attendee's name). There are 294 instances and 2 classes.
* Chess Endgame Database for White King and Rook against Black King
Created by Michael Bain and Arthur van Hoff
Donated by Michael Bain
A KRK database was described by Clarke (1977). The current database
was described and used for machine learning experiments in Bain (1992;
1994). It should be noted that our database is not guaranteed correct,
but the class distribution is the same as Clarke's database. In (Bain
1992; 1994) the task was classification of positions in the database
as won for white in a fixed number of moves, assuming optimal play by
both sides. The problem was structured into separate sub-problems by
depth-of-win ordered draw, zero, one, ..., sixteen. When learning
depth d all examples at depths > d are used as negatives. Quinlan (1994)
applied Foil to learn a complete and correct solution for this task.
28056 instances, 6 nominal features.
* Document Understanding Database
Created and donated by Donato Malerba.
There are five concepts, expressed as predicates, to be learned.
They concern five logical components that is possible to
identify in a sample of business letters, namely sender, receiver,
logotype, reference number and date. The problem is complicated
by the presence of dependencies among concepts. The problem can
be cast as a multiple predicate learning problem. Experimental
results show that learning contextual rules, that is rules
in which concept dependencies are explicitly considered,
leads to better results.
* C++ utility to convert the artificial character database instances
into a 12x8 pixel array.
Created and donated by Scott Derrick
* MUSK databases (2)
Created by AI Group at Arris Pharmaceutical Corporation
Donated by Tom Dietterich
This dataset describes a set of molecules, some of which were judged
by human experts to be musks and the remaining were judged to be non-musks.
The goal is to learn to predict whether new molecules will be musks or
non-musks. However, the 166 features that describe these molecules depend
upon the exact shape, or conformation, of the molecule. Because bonds can
rotate, a single molecule can adopt many different shapes. To generate
this data set, the low-energy conformations of the molecules were generated
and then filtered to remove highly similar conformations. A feature vector
was extracted that described each remaining conformation.
This many-to-one relationship between feature vectors and molecules
is called the "multiple instance problem". When learning a
classifier for this data, the classifier should classify a molecule
as "musk" if ANY of its conformations is classified as a musk. A
molecule should be classified as "non-musk" if NONE of its
conformations is classified as a musk. 476 and 6,598 instances.
* German Credit Database (from Statlog project)
Created by Professor Dr. Hans Hofmann
Donated by Gholamreza Nakhaeizadeh
This dataset classifies people described by a set of attributes as good or
bad credit risks. Comes in two formats (one all numeric). Also comes with
a cost matrix.
* Qualitative Structure Activity Relationships (QSARs)
Donated by Ross King
Learning Qualitative Structure Activity Relationships (QSARs). Two sets of
datasets for the Inhibition of Dihydrofolate Reductase by Pyrimidines, and
the inhibition of Dihydrofolate Reductase by Triazines. Data comes in
multiple formats, one for Inductive Logic Programming (ILP), one for
propositional machine learning discrimination and one for propositional
machine learning regression.
* Moral Reasoner Database & Theory
Created by T.R. Shultz & J.M. Daley
Donated by James Wogulis
This is a rule-based model that qualitatively simulates moral reasoning.
The model was intended to simulate how an ordinary person, down to about
age five, reasons about harm-doing. The horn-clause theory and the 202
instances are the same as were used in (Wogulis, 1994). Theory includes
negated literals.
------------------------------
Date: Thu, 17 Nov 1994 17:03:33 +1100 (EST)
From: Xindong Wu <xindong@insect.sd.monash.edu.au>
Subject: KDD Tutorial Overheads Available by FTP
KDD Tutorial Overheads Available by FTP
The overheads for the AI'94 Tutorial on Intelligent Learning Database
Systems are now available by anonymous ftp from coral.cs.jcu.edu.au in
pub/HCV/KDD.ps.
The following is an outline of the tutorial.
Knowledge acquisition from databases is a research frontier for both
database technology and machine learning (ML) techniques, and has
seen sustained research over recent years. It also acts as a link
between the two fields, thus offering a dual benefit. Firstly,
since database technology has already found wide application in many
fields, ML research obviously stands to gain from this greater
exposure and established technological foundation. Secondly, ML
techniques can augment the ability of existing database systems to
represent, acquire, and process a collection of expertise such as
those which form part of the semantics of many advanced applications
(e.g. CAD/CAM). This full-day tutorial presents and discusses
techniques for the following 3 interconnected phases in constructing
intelligent learning database systems: (1) Translation of standard
database information into a form suitable for use by a rule-based
system; (2) Using machine learning techniques to produce rule bases
from databases; and (3) Interpreting the rules produced to solve
users' problems and/or reduce data spaces. It suits a wide audience
including postgraduate students and industrial people from
databases, expert systems, and machine learning.
Comments and suggestions for improvements are solicited!
Xindong Wu
------------------------------
From: aha@aic.nrl.navy.mil
Subject: Special Issue on Lazy Learning (AI Review Journal)
Date: Fri, 18 Nov 1994 16:58:09 -0500 (EST)
AI REVIEW JOURNAL
Special Issue on
**LAZY LEARNING**
Traditional learning algorithms compile data into abstractions in the
process of inducing concept descriptions. Lazy learning algorithms,
also known as instance-based, memory-based, exemplar-based,
experience-based, and case-based, instead delay this process and
represent concept descriptions with the data itself. Lazy learning
has its roots in disciplines such as pattern recognition, cognitive
psychology, statistics, cognitive science, robotics, and information
retrieval. It has received increasing attention in several AI
disciplines during the past decade as researchers have explored issues
on massively parallel approaches, cost sensitivity, matching
algorithms for use with symbolic and structured data representations,
formal analyses, rule extraction, feature selection, interaction with
knowledge-based systems, integration with other learning/reasoning
approaches, and numerous application-specific issues. Many reasons
exist for this level of activity: these algorithms are relatively easy
to present and analyze, are easily applied, have promising performance
on some measures, and are the basis for today's commercially popular
case-based reasoning systems. In view of their growing popularity, a
special issue of the AI Review Journal, planned for the fall of
1995, will be devoted to topics related to lazy learning approaches.
Papers are solicited in, but not limited to, the following areas:
* Novel algorithms and approaches for lazy learning, particularly
those that address relevant open issues, introduce a novel approach
for solving them, include rigorous evaluations with alternative
approaches, and investigate the cause and scope of the benefits
displayed by the new approach. We also welcome such submissions
that highlight algorithms which, though successfully employed in
other areas, are not well known within the AI community.
* Mathematical analyses of lazy learning algorithms, focusing on
their computational behavior; average case analyses are especially
encouraged.
* Descriptions of mature applications of lazy learning technologies,
focusing on their evaluation in an embedded system and contrasting
them with previous such applications.
* Novel integrations in embedded systems with a focus on the benefits
derived by using lazy learning approaches.
* Discussion and evaluation of hypotheses concerning the benefits
and/or limitations of lazy (vs. eager) learning approaches.
* Organized surveys on lazy learning approaches that span several
disciplines, focusing on recent progress in each, and including
a framework that can be used to compare, contrast, and communicate
different approaches with respect to the objectives of each discipline.
The Artificial Intelligence Review serves as a forum for the work of
researchers and application developers from Artificial Intelligence,
Cognitive Science and related disciplines. The Review publishes
state-of-the-art research and applications, and critical evaluations
of techniques and algorithms from these fields. The Review also
presents refereed survey and tutorial articles, as well as reviews and
commentary on topics from these disciplines.
***Instructions for Submitting Papers***
Full-length manuscripts should be no more than 30 printed pages
(approximately 15,000 words) with a 12-point font and 18-point
spacing, including figures and references. Technical note submissions
of half this length are also encouraged. Submissions must not have
appeared in, nor be under consideration by, other journals. Include a
separate page specifying the title and giving the preferred address of
the contact author for correspondence (including email address, postal
address, FAX number, and telephone number). Send FOUR copies of each
submission to the guest editor listed below. For additional
information, contact the guest editor.
***Important Dates***
Manuscripts Due: March 15, 1995
Acceptance Notification: May 15, 1995
Final manuscript due: June 25, 1995
Publication date of issue: November 15, 1995
**Guest Editor***
David W. Aha
Navy Center for Applied Research in AI
Code 5510
Naval Research Laboratory
4555 Overlook Ave, SW
Washington, D.C. 20375 USA
aha@aic.nrl.navy.mil
(202) 767-9006 (Voice)
(202) 767-3172 (FAX)
------------------------------
Date: Fri, 11 Nov 94 18:59:25 EST
From: Eric Siegel <evs@cs.columbia.edu>
Subject: AAAI Symposium: Genetic Programming (Nov. 95) CFP
******************* Call for Participation ****************************
GENETIC PROGRAMMING
1995 AAAI Fall Symposium Series
Cambridge, Massachusettes
November 10 - 12, 1995 (Friday-Sunday)
Chairs: Eric V. Siegel, Columbia University
John R. Koza, Stanford University
Committee: Lee Altenberg, Duke University
David Andre, Stanford Univerisity
Robert Collins, USAnimation, Inc.
Frederic Gruau, Stanford University
Kim Kinnear, Adaptive Computing Technology
Brij Masand, GTE Labs
Sid R. Maxwell, Borland International
Conor Ryan, University College Cork
Andy Singleton, Creation Mechanics, Inc.
Walter Alden Tackett, Neuromedia
Astro Teller, Carnegie Mellon University
Genetic programming (GP) extends the genetic algorithm to the domain of
computer programs. In genetic programming, populations of programs are
genetically bred to solve problems. Genetic programming can solve problems
of system identification, classification, control, robotics, optimization,
game-playing, and pattern recognition.
Starting with a primordial ooze of hundreds or thousands of randomly
created programs composed of functions and terminals appropriate to the
problem, the population is progressively evolved over a series of
generations by applying the operations of Darwinian fitness proportionate
reproduction and crossover (sexual recombination).
Topics of interest for the symposium include:
The theoretical basis of genetic programming
Applications of genetic programming
Rigorousness of validation techniques
Hierarchical decomposition, e.g. automatically defined functions
Competitive coevolution
Automatic parameter tuning
Representation issues
Genetic operators
Establishing standard benchmark problems
Parallelization techniques
Innovative variations
The format of the symposium will encourage interaction and discussion, but
will also include formal presentations. Persons wishing to make a
presentation should submit an extended abstract of up to 2500 words of
their work in progress or completed work. For those abstracts accepted,
full papers will be due at a date closer to the symposium.
Persons not wishing to make a presentation are asked to submit a one-page
description of their research interests since there may be limited room for
participation.
Submit your abstract or one-page description as plain text electronically
by Friday April 14, 1995, with a hard-copy backup to:
Eric V. Siegel
AAAI GP Symposium Co-Chair
Columbia University
Department of Computer Science
500 W 120th Street
New York, NY 10027, USA
fax: 212-666-0140
e-mail: evs@cs.columbia.edu
Sponsored by the American Association for Artificial Intelligence
445 Burgess Drive
Menlo Park, CA 94025
(415) 328-3123
sss@aaai.org
------------------------------
From: Jeff Schlimmer - Faculty <schlimme@eecs.wsu.edu>
Subject: Machine Learning Conference Final Call for Papers
Date: Mon, 21 Nov 1994 16:59:12 -0800 (PST)
CALL FOR PAPERS
Twelfth International Conference on Machine Learning
Tahoe City, California
July 9-12, 1995
The Twelfth International Conference on Machine Learning (ML95)
will be held at the Granlibakken Resort in Tahoe City, California
during July 9-12, 1995, with informal workshops and tutorials on July
9. We invite paper submissions from researchers in all areas of
machine learning. The conference will include presentations of
refereed papers and invited talks.
The Eighth Conference on Computational Learning Theory (COLT 95)
will be held from July 5-8 at the University of California-Santa Cruz;
carpools will be arranged to shuttle participants from COLT 95 to
ML95.
REVIEW CRITERIA
Each submitted paper will be reviewed by at least two members of
the program committee and will be judged on significance, originality,
and clarity. Papers submitted to the conference should differ
substantially from those submitted to other conferences.
PAPER FORMAT
Submissions must be clearly legible, with good quality print.
Papers are limited to a total of twelve (12) pages, EXCLUDING title
page and bibliography, but INCLUDING all tables and figures. Papers
must be printed on 8-1/2 x 11 inch paper or A4 paper using 12 point
type (10 characters per inch) with no more than 38 lines per page and
75 characters per line (e.g., LaTeX 12 point article style). The title
page must include an abstract and email and postal addresses of all
authors. Papers without this format will not be reviewed. To save
paper and postage costs please use DOUBLE-SIDED printing.
REQUIREMENTS FOR SUBMISSION
Send four (4) copies of each submitted paper to one of the
conference co-chairs. Papers must be received by
FEBRUARY 7, 1995 .
Electronic or FAX submissions are not acceptable. Notification of
acceptance or rejection will be mailed to the first (or designated)
author by March 22, 1995. Camera-ready accepted papers are due on
April 25, 1995.
INFORMAL WORKSHOPS AND TUTORIALS
Proposals for informal workshops and tutorials are invited in all
areas of machine learning. Send a two (2) page description of the
proposed workshop or tutorial, its objectives, organizer(s), and
expected number of attendees to the workshop and tutorial chair. For
tutorials, provide previous teaching experience. Workshop proposals
must be received by DECEMBER 15, 1994. Tutorial proposals must be
received by JANUARY 4, 1995. Current workshop and tutorial details are
available online via the World-Wide Web in the URL
http://grad.csee.usf.edu/aipage.html .
Conference Co-Chairs
Armand Prieditis
Dept. of Computer Science
University of California
Davis, CA 95616
priediti@cs.ucdavis.edu
Stuart Russell
Computer Science Div.
University of California
Berkeley, CA 94720
russell@cs.berkeley.edu
Program Committee
Yuichiro Anzai, Keio U.
Chris Atkeson, Georgia Tech.
Francesco Bergadano, U. Torino
Lashon Booker, Mitre
Ivan Bratko, J. Stefan Inst.
Wray Buntine, NASA Ames
Claire Cardie, Cornell U.
Jason Catlett, AT & T Bell Labs
Gerald DeJong, U. Illinois
Tom Dietterich, Oregon State U.
Charles Elkan, UC San Diego
Oren Etzioni, U. Washington
Usama Fayyad, JPL
Andrew Golding, Mitsubishi
Russ Greiner, Siemens
Lisa Hellerstein, Northwestern U
Michael Jordan, MIT
Leslie Kaelbling, Brown U.
Simon Kasif, Princeton U.
Sridhar Mahadevan, U. South Fla.
Chris Matheus, GTE
Melanie Mitchell, Sante Fe Inst.
Ray Mooney, UT Austin
Andrew Moore, CMU
Stephen Muggleton, Oxford U.
Michael Pazzani, UC Irvine
Ed Pednault, AT & T Bell Labs
Cullen Schaffer, Hunter College
Andreas Stolcke, SRI
Devika Subramanian, Cornell U.
Rich Sutton, GTE
Prasad Tadepalli, Oregon State U.
Gerald Tesauro, IBM
Sebastian Thrun, U. Bonn
Manuela Veloso, CMU
David Wilkins, U. Illinois
Stephan Wrobel, GMD
Kenji Yamanishi, NEC Princeton
Workshop and Tutorial Chair
Sridhar Mahadevan
Department of Computer Science and Engineering
University of South Florida
4202 East Fowler Avenue, EBG 118
Tampa, Florida 33620
mahadeva@csee.usf.edu
Publicity Chair
Jeff Schlimmer
School of EE & CS
Washington State University
Pullman, WA 99164
schlimmer@eecs.wsu.edu
Local Arrangements
Debbie Chadwick
Department of Computer Science
University of California
Davis, CA 95616
chadwick@cs.ucdavis.edu
GENERAL INQUIRIES
Please send general inquiries to ml95@cs.ucdavis.edu .
To receive future conference announcements please send a note to
the publicity chair. Current conference information is available
online via the World-Wide Web in the URL
http://www.eecs.wsu.edu/~schlimme/ml95.html . This announcement is
also available in PostScript form in the URL
file://ftp.eecs.wsu.edu/pub/ml95/call-for-papers.ps .
------------------------------
Date: Mon, 21 Nov 94 12:55:06 EST
From: David Cohn <cohn@psyche.mit.edu>
Subject: Active Learning Symposium: preliminary Call for Participation
Call for Participation
AAAI Fall Symposium 1995 on Active Learning
November 10 - 12, 1995
Massachusetts Institute of Technology
Cambridge, MA
SYMPOSIUM TOPIC
An active learning system is one that can influence the training data
it receives by actions or queries to its environment. Properly
selected, these actions can drastically reduce the amount of data and
computation required by a machine learner.
Active learning has been studied independently by researchers in
machine learning, neural networks, robotics, computational learning
theory, experiment design, information retrieval, and reinforcement
learning, among other areas. This symposium will bring researchers
together to clarify the foundations of active learning and point out
synergies to build on.
PARTICIPATION
The Symposium on Active Learning will be held as part of the AAAI Fall
Symposium Series, and will be limited to between forty and sixty
participants.
Potential participants should submit a short position paper (at most
two pages) discussing what they could contribute to a dialogue on
active learning and/or what they hope to learn by participating.
Suggested topics include:
Theory: What are the important results in the theory of active
learning and what are important open problems? How much guidance
does theory give to application?
Algorithms: What successful algorithms have been found for active
learning? How general are they? For what tasks are they appropriate?
Evaluation: How can accuracy, convergence, and other properties of
active learning algorithms be evaluated when, for instance,
data is not sampled randomly?
Taxonomy : What kinds of information are available to learners
(e.g. membership vs. equivalence queries, labeled vs. unlabeled data)
and what are the ways learning methods can use them? What are the
commonalities among methods studied by different fields?
Papers should be sent by APRIL 14, 1995 to:
David D. Lewis lewis@research.att.com
AT&T Bell Laboratories
600 Mountain Ave.; Room 2C-408
Murray Hill, NJ 07974-0636
Electronic mail submissions are strongly preferred.
In addition to invited participants, a limited number of other
interested parties will be able to register in each symposium on a
first-come, first-served basis. Registration will be available by
1 August, 1995. To obtain registration information write to the
AAAI at 445 Burgess Drive, Menlo Park, CA 94025 (fss@aaai.org).
SYMPOSIUM STRUCTURE
The symposium will be broken into sessions, each dedicated to a major
theme identified within the position papers. Sessions will begin with
a background presentation by an invited speaker, followed by brief
position statements from selected participants. A significant portion
of each session will be reserved for group discussion, guided by a
moderator and focused on the core issue for the session. The final
session of the symposium will accommodate new issues that are raised
during sessions.
RELEVANT DATES
April 14, 1995 Submissions for the symposia are due
May 19, 1995 Notification of acceptance
September 1, 1995 Working notes for symposium distributed
November 10-12, 1995 Symposium held at MIT
Organizing Committee:
David A. Cohn (co-chair), MIT, cohn@psyche.mit.edu; David D. Lewis
(co-chair), AT&T Bell Labs, lewis@research.att.com; Kathryn Chaloner,
U. Minnesota ; Leslie Pack Kaelbling, Brown U.; Robert Schapire, AT&T
Bell Labs; Sebastian Thrun, U. Bonn; Paul Utgoff, U. Mass Amherst.
Sponsored by the American Association for Artificial Intelligence
445 Burgess Drive
Menlo Park, CA 94025
(415) 328-3123
fss@aaai.org
------------------------------
Date: Fri, 18 Nov 94 17:36:37 MST
From: David Wolpert <dhw@santafe.edu>
Subject: OFF-TRAINING SET ERROR
*** Paper Announcement ***
OFF-TRAINING SET ERROR AND A PRIORI DISTINCTIONS BETWEEN LEARNING ALGORITHMS
by David H. Wolpert
Abstract: This paper uses off-training set (OTS) error to investigate
the assumption-free relationship between learning algorithms. It is
shown, loosely speaking, that for any two algorithms A and B, there
are as many targets (or priors over targets) for which A has lower
expected OTS error than B as vice-versa, for loss functions like
zero-one loss. In particular, this is true if A is cross-validation
and B is "anti-cross-validation" (choose the generalizer with largest
cross-validation error). On the other hand, for loss functions other
than zero-one (e.g., quadratic loss), there are a priori distinctions
between algorithms. However even for such loss functions, any
algorithm is equivalent on average to its "randomized" version, and in
this still has no first principles justification in terms of average
error. On the other hand, it is shown that (for example)
cross-validation may have better minimax properties than
anti-cross-validation, even for zero-one loss. This paper also
analyzes averages over hypotheses rather than targets. Such analyses
hold for all possible priors. Accordingly they prove, as a particular
example, that cross-validation can not be justified as a Bayesian
procedure. In fact, for a very natural restriction of the class of
learning algorithms, one should use anti-cross-validation rather than
cross-validation (!). This paper ends with a discussion of the
implications of these results for computational learning theory. It is
shown that one can not say: if empirical misclassification rate is
low; the VC dimension of your generalizer is small; and the training
set is large, then with high probability your OTS error is
small. Other implications for "membership queries" algorithms and
"punting" algorithms are also discussed.
The paper can be retrieved by anonymous ftp to ftp.santafe.edu. Go to
the subdirectory pub/dhw_ftp. Compressed postscript of the paper is in
nfl.ps.Z, and uuencoded compressed postscript in nfl.ps.Z.encoded.
Please tell me if you have any problems printing out the paper.
Comments are explicitly solicited.
David Wolpert
The Santa Fe Institute
1399 Hyde Park Road
Santa Fe, NM, 87501, USA
(505) 984-8800 (voice); (505) 982-0565 (FAX).
------------------------------
Date: Fri, 18 Nov 94 09:24:33 EST
From: Jeff Goldman <goldman@coyote.stap34.aar.wpafb.af.mil>
Subject: Responses about C4.5 on a PC
Here is the summary information about running C4.5 on a PC. Again, thank
you all for responding.
Sincerely,
Jeff Goldman
goldmanj@aa.wpafb.af.mil
Answers From:
#-----------------------------------------------------------------------------
Floor Verdenius IN%"F.Verdenius@ato.agro.nl"
ATO-DLO, Dept. Systems Research
PO box 17
6700 AA Wageningen Tel (+31) 8370 75307
The Netherlands Fax (+31) 8370 12260
#------------------------------------------------------------------------------
We are using C4.5 on a PC. We used the original code as comes with Quinlan's
book "C4.5, Programs for Machine Learning". This book is published by Morganm
Kauffman. This version, written for UNIX requires only little adaption to run
on a DOS machine (we used both Borlands C++ and Watcom C++, and it works),
and it runs on small datasets, so you should not face any major problem in
applying it for small datasets.
As soon, however, as you use C4.5 for iterative processes, or apply it
on larger datasets, the program crashes due to memory problems. The
main reason for this is that the DOS C/C++ compilers lack garbage
collection.
C4.5 is very liberal in allocating memory for temporal structures. This causes
fragmentation of the heap, causing the system crash. As these structures are
also frequently freed, the total actual memory usage on the time of the
sytstem crash stays limited (so still memory available on the moment of the
crash). I consider this to be a major shortcoming of C4.5.
A student of us, Ceel Rozenboom, is currently working on solutions for
overcoming this limitation by applying garbage collection and/or memory
prestructuring techniques to avoid such a system crash. One of his problems is
that we do not have a standard garbage collector for C/C++ under DOS (any
suggestions are welcome).
I hope this (especially the first part) answers your questions.
Good luck.
Jeff: How is the 8 character filename limitation handled?
Do I need to go and rename all of the files
(including temporary files created by c4.5) ?
Floor:
Concerning your question on file names: We have no problems leaving filenames
and extensions as they are. C4.5 uses filenames and extensions similar to the
general DOS-format *.*. DOS simply truncates if filenames take more than 8, and
extensions take more than 3 chars. The only precautions of course are to make
sure that no dots appear in the filename (so C4.5 becomes C45) and to assure
that unicity of filenames and extension lies within the first 8 and 3 chars.
Answer from:
# ------------------------------------------------------------------- #
# Jerzy Surma #
# Department of Computer Science #
# University of Economics #
# ul. Komandorska 118/120 Fax: ++48-71-672778 #
# 53-345 Wroclaw, P O L A N D Email: surma@unix.ok.ae.wroc.pl #
# ------------------------------------------------------------------- #
I have tried to do a PC version of Quinlan C4.5 programm.
I used the Borland C++ ver.3.1. First of all I had to add
prototype definition of all function, and I changed a random
function. I had some problems with a real numbers, so I had to
changes a float type to double type in some cases. I obtained
an EXE file that was good for a small learning sets (e.q. golf),
but I had an errors (e.q. null pointer assigment) with a bigger
learning sets (e.q. labor-neg).
------------------------------
End of ML-LIST (Digest format)
****************************************