Copy Link
Add to Bookmark
Report
Machine Learning List Vol. 1 No. 01
Machine Learning List: Vol. 1 No. 1
Sunday, July 9, 1989
Contents:
ML List Announcement
Biotech data for use with machine learning algorithms (Hunter)
Experimentation (Nordhausen)
Grand Challenges (Pazzani)
The Machine Learning List is moderated. Contributions should be relevant to
the scientific study of machine learning. Mail contributions to ml@ics.uci.edu.
Mail requests to be added or deleted to ml-request@ics.uci.edu
- ----------------------------------------------------------------------
This issue of ML-LIST is being mailed to anyone who requested it by sending
mail to ml@ics.uci.edu and to everyone who attended either of two learning
workshops at Cornell recently. Future issues will be mailed only to those
who have requested it. Please send all future requests to be added or removed
from ML-LIST to ml-request@ics.uci.edu. Local redistribution is encouraged.
- ----------------------------------------------------------------------
Date: Thu, 6 Jul 89 14:39:17 ADT
From: Larry Hunter <HUNTER@HUNTERMAC.nlm.nih.GOV>
Subject: Biotech data for use with machine learning algorithms
As I discussed in my talk at the recent Machine Learning Workshop, I believe
that (1) the test of the quality of a machine learning algorithm ought to be
measured primarily by the significance of its novel discoveries, and that (2)
Molecular biology provides datasets with the potential for current ML
algorithms to make novel discoveries. Mike Pazzani asked if I would post some
material about how to get access to these datasets to this bulletin board.
There are several molecular biology databases that are freely available to the
machine learning community. In general, databases are distributed both by the
maintainers and by others who have obtained copies. Maintainers generally
charge a media and handling fee. There are generally no restrictions on
redistribution, so it may be cheaper to copy a database from someone who
already has one. Since the databases are large, downloading them over the
internet is frowned on, but it is possible to query the some of the databases
by email.
In addition to having the raw data in your hands, some of you may find it
helpful to acquire some "background knowledge," i.e. some comprehensible
readings on molecular biology. I would suggest Mark Ptashne's "A Genetic
Switch: Gene Control and Phage Lambda" (Blackwell Scientific Publications,
Boston, MA, 1986) as an excellent, accessible example of both the kinds of
knowledge relevant to solving molecular biology questions, and of the reasoning
processes of biologists. A somewhat idiosyncratic, but very useful collection
of algorithms for analyzing genetic and protein sequence data can be found in
Russell Doolittle's "Of URFs and ORFs: A Primer on How To Analyze Derived Amino
Acid Sequences" (University Science Books, Mill Valley, CA, 1987). A good,
detailed text on protein structure for those with some chemistry background is
Georg Schultz & Heiner Schirmer's "Principles of Protein Structure"
(Springer-Verlag, NY NY, 1978).
There are hundreds of potentially useful databases. There is a loosely
organized group of computer scientists and biologists working on what they call
"The Matrix of Biological Knowledge," a set of tools for integrating and using
these many databases. They are running a conference this August 18-19 in New
Hampshire. Try contacting Karen Gruskin, Molecular Biology Computer Research
Resource, Dana-Farber Cancer Institute, 44 Binney St. Boston, MA 02115 for
more information about the conference.
What follows is the most current information I have about the major molecular
biology databases. The databases are updated on the order of every 6 weeks,
often with megabytes of new data (notice the large sizes in ()'s next to each
database below). Also, responsibility for maintaining the databases is in a
state of flux, so if you have trouble finding something, let me know and I'll
try to help you out.
A good place to start is a database that lists many of MB databases, where to
get them, what's in them, etc. This database of databases is called BDIR
(Biological Database Information Resource), and is maintained by:
Kathleen Arnett
Bioinformatics Division
American Type Culture Collection
12301 Parklawn Dr.
Rockville, MD 20852-1776
(301) 231-5585
KA3@NIHCU.BITNET
Perhaps the most central of all the biology databases is Genbank, which
contains well over 10,000 DNA sequences (> 78M).
GenBank:
c/o Intelligenetics
700 E. El Camino Real
Mountain View, CA 94040
(415)962-7364
genbank@bionet-20.bio.net (warning: bionet-20 disappears 30 September 1989)
Protein Idenfication Resource (PIR): Protein sequence database, (>16M)
c/o National Biomedical Research Foundation
Georgetown University Medical Center
3900 Reservoir Rd. NW
Washington, DC 20007
(202)625-2121
pirmail@gunbrf.bitnet
Protein Databank (PDB, also called Brookhaven). Crystallographic data about
the shape of about 100 proteins. Important cross reference with PIR for
structural prediction work (>55M)
PDB c/o Brookhaven National Laboratory
Chemistry Dept Bldg. 555
Upton, NY 11973
(516)282-4382
no email I've been able to find.
EMBL the European version of GenBank. Some complementary data (>58M)
c/o European Molecular Biology Laboratory
Postfach 10 22 09
6900 Heidelberg 1
West Germany
datalib@embl.bitnet
Also, the National Library of Medicine is interested in supporting machine
learning research in molecular biology domains. You might try talking to Peter
Clepper in the extramural programs biomedical information support branch (they
give grants, graduate student fellowships, etc.) (301) 496-4221.
This should be enough to get you off the ground. There will be an AAAI
symposium on AI and molecular biology next spring, just in time to present your
early results...
Have fun!
Larry Hunter
National Library of Medicine
Lister Hill Center, MS-54
Bethesda, MD 20894
(301) 496-9300
hunter@mcs.nlm.nih.gov
- ----------------------------------------------------------------------
Date: Fri, 07 Jul 89 15:10:40 -0700
From: Bernd Nordhausen <bernd@ci4.ICS.UCI.EDU>
Subject: experimentation
Some thoughts about experimentation in machine learning:
To say it up front, I am for experimental validation of ML systems, and you
will see learning curves in my own work. I think it is valuable that the
machine learning community has started to think about validating systems
through experiments. But I am cautioning that while experimentation is
worthwhile, and we should definitely pursue it, it is not the answer to all
of our validation problems.
The introduction of comparative studies (i. e. using the same data to
evaluate and compare different systems) was progress. However, for
many systems the domains are so different that no common data is
available. Thus people create their domains and dependent measures to
experimentally validate their systems. In these cases there CAN be a
danger, that experimental results start to become less meaningful,
because too many hidden factors might be involved in the experiments.
Furthermore, I argue that not every claim has to be substantiated by a
learning curve. Especially at a workshop, it is ok to make a claim
and say "I have preliminary results which suggests this claim,"
without a learning curve to confirm this assertion. Sometimes learning
curves can do much more harm than they can do good. A learning curve
should imply a careful evaluation, and not just some preliminary
results.
These are my 2 Pfennigs worth. I am interested to hear from other people what
they think about the subject of experimentation, so let the flames roll.
Bernd Nordhausen
- ----------------------------------------------------------------------
Date: Sunday, 09 Jul 89 11:10:46 -0700
From: Michael Pazzani <pazzani@ICS.UCI.EDU>
Subject: Grand Challenges
At the recent Machine Learning Talk, Tom Dietterich gave an interesting
talk on Grand Challenges for Machine Learning. One of the challenges
was learning from natural language texts. In some ways, I think it would
be a good idea for people in machine learning to look at natural language
processing. Some of the most advanced work on knowledge represention and
inference has been motivated by natural language understanding problems.
Unlike many explanation-based learning systems, very few NLP systems use
Horn clauses to represent knowledge and backward-chaining depth-first search
as an inference process.
However, many hard problems in learning from natural language text are
pure natural language problems that have very little to do with
learning (e.g., word sense disambiguation, finding the referents of
noun phrases, etc.). Until progress is made in NLP, I doubt that much
progress can be made in learning from natural language texts. Perhaps
a better challenge for the learning community as opposed to the natural
language community would be to incorporate the hand-coded representations
of natural language texts into a large memory. How much progress
has the CYC project made on this project?
Michael Pazzani
- ----------------------------------------------------------------------
End of ML-LIST 1.1