Copy Link
Add to Bookmark
Report
Neuron Digest Volume 04 Number 21
Neuron Digest Thursday, 10 Nov 1988 Volume 4 : Issue 21
Today's Topics:
How to ensure that back-propagation separates separable patterns
reply on back-propagation fails to separate paper
Re: reply on back-propagation fails to separate paper
Re: reply on back-propagation fails to separate paper
Re: reply on back-propagation fails to separate paper
LMS fails even in separable cases
ANNs for Image classification (Remote Sensing)
INNS paper reference
looking for papers
Political science viewed as a Neural Net :-)
polyvalued digits in classifier messages?
Production rule LHS pattern matching with neural nets
Response wanted on neural net user interfaces
Send submissions, questions, address maintenance and requests for old issues to
"neuron-request@hplabs.hp.com" or "{any backbone,uunet}!hplabs!neuron-request"
------------------------------------------------------------
Subject: How to ensure that back-propagation separates separable patterns
From: Geoffrey Hinton <hinton@ai.toronto.edu>
Date: Thu, 27 Oct 88 15:39:44 -0400
[[ Editor's Note: This series of entries is from a mailing list of active
Connectionist researchers. For you readers who are pursuing different
aspects of ANNs, you might be interested. -PM ]]
We have recently obtained the following results, and would like to know if
anyone has seen them previously written up. The description is informal
but accurate:
There has been some interest in understanding the behavior of
backpropagation in feedforward nets with no hidden neurons. This is a
particularly simple case that does not require the full power of
back-propagation (it can also be approached from the point of view of
perceptrons), but we believe that it provides useful insights into the
structure of the error surface, and that understanding this case is a
prerequisite for understanding the more general case with hidden layers.
In [1], the authors give examples illustrating the fact that while a
training set may be separable, a net performing backpropagation (gradient)
search may get stuck in a solution which fails to separate. The objective
of this note is to point out that if one uses instead a threshold
procedure, where we do not penalize values ``beyond'' the targets, then
such counterexamples cease to exist, and in fact that one has a convergence
theorem that closely parallels that for perceptrons: the continuous
gradient adjustment procedure is such that, from any initial weight
configuration, a separating set of weights is obtained in finite time.
We also show how to modify the example given in [2] to conclude that there
is a training set consisting of 125 binary vectors and a network
configuration for which there are nonglobal local minima, even if threshold
LMS is used. In this example, the training set is of course not separable.
Finally, we compare our results to more classical pattern recognition
theorems, in particular to the convergence of the relaxation procedure for
threshold LMS with linear output units ([3]), and show that the continuous
adjustment approach is essential in the study of the nonlinear case.
Another essential difference with the linear case is that in the latter
nonglobal local minima cannot happen even if the data is not separable.
References:
[1] Brady, M., R. Raghavan and J. Slawny, ``Backpropagation fails to separate
where perceptrons succeed,'' submitted for publication. Summarized version in
``Gradient descent fails to separate,'' in {\it Proc. IEEE International
Conference on Neural Networks}, San Diego, California, July 1988, Vol. I,
pp.649-656.
[2] Sontag, E.D., and H.J. Sussmann, ``Backpropagation can give rise to
spurious local minima even for networks without hidden layers,'' submitted.
[3] Duda, R.O., and P.E. Hart, {\it Pattern Classificationa and Scene
Analysis}, Wiley, New York, 1973.
Geoff Hinton
Eduardo Sontag
Hector Sussman
------------------------------
Subject: reply on back-propagation fails to separate paper
From: Richard Lippmann <rpl@ll-sst.ARPA>
Date: Fri, 28 Oct 88 09:42:23 -0400
Geoff,
We came up with the same conclusion a while ago when some people
were worried about the performance of back propagation but never published
it. Back propagation with limits seems to converge correctly for those
contrived deterministic cases where minimizing total squared error does not
minimize the percent patterns classified correctly. The limits cause the
algorithm to change from an LMS mean-square minimizing approach to
perceptron-like error corrective approach. Typically, however, the
difference in percent patterns classified correctly between the local and
global solutions in those cases tends to be small. In practice, we found
that convergence for the one contrived case we tested with limits took
rather a long time.
I have never seen this published and it would be good to see your
result published with a convergence proof. I have also seen little
published on the effect of limits on performance of classifiers or on final
weight values.
Rich
------------------------------
Subject: Re: reply on back-propagation fails to separate paper
From: mehra@aquinas.csl.uiuc.edu (Pankaj Mehra)
Date: Fri, 28 Oct 88 11:44:02 -0500
Hi everybody.
When I heard Brady et al.'s talk at ICNN-88, I thought that the
results simply pointed out that a correct approach to classification may
not give equal importance to all training samples. As is well-known,
classical back-prop converges to a separating surface that depends on the
LMS error summed uniformly over all training samples.
I think that the new results provide a case for attaching more
importance to the elements on concept boundaries. I have been working on
this problem (of trying to characterize "boundary" elements) off and on,
without much success. Basically, geometric characterizations exist but they
are too complex to evaluate. What is interesting, however, is the fact
that complexity of learning (hence, the time for convergence) depend on the
nature of the separating surface. Theoretical results also involve similar
concepts, e.g. VC-dimension.
Also notice that if one could somehow "learn" the characteristic of
boundary elements, then one could ignore a large part of the training
sample and still converge properly using a threshold procedure like that
suggested in Geoff's note.
Lastly, since back-prop is not constrained to always use LMS as the
error function, one wonders if there is an intelligent method (that can be
automated) for constructing error functions based on the complexity of the
separating surface.
- - Pankaj Mehra
{mehra%aquinas@uxc.cso.uiuc.edu}
------------------------------
Subject: Re: reply on back-propagation fails to separate paper
From: sontag@fermat.rutgers.edu (Eduardo Sontag)
Date: Sat, 29 Oct 88 10:50:52 -0400
Mark,
Re your question to Mehra about complexity of boundaries and VC dimension,
we just had a talk at Rutgers yesterday by Eric Baum
(baum@pupgg.princeton.edu) about this. You should ask him for copies of
his papers on the subject, which also contain references to Valiant's and
other related work.
I think that the relation between Brady et.al. and our results can be
explained better in terms of threshold vs nonthreshold costs than in terms
of relative weightings of terms.
- -eduardo
Eduardo D. Sontag
Rutgers Center for Systems and Control (SYCON)
Rutgers University
(sontag@fermat.rutgers.edu)
------------------------------
Subject: Re: reply on back-propagation fails to separate paper
From: terry@cs.jhu.edu (Terry Sejnowski <terry@cs.jhu.edu>)
Date: Mon, 31 Oct 88 19:35:09 -0500
The proceedings of the CMU Connectionist Models Summer School has two
papers on optimal choice of training set based on "critical" or "boundary"
patterns: Karen Huyser on boolean functions and Ahmad Subatai on the
majority function.
The proceedings are available from Morgan Kaufmann.
Terry
------------------------------
Subject: LMS fails even in separable cases
From: neural!jsd@hplabs.HP.COM (John Denker)
Date: Wed, 02 Nov 88 11:25:35 -0500
Yes, we noticed that a Least-Mean-Squares (LMS) network even with no hidden
units fails to separate some problems. Ben Wittner spoke at the IEEE NIPS
meeting in Denver, November 1987, describing !two! failings of this type.
He gave an example of a situation in which LMS algorithms (including
ordinary versions of back-prop) are metastable, i.e. they fail to separate
the data for certain initial configurations of the weights. He went on to
describe another case in which the algorithm actually !leaves! the solution
region after starting within it.
He also pointed out that this can lead to learning sessions in which the
categorization performance of back-prop nets (with or without hidden units)
is not a monotonically improving function of learning time.
Finally, he presented a couple of ways of modifying the algorithm to
prevent these problems, and proved a convergence theorem for the modified
algorithms. One of the key ideas is something that has been mentioned in
several recent postings, namely, to have zero penalty when the training
pattern is well-classified or "beyond".
We cited Minsky & Papert as well as Duda & Hart; we believe they were
more-or-less aware of these bugs in LMS, although they never presented
explicit examples of the failure modes.
Here is the abstract of our paper in the proceedings, _Neural Information
Processing Systems -- Natural and Synthetic_, Denver, Colorado, November
8-12, 1987, Dana Anderson Ed., AIP Press. We posted the abstract back in
January '88, but apparently it didn't get through to everybody. Reprints
of the whole paper are available.
Strategies for Teaching Layered Networks Classification Tasks
Ben S. Wittner (1)
John S. Denker
AT&T Bell Laboratories
Holmdel, New Jersey 07733
ABSTRACT: There is a widespread misconception that the delta-rule is in
some sense guaranteed to work on networks without hidden units. As
previous authors have mentioned, there is no such guarantee for
classification tasks. We will begin by presenting explicit
counter-examples illustrating two different interesting ways in which the
delta rule can fail. We go on to provide conditions which do guarantee
that gradient descent will successfully train networks without hidden units
to perform two-category classification tasks. We discuss the
generalization of our ideas to networks with hidden units and to
multi-category classification tasks.
(1) Currently at NYNEX Science and Technology / 500 Westchester Ave.
White Plains, NY 10604
------------------------------
Subject: ANNs for Image classification (Remote Sensing)
From: csrobe@cs.wm.edu (Chip Roberson)
Date: Sun, 06 Nov 88 21:33:15 -0500
Keywords: Artifical Neural Networks, Remote Sensing, Image Processing,
Landsat, Land Use
I am beginning to do some image processing of remote sensing data to do
land use studies. Much of my work will include taking Landsat and SPOT
raster data (as well vector data from various sources) and classifying the
spectral signatures so we can determine how land segments are used
(vegetation, barren, stripped, fresh water, etc.).
What I am wondering is what neural computation techniques have been used in
this domain? I have seen where ANNs have been used as maximum-likelihood
classifiers for pattern recognition, but has anyone done any specific
applications that are related to what I'm attempting to do? Are there any
interestng issues regarding this kind of image processing?
I'm interested in receiving references, comments, wild theories, names,
phone numbers, caveats, random thoughts, and questions.
Many Thanks,
- -chip
Chip Roberson csrobe@cs.wm.edu [128.239.1.30]
Dept. of Computer Science csrobe@icase.edu
College of William and Mary #csrobe@wmmvs.bitnet
Williamsburg, VA 23185 ...!uunet!pyrdc!gmu90x!wmcs!csrobe
(804) 253-4393
------------------------------
Subject: INNS paper reference
From: Rockwell.henr@Xerox.COM
Date: 01 Nov 88 10:45:00 -0500
Reply-To: rockwell.henr@xerox.com
Organization: Xerox
I'm looking for a reference for a paper presented at the INNS conference.
Someone spoke about it briefly during another talk. The paper dealt with
boundry cases for training data and how network performance was increased
by inclusion of as many boundry cases as possible.
Does this ring any bells or are there any other papers on this subject that
are known?
------------------------------
Subject: looking for papers
From: aam9n@uvaee.ee.virginia.EDU (Ali Minai)
Organization: EE Dept, U of Virginia, Charlottesville
Date: 03 Nov 88 00:36:18 +0000
I am looking for two papers by Lapedes and Farber, and will be grateful if
some kind soul out there could send me copies. The papers are:
GENETIC DATA BASE ANALYSIS WITH NEURAL NETS.
and
NONLINEAR SIGNAL PROCESSING USING NEURAL NETS.
Both are by A. Lapedes & R. Farber, and both were presented to the IEEE
Conference on Neural Information Processing Systems--Natural and Synthetic.
Any info about how to obtain proceedings fot this conference will also be
appreciated.
Thanks in advance
Ali Minai,
Dept. of Electrical Engg.
Thornton Hall
University of Virginia
Charlottesville, VA 22901
aam9n@uvaee.ee.Virginia.EDU
aam9n@maxwell.acc.Virginia.EDU
------------------------------
Subject: Political science viewed as a Neural Net :-)
From: Scott.Fahlman@B.GP.CS.CMU.EDU
Date: Tue, 25 Oct 88 21:24:04 -0400
[[ Editor's note: I would have liked to get this off before last Tuesday,
but it's better late than never. After all, Canada is having their
election shortly, as well. Given that most elections generate more heat
than light, I'd say a Boltzmann Machine model may be more apropos. -PM ]]
I was thinking about the upcoming U.S. election today, and it occurred to
me that the seemingly useless electoral college mandated by the U.S.
constitution might actually be of some value. A direct democratic election
is basically a threshold decision function with lots of inputs and with
fixed weights; add the electoral college and you've got a layered network
with fifty hidden units, each with a non-linear threshold function.
A direct election can only choose a winner based on some linearly separable
function of voter opinions. You would expect to see complex issues
forcibly projected onto some crude 1-D scale (e.g. "liberal" vs.
"conservative" or "wimp" vs. "macho"). With a multi-layer decision network
the system should be capable of performing a more complex separation of the
feature space. Though they lacked the sophisticated mathematical theories
available today, the designers of our constitution must have sensed the
severe computational limitations of direct democracy and opted for the more
complex decision system. Unfortunately, we do not seem to be getting the
full benefit of this added flexibility.
What the founding fathers left out of this multi-layer network is a
mechanism for adjusting the weights in the network based on how well the
decision ultimately turned out. Perhaps some form of back-propagation
would work here. It might be hard to agree on a proper error measure, but
the idea seems worth exploring. For example, everyone who voted for Nixon
in 1972 should have the weight of his his future votes reduced by epsilon;
a large momentum term would be added to the reduction for those people who
had voted for Nixon previously. The reduction would be greater for voters
in states where the decision was close (if any such states can be found).
There is already a mechanism in place for altering the output weights of
the hidden units: those states that correlate positively with the ultimate
decision end up with more political "clout", then with more defense-related
jobs. This leads to an influx of people and ultimately to more electoral
votes for that state. Some sort of weight-decay term would be needed to
prevent a runaway process in which all of the people end up in California.
We might also want to add more cross-connections in the network. At
present, each voter affects only one hidden unit, the state where he
resides. This somewhat limits the flexibility of the learning process in
assigning arbitrary functions to the hidden units. To fix this, we could
allow voters to register in more than one state. George Bush has five or
six home states; why not make this option available to all voters?
More theoretical analysis of this complex system is needed. Perhaps NSF
should fund a center for this kind of thinking. The picture is clouded by
the observation that individual voters are not simple predicates: most of
them have a rudimentary capacity for simple inference and in some cases
they even exhibit a form of short-term learning. However, these minor
perturbations probably cancel out on the average, and can be treated as
noise in the decision units. Perhaps the amount of noise can be
manipulated to give a crude approximation to simulated annealing.
- -- Scott
------------------------------
Subject: polyvalued digits in classifier messages?
From: brian@caen.engin.umich.edu (Brian Holtz)
Date: Sat, 05 Nov 88 20:43:57 -0500
Does anyone know of any references that describe classifier systems whose
messages are composed of digits that may take more than two values? For
instance, I want to use a genetic algorithm to train a classifier system to
induce lexical gender rules in Latin. Has any work been done on managing
the complexity of going beyond binary-coded messages, or (better yet)
encoding characters in messages in a useful, non-ASCIIish way?
Brian Holtz
------------------------------
Subject: Production rule LHS pattern matching with neural nets
From: pratt@paul.rutgers.edu (Lorien Y. Pratt)
Date: Fri, 28 Oct 88 16:54:11 -0400
Hello,
I'm interested in any work involving matching the left-hand side of a
production rule using a neural network. I imagine that one could use
a representation language which isn't first order logic which could be
used for the LHS, also and which might be more amenable to a neural network
approach. Thanks (again!) for any pointers,
Lori
- -------------------------------------------------------------------
Lorien Y. Pratt Computer Science Department
pratt@paul.rutgers.edu Rutgers University
Busch Campus
(201) 932-4634 Piscataway, NJ 08854
------------------------------
Subject: Response wanted on neural net user interfaces
From: cs162faw@sdcc18.ucsd.EDU (Phaedrus)
Organization: University of California, San Diego
Date: Wed, 02 Nov 88 02:36:39 +0000
I am an undergraduate student who is currently helping to design/test a
user-interface for a neural-net software/hardware package. The package in
question is from HNC, and currently they are using a simple menu-based
system called Netset. HNC is presently developing a applications language
called Axon.
I know there are many other products on the market that simulate neural
networks (whether through hardware or software). I would be interested to
talk to any people who have used either menu-based systems or language-like
systems and what they liked/disliked about them.
Our particular goal is to make the interface easy to use, so any comments
about other software's user interfaces would be VERY appreciated. Also, if
you could mention how proficient you already are with neural networks, your
programming skill, and any other personal information.
If anyone out there has used Netset, comments about it would especially be
appreciated.
Please reply by e-mail. Thank you in advance.
No signature file....yet. James Shaw
------------------------------
End of Neurons Digest
*********************