Copy Link
Add to Bookmark
Report
Neuron Digest Volume 05 Number 44
Neuron Digest Friday, 10 Nov 1989 Volume 5 : Issue 44
Today's Topics:
Administrivia
learning a step function using back propogation
Re: Step Function
Re: Re: Step Function
Re: Step Function
Re: Re: Step Function
Step,approximations,etc
Re: Step Function
Re: Step Function
Re: Step Function
Re: Step Function
Re: Step Function
Re: Step Function
Re: Step Function
Re: Step Function
Re: Step Function
Re: Step Function
Re: Step function
Re: Step Function
Re: Step Function
Re: Re: Step Function
Re: Step Function
Re: Re: Step Function
Send submissions, questions, address maintenance and requests for old issues to
"neuron-request@hplabs.hp.com" or "{any backbone,uunet}!hplabs!neuron-request"
Use "ftp" to get old issues from hplpm.hpl.hp.com (15.255.176.205).
------------------------------------------------------------
Subject: Administrivia
From: "Neuron-Digest Moderator -- Peter Marvit" <neuron@hplabs.hp.com>
Date: Fri, 10 Nov 89 15:48:55 -0800
As usual, there have been more messages than Digests, especially from the
USENET feed of comp.ai.neural-nets; I may never get out from the flurry.
Nonetheless, this Digest (and possibly the next) is an interesting series
of exchanges which took place several months ago. I've done minimal
editing, mostly to remove unneeded re-quotations. Any lack of continuity
should therefore be attributed to me. I've made few editorial comments,
since the sepakers seem more than capable. As usual, my editorial marks or
comments are enclosed by double brackets [[ like this ]].
-Peter Marvit
Immoderator
------------------------------
Subject: learning a step function using back propogation
From: rohit vidwans <cs.utexas.edu!ut-emx!walt.cc.utexas.edu!rohit@TUT.CIS.OHIO-STATE.EDU>
Organization: The University of Texas at Austin, Austin, Texas
Date: 06 Aug 89 23:09:39 +0000
I am trying to teach a network to accept as input a step function and
return as output the same step function or a scaled version of the same. I
have tried single and multiple inputs and outputs with multiple hidden
layers for my network..but am having no luck whatsoever...I am using back
propogation learning and my hidden layers use the sigmoidal function.
if anybody has any suggestions to make please post reply on net or send
mail to above address.
Neural Network Group
Chemical Engineering Department
University of Texas at Austin
------------------------------
Subject: Re: Step Function
From: ck@rex.cs.tulane.edu (Cris Koutsougeras)
Organization: Computer Science Dept., Tulane Univ., New Orleans, LA
Date: Wed, 23 Aug 89 09:19:37 +0000
With the back prop only an analytic (continous differentiable) can be
learned. The perfect step function is not therefore it will not be
learned. A close approximation can be learned if your training set has
enough samples around the point of the jump and of course quite enough
coverage of the rest of the total interval. Look at it from an other point
of view. No matter how many units (nodes) you put in your net, you are
never going to have enough non-linearity to match the perfect step
function. If you want I can send you a paper on some related experiments
from our group at Tulane U.
C. Koutsougeras
------------------------------
Subject: Re: Re: Step Function
From: heirich@beowulf.ucsd.edu (Alan Heirich)
Organization: EE/CS Dept. U.C. San Diego
Date: 24 Aug 89 03:18:10 +0000
[[ Regarding Cris Koutsougeras' response ]]
This statement makes me uncomfortable. It has been proven (White) that a
two-layer feed forward network with a signmoid activation function (such as
you find in standard back propagation networks) can approximate any Borel
measurable function to an arbitrary degree of precision. So, for all
intents and purposes, such a network can match a perfect step function to
any discernible precision. On the other hand, I have not seen any
literature proving bounds on learnability. This is obviously a very
important area, but if people are still quibbling about the prevelance of
local minima in error landscapes, I can't imagine that anyone has proven
such a result about which functions are learnable!
Alan Heirich Comp. Sci. & Eng., Cognitive Science
C-014 University of California, San Diego 92093
heirich@cs.ucsd.edu
aheirich@ucsd.bitnet
------------------------------
Subject: Re: Step Function
From: apr@cbnewsl.ATT.COM (anthony.p.russo)
Organization: AT&T Bell Laboratories
Date: 24 Aug 89 13:43:32 +0000
[[ Regarding Alan Herich's response ]]
Some things have been proven not to be learnable, i.e. EX-OR.
On the other hand, I'm sure no one has proven whether step-function
scaling is learnable or not, but intuitively it seems that it is.
Perhaps not by backprop ...
There is, happily, a small body of research in this area, but not nearly
enough. the best reference I can give is:
L.G. Valiant, "A theory of the Learnable," Communications of the ACM,
vol. 27, no.11, 1984.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Tony Russo " Surrender to the void." ~
~ AT&T Bell Laboratories ~
~ apr@cbnewsl.ATT.COM ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
------------------------------
Subject: Re: Re: Step Function
From: demers@beowulf.ucsd.edu (David E Demers)
Organization: EE/CS Dept. U.C. San Diego
Date: Fri, 25 Aug 89 04:39:32 +0000
>> [...] I have not seen any literature
>> proving bounds on learnability. This is obviously a very important area,
^ I think Alan may mean "learnability
of the step function." here.
>Some things have been proven not to be learnable, i.e. EX-OR.
Surely you jest. XOR is not learnable by a neural net with
no hidden layers, but is certainly a learnable function.
>On the other hand, I'm sure no one has proven whether step-function
>scaling is learnable or not, but intuitively it seems that it is.
[...] the best reference I can give is:
>L.G. Valiant, "A theory of the Learnable," Communications of the ACM,
>vol. 27, no.11, 1984.
Also interesting are:
Ehrenfeucht, Haussler, Kearns & Valiant, "A General Lower Bound on the
Number of Examples Needed for Learning" appearing in Information and
Computation, 1988 [I just have a preprint...]
and
Haussler, "Quantifying Inductive Bias: AI Learning Algorithms and Valiant's
Learning Framework", Artificial Intelligence 36 (1988).
The proceedings of the 1988 workshop on Computational Learning Theory are
available from Morgan Kaufman. One of my summer projects was to read
several of the papers from the workshop... maybe at Christmas...
But back to the original subject -
I think Alan Heirich is right that the step function may be APPROXIMATED to
arbitrary precision by a net with one hidden layer and sigmoid transfer
functions. Backpropogation may not be an adequate learning method; however
the cited work by White (appearing in the current issue of Neural Networks,
I believe { vol 2 no. 3}) shows that the function IS learnable. This is
stronger than showing that a net can approximate it.
So, given the function and given a bound on the error, there is a net which
can approximate the function to that bound, and an algorithm to learn the
mapping. Our job is to discover them :-)
Dave DeMers
demers@cs.ucsd.edu
Dept. of Computer Science & Engineering
UCSD
La Jolla, CA 92093
------------------------------
Subject: Step,approximations,etc
From: Cris Koutsougeras <rex!ck@G.MS.UKY.EDU>
Organization: Computer Science Dept., Tulane Univ., New Orleans, LA
Date: 29 Aug 89 06:28:57 +0000
Let me make this a collective response to keep the number of
messages smaller ...
>From: kroger@titan.tsd.arlut.utexas.edu (Jim Kroger)
>
>Hi, Chris. I am very interested in understanding what you mean by
>learning a function (continuous or otherwise) in the following....
Well, it is understood that the Rumelhart net and the Back Prop constitute
a general tool for curve fitting. Given a certain configuration of the net
and weights-thresholds/biases, one can derive (but do not try it if you
have many units) a closed form expression for the input-output function.
That is the function "learned" by the process of adaptation. Now the
process of adaptation is based on the samples of the training set. These
samples cannot only be seen as "input pattern - output class" but more
generaly as "input vector value - output vector values". In the case we
were discussing teaching a function means to sample a given function and
use the samples as training set. The expectation is to see the net learning
the sampled function while trying to fit a curve on the samples. In other
words the nets final transfer function should be mathematicaly equivalent
to (or an approximation of) the sampled one. From there all the difficult
problems start which are basicaly due to the degree of non-linearity
needed/existing etc. etc.
>From: Ali Minai <aam9n@uvaee.ee.virginia.edu>
>
>This is in response to your posting on the net about neural nets
>learning a step function. If you have any results on this, I would
>be very interested in references, comments etc. Thanks.
All those who have sent me personal requests will get my response. I can
not respond "yesterday" since the classes start now but I will. It would
save some effort though if you are able to retrieve the following
publications. If you are not then let me know.
George, R., B. Geraci, R. Shrikanth and C. Koutsougeras, "A Methodical
Study of the Rumelhart Model", 5th IASTED Intern. Conf. Expert Systems and
Neural Networks. Hawaii, Aug. 1989.
The effect of more or less units than required in a (general) network is
discussed in :
Koutsougeras, C. and C.A. Papachristou, "A Neural Network Model for
Discrete Mappings", in Proceedings of the IEEE International Conference on
Languages for Automation (LFA '88), August 1988.
Other references :
Duda & Hart "Pattern recognition and Schene analysis" John Wiley &
Sons 1973
Koutsougeras, C. and C.A. Papachristou, "Training of A Neural
Network Model for Pattern Classification Based on an Entropy
Measure", in Proceedings ofthe IEEE International Conference on
Neural Networks (ICNN '88), IEEE, July 1988.
Also Alan Heirich writes :
>This statement makes me uncomfortable. [[ etc...]
That is what I was truing to point out. The more non-linearity you
introduce the better the precision, but no way exact fit. The amount of
non-linearity can be increased by introducing more units per layer or more
layers although additional layers can get the net easily out of hand. In
fact Ken-itci Funahashi in "On the approximate realization of continous
mappings by neural networks" (Neural Networks Vol.2 No. # '89) has proven
that the arbitrary approximation is possible with only one hidden layer.
That is one only needs to add units only in the hidden layer. Nevertheless
we agree on the arbitrary approximation except that the point here is that
it is an approximation and not an exact fit for the particular function.
Cris Koutsougeras
Computer Science Dpt
Tulane University
------------------------------
Subject: Re: Step Function
From: "anthony.p.russo" <att!cbnewsl!apr@UCBVAX.BERKELEY.EDU>
Organization: AT&T Bell Laboratories
Date: 29 Aug 89 11:40:29 +0000
> Surely you jest. XOR is not learnable by a neural net with
> no hidden layers, but is certainly a learnable function.
X-OR is not learnable. If you are given the first three entries in the
truth table, you could not possibly generalize to the last entry with any
confidence. That is, of the two possibilities for the last entry, neither
is preferable.
X-OR is *memorizable*, not learnable.
In either case, learnability is not a requirement for neural net
applicability. Nets memorize well.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Tony Russo " Surrender to the void." ~
~ apr@cbnewsl.ATT.COM ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
------------------------------
Subject: Re: Step Function
From: ackley@wind.bellcore.com (David H Ackley)
Organization: Bellcore, Morristown, NJ
Date: 30 Aug 89 07:29:49 +0000
In article<1667@cbnewsl.ATT.COM> apr@cbnewsl.ATT.COM (anthony.p.russo) writes:
>X-OR is not learnable. If you are given the first three entries in the truth
>table, you could not possibly generalize to the last entry with
>any confidence.
This renders the term "learnable" meaningless. Over the space of boolean
functions, your claim is equally true of all of them. Without some kind of
preference structure or bias over the space, both outputs are equally
likely for the last row, regardless of the rest of the table. If you do
allow such biases, I can pick one (say, a "parity bias") that makes XOR
"learnable".
Absolutely, generalization is a big important hard problem --- but it's
different than learnability, and it's not why XOR is famous.
| David Ackley Cognitive Science Research Group |
|"To act or Bell Communications Research Inc.|
| to react, ackley@flash.bellcore.com|
| that is the question" ...!bellcore!ackley|
------------------------------
Subject: Re: Step Function
From: apr@cbnewsl.ATT.COM (anthony.p.russo)
Organization: AT&T Bell Laboratories
Date: 30 Aug 89 12:55:37 +0000
[[ Regarding David Ackley's response ]]
> This renders the term "learnable" meaningless.
I agree that X-OR is learnable *with* bias. That however, is not the
definition of "learnable" in the strictest sense. I won't attempt to prove
or disprove your claim regarding all boolean functions, but why must
boolean functions be learnable at all? When you saw X-OR for the first
time, you most likely *memorized* the truth table. If not, then after *all
four* entries, you determined the rule that if both inputs are equal,
output "0"; else output "1". Could you have determined this rule with just
three examples? No. How would you have chosen a bias?
Perhaps our definitions of "learnable" are different. Mine is that, with a
fraction of the possible samples, one can generalize to 100% accuracy.
Otherwise, cfor example, if after 99 of 100 samples one cannot correctly
predict the last output with 100% confidence, nothing has been learned at
all about the function in question. If it does take 100% of the
possibilities to learn something, that I claim it has not been learned, but
rather *memorized*.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Tony Russo " Surrender to the void." ~
~ apr@cbnewsl.ATT.COM put disclaimer here. ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
------------------------------
Subject: Re: Step Function
From: mathew@jane.Jpl.Nasa.Gov (Mathew Yeates)
Organization: Image Analysis Systems Grp, JPL
Date: 30 Aug 89 16:23:45 +0000
under [[ Russo's ]] this definition, no function is learnable. Considering
a function as a set of ordered pairs, seeing some subset of the ordered
pairs can never lead to a complete knowledge of the entire set. At least
for a mere mortal. A rigorous definition of learnabilty from examples can
be found in Valiants "A Theory of the Learnable".
------------------------------
Subject: Re: Step Function
From: dg1v+@andrew.cmu.edu (David Greene)
Organization: Graduate School of Industrial Administration, Carnegie Mellon, Pittsburgh, PA
Date: 30 Aug 89 17:07:35 +0000
[[ Regarding Russo's definition of learnability ]]
I believe you are confusing 'sample' with 'census'. If the total
population of examples is 100 (and assuming no noise) then perhaps your
statement would be appropriate. However if the 100 represents a sample,
then the issue of learnability after 99 examples from that sample is
dependent on the size of the total population and the degree of confidence
you require. (eg. Valient's PAC model of learnability -- Valient, L. G "A
Theory of the Learnable" Communications of the ACM, 27 (11) :1134-1142,
November 1984)
- -David
- --------------------------------------------------------------------
David Perry Greene || ARPA: dg1v@andrew.cmu.edu
GSIA /Robotics || dpg@isl1.ri.cmu.edu
Carnegie Mellon Univ. || BITNET: dg1v%andrew@vb.cc.cmu.edu
Pittsburgh, PA 15213 || UUCP: !harvard!andrew.cmu.edu!dg1v
- --------------------------------------------------------------------
"You're welcome to use my opinions, just don't get them all wrinkled."
------------------------------
Subject: Re: Step Function
From: pluto@beowulf.ucsd.edu (Mark E. P. Plutowski)
Organization: EE/CS Dept. U.C. San Diego
Date: 31 Aug 89 04:22:41 +0000
I would suggest you introduce one of two things into your definition of
"learnability." Either 1) make it dependent on a particular architecture,
or, 2) make it dependent upon a complexity measure, such as time needed to
learn, number of units, or number of weights.
It makes sense to discuss learnability when we talk about a set of concepts
(e.g., functions) given a hypothesis space with which to learn the concepts
(e.g., threshold-logic). Then, it is an issue whether the concept in
question can be represented within the hypothesis space -- and then, we can
discuss whether it can be learned given a feasible amount of resources.
------------------------------
Subject: Re: Step Function
From: apr@cbnewsl.ATT.COM (anthony.p.russo)
Organization: AT&T Bell Laboratories
Date: 31 Aug 89 11:22:21 +0000
[[ Replying to Matthew Yeats ]]
Yes, I should stick to Valiant's paper. However, one thing: a known
function is, in most cases, more than a set of ordered pairs. It is a set
of ordered pairs related in some way. Discover the relation and no further
examples are needed to have complete knowledge of the entire set. The
question of "learnability" is whether the function, or rule, can possibly
be extracted from a set of samples *with probability of error less than
some value*. This is one mistake I made when I said "100%
confidence"--that's not possible.
However, getting back to X-OR, this probability of error = 0.5, clearly too
high to consider it learnable.
Aside: someone commented that learnability should be defined dependent on
architecture. I don't see why.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Tony Russo " Surrender to the void." ~
~ apr@cbnewsl.ATT.COM put disclaimer here ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
------------------------------
Subject: Re: Step Function
From: pluto@beowulf.ucsd.edu (Mark E. P. Plutowski)
Organization: EE/CS Dept. U.C. San Diego
Date: 31 Aug 89 17:01:20 +0000
>... someone commented that learnability should be defined dependent
>on architecture. I don't see why.
I mentioned this for concreteness' sake. More generally, you want to know
whether the set of hypotheses your learning protocol has at its disposal
can represent the function you wish to learn to a satisfactory level of
approximation - is the hypothesis space sufficient to characterize the set
of concepts you wish to learn ? A particular architecture can represent a
subset of some concept space - this is its hypothesis space - and it may be
only capable of effectively learning some subset of its hypotheses space.
In the case you mention, a single 2-input threshold gate cannot learn x-or
because it cannot represent x-or. A 2-input threshold gate network with at
least 1 hidden unit may not be able to effectively (feasibly,
satisfactorily) learn x-or, perhaps due to the ineffectiveness of the
learning law.
With the definition of learning you initially proposed, we can say very
little about learning - except that it will be impossible for all practical
purposes.
------------------------------
Subject: Re: Step Function
From: David H Ackley <wind!ackley@FLASH.BELLCORE.COM>
Organization: Bellcore, Morristown, NJ
Date: 31 Aug 89 18:02:23 +0000
In article <1697@cbnewsl.ATT.COM> apr@cbnewsl.ATT.COM (anthony.p.russo) writes:
>Yes, I should stick to Valiant's paper. However, one thing: a known function
^^^^^^^^^^^^^^
>is, in most cases, more than a set of ordered pairs. It is a set of ordered
>pairs related in some way.
By referring to "known functions" you are implicitly incorporating a bias
in your notion of learnability --- i.e., you are preferring functions that
have some (presumably simple) relation between input and output. I say
"presumably simple" because I imagine you want to rule out properties like
"appearing in the same truth table" as sufficient grounds for being
"related in some way".
Can you make your criteria for "known-ness" explicit?
| David Ackley Cognitive Science Research Group |
| "There are Bell Communications Research Inc.|
| no facts, ackley@flash.bellcore.com|
| only factors" ...!bellcore!ackley|
------------------------------
Subject: Re: Step Function
From: apr@cbnewsl.ATT.COM (anthony.p.russo)
Organization: AT&T Bell Laboratories
Date: 01 Sep 89 13:52:34 +0000
I could have omitted the word "known" and had a perfectly legitimate
statement. That is, ANY deterministic FUNCTION is a set of related ordered
pairs, whether known or not.
> Can you make your criteria for "known-ness" explicit?
No. You bring up a good point, because your argument is really, "What
functions shall we consider in the hypothesis space?"
I can't tell you; this appears to be getting very deep. On the surface, it
seems that biases are !necessary! for learning anything at all. If so,
then the biases are probably hard-wired and not learned, since they would
have to be learned in terms of other biases, etc.
Does this make sense to anyone else, or have I gone off the deep end?
~ tony ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Tony Russo " Surrender to the void." ~
~ apr@cbnewsl.ATT.COM put disclaimer here ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
------------------------------
Subject: Re: Step function
From: danforth@riacs.edu (Douglas G. Danforth)
Organization: Research Institute for Advanced Computer Science
Date: 01 Sep 89 22:18:45 +0000
[[ Regarding Tony Russo's previous message ]]
Two comments:
(1) If you will allow me to modify your words, it is interesting that
bias can also be looked upon as a "basis". By choosing a biased
view of the world one does at least have a place to stand. If an
organism has a collection of biased views then there is a
possibility that the collection can "span" a space and act as a
basis for representing any other biased view in that space.
(2) In a deep way we are all limited by the sensory space in which we
live. The representation of a function which spans a space larger
than we are aware can only be approximated by functions within the
smaller space. We are blind to that which we do not know. We
know we don't know some things. However, there are things we
don't know we don't know. Very humbling (see Fernando Flores and
Terry Winograd's book, Computers and Cognition (??), sorry forgot
the exact title).
- --------------------------------------
Doug Danforth
danforth@riacs.edu
- --------------------------------------
------------------------------
Subject: Re: Step Function
From: bill@boulder.Colorado.EDU
Organization: University of Colorado, Boulder
Date: 03 Sep 89 18:56:49 +0000
Is it necessary to have a bias in order to be able to learn?
Not always. If the training set includes an example of every possible
input, then the device only needs to be able to "memorize" the correct
responses -- it doesn't need a preset bias.
If, though, there are inputs not included in the training set, then the
device can only produce responses by somehow "generalizing" from what it
has seen, and the rules it uses for generalization constitute a bias. (I
actually prefer the word "heuristic" to "bias". What I am saying here is
that any device that learns from incomplete data must use some sort of
heuristic.)
For devices that learn solely from examples -- which includes all the
neural networks I've seen -- that's all there is to it. It's not a deep
question at all.
It may be more interesting to wonder about more "intelligent" devices,
which are capable of learning from commentary as well as examples.
("Commentary" is, e.g., your English teacher telling you that all proper
nouns should be capitalized.) Even this, though, does not change the
general principle. A comment may be thought of as a constraint upon the
set of possible responses. If all of the examples, plus all of the
constraints, do not force a unique response to every possible input, then
the device, in order to pick a response, must use some sort of bias.
------------------------------
Subject: Re: Step Function
From: apr@cbnewsl.ATT.COM (anthony.p.russo)
Organization: AT&T Bell Laboratories
Date: 05 Sep 89 11:26:00 +0000
> Is it necessary to have a bias in order to be able to learn?
> Not always. [...]
From what I have thought about and read here, it seems that learning
actually constitutes learning *biases* and nothing else (unless it is
memorization).
*IF* this is true (and I'm not sure it is), then for a classifier, from an
information-theoretic point of view, it then makes the most sense to give
the network examples that are on the borderline of a rule or class. These
"borderline" patterns should contain more information about biases than
good examples of a class.
In ways this makes sense -- it is akin to telling the network only where
the boundaries between classes are.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Tony Russo " Surrender to the void." ~
~ apr@cbnewsl.ATT.COM put disclaimer here ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
------------------------------
Subject: Re: Re: Step Function
From: pluto@beowulf.ucsd.edu (Mark E. P. Plutowski)
Organization: EE/CS Dept. U.C. San Diego
Date: Tue, 05 Sep 89 20:12:11 +0000
> Not always. If the training set includes an example of every
>possible input, then the device only needs to be able to "memorize"
>the correct responses -- it doesn't need a preset bias.
This is a strong form of a learning bias. The device assumes that
everything it sees during training is relevant, and, sufficient.
You might be interested in some approaches to learning theory in which the
device has access to a teacher that can provide more than just a
error-signal ... e.g., once a hypothesis is formed by the device, the
teacher, rather than just saying whether or not the hypothesis is correct
for given values of inputs, can provide counterexamples to the device. Of
course, this requires a teacher with expert knowledge! And, the ability to
ascertain what hypothesis the device currently entertains.
Applying this idea to neural networks is difficult. The question is: how
to apportion this type of training signal to the appropriate units in the
network. And, once received by the pertinent units, what to make of it.
Any ideas on how to backpropate such training information? "Commentary"
feedback can apply to the entire hypothesis formed by the device, not just
its performance on a particular input.
------------------------------
Subject: Re: Step Function
From: chrisley@kanga.uucp (Ron Chrisley UNTIL 10/3/88)
Organization: Xerox Palo Alto Research Center
Date: 06 Sep 89 01:05:30 +0000
[[ Tony Russo writes about borderline patterns, et alia ]]
This is exactly the learning procedure behind Kohonen's LVQ2 (Learning
Vector Quantization) algorithm. See his paper in ICNN '88. He derives the
optimality of this method from Bayesian arguments. I do not see how the
fact that generalization = bias implies the optimality of learning the
boundary condsitions, and would be very interested in having you elaborate
on why you think it might.
Ron Chrisley
------------------------------
Subject: Re: Re: Step Function
From: bill@boulder.Colorado.EDU
Organization: University of Colorado, Boulder
Date: Wed, 06 Sep 89 02:24:07 +0000
In principle, if you can come up with the training information, you can
use back-prop (or something very much like it) to apply it.
Back-prop is essentially gradient descent for an error function. Usually
the error function is either the total squared error in the output layer
for a randomly chosen input (this is "online" back- prop), or the average
total squared error for a set of inputs (which is "batch-mode" back-prop).
But as far as the mathematics is concerned, the error function can be
anything you please.
For a neural network, the "hypothesis" it forms is encoded by its
weights, and any sort of "comment" you please can be jammed into the
error function, if only you can formulate it as a numerical measure
calculable from the weights.
In fact, a few people have already been trying to do that sort of thing.
As you probably know, a persistent problem with back-prop is that it tends
to give networks that don't generalize very well to new inputs. One hope
for getting better performance is to augment the error function with an
extra term representing the "complexity" of the network, since it seems
intuitively reasonable that simpler networks should generalize better. It
isn't obvious how best to measure complexity: maybe the sum of the
magnitudes of all the weights; or maybe the number of weights exceeding
some fixed threshold; or maybe something else. This is work in progress,
but there have been some promising beginnings. (Perhaps somebody directly
involved can comment.)
------------------------------
End of Neurons Digest
*********************