Copy Link
Add to Bookmark
Report
Neuron Digest Volume 05 Number 05
Neuron Digest Friday, 20 Jan 1989 Volume 5 : Issue 5
Today's Topics:
Administrivia
Weight decay information
Send submissions, questions, address maintenance and requests for old issues to
"neuron-request@hplabs.hp.com" or "{any backbone,uunet}!hplabs!neuron-request"
ARPANET users can get old issues via ftp from hplpm.hpl.hp.com (15.255.16.205).
------------------------------------------------------------
Subject: Administrivia
From: Peter Marvit <neuron-request@hplabs.hp.com>
Date: Fri, 20 Jan 89 10:00:00 PST
[[ This issue is devoted to a single topic. I am grateful to John Krushke
who compiled all this information. I've left his format intact, since I
don't want to break any undigestifiers. The separation is ^L (control-L)
and "- --------...". -PM ]]
------------------------------
Subject: Weight decay information
From: kruschke@cogsci.berkeley.edu (John Kruschke)
Date: Tue, 03 Jan 89 00:30:12 -0800
Here is the compilation of responses to my request for info on weight decay.
I have kept editing to a minimum, so you can see exactly what the author of
the reply said. Where appropriate, I have included some comments of my own,
set off in square brackets. The responses are arranged into three broad
topics: (1) Boltzmann-machine related; (2) back-prop related; (3) psychology
related.
Thanks to all, and happy new year! --John
- -----------------------------------------------------------------
ORIGINAL REQUEST:
I'm interested in all the information I can get regarding WEIGHT DECAY in
back-prop, or in other learning algorithms.
*In return* I'll collate all the info contributed and send the complilation
out to all contributors.
Info might include the following:
REFERENCES:
- Applications which used weight decay
- Theoretical treatments
Please be as complete as possible in your citation.
FIRST-HAND EXPERIENCE
- Application domain, details of I/O patterns, etc.
- exact decay procedure used, and results
(Please send info directly to me: kruschke@cogsci.berkeley.edu
Don't use the reply command.)
T H A N K S ! --John Kruschke.
- -----------------------------------------------------------------
From: Geoffrey Hinton <hinton@ai.toronto.edu>
Date: Sun, 4 Dec 88 13:57:45 EST
Weight-decay is a version of what statisticians call "Ridge Regression".
We used weight-decay in Boltzmann machines to keep the energy barriers
small. This is described in section 6.1 of:
Hinton, G. E., Sejnowski, T. J., and Ackley, D. H. (1984)
Boltzmann Machines: Constraint satisfaction networks that learn.
Technical Report CMU-CS-84-119, Carnegie-Mellon University.
I used weight decay in the family trees example. Weight decay was used to
improve generalization and to make the weights easier to interpret (because,
at equilibrium, the magnitude of a weight = its usefulness). This is in:
Rumelhart, D.~E., Hinton, G.~E., and Williams, R.~J. (1986)
Learning representations by back-propagating errors.
{\it Nature}, {\bf 323}, 533--536.
I used weight decay to achieve better generalization in a hard
generalization task that is reported in:
Hinton, G.~E. (1987)
Learning translation invariant recognition in a massively
parallel network. In Goos, G. and Hartmanis, J., editors,
{\it PARLE: Parallel Architectures and Languages Europe},
pages~1--13, Lecture Notes in Computer Science,
Springer-Verlag, Berlin.
Weight-decay can also be used to keep "fast" weights small. The fast
weights act as a temporary context. One use of such a context is
described in:
Hinton, G.~E. and Plaut, D.~C. (1987)
Using fast weights to deblur old memories.
{\it Proceedings of the Ninth Annual Conference of the
Cognitive Science Society}, Seattle, WA.
- --Geoff
- -----------------------------------------------------------------
[In his lecture at the International Computer Science Institute,
Berkeley CA, on 16-DEC-88, Geoff also mentioned that weight decay is
good for wiping out the initial values of weights so that only the
effects of learning remain.
In particular, if the change (due to learning) on two weights is the
same for all updates, then the two weights converge to the same
value. This is one way to generate symmetric weights from
non-symmetric starting values.
--John]
- -----------------------------------------------------------------
From: Michael.Franzini@SPEECH2.CS.CMU.EDU
Date: Sun, 4 Dec 1988 23:24-EST
My first-hand experience confirms what I'm sure many other people have told
you: that (in general) weight decay in backprop increases generalization.
I've found that it's particulary important for small training sets, and its
effect diminishes as the training set size increases.
Weight decay was first used by Barak Pearlmutter. The first mention of
weight decay is, I believe, in an early paper of Hinton's (possibly the
Plaut, Nowlan, and Hinton CMU CS tech report), and it is attributed to
"Barak Pearlmutter, Personal Communication" there.
The version of weight decay that (i'm fairly sure) all of us at CMU use is
one in which each weight is multiplied by 0.999 every epoch. Scott Fahlman
has a more complicated version, which is described in his QUICKPROP tech
report. [QuickProp is also described in his paper in the Proceedings of the
1988 Connectionist Models Summer School, published by Morgan Kaufmann.
--John]
The main motivation for using it is to eliminate spurious large weights
which happen not to interfere with recognition of training data but would
interfere with recognizing testing data. (This was Barak's motivation for
trying it in the first place.) However, I have heard more theoretical
justifications (which, unfortunately, I can't reproduce.)
In case Barak didn't reply to your message, you might want to contact him
directly at bap@cs.cmu.edu.
--Mike
- -----------------------------------------------------------------
From: Barak.Pearlmutter@F.GP.CS.CMU.EDU
Date: 8 Dec 1988 16:36-EST
We first used weight decay as a way to keep weights in a boltzmann machine
from growing too large. We added a term to the thing being minimized, G, so
that
G' = G + 1/2 h \sum_{i<j} w_{ij}^2
where G' is our new thing to minimize. This gives
\partial G'/\partial w_{ij} = \partial G/\partial w_{ij} + h w_{ij}
which is just weight decay with some mathematical motivation. As Mike
mentioned, I was the person who thought of weight decay in this context (in
the shower no less), but parameter decay has been used forever, in adaptive
control for example.
It sort of worked okay for Boltzmann machines, but works much better in
backpropagation. As a historic note I should mention that there were some
competing techniques for keeping weights small in Boltzmann machines, such
as Mark Derthick's "differential glommetry" in which the effective target
termperature of the wake phase is higher than that of the sleep phase. I
don't know if there is an analogue for this in backpropagation, but there
certainly is for mean field theory networks.
Getting back weight decay, it was noted immediately that G has the unit
"bits" while $w_{ij}^2$ has the unit "weight^2", sort of a problem from a
dimensional analysis point of view. Solving this conundrum, Rick Szeliski
pointed out that if we're going to transmit our weights by telephone and
know a-priori that weights have gaussian distributions, so
P(w_{ij}=x) \propto e^{-1/2 h x^2}
where h is set to get the correct variance, then transmitting a weight w
will take $-1/2 h w^2$ bits, which we can add to G with dimensional
confidence.
Of course, this argument extends to fast/slow split weights nicely; the
other guy already knows the slow weights, so we need transmit only the fast
weights.
By "ridge regression" I guess Geoff means that valleys in weight space that
cause the weights to grow asymptotically are made to tilt up after a while,
so that the asymptotic tailing off is eliminated. It's like adding a bowl
to weight space, so minima have to be within the bowl.
An interesting side effect of weight decay is that, once we get to a
minimum, so $\partial G'/\partial w = 0$, then
w_{ij} \propto - \partial G/\partial w_{ij}
so we can do a sort of eyeball significance analysis, since a weight's
magnitiude is proportaional to how sensitive the error is to changing it.
- -----------------------------------------------------------------
From: russ%yummy@gateway.mitre.org (Russell Leighton)
Date: Mon, 5 Dec 88 09:17:56 EST
We always use weight decay in backprop. It is partiuclarly important in
escaping local minima. Decay moves the transfer function from all of the
semi-linear (sigmoidal) nodes toward the linear region. The important point
is that all nodes move proportionally so no information in the weights is
"erased" but only scaled. When the nodes that have trapped the system in the
local minima are scaled enough, the system moves onto a different trajectory
through weight space. Oscilations are still possible, but are less likely.
We use decay with a process we call "shaping" (see Wieland and Leighton,
"Shaping Schedules as a Method for Accelerating Leanring", Abstracts of the
First Annual INNS Meeting, Boston, 1988) that we use to speed learning of
some difficult problems.
ARPA: russ%yummy@gateway.mitre.org
Russell Leighton
MITRE Signal Processing Lab
7525 Colshire Dr.
McLean, Va. 22102
USA
- -----------------------------------------------------------------
From: James Arthur Pittman <hi.pittman@MCC.COM>
Date: Tue, 6 Dec 88 09:34 CST
Probably he will respond to you himself, but Alex Weiland of MITRE presented
a paper at INNS in Boston on shaping, in which the order of presentation of
examples in training a back-prop net was altered to reflect a simpler rule
at first. Over a number of epochs he gradually changed the examples to
slowly change the rule to the one desired. The nets learned much faster than
if he just tossed the examples at the net in random order. He told me that
it would not work without weight decay. He said their rule-of-thumb was the
decay should give the weights a half-life of 2 to 3 dozen epochs (usually a
value such as 0.9998). But I neglected to ask him if he felt that the number
of epochs or the number of presentations was important. Perhaps if one had
a significantly different training set size, that rule-of-thumb would be
different?
I have started some experiments simular to his shaping, using some random
variation of the training data (where the random variation grows over time).
Weiland also discussed this in his talk. I haven't yet compared decay with
no-decay. I did try (as a lark) using decay with a regular (non-shaping)
training, and it did worse than we usually get (on same data and same
network type/size/shape). Perhaps I was using a stupid decay value (0.9998
I think) for that situation.
I hope to get back to this, but at the moment we are preparing for a
software release to our shareholders (MCC is owned by 20 or so computer
industry corporations). In the next several weeks a lot of people will go
on Christmas vacation, so I will be able to run a bunch of nets all at once.
They call me the machine vulture.
- -----------------------------------------------------------------
From: Tony Robinson <ajr@digsys.engineering.cambridge.ac.uk>
Date: Sat, 3 Dec 88 11:10:20 GMT
Just a quick note in reply to your message to `connectionists' to say that I
have tried to use weight decay with back-prop on networks with order 24 i/p,
24 hidden, 11 o/p units. The problem was vowel recognition (I think), it
was about 18 months ago, and the problem was of the unsolvable type (i.e.
non-zero final energy).
My conclusion was that weight decay only made matters worse, and my
justification (to myself) for abandoning weight decay was that you are not
even pretending to do gradient descent any more, and any good solution
formed quickly becomes garbaged by scaling the weights.
If you want to avoid hidden units sticking on their limiting values, why not
use hidden units with no limiting values, for instance I find the activation
function f(x) = x * x works better than f(x) = 1.0 / (1.0 + exp(- x))
anyway.
Sorry I havn't got anything formal to offer, but I hope these notes help.
Tony Robinson.
- -----------------------------------------------------------------
From: jose@tractatus.bellcore.com (Stephen J Hanson)
Date: Sat, 3 Dec 88 11:54:02 EST
Actually, "costs" or "penalty" functions are probably better terms. We had a
poster last week at NIPS that discussed some of the pitfalls and advantages
of two kinds of costs. I can send you the paper when we have a version
available.
Stephen J. Hanson (jose@bellcore.com)
- -----------------------------------------------------------------
[ In a conversation in his office on 06-DEC-88, Dave Rumelhart
described to me several cost functions he has tried.
The motive for the functions he has tried is different from the motive
for standard weight decay. Standard weight decay,
\sum_{i,j} w_{i,j}^2 ,
is used to *distribute* weights more evenly over the given connections,
thereby increasing robustness (cf. earlier replies).
He has tried several other cost functions in an attempt to *localize*, or
concentrate, the weights on a small subset of the given connections. The
goal is to improve generalization. His favorite is
\sum_{i,j} ( w_{i,j}^2 / ( K + w_{i,j}^2 ) )
where K is a constant, around 1 or 2. Note that this function is negatively
accelerating, whereas standard weight decay is positively accelerating.
This function penalizes small weights (proportionally) more than large
weights, just the opposite of standard weight decay.
He has also tried, with less satisfying results,
\sum ( 1 - \exp - (\alpha w_{i,j}^2) )
and
\sum \ln ( K + w_{i,j}^2 ).
Finally, he has tried a cost function designed to make all the fan-in
weights of a single unit decay, when possible. That is, the unit is
effectively cut out of the network. The function is
\sum_i (\sum_j w_{i,j}^2) / ( K + \sum_j w_{i,j}^2 ).
Each weight is thereby penalized (inversely) proportionally to the total
fan-in weight of its node.
--John ]
- -----------------------------------------------------------------
[ This is also a relevant place to mention my paper in the Proceedings of
the 1988 Connectionist Models Summer School, "Creating local and distributed
bottlenecks in back-propagation networks". I have since developed those
ideas, and have expressed the localized bottleneck method as gradient
descent on an additional cost term. The cost term is quite general, and
some forms of decay are simply special cases of it. --John]
- -----------------------------------------------------------------
From: john moody <moody-john@YALE.ARPA>
Date: Sun, 11 Dec 88 22:54:11 EST
Scalettar and Zee did some interesting work on weight decay with back prop
for associative memory. They found that a Unary Representation emerged (see
Baum, Moody, and Wilczek; Bio Cybernetics Aug or Sept 88 for info on Unary
Reps). Contact Tony Zee at UCSB (805)961-4111 for info on weight decay
paper.
--John Moody
- -----------------------------------------------------------------
From: gluck@psych.Stanford.EDU (Mark Gluck)
Date: Sat, 10 Dec 88 16:51:29 PST
I'd appreciate a copy of your weight decay collation. I have a paper in MS
form which illustrates how adding weight decay to the linear-LMS one-layer
net improves its ability to predict human generalization in classification
learning.
mark gluck
dept of psych
stanford univ,
stanford, ca 94305
- -----------------------------------------------------------------
From: INAM000 <INAM%MCGILLB.bitnet@jade.berkeley.edu> (Tony Marley)
Date: SUN 04 DEC 1988 11:16:00 EST
I have been exploring some ideas re COMPETITIVE LEARNING with "noisy
weights" in modeling simple psychophysics. The task is the classical one of
identifying one of N signals by a simple (verbal) response -e.g. the stimuli
might be squares of different sizes, and one has to identify the presented
one by saying the appropriate integer. We know from classical experiments
that people cannot perform this task perfectly once N gets larger than about
7, but performance degrades smoothly for larger N.
I have been developing simulations where the mapping is learnt by
competitive learning, with the weights decaying/varying over time when they
are not reset by relevant inputs. I have not got too many results to date,
as I have been taking the psychological data seriously, which means worrying
about reaction times, sequential effects, "end effects" (stimuli at the end
of the range more accurately identified), range effects (increasing the
stimulus range has little effect), etc..
Tony Marley
- -----------------------------------------------------------------
From: aboulanger@bbn.com (Albert Boulanger)
Date: Fri, 2 Dec 88 19:43:14 EST
This one concerns the Hopfield model. In
James D Keeler,
"Basin of Attraction of Neural Network Models",
Snowbird Conference Proceedings (1986), 259-264,
it is shown that the basins of attraction become very complicated as
the number of stored patterns increase. He uses a weight modification
method called "unlearning" to smooth out these basins.
Albert Boulanger
BBN Systems & Technologies Corp.
aboulanger@bbn.com
- -----------------------------------------------------------------
From: Joerg Kindermann <unido!gmdzi!joerg@uunet.UU.NET>
Date: Mon, 5 Dec 88 08:21:03 -0100
We used a form of weight decay not for learning but for recall in multilayer
feedforward networks. See the following abstract. Input patterns are treated
as ``weights'' coming from a constant valued external unit.
If you would like a copy of the technical report, please send e-mail to
joerg@gmdzi.uucp
or write to:
Dr. Joerg Kindermann
Gesellschaft fuer Mathematik und Datenverarbeitung
Schloss Birlinghoven
Postfach 1240
D-5205 St. Augustin 1
WEST GERMANY
Detection of Minimal Microfeatures by Internal Feedback
J. Kindermann & A. Linden
Abstract
We define the notion of minimal microfeatures and introduce a new method of
internal feedback for multilayer networks. Error signals are used to modify
the input of a net. When combined with input DECAY, internal feedback allows
the detection of sets of minimal microfeatures, i.e. those subpatterns which
the network actually uses for discrimination. Additional noise on the
training data increases the number of minimal microfeatures for a given
pattern. The detection of minimal microfeatures is a first step towards a
subsymbolic system with the capability of self-explanation. The paper
provides examples from the domain of letter recognition.
- -----------------------------------------------------------------
From: Helen M. Gigley <hgigley@note.nsf.gov>
Date: Mon, 05 Dec 88 11:03:23 -0500
I am responding to your request even though my use of decay is not with
respect to learning in connectionist-like models. My focus has been on a
functioning system that can be lesioned.
One question I have is what is the behavioral association to weight decay?
What aspects of learning is it intended to reflect. I can understand that
activity decay over time of each cell is meaningful and reflects a cellular
property, but what is weight decay in comparable terms?
Now, I will send you offprints if you would like of my work and am including
a list of several publications which you may be able to peruse. The model,
HOPE, is a hand-tuned structural connectionist model that is designed to
enable lesioning without redesign or reprogramming to study possible
processing causes of aphasia. Decay factors as an integral part of dynamic
time-dependent processes are one of several aspects of processing in a
neural environment which potentially affect the global processing results
even though they are defined only locally. If I can be of any additional
help please let me know.
Helen Gigley
References:
Gigley, H.M. Neurolinguistically Constrained Simulation of Sentence
Comprehension: Integrating Artificial Intelligence and Brain Theorym
Ph.D. Dissertation, UMass/Amherst, 1982. Available University
Microfilms, Ann Arbor, MI.
Gigley, H.M. HOPE--AI and the dynamic process of language behavior.
in Cognition and Brain Theory 6(1) :39-88, 1983.
Gigley, H.M. Grammar viewed as a functioning part of of a cognitive
system. Proceedings of ACL 23rd Annual Meeting, Chicago, 1985 .
Gigley, H.M. Computational Neurolinguistics -- What is it all about?
in IJCAI Proceedings, Los Angeles, 1985.
Gigley, H.M. Studies in Artificial Aphasia--experiments in processing
change. In Journal of Computer Methods and Programs in Biomedicine,
22 (1): 43-50, 1986.
Gigley, H.M. Process Synchronization, Lexical Ambiguity Resolution,
and Aphasia. In Steven L. Small, Garrison Cottrell, and Michael
Tanenhaus (eds.) Lexical Ambiguity Resolution, Morgen Kaumann, 1988.
- -----------------------------------------------------------------
From: bharucha@eleazar.Dartmouth.EDU (Jamshed Bharucha)
Date: Tue, 13 Dec 88 16:56:00 EST
I haven't tried weight decay but am curious about it. I am working on
back-prop learning of musical sequences using a Jordan-style net. The
network develops a musical schema after learning lots of sequences that have
culture-specific regularities. I.e., it learns to generate expectancies for
tones following a sequential context. I'm interested in knowing how to
implement forgetting, whether short term or long term.
Jamshed.
------------------------------
End of Neurons Digest
*********************