Você está na página 1de 17

LETTER Communicated by Anthony Zador

Choice and Value Flexibility Jointly Contribute to the


Capacity of a Subsampled Quadratic Classifier

Panayiota Poirazi
Bartlett W. Mel
Department of Biomedical Engineering, University of Southern California, Los Angeles,
CA 90089, U.S.A.

Biophysical modeling studies have suggested that neurons with active


dendrites can be viewed as linear units augmented by product terms that
arise from interactions between synaptic inputs within the same dendritic
subregions. However, the degree to which local nonlinear synaptic inter-
actions could augment the memory capacity of a neuron is not known
in a quantitative sense. To approach this question, we have studied the
family of subsampled quadratic classifiers: linear classifiers augmented
by the best k terms from the set of K = (d2 + d)/2 second-order prod-
uct terms available in d dimensions. We developed an expression for the
total parameter entropy, whose form shows that the capacity of an SQ
classifier does not reside solely in its conventional weight values—the
explicit memory used to store constant, linear, and higher-order coef-
ficients. Rather, we identify a second type of parameter flexibility that
jointly contributes to an SQ classifier’s capacity: the choice as to which
product terms are included in the model and which are not. We validate
the form of the entropy expression using empirical studies of relative
capacity within families of geometrically isomorphic SQ classifiers. Our
results have direct implications for neurobiological (and other hardware)
learning systems, where in the limit of high-dimensional input spaces
and low-resolution synaptic weight values, this relatively little explored
form of choice flexibility could constitute a major source of trainable
model capacity.

1 Introduction

Experimental evidence continues to accumulate indicating that the den-


dritic trees of many neuron types are electrically “active”; they contain
voltage-dependent ionic conductances that can lead to full regenerative
propagation of action potentials and other dynamical phenomena (for recent
reviews see Johnston, Magee, Colbert, & Christie, 1996; Stuart, Spruston,
Sakmann, & Hausser, 1997; Stuart, Spruston, & Hausser, 1999). Although
our state of knowledge regarding the functions of dendritic trees is largely
confined to the realm of hypothesis (Koch, Poggio, & Torre, 1982; Rall &

Neural Computation 12, 1189–1205 (2000) °


c 2000 Massachusetts Institute of Technology
1190 Panayiota Poirazi and Bartlett W. Mel

A hypothetical response at soma


B C

A+B

A+C

soma

Figure 1: Conceptual illustration of location-dependent nonlinear interactions


in an active dendritic tree. Two inputs, A and B, activated at high frequency on
a dendritic branch containing a thresholding nonlinearity could be sufficient to
cause the branch (and the cell) to “fire” repeatedly, while coactivation of more
distantly separated inputs, A and C, might generate only a rare suprathreshold
event at any location. This type of nonlinear interaction between A and B is
sometimes loosely termed multiplicative.

Segev, 1987; Shepherd & Brayton, 1987; Borg-Graham & Grzywaca, 1992;
Mel, 1992a, 1992b, 1993, 1994, 1999; Zador, Claiborne, & Brown, 1992; Yuste
& Tank, 1996; Segev & Rall, 1998; Mel, Ruderman, & Archie, 1998a, 1998b),
results of existing experimental and modeling studies strongly suggest that
electrical compartmentalization provided by the dendritic branching struc-
ture, coupled with local thresholding provided by active dendritic channels,
together are likely to have a profound impact on the way in which synap-
tic inputs to a dendritic tree are integrated to produce a cell’s output (see
Figure 1).
A key challenge in deciphering the functions of individual neurons in-
volves properly characterizing the form of nonlinear interactions between
synapses within a dendritic compartment, and to this end, a variety of bio-
physical modeling studies have identified possible functional roles for such
interactions (Koch et al., 1982; Rall & Segev, 1987; Shepherd & Brayton, 1987;
Borg-Graham & Grzywacz, 1992; Kock & Poggio, 1992; Zador et al., 1992;
Mel 1992a, 1992b, 1993). In this same vein, two recent biophysical model-
ing studies (Mel et al., 1998a, 1998b) have shown that the time-averaged
response of a cortical pyramidal cell with weakly excitable dendrites can
mimic the output of a quadratic “energy” model, a formalism that has been
used to model a variety of receptive field properties in visual cortex (Pollen
& Ronner, 1983; Adelson & Bergen, 1985; Heeger, 1992; Ohzawa, DeAngelis,
& Freeman, 1997). This close connection between dendritic and receptive
Capacity of a Subsampled Quadratic Classifier 1191

field nonlinearities is highly suggestive and supports the notion that ac-
tive dendrites could contribute to a diverse set of neuronal functions in the
mammalian central nervous system that involve a common, quasi-static
multiplicative nonlinearity of relatively low order (for further discussion,
see Mel, 1999).
The potential contributions of active dendritic processing to the brain’s
capacity to learn and remember are a fundamental problem in neuroscience
that have thus far received relatively little attention. In one study address-
ing the issue of neuronal capacity, Zador and Pearlmutter (1996) computed
the Vapnik-Chervonenkis (1971) dimension of an integrate-and-fire neu-
ron with a passive dendritic tree. However location-dependent nonlinear
synaptic interactions, which are likely to occur within the dendrites of many
neuron types, were not considered in this study.
To begin to address this issue, we previously developed a spatially ex-
tended single-neuron model called the “clusteron,” which explicitly in-
cluded location-dependent multiplicative synaptic interactions in the neu-
ron’s input-output function (Mel, 1992a) and illustrated how a neuron with
active dendrites could act as a high-dimensional quadratic learning ma-
chine. The clusteron abstraction was conceived primarily to help quantify
the impact of local dendritic nonlinearities on the usable memory capacity of
a cell; an earlier biophysical modeling study had demonstrated empirically
that this impact could be significant (Mel, 1992b). However, an additional
insight gained from this work was that a cell with s input sites is likely
capable of representing only a small fraction of the O(d2 ) second-order in-
teraction terms available among d input lines, since in the brain, it is often
reasonable to assume that d is very large. In short, if a neuron does indeed
pull out higher-order features in its dendrites, it is likely to have to choose
very carefully among them.
Inspired in part by this observation, our goal here has been to quantify the
degree to which inclusion of a subset of the best available higher-order terms
augments the capacity of a subsampled quadratic (SQ) classifier relative to
its linear counterpart. Although it has been long appreciated that inclusion
of higher-order terms can increase the power of a learning machine for both
regression and classification (Poggio, 1975; Barron, Mucciardi, Cook, Craig,
& Barron, 1984; Giles & Maxwell, 1987; Ghosh & Shin, 1992; Guyon, Boser,
& Vapnik, 1993; Karpinsky & Werther, 1993; Schurmann, 1996), and algo-
rithms for learning polynomials have often included heuristic strategies for
subsampling the intractable number of higher-order product terms, existing
theory bearing on the learning capabilities of quadratic classifiers is limited.
Cover (1965) showed that an rth order polynomial classifier consisting of
all r-wise products of the input variables in d dimensions has a natural
separating capacity of
µ ¶
d+r
2 ,
r
1192 Panayiota Poirazi and Bartlett W. Mel

or twice the number of parameters. However, an SQ classifier with k nonzero


product terms, which would appear at first to have a Cover capacity of
2(1 + d + k), in fact contains additional covert parameters relating to the
choice of nonzero coefficients to be included in the classifier, since this choice
is a bona fide form of model flexibility that can be used to fit training data.
Viewed in another way, the SQ classifier with k nonzero terms also implicitly
specifies zero values on the product terms that were not chosen, so that the
capacity of an SQ classifier must necessarily take account of these additional
internal degrees of freedom.
Our specific objective here has been to obtain an expression that quan-
tifies the separate contributions to the capacity of an SQ classifier that
arise from parameter value versus choice flexibility. Beyond the theoreti-
cal significance of this question and its broader implications for the learn-
ing performance of classifiers that subsample from large sets of nonlin-
ear basis functions, our hope has been that this analysis and its subse-
quent refinements will lead to further insights into the potential contri-
butions of active dendritic processing to the memory capacity of neural
tissue.

2 Toward the Capacity of a Subsampled Quadratic Classifier

We consider the family of SQ classifiers in which only k of the K = (d2 +


d)/2 available second-order terms in d dimensions is allowed to take on a
nonzero value. In the special case where k = 0, we are left with a pure linear
classifier, while k = K corresponds to a full quadratic classifier. Crucially,
the k nonlinear terms included in the model are not chosen at random, but
are chosen instead to be the best subset of k terms—that which minimizes
training set error.
The capacity of a learning machine depends on the number of distinct
input-output functions that can be expressed, which in the case of a classi-
fier corresponds to the number of different labelings (dichotomies) that can
be assigned to N patterns in d dimensions (Cover, 1965; Vapnik & Chervo-
nenkis, 1971). Although a general expression for the number of dichotomies
realizable by an SQ classifier is difficult to obtain as a function of N, d, and
k, an upper bound on the total parameter entropy of an SQ classifier taken
over the space of randomly drawn training sets may be expressed as

µ ¶
K
B(k, d) = (1 + d + k)w + log 2 , (2.1)
k

where the first term measures the entropy contained in the stored param-
eter values, with w an unknown representing the average number of bits
of value entropy per parameter, and the second term measures the entropy
contained in the choice of higher-order terms, assuming that all combina-
Capacity of a Subsampled Quadratic Classifier 1193

Classifier Bits vs. Product Term Ratio

2000
d = 10
d = 20
choice
1500 d = 30 bits

Bits Used in Classifiers


weight
1000 bits

500

0
0 20 40 60 80 100
Product Term Ratio p = k/K (%)

Figure 2: SQ capacity curves were generated using equation 2.1 with w = 4.


Dashed lines indicate the contribution of explicit coefficients to the bit total,
while the additional increment is provided by the combinatorial term in equa-
tion 2.1 pertaining to the choice of nonzero coefficients over the available col-
lection of product terms. Both terms are O(d2 ).

tions of k higher-order terms are chosen equally often. The factorization of


value versus choice entropy assumes that the choice of higher-order terms,
and the values assigned to them, are statistically independent.
The relative magnitudes of value versus choice capacity are shown in
Figure 2, where B(d, k) is plotted as a function of the product term ratio
p = Kk , for w = 4. The contribution of the weight values to the bit total is
given by the dashed line connecting the p = 0% and p = 100% points on each
curve. The contribution of the choice term has the form of an upside-down
“U” and can be seen as the increment to the bit count above the dashed
line.1
When the mapping between learning problems and classifier parameters
is deterministic—i.e., when there always exists a single best setting of clas-
sifier error parameters to minimize training, and when all classifier states
are equally likely, equation 2.1 represents the mutual information contained
in the classifier parameters about training set labels. Unfortunately, equa-

1 The expression for B assumes that the value bits and the choice bits are independent
of each other, which holds only when the k best-product terms chosen for inclusion in
the classifier are unlikely to include any zero values. This assumption breaks down as
k → K, in which limit some of the k largest learned coefficients will have values near zero.
This leads in turn to an overprediction of classifier capacity according to equation 2.1 and
the resulting slight nonmonotonicity seen in Figure 2 for large p values. A more intricate
counting method would be needed to separate these two sources of classifier capacity for
large p values.
1194 Panayiota Poirazi and Bartlett W. Mel

tion 2.1 cannot be directly used to predict error rates for SQ classifiers. First,
the value of w is generally unknown a priori. Second, actual error rates do
not depend on parameter entropy alone, but depend also on geometric fac-
tors that determine the suitability of the classifier’s parameters to specific
training set distributions.
Following the maxim that capacity is closely linked to trainable degrees
of freedom, however, and the principle that capacity and error rates should
be related to each other in consistent (if unknown) ways, we conjectured
that there could exist subfamilies of isomorphic SQ classifiers wherein the
relative capacity of any two classifiers i and j, at any error rate and for
any training set distribution, is given by the ratio of their respective B(k, d)
values. Specifically, a family of isomorphic SQ classifiers C is defined such
that for any two classifiers {i, j} ∈ C , the relative number of patterns N
learnable by each classifier at error rate ² obeys the relation

Ni Bi
Ri/j = = , for all ². (2.2)
Nj Bj

Put another way, two isomorphic classifiers should yield equal error rates
when asked to learn an equivalent number of patterns per parameter bit, for
any input distribution. We describe several tests of this scaling conjecture
in the following sections.

2.1 Special Cases: Linear and Full Quadratic Classifiers. The scaling
relation expressed in equation 2.2 may be easily verified in the limiting cases
of k = 0 and k = K, which correspond to linear and full quadratic classifiers,
respectively. In both cases, the combinatorial term representing the choice
capacity in equation 2.1 vanishes. Assuming that the value-entropy scaling
factor w is always equal for the two classifiers at a given error rate, the
capacity ratio for two isomorphic “choiceless” classifiers indexed by i and j
simplifies to the ratio of their explicit parameter counts:

1 + di
Ri/j = (2.3)
1 + dj

for the linear case and


µ ¶
di + 2
2 d2 + 3di + 2
Ri/j = µ ¶ = i2 (2.4)
dj + 2 dj + 3dj + 2
2

for the full quadratic case. Both outcomes are consistent with the fact that
the natural separating capacity for both linear and full polynomial discrim-
inant functions is directly proportional to the number of free parameters
Capacity of a Subsampled Quadratic Classifier 1195

available to the classifier (Cover, 1965). Numerical simulations and sup-


porting calculations confirmed that this capacity scaling at every error rate
holds for linear and quadratic classifiers in 10, 20, and 30 dimensions (see
Figure 3).
The superposition of performance curves for linear classifiers in 10, 20,
and 30 dimensions shown in Figure 3C validates the assumption in equa-
tion 2.3 that w is independent of d. Thus, randomly labeled samples drawn
from a spherical gaussian distribution evoke the same number of bits of
entropy in a linear classifier, per parameter, regardless of the dimension in
which the classifier sits. A statistical mechanical interpretation of this fact
is given in Riegler & Seung (1997).
Similarly, the scaling of full quadratic performance curves shown in Fig-
ure 3C implies that the value of w is independent of d within this fam-
ily of learning machines as well. However, it is important to note that the
value of w is not necessarily consistent between the two families, since
the graphs of error rate versus training set size for linear and quadratic
classifiers are not related by any simple scaling factor (see Figure 3C).
In fact, full quadratic classifiers performed better per parameter than lin-
ear classifiers for small training sets—in this case, up to six patterns per
model parameter. The opposite was true for large training sets for which
the training set covariance matrix approaches the identity, thus depleting
any information about class boundaries contained in the O(d2 ) quadratic
parameters.

2.2 Capacity Scaling for General SQ Classifiers. We tested the scal-


ing relation in equation 2.2 for SQ classifiers in the general case when
0 < k < K. In all simulations reported here, the coefficients for the k
nonzero second-order terms were determined as follows. A conjugate gra-
dient algorithm was used to train a full quadratic classifier to minimize
mean squared error (MSE) over the training set. Given the sphericity of
the training set distribution, the selection of the k best-product terms after
training could be made based on weight magnitude, a strategy that is not
valid in general (see Bishop, 1995). However, in this case, it was verified
empirically that the increase in MSE or classification error when a single
weight was set to zero grew roughly monotonically with the weight mag-
nitude (see Figure 4). The pruned classifier, including only the k largest
second-order terms, was retrained to minimize MSE. During this second
round of training, the output of the classifier was passed through a sig-
moidal thresholding function (1 − e−x/s )/(1 + e−x/s ) with slope s = 0.51. In
some cases, the resulting value parameters were binarized to take on values
of ±1.
Graphs of error rate versus training set size for SQ classifiers are shown
in Figure 5A for a range of k values in 10 dimensions. The capacity boost
resulting from inclusion of varying numbers of second-order terms to a
linear classifier was culled from this graph by choosing a fixed error rate
1196 Panayiota Poirazi and Bartlett W. Mel

A Model Fit to Performance Curves B Model Fit to Performance Curves


for Linear classifiers for Quadratic classifiers
45 45
d = 10 d = 10
40 40
Classification Error (%)

35 35

30 30

25 25

20 20

15 15

10 10
spherical gaussian model non-spherical gaussian model
5 learned linear discriminant 5 learned quadratic discriminant

0 0
0 50 100 150 200 0 200 400 600 800 1000
Training Patterns Training Patterns
VC Cover
Dimension Capacity
C Scaling of Performance Curves
Across Dimension
45
d = 10, 20, 30
40
Classification Error (%)

35

30

25 linear
20 full quadratic
15

10

0
0 2 4 6 8 10
Training Patterns / Model Parameters

Figure 3: We trained linear and full quadratic classifiers using randomly labeled
patterns drawn from a d-dimensional zero-mean spherical gaussian distribution
and measured average classification error versus training set size. (A) A sim-
ple analytical model (crosses) fit numerical simulations results (squares) in the
asymptote of large N (see the appendix). Dashed lines indicate VC dimension
1 + d and Cover capacity 2(1 + d) for the perceptron, which correspond to 1%
and 7% error rates, respectively. Error rates slightly above the theoretical mini-
mum (0% at VC dimension) were due to imperfect control of the learning-rate
parameter. (B) A Bayes-optimal classifier based on estimated covariance matri-
ces (crosses) provided a control for the performance of a numerically optimized
full quadratic classifier (squares); the performance curves again merge in the
limit of large N. (C) Scaling of linear and quadratic error rates across dimension,
when plotted against training patterns per model parameter (1 + d for linear,
1 + d + (d2 + d)/2 for quadratic).
Capacity of a Subsampled Quadratic Classifier 1197

A Distribution of Learned Weights B MSE Increase for Individual Weight

0.2 40
d = 10, N = 100 d = 10, N = 100
0.18 Linear 35
0.16 Quadratic

Squared Error Increase


Relative Probabilities

30
0.14
25
0.12

0.1 20

0.08
15
0.06
10
0.04 Linear
5 Quadratic
0.02

0 0
10 8 6 4 2 0 2 4 6 8 10 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Weight Values Weight Magnitude

Figure 4: Distributions of learned weight values and increments to MSE with


pruning of individual weights. (A) Similar distribution of linear versus second-
order weight values after training a full quadratic classifier. (B) Effect of pruning
individual weights on the MSE. For each data point, x-value indicates weight
magnitude; y-value indicates the increment to MSE when that weight was set to
0 (without further retraining). As expected from the symmetry of the training
distribution, an increase in MSE is roughly monotonic with weight magnitude
for both linear and second-order weights. A similar relationship was seen for
increments in classification error as a function of weight magnitude.

and recording the associated capacities for the linear versus various SQ
curves (see Figure 5B). The contribution of the choice flexibility is evi-
dent here, leading to capacity boosts in excess of that predicted by ex-
plicit parameter counts. For example, at a fixed 1% error rate in 10 di-
mensions, addition of the 10 best quadratic terms to the 11 linear and
constant terms boosted the trainable capacity by more than a factor of
3, whereas the number of explicit parameters grew by less than a factor
of 2.
By analogy with the special cases of linear and full quadratic classifiers,
we sought to identify subfamilies of isomorphic SQ classifiers within which
the relative capacity of any two classifiers was given by Ri/j . In doing so,
however, two complications arise. First, it is a priori unclear what condi-
tion(s) on d and k bind together families of isomorphic SQ classifiers in
general, if such families exist. Second, even if such families do exist, w no
longer cancels in the expression for Ri/j , as it does for both linear and full
quadratic classifiers. In cases where w is unknown, therefore, evaluation of
Ri/j is prevented.
Regarding the criterion for classifier isomorphism, we found empirically
that error rates for SQ classifiers i and j in dimensions di and dj could be
brought into register across dimension with a simple capacity scaling, as
1198 Panayiota Poirazi and Bartlett W. Mel

A Performance of SQ Classifiers B Capacity Boosts for SQ Classifiers


(relative to linear)
45 12
d = 10 linear d = 10
40
10 k = 55 (full quadratic)
35 k =10

Capacity Ratio (SQ/linear)


Classification Error (%)

k =20 k = 40
30 8
k =30 k = 30
25
6 k = 20
20
k =40
15 4
k =10
10 full quadratic
2
5 k = 0 (linear)

0 0
0 100 200 300 0 5 10 15 20 25
Training Patterns Classification Error (%)

Figure 5: Performance of SQ classifiers. (A) Performance curves for various


values of k in a 10-dimensional input space. (B) Ratio of the memory capacity
of SQ classifiers in 10 dimensions relative to a linear classifier k = 0, plotted as
a function of the error rate.

long as the relation

ki kj
= (2.5)
Ki Kj

was obeyed. This suggests that a key geometric invariance in the SQ family
involves the proportion of higher-order terms included in the classifier (see
Figure 6B). This rule also operates correctly for the special cases of linear
(k = 0) and full quadratic (k = K) classifiers. By contrast, another plausible
condition on d and k failed to bind together sets of isomorphic SQ classifiers:
classifiers that use the same fraction of their complete parameter set (see
Figure 6A).
Having established that two SQ classifiers with equal product term ratios
p = Kk produce performance curves that are isomorphic up to a scaling of
their capacities, the key question remaining was whether the scaling fac-
tor would be predicted by Ri/j . Assuming wi = wj as for the linear and full
quadratic cases, we note that the relative capacity of two isomorphic SQ clas-
sifiers is again independent of w and approaches ( ddji )2 for large di and dj . This
occurs because as di and dj grow large, the ratio of the two value capacities
considered separately, and the two choice capacities considered separately,
both individually approach ( ddji )2 . However, for the lower-dimensional cases
included in our empirical studies, the capacity ratio Ri/j does not entirely
shed its dependence on w and p. We found that a value of w = 4 produced
good fits to the empirically determined scaling factors for p-matched SQ
Capacity of a Subsampled Quadratic Classifier 1199

A Comparison of Parameter Bits Used B Comparison of 2nd order Terms Used


for Isomorphic Capacity Curves for Isomorphic Capacity Curves
100

Product Term Ratio in Target Dimension (%)


100

d = 20 d = 20
Parameters Used in Target Dimension

80 d = 30 80 d = 30
(% of total parameters)

60 60

40 40

20 20

0 0
0 20 40 60 80 100 0 20 40 60 80 100
Parameters Used in 10 Dimensions Product Term Ratio p in 10 Dimensions (%)
(% of total parameters)

Figure 6: Relationship between SQ classifiers in different dimensions whose


performance curves were isomorphic. (A) The nonlinear relation shown indi-
cates that no simple proportionality exists between the total parameter counts
(1 + d + k) used by two SQ classifiers whose performance curves were found
to be isomorphic up to an x-axis scaling. 20-d (crosses) and 30-d (squares) clas-
sifiers were compared to a 10-d reference case. (B) In contrast, a linear relation
(equality) holds between the product term ratios p = Kk for two isomorphic SQ
classifiers.

classifiers in 10, 20, and 30 dimensions, for a wide range of p values tested,
and for training sets drawn from both gaussian and uniform distributions2
(see Figure 7).
To eliminate uncertainty in the value of w, we trained binary-valued SQ
classifiers so that w could be assigned a value of 1 in the expression for Ri/j .
In this case, SQ classifiers were trained as before, but prior to testing, the
real-valued coefficients were mapped to ±1 according to their sign. This
radical quantization of weight values led to large increases in error rates
and gross alterations in the form of the error curves; compare Figure 8A to
Figure 7A. However Ri/j could now be evaluated with no free parameters
and continued to produce nearly perfect superpositions of error rate curves
within isomorphic families (see Figure 8).
The nonmonotonicities seen in these binary-valued cases are of unknown
origin. However, given that they scale across dimension, we infer they arise
from consistent, family-specific idiosyncrasies in the number, and inde-
pendence, of the dichotomies realizable for particular training set distri-
butions.

2 Fits using w = 3 were nearly as good; the scaling factor’s weak dependence on w is

explained by the fact that w appears in both the numerator and denominator of Ri/j .
1200 Panayiota Poirazi and Bartlett W. Mel

SQ Performance Curves with Analog B SQ Curves with Analog Weights have different
A shapes for Uniform vs. Gaussian Distributions
Weights Scale Across Dimesion
40 35
d k
35 30
10 12
Uniform
30 20 46
Classification Error (%)

Classification Error (%)


25
30 90
25
k Gaussian
⊕ 21% 20
20 K
15
15 k
⊕ 45% d k
K 10 d k
10
10 25 k
20 95 ⊕ 18% 10 10
5 5 K
30 210 20 40
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
Training Patterns / Parameter Bits Training Patterns/Parameter Bits

Figure 7: Error rates for SQ classifiers in different dimension could be superim-


posed when matched for product term ratio p = Kk . (A) Error rates were equiv-
alent for isomorphic SQ classifiers in 10, 20, and 30 dimensions when training
set size was normalized by capacity as given by equation 2.1, with w = 4. Two
groups of curves shown are for p ≈ 21% and p ≈ 45%. The approximation
arises from quantization of k values. The different shapes of the two groups of
curves imply that classifiers with different p-ratios perform differently per bit in
learning the same training set. (B) When training patterns were drawn from a
uniform distribution, normalized capacity curves for isomorphic SQ classifiers
continued to match across dimension, although the shapes of the curves were
different from those produced by the same classifiers trained with gaussian
samples.

3 Discussion

We derived a simple expression for the total parameter flexibility of a sub-


sampled quadratic classifier, with separate factors for value versus choice
capacity. To validate the form of the expression, we (i) identified families of
geometrically isomorphic SQ classifiers, with linear and full quadratic dis-
criminant functions as special cases; (ii) measured relative capacities within
these families; (iii) demonstrated that equation 2.2 predicted capacity scal-
ing factors for SQ classifiers in 10 to 30 dimensions with one free parameter
for classifiers with real-valued coefficients, and with no free parameters for
classifiers with binary-valued coefficients; and (iv) showed that for classi-
fiers with real-valued or binary-valued coefficients, both the predicted and
the empirically derived capacity scaling factors were consistent for two very
different training set distributions, suggesting that the expression for capac-
ity scaling in equation 2.2 may hold for any well-behaved input distribution.
The quantification of capacity in equation 2.1 is not yet supported by an
exact function counting theorem as has been worked out for linear and full
Capacity of a Subsampled Quadratic Classifier 1201

A SQ Performance Curves with Binary B SQ Curves with Binary Weights have Different
Weights Scale Across Dimension Shapes for Uniform vs. Gaussian Distributions
40 50
k
⊕ 21% 45
35 K
d k Uniform
40
30
Classification Error (%)

Classification Error (%)


k 10 12 35
⊕ 45% 20 46
25 K Gaussian
30 90 30

20 25

15 20
d k 15
10 d k
10 25 10 k
20 95 ⊕ 18% 10 10
5 K
30 210 5 20 40
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 1 2 3 4 5 6 7
Training Patterns/Parameter bits Training Patterns/Parameter Bits

Figure 8: Error curves for p-matched SQ classifiers trained with gaussian dis-
tribution and constrained to use binary valued ±1 weights. Superposition of
curves was achieved using capacity scaling factor Ri/j with no free parameters,
since w could be set to 1 in this case.

rth-order polynomials (Cover, 1965), nor has it been derived from a fun-
damental statistical mechanical perspective as has been done, for example,
for the simple perceptron (Riegler & Seung, 1997)—and we cannot predict
how difficult it will be to see through either of these approaches. In its favor,
however, the simple form of equation 2.1, coupled with the condition for
classifier isomorphism given in equation 2.5, together account for much of
the empirical variance in model capacity expressed by this rather complex
class of learning machines. It is therefore likely that these two quantitative
relations will be echoed in future theorems and proofs bearing on the class
of subsampled polynomial classifiers.
While choice capacity is in some sense invisible, it is nonetheless very
real. Thus, when a single optimally chosen nonlinear basis function is added
to a linear model, the system behaves as if it gained not only an additional
w bits of value, or coefficient, capacity, but a bonus capacity of log 2 (K) bits
owing to the choice that was made. When K is large and w is small, this
bonus can be highly significant. It seems plausible that the bit-counting
approach of equation 2.1 could be useful in quantifying the capacity of
other types of classifiers that adaptively subsample from large families of
nonlinear basis functions—for example, a classifier that draws an optimal
set of k radial basis functions from a complete set of K basis functions tiling
an input space.
The practical significance of choice capacity is most acute for hardware-
based learning systems, possibly including the brain, where in the limits
of low-resolution weight values (Petersen, Malenka, Nicoll, & Hopfield,
1202 Panayiota Poirazi and Bartlett W. Mel

1998) and very high-dimensional input spaces, learning capacity may be


dominated by the flexibility to choose from a huge set of candidate nonlinear
basis functions. As a corollary, the capacity of such learning systems could
be significantly underestimated if based solely on counts of existing units
or synaptic connections—a strategy that, by contrast, is appropriate when
used to estimate the capacity of a fully connected neural network (Baum &
Haussler, 1989).

Appendix: Derivation of Formula for Error Rate of Spherical Perceptron


in Limit of Large N

We derived an expression for perceptron error rates using a simple geo-


metric argument valid in the limit of large, randomly labeled training sets.
When N patterns are drawn from a zero-mean spherical gaussian distribu-
tion G and randomly labeled with ±1, the resulting positive and negative
training sets Tpos and Tneg may each be modeled as spherical gaussian blobs
Gpos and Gneg , with means Xpos and Xneg slightly shifted from the origin. The
sample means Xpos and Xneg are themselves distributed as d-dimensional
gaussians, but with reduced variance given by
p
σ 0 = σ/ N/2 . (A.1)

Given that most of the mass of a high-dimensional spherical


√ gaussian with
variance σ 2 is contained in a thin shell around radius σ d, we may view
Xpos and Xneg as randomly oriented vectors of length,
√ p
L = σ 0 d = σ (2d/N). (A.2)

Since in high dimension Xpos and Xneg are also orthogonal, the expected

distance between them is given by 1X = 2L. The optimal linear discrimi-
nant function is thus a hyperplane that bisects, and is perpendicular to, the
line connecting Xpos and Xneg . Given that Tpos and Tneg have equal prior prob-
ability, the expected classification error is given by the integrated probabil-
ity contained in Gpos (or Gneg ) lying beyond the discriminating hyperplane.
Since Gpos is factorial, this calculation reduces to a complementary error
p
function—the total probability lying beyond the distance 1X/2 = σ d/N
from the origin of a 1-dimensional gaussian. Thus, in the limit of high di-
mension with N À d, the expected classification error is
Ãr !
1 d
CE = erfc , (A.3)
2 2N

using the standard substitution z = √x , and the complementary error


R ∞ 22σ
function defined by erfc(z) = √2π z e−t dt.
Capacity of a Subsampled Quadratic Classifier 1203

In a variant of this classification problem, the N training samples drawn


from G are all assigned to Tpos , while Tneg is taken to be G itself. This problem
thus requires responding yes to the largest possible fraction of training
samples, while saying no to the largest possible fraction of the probability
density contained in G. Assuming equal priors for positive and negative
exemplars during testing, the optimal separating hyperplane in this case
cuts halfway through, and perpendicular to, the line connecting Xpos to the
origin. This reduces 1X, and hence the z score in equation
q A.3, by a factor

of 2, to give a classification error of CE = 2 erfc( 4N ). This was dubbed
1 d

the “déja-vù” learning problem and analyzed by Pearlmutter (1992) using


a somewhat different approach.

Acknowledgments

Thanks to Barak Pearlmutter, Dan Ruderman, Chris Williams, and Lyle


Borg-Graham for helpful comments and suggestions. Funding for this work
was provided by the NSF, the ONR, and a Myronis fellowship.

References

Adelson, E., & Bergen, J. (1985). Spatiotemporal energy models for the percep-
tion of motion. J. Opt. Soc. Amer., A2, 284–299.
Barron, R., Mucciardi, A., Cook, F., Craig, J., & Barron, A. (1984). Adaptive
learning networks: Development and applications in the united states of
algorithms related to gmdh. In S. Farlow (Ed.), Self-organizing methods in
modeling. New York: Marcel Dekker.
Baum, E., & Haussler, D. (1989). What size net gives valid generalization? Neural
Computation, 1, 151–160.
Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Oxford Univer-
sity Press.
Borg-Graham, L., & Grzywacz, N. (1992). A model of the direction selectiv-
ity circuit in retina: Transformations by neurons singly and in concert.
In T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation
(pp. 347–375). Orlando, FL: Academic Press.
Cover, T. (1965). Geometrical and statistical properties of systems of linear in-
equalities with applications in pattern recognition. IEEE Trans. Electronic Com-
puters, 14, 326–334.
Ghosh, J., & Shin, Y. (1992). Efficient higher order neural networks for classifi-
cation and function approximation. International Journal of Neural Systems, 3,
323–350.
Giles, C., & Maxwell, T. (1987). Learning, invariance, and generalization in high-
order neural networks. Applied Optics, 26, 4972–4978.
Guyon, I., Boser, B., & Vapnik, V. (1993). Automatic capacity tuning of very large
vc-dimension classifiers. In S. Hanson, J. Cowan, & C. Giles (Eds.), Advances in
1204 Panayiota Poirazi and Bartlett W. Mel

neural information processing systems, 5 (pp. 147–155). San Mateo, CA: Morgan
Kaufmann.
Heeger, D. (1992). Half-squaring in responses of cat striate cells. Visual Neurosci.,
9, 427–443.
Johnston, D., Magee, J., Colbert, C., & Christie, B. (1996). Active properties of
neuronal dendrites. Ann. Rev. Neurosci., 19, 165–186.
Karpinsky, M., & Werther, T. (1993). Vc dimension and uniform learnability of
sparse polynomials and rational functions. SIAM J. Comput., 22, 1276–1285.
Koch, C., & Poggio, T. (1992). Multiplying with synapses and neurons. In
T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation
(pp. 315–345). Orlando, FL: Academic Press.
Koch, C., Poggio, T., & Torre, V. (1982). Retinal ganglion cells: A functional
interpretation of dendritic morphology. Phil. Trans. R. Soc. Lond. B, 298, 227–
264.
Mel, B. W. (1992a). The clusteron: Toward a simple abstraction for a complex
neuron. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural
information processing systems, 4 (pp. 35–42). San Mateo, CA: Morgan Kauf-
mann.
Mel, B. W. (1992b). NMDA-based pattern discrimination in a modeled cortical
neuron. Neural Comp., 4, 502–516.
Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neuro-
physiol., 70(3), 1086–1101.
Mel, B. W. (1994). Information processing in dendritic trees. Neural Computation,
6, 1031–1085.
Mel, B. W. (1999). Why have dendrites? A computational perspective. In G. Stu-
art, N. Spruston, & M. Häusser (Eds.), Dendrites (pp. 271–289). Oxford: Oxford
University Press.
Mel, B. W. , Ruderman, D. L., & Archie, K. A. (1998a). Toward a single cell
account of binocular disparity tuning: An energy model may be hiding in
your dendrites. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in
neural information processing systems, 10 (pp. 208–214). Cambridge, MA: MIT
Press.
Mel, B. W. , Ruderman, D. L., & Archie, K. A. (1998b). Translation-invariant
orientation tuning in visual “complex” cells could derive from intradendritic
computations. J. Neurosci., 17, 4325–4334.
Ohzawa, I., DeAngelis, G., & Freeman, R. (1997). Encoding of binocular disparity
by complex cells in the cat’s visual cortex. J. Neurophysiol., 77.
Pearlmutter, B. A. (1992). How selective can a linear threshold unit be? In Inter-
national Joint Conference on Neural Networks. Beijing, PRC: IEEE.
Petersen, C. C. H., Malenka, R. C., Nicoll, R. A., & Hopfield, J. J. (1998). All-or-
none potentiation at ca3–ca1 synapses. Proc. Natl. Acad. Sci. USA, 95, 4732–
4737.
Poggio, T. (1975). Optimal nonlinear associative recall. Biol. Cybern., 9, 201.
Pollen, D., & Ronner, S. (1983). Visual cortical neurons as localized spatial fre-
quency filters. IEEE Trans. Sys. Man Cybern., 13, 907–916.
Rall, W., & Segev, I. (1987). Functional possibilities for synapses on dendrites
and on dendritic spines. In G. Edelman, W. Gall, & W. Cowan (Eds.), Synaptic
Capacity of a Subsampled Quadratic Classifier 1205

function (pp. 605–636). New York: Wiley.


Riegler, P., & Seung, H. S. (1997). Vapnik-chervonenkis entropy of the spherical
perceptron. Phys. Rev. E, 55, 3283–3287.
Schurmann, J. (1996). Pattern classification: A unified view of statistical and neural
approaches. New York: Wiley.
Segev, I., & Rall, W. (1998). Excitable dendrites and spines—earlier theoretical
insights elucidate recent direct observations. Trends Neurosci., 21, 453–460.
Shepherd, G., & Brayton, R. (1987). Logic operations are properties of computer-
simulated interactions between excitable dendritic spines. Neurosci., 21, 151–
166.
Stuart, G., Spruston, N., Sakmann, B., & Hausser, M. (1997). Action potential
initiation and backpropagation in neurons of the mammalian CNS. TINS, 20,
125–131.
Stuart, G., Spruston, N., & Hausser, M. (Eds.). (1999). Dendrites. Oxford: Oxford
University Press.
Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of
relative frequencies of events to their probabilities. Theory of Probability and
Its Applications, 16, 264–280.
Yuste, R., & Tank, D. W. (1996). Dendritic integration in mammalian neurons, a
century after Cajal. Neuron, 16, 701–716.
Zador, A., Claiborne, B., & Brown, T. (1992). Nonlinear pattern separation in
single hippocampal neurons with active dendritic membrane. In J. Moody,
S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing
systems, 4 (pp. 51–58). San Mateo, CA: Morgan Kaufmann.
Zador, A., & Pearlmutter, B. A. (1996). VC dimension of an integrate-and-fire
model neuron. Neural Computation, 8, 611–624.

Received July 1, 1998; accepted July 20, 1999.

Você também pode gostar