Você está na página 1de 30

ARTICLE IN PRESS

Journal of Computer and System Sciences 68 (2004) 205234


http://www.elsevier.com/locate/jcss

More efcient PAC-learning of DNF with membership queries


under the uniform distribution$
Nader H. Bshouty,a Jeffrey C. Jackson,b, and Christino Tamonc
b

a
Computer Science Department, Technion Israel Institute of Technology, Haifa 32000, Israel
Mathematics and Computer Science Department, Duquesne University, 600 Forbes Avenue, Pittsburgh, PA 15282-1754,
USA
c
Department of Mathematics and Computer Science, Clarkson University, Potsdam, NY, 13699-5815, USA

Received 3 December 2001; revised 15 June 2003; accepted in revised form 17 October 2003

Abstract
An efcient algorithm exists for learning disjunctive normal form (DNF) expressions in the uniformdistribution PAC learning model with membership queries (J. Comput. System Sci. 55 (1997) 414), but in
practice the algorithm can only be applied to small problems. We present several modications to the
algorithm that substantially improve its asymptotic efciency. First, we show how to signicantly improve
the time and sample complexity of a key subprogram, resulting in similar improvements in the bounds on
the overall DNF algorithm. We also apply known methods to convert the resulting algorithm to an
attribute efcient algorithm. Furthermore, we develop a technique for lower bounding the sample size
required for PAC learning with membership queries under a xed distribution and apply this technique to
produce a lower bound on the number of membership queries needed for the uniform-distribution DNF
learning problem. Finally, we present a learning algorithm for DNF that is attribute efcient in its use of
random bits.
r 2003 Elsevier Inc. All rights reserved.
Keywords: Probably approximately learning; Membership queries; Disjunctive normal form; Uniform distribution;
Fourier transform

This material is based upon the work supported by the National Science Foundation under Grants CCR-9800029,
CCR-9877079, and CCR-0209064.

Corresponding author.
E-mail address: jackson@math.duq.edu (J.C. Jackson).
0022-0000/$ - see front matter r 2003 Elsevier Inc. All rights reserved.
doi:10.1016/j.jcss.2003.10.002

ARTICLE IN PRESS
206

N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

1. Introduction
Jackson [7] gave the rst polynomial-time PAC learning algorithm for DNF with membership
queries under the uniform distribution. However, the algorithms time and sample complexity
make it impractical for all but relatively small problems. The algorithm is also not particularly
efcient in its use of random bits.
In this paper we signicantly improve the time, sample, and random bit complexity of Jacksons
Harmonic Sieve. Furthermore, by applying existing techniques, we show that the algorithm can be
made attribute efcient. Attribute efcient learning algorithms are standard PAC algorithms with
the additional constraint that the sample complexity of the algorithm must be polynomial in log n
(where n represents the total number of attributes) and in all other standard PAC parameters,
including the number of attributes that are relevant to the target, which we will denote by r:
Specically, with s representing the size of the DNF expression to be learned and 1  e the
desired accuracy of the learned hypothesis, we show that the sample complexity can be reduced
2 =e4 : Similarly, the
4 =e8 implicit in Jacksons Harmonic Sieve algorithm to Ors
from the Ons
10 =e12 to Ors
6 =e4 : Other aspects of the algorithm, such as
time bound can be reduced from Ons
the form and size of the hypothesis produced, are not adversely affected by these changes.
Furthermore, at the expense of producing a more complex hypothesis, the sample and time
6 =e2 ; respectively, by employing a recent
2 =e2 and Ors
complexity can be reduced to Ors
observation of Klivans and Servedio [8].
We obtain our main results by improving on a key Fourier-based subprogram of the Harmonic
Sieve. Specically, the Sieve relies on an algorithm developed by Goldreich and Levin [6] that,
given the ability to query a Boolean function f over n Boolean inputs at a polynomial number of
points, nds the parity functions (over subsets of the inputs) that are best correlated with f with
respect to the uniform distribution. (The GoldreichLevin algorithm is often referred to in the
learning-theoretic literature as the KushilevitzMansour algorithm, as the latter authors were the
rst to apply it to larger learning theory questions [9].) As applied in the Sieve, the Goldreich
4 : The Harmonic Sieve actually
6 and sample complexity Ons
Levin algorithm runs in time Ons
needs a slightly modied version of this algorithm and performs additional processing, further
increasing the overall bounds on DNF learning.
Levin subsequently developed an alternative algorithm for the parity-learning problem that
offers some potential computational benets [10]. However, a straightforward implementation of
2 s2 ns4 and sample bound
the algorithm within the Harmonic Sieve gives a time bound of On
2 s2 ; which while better than GoldreichLevin in terms of s are worse in terms of n:
of On
We build on Levins work, developing an algorithm that, when used as a subprogram of the
2 (or Ors
2 if a relevant attribute
Harmonic Sieve, has time bound and sample complexity Ons
detection scheme is applied), improving on both GoldreichLevin and Levin. This improved
algorithm for locating well-correlated parity functions may be of interest beyond our application
of it to DNF learning.
Along these lines, it should be noted that our algorithm is another illustration of the close
connections between results in cryptography/derandomization and learning (see [8] for more
examples). Since Goldreich and Levin, and subsequently Levin, developed their algorithms for the
purpose of proving properties of hard-core predicates, it is natural to ask if our algorithm leads to

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

207

new cryptographic results. While our improvements of Levins algorithm do imply stronger hardcore properties than given in Levins paper [10], similar improvements have already been observed
by Goldreich using different means [5].
We also develop a new technique for nding a lower bound on the sample size needed to learn
classes in the PAC-learning model with membership queries. We apply this technique to nd a
lower bound of roughly Os log r for PAC-learning DNF with membership queries under the
uniform distribution. There is thus some gap between this and our upper bound.
Finally, we develop some general derandomization techniques and apply them to obtain a
learning algorithm for DNF that is attribute efcient in its use of random bits.
The remainder of the paper is organized as follows. We rst give necessary denitions and some
background theorems. The key problem analyzed in this paper, which we call the weak parity
problem, is then dened, and Levins original algorithm for solving this problem is presented in
detail. We next develop two improvements on Levins algorithm. The performance of the nal
improved Levin algorithm when it operates as a subprogram of the Harmonic Sieve is then
analyzed. Next, we present our lower bound argument and approach to improving randomness
efciency. Finally, suggestions for further work close the paper.

2. Denitions and notation


Let n be some positive integer and let n f1; 2; y; ng: We consider Boolean functions of the
formS
f : f0; 1gn -f1; 1g; the class Cn of such functions, and the countable union of such classes
C nX0 Cn : In this paper we will focus on the class of DNF expressions. A DNF formula is a
disjunction of terms, where each term is a conjunction of literals. A literal is either a variable or its
negation. The DNF-size of a function f is the minimal number of terms in any DNF formula
representing f : Since every Boolean function over f0; 1gn can be expressed as a DNF formula, when
we speak later of learning DNF we will mean learning the class of all possible Boolean functions in
time bounded by (among other things) a polynomial in the DNF-size of the target function.
For aAf0; 1gn ; denote the ith bit of a by ai : The Hamming weight of a; i.e., the number of ones
in a; is denoted wta: For iAn; the unit vector ei is the vector of all zeros except for the ith bit
which is one. For IDn; the vector eI denotes the vector of all zeros except at bit positions
indexed by I where they are ones. We denote the bitwise
exclusive-or between two vectors
P
a; bAf0; 1gn by a"b: The dot product a b is dened as ni1 ai bi : When dealing with subsets, we
identify them with their characteristic vectors, i.e., subsets of n with vectors of f0; 1gn : So for two
subsets A; B; the symmetric difference of A and B is denoted by A"B:
If f n Ogn and g0 n is gn with all logarithmic factors removed, then we write f n
0 n: This extends to k-ary functions for k41 in the obvious way. The function signx
Og
returns 1 if x is positive and 1 is x is negative.
Let f ; h be Boolean functions. We say that h is an e-approximator for f under distribution D if
PrD f ahoe: We also use the notation f Wh to denote the symmetric difference between f and h;
i.e., fx : f xahxg: The example oracle for f with respect to D is denoted by EX f ; D: This
oracle returns the pair x; f x where x is drawn from f0; 1gn according to distribution D: The
membership oracle for f is denoted by MEM f : On input xAf0; 1gn ; this membership oracle

ARTICLE IN PRESS
208

N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

returns f x: The Probably Approximately Correct (PAC) learning model [13] is dened as follows.
A class C of Boolean functions is called PAC-learnable if there is an algorithm A such that for
any positive e (accuracy) and d (condence), for any f AC; and for any distribution D; with
probability at least 1  d; the algorithm AEX f ; D; e; d produces an e-approximator for f with
respect to D in time polynomial in the size s of f ; n; 1=e and 1=d: We call a concept class weakly
PAC-learnable if it is PAC-learnable with e 1=2  1=polyn; s: The e-approximator for f in this
case is called a weak hypothesis for f : If C is PAC-learnable by an algorithm A that uses the
membership oracle then C is said to be PAC-learnable with membership queries.
A variable or input xi to a function f is relevant if f aaf a"ei ; for some aAf0; 1gn : If there exists
a function in on so that C is PAC-learnable by an algorithm A that asks polyr; sin examples
and queries, where r is the number of relevant variables of the target function f AC; then C is said to be
in-attribute efficient PAC-learnable with membership queries. A class is attribute-efficient PAClearnable if it is log n-attribute efcient PAC-learnable. If the algorithm A outputs an e-approximator h
of size polyr; sin; then C is said to be PAC-learnable attribute efciently with small hypothesis.
The Fourier transform of a function f mapping f0; 1gn to the reals is dened as follows. Let
fa Ex f xwa x be the Fourier coefficient of f at aAf0; 1gn ; where wa x 1a x and the
expectation is taken with respect to the uniform distribution over f0; 1gn : It is easily shown that
P
any such function f can be represented as f x a fawa x; since the functions wa ; aAf0; 1gn ;
form an orthonormal basis of the real-valued functions over f0; 1gn : When dealing with a realvalued function g : f0; 1gn -R; the notation jgj denotes maxfjgxj : xAf0; 1gn g:
Next we dene some notation from coding theory [15]. Let S be a nite alphabet of size m: A
code C of word length n is a subset of Sn : The distance between two codewords x; yAL is dened as
dx; y jfiAn : xi ayi gj and the minimum distance of C is dened as
dL minfdx; yjx; yAC; xayg:
An q-ary n; d-code is a code over an alphabet of size q of word length n and minimum distance d:
In most cases, the alphabet S is a nite eld Fq of size q and the code C is a linear subspace of Fnq ;
in this case, C is called a q-ary linear n; k; d-code, if C has dimension k as a subspace. The
generator matrix G of a q-ary linear n; k; d-code C is a n  k matrix over Fq whose columns are
the basis of C; each codeword in C is of the form G x; for some xAFkq : Finally, a code C is called
asymptotically good if, as n-N; both the rate k=n and d=n are bounded away from zero.
3. Sample size theorems
We make frequent use of the following two theorems to derive sample sizes sufcient to
estimate the mean of a random variable to a specied accuracy with a given condence level.
Lemma 1 (Hoeffding). Let X1 ; X2 ; y; Xm be independent random variables all with mean m such
that for all i; apXi pb: Then for any l40;
 #
"

1 X
m
2
2


Pr 
Xi  mXl p2e2l m=ba :

m i1

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

209

It follows that the sample mean of m b  a2 ln2=d=2l2 independent random variables


having common mean m will, with probability at least 1  d; be within an additive factor of l of m:
Lemma 2 (BienaymeChebyschev). Let X1 ; X2 ; y; Xm be pairwise independent random variables
all with mean m and variance s2 : Then for any l40;
 #
"

1 X
m
s2


Xi  mXl p 2 :
Pr 

m i1
ml
It follows that the sample mean of m s2 =dl2 random variables as described in the lemma
will, with probability at least 1  d; be within an additive factor of l of m:
4. Weak parity learning
While our ultimate goal is to show how to improve the running time and randomization aspects
of the Harmonic Sieve algorithm for learning DNF expressions, the core of our speed-up result
lies in improving a key procedure of the Sieve. This procedure is used to weakly learn a function
f : f0; 1gn -f1; 1g using a hypothesis drawn from the class of parity functions P
fwa ; wa j aAf0; 1gn g: We will refer to this as the problem of weak parity learning; it is dened
formally below.
In this section we will study the weak parity learning problem and present an algorithm that
noticeably improves on the previous best asymptotic time bounds of other algorithms for solving
this problem. Specically, letting s represent the smallest number of terms in any DNF
6 [6] and
representation of the target f ; the time bounds for previous algorithms were Ons
2 s2 ns4 [10]. The time bound for our new algorithm is Ons
2 :
On
Further below, we will begin our development of a more efcient weak parity algorithm by
reviewing Levins algorithm [10] for this problem. Levins algorithm has sample and time
complexity bounds dependent on n2 as well as other factors. After describing Levins algorithm, we will
show how to modify it to reduce the bounds to a linear dependence on n: Finally, we give a further
modication that improves the time bounds dependence on the DNF-size s of the target function.
Before developing these algorithms, we give a formal denition of the problem to be solved and
show how it is related to PAC learning, providing a rationale for our name for this problem.
4.1. The weak parity problem
Denition 1 (y-Heavy Fourier coefcient). A Fourier coefcient fa of a function
f : f0; 1gn -f1; 1g is said to be y-heavy if j fajXy:
Denition 2 (Weak parity learning). Given a positive real value y and a membership oracle
MEM f for an unknown target function f : f0; 1gn -f1; 1g; the weak parity learning problem
consists of producing a set that is either empty or that contains exactly one index (frequency) of a
Fourier coefcient of f : In particular, the set must contain the index of a coefcient that is at least
y=3-heavy if f has a y-heavy coefcient. On the other hand, if all of the coefcients of f are less

ARTICLE IN PRESS
210

N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

than y=3-heavy, then the set must be empty. Finally, if f has no y-heavy coefcient but does
have a coefcient that is at least y=3-heavy, the set must either contain the index of one such
coefcient or be empty.
Denition 3 (Weak parity algorithm). A weak parity algorithm is an algorithm that, given
MEM f and y as above along with a positive real value d; succeeds with probability at least
1  d at solving the weak parity learning problem within a specied run time bound that can
depend polynomially on n; y1 ; and log d1 :
Lemma 3. If AMEM f ; y; d is a weak parity algorithm then there is an algorithm
A0 MEM f ; d that learns a 1=2  O1=s-approximator to f with respect to the uniform
distribution, where s is the least number of terms in any DNF representation of the target function f
(the DNF-size of f ). Furthermore, A0 runs in time bounded by Olog2 s times the running time of
AMEM f ; 1=2s 1; d: The hypothesis produced by A0 will come from P:
In short, given a weak parity algorithm as dened above, we can construct an algorithm that
weakly learns using P as the hypothesis class and that has the same run time bound as
AMEM f ; 1=2s 1; d to within logarithmic factors. Therefore, every weak parity algorithm
as dened above is the essence of an algorithm that weakly learns in the standard PAC sense and
that outputs a parity function as its hypothesis. Thus, while we develop weak parity algorithms
below that accept an arbitrary threshold value y as a parameter, our run time bounds for the
algorithms will be stated in terms of s by applying the substitution y 1=Os and including the
Olog2 s factor inherent in our construction of A0 :
Proof of Lemma 3. First note that every DNF function f has a Fourier coefcient of magnitude at
least 1=2s 1 [7]. We will use this fact to show that any weak parity algorithm
AMEM f ; y; d can be converted to an algorithm A0 MEM f ; d that with probability at
least 1  d runs within the stated time bound and produces the index a of a Fourier coefcient
such that j faj O1=s: We will then observe that the O1=s-heavy Fourier coefcient
produced by A0 corresponds to a parity function that weakly approximates f with respect to the
uniform distribution.
We construct A0 from A as follows. Algorithm A0 consists of running A repeatedly with
different arguments each time, the ith time with arguments AMEM f ; 21i ; d=2i : That is, A0
implements a guess-and-double strategy. The algorithm continues to run instances of A until
one of these runs returns a Fourier coefcient. Notice that on run number Jlog2s 1n 1 (if
the algorithm calls A this many times), A will be called with its y parameter assigned a value at
most 1=2s 1 and its d parameter assigned a value at most d=22s 1: Therefore, this run will
with probability at least 1  d=22s 1 successfully return a O1=s-heavy Fourier coefcient
and run within time Olog s times the time bound on AMEM f ; 1=2s 1; d; since A by
denition has a time bound polynomial in log d1 :
Furthermore, since the preceding Olog s runs of A use larger parameter values, they will with
high probability successfully complete and run within the same time bound as this nal run. In
fact, our choice of values for the d parameters in each run guarantees that all of the runs of A;

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

211

including this nal run, will successfully complete within the time bound with probability at least
1  d: Therefore, if one of these earlier runs was to return an index that presumably corresponded
to a O1=s-heavy Fourier coefcient, with probability at least 1  d this returned index would in
fact correspond to such a heavy coefcient. Summarizing, this algorithm A0 performs as was
claimed above.
To see that a 1=s-heavy Fourier coefcient corresponds to a weakly approximating parity
function, we consider the denition of Fourier coefcients. That is, if a is an index such that
j fajX1=s; then by denition we have that jExBUn f xwa xjX1=s (here Un denotes the
uniform distribution over f0; 1gn ). Note that, since f and wa are both functions mapping to
f1; 1g; f xwa x 1 if and only if f x wa x: Therefore, ExBUn f xwa x
2 PrxBUn f x wa x  1: So if jExBUn f xwa xjX1=s then we will have PrxBUn f x
hxX1=2 1=2s for either hx wa x or hx wa x: This analysis obviously generalizes
for an O1=s-heavy coefcient. &

4.2. Weak parity learning using Levins algorithm


In this section we describe Levins algorithm [10] for solving a problem closely related to the
weak parity learning problem. We then show a straightforward way of converting Levins
algorithm to one that explicitly solves the weak parity learning problem as we have dened it.
Levins algorithm differs from weak parity algorithms as described above in that it returns a
short list of Fourier coefcient indices that with high probability contains the index of at least one
y-heavy coefcient, if the target f has such a coefcient. To convert this to a weak parity
algorithm we must either extract a single y=3-heavy coefcient from the list or, if appropriate,
return the empty set. In this section, we will use a straightforward but computationally expensive
method to search the list produced by Levins algorithm for a heavy coefcient. Some of our later
speed-up will come from taking a more sophisticated approach to this part of the problem.
4.2.1. Levins algorithm
The specic problem solved by Levins algorithm is the following: given a membership oracle
MEM f for some function f and given a positive threshold y such that there is at least one yheavy coefcient, with probability at least 12 return a list of On=y2 Fourier indices that contains
at least one y-heavy coefcient. (The well-known GoldreichLevin algorithm used in the original
Harmonic Sieve solves a similar problem but uses a very different approach.)
We begin our development of Levins algorithm by assuming for the moment that we have
guessed that fa is a y-heavy coefcient and that we want to verify our guess. Typically, we might
n
perform this
P verication by drawing a sample X of xAf0; 1g uniformly at random and
computing xAX f xwa x=jX j: Applying the Hoeffding bound (Lemma 1), it follows that jX j on
the order of f2 a will give a good estimate of fa with high probability. However, notice that we
do not need a completely uniform distribution to produce a coarse estimate with reasonably high
probability. In particular, if we draw the examples X from any pairwise independent distribution,
then we can apply Chebyshev bounds (Lemma 2) and get that for jX jX2n=f2 a; with probability

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

212

P
at least 1  1=2n; sign xAX f xwa x sign fa: Thus a polynomial-size sample sufces to
nd the sign of the coefcient with reasonably high probability, even using a distribution which is
only pairwise rather than mutually independent.
One way to generate a pairwise independent distribution is by choosing, for xed k41; a
random n-by-k 01 matrix R and forming the set Y fR p j pAf0; 1gk  f0k gg (the arithmetic in
the matrix multiplication is performed modulo 2). This set Y is pairwise independent because each
n-bit vector in Y is a linear combination of random vectors, so knowing any one vector Rp gives
no information about what any of the remaining vectors might be, even if p is known. Thus if we
take k Jlog2 1 2n=f2 an and form the set Y as above then with probability at least
P
1  1=2n over the random draw of R; sign xAY f xwa x sign fa:
We will actually be interested in functions that vary slightly from f : Specically, for 1pipn;
dene fi x  f x"ei and note that
fbi a Ex fi xwa x
Ex f x"ei wa x
Ex f xwa x"ei 
Ex f xwa xwa ei 
1ai Ex f xwa x
1ai fa:
(The third equality follows by a change of variables.) Since each of the fbi a coefcients has the
same magnitude as fa; it follows that for any xed i and for Y and k as above,

sign

!
f x"ei wa x

1ai sign fa

xAY

with probability over the uniform random choice of R of at least 1  1=2n: Therefore, with
probability at least 1=2; (1) holds simultaneously for all i:
Note that instead of summing over x above, we could rewrite this as a sum over pAP; where
P f0; 1gk  f0k g: Dene fR;i p  f Rp"ei : Then with probability at least 1=2 over the
choice of R; for all i we have
!
X
1ai sign fa sign
f Rp"ei w Rp
a

pAP

sign

X
pAP

sign

X
pAP

!
aT R p

fR;i p1

!
fR;i pwaT R p :

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

Now x zAf0; 1gk and notice that by the denition of the Fourier transform,
X
k
fc
fR;i pwz p:
R;i z 2

213

pAf0;1g

If z aT R then the sum on the right-hand side of (3) is almost identical to the sum in (2). The only
difference is that the 0k vector has been included in (3).
To handle this off-by-one problem, we make k larger than needed for purposes of applying the
Chebyshev bound. Specically, let k Jlog2 1 2n=f2 an 1; so jY j 2k  144n=f2 a:
This Y is sufciently large so that it is still the case that with probability at least 1=2; for all i;
0
1
X
fR;i pwaT R pA:
1ai sign fa sign@
pAf0;1gk

To see this, note that our sample size jY j is now twice that required by the Chebyshev lemma.
Now if the sample is twice as large as required to obtain the correct sign (i.e., twice as large as is
required to be within j faj of the true mean value), this means that, for xed i; with probability
at least 1  1=2n
 b
P

 pAP fR;i pwaT R p
i aj
bi apjfp

:
f
4



jY j
2
Also, based on the bound on jY j given above, for all n40 and any xed i we have
1=jY jojfbi aj=4: It follows that with probability at least 1  1=2n the sign of the sum in (4) will
agree with the sign of fbi a even if an additional term with incorrect sign is added to the sum.
Therefore, for z aT R; with probability at least 1=2 over the choice of R; 1ai sign fa
signfc
R;i z holds for all i:
Up until now we have been assuming that the index a of a y-heavy Fourier coefcient of f is
known. We are now ready to show how to use the observations above to, with probability at least
1=2; nd a list of coefcients containing such an a:
First, notice that each of the n functions fR;i is a function on k bits, so the entire truth table for
each of these functions is of size 2k p4 8n=f2 a: This is polynomial in f1 a; where fa is of
magnitude at least y; since fa is assumed to be a y-heavy Fourier coefcient. Therefore, with
constant probability we can efciently produce a list of length 2k containing the index of a y-heavy
Fourier coefcient as follows. First, select R at random and compute exactly the complete Fourier
transforms of each of the n functions fR;i : If the Fast Fourier Transform algorithm is used (see,
e.g., [1]) then each of these transforms can be computed in time k2k : Next, we turn this collection
of n Fourier transforms on k-bit functions into a two dimensional 2k -by-n table, each column
consisting of one of the Fourier transforms. Each row of this table is then converted to an n-bit
Fourier index by mapping each value in a row to 0 if the value is positive and 1 otherwise. Then,
based on our earlier analysis, with probability at least 1=2 the row corresponding to aT R contains
the index a if fa40 and the ones complement of a otherwise.

ARTICLE IN PRESS
214

N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

Fig. 1 shows Levins algorithm for nding a set containing a y-heavy Fourier coefcient with
probability at least 1=2: One minor concern addressed in the given implementation of the
algorithm is the time required to ip an input bit, i.e., the time required for the operation x"ei :
For consistency with our subsequent algorithms, we assume in our presentation of Levins
algorithm that this operation will be performed as part of each membership query, and therefore
assume unit time for this operation. This seems a reasonable assumption, as a simple modication
to the membership oracle can accommodate this bit-ipping operation with a constant additional
time cost per each access to an input bit.
Specically, the new membership oracle can be thought of as adding an interrupt handler to the
original membership oracle. Each time the original oracle requests the value of an input bit, say
bit j; the interrupt handler res. If the second (bit number) argument to the new oracle is equal to
j; then the handler returns to the original oracle the complemented value of bit j: Otherwise, it
returns unchanged the value of bit j:
2 =y2 ; and
In summary, Levins algorithm produces a list of 2k1 On=y2 vectors in time On
with probability at least 1=2 one of these vectors is the index of a y-heavy Fourier coefcient if
such a coefcient exists. We next turn to converting this algorithm to a weak parity algorithm.
4.2.2. Producing a weak parity algorithm from Levins algorithm
If Levins algorithm is run log2 2=d times then with probability at least 1  d=2 the union of
the sets of vectors returned by the runs contains the index of a y-heavy coefcient, if the target
function has such a coefcient. However, a weak parity algorithm as we have dened it must
return a single heavy coefcient (not necessarily fully y-heavy, but at least y=3-heavy) if such a
coefcient exists, and must return no coefcient if all coefcients are light. In this section we
show how to implement a testing phase that post-processes the set produced by Levins algorithm,
resulting in a weak parity algorithm.
The simple testing strategy applied in this section is illustrated in Fig. 2. The primary input to
the algorithm is a set S representing the union of the sets produced by log2 2=d independent runs

Fig. 1. Levins algorithm.

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

215

Fig. 2. Simple testing phase algorithm.

of Levins algorithm. The testing algorithm rst draws uniformly at random a single set T of
vectors to be used as input to the target function. Next, it calculates the sample mean of the
function f wa for each index a in S: The size of T is chosen (by application of the Hoeffding bound)
such that with probability at least 1  d all of these estimates will be within y=3 of their true mean
values. Therefore, if there is at least one index in S representing a y-heavy coefcient then with
probability at least 1  d some such index will pass the test at line 4. Similarly, with the same
probability all indices representing Fourier coefcients b such that j fbjoy=3 will fail the test.
Therefore, the algorithm performs as claimed in the gure. Note, however, that the coefcient
returned is only guaranteed to be y=3-heavy, not y-heavy as Levins algorithm guarantees.
Overall, then, by following multiple calls to Levins algorithm with this testing step (using d=2
as the condence parameter), we have a weak parity algorithm. Because the size of the set S that is
2
2 =y2

; the overall algorithm runs in time On


input to the testing algorithm will be of size On=y
4
2
2 =y :
n=y : It can also be veried that the sample complexity is On
In the next subsection we improve on this algorithm, essentially reducing the n2 terms to linear
factors of n:
4.3. Improving on Levins algorithm
The observation that fbi a 1ai fa was crucial to the success of Levins algorithm. Our
basic approach to improving on Levins algorithm is to extend this observation from the case of a
single input bit being ipped to multiple bit ips and to use an algorithm based on multiple bit
ips to infer the results we would get from the individual bit ips used in Levins algorithm. By
performing this process multiple times, we can employ a voting strategy that reduces the run time
dependence on n in the new algorithm.
As an example of the multiple bit ipping strategy, let y be a xed threshold value and let a be
the index of a y-heavy Fourier coefcient. For any distinct xed i; jAn; the earlier analysis can be

ARTICLE IN PRESS
216

N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

extended in a straightforward way to show that


fa Ex f x"efi;jg w x"efi;jg 
a

1ai 1aj Ex f x"efi;jg wa x:


Q
More generally, for any xed IDn; fa Ex f x"eI wa x iAI 1ai : Therefore, for
pairwise independent Y as before we have that for a given i and j;
!
X
sign
f x"efi;jg wa x 1ai 1aj sign fa
xAY

with probability at least 1  1=2n: We also have that


!
X
f x"ej wa x 1aj sign fa
sign
xAY

with the same probability. Thus with probability 1  1=n both of these sums have the correct sign
and can be used to solve for the value of ai (the sign of the product of the two sums gives 1ai :
This gives a way to determine the value of ai that is distinct from Levins original method. We can
do something similar using ak rather than aj for some kaj; giving us yet another way to compute
ai s value.
If there was sufcient independence between these different ways of arriving at values of ai ; then
we could use many such calculations and take their majority vote to arrive at a good estimate of
the value of ai : The Hoeffding bound would then show that we could tolerate a much smaller
probability of success for each of the individual estimates of mean values, specically constant
probability of success rather than 1  1=2n: Furthermore, if this probability was not dependent on
n; then working back through the earlier analysis we see that the variable k used in Levins
algorithm also would not depend on n: This in turn would result in an overall reduction in the run
time bounds dependence on n:
These observations form the basis for the algorithm shown in Fig. 3. First, notice that for
k Jlog2 1 1=cy2 n we have that, for a such that j faj4y; with probability at least 1  c over
the choice of R
!
X
sign
f xw x sign fa:
a

xAY

Furthermore, for any xed IDn; we can similarly apply Chebyshev to f x"eI and get that with
the same probability 1  c over uniform random choice of R;
!
X
Y
sign
f x"eI w x sign fa
1aj :
5
a

xAY

jAI

This means that the expected fraction of Is that fail to satisfy (5) for uniform random choice of R
is at most c: Therefore, by Markovs inequality, the probability of choosing an R such that a 2c or
greater fraction of the Is fail to satisfy (5) is at most 1=2: We will call such an R bad for a and all
other Rs good for a: Furthermore, for any xed iAn and for any R that is good for a; at most a

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

217

Fig. 3. Improved Levin algorithm.

2c fraction of the Is fail to satisfy the following equality:


!
X
Y
sign
f x"eI "ei wa x sign fa
1aj :
xAY

jAI"fig

This follows because for any xed R there is a one-to-one correspondence between the Is that fail
to satisfy this equality and those that fail to satisfy (5). Therefore, combining (5) and (6), for xed
i; the probability over random choice of R is also at most 1=2 that for a greater than 4c fraction of
the Is, either of the following conditions holds (each condition is a conjunction of two relational
expressions):
!
Y
X
f x"eI w x asign fa
1aj
sign
a

xAY

jAI

and
sign

X
xAY

!
f x"eI "ei wa x

sign fa

Y
jAI"fig

1aj ;

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

218

or
sign

!
f x"eI wa x

sign fa

xAY

sign

1aj

jAI

and
X

!
f x"eI "ei wa x asign fa

xAY

1aj :

jAI"fig

This in turn implies that for xed i;


" "
#
#
!
X
X
1
f x"eI wa x
f x"eI "ei wa x a1ai X4c p :
Pr Pr sign
R
I
2
xAY
xAY
So, for xed i and Rs that are good for a; a random choice of I has probability at least 1  4c of
giving the correct sign for ai and therefore probability at most 4c of giving the incorrect sign. If the
correct sign is 1; then for a good R and any xed i;
"
!#
X
X
f x"eI wa x
f x"eI "ei wa x X1  8c:
EI sign
xAY

xAY

Similarly, a correct sign of 1 gives expected value bounded by 1 8c: So by Hoeffding, if we


estimate this expected value by taking a sum over t randomly chosen Is, for
2 ln 4n
;
7
tX
1  8c2
then the sign of the estimate will be 1ai with probability at least 1  1=2n: Therefore, for any R
good for a; with probability at least 1=2 the sign estimates of 1ai will be correct simultaneously
for all n possible values of i: In this case, we will discover all n bits of the index a of the y-heavy
Fourier coefcient fa: Since we have probability at least 1=2 of choosing a good R; overall we
succeed at nding a with probability at least 1=4 over the random choice of R and of the t values
of I:
OnePnal detail that must be addressed is that while the above analysis is in terms of sums of the
form xAY ; the implicit sums computed by the algorithm, using the FFT, effectively include the
bit-vector x 0n : As with the analysis of Levins original algorithm, we can overcome this
problem by using a somewhat larger k value than the above analysis would indicate. It is easy to
verify that the k given in Fig. 3 is sufcient to overcome this off-by-one problem.
2

: However, as before, this algorithm is


This algorithm has sample and time complexity On=y
not by itself a weak parity algorithm. To nd a single Fourier coefcient that is y=3-heavy again
requires testing possibly each of the 2k Oy2 coefcients returned by the improved Levins
2 per coefcient. So the
algorithm, and the simple test phase as given in Fig. 2 takes time Oy
2

time complexity of the resulting weak parity algorithm is now On=y


y4 : While an
improvement over the bound for the previous weak parity algorithm, this is still not as good as the
2

: It would of course be nice to get a


sample complexity bound, which can be seen to be On=y
similar bound on the time complexity, and in fact this is what we do next.

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

219

4.4. Improving the testing phase


In this section, we will show how to further modify Levins algorithm so that for a given
threshold y it produces a list containing all y-heavy coefcients of the target f and, importantly,
2

: To convert this to a weak parity algorithm is


only y=3-heavy coefcients, still in time On=y
then simply a matter of choosing one of the elements of the list arbitrarily, so the testing phase is
2

:
no longer needed. Thus the resulting weak parity algorithm has an overall time bound of On=y
The basic idea is that we will now tighten up our estimates of E f x"eI wa x
Q
1aj fa: Earlier, we were only interested in getting the sign of our estimates of this
jAI

expectation correct. Now, we will want to obtain estimates of this expectation that are (with high
T
probability) within y=3 of the true value. Because for each choice of R; fc
R;I a R is
(approximately) an estimate of E f x"eI wa x; we can then decide whether or not a given z
in the algorithm of Fig. 3 corresponds to a heavy a: loosely, if jfc
R;I zjo2y=3 then z does not
correspond to a y-heavy a; and we will eliminate the corresponding az from the output list of
coefcients. Otherwise, the corresponding az is y=3-heavy, and we leave it in the list.
Actually, what we have just outlined will not quite work, because we cannot afford to estimate
the expectation for each value of I to within y=3 and still reach our desired time bound. Instead,
because we are not really interested in the values of the expectation for individual Is but are
merely using various Is to obtain an estimate of the magnitude of fa; we can once again use a
voting scheme. An argument similar to the earlier one will then show that this voting scheme
succeeds within the required time bounds.
Finally, we must also eliminate from the nal list of coefcients any stray coefcients
produced by bad choices of R or of a set of Is. That is, it is quite possible that particular
choices of R or of a set of Is could result in our tests failing to reject a coefcient that is less than
y=3-heavy. However, it is much less likely that this same non-heavy coefcient will consistently
appear to be y=3-heavy for a number of independently chosen Rs and sets of Is. Therefore, we
can once again employ voting to eliminate these non-heavy coefcients from the nal output list.
Formalizing these ideas, we arrive at the algorithm of Fig. 4. In what follows we will outline the
correctness of this algorithm. First consider two coefcients a and b where a is y-heavy and b is ylight, that is, j fbjoy=3 (notice the 3 in this denition). Choose k Jlog2 1 1=cy2 n for
some constant c; let R be a uniform random n by k matrix as before, and also as before take
Y fR p j pAf0; 1gk  f0k gg: Then dene



 1 X


f x"eI wd x  fdwd eI ;
DR;I;d   k

 2  1 xAY
where d is an n-bit vector. Chebyshevs inequality then gives that for xed I and any xed d;
PrDR;I;d Xy=3p9c:
R

By a Markov argument as before, this implies that for xed d and constant 0oc1 o1;
1
PrPrDR;I;d Xy=3X9c1 cp :
R I
c1

ARTICLE IN PRESS
220

N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

Fig. 4. An algorithm incorporating the magnitude test.

Similarly, for constant 0oc2 o1;




1
Pr PrDR;I;d XyXc2 c p :
R
I
c2

For xed d and y; we call R magnitude good for d if both PrI DR;I;d Xy=3p9c1 c and
PrI DR;I;d Xypc2 c: Note that the probability of drawing a magnitude good R for any xed d is at
least 1  1=c1 1=c2 by inequalities (8) and (9).
Next, we dene a voting procedure, which we would like to produce 1 if its k-bit input z
corresponds to a y-heavy coefcient and 1 if z corresponds to a y-light coefcient. We will see

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

221

below that the following procedure frequently (over choices of R and I) performs in just this way:

VR;I z

8
<

1 otherwise:

22k 1y
3

if jfc
R;I zjX

signfc
R;I zf eI
;
k
2

10

Denition 4. Let R be the multiset of Rs generated on line 10 by a single run of the algorithm. Let
dAf0; 1gn be a y-heavy (resp. y-light) coefcient, let RAR be magnitude good for d; and let
zd  d T R: Then the pair zd ; R is called a d; R-decisive pair. For a y-heavy (resp. y-light)
coefcient d; Zd represents the set of all d; R-decisive pairs.
It is convenient to dene a partial function s : f0; 1gn -f1; 1g that produces 1 if its input is
the index of a y-heavy coefcient, 1 if its input corresponds to a y-light coefcient, and is
otherwise undened. Then if d represents a y-heavy or y-light coefcient, we will say that a d; Rdecisive pair votes correctly given I if VR;I zd sd: We will next show that every d; R-decisive
pair will with high probability vote correctly given random I:
Lemma 4. If zd ; R is a d; R-decisive pair then sdEI VR;I zd X1  18c1 c:
Proof. We prove the lemma for d y-heavy, which we represent by the symbol a: The proof for ylight b consists of showing that EI VR;I zb p  1 18c1 c; which is very similar and
omitted.
Recall that
1
fc
R;I z k
2

fR;I pwz p:
k

pAf0;1g

Also, for any k-bit z; fR;I 0k wz 0k f eI : So


X

c
c
2k jfc
R;I za j  signfR;I za f eI signfR;I za

pAf0;1g

signfc
R;I za

fR;I pwza p

f0k g

f x"eI wa x:

xAY

P
k
Therefore, VR;I za will be 1 if and only if signfc
R;I za
xAY f x"eI wa x=2  1X2y=3:
Furthermore, since 0o1=2k  1o2y=3 for k and c as dened in Fig. 4, and because also

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

222

c
2k jfc
R;I za j  signfR;I za f eI X  1 (since f Af1; 1g) it follows that



P

 xAY f x"eI wa x 2y  X

X ) 

f x"eI wa x41
 3


 xAY
2k  1
c
) j2k jfc
R;I za j  signfR;I za f eI j41
c
) 2k jfc
R;I za j  signfR;I za f eI 41
X
f x"eI wa x40
) signfc
R;I za
xAY

) signfc
R;I za sign

!
f x"eI wa x

xAY

and therefore



X
X


f x"eI wa x=2k  1X2y=3 ) signfc
f x"eI wa x=2k  1X2y=3:

R;I za

 xAY
xAY
The converse is also clearly true. Therefore,



X


k
VR;I za 1 3 
f x"eI wa x=2  1X2y=3:

 xAY
Now for y-heavy a and any R and I such that DR;I;a oy=3; it follows that
P
j xAY f x"eI wa x=2k  1jX2y=3: Also, by denition, if R is magnitude good for a then
PrI DR;I;a Xy=3o9c1 c: Therefore, for a y-heavy and R magnitude good for a; the probability over
random choice of I of a 1 vote is at least 1  9c1 c and the probability of a 1 vote at most 9c1 c;
establishing the claim that for such an a and R; EI VR;I za X1  18c1 c: &
Now assume that for xed R and xed 0oyo1 and 0odo1; a multiset M of
mX2 ln2=d1 =1  18c1 c2 Is is drawn uniformly at random. Then by the Hoeffding bound,
any d; R-decisive pair will with probability at least 1  d1 vote correctly given I for a majority of
the Is in M: We say in this case that the d; R pair votes correctly over M: Symbolically, if zd ; R
is a d; R-decisive pair and M is a randomly selected multiset of at least m Is,
"
"
#
#
X
VR;I zd sd X1  d1 :
Pr sign
M

IAM

Now note that an R which is magnitude good is also good in essentially the sense used in
the previous section. That is, for any R that is magnitude good for y-heavy a; with probability at
least 1  1=c2 over random choice of I;
!
X
sign
f x"eI w x sign faw eI :
a

xAY

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

223

Thus, applying the analysis of the previous section, it is easily seen that
"
!#
X
X
f x"eI wa x
f x"eI "ei wa x X1  4c2 c:
EIAT sign
xAY

xAY

So by Hoeffding, if T is a randomly chosen set of at least t Isfor t as shown in Fig. 4and if a


is y-heavy, if the R chosen at line 10 is magnitude good for a; and if R; a votes correctly over T;
then with probability at least 1  d2 over the choice of the set T; a will be added to the candidate
list L of coefcients by an execution of the statements at lines 11 through 28 of the algorithm.
Summarizing, assume that we choose a single set (call it T for consistency with the algorithm of
the previous section) that contains at least maxm; t randomly chosen Is. Then a y-heavy
coefcient a will be added to the candidate list L when R is magnitude good for a and both the
V za 40 test at line 24 and the decoding of a at line 25 succeed. For T chosen as above and
random R; this occurs with probability at least 1  1=c1  1=c2 1  d1  d2 : On the other hand,
a y-light coefcient b will be added to L only if R is not magnitude good for b or if it is magnitude
good but the test Vzb 40 incorrectly succeeds. In fact, if R is not fully magnitude good but
satises only PrI DR;I;b Xy=3o9c1 c (which is true with probability at least 1  1=c1 over choice of
R), then b will be added to L with probability at most d1 : Therefore, y-light b is added to L with
probability at most d1 1=c1 over the choice of T and R: Note also that the test at line 26
precludes a y-light coefcient from being added to the list more than once for a given choice of R
and T:
Therefore, in order to detect the presence of y-light bs in the candidate list, we will randomly
choose multiple Rs and Ts, use each pair to add a set of coefcients to L; and then consider the
sample frequency of each coefcient d appearing in L: Since R and T are chosen independently
each time, for a given d each pass of the algorithm effectively produces an independent sample of
a random f0; 1g-valued variable having a mean value that is the probability that d is added to L
over random choice of R and T: If c1 ; c2 ; d1 ; and d2 are chosen appropriately, there will be a
signicant gap between the L-inclusion probability for any y-heavy coefcient a and any y-light
coefcient b: Thus once again Hoeffding can be used to choose an appropriately large sample of
size c that can with high probability be used to differentiate between all y-light coefcients in L
and all of the y-heavy coefcients.
Specically, the gap G between the probability of occurrence of y-heavy and y-light coefcients
will be at least 1 d1 d2 1=c1 1=c2  2d1 1=c1  d2 1=c2 : So we might choose the
parameter l for the Hoeffding bound to be G=2: We would then like to apply the Hoeffding bound
to select a random sample of size c sufciently large so with probability at least 1  d the probability
of occurrence for all coefcients is estimated to within l: Since there are at most 3c=cy2 coefcients
in L; the union bound indicates that an c of at least ln6c=cdy2 =2l2 would be sufcient.
However, there is a potential aw in this approach having to do with the choice of l: We do not
know a priori which coefcients will appear in L: In particular, we will be estimating the
probability of occurrence for those y-light coefcients that appear at least once in L; so these ylight coefcients will have sample mean at least 1=c even if the true mean is extremely small or
even 0: In effect, the sample mean for y-light coefcients is biased by at most an additive 1=c
factor. Notice also that for c at least ln6c=cdy2 =2l2 and given the constraints on c (see lines 1
and 2 of the algorithm), 1=cpl2 =2: Thus if we choose l such that l2 =2 2l is equal to G then, by

ARTICLE IN PRESS
224

N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

the Hoeffding bound, removing all coefcients that occur fewer than c1  d1  d2 1  1=c1 
1=c2  l times in L removes all of the y-light coefcients and none of the y-heavy ones, with
probability at least 1  d:
Finally, it can be veried that there are constant values c1 ; c2 ; d1 ; and d2 satisfying the given
constraints (line 1). For example, we can choose c1 4; c2 18; d1 1=61; d2 1=25: With these
constants xed, G 7=18; and all constraints are satised. Ideally, given y and d; the constant
values (including c) should be chosen to minimize cmaxm; tn2k k; which is roughly the number of
operations performed by the innermost loop of the algorithm.
5. Learning DNF
In this section, we will plug the improved weak parity algorithm developed above into a version
of the Harmonic Sieve, an algorithm for learning DNF with respect to the uniform distribution
[7]. Before we can do this, we need to generalize the weak parity problem to a non-Boolean
setting. The Sieve can then make use of the resulting generalized weak parity algorithm in order to
more efciently learn DNF.
5.1. Solving the non-Boolean weak parity problem
Earlier, we introduced the weak parity learning problem and provided several algorithms for
solving it. Until now, it was assumed that the target function f was Boolean. However, the
denitions of y-heavy, weak parity learning, and weak parity algorithm immediately generalize to
the case in which the target g is real-valued. We will argue here that the weak parity algorithms
developed in the previous section can also be generalized to solve the weak parity problem for
non-Boolean target g: (Recall that we dene jgj maxfjgxj : xAf0; 1gn g:)
Lemma 5. Let g : f0; 1gn -R be any real-valued function such that jgjX1 and let y40 be any value

such that there is at least one Fourier coefficient ga


such that jgajXy:
Let B be such that jgjpB
0
and define the function h g=B and the threshold y y=B: Then if MEMh; y0 ; and (for the third
algorithm) d40 are input, then:
*

2 B2 =y2 and with probability at least 1=2 returns a set


Levins algorithm runs in time On

containing an n-bit vector a such that jgajXy:


2 2

The improved Levins algorithm runs in time OnB


=y and with probability at least 1=4 returns a

set containing an n-bit vector a such that jgajXy:


2 2

Levins algorithm with magnitude testing runs in time OnB


=y and with probability at least

1  d returns a set containing all n-bit vectors a such that jgajXy


and no n-bit vectors b such that

jgbjoy=3:

Proof. It is easily seen that all of the proofs obtained thus far for Boolean f also apply to h: In
particular, note that in applying the Chebyshev bound with Boolean f we used only the inequality
s2 pE 2 f p1; and that this inequality holds for h as well. Also, all of the earlier Fourier-based
relationships for f also hold for h; as they did not rely on the magnitude of f :

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

225

Therefore, if we run any of the original algorithms on an oracle for h g=B using a threshold
y y=B then the algorithm will run in the given time bound and with the given probability
produce a set of coefcients containing one or more y0 -heavy coefcients for h:
Next, it is easily veried that the Fourier transform is linear, and therefore that for all

ga=B:
Thus, a coefcient is y0 -heavy for h if and only if it is y-heavy for g: &
aAf0; 1gn ; ha
0

Therefore, given a membership oracle for g and a bound on its magnitude, it is a simple matter
to simulate an oracle h and use the earlier algorithms to solve the weak parity problem for g: We
now employ this to learn DNF.
5.2. Weak parity learning and DNF
The original Harmonic Sieve uses a non-Boolean version of a weak parity algorithm due to
Goldreich and Levin [6] as the basis for a weak parity learning algorithm (using a construction
similar to the one of Lemma 3) in a boosting-based algorithm for learning DNF. The generalized
6 6
2 2

=y [7] vs. Onjgj


=y for the nal version of
GoldreichLevin algorithm runs in time Onjgj
Levins algorithm developed above. In the original Sieve, the weak parity algorithm is called with
a simulated membership oracle MEMg such that jgj O1=e2a for arbitrarily small constant
a; and y O1=s; where s is the minimum number of terms in any DNF representation of the
target function. Thus in terms of the PAC parameters e and s; the weak parity algorithm runs in
2 =e4 :
time roughly Ons
The time to simulate MEMg is bounded by the maximal time to simulate any weak
hypothesis times the number of boosting rounds, since the denition of g depends primarily on the
denition of a boosting distribution Di ; which in turn depends primarily on counting the number
of weak hypotheses that correctly label an instance. The boosting algorithm used by the original
2 rounds, and the time to evaluate a weak hypothesis is just the number of bits in a
Sieve uses Os
parity function. When learning DNF with respect to uniform we can assume that there are no
terms larger than Ologs=e [14]. This in turn implies that for some constant kt we can assume
that there will be a weak-approximating parity function with at most kt logs=e relevant
variables, since a parity over a subset of variables in some term of the DNF is a weak
approximator [7]. And it is easy to see that we can modify the weak learning algorithm to output
only such parity functions, without any impact on the time bound of the weak learner. Therefore,
2 ; and the overall
the overall time to simulate MEMg for any g considered by the Sieve is Os
4 =e4 :
time for weakly learning g from a membership oracle for the true target f is roughly Ons
Since the weak parity algorithm dominates the time of the inner loop of the Sieve algorithm,
2 times, replacing the original weak parity algorithm with
and the boosting loop is executed Os
10 =e12 to Ons
6 =e4 (the extra
the new one reduces the time bound on the Sieve from roughly Ons
s2 factor in the original bound here vs. Jacksons published bound [7] reects his overlooking the
time required to simulate MEMg).
As Klivans and Servedio have pointed out [8], replacing the original Sieves boosting algorithm
with an alternative boosting algorithm can produce further improvement. In particular, one of
Freunds boosting algorithms [4] (called BComb by Klivans and Servedio) will call the weak parity

ARTICLE IN PRESS
226

N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

2
algorithm with jgj bounded by O1=e
and y bounded as above, yet still runs for only Os
6
2

boosting stages. This brings the time bound for the overall algorithm down to Ons
=e ; but at
the expense of a somewhat more complex hypothesis than the one produced by the original Sieve.
The original algorithm produces a threshold of parity functions, while the modied algorithm will
produce a threshold of thresholds of parities. In fact, the top-level threshold in the hypothesis
produced using BComb may also have random variables as inputs, so this hypothesis is not
necessarily even deterministic.
5.3. Sample complexity
2 2

=y ; where B is an
The nal non-Boolean weak parity algorithm has sample complexity OnB
upper bound on jgj: Note that in terms of its sampling of the target function g; this algorithm is
oblivious in the sense that the algorithm makes membership queries on MEMg without
considering the target function itself: the queries are dictated by the random choices of the
matrices R and sets T of n-bit vectors. Furthermore, the Harmonic Sieve simulates the
membership oracle for g directly from the membership oracle for the DNF target f : That is, for
any x; the only query to MEM f needed to compute MEMgx is MEM f x: Also, the
2 executions of the weak parity algorithm
boosting algorithm does not require that its kb Os
have independent probabilities of failure. Instead, it is enough to have each execution fail with
probability at most d=kb ; and then the union bound (which does not require independence) will
guarantee an overall probability of success of at least 1  d:
So by drawing a single multiset M of Rs and Ts sufcient to ensure that one execution of the
weak parity algorithm succeeds with probability at least 1  d=kb ; M plus the single set Q of
membership queries to f dictated by M can be used for all executions of the weak parity
algorithm, and the overall Sieve will still succeed with probability at least 1  d: Since the size of
the sample required for a single execution of the weak parity algorithm depends logarithmically
2 2

=y : In the context of the Harmonic Sieve using


on its d parameter, the size of Q will still be OnB
2 2

BComb ; this corresponds to a sample size of Ons =e :

5.4. Attribute efficient learning


As has been noted by others (see [3] and the references therein), the relevant variables of a target
function can, in many learning models, be located relatively easily when a membership oracle is
available. We briey outline here an argument showing that any algorithm using membership
queries to learn a projection-closed function class (i.e., a class closed under partial assignment)
with respect to the uniform distribution can be made attribute efcient. Our algorithm for nding
relevant variables quickly is based on an idea noted by Angluin et al. [2]. The analysis in this
section, when applied to the algorithm given earlier for learning DNF, implies that the factor of n
in the above soft-O bounds can effectively be replaced by a factor of r:
The idea is that we will incrementally build up a set V containing relevant variables. For each
candidate V we will use random sampling to attempt to verify that a projection of the target
function f on V is with high probability very consistent with f over the entire n-dimensional
space. Once we nd a V that passes this test, we can run the learning algorithm on a simulated

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

227

membership oracle for the projection of f onto V ; since this projection will be a DNF expression
of size no more than the size of f : On the other hand, each V that fails the test will allow us to
rapidly nd another relevant variable to add to V :
More specically, to test a particular V we will rst uniformly at random choose a bit-vector x
of length jV j and a bit-vector y of length n  jV j and query f on the n-bit vector formed by
assigning x to the variables in V and y to the variables in n\V: Denote the value returned by the
oracle by f x; y: For each such pair x; y we compare f x; y with f x; 0: We repeat this for
c 2=e ln2n=d random choices of x; ys. If any one of these queries produces a value
f x; yaf x; 0 then we have a witness that there is a relevant variable outside of V : We then ip
half of the 1 bits in y; producing y0 ; and query f on x; y0 : The result will differ from either f x; 0
or from f x; y: In either case, we will have two vectors that are half as far apart in Hamming
distance as 0 and y are. Repeating this bit-ipping procedure at most log n times identies a
relevant variable that is not in V : If all tests result in equality, then with probability at least
1  d=2n we have Prx;y f x; yaf x; 0pe=2: At this point we learn f x; 0 to within an error of
e=2 with probability at least 1  d=2:
The probability that this learning strategy fails to produce a hypothesis h that is an eapproximator to f is at most, using the union bound, d: Overall, then, this testing phase consists
of at most rpn rounds, with each round adding one relevant variable to V : Each round consists
of O1=e logn=d queries followed by at most log n additional queries. So we can replace n with

r in the earlier bounds at the expense of an additive factor of Or=e


log2 nhiding the
logarithmic factors involving 1=d:
6. Lower bounds
In this section we give some lower bounds on the sample complexity for learning classes of
Boolean functions under xed distributions. We prove the following results.
Theorem 6. Let C be a class of Boolean functions, D a fixed probability distribution over f0; 1gn ;
and 0oe; do1 be fixed. Also, let Ce DC be such that for every f1 ; f2 ACe we have PrD f1 af2 X2e:
Any PAC-learning algorithm with membership queries that learns C under the distribution D with
accuracy e and with confidence 1  d uses at least
1
m1
l Ilog2 jCe j  log2
1d
queries.
f
Proof. Let Ar;s
be a randomized algorithm that uses a sequence of random bits r; is given a set s
of m1 random examples, and asks m2 membership queries to f ; where m1 m2 ol: Suppose for
every f ACe we have
f
Df peX1  d:
Prs;r DAr;s

This implies that there is a specic sequence r0 of bits and specic set s0 of m1 examples such that
DArf0 ;s0 Df pe

11

ARTICLE IN PRESS
228

N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

for at least 1  djCe j of the functions in Ce : Since Ar0 ;s0 asks less than l queries, each query gives
a response in f0; 1g; and 2l o1  djCe j; there must be two functions f1 and f2 in Ce and satisfying
(11) for which the algorithm outputs the same hypothesis Arf01;s0 Arf02;s0 h: Now
2eoD f1 Df2 pDhDf1 DhDf2 ;
and therefore there is an i such that DhDfi 4e: This is a contradiction. &
We will need the following lemma from coding theory (based on the VarshamovGilbert
bound; see, e.g., [15]).
Lemma 7. Fix any integer m42 and let S be an alphabet with jSj m symbols. Then for any
0oco1 and any integer nX1 there is a code LDSn of minimum distance cn and size jLjXm1c =2n :
Proof. The lemma follows from a simple counting argument. Since each aASn has at most
 
 
 
n
n
2 n
cn
1 m  1
m  1
? m  1
1
2
cn
elements in Sn with distance that is less than or equal to cn; there is a code of size
mn
jLjX
n

n
1 m  1 1 m  12 n2 ? m  1cn cn
and minimum distance cn:
For m42 we have

 1c n
mn
mn
m
X
:
n
2 n
cn  n Xmcn 2n
2
1 m  1 1 m  1 2 ? m  1 cn

&

We now show that in order to learn any class over r relevant variables containing
all functions of DNF-size at most s with sufciently small e and d we need sample size nearly
s log2 r:
Theorem 8. Any algorithm for learning, with respect to the uniform distribution and with eo1=8 and
do1=2; any class of Boolean functions over r relevant variables that contains all functions of DNFsize at most s requires sample size Os logr  log s:
Proof. Assume for simplicity, and justiably since our result is asymptotic, that s is a
power of 2: Let u r  log2 s and note that u is positive, since every Boolean
function on r variables can be represented as a DNF with fewer than 2r terms: just represent
each 1 entry in the truth table over the r relevant variables as a term, unless all entries are 1; in
which case the constant function can be used. Also let t log2 s; and let x1 ; y; xt and y1 ; y; yu be
the r variables of the DNF. For aAf0; 1gt we dene xa xa11 ?xat t where xai i xi if ai 1 and
xai i x% i if ai 0:

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

229

Let S f0; 1; y1 ; y; yu ; y% 1 ; y; y% u g: Now dene the set of DNF formulas


8

9

< _
=

t
xa ya ya aAf0;1gt AS2 :
C0
:
;

aAf0;1gt
That is, each DNF expression in C 0 has 2t terms, each containing the t x variables plus one
symbol from S (either a y variable or a constant). Furthermore, each term in one of these
expressions sets the senses (positive or negated) of the x variables differently, and all possible
senses are represented. Thus the number of terms in each DNF expression in C 0 is 2t s; and
therefore all of the functions represented in C 0 have DNF-size at most s: By Lemma 7, for any
co1 there is LCSs of minimum distance cs and size
!s
2u 21c
:
jLjX
2
Fix such an L and dene a new set C that is a subset of C 0 :

8
9

< _
=

xa ya ya aAf0;1gt AL :
C
:
;
t

aAf0;1g

W
For yAL we write fy aAf0;1gt xa ya : That is, fy represents the DNF expression in which the ith
symbol in the string y is the value of the ya variable in the ith term of the DNF.
Now notice that for every y1 and y2 in L we have fy1 "fy2 fy1 "y2 : This follows from the
fact that fy can be also written as "aAf0;1gt xa ya : Now
Pr fy1 afy2  Pr fy1 "fy2 1
Pr fy1 "y2 1
E fy1 "y2 
1 X
2

Ey1
a "ya 
s
t
aAf0;1g

c
X :
2
1

The latter is because ya and ya are different in at least cs entries and each ya "ya that is not
zero has expectation at least 1=2:
By Theorem 6, any PAC learning algorithm using membership queries with eoc=4 and, say,
do1=2 for this class needs
!s !
2u 21c
O log
2
queries for any co1: Choose c to be a constant, say 1=2: This gives a sample complexity of
Os logr  log s: &

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

230

7. Randomness-efcient Levin algorithm


In [9] Kushilevitz and Mansour showed how to derandomize another algorithm for solving the
same parity-nding problem solved by Levin, but they assumed that a quantity called the L1 -norm
of the target function is polynomially bounded and is known. However, the L1 norm
is not polynomially bounded for the class of polynomial-size DNF expressions [11]. In this
section we describe an algorithm for learning DNF with respect to uniform that is (attribute)
efcient in terms of its use of random bits (its randomness complexity) as well as its sample
complexity.
Randomization is used several times in the improved Levins algorithm (given in Fig. 3). First,
the method uses the random Boolean matrix RAf0; 1gnk to create the pairwise independent
sample. Second, it requires the set I of t Oln n random vectors used for majority
voting. Third, it uses randomness for Hoeffding sampling to verify that a candidate vector is yheavy. For the rst case, which is the main focus of this section, we show that the use of small bias
probability distributions [12] can reduce the number of row bits used by R from nk bits to
essentially k log n bits. For the second case, Goldreich [5] has a method based on error-correcting
codes which can decrease the column bits of R (which directly impacts the sample size used) from
On=y2 to O1=y2 : We begin by describing the technique to reduce the row size of the random
matrix R used in the basic algorithm (given in Fig. 1) and then show how this applies to DNF
learning.
We utilize the so-called biased distributions introduced by Naor and Naor [12]. A probability
n

;
distribution D : f0; 1gn -0; 1 is called l-bias if for all aAf0; 1gn \f0n g we have jDajpl2
n

where each Da
is the Fourier coefcient of the real-valued function D over f0; 1g at a: Naor
and Naor [12] gave an explicit construction of a l-bias probability distribution of size On=l2 ;
so Ologn=l random bits sufce for sampling from this distribution. Recall that Levins
algorithm creates the n  k random Boolean matrix R by selecting each entry randomly and
independently; hence it needs kn random bits. We modify this algorithm by choosing k
independent columns of R according to a biased distribution D over f0; 1gn : In this way, the
number of random bits required is Ok logn=l; if a l-bias D is used.
We formalize the ideas sketched above. A sequence of random variables X1 ; y; Xm is called
pairwise d-dependent if for every 1piajpm and for every a; b; we have jPrXi a; Xj
b  PrXi a PrXj bjpd: Next we state an observation on pairwise dependent random
variables and extend Chebyshevs inequality for these types of random variables.
Claim 9. Let X1 ; y; Xm Af1; 1g be pairwise d-dependent random variables. Then jEXi Xj  
EXi EXj jp4d:
Lemma 10. Let X1 ; y; Xm Af1; 1g be pairwise d-dependent random variables. Suppose that for
each iAm; EXi  m and VarXi  s2 : Then
"

!
#
m
1 X
s2 4md
Pr sign
Xi asignm p
:
m i1
mm2

12

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

231

Proof. We start with an upper bound for the left-hand side of (12):
2
2

"
#
!2 3
!2 3

1 X
m
m
X
X
1
1
1


X  mXjmj p 2 E4
Xi  m 5p 2 2 E4
Xi  m 5:
Pr 

m i1 i
m
m i1
mm
i
Then using straightforward algebra, we get
"
#
X
1 X
EXi  m2 
EXi  mXj  m
m2 m 2 i
iaj
"
#
X
1 X
2
VarXi 
EXi Xj   m
2 2
mm
i
iaj
which is bounded from above by s2 4md=mm2 : &
Notice that by setting d 1=4m and using the fact that s2 1  m2 p1; we obtain an upper
bound of 2=mm2 in the left hand side of the Chebyshev bound stated above. This is only larger
by a multiplicative factor of 2 than the bound obtained from Levins original analysis.
We will show that by choosing k independent column vectors from f0; 1gn according to an lbias distribution D the sequence of random variables Xi Rpi ; where pi Af0; 1gk \f0k g; is pairwise
4l=2n -dependent. So, as long as 4l=2n p1=2m; the analysis in Section 4.2 still holds. In our
case, choosing l to be a xed small constant will sufce.
Claim 11. Let R be a n-by-k matrix with f0; 1g-entries constructed by selecting k random column
vectors from f0; 1gn according to a l-bias distribution D over f0; 1gn : Let Xi Rpi ; for
pi Af0; 1gk \f0k g: Then X1 ; y; X2k 1 are pairwise 4l=2n -dependent random variables.
Proof. We observe rst that if D is an l-bias distribution over f0; 1gn then jDx  2n jpl: Recall
l
n

n
that since for all aa0n jDajp
2n and D0 2 ;


 X
X



jDx  2n j 
x
jDajpl:
Daw
p
a
 aa0n
aa0n
Let p and q be any elements of f0; 1gk \f0k g: We need to prove that for any paq and any
a; bAf0; 1gn
jPrRp a; Rq b  PrRp a PrRq bjp4l=2n :
Consider rst PrRp a: Assume without loss of generality that pk a0: Then after choosing the
rst k  1 columns of R; the value of the last column is uniquely determine for the equation
Rp a to hold, say the last column must equal to aAf0; 1gn : Hence PrRp a Da:
Now consider PrRp a; Rq b; with paqAf0; 1gk and a; bAf0; 1gn : We assume without loss
of generality that the determinant of the matrix


pk1 qk1
pk

qk

ARTICLE IN PRESS
232

N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

is nonzero, where arithmetic is over F2 : After choosing the rst k  2 columns, there is a unique
solution for the k  1th and kth columns, say a and b: Then PrRp a; Rq b DaDb;
since we draw independent columns. The difference between two quantities of the form DaDb
is at most the difference between 2n  l2 and 2n l2 which is at most 4l=2n : This completes
the proof. &
Levins algorithm consumes nk random bits (to choose the matrix R). For learning DNF, this
log s random bits per boosting rounds and, with Os
2 rounds, the total number of
costs On
2 : Using biased distributions to generate R; we
random bits used by the Harmonic Sieve is Ons
replace the factor n by log n: Furthermore, we can apply the argument sketched in Section 5.4 to
replace n with r (the number of relevant variables).
Here we mention Goldreichs [5] idea of using error correcting codes for improving the modied
Levin algorithm (given in Fig. 3). His technique can eliminate the additional random bits required
in our version in Section 4.3 and will achieve the same reduction in sample space (from On=y2 to
O1=y2 ).
In Goldreichs scheme, we take an asymptotically good binary linear t; n; d-code C; i.e., t
On (constant rate) and d=n O1 (can tolerate a constant fraction of errors). Since C is a
binary linear code, it has a generator matrix GAf0; 1gtn ; the encoding of a string xAf0; 1gn is
G x: So the kth bit of G x is simply wIk x where Ik is the subset specied by the kth row of G:
The fact that C can tolerate a constant fraction of errors imply that we can replace the 1=2n
upper bound on the failure probability of approximating each bit of the y-heavy coefcient with
"
!
#
X
1
f y"eIk wa y awIk a sign fa p
Pr sign
c
yAY
for some constant c40 and for each row Ik of G: Thus the sample size is jY jXc=y2 : One can view
this improvement also as a reduction in the column size of the matrix R used in Levins algorithm,
since now we can choose R of size Ologn=l  log1=y2 : This coding scheme is explicit and
efcient since the class of Justesen codes provide such a family of asymptotically good binary
linear code with an efcient decoding algorithm. We summarize the overall derandomized
algorithm in Fig. 5; this algorithm can also be adapted so that it is attribute efcient.

8. Future work
While we have made progress in improving the efciency of DNF learning, we have focused
specically on improving the subprogram for nding heavy Fourier coefcients. As noted already,
Klivans and Servedio [8] have had some success at improving the efciency of the original
Harmonic Sieve algorithm for DNF learning by using a different boosting technique than the one
used originally in the Sieve. However, the hypothesis produced is more complex than the one
produced by the Sieve, and is not guaranteed to be deterministic. Is there a boosting algorithm
that can be used to achieve the best time and sample complexity bounds given in this paper while
producing hypotheses in the same function class as the original Sieve?

ARTICLE IN PRESS
N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

233

Fig. 5. Derandomized Levins algorithm.

One goal in this research is to develop a practical DNF learning algorithm for large problems.
Although the nal algorithm we present is reasonably efcient asymptotically, the combination of
nontrivial log factors and somewhat large constants hidden by the asymptotic analysis call into
question the algorithms practical usefulness. This raises at least two questions: can the current
algorithm be tuned to perform well on large empirical problems, and if not, how can the run time
for DNF learning be improved?
Another area for future study is closing the signicant gap between our lower and upper bounds
for the sample complexity of learning DNF.
Acknowledgments
The authors thank Ming-Chih Chen for discussions on derandomization of Levins algorithm.
The authors are also grateful to the anonymous referee of the conference version of this paper
who, among other things, reminded them that attribute efciency is easy to achieve in the learning
model studied in this paper. The journal reviewers also uncovered a number of small problems
and in other ways helped to improve this work.
References
[1] A.V. Aho, J.E. Hopcroft, J.D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley,
Reading, MA, 1974.

ARTICLE IN PRESS
234

N.H. Bshouty et al. / Journal of Computer and System Sciences 68 (2004) 205234

[2] D. Angluin, L. Hellerstein, M. Karpinski, Learning read-once formulas with queries, J. ACM 40 (1) (1993)
185210.
[3] N.H. Bshouty, L. Hellerstein, Attribute-efcient learning in query and mistake-bound models, J. Comput. System
Sci. 56 (3) (1998) 310319.
[4] Y. Freund, An improved boosting algorithm and its implications on learning complexity, in: Proceedings of the
Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, 1992, pp. 391398.
[5] O. Goldreich, Modern Cryptography, Probabilistic Proofs and Pseudorandomness, in: Algorithms and
Combinatorics, Vol. 17, Springer, Berlin, 1999.
[6] O. Goldreich, L. Levin, A hardcore predicate for all one-way functions, in: Proceedings of the 21st Annual ACM
Symposium on the Theory of Computing, Seattle, WA, 1989, pp. 2532.
[7] J.C. Jackson, An efcient membership-query algorithm for learning DNF with respect to the uniform distribution,
J. Comput. System Sci. 55 (3) (1997) 414440.
[8] A. Klivans, R. Servedio, Boosting and hardcore sets, in: Proceedings of the 40th Annual Symposium on
Foundations of Computer Science, New York, NY, 1999.
[9] E. Kushilevitz, Y. Mansour, Learning decision trees using the Fourier spectrum, SIAM J. Comput. 22 (6) (1993)
13311348.
[10] L. Levin, Randomness and non-determinism, J. Symbolic Logic 58 (3) (1993) 11021103.
[11] Y. Mansour, An Onlog log n learning algorithm for DNF under the uniform distribution, in: Proceedings of Fifth
Annual Conference on Computational Learning Theory, Pittsburgh, PA, 1992, pp. 5361.
[12] J. Naor, M. Naor, Small-bias probability spaces: efcient constructions and applications, SIAM J. Comput. 22 (4)
(1993) 838856.
[13] L. Valiant, A theory of the learnable, Comm. ACM 27 (11) (1984) 11341142.
[14] K. Verbeurgt, Learning DNF under the uniform distribution in quasi-polynomial time, in: Proceedings of the
Third Annual Conference on Computational Learning Theory, Rochester, NY, 1990, pp. 314326.
[15] J.H. van Lint, Introduction to Coding Theory, Springer, Berlin, 1982.

Você também pode gostar