Learning Boolean Formulae: Key Words: Machine Learning, Inductive Inference
Learning Boolean Formulae: Key Words: Machine Learning, Inductive Inference
Learning Boolean Formulae: Key Words: Machine Learning, Inductive Inference
Abstract
Ecient distribution-free learning of Boolean formulae from positive and negative examples
is considered. It is shown that classes of formulae that are eciently learnable from only
positive examples or only negative examples have certain closure properties. A new substitution
technique is used to show that in the distribution-free case learning DNF (disjunctive normal
form formulae) is no harder than learning monotone DNF. We prove that monomials cannot be
eciently learned from negative examples alone, even if the negative examples are uniformly
distributed. It is also shown that if the examples are drawn from uniform distributions then the
class of DNF in which each variable occurs at most once is eciently learnable, while the class
of all monotone Boolean functions is eciently weakly learnable (i.e., individual examples are
correctly classied with a probability larger than 12 + p1 , where p is a polynomial in the relevant
parameters of the learning problem). We then show an equivalence between the notion of weak
learning and the notion of group learning, where a group of examples of polynomial size, either
all positive or all negative, must be correctly classied with high probability.
Key words: Machine Learning, Inductive Inference.
Earlier versions of most of the results presented here were described in \On the learnability of Boolean formulae",
by M. Kearns, M. Li, L. Pitt and L. Valiant, Proceedings of the 19th A.C.M. Symposium on the Theory of Computing,
1987, pp. 285-295. Earlier versions of Theorems 15 and 22 were announced in \Cryptographic limitations on learning
Boolean formulae and nite automata", by M. Kearns and L. Valiant, Proceedings of the 21st A.C.M. Symposium
on the Theory of Computing, 1989, pp. 433-444.
yThis research was done while the author was at Harvard University and supported by grants NSF-DCR-8600379,
ONR-N00014-85-K-0445, and an A.T. & T. Bell Laboratories Scholarship. Author's current address: AT&T Bell
Laboratories, Room 2A-423, 600 Mountain Avenue, P.O. Box 636, Murray Hill, NJ 07974-0636.
zThis research was done while the author was at Harvard University and supported by grants NSF-DCR-8600379,
DAAL03-86-K-0171, and ONR-N00014-85-K-0445. Author's current address: Department of Computer Science,
University of Waterloo, Waterloo, Ontario N2L 3G1 Canada.
x Supported by grants NSF-DCR-8600379 and ONR-N00014-85-K-0445. Author's address: Aiken Computation
Laboratory, Harvard University, Cambridge, MA 02138.
1
1 Introduction
We study the computational feasibility of learning Boolean expressions from examples. Our goals
are to prove results and develop general techniques that shed light on the boundary between the
classes of expressions that are learnable in polynomial time and those that are apparently not. The
elucidation of this boundary, for Boolean expressions and other knowledge representations, is an
example of the potential contribution of complexity theory to articial intelligence.
We employ the distribution-free model of learning introduced in [V84]. A more complete dis-
cussion and justication of this model can be found in [V85, BEHW86, KLPV87b]. The pa-
per [BEHW86] includes some discussion that is relevant more particularly to innite representa-
tions, such as geometric ones, rather than the nite case of Boolean functions.
The results of this paper fall into four categories: tools for determining ecient learnability,
a negative result, distribution-specic positive results, and an equivalence between two models of
learnability.
The tools for determining ecient learnability are of two kinds. In Section 3.1 we discuss closure
under Boolean operations on the members of the learnable classes. The assumption that the classes
are learnable from positive examples only or from negative examples only is sometimes sucient to
ensure closure. In Section 3.2, we give a general substitution technique. It can be used to show, for
example, that if disjunctive normal form (DNF) formulae are eciently learnable in the monotone
case, then they are also eciently learnable in the unrestricted case.
In Section 4 we prove a negative result. We show that for purely information-theoretic reasons,
monomials cannot be learned from negative examples alone, regardless of the hypothesis represen-
tation, and even if the distribution on the negative examples is uniform. This contrasts with the
fact that monomials can be learned from positive examples alone [V84], and can be learned from
very few examples if both kinds are available [H88, L88].
The class of DNF expressions in which each variable occurs at most once is called DNF. In
Section 5, we consider learning DNF and the class of arbitrary monotone Boolean functions,
under the restriction that both the positive and negative examples are drawn from uniform dis-
tributions. We show that DNF is learnable (with DNF as the hypothesis representation) in
this distribution-specic case. In the distribution-free setting, learning DNF with DNF as the
hypothesis representation is known to be NP-hard [PV88]. We also show that arbitrary monotone
Boolean functions are weakly learnable under uniform distributions, with a polynomial-time pro-
gram as the hypothesis representation. In weak learning, it is sucient to deduce a rule by which
individual examples are correctly classied with a probability 21 + 1p , where p is a polynomial in
the parameters of the learning problem. We conclude in Section 6 by showing that weak learning
2
is equivalent to the group learning model, in which it is sucient to deduce a rule that determines
with accuracy 1 , whether a large enough set of examples contains only positive or only neg-
ative examples, given that one of the two possibilities holds. This equivalence holds in both the
distribution-free case and under any restricted class of distributions, thus yielding a group learning
algorithm for monotone Boolean functions under uniform distributions. In the distribution-free
case, monotone functions are not weakly learnable by any hypothesis representation, independent
of any complexity-theoretic assumptions.
3
(see e.g. [GJ79]), and denote by jxj and jcj the length of these encodings measured in bits.
Parameterized representation classes. We will often study parameterized classes of represen-
tations. Here we have a stratied domain X = [n1 Xn and representation class C = [n1 Cn .
The parameter n can be regarded as an appropriate measure of the complexity of con-
cepts in (C ), and we assume that for a representation c 2 Cn we have pos (c) Xn and
neg (c) = Xn , pos (c). For example, Xn may be the set f0; 1gn, and Cn the class of all Boolean
formulae over n variables. Then for c 2 Cn , (c) would contain all satisfying assignments of
the formula c.
Ecient evaluation of representations. If C is a representation class over X , we say that C
is polynomially evaluatable if there is a (probabilistic) polynomial-time evaluation algorithm
A that on input a representation c 2 C and a domain point x 2 X outputs c(x).
Samples. A labeled example from a domain X is a pair < x; b >, where x 2 X and b 2 f0; 1g.
A labeled sample S = < x1; b1 >; : : :; < xm ; bm > from X is a nite sequence of labeled
examples from X . If C is a representation class, a labeled example of c 2 C is a labeled
example of the form < x; c(x) >, where x 2 X . A labeled sample of c is a labeled sample S
where each example of S is a labeled example of c. In the case where all labels bi or c(xi ) are
1 (respectively, 0), we may omit the labels and simply write S as a list of points x1 ; : : :; xm,
and we call the sample a positive (negative) sample.
We say that a representation h and an example < x; b > agree if h(x) = b; otherwise they
disagree. We say that a representation h and a sample S are consistent if h agrees with each
example in S .
4
We think of the target distributions as representing the \real world" distribution of the envi-
ronment in which the learning algorithm must perform. For instance, suppose that the target
concept were that of \dangerous situations". Certainly the situations \oncoming tractor" and
\oncoming taxi" are both contained in this concept. However, a child growing up in a rural
environment is much more likely to witness the former event than the latter, and the situation
is reversed for a child growing up in an urban environment. These dierences in probability
are re
ected in dierent target distributions for the same underlying target concept. Fur-
thermore, since we rarely expect to have precise knowledge of the target distributions at the
time we design a learning algorithm, ideally we seek algorithms that perform well under any
target distributions.
Given a xed target representation c 2 C , and given xed target distributions Dc+ and Dc, ,
there is a natural measure of the error (with respect to c, Dc+ and Dc, ) of a representation h
from a representation class H . We dene e+c (h) = Dc+ (neg (h)) and e,c (h) = Dc, (pos (h)). Note
that e+c (h) (respectively, e,c (h)) is simply the probability that a random positive (negative)
example of c is identied as negative (positive) by h. If both e+c (h) < and e,c (h) < , then
we say that h is an -good hypothesis (with respect to Dc+ and Dc, ); otherwise, h is -bad. We
dene the accuracy of h to be the value min (1 , e+c (h); 1 , e,c (h)).
When the target representation c is clear from the context, we will drop the subscript c and
simply write D+ ; D,; e+ and e, .
In the denitions that follow, we will demand that a learning algorithm produce with high
proability an -good hypothesis regardless of the target representation and target distribu-
tions. While at rst this may seem like a strong criterion, note that the error of the hypothesis
output is always measured with respect to the same target distributions on which the algo-
rithm was trained. Thus, while it is true that certain examples of the target representation
may be extremely unlikely to be generated in the training process, these same examples in-
tuitively may be \ignored" by the hypothesis of the learning algorithm, since they contribute
a negligible amount of error. Continuing our informal example, the rural child may never be
shown an oncoming taxi as an example of a dangerous situation, but provided he remains
in the environment in which he was trained, it is unlikely that his inability to recognize this
danger will ever become apparent.
Learnability. Let C and H be representation classes over X . Then C is learnable from examples
by H if there is a (probabilistic) algorithm A with access to POS and NEG , taking inputs
; , with the property that for any target representation c 2 C , for any target distributions
D+ over pos (c) and D, over neg (c), and for any input values 0 < ; < 1, algorithm A halts
5
and outputs a representation hA 2 H that with probability greater than 1 , satises
(i) e+ (hA ) <
and
(ii) e, (hA ) < .
We call C the target class and H the hypothesis class; the output hA 2 H is called the
hypothesis of A. A will be called a learning algorithm for C . If C and H are polynomially
evaluatable, and A runs in time polynomial in 1 ; 1 and jcj then we say that C is polynomially
learnable from examples by H ; if C is parameterized we also allow the running time of A to
have polynomial dependence on n. We will drop the phrase \from examples" and simply say
that C is learnable by H , and C is polynomially learnable by H . We say C is polynomially
learnable to mean that C is polynomially learnable by H for some polynomially evaluatable
H . We will sometimes call the accuracy parameter and the condence parameter.
Thus, we ask that for any target representation and any target distributions, a learning
algorithm nds an -good hypothesis with probability at least 1 , . A major goal of research
in this model is to discover which representation classes C are polynomially learnable.
We will sometimes bound the probability that a learning algorithm fails to output an -good
hypothesis by separately analyzing a constant number k of dierent ways the algorithm might
fail, and showing that each of these occurs with probability at most k . Thus, we will use the
expression \with high probability" to mean with probability at least 1 , k for an appropriately
large constant k, which for brevity may sometimes be left unspecied.
We refer to this model as the distribution-free model, to emphasize that we seek algorithms
that work for any target distributions. We also occasionally refer to this model as strong
learnability, in contrast with the notion of weak learnability dened below.
Weak learnability. We will also consider a distribution-free model in which the hypothesis of
the learning algorithm is required to perform only slightly better than random guessing.
Let C and H be representation classes over X . Then C is weakly learnable from examples
by H if there is a polynomial p and a (probabilistic) algorithm A with access to POS and
NEG , taking input , with the property that for any target representation c 2 C , for any
target distributions D+ over pos (c) and D, over neg (c), and for any input value 0 < < 1,
algorithm A halts and outputs a representation hA 2 H that with probability greater than
1 , satises
(i) e+ (hA ) < 21 , p(j1cj)
6
and
(ii) e, (hA ) < 12 , p(j1cj) .
Thus, the accuracy of hA must be at least 12 + p(j1cj) . A will be called a weak learning algorithm
for C . In the case that the target class C is parameterized, we allow the polynomial p in
conditions (i) and (ii) to depend on n. If C and H are polynomially evaluatable, and A runs
in time polynomial in 1 and jcj (as well as polynomial in n if C is parameterized), we say
that C is polynomially weakly learnable by H and C is polynomially weakly learnable if it is
weakly learnable by H for some polynomially evaluatable H .
While the motivation for weak learning may not be as apparent as that for strong learning,
we may intuitively think of weak learning as the ability to detect some slight bias separating
positive and negative examples. In fact, Schapire [S89] has shown that in the distribution-free
setting, polynomial-time weak learning is equivalent to polynomial-time learning.
Positive-only and negative-only learning algorithms. We will sometimes study learning
algorithms that need only positive examples or only negative examples. If A is a learning
algorithm for a representation class C , and A makes no calls to the oracle NEG (respectively,
POS ), then we say that A is a positive-only (negative-only) learning algorithm, and C is
learnable from positive examples (learnable from negative examples). Analogous denitions
are made for weak learnability. Note that although the learning algorithm may receive only
one type of example, the hypothesis output must still be accurate with respect to both the
positive and negative distributions.
Many learning algorithms in the distribution-free model are positive-only or negative-only.
The study of positive-only and negative-only learning is important for at least two reasons.
First, it helps to quantify more precisely what kind of information is required for learning var-
ious representation classes. Second, it is crucial for applications where, for instance, positive
examples are rare but must be classied accurately when they do occur.
Distribution-specic learnability. The models for learnability described above demand that
a learning algorithm work regardless of the distributions on the examples. We will sometimes
relax this condition, and consider these models under restricted classes of target distributions,
for instance the class consisting only of the uniform distribution. Here the denitions are the
same as before, except that we ask that the performance criteria for learnability be met only
under these restricted target distributions.
Sample complexity. Let A be a learning algorithm for a representation class C . Then we denote
by sA (; ) the number of calls to the oracles POS and NEG made by A on inputs ; ; this is
7
a worst-case measure over all possible target representations in C and all target distributions
D+ and D, . In the case that C is a parameterized representation class, we also allow sA to
depend on n. We call the function sA the sample complexity or sample size of A. We denote
by s+A and s,A the number of calls of A to POS and NEG , respectively. Note that here we
have dened the sample complexity deterministically, since all of the algorithms we give use
a number of examples depending only on , and n. In general, however, we may wish to
allow the number of examples drawn to depend on the coin
ips of the learning algorithm
and the actual sequence of examples seen; in this case the sample complexity would be an
expected value rather than an absolute value. Following the sample complexity lower bound
of Theorem 12 in Section 4, we discuss how to adapt the proof of this theorem and other
known deterministic sample complexity lower bounds to yield lower bounds on the expected
number of examples.
Cherno bounds. We shall make extensive use of the following bounds on the area under the
tails of the binomial distribution. For 0 p 1 and m a positive integer, let LE (p; m; r)
denote the probability of at most r successes in m independent trials of a Bernoulli variable
with probability of success p, and let GE (p; m; r) denote the probability of at least r successes.
Then for 0 1,
Fact CB1. LE (p; m; (1 , )mp) e,2 mp=2
and
Fact CB2. GE (p; m; (1 + )mp) e,2mp=3
These bounds in the form they are stated are from [AV79]; see also [C52]. Although we
will make frequent use of Fact CB1 and Fact CB2, we will do so in varying levels of detail,
depending on the complexity of the calculation involved. However, we are primarily interested
in the Cherno bounds for the following consequence of Fact CB1 and Fact CB2: given an
event E of probability p, we can obtain an estimate p0 of p by drawing m points from the
distribution and computing the frequency p0 with which E occurs in this sample. Then for m
polynomial in 1p and 1 , say m = O( 1p log 1 ), p0 satises p2 < p0 < 2p with probability at least
1 , . If we also allow m to depend polynomially on 1 , say m = O( 12 log 1 ), we can obtain
an estimate p0 such that p , < p0 < p + with probability at least 1 , .
Notational conventions. Let E (x) be an event and (x) a random variable that depend on a
parameter x that takes on values in a set X . Then for X 0 X , we denote by Prx2X 0 (E (x))
the probability that E occurs when x is drawn uniformly at random from X 0. Similarly,
Ex2X 0 ( (x)) is the expected value of when x is drawn uniformly at random from X 0. We also
8
need to work with distributions other than the uniform distribution; thus if P is a distribution
over X we use Prx2P (E (x)) and Ex2P ( (x)) to denote the probability of E and the expected
value of , respectively, when x is drawn according to the distribution P . When E or
depend on several parameters that are drawn independently from dierent distributions we use
multiple subscripts. For example, Prx1 2P1 ;x2 2P2 ;x3 2P3 (E (x1; x2; x3)) denotes the probability
of event E when x1 is drawn from distribution P1 , x2 from P2, and x3 from P3 , independently.
9
3 Useful Tools for Distribution-free Learning
In this section we describe some general tools for determining polynomial-time learnability in
the distribution-free model. These tools fall into two classes: closure theorems for polynomially
learnable representation classes, and reductions between learning problems via variable substitution.
We also discuss some applications of these techniques. Although we are primarily interested here in
polynomial-time learnability, the results presented in this section are easily generalized to algorithms
with higher time complexities.
11
fall in pos (c1); thus with probability exceeding 1 , sA1 ( ks4A1 ) 1 , , all sA1 positive examples
provided to the simulation of A1 fall in neg (h2) \ pos (c1) as claimed.
Thus if h1 is the hypothesis of A1 following the simulation, then with high probability h1 satises
,
e (h1) < k , and also
Prx2D+ (x 2 neg (h1) \ neg (h2)) = Prx2D+ (x 2 neg (h1) \ neg (h2) \ pos (c1))
+Prx2D+ (x 2 neg (h1 ) \ neg (h2 ) \ neg (c1))
Prx2D+ (x 2 neg (h1)jx 2 neg (h2) \ pos (c1))
+Prx2D+ (x 2 neg (h2 ) \ neg (c1))
2
k + ks
A1
Setting hA = h1 _ h2 , we have e+ (hA ) < and e, (hA ) < , as desired. Note that the time required
by this simulation is polynomial in the time required by A1 and the time required by A2 .
The following dual to Theorem 1 has a similar proof:
Theorem 2 Let C1 be polynomially learnable by H1, and let C2 be polynomially learnable by H2
from positive examples. Then C1 ^ C2 is polynomially learnable by H1 ^ H2 .
As corollaries we have that the following classes of Boolean formulae are polynomially learnable:
Corollary 3 For any xed k, let kCNF _ kDNF = [n1 (kCNF n _ kDNF n ). Then kCNF _ kDNF
is polynomially learnable by kCNF _ kDNF.
Corollary 4 For any xed k, let kCNF ^ kDNF = [n1 (kCNF n ^ kDNF n ). Then kCNF ^ kDNF
is polynomially learnable by kCNF ^ kDNF.
Proofs of Corollaries 3 and 4 follow from Theorems 1 and 2 and the algorithms in [V84] for
learning kCNF from positive examples and kDNF from negative examples. Note that algorithms
obtained in Corollaries 3 and 4 use both positive and negative examples. Following Theorem 12 of
Section 4 we show that the representation classes kCNF _ kDNF and kCNF ^ kDNF require both
positive and negative examples for polynomial learnability, regardless of the hypothesis class. We
note that the more recent results of Rivest [R87] imply that the above classes are learnable by the
class of k-decision lists, a class that properly includes them.
Under the stronger assumption that both C1 and C2 are learnable from positive examples, we
can prove the following result, which shows that the classes that are polynomially learnable from
positive examples are closed under conjunction of representations.
12
Theorem 5 Let C1 be polynomially learnable by H1 from positive examples, and let C2 be polyno-
mially learnable by H2 from positive examples. Then C1 ^ C2 is polynomially learnable by H1 ^ H2
from positive examples.
Proof: Let A1 be a polynomial-time positive-only algorithm for learning C1 by H1, and let A2 be
a polynomial-time positive-only algorithm for learning C2 by H2 . We describe a polynomial-time
positive-only algorithm A for learning C1 ^ C2 by H1 ^ H2 that uses A1 and A2 as subroutines.
Let c = c1 ^ c2 be the target representation in C1 ^ C2, where c1 2 C1 and c2 2 C2, and let D+
and D, be the target distributions on pos (c) and neg (c). Since pos (c) pos (c1), A can use A1
to learn a representation h1 2 H1 for c1 using the positive examples from D+ generated by POS .
A simulates algorithm A1 with accuracy parameter 2 and condence parameter 2 , and obtains
h1 2 H1 that with high probability satises e+ (h1) 2 . Note that although we are unable to
bound e, (h1) by 2 , we must have
Prx2D, (x 2 pos (h1) , pos (c1)) = Prx2D, (x 2 pos (h1) and x 2 neg(c1))
Prx2D, (x 2 pos (h1))jx 2 neg (c1)) 2
since A1 must work for any xed distribution on neg (c1). Similarly, A simulates algorithm A2 with
accuracy parameter 2 and condence parameter 2 to obtain h2 2 H2 that with high probability
satises e+ (h2 ) 2 and Prx2D, (pos (h2 ) , pos (c2)) 2 . Then we have
e+ (h1 ^ h2) e+ (h1) + e+ (h2) :
We now bound e, (h1 ^ h2 ) as follows:
e, (h1 ^ h2 )
= Prx2D, (x 2 pos (h1 ^ h2 ) , pos (c1 ^ c2))
= Prx2D, (x 2 pos (h1) \ pos (h2) \ neg (c1 ^ c2 ))
= Prx2D, (x 2 pos (h1) \ pos (h2) \ (neg (c1) [ neg (c2)))
= Prx2D, (x 2 (pos (h1 ) \ pos (h2 ) \ neg (c1)) [ (pos (h1 ) \ pos (h2) \ neg (c2)))
Prx2D, (x 2 pos (h1)\pos (h2)\neg (c1))+Prx2D, (x 2 pos (h1)\pos (h2)\neg (c2))
Prx2D, (x 2 pos (h1) \ neg (c1)) + Prx2D, (x 2 pos (h2) \ neg (c2))
= Prx2D, (x 2 pos (h1) , pos (c1)) + Prx2D, (x 2 pos (h2 ) , pos (c2))
2 + 2 = :
The time required by this simulation is polynomial in the time taken by A1 and A2 .
The proof of Theorem 5 generalizes to allow any polynomial number p of conjuncts of represen-
tations in the target class. Thus, if C1; : : :; Cp are polynomially learnable from positive examples
13
C1 _ C2 C1 polynomially C1 polynomially C1 polynomially
polynomially learnable by learnable by C1 learnable by C1 learnable by C1
C1 _ C2 ? from POS from NEG from POS and NEG
C2 polynomially NP -hard YES NP -hard
learnable by in some from in some
C2 from POS cases POS and NEG cases
C2 polynomially YES YES YES
learnable by from from from
C2 from NEG POS and NEG NEG POS and NEG
C2 polynomially NP -hard YES NP -hard
learnable by in some from in some
C2 from POS and NEG cases POS and NEG cases
Figure 1: Polynomial learnability of C1 _ C2 by C1 _ C2.
14
C1 ^ C2 C1 polynomially C1 polynomially C1 polynomially
polynomially learnable by learnable by C1 learnable by C1 learnable by C1
C1 ^ C2 ? from POS from NEG from POS and NEG
C2 polynomially YES YES YES
learnable by from from from
C2 from POS POS POS and NEG POS and NEG
C2 polynomially YES NP -hard NP -hard
learnable by from in some in some
C2 from NEG POS and NEG cases cases
C2 polynomially YES NP -hard NP -hard
learnable by from in some in some
C2 from POS and NEG POS and NEG cases cases
Figure 2: Polynomial learnability of C1 ^ C2 by C1 ^ C2.
3.2 Reductions between Boolean formulae learning problems via variable sub-
stitutions
In traditional complexity theory, the notion of polynomial-time reducibility has proven extremely
useful for comparing the computational diculty of problems whose exact complexity or tractability
is unresolved. Similarly, in computational learning theory, we might expect that given two represen-
tation classes C1 and C2 whose polynomial learnability is unresolved, we may still be able to prove
conditional statements to the eect that if C1 is polynomially learnable, then C2 is polynomially
learnable. This suggests a notion of reducibility between learning problems.
In this section we describe polynomial-time reductions between learning problems for classes of
Boolean formulae. These reductions are very general and involve simple variable substitutions. Sim-
ilar transformations have been given for the mistake-bounded model of learning in [L88]. Recently
the notion of reducibility among learning problems has been elegantly generalized and developed
into a complexity theory for polynomial-time learnability [PW88].
If F = [n1 Fn is a parameterized class of Boolean formulae, we say that F is naming invariant
if for any formula f (x1; : : :; xn ) 2 Fn , we have f (x(1) ; : : :; x(n)) 2 Fn , where is any permutation
of f1; : : :; ng. We say that F is upward closed if for n 1, Fn Fn+1 .
We note that all of the classes of Boolean formulae studied here are both naming invariant
and upward closed. For a simple example, let us consider the class of monotone DNF Mn over n
variables. Formula x1 + x2 2 Mk for all k 2, and so is x2 + x1 .
15
Theorem 7 Let F = [n1 Fn be a parameterized class of Boolean formulae that is naming invariant
and upward closed. Let G be a nite set of Boolean formulae over a constant number of variables
k. Let Fn0 be the class of formulae obtained by choosing any f (x1; : : :; xn) 2 Fn and substituting
for one or more of the variables xi in f any formula gi (xi1 ; : : :; xik ), where gi 2 G, and each
xij 2 fx1; : : :; xng (thus, the formula obtained is still over x1 ; : : :; xn). Let F 0 = [n1 Fn0 . Then if
F is polynomially learnable, F 0 is polynomially learnable.
Proof: Let A be a polynomial-time learning algorithm for F . We describe a polynomial-time
learning algorithm A0 for F 0 that uses algorithm A as a subroutine. For each formula gi 2 G, A0
creates nk new variables z1i ; : : :; zni k . The intention is that zji will simulate the value of the formula
gi when gi is given the j th choice in some canonical ordering of k (not necessarily distinct) inputs
from x1 ; : : :; xn . Note that there are exactly nk such choices.
Whenever algorithm A requests a positive or negative example, A0 takes a positive or negative
example (v1 ; : : :; vn) 2 f0; 1gn of the target formula f 0 (x1 ; : : :; xn ) 2 Fn0 . Let cij 2 f0; 1g be the
value assigned to zji by the simulation described above. Then A0 gives the example
(v1; : : :; vn ; c11; : : :; c1nk ; : : :; cj1Gj ; : : :; cjnGkj)
to algorithm A. Since f 0 (x1; : : :; xn ) was obtained by substitutions on some f 2 Fn , and since F
is naming invariant and upward closed, there is a formula in Fn+jGjnk that is consistent with all
the examples we generate by this procedure (it is just f 0 with each occurrence of the formula gi
replaced by the variable zji that simulates the inputs to the occurrence of gi ). Thus A must output
an -good hypothesis
hA (x1; : : :; xn ; z11; : : :; zn1k ; : : :; z1jGj ; : : :; znjGk j):
We then obtain an -good hypothesis over n variables by dening
hA0 (v1; : : :; vn) = hA(v1 ; : : :; vn; c11; : : :; c1nk ; : : :; cj1Gj; : : :; cjnGkj)
for any (v1 ; : : :; vn) 2 f0; 1gn, where each cij is computed as described above.
Note that if the learning algorithm A for F uses only positive examples or only negative ex-
amples, this property is preserved by the reduction of Theorem 7. As a corollary of Theorem 7 we
have that for most natural Boolean formula classes, the monotone learning problem is no harder
than the general learning problem:
Corollary 8 Let F = [n1 Fn be a parameterized class of Boolean formulae that is naming in-
variant and upward closed. Let monotone F be the class containing all monotone formulae, i.e.,
formulae containing no negations, in F . Then if monotone F is polynomially learnable, F is
polynomially learnable.
16
Proof: In the statement of Theorem 7, let G = fyg. Then all of the literals x1; : : :; xn can be
obtained as instances of the single formula in G.
Theorem 7 says that the learning problem for a class of Boolean formulae does not become
harder if an unknown subset of the variables is replaced by a constant-sized set of formulae whose
inputs are unknown. The following result says this is also true if the number of substitution formulae
is larger, but the order and inputs are known.
Theorem 9 Let F = [n1 Fn be a parameterized class of Boolean formulae that is naming invari-
ant and upward closed. Let p(n) be a xed polynomial, and let the description of the p(n)-tuple
(g1n; : : :; gpn(n)) be computable in polynomial time on input n in unary, where each gin is a Boolean
formula over n variables. Let Fn0 consist of formulae of the form
f (g1n (x1; : : :; xn); : : :; gpn(n)(x1; : : :; xn))
where f 2 Fp(n). Let F 0 = [n1 Fn0 . Then if F is polynomially learnable, F 0 is polynomially
learnable.
Proof: Let A be a polynomial-time learning algorithm for F . We describe a polynomial-time
learning algorithm A0 for F 0 that uses algorithm A as a subroutine. Similar to the proof of Theo-
rem 7, A0 creates new variables z1 ; : : :; zp(n). The intention is that zi will simulate gin (x1 ; : : :; xn ).
When algorithm A requests a positive or a negative example, A0 takes a positive or negative
example (v1; : : :; vn ) 2 f0; 1gn of the target formula f 0(x1 ; : : :; xn) 2 Fn0 and sets ci = gin (v1; : : :; vn ).
A0 then gives the vector (c1; : : :; cp(n)) to A. As in the proof of Theorem 7, A must output an -
good hypothesis hA over p(n) variables. We then dene hA0 (v1; : : :; vn) = hA (c1 ; : : :; cp(n)), for any
v1 ; : : :; vn 2 f0; 1gn, where each ci is computed as described above.
If the learning algorithm A for F uses only positive examples or only negative examples, this
property is preserved by the reduction of Theorem 9.
Corollary 10 Let F = [n1 Fn be a parameterized class of Boolean formulae that is naming in-
variant and upward closed and in which formulae in Fn are of length at most q (n) for some xed
polynomial q (). Let F consist of all formulae in F in which each variable occurs at most once.
Then if F is polynomially learnable, F is polynomially learnable.
Proof: Let f 2 F , and let l be the maximum number of times any variable occurs in f . Then in
the statement of Theorem 9, let p(n) = q (n) and ginn +j = xj for 0 i l , 1 and 1 j n.
Corollaries 8 and 10 are particularly useful for simplifying the learning problem for classes whose
polynomial-time learnability is in question. For example:
17
Corollary 11 If monotone DNF (respectively, monotone CNF) is polynomially learnable, then
DNF (CNF) is polynomially learnable.
It is important to note that the substitutions suggested by Theorems 7 and 9 and their corol-
laries do not preserve the underlying target distributions. For example, it does not follow from
Corollary 11 that if monotone DNF is polynomially learnable under uniform target distributions
(as is shown in Section 5) then DNF is polynomially learnable under uniform distributions. The re-
sults presented in this section should also work for other concept classes whenever similar conditions
hold, as pointed out by one referee.
4 A Negative Result
A number of of polynomial-time learning algorithms in the literature require only positive examples
or only negative examples. Among other issues, this raises the question of whether every polyno-
mially learnable class is polynomially learnable either from positive examples only or from negative
examples only.
In this section we prove a superpolynomial lower bound on the number of examples required for
learning monomials from negative examples. The proof can actually be tightened to give a strictly
exponential lower bound, and the proof technique has been generalized in [G89]. A necessary
condition for (general) learning from positive examples is given in [S90]. Our bound is information-
theoretic in the sense that it holds regardless of the computational complexity and hypothesis class
of the negative-only learning algorithm and is independent of any complexity-theoretic assumptions.
By duality, we obtain lower bounds on the number of examples needed for learning disjunctions
from positive examples, and it follows from our proof that the same bound holds for learning
from negative examples any class properly containing monomials (e.g., kCNF ) or for learning from
positive examples any class properly containing disjunctions (e.g., kDNF ). In fact, these results
hold even for the monotone versions of these classes. We apply our lower bound to show that the
polynomially learnable class kCNF _ kDNF requires both positive and negative examples, thus
answering negatively the question raised above.
Theorem 12 Let A be a negative-only learning algorithm for the class of monotone monomials,
and let s,A denote the number of negative examples required by A. Then for any n and for and
n
suciently small constants, s,A (n) =
(2 4 ):
Proof: Let us x = to be a suciently small constant, and assume for contradiction that A
is a negative-only learning algorithm for monotone monomials such that s,A (n) 2 n4 = L(n). Call
18
a monomial monotone dense if it is monotone and contains at least n2 of the variables. Let T be an
ordered sequence of (not necessarily distinct) vectors from f0; 1gn such that jT j = L(n), and let
be the set of all such sequences. If c is a monotone dense monomial, then dene uc 2 f0; 1gn to be
the unique vector such that uc 2 pos (c) and uc has the fewest bits set to 1. We say that T 2 is
legal negative for c if T contains no vector v such that v 2 pos (c).
We rst dene target distributions for a monotone dense monomial c. Let D+ (uc ) = 1 and let
D, be uniform over neg (c). Note that uc 62 pos (h) implies e+ (h) = 1, so any -good h must satisfy
uc 2 pos (h). For T 2 and c a monotone dense monomial, dene the predicate P (T; c) to be 1 if
and only if T is legal negative for c and when T is received by A as a sequence of negative examples
for c from NEG , A outputs an hypothesis hA such that uc 2 pos (hA ). Note that this denition
assumes that A is deterministic. To allow probabilistic algorithms, we simply change the denition
to P (T; c) = 1 if and only if T is legal negative for c, and when T is given to A, A outputs an
hypothesis hA such that uc 2 pos (hA ) with probability at least 21 , where the probability is taken
over the coin tosses of A.
Suppose we draw v uniformly at random from f0; 1gn. Fix any monotone dense monomial c.
We have
Prv2f0;1gn (v 2 pos (c)) 2 n2 21n = 21n2
since at most 2 n2 vectors can satisfy a monotone dense monomial. Thus, if we draw L(n) points
uniformly at random from f0; 1gn, the probability that we draw some point satisfying c is at most
L(nn) 1 for n large enough. By this analysis, we conclude that the number of T 2 that are legal
22 2
negative for c must be at least j j ,
2 . Since D is uniform and A is a learning algorithm, at least
j j (1 , ) = j j (1 , ) of these must satisfy P (T; c) = 1. Let M (n) be the number of monotone
2 2
dense monomials over n variables. Then summing over all monotone dense monomials, we obtain
j j (1 , )M (n) X N (T )
2 T 2
where N (T ) is dened to be the number of monotone dense monomials satisfying P (T; c) = 1.
From this inequality, and the fact that N (T ) is always at most M (n), we conclude that at least 18
of the T 2 must satisfy N (T ) 81 (1 , )M (n). Since D, is uniform, and since at least a fraction
1 , L2(n2n) , for large n, of the T 2 are legal negative for the target monomial c, A has probability
at least 161 of receiving a T with such a large N (T ) for n large enough. But then the hypothesis hA
output by A has at least 18 (1 , )M (n) positive examples by denition of the predicate P . Since
the target monomial c has at most 2 n2 positive examples,
1 (1 , )M (n) , 2 n2
,
e (hA ) 8 2n
19
and this error must be less than . But this cannot be true for a small enough constant and n
large enough. Thus, A cannot achieve arbitrarily small error on monotone dense monomials, and
the theorem follows.
An immediate consequence of Theorem 12 is that monomials are not polynomially learnable
from negative examples (regardless of the hypothesis class). This is in contrast to the fact that
monomials are polynomially learnable (by monomials) from positive examples [V84]. It also follows
that any class that contains the class of monotone monomials (e.g., kCNF ) is not polynomially
learnable from negative examples. Further, Theorem 12 implies that the polynomially learnable
classes kCNF _ kDNF and kCNF ^ kDNF of Corollaries 3 and 4 require both positive and negative
examples for polynomial learnability. The same is true for the class of decision lists studied by
Rivest [R87].
We also note that it is possible to obtain similar but weaker results with simpler proofs. For
instance, since 2-term DNF is not learnable by 2-term DNF unless NP = RP , it follows from
Theorem 1 that monomials are not polynomially learnable by monomials from negative examples
unless NP = RP . However, Theorem 12 gives a lower bound on the sample size for negative-only
algorithms that holds regardless of the hypothesis class, and is independent of any complexity-
theoretic assumption.
By duality we have the following lower bound on the number of positive examples needed for
learning monotone disjunctions.
Corollary 13 Let A be a positive-only learning algorithm for the class of monotone disjunctions,
and let s+A denote the number of positive examples required by A. Then for any n and for and
n
suciently small constants, s+A (n) =
(2 4 ):
We conclude this section with a brief discussion of how the lower bound of Theorem 12 and
similar lower bounds can be used to obtain lower bounds on the expected number of examples for
algorithms whose sample size may depend on coin
ips and the actual sequence of examples received.
We rst dene what we mean by the expected number of examples. Let A be a (randomized)
learning algorithm for a class C , let r~ be an innite sequence of bits (interpreted as the random
coin tosses for A), and let w~ be an innite sequence of alternating positive and negative examples
of some c 2 C . Then we dene sA (; ; r~; w~) to be the number of examples read by A (where each
request for an example results in either the next positive or next negative example being read from
w~ ) on inputs , , r~ and w~. The expected sample complexity of A is then the supremum over all
c 2 C and all target distributions D+ and D, for c of the expectation E(sA (; ; r~; w~ )), where the
innite bit sequence r~ is drawn uniformly at random and the innite example sequence w~ is drawn
randomly according to the target distributions D+ and D, .
20
The basic format of the proof of the lower bound of Theorem 12 (as well as those of [BEHW86]
and [EHKV88]) is to give specic distributions such that a random sample of size at most B has
probability at least p of causing any learning algorithm to fail to output an -good hypothesis. To
obtain a lower bound on the expected sample complexity, let A be any learning algorithm, and let
q be the probability that A draws fewer than B examples when run on the same distributions that
were given to prove the deterministic lower bound. Then the probability that algorithm A fails to
output an -good hypothesis is bounded below by pq . Since A is a learning algorithm we must have
pq , so q p . This gives a lower bound of (1 , p )B on the expected sample complexity. Since the
value of p proved in Theorem 12 is 161 , we immediately obtain an asymptotic lower bound of
(nk )
for any constant k on the expected sample complexity of any negative-only learning algorithm for
monomials.
21
5.1 A polynomial-time weak learning algorithm for all monotone Boolean func-
tions under uniform distributions
The key to our algorithm will be the existence of a single input bit that is slightly correlated with
the output of the target function.
For T f0; 1gn and u; v 2 f0; 1gn dene
u v = (u1 v1 ; : : :; un vn )
and T v = fu v : u 2 T g. For 1 i n let e(i) be the vector with the ith bit set to 1 and all
other bits set to 0.
The following lemma is due to Aldous [A86].
Lemma 14 [A86] Let T f0; 1gn be such that jT j 22n . Then for some 1 i n,
jT e(i) , T j j2Tnj :
Theorem 15 The class of all monotone Boolean functions is polynomially weakly learnable under
uniform D+ and uniform D, .
Proof: Let f be any monotone Boolean function on f0; 1gn. First assume that jpos (f )j 22n .
For v 2 f0; 1gn and 1 i n, let vi=b denote v with the ith bit set to b 2 f0; 1g, i.e., vi = b.
Now suppose that v 2 f0; 1gn is such that v 2 neg (f ) and vj = 1 for some 1 j n. Then
vj =0 2 neg (f ) by monotonicity of f . Thus for any 1 j n we must have
Prv2D, (vj = 1) 21 (2)
since D, is uniform over neg (f ).
Let e(i) be the vector satisfying jpos (f ) e(i) , pos (f )j jpos2n(f )j in Lemma 14 above. Let
v 2 f0; 1gn be any vector satisfying v 2 pos (f ) and vi = 0. Then vi=1 2 pos (f ) by monotonicity
of f . However, by Lemma 14, the number of v 2 pos (f ) such that vi = 1 and vi=0 2 neg (f ) is at
least jpos2n(f )j . Thus, we have
Prv2D+ (vi = 1) (1 , 21n ) 21 + 21n = 21 + 41n : (3)
Similarly, if jneg (f )j 22n , then for any 1 j n we must have
Prv2D+ (vj = 0) 21 (4)
and for some 1 i n,
Prv2D, (vi = 0) 12 + 41n : (5)
22
Note that either jpos (f )j 22n or jneg (f )j 22n .
We use these dierences in probabilities to construct a polynomial-time weak learning algorithm
A. A rst assumes jpos (f )j 22n ; if this is the case, then Equations 2 and 3 must hold. A then
attempts to nd an index 1 k n satisfying
Prv2D+ (vk = 1) 21 + 81n (6)
The existence of such a k is guaranteed by Equation 3. A nds such a k with high probability
by sampling POS enough times according to Fact CB1 and Fact CB2 to obtain an estimate p of
Prv2D+ (vk = 1) satisfying
Prv2D+ (vk = 1) , 161n < p < Prv2D+ (vk = 1) + 161n :
If A successfully identies an index k satisfying Equation 6, then the hypothesis hA is dened as
follows: given an unlabeled input vector v , hA
ips a biased coin and with probability 321n classies
v as negative. With probability 1 , 321n , hA classies v as positive if vi = 1 and as negative if
vi = 0. It is easy to verify by Equations 2 and 6 that this is a randomized hypothesis meeting the
conditions of weak learnability.
If A is unable to identify an index k satisfying Equation 6, then A assumes that jneg (f )j 22n ,
and in a similar fashion proceeds to form a hypothesis hA based on the dierences in probability
of Equations 4 and 5.
It is shown in [BEHW86] (see also [EHKV88]) that the number of examples needed for learning
(and therefore the computation time required) is bounded below by the Vapnik-Chervonenkis di-
mension of the target class; furthermore, this lower bound is proved using the uniform distribution
over a shattered set and holds even for the weak learning model in the case of superpolynomial
Vapnik-Chervonenkis dimension. From this it follows that the class of monotone Boolean func-
tions is not polynomially weakly learnable under arbitrary target distributions (since the Vapnik-
Chervonenkis dimension of this class is exponential in n) and that the class of all Boolean functions
is not polynomially weakly learnable under uniform target distributions (since the entire set f0; 1gn
is shattered). It can also be shown that the class of monotone Boolean functions is not polynomially
(strongly) learnable under uniform target distributions | to see this, consider only those mono-
tone functions dened by an arbitrary set of vectors with exactly half the bits on. The positive
examples are the vectors that can be obtained by choosing one of the vectors in the dening set and
turning 0 or more of its o bits on. This is clearly a monotone function, and it is not possible to
achieve = ( 2p1 n ) on the uniform distribution: the vectors with half the bits on constitute ( p1n )
of the distribution, and the target function is truly random on these vectors. Thus, Theorem 15
is optimal in the sense that generalization in any direction | uniform distributions to arbitrary
23
distributions, weak learning to strong learning, or monotone functions to arbitrary functions |
results in intractability.
25
Analysis of Substep 1.1. For each i, if the variable xi is not in any monomial of f ,
Prv2D+ (vi = 0) = Prv2D+ (vi = 1) = 12
since D+ is uniform. Now suppose that variable xi appears in a signicant monomial m of f . Note
that, since the length of a signicant monomial is at least rlog n, we have,
rlog n,1
1=2 Prv2D+ (vi = 1jv 2 neg (m)) 22rlog n ,,11 :
Then we have
Prv2D+ (vi = 1) = Prv2D+ (vi = 1jv 2 pos (m))Prv2D+ (v 2 pos (m)) (8)
+Prv2D+ (vi = 1jv 2 neg (m))Prv2D+ (v 2 neg (m))
n,1 ,1
Prv2D+ (v 2 pos (m)) + 2r2log
rlog n ,1 (1 , Prv2D+ (v 2 pos (m))) 2 + 8nd+1 ,
1 1
O( n2d1,1 ).
1 ) between the probability that a variable appearing in a
Thus there is a dierence of
( nd+1
signicant monomial is set to 1 and the probability that a variable not appearing in f is set to 1.
Using Facts CB1 and CB2, we can determine with high probability if xi appears in a signicant
monomial of f by drawing a polynomial number of examples from POS . Notice that if xi appears
in a monomial that is not signicant then it really does not matter. We might in this process also
nd some of those variables in some insignicant monomials with higher probability, and lose some
others with lower probability. These variables do not matter to us. This will not aect Substeps
1.2 and 1.3.
Analysis of Substep 1.2. For each pair of variables xi and xj that appear in some monomial of
f (as decided in Substep 1.1), we now decide whether they appear in the same monomial of f .
Lemma 17 If variables xi and xj appear in the same monomial of f , then
Prv2D+ (vi = 1 or vj = 1) = 43 + 12 (Prv2D+ (vi = 1) , 12 ) O( nr1,1 ):
Proof: Since xi and xj appear in the same monomial of f and appear only once in f , we have
Prv2D+ (vi = 1) = Prv2D+ (vj = 1) since D+ is uniform. Let m be the monomial of f in which xi
and xj appear, and let E1 be the event that m is satised. Let E2 be the event that at least one
monomial of f besides (but possibly in addition to) m is satised. Note that Prv2D+ (E1 [ E2) = 1.
Using the facts that since D+ is uniform, Prv2D+ (E1 \ E2) nr1,1 (because given that a positive
example already satises a monomial of f , the remaining variables are independent and uniformly
distributed) and Prv2D+ (vi = 1) = Prv2D+ (E1) + 12 (1 , Prv2D+ (E1)) and by Equation 7, we have
26
Prv2D+ (vi = 1 or vj = 1)
= Prv2D+ (vi = 1 or vj = 1jE1)Prv2D+ (E1)
+Prv2D+ (vi = 1 or vj = 1jE2)Prv2D+ (E2) , O( nr1,1 )
= Prv2D+ (E1) + 34 (1 , Prv2D+ (E1)) O( nr1,1 )
= 43 + 41 Prv2D+ (E1) O( nr1,1 )
= 43 + 21 (Prv2D+ (vi = 1) , 12 ) O( nr1,1 ). (Lemma 17)
Lemma 18 If variables xi and xj appear in dierent monomials of f , then
Prv2D+ (vi = 1 or vj = 1) = 43 + 21 (Prv2D+ (vi = 1) , 21 ) + 12 (Prv2D+ (vj = 1) , 12 ) O( nr1,1 ):
Proof: Let E1 be the event that the monomial m1 of f containing xi is satised, and E2 the event
that the monomial m2 containing xj is satised. Let E3 be the event that some monomial other
than (but possibly in addition to) m1 and m2 is satised. Note that Prv2D+ (E1 [ E2 [ E3 ) = 1.
Then similar to the proof of Lemma 17, we have
Prv2D+ (vi = 1 or vj = 1)
= Prv2D+ (vi = 1 or vj = 1jE1)Prv2D+ (E1)+Prv2D+ (vi = 1 or vj = 1jE2)Prv2D+ (E2)
+Prv2D+ (vi = 1 or vj = 1jE3)Prv2D+ (E3) , O( nr1,1 )
= Prv2D+ (E1) + Prv2D+ (E2) + 34 Prv2D+ (E3) , O( nr1,1 )
= Prv2D+ (E1) + Prv2D+ (E2) + 34 (1 , Prv2D+ (E1) , Prv2D+ (E2)) O( nr1,1 )
= 43 + 41 (Prv2D+ (E1) + Prv2D+ (E2)) O( nr1,1 )
= 43 + 21 (Prv2D+ (vi = 1) , 12 ) + 12 (Prv2D+ (vj = 1) , 21 ) O( nr1,1 ) (Lemma 18)
From Equation 8, Lemma 17, Lemma 18 and the fact that if xi and xj appear in the same mono-
mial of f , then Prv2D+ (vi = 1) = Prv2D+ (vj = 1), we have that there is a dierence
( nd+1 ,1 nr,1 )
between the value of Prv2D+ (vi = 1 or vj = 1) in the two cases addressed by Lemmas 17 and 18.
Thus we can determine whether xi and xj appear in the same monomial by drawing a polynomial
number of examples from POS using Facts CB1 and CB2.
In Step 2, we draw a polynomial number of examples from both POS and NEG to test if the
hypothesis hA produced in Step 1 is -good, again using Facts CB1 and CB2. If it is determined that
hA is not -good, then A guesses the assumption made in Step 1 is not correct, and therefore that
there is a monomial in f which is of length at most rlog n. This implies that all the monomials of
length larger than 2rlog n are not signicant. Therefore in Step 3 we assume that all the monomials
in f are shorter than 2rlog n. We use only the negative examples.
Analysis of Substep 3.1. If variable xi does not appear in any monomial of f , then
Prv2D, (vi = 0) = 21 (9)
27
since D+ is uniform.
Lemma 19 If variable xi appears in a signicant monomial of f , then
Prv2D, (vi = 0) 12 + 2(n2r1, 1) :
Proof: Let l be the number of literals in the monomial m of f that variable xi appears in. Then
in a vector v drawn at random from D, , if some bit of v is set such that m is already not satised,
the remaining bits are independent and uniformly distributed. Thus
,1
Prv2D, (vi = 0) = 22ll,1 = 21 + 2(2l1,1) :
Since l 2rlog n, the claim follows. (Lemma 19)
1
By Equation 9 and Lemma 19, there is a dierence of
( n2r ) between the probability that a
variable in a signicant monomial of f is set to 0 and the probability that a variable not appearing
in f is set to 0. Thus we can draw a polynomial number of examples from NEG , and decide if
variable xi appears in some signicant monomial of f , using Facts CB1 and CB2.
Analysis of Substep 3.2. We have to decide whether variables xi and xj appear in the same
monomial of f , given that each appear in some monomial of f .
Lemma 20 If variables xi and xj are not in the same monomial of f , then
Prv2D, (vi = 0 and vj = 0) = Prv2D, (vi = 0)Prv2D, (vj = 0):
Proof: If xi and xj do not appear in the same monomial, then they are independent of each
other with respect to D, since each variable appears only once in f . (Lemma 20)
Lemma 21 If variables xi and xj appear in the same monomial of f , then
Prv2D, (vi = 0 and vj = 0) = 12 Prv2D, (vi = 0):
Proof:
Prv2D, (vi = 0 and vj = 0) = Prv2D, (vi = 0)Prv2D, (vj = 0jvi = 0)
But Prv2D, (vj = 0jvi = 0) = 21 . (Lemma 21)
By Lemmas 19, 20 and 21 we have that there is a dierence of
( n2r ) in the value of Prv2D, (vi =
1
0 and vj = 0) in the two cases addressed by Lemmas 20 and 21, since
Prv2D, (vi = 0) 12 + 2(n2r1, 1) :
Thus we can test if xi and xj appear in the same monomial of f by drawing a polynomial number
of examples from NEG using Facts CB1 and CB2. This completes the proof of Theorem 16.
28
The results of [PV88] show that k-term DNF is not learnable by k-term DNF unless NP =
RP . However, the algorithm of Theorem 16 outputs an hypothesis with at most the same number
of terms as the target formula. This is an example of a class for which learning under arbitrary
target distributions is NP -hard, but learning under uniform target distributions is tractable.
29
all drawn from D+ (respectively, D, ) are given to hA , then hA classies this group as positive
(negative) with probability at least 1 , . If A runs in polynomial time, we say that C is polynomially
group learnable.
Theorem 22 Let C be a polynomially evaluatable parameterized Boolean representation class.
Then C is polynomially group learnable if and only if C is polynomially weakly learnable.
Proof: (If) Let A be a polynomial-time weak learning algorithm for C . We construct a polynomial-
time group learning algorithm A0 for C as follows: A0 rst simulates algorithm A and obtains an
hypothesis hA that with probability 1 , has accuracy 12 + p(jc1j;n) for some polynomial p. Now
given m examples that are either all drawn from D+ or all drawn from D, , A0 evaluates hA on
each example. Suppose all m points are drawn from D+ . Assuming that hA in fact has accuracy
1 1 1
2 + p(jcj;n) , the probability that hA is positive on fewer than 2 of the examples can be made smaller
than 2 by choosing m to be a large enough polynomial in 1 and p(jc1j;n) using Fact CB1. On the
other hand, if all m examples are drawn from D, then the probability that hA is positive on more
than 21 of the examples can be made smaller than 2 for m large enough by Fact CB2. Thus, if hA
is positive on more than m2 of the m examples, A0 guesses that the sample is positive; otherwise, A0
guesses that the sample is negative. The probability of misclassifying the sample is then at most
, so A0 is a group learning algorithm.
(Only if) Let A be a polynomial-time group learning algorithm for C . We use A as a subroutine
in a polynomial-time weak learning algorithm A0 . Suppose algorithm A is run to obtain with high
probability an 2 -good hypothesis hA for groups of size l = p( 2 ; 1 ; jcj; n) all drawn from D+ or all
drawn from D, for some polynomial p.
Note that although hA is guaranteed to produce an output only when given l positive examples
or l negative examples, the probability that hA produces an output when given a mixture of positive
and negative examples is well dened. Thus for 0 i l, let qi denote the probability that hA
is positive when given as input a group whose rst i examples are drawn from D+ and whose last
l , i examples are drawn from D,. Then since hA is an 2 -good hypothesis we have q0 2 and
ql 1 , 2 . Thus,
(ql , q0 ) (1 , 2 ) , 2 = 1 , :
Then
1 , (ql , q0 )
= (ql , ql,1 ) + (ql,1 , ql,2 ) + + (q1 , q0 ):
This implies that for some 1 j l, (qj , qj ,1 ) 1,l 21l for 21 .
30
Let us now x = 21 . Algorithm A0 rst runs algorithm A with accuracy parameter 2 . A0 next
obtains an estimate qi0 of qi for each 0 i l that is accurate within an additive factor of 161 l , that
is,
qi , 161 l qi0 qi + 161 l : (10)
This is done by repeatedly evaluating hA on groups of l examples in which the rst i examples are
drawn from D+ and the rest are drawn from D, , and computing the fraction of runs for which hA
evaluates as positive. These estimates can be obtained in time polynomial in l and 1 with high
probability using Facts CB1 and CB2.
Now for the j such that (qj , qj ,1 ) 21l , the estimates qj0 and qj0 ,1 will have a dierence
of at least 41l with high probability. Furthermore, for any i, if (qi , qi,1 ) 81l then with high
probability the estimates satisfy (qi0 , qi0,1 ) 41l . Let k be the index such that (qk0 , qk0 ,1 ) is the
largest separation between adjacent estimates. Then we have argued that with high probability,
(qk , qk,1 ) 81l .
The intermediate hypothesis h of A0 is now dened as follows: given an example whose classi-
cation is unknown, h constructs l input examples for hA consisting of k , 1 examples drawn from
D+, the unknown example, and l , k examples drawn from D,. The prediction of h is then the
same as the prediction of hA on this constructed group. The probability that h predicts positive
when the unknown example is drawn from D+ is then qk and the probability that h predicts positive
when the unknown example is drawn from D, is qk,1 .
The rst problem with h is that new examples need to be drawn from D+ and D, each time
an unknown point is classied. This sampling is eliminated as follows: for U a xed sequence
of k , 1 positive examples of the target representation and V a xed sequence of l , k negative
examples, dene h(U; V ) to be the hypothesis h described above using the xed constructed samples
consisting of U and V . Let p+ (U; V ) be the probability that h(U; V ) classies a random example
drawn from D+ as positive, and let p, (U; V ) be the probability that h(U; V ) classies a random
example drawn from D, as positive. Then for U drawn randomly according to D+ and V drawn
randomly according to D, , dene the random variable R(U; V ) = p+ (U; V ) , p, (U; V ). Then the
expectation of R obeys E(R(U; V )) 2 where = 81l .
However, it is always true that R(U; V ) 1. Thus, let r be the probability that R(U; V ) .
Then we have r + (1 , r) 2. Solving, we obtain r = 41l .
Thus, to x this rst problem, A0 repeatedly draws U from D+ and V from D, until R(U; V )
2. By the above argument, this takes only O( 1 log 1 ) to succeed with probability exceeding 1 , .
Note that A0 can test whether R(U; V ) 2 is satised in polynomial time by sampling.
The (almost) nal hypothesis h(U0 ; V0) simply \hardwires" the successful U0 and V0, leaving
31
one input free for the example whose classication is to be predicted.
The second problem is that we still need to \center" the bias of the hypothesis h(U0; V0), that
is, to modify it to provide a slight advantage over random guessing on both the distributions D+
and D, . Since U0 and V0 are now xed, let us simplify our notation and write p+ = p+ (U0 ; V0)
and p, = p, (U0; V0). We assume that 21 p+ > p, ; the other cases can be handled by a similar
analysis (although for the case p+ > 1=2 > p, , it may not be necessary to center the bias). Recall
that we know the separation (p+ , p, ) is \signicant" (that is, at least 81l ).
To center the bias, the nal hypothesis h0 of A0 will be randomized and behave as follows: on
any input x, h0 rst
ips a coin whose probability of heads is p , where p will be determined by
the analysis. If the outcome is heads, h0 immediately outputs 1. If the outcome is tails, h0 uses
h(U0 ; V0) to predict the label of x.
Now if x is drawn randomly from D+ , the probability that h0 classies x as positive is p +
(1 , p )p+ . If x is drawn randomly from D, , the probability that h0 classies x as positive is
p + (1 , p)p,. To center the bias, the desired conditions on p are
p + (1 , p )p+ = 12 +
and
p + (1 , p )p, = 12 ,
for some \signicant" quantity
. Solving, we obtain
, p+ , p,
p = 21 , p+ , p,
and + ,
= (1 , p )(2p , p ) :
Now it is easily veried that p 12 , and thus
321 l . Thus from suciently accurate estimates
of p+ and p, , A0 can determine an accurate approximation to p and thus center the bias.
7 Concluding Remarks
A great deal of progress has been made towards understanding the complexity of learning in our
model since the results presented here were rst announced. The interested reader is encouraged
to consult the proceedings of the annual workshops on Computational Learning Theory (Morgan
Kaufmann Publishers) for further investigation.
Perhaps the most important remaining open problem suggested by this research is whether the
class of polynomial size DNF formulae is learnable in polynomial time in our model.
32
8 Acknowledgements
We would like to thank Lenny Pitt for helpful discussions on this paper and Umesh Vazirani for
many ideas on learning under uniform distributions. We also thank the two very careful referees
who have corrected many errors and suggested many improvements.
References
[A86] D. Aldous.
On the Markov chain simulation method for uniform combinatorial distributions and
simulated annealing.
University of California at Berkeley Statistics Department,
technical report number 60, 1986.
[AV79] D. Angluin, L.G. Valiant.
Fast probabilistic algorithms for Hamiltonian circuits and matchings.
Journal of Computer and Systems Sciences,
18, 1979, pp. 155-193.
[BI88] G.M. Benedek, A. Itai.
Learnability by xed distributions.
Proceedings of the 1988 Workshop on Computational Learning Theory,
Morgan Kaufmann Publishers, 1988, pp. 80-90.
[BEHW86] A. Blumer, A. Ehrenfeucht, D. Haussler, M. Warmuth.
Classifying learnable geometric concepts with the Vapnik-Chervonenkis dimension.
Proceedings of the 18th A.C.M. Symposium on the Theory of Computing,
1986, pp. 273-282.
[C52] H. Cherno.
A measure of asymptotic eciency for tests of a hypothesis based on the sum of
observations.
Annals of Mathematical Statistics,
23, 1952, pp. 493-509.
[EHKV88] A. Ehrenfeucht, D. Haussler, M. Kearns. L.G. Valiant.
A general lower bound on the number of examples needed for learning.
33
Proceedings of the 1988 Workshop on Computational Learning Theory,
Morgan Kaufmann Publishers, 1988, pp. 139-154.
[GJ79] M. Garey, D. Johnson.
Computers and intractability: a guide to the theory of NP-completeness.
Freeman, 1979.
[G89] M. Gereb-Graus.
Complexity of learning from one-sided examples.
Harvard University, unpublished manuscript, 1989.
[H88] D. Haussler.
Quantifying inductive bias: AI learning algorithms and Valiant's model.
Articial Intelligence,
36(2), 1988, pp. 177-221.
[HKLW88] D. Haussler, M. Kearns, N. Littlestone, M. Warmuth.
Equivalence of models for polynomial learnability.
Proceedings of the 1988 Workshop on Computational Learning Theory,
Morgan Kaufmann Publishers, 1988, pp. 42-55, and
University of California at Santa Cruz Information Sciences Department,
technical report number UCSC-CRL-88-06, 1988.
[HSW88] D. Helmbold, R. Sloan, M. Warmuth.
Bootstrapping one-sided learning.
Unpublished manuscript, 1988.
[KLPV87a] M. Kearns, M. Li, L. Pitt, L.G. Valiant.
On the learnability of Boolean formulae.
Proceedings of the 19th A.C.M. Symposium on the Theory of Computing,
1987, pp. 285-295.
[KLPV87b] M. Kearns, M. Li, L. Pitt, L.G. Valiant.
Recent results on Boolean concept learning.
Proceedings of the 4th International Workshop on Machine Learning,
Morgan Kaufmann Publishers, 1987, pp. 337-352.
[KV89] M. Kearns, L.G. Valiant.
Cryptographic limitations on learning Boolean formulae and nite automata.
34
Proceedings of the 21st A.C.M. Symposium on the Theory of Computing,
1989, pp. 433-444.
[L88] N. Littlestone.
Learning quickly when irrelevant attributes abound: a new linear threshold algorithm.
Machine Learning,
2(4), 1988, pp. 245-318.
[N87] B.K. Natarajan.
On learning Boolean functions.
Proceedings of the 19th A.C.M. Symposium on the Theory of Computing,
1987, pp. 296-304.
[PV88] L. Pitt, L.G. Valiant.
Computational limitations on learning from examples.
Journal of the A.C.M.,
35(4), 1988, pp. 965-984.
[PW88] L. Pitt, M.K. Warmuth.
Reductions among prediction problems: on the diculty of predicting automata.
Proceedings of the 3rd I.E.E.E. Conference on Structure in Complexity Theory,
1988, pp. 60-69.
[R87] R. Rivest.
Learning decision lists.
Machine Learning,
2(3), 1987, pp. 229-246.
[S89] R. Schapire.
The strength of weak learnability.
Machine Learning,
5(2), 1990, pp. 197-227.
[S90] H. Shvayster.
A necessary condition for learning from positive examples.
Machine Learning,
5(1), 1990, pp. 101-113.
35
[V84] L.G. Valiant.
A theory of the learnable.
Communications of the A.C.M.,
27(11), 1984, pp. 1134-1142.
[V85] L.G. Valiant.
Learning disjunctions of conjunctions.
Proceedings of the 9th International Joint Conference on Articial Intelligence,
1985, pp. 560-566.
36