Learning Boolean Formulae: Key Words: Machine Learning, Inductive Inference

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Learning Boolean Formulae

Michael Kearnsy Ming Liz Leslie Valiantx


AT&T Bell Laboratories University of Waterloo Harvard University

Abstract
Ecient distribution-free learning of Boolean formulae from positive and negative examples
is considered. It is shown that classes of formulae that are eciently learnable from only
positive examples or only negative examples have certain closure properties. A new substitution
technique is used to show that in the distribution-free case learning DNF (disjunctive normal
form formulae) is no harder than learning monotone DNF. We prove that monomials cannot be
eciently learned from negative examples alone, even if the negative examples are uniformly
distributed. It is also shown that if the examples are drawn from uniform distributions then the
class of DNF in which each variable occurs at most once is eciently learnable, while the class
of all monotone Boolean functions is eciently weakly learnable (i.e., individual examples are
correctly classi ed with a probability larger than 12 + p1 , where p is a polynomial in the relevant
parameters of the learning problem). We then show an equivalence between the notion of weak
learning and the notion of group learning, where a group of examples of polynomial size, either
all positive or all negative, must be correctly classi ed with high probability.
Key words: Machine Learning, Inductive Inference.
 Earlier versions of most of the results presented here were described in \On the learnability of Boolean formulae",
by M. Kearns, M. Li, L. Pitt and L. Valiant, Proceedings of the 19th A.C.M. Symposium on the Theory of Computing,
1987, pp. 285-295. Earlier versions of Theorems 15 and 22 were announced in \Cryptographic limitations on learning
Boolean formulae and nite automata", by M. Kearns and L. Valiant, Proceedings of the 21st A.C.M. Symposium
on the Theory of Computing, 1989, pp. 433-444.
yThis research was done while the author was at Harvard University and supported by grants NSF-DCR-8600379,
ONR-N00014-85-K-0445, and an A.T. & T. Bell Laboratories Scholarship. Author's current address: AT&T Bell
Laboratories, Room 2A-423, 600 Mountain Avenue, P.O. Box 636, Murray Hill, NJ 07974-0636.
zThis research was done while the author was at Harvard University and supported by grants NSF-DCR-8600379,
DAAL03-86-K-0171, and ONR-N00014-85-K-0445. Author's current address: Department of Computer Science,
University of Waterloo, Waterloo, Ontario N2L 3G1 Canada.
x Supported by grants NSF-DCR-8600379 and ONR-N00014-85-K-0445. Author's address: Aiken Computation
Laboratory, Harvard University, Cambridge, MA 02138.

1
1 Introduction
We study the computational feasibility of learning Boolean expressions from examples. Our goals
are to prove results and develop general techniques that shed light on the boundary between the
classes of expressions that are learnable in polynomial time and those that are apparently not. The
elucidation of this boundary, for Boolean expressions and other knowledge representations, is an
example of the potential contribution of complexity theory to arti cial intelligence.
We employ the distribution-free model of learning introduced in [V84]. A more complete dis-
cussion and justi cation of this model can be found in [V85, BEHW86, KLPV87b]. The pa-
per [BEHW86] includes some discussion that is relevant more particularly to in nite representa-
tions, such as geometric ones, rather than the nite case of Boolean functions.
The results of this paper fall into four categories: tools for determining ecient learnability,
a negative result, distribution-speci c positive results, and an equivalence between two models of
learnability.
The tools for determining ecient learnability are of two kinds. In Section 3.1 we discuss closure
under Boolean operations on the members of the learnable classes. The assumption that the classes
are learnable from positive examples only or from negative examples only is sometimes sucient to
ensure closure. In Section 3.2, we give a general substitution technique. It can be used to show, for
example, that if disjunctive normal form (DNF) formulae are eciently learnable in the monotone
case, then they are also eciently learnable in the unrestricted case.
In Section 4 we prove a negative result. We show that for purely information-theoretic reasons,
monomials cannot be learned from negative examples alone, regardless of the hypothesis represen-
tation, and even if the distribution on the negative examples is uniform. This contrasts with the
fact that monomials can be learned from positive examples alone [V84], and can be learned from
very few examples if both kinds are available [H88, L88].
The class of DNF expressions in which each variable occurs at most once is called DNF. In
Section 5, we consider learning DNF and the class of arbitrary monotone Boolean functions,
under the restriction that both the positive and negative examples are drawn from uniform dis-
tributions. We show that DNF is learnable (with DNF as the hypothesis representation) in
this distribution-speci c case. In the distribution-free setting, learning DNF with DNF as the
hypothesis representation is known to be NP-hard [PV88]. We also show that arbitrary monotone
Boolean functions are weakly learnable under uniform distributions, with a polynomial-time pro-
gram as the hypothesis representation. In weak learning, it is sucient to deduce a rule by which
individual examples are correctly classi ed with a probability 21 + 1p , where p is a polynomial in
the parameters of the learning problem. We conclude in Section 6 by showing that weak learning

2
is equivalent to the group learning model, in which it is sucient to deduce a rule that determines
with accuracy 1 ,  whether a large enough set of examples contains only positive or only neg-
ative examples, given that one of the two possibilities holds. This equivalence holds in both the
distribution-free case and under any restricted class of distributions, thus yielding a group learning
algorithm for monotone Boolean functions under uniform distributions. In the distribution-free
case, monotone functions are not weakly learnable by any hypothesis representation, independent
of any complexity-theoretic assumptions.

2 De nitions and Notation


In this section we give de nitions for the model of machine learning we study. This model was rst
de ned in [V84].

2.1 Representing subsets of a domain


Concept classes and their representation. Let X be a set called a domain. We may think of
X as containing encodings of all objects of interest to us in our learning problem. The goal
of a learning algorithm is then to infer some unknown subset of X , called a concept, chosen
from a known concept class. We might imagine a child attempting to learn to distinguish
chairs from non-chairs among the set X of all the physical objects in its environment. This
particular concept is but one of many concepts in the class, each of which the child might be
expected to learn and each of which is a set of related and somehow \interesting" objects.
For example, another concept might consist of all metal objects in the environment. We
would not expect a randomly chosen subset of objects to be an \interesting" concept, since
as humans we do not expect these objects to bear any natural relation to one another.
For computational purposes we always need a way of naming or representing concepts. Thus,
we formally de ne a representation class over X to be a pair (; C ), where C  f0; 1g and
 is a mapping  : C ! 2X . For c 2 C ,  (c) is called a concept over X ; the image space
 (C ) is the concept class that is represented by (; C ). For c 2 C , we de ne pos (c) = (c)
(the positive examples of c) and neg (c) = X ,  (c) (the negative examples of c). The domain
X and the mapping  will usually be clear from the context, and we will simply refer to
the representation class C . We will sometimes use the notation c(x) to denote the value
of the characteristic function of  (c) on the domain point x; thus x 2 pos (c) (respectively,
x 2 neg (c)) and c(x) = 1 (c(x) = 0) are used interchangeably. We assume that domain points
x 2 X and representations c 2 C are eciently encoded using any of the standard schemes

3
(see e.g. [GJ79]), and denote by jxj and jcj the length of these encodings measured in bits.
Parameterized representation classes. We will often study parameterized classes of represen-
tations. Here we have a strati ed domain X = [n1 Xn and representation class C = [n1 Cn .
The parameter n can be regarded as an appropriate measure of the complexity of con-
cepts in  (C ), and we assume that for a representation c 2 Cn we have pos (c)  Xn and
neg (c) = Xn , pos (c). For example, Xn may be the set f0; 1gn, and Cn the class of all Boolean
formulae over n variables. Then for c 2 Cn ,  (c) would contain all satisfying assignments of
the formula c.
Ecient evaluation of representations. If C is a representation class over X , we say that C
is polynomially evaluatable if there is a (probabilistic) polynomial-time evaluation algorithm
A that on input a representation c 2 C and a domain point x 2 X outputs c(x).
Samples. A labeled example from a domain X is a pair < x; b >, where x 2 X and b 2 f0; 1g.
A labeled sample S = < x1; b1 >; : : :; < xm ; bm > from X is a nite sequence of labeled
examples from X . If C is a representation class, a labeled example of c 2 C is a labeled
example of the form < x; c(x) >, where x 2 X . A labeled sample of c is a labeled sample S
where each example of S is a labeled example of c. In the case where all labels bi or c(xi ) are
1 (respectively, 0), we may omit the labels and simply write S as a list of points x1 ; : : :; xm,
and we call the sample a positive (negative) sample.
We say that a representation h and an example < x; b > agree if h(x) = b; otherwise they
disagree. We say that a representation h and a sample S are consistent if h agrees with each
example in S .

2.2 Distribution-free learning


Distributions on examples. On any given execution, a learning algorithm for a representation
class C will be receiving examples of a single distinguished representation c 2 C . We call
this distinguished c the target representation. Examples of the target representation are
generated probabilistically as follows: let Dc+ be a xed but arbitrary probability distribution
over pos (c), and let Dc, be a xed but arbitrary probability distribution over neg (c). We
call these distributions the target distributions. When learning c, learning algorithms will be
given access to two oracles, POS and NEG , that behave as follows: oracle POS (respectively,
NEG ) returns in unit time a positive (negative) example of the target representation, drawn
randomly according to the target distribution Dc+ (Dc, ).

4
We think of the target distributions as representing the \real world" distribution of the envi-
ronment in which the learning algorithm must perform. For instance, suppose that the target
concept were that of \dangerous situations". Certainly the situations \oncoming tractor" and
\oncoming taxi" are both contained in this concept. However, a child growing up in a rural
environment is much more likely to witness the former event than the latter, and the situation
is reversed for a child growing up in an urban environment. These di erences in probability
are re ected in di erent target distributions for the same underlying target concept. Fur-
thermore, since we rarely expect to have precise knowledge of the target distributions at the
time we design a learning algorithm, ideally we seek algorithms that perform well under any
target distributions.
Given a xed target representation c 2 C , and given xed target distributions Dc+ and Dc, ,
there is a natural measure of the error (with respect to c, Dc+ and Dc, ) of a representation h
from a representation class H . We de ne e+c (h) = Dc+ (neg (h)) and e,c (h) = Dc, (pos (h)). Note
that e+c (h) (respectively, e,c (h)) is simply the probability that a random positive (negative)
example of c is identi ed as negative (positive) by h. If both e+c (h) <  and e,c (h) < , then
we say that h is an -good hypothesis (with respect to Dc+ and Dc, ); otherwise, h is -bad. We
de ne the accuracy of h to be the value min (1 , e+c (h); 1 , e,c (h)).
When the target representation c is clear from the context, we will drop the subscript c and
simply write D+ ; D,; e+ and e, .
In the de nitions that follow, we will demand that a learning algorithm produce with high
proability an -good hypothesis regardless of the target representation and target distribu-
tions. While at rst this may seem like a strong criterion, note that the error of the hypothesis
output is always measured with respect to the same target distributions on which the algo-
rithm was trained. Thus, while it is true that certain examples of the target representation
may be extremely unlikely to be generated in the training process, these same examples in-
tuitively may be \ignored" by the hypothesis of the learning algorithm, since they contribute
a negligible amount of error. Continuing our informal example, the rural child may never be
shown an oncoming taxi as an example of a dangerous situation, but provided he remains
in the environment in which he was trained, it is unlikely that his inability to recognize this
danger will ever become apparent.
Learnability. Let C and H be representation classes over X . Then C is learnable from examples
by H if there is a (probabilistic) algorithm A with access to POS and NEG , taking inputs
;  , with the property that for any target representation c 2 C , for any target distributions
D+ over pos (c) and D, over neg (c), and for any input values 0 < ;  < 1, algorithm A halts
5
and outputs a representation hA 2 H that with probability greater than 1 ,  satis es
(i) e+ (hA ) < 
and
(ii) e, (hA ) < .
We call C the target class and H the hypothesis class; the output hA 2 H is called the
hypothesis of A. A will be called a learning algorithm for C . If C and H are polynomially
evaluatable, and A runs in time polynomial in 1 ; 1 and jcj then we say that C is polynomially
learnable from examples by H ; if C is parameterized we also allow the running time of A to
have polynomial dependence on n. We will drop the phrase \from examples" and simply say
that C is learnable by H , and C is polynomially learnable by H . We say C is polynomially
learnable to mean that C is polynomially learnable by H for some polynomially evaluatable
H . We will sometimes call  the accuracy parameter and  the con dence parameter.
Thus, we ask that for any target representation and any target distributions, a learning
algorithm nds an -good hypothesis with probability at least 1 ,  . A major goal of research
in this model is to discover which representation classes C are polynomially learnable.
We will sometimes bound the probability that a learning algorithm fails to output an -good
hypothesis by separately analyzing a constant number k of di erent ways the algorithm might
fail, and showing that each of these occurs with probability at most k . Thus, we will use the
expression \with high probability" to mean with probability at least 1 , k for an appropriately
large constant k, which for brevity may sometimes be left unspeci ed.
We refer to this model as the distribution-free model, to emphasize that we seek algorithms
that work for any target distributions. We also occasionally refer to this model as strong
learnability, in contrast with the notion of weak learnability de ned below.
Weak learnability. We will also consider a distribution-free model in which the hypothesis of
the learning algorithm is required to perform only slightly better than random guessing.
Let C and H be representation classes over X . Then C is weakly learnable from examples
by H if there is a polynomial p and a (probabilistic) algorithm A with access to POS and
NEG , taking input  , with the property that for any target representation c 2 C , for any
target distributions D+ over pos (c) and D, over neg (c), and for any input value 0 <  < 1,
algorithm A halts and outputs a representation hA 2 H that with probability greater than
1 ,  satis es
(i) e+ (hA ) < 21 , p(j1cj)

6
and
(ii) e, (hA ) < 12 , p(j1cj) .
Thus, the accuracy of hA must be at least 12 + p(j1cj) . A will be called a weak learning algorithm
for C . In the case that the target class C is parameterized, we allow the polynomial p in
conditions (i) and (ii) to depend on n. If C and H are polynomially evaluatable, and A runs
in time polynomial in 1 and jcj (as well as polynomial in n if C is parameterized), we say
that C is polynomially weakly learnable by H and C is polynomially weakly learnable if it is
weakly learnable by H for some polynomially evaluatable H .
While the motivation for weak learning may not be as apparent as that for strong learning,
we may intuitively think of weak learning as the ability to detect some slight bias separating
positive and negative examples. In fact, Schapire [S89] has shown that in the distribution-free
setting, polynomial-time weak learning is equivalent to polynomial-time learning.
Positive-only and negative-only learning algorithms. We will sometimes study learning
algorithms that need only positive examples or only negative examples. If A is a learning
algorithm for a representation class C , and A makes no calls to the oracle NEG (respectively,
POS ), then we say that A is a positive-only (negative-only) learning algorithm, and C is
learnable from positive examples (learnable from negative examples). Analogous de nitions
are made for weak learnability. Note that although the learning algorithm may receive only
one type of example, the hypothesis output must still be accurate with respect to both the
positive and negative distributions.
Many learning algorithms in the distribution-free model are positive-only or negative-only.
The study of positive-only and negative-only learning is important for at least two reasons.
First, it helps to quantify more precisely what kind of information is required for learning var-
ious representation classes. Second, it is crucial for applications where, for instance, positive
examples are rare but must be classi ed accurately when they do occur.
Distribution-speci c learnability. The models for learnability described above demand that
a learning algorithm work regardless of the distributions on the examples. We will sometimes
relax this condition, and consider these models under restricted classes of target distributions,
for instance the class consisting only of the uniform distribution. Here the de nitions are the
same as before, except that we ask that the performance criteria for learnability be met only
under these restricted target distributions.
Sample complexity. Let A be a learning algorithm for a representation class C . Then we denote
by sA (;  ) the number of calls to the oracles POS and NEG made by A on inputs ;  ; this is
7
a worst-case measure over all possible target representations in C and all target distributions
D+ and D, . In the case that C is a parameterized representation class, we also allow sA to
depend on n. We call the function sA the sample complexity or sample size of A. We denote
by s+A and s,A the number of calls of A to POS and NEG , respectively. Note that here we
have de ned the sample complexity deterministically, since all of the algorithms we give use
a number of examples depending only on ,  and n. In general, however, we may wish to
allow the number of examples drawn to depend on the coin ips of the learning algorithm
and the actual sequence of examples seen; in this case the sample complexity would be an
expected value rather than an absolute value. Following the sample complexity lower bound
of Theorem 12 in Section 4, we discuss how to adapt the proof of this theorem and other
known deterministic sample complexity lower bounds to yield lower bounds on the expected
number of examples.
Cherno bounds. We shall make extensive use of the following bounds on the area under the
tails of the binomial distribution. For 0  p  1 and m a positive integer, let LE (p; m; r)
denote the probability of at most r successes in m independent trials of a Bernoulli variable
with probability of success p, and let GE (p; m; r) denote the probability of at least r successes.
Then for 0   1,
Fact CB1. LE (p; m; (1 , )mp)  e, 2 mp=2
and
Fact CB2. GE (p; m; (1 + )mp)  e, 2mp=3
These bounds in the form they are stated are from [AV79]; see also [C52]. Although we
will make frequent use of Fact CB1 and Fact CB2, we will do so in varying levels of detail,
depending on the complexity of the calculation involved. However, we are primarily interested
in the Cherno bounds for the following consequence of Fact CB1 and Fact CB2: given an
event E of probability p, we can obtain an estimate p0 of p by drawing m points from the
distribution and computing the frequency p0 with which E occurs in this sample. Then for m
polynomial in 1p and 1 , say m = O( 1p log 1 ), p0 satis es p2 < p0 < 2p with probability at least
1 ,  . If we also allow m to depend polynomially on 1 , say m = O( 12 log 1 ), we can obtain
an estimate p0 such that p , < p0 < p + with probability at least 1 ,  .
Notational conventions. Let E (x) be an event and (x) a random variable that depend on a
parameter x that takes on values in a set X . Then for X 0  X , we denote by Prx2X 0 (E (x))
the probability that E occurs when x is drawn uniformly at random from X 0. Similarly,
Ex2X 0 ( (x)) is the expected value of when x is drawn uniformly at random from X 0. We also
8
need to work with distributions other than the uniform distribution; thus if P is a distribution
over X we use Prx2P (E (x)) and Ex2P ( (x)) to denote the probability of E and the expected
value of , respectively, when x is drawn according to the distribution P . When E or
depend on several parameters that are drawn independently from di erent distributions we use
multiple subscripts. For example, Prx1 2P1 ;x2 2P2 ;x3 2P3 (E (x1; x2; x3)) denotes the probability
of event E when x1 is drawn from distribution P1 , x2 from P2, and x3 from P3 , independently.

2.3 Some representation classes


We now de ne some of the restricted Boolean formula representation classes whose learnability we
will study. Here the domain Xn is always f0; 1gn and the mapping  simply maps each formula to
its set of satisfying assignments. The classes de ned below are all parameterized; for each class we
de ne the subclasses Cn , and then C is de ned by C = [n1 Cn .
Monomials: The representation class Mn consists of all conjunctions of literals (monomials) over
the Boolean variables x1 ; : : :; xn.
kCNF: For constant k, the representation class kCNF n consists of Boolean formulae of the form
C1 ^    ^ Cl, where each Ci is a disjunction of at most k literals over the Boolean variables
x1; : : :; xn. No a priori bound is placed on l. Note that Mn = 1CNF n .
kDNF: For constant k, the representation class kDNF n consists of Boolean formulae of the form
T1 _    _ Tl, where each Ti is a conjunction of at most k literals over the Boolean variables
x1; : : :; xn . No a priori bound is placed on l.
k-clause CNF: For constant k, the representation class k , clause CNF n consists of all conjunc-
tions of the form C1 ^    ^ Cl for l  k, where each Ci is a disjunction of literals over the
Boolean variables x1 ; : : :; xn.
k-term DNF: For constant k, the representation class k , term DNF n consists of all disjunctions
of the form T1 _    _ Tl for l  k, where each Ti is a monomial over the Boolean variables
x1; : : :; xn .
CNF: The representation class CNF n consists of all formulae of the form C1 ^    ^ Cl, where
each Ci is a disjunction of literals over the Boolean variables x1; : : :; xn . No a priori bound is
placed on l.
DNF: The representation class DNF n consists of all formulae of the form T1 _    _ Tl, where
each Ti is a disjunction of literals over the Boolean variables x1 ; : : :; xn. No a priori bound is
placed on l.

9
3 Useful Tools for Distribution-free Learning
In this section we describe some general tools for determining polynomial-time learnability in
the distribution-free model. These tools fall into two classes: closure theorems for polynomially
learnable representation classes, and reductions between learning problems via variable substitution.
We also discuss some applications of these techniques. Although we are primarily interested here in
polynomial-time learnability, the results presented in this section are easily generalized to algorithms
with higher time complexities.

3.1 Closure theorems for polynomially learnable classes: composing learning


algorithms
Suppose that C1 is polynomially learnable by H1, and C2 is polynomially learnable by H2 . Then
it is easy to see that the class C1 [ C2 is polynomially learnable by H1 [ H2 : we rst assume that
the target representation c is in the class C1 and run algorithm A1 for learning C1. We then test
the hypothesis h1 output by A1 on a polynomial-size random sample of c to determine with high
probability if it is -good (this can be done using Fact CB1 and Fact CB2). If h1 is -good, we
halt; otherwise, we run algorithm A2 for learning C2 and use the hypothesis h2 output by A2 . This
algorithm demonstrates one way in which existing learning algorithms can be composed to learn
more powerful representation classes, and it generalizes to any polynomial number of unions of
polynomially learnable classes. Are there more interesting ways to compose learning algorithms?
In this section we describe techniques for composing existing learning algorithms to obtain new
learning algorithms for representation classes that are formed by performing logical operations on
the members of the learnable classes. We apply the results to obtain polynomial-time learning
algorithms for two classes of Boolean formulae not previously known to be polynomially learnable.
Recently in [HSW88] a general composition technique has been proposed and carefully analyzed in
several models of learnability.
If c1 2 C1 and c2 2 C2 are representations, the representation c1 _ c2 is de ned by pos (c1 _ c2 ) =
pos (c1 ) [ pos (c2). Similarly, pos (c1 ^ c2) = pos (c1) \ pos (c2). We then de ne C1 _ C2 = fc1 _ c2 :
c1 2 C1; c2 2 C2g and C1 ^ C2 = fc1 ^ c2 : c1 2 C1; c2 2 C2 g.
Theorem 1 Let C1 be polynomially learnable by H1, and let C2 be polynomially learnable by H2
from negative examples. Then C1 _ C2 is polynomially learnable by (H1 _ H2) [ H2 .
Proof: Let A1 be a polynomial-time algorithm for learning C1 by H1, and A2 a polynomial-time
negative-only algorithm for learning C2 by H2. We describe a polynomial-time algorithm A for
learning C1 _ C2 by (H1 _ H2 ) [ H2 that uses A1 and A2 as subroutines.
10
Let c = c1 _ c2 be the target representation in C1 _ C2, where c1 2 C1 and c2 2 C2, and let
D+ and D, be the target distributions on pos (c) and neg (c), respectively. Let sA1 be the number
of examples needed by algorithm A1 in the simulation described below (sA1 will depend on n, jc1j
and the accuracy and con dence parameters used in the simulation).
Since neg (c)  neg (c2), the distribution D, may be regarded as a distribution on neg (c2), with
D, (x) = 0 for x 2 neg (c2) , neg (c). Thus A rst runs the negative-only algorithm A2 to obtain
a representation h2 2 H2 for c2, using the examples generated from D, by NEG . This simulation
is done with accuracy parameter ks2A1 and con dence parameter 5 , where k is a constant that can
be determined by applying Fact CB1 and Fact CB2 in the analysis below. A2 then outputs an
h2 2 H2 satisfying with high probability e, (h2) < ks2A1 . Note that although we are unable to
bound e+ (h2) directly (because D+ is not a distribution over pos (c2)), the fact that the simulation
of the negative-only algorithm A2 must work for any target distribution on pos (c2) implies that h2
must satisfy with high probability
2
Prx2D+ (x 2 neg (h2) and x 2 pos (c2))  Prx2D+ (x 2 neg (h2)jx 2 pos (c2)) < ks  : (1)
A1
+ 1 1
A next attempts to determine if e (h2 ) < . A takes m = O(  ln  ) examples from POS and
uses these examples to compute an estimate p for the value of e+ (h2). Using Fact CB1 it can be
shown that if e+ (h2 )  , then with high probability p > 2 . Using Fact CB2 it can be shown that
if e+ (h2 )  4 , then with high probability p  2 . Thus, if p  2 then A guesses that e+ (h2)  . In
this case A halts with hA = h2 as the hypothesis.
On the other hand, if p > 2 then A guesses that e+ (h2 )  4 . In this case A runs A1 in order
to obtain an h1 that is -good with respect to D, and also with respect to that portion of D+
on which h2 is wrong. More speci cally, A runs A1 with accuracy parameter k and con dence
parameter 5 according to the following distributions: each time A1 calls NEG , A supplies A1 with
a negative example of c drawn according to the target distribution D, ; each such example is also
a negative example of c1 since neg (c)  neg (c1). Each time A1 calls POS , A draws from the target
distribution D+ until a point x 2 neg (h2) is obtained. Since the probability of drawing such an x
is exactly e+ (h2 ), if e+ (h2 )  4 then the time needed to obtain with high probability a sample of
size sA1 of such x is polynomial in 1 , 1 and sA1 by Fact CB1.
We now argue that with high probability, the sA1 positive examples provided to A1 by this
simulation are in fact drawn according to the distribution D+ restricted not just to the set neg (h2 ),
but in fact restricted to the smaller set neg (h2) \ pos (c1). By Equation 1 we have that a single draw
from D+ has probability at most ks2A1 of falling in neg (h2) \ pos (c2). Thus, under the assumption
that e+ (h2 )  4 , the probability that a draw from D+ restricted to neg (h2) in fact falls in pos (c2)
is at most ( ks2A1 )=( 4 ) = ks4A1 : Now if a point drawn from D+ does not fall in pos (c2) then it must

11
fall in pos (c1); thus with probability exceeding 1 , sA1 ( ks4A1 )  1 ,  , all sA1 positive examples
provided to the simulation of A1 fall in neg (h2) \ pos (c1) as claimed.
Thus if h1 is the hypothesis of A1 following the simulation, then with high probability h1 satis es
,
e (h1) < k , and also
Prx2D+ (x 2 neg (h1) \ neg (h2)) = Prx2D+ (x 2 neg (h1) \ neg (h2) \ pos (c1))
+Prx2D+ (x 2 neg (h1 ) \ neg (h2 ) \ neg (c1))
 Prx2D+ (x 2 neg (h1)jx 2 neg (h2) \ pos (c1))
+Prx2D+ (x 2 neg (h2 ) \ neg (c1))
2
 k + ks 
A1
Setting hA = h1 _ h2 , we have e+ (hA ) <  and e, (hA ) < , as desired. Note that the time required
by this simulation is polynomial in the time required by A1 and the time required by A2 .
The following dual to Theorem 1 has a similar proof:
Theorem 2 Let C1 be polynomially learnable by H1, and let C2 be polynomially learnable by H2
from positive examples. Then C1 ^ C2 is polynomially learnable by H1 ^ H2 .
As corollaries we have that the following classes of Boolean formulae are polynomially learnable:
Corollary 3 For any xed k, let kCNF _ kDNF = [n1 (kCNF n _ kDNF n ). Then kCNF _ kDNF
is polynomially learnable by kCNF _ kDNF.

Corollary 4 For any xed k, let kCNF ^ kDNF = [n1 (kCNF n ^ kDNF n ). Then kCNF ^ kDNF
is polynomially learnable by kCNF ^ kDNF.
Proofs of Corollaries 3 and 4 follow from Theorems 1 and 2 and the algorithms in [V84] for
learning kCNF from positive examples and kDNF from negative examples. Note that algorithms
obtained in Corollaries 3 and 4 use both positive and negative examples. Following Theorem 12 of
Section 4 we show that the representation classes kCNF _ kDNF and kCNF ^ kDNF require both
positive and negative examples for polynomial learnability, regardless of the hypothesis class. We
note that the more recent results of Rivest [R87] imply that the above classes are learnable by the
class of k-decision lists, a class that properly includes them.
Under the stronger assumption that both C1 and C2 are learnable from positive examples, we
can prove the following result, which shows that the classes that are polynomially learnable from
positive examples are closed under conjunction of representations.

12
Theorem 5 Let C1 be polynomially learnable by H1 from positive examples, and let C2 be polyno-
mially learnable by H2 from positive examples. Then C1 ^ C2 is polynomially learnable by H1 ^ H2
from positive examples.
Proof: Let A1 be a polynomial-time positive-only algorithm for learning C1 by H1, and let A2 be
a polynomial-time positive-only algorithm for learning C2 by H2 . We describe a polynomial-time
positive-only algorithm A for learning C1 ^ C2 by H1 ^ H2 that uses A1 and A2 as subroutines.
Let c = c1 ^ c2 be the target representation in C1 ^ C2, where c1 2 C1 and c2 2 C2, and let D+
and D, be the target distributions on pos (c) and neg (c). Since pos (c)  pos (c1), A can use A1
to learn a representation h1 2 H1 for c1 using the positive examples from D+ generated by POS .
A simulates algorithm A1 with accuracy parameter 2 and con dence parameter 2 , and obtains
h1 2 H1 that with high probability satis es e+ (h1)  2 . Note that although we are unable to
bound e, (h1) by 2 , we must have
Prx2D, (x 2 pos (h1) , pos (c1)) = Prx2D, (x 2 pos (h1) and x 2 neg(c1))
 Prx2D, (x 2 pos (h1))jx 2 neg (c1))  2
since A1 must work for any xed distribution on neg (c1). Similarly, A simulates algorithm A2 with
accuracy parameter 2 and con dence parameter 2 to obtain h2 2 H2 that with high probability
satis es e+ (h2 )  2 and Prx2D, (pos (h2 ) , pos (c2))  2 . Then we have
e+ (h1 ^ h2)  e+ (h1) + e+ (h2)  :
We now bound e, (h1 ^ h2 ) as follows:
e, (h1 ^ h2 )
= Prx2D, (x 2 pos (h1 ^ h2 ) , pos (c1 ^ c2))
= Prx2D, (x 2 pos (h1) \ pos (h2) \ neg (c1 ^ c2 ))
= Prx2D, (x 2 pos (h1) \ pos (h2) \ (neg (c1) [ neg (c2)))
= Prx2D, (x 2 (pos (h1 ) \ pos (h2 ) \ neg (c1)) [ (pos (h1 ) \ pos (h2) \ neg (c2)))
 Prx2D, (x 2 pos (h1)\pos (h2)\neg (c1))+Prx2D, (x 2 pos (h1)\pos (h2)\neg (c2))
 Prx2D, (x 2 pos (h1) \ neg (c1)) + Prx2D, (x 2 pos (h2) \ neg (c2))
= Prx2D, (x 2 pos (h1) , pos (c1)) + Prx2D, (x 2 pos (h2 ) , pos (c2))
 2 + 2 = :
The time required by this simulation is polynomial in the time taken by A1 and A2 .
The proof of Theorem 5 generalizes to allow any polynomial number p of conjuncts of represen-
tations in the target class. Thus, if C1; : : :; Cp are polynomially learnable from positive examples

13
C1 _ C2 C1 polynomially C1 polynomially C1 polynomially
polynomially learnable by learnable by C1 learnable by C1 learnable by C1
C1 _ C2 ? from POS from NEG from POS and NEG
C2 polynomially NP -hard YES NP -hard
learnable by in some from in some
C2 from POS cases POS and NEG cases
C2 polynomially YES YES YES
learnable by from from from
C2 from NEG POS and NEG NEG POS and NEG
C2 polynomially NP -hard YES NP -hard
learnable by in some from in some
C2 from POS and NEG cases POS and NEG cases
Figure 1: Polynomial learnability of C1 _ C2 by C1 _ C2.

by H1; : : :; Hp respectively, then the class C1 ^    ^ Cp is polynomially learnable by H1 ^    ^ Hp


from positive examples.
We can also prove the following dual to Theorem 5:
Theorem 6 Let C1 be polynomially learnable by H1 from negative examples, and let C2 be polyno-
mially learnable by H2 from negative examples. Then C1 _ C2 is polynomially learnable by H1 _ H2
from negative examples.
Again, if C1 ; : : :; Cp are polynomially learnable from negative examples by H1 ; : : :; Hp respec-
tively, then the class C1 _  _ Cp is polynomially learnable by H1 _  _ Hp from negative examples,
for any polynomial value p.
We can also use Theorems 1, 2, 5 and 6 to characterize the conditions under which the class
C1 _ C2 (respectively, C1 ^ C2) is polynomially learnable by C1 _ C2 (respectively, C1 ^ C2). Figures 1
and 2 summarize this information, where a \YES" entry indicates that for C1 and C2 polynomially
learnable as indicated, C1 _ C2 (respectively, C1 ^ C2 ) is always polynomially learnable by C1 _ C2
(respectively, C1 ^ C2 ), and an entry \NP -hard" indicates that the learning problem is NP -hard for
some choice of C1 and C2. All NP -hardness results follow from the results of [PV88]. For learning
problems, since we also allow randomized algorithms, these NP -hardness results usually are taken
to mean that if any of such problem is solved in random polynomial time, i.e. in RP , then all NP
problems are in RP , that is, NP = RP .

14
C1 ^ C2 C1 polynomially C1 polynomially C1 polynomially
polynomially learnable by learnable by C1 learnable by C1 learnable by C1
C1 ^ C2 ? from POS from NEG from POS and NEG
C2 polynomially YES YES YES
learnable by from from from
C2 from POS POS POS and NEG POS and NEG
C2 polynomially YES NP -hard NP -hard
learnable by from in some in some
C2 from NEG POS and NEG cases cases
C2 polynomially YES NP -hard NP -hard
learnable by from in some in some
C2 from POS and NEG POS and NEG cases cases
Figure 2: Polynomial learnability of C1 ^ C2 by C1 ^ C2.

3.2 Reductions between Boolean formulae learning problems via variable sub-
stitutions
In traditional complexity theory, the notion of polynomial-time reducibility has proven extremely
useful for comparing the computational diculty of problems whose exact complexity or tractability
is unresolved. Similarly, in computational learning theory, we might expect that given two represen-
tation classes C1 and C2 whose polynomial learnability is unresolved, we may still be able to prove
conditional statements to the e ect that if C1 is polynomially learnable, then C2 is polynomially
learnable. This suggests a notion of reducibility between learning problems.
In this section we describe polynomial-time reductions between learning problems for classes of
Boolean formulae. These reductions are very general and involve simple variable substitutions. Sim-
ilar transformations have been given for the mistake-bounded model of learning in [L88]. Recently
the notion of reducibility among learning problems has been elegantly generalized and developed
into a complexity theory for polynomial-time learnability [PW88].
If F = [n1 Fn is a parameterized class of Boolean formulae, we say that F is naming invariant
if for any formula f (x1; : : :; xn ) 2 Fn , we have f (x(1) ; : : :; x(n)) 2 Fn , where  is any permutation
of f1; : : :; ng. We say that F is upward closed if for n  1, Fn  Fn+1 .
We note that all of the classes of Boolean formulae studied here are both naming invariant
and upward closed. For a simple example, let us consider the class of monotone DNF Mn over n
variables. Formula x1 + x2 2 Mk for all k  2, and so is x2 + x1 .

15
Theorem 7 Let F = [n1 Fn be a parameterized class of Boolean formulae that is naming invariant
and upward closed. Let G be a nite set of Boolean formulae over a constant number of variables
k. Let Fn0 be the class of formulae obtained by choosing any f (x1; : : :; xn) 2 Fn and substituting
for one or more of the variables xi in f any formula gi (xi1 ; : : :; xik ), where gi 2 G, and each
xij 2 fx1; : : :; xng (thus, the formula obtained is still over x1 ; : : :; xn). Let F 0 = [n1 Fn0 . Then if
F is polynomially learnable, F 0 is polynomially learnable.
Proof: Let A be a polynomial-time learning algorithm for F . We describe a polynomial-time
learning algorithm A0 for F 0 that uses algorithm A as a subroutine. For each formula gi 2 G, A0
creates nk new variables z1i ; : : :; zni k . The intention is that zji will simulate the value of the formula
gi when gi is given the j th choice in some canonical ordering of k (not necessarily distinct) inputs
from x1 ; : : :; xn . Note that there are exactly nk such choices.
Whenever algorithm A requests a positive or negative example, A0 takes a positive or negative
example (v1 ; : : :; vn) 2 f0; 1gn of the target formula f 0 (x1 ; : : :; xn ) 2 Fn0 . Let cij 2 f0; 1g be the
value assigned to zji by the simulation described above. Then A0 gives the example
(v1; : : :; vn ; c11; : : :; c1nk ; : : :; cj1Gj ; : : :; cjnGkj)
to algorithm A. Since f 0 (x1; : : :; xn ) was obtained by substitutions on some f 2 Fn , and since F
is naming invariant and upward closed, there is a formula in Fn+jGjnk that is consistent with all
the examples we generate by this procedure (it is just f 0 with each occurrence of the formula gi
replaced by the variable zji that simulates the inputs to the occurrence of gi ). Thus A must output
an -good hypothesis
hA (x1; : : :; xn ; z11; : : :; zn1k ; : : :; z1jGj ; : : :; znjGk j):
We then obtain an -good hypothesis over n variables by de ning
hA0 (v1; : : :; vn) = hA(v1 ; : : :; vn; c11; : : :; c1nk ; : : :; cj1Gj; : : :; cjnGkj)
for any (v1 ; : : :; vn) 2 f0; 1gn, where each cij is computed as described above.
Note that if the learning algorithm A for F uses only positive examples or only negative ex-
amples, this property is preserved by the reduction of Theorem 7. As a corollary of Theorem 7 we
have that for most natural Boolean formula classes, the monotone learning problem is no harder
than the general learning problem:
Corollary 8 Let F = [n1 Fn be a parameterized class of Boolean formulae that is naming in-
variant and upward closed. Let monotone F be the class containing all monotone formulae, i.e.,
formulae containing no negations, in F . Then if monotone F is polynomially learnable, F is
polynomially learnable.

16
Proof: In the statement of Theorem 7, let G = fyg. Then all of the literals x1; : : :; xn can be
obtained as instances of the single formula in G.
Theorem 7 says that the learning problem for a class of Boolean formulae does not become
harder if an unknown subset of the variables is replaced by a constant-sized set of formulae whose
inputs are unknown. The following result says this is also true if the number of substitution formulae
is larger, but the order and inputs are known.
Theorem 9 Let F = [n1 Fn be a parameterized class of Boolean formulae that is naming invari-
ant and upward closed. Let p(n) be a xed polynomial, and let the description of the p(n)-tuple
(g1n; : : :; gpn(n)) be computable in polynomial time on input n in unary, where each gin is a Boolean
formula over n variables. Let Fn0 consist of formulae of the form
f (g1n (x1; : : :; xn); : : :; gpn(n)(x1; : : :; xn))
where f 2 Fp(n). Let F 0 = [n1 Fn0 . Then if F is polynomially learnable, F 0 is polynomially
learnable.
Proof: Let A be a polynomial-time learning algorithm for F . We describe a polynomial-time
learning algorithm A0 for F 0 that uses algorithm A as a subroutine. Similar to the proof of Theo-
rem 7, A0 creates new variables z1 ; : : :; zp(n). The intention is that zi will simulate gin (x1 ; : : :; xn ).
When algorithm A requests a positive or a negative example, A0 takes a positive or negative
example (v1; : : :; vn ) 2 f0; 1gn of the target formula f 0(x1 ; : : :; xn) 2 Fn0 and sets ci = gin (v1; : : :; vn ).
A0 then gives the vector (c1; : : :; cp(n)) to A. As in the proof of Theorem 7, A must output an -
good hypothesis hA over p(n) variables. We then de ne hA0 (v1; : : :; vn) = hA (c1 ; : : :; cp(n)), for any
v1 ; : : :; vn 2 f0; 1gn, where each ci is computed as described above.
If the learning algorithm A for F uses only positive examples or only negative examples, this
property is preserved by the reduction of Theorem 9.
Corollary 10 Let F = [n1 Fn be a parameterized class of Boolean formulae that is naming in-
variant and upward closed and in which formulae in Fn are of length at most q (n) for some xed
polynomial q (). Let F consist of all formulae in F in which each variable occurs at most once.
Then if F is polynomially learnable, F is polynomially learnable.
Proof: Let f 2 F , and let l be the maximum number of times any variable occurs in f . Then in
the statement of Theorem 9, let p(n) = q (n) and ginn +j = xj for 0  i  l , 1 and 1  j  n.
Corollaries 8 and 10 are particularly useful for simplifying the learning problem for classes whose
polynomial-time learnability is in question. For example:

17
Corollary 11 If monotone DNF (respectively, monotone CNF) is polynomially learnable, then
DNF (CNF) is polynomially learnable.

It is important to note that the substitutions suggested by Theorems 7 and 9 and their corol-
laries do not preserve the underlying target distributions. For example, it does not follow from
Corollary 11 that if monotone DNF is polynomially learnable under uniform target distributions
(as is shown in Section 5) then DNF is polynomially learnable under uniform distributions. The re-
sults presented in this section should also work for other concept classes whenever similar conditions
hold, as pointed out by one referee.

4 A Negative Result
A number of of polynomial-time learning algorithms in the literature require only positive examples
or only negative examples. Among other issues, this raises the question of whether every polyno-
mially learnable class is polynomially learnable either from positive examples only or from negative
examples only.
In this section we prove a superpolynomial lower bound on the number of examples required for
learning monomials from negative examples. The proof can actually be tightened to give a strictly
exponential lower bound, and the proof technique has been generalized in [G89]. A necessary
condition for (general) learning from positive examples is given in [S90]. Our bound is information-
theoretic in the sense that it holds regardless of the computational complexity and hypothesis class
of the negative-only learning algorithm and is independent of any complexity-theoretic assumptions.
By duality, we obtain lower bounds on the number of examples needed for learning disjunctions
from positive examples, and it follows from our proof that the same bound holds for learning
from negative examples any class properly containing monomials (e.g., kCNF ) or for learning from
positive examples any class properly containing disjunctions (e.g., kDNF ). In fact, these results
hold even for the monotone versions of these classes. We apply our lower bound to show that the
polynomially learnable class kCNF _ kDNF requires both positive and negative examples, thus
answering negatively the question raised above.
Theorem 12 Let A be a negative-only learning algorithm for the class of monotone monomials,
and let s,A denote the number of negative examples required by A. Then for any n and for  and 
n
suciently small constants, s,A (n) =
(2 4 ):
Proof: Let us x  =  to be a suciently small constant, and assume for contradiction that A
is a negative-only learning algorithm for monotone monomials such that s,A (n)  2 n4 = L(n). Call

18
a monomial monotone dense if it is monotone and contains at least n2 of the variables. Let T be an
ordered sequence of (not necessarily distinct) vectors from f0; 1gn such that jT j = L(n), and let
be the set of all such sequences. If c is a monotone dense monomial, then de ne uc 2 f0; 1gn to be
the unique vector such that uc 2 pos (c) and uc has the fewest bits set to 1. We say that T 2 is
legal negative for c if T contains no vector v such that v 2 pos (c).
We rst de ne target distributions for a monotone dense monomial c. Let D+ (uc ) = 1 and let
D, be uniform over neg (c). Note that uc 62 pos (h) implies e+ (h) = 1, so any -good h must satisfy
uc 2 pos (h). For T 2 and c a monotone dense monomial, de ne the predicate P (T; c) to be 1 if
and only if T is legal negative for c and when T is received by A as a sequence of negative examples
for c from NEG , A outputs an hypothesis hA such that uc 2 pos (hA ). Note that this de nition
assumes that A is deterministic. To allow probabilistic algorithms, we simply change the de nition
to P (T; c) = 1 if and only if T is legal negative for c, and when T is given to A, A outputs an
hypothesis hA such that uc 2 pos (hA ) with probability at least 21 , where the probability is taken
over the coin tosses of A.
Suppose we draw v uniformly at random from f0; 1gn. Fix any monotone dense monomial c.
We have
Prv2f0;1gn (v 2 pos (c))  2 n2 21n = 21n2
since at most 2 n2 vectors can satisfy a monotone dense monomial. Thus, if we draw L(n) points
uniformly at random from f0; 1gn, the probability that we draw some point satisfying c is at most
L(nn)  1 for n large enough. By this analysis, we conclude that the number of T 2 that are legal
22 2
negative for c must be at least j j ,
2 . Since D is uniform and A is a learning algorithm, at least
j j (1 ,  ) = j j (1 , ) of these must satisfy P (T; c) = 1. Let M (n) be the number of monotone
2 2
dense monomials over n variables. Then summing over all monotone dense monomials, we obtain
j j (1 , )M (n)  X N (T )
2 T 2
where N (T ) is de ned to be the number of monotone dense monomials satisfying P (T; c) = 1.
From this inequality, and the fact that N (T ) is always at most M (n), we conclude that at least 18
of the T 2 must satisfy N (T )  81 (1 , )M (n). Since D, is uniform, and since at least a fraction
1 , L2(n2n) , for large n, of the T 2 are legal negative for the target monomial c, A has probability
at least 161 of receiving a T with such a large N (T ) for n large enough. But then the hypothesis hA
output by A has at least 18 (1 , )M (n) positive examples by de nition of the predicate P . Since
the target monomial c has at most 2 n2 positive examples,
1 (1 , )M (n) , 2 n2
,
e (hA )  8 2n
19
and this error must be less than . But this cannot be true for  a small enough constant and n
large enough. Thus, A cannot achieve arbitrarily small error on monotone dense monomials, and
the theorem follows.
An immediate consequence of Theorem 12 is that monomials are not polynomially learnable
from negative examples (regardless of the hypothesis class). This is in contrast to the fact that
monomials are polynomially learnable (by monomials) from positive examples [V84]. It also follows
that any class that contains the class of monotone monomials (e.g., kCNF ) is not polynomially
learnable from negative examples. Further, Theorem 12 implies that the polynomially learnable
classes kCNF _ kDNF and kCNF ^ kDNF of Corollaries 3 and 4 require both positive and negative
examples for polynomial learnability. The same is true for the class of decision lists studied by
Rivest [R87].
We also note that it is possible to obtain similar but weaker results with simpler proofs. For
instance, since 2-term DNF is not learnable by 2-term DNF unless NP = RP , it follows from
Theorem 1 that monomials are not polynomially learnable by monomials from negative examples
unless NP = RP . However, Theorem 12 gives a lower bound on the sample size for negative-only
algorithms that holds regardless of the hypothesis class, and is independent of any complexity-
theoretic assumption.
By duality we have the following lower bound on the number of positive examples needed for
learning monotone disjunctions.
Corollary 13 Let A be a positive-only learning algorithm for the class of monotone disjunctions,
and let s+A denote the number of positive examples required by A. Then for any n and for  and 
n
suciently small constants, s+A (n) =
(2 4 ):
We conclude this section with a brief discussion of how the lower bound of Theorem 12 and
similar lower bounds can be used to obtain lower bounds on the expected number of examples for
algorithms whose sample size may depend on coin ips and the actual sequence of examples received.
We rst de ne what we mean by the expected number of examples. Let A be a (randomized)
learning algorithm for a class C , let r~ be an in nite sequence of bits (interpreted as the random
coin tosses for A), and let w~ be an in nite sequence of alternating positive and negative examples
of some c 2 C . Then we de ne sA (; ; r~; w~) to be the number of examples read by A (where each
request for an example results in either the next positive or next negative example being read from
w~ ) on inputs , , r~ and w~. The expected sample complexity of A is then the supremum over all
c 2 C and all target distributions D+ and D, for c of the expectation E(sA (; ; r~; w~ )), where the
in nite bit sequence r~ is drawn uniformly at random and the in nite example sequence w~ is drawn
randomly according to the target distributions D+ and D, .
20
The basic format of the proof of the lower bound of Theorem 12 (as well as those of [BEHW86]
and [EHKV88]) is to give speci c distributions such that a random sample of size at most B has
probability at least p of causing any learning algorithm to fail to output an -good hypothesis. To
obtain a lower bound on the expected sample complexity, let A be any learning algorithm, and let
q be the probability that A draws fewer than B examples when run on the same distributions that
were given to prove the deterministic lower bound. Then the probability that algorithm A fails to
output an -good hypothesis is bounded below by pq . Since A is a learning algorithm we must have
pq   , so q  p . This gives a lower bound of (1 , p )B on the expected sample complexity. Since the
value of p proved in Theorem 12 is 161 , we immediately obtain an asymptotic lower bound of
(nk )
for any constant k on the expected sample complexity of any negative-only learning algorithm for
monomials.

5 Distribution-speci c Learning in Polynomial Time


It has been shown elsewhere that for several natural representation classes the learning problem is
computationally intractable (modulo various complexity-theoretic or cryptographic assumptions),
in some cases even if we allow arbitrary polynomially evaluatable hypothesis representations (see
e.g. [KV89, PV88, PW88] for hardness results in the distribution-free model). In other cases,
most notably the class of unrestricted DNF formulae, researchers have been unable to provide rm
evidence for either the polynomial-time learnability or the intractability of learning. Given this
state of a airs, we seek to obtain partial positive results by weakening our demands on a learning
algorithm, thus making the computational problem easier (in cases such as Boolean formulae, where
we already have strong evidence for intractability) or the mathematical problem easier (in cases
such as DNF, where essentially nothing is currently known). This approach has been pursued
in at least two directions: by providing learning algorithms with additional information about
the target concept in the form of queries, and by relaxing the demand for performance against
arbitrary target distributions to that of performance against speci c natural distributions. In this
section we describe results in the latter direction. Distribution-speci c learning is also considered
in [BI88, N87].
We describe polynomial-time algorithms for learning under uniform distributions representa-
tion classes for which the learning problem under arbitrary distributions is either intractable or
unresolved. We begin with an algorithm for weakly learning the class of all monotone Boolean
functions under uniform target distributions.

21
5.1 A polynomial-time weak learning algorithm for all monotone Boolean func-
tions under uniform distributions
The key to our algorithm will be the existence of a single input bit that is slightly correlated with
the output of the target function.
For T  f0; 1gn and u; v 2 f0; 1gn de ne
u  v = (u1  v1 ; : : :; un  vn )
and T  v = fu  v : u 2 T g. For 1  i  n let e(i) be the vector with the ith bit set to 1 and all
other bits set to 0.
The following lemma is due to Aldous [A86].
Lemma 14 [A86] Let T  f0; 1gn be such that jT j  22n . Then for some 1  i  n,
jT  e(i) , T j  j2Tnj :
Theorem 15 The class of all monotone Boolean functions is polynomially weakly learnable under
uniform D+ and uniform D, .
Proof: Let f be any monotone Boolean function on f0; 1gn. First assume that jpos (f )j  22n .
For v 2 f0; 1gn and 1  i  n, let vi=b denote v with the ith bit set to b 2 f0; 1g, i.e., vi = b.
Now suppose that v 2 f0; 1gn is such that v 2 neg (f ) and vj = 1 for some 1  j  n. Then
vj =0 2 neg (f ) by monotonicity of f . Thus for any 1  j  n we must have
Prv2D, (vj = 1)  21 (2)
since D, is uniform over neg (f ).
Let e(i) be the vector satisfying jpos (f )  e(i) , pos (f )j  jpos2n(f )j in Lemma 14 above. Let
v 2 f0; 1gn be any vector satisfying v 2 pos (f ) and vi = 0. Then vi=1 2 pos (f ) by monotonicity
of f . However, by Lemma 14, the number of v 2 pos (f ) such that vi = 1 and vi=0 2 neg (f ) is at
least jpos2n(f )j . Thus, we have
Prv2D+ (vi = 1)  (1 , 21n ) 21 + 21n = 21 + 41n : (3)
Similarly, if jneg (f )j  22n , then for any 1  j  n we must have
Prv2D+ (vj = 0)  21 (4)
and for some 1  i  n,
Prv2D, (vi = 0)  12 + 41n : (5)

22
Note that either jpos (f )j  22n or jneg (f )j  22n .
We use these di erences in probabilities to construct a polynomial-time weak learning algorithm
A. A rst assumes jpos (f )j  22n ; if this is the case, then Equations 2 and 3 must hold. A then
attempts to nd an index 1  k  n satisfying
Prv2D+ (vk = 1)  21 + 81n (6)
The existence of such a k is guaranteed by Equation 3. A nds such a k with high probability
by sampling POS enough times according to Fact CB1 and Fact CB2 to obtain an estimate p of
Prv2D+ (vk = 1) satisfying
Prv2D+ (vk = 1) , 161n < p < Prv2D+ (vk = 1) + 161n :
If A successfully identi es an index k satisfying Equation 6, then the hypothesis hA is de ned as
follows: given an unlabeled input vector v , hA ips a biased coin and with probability 321n classi es
v as negative. With probability 1 , 321n , hA classi es v as positive if vi = 1 and as negative if
vi = 0. It is easy to verify by Equations 2 and 6 that this is a randomized hypothesis meeting the
conditions of weak learnability.
If A is unable to identify an index k satisfying Equation 6, then A assumes that jneg (f )j  22n ,
and in a similar fashion proceeds to form a hypothesis hA based on the di erences in probability
of Equations 4 and 5.
It is shown in [BEHW86] (see also [EHKV88]) that the number of examples needed for learning
(and therefore the computation time required) is bounded below by the Vapnik-Chervonenkis di-
mension of the target class; furthermore, this lower bound is proved using the uniform distribution
over a shattered set and holds even for the weak learning model in the case of superpolynomial
Vapnik-Chervonenkis dimension. From this it follows that the class of monotone Boolean func-
tions is not polynomially weakly learnable under arbitrary target distributions (since the Vapnik-
Chervonenkis dimension of this class is exponential in n) and that the class of all Boolean functions
is not polynomially weakly learnable under uniform target distributions (since the entire set f0; 1gn
is shattered). It can also be shown that the class of monotone Boolean functions is not polynomially
(strongly) learnable under uniform target distributions | to see this, consider only those mono-
tone functions de ned by an arbitrary set of vectors with exactly half the bits on. The positive
examples are the vectors that can be obtained by choosing one of the vectors in the de ning set and
turning 0 or more of its o bits on. This is clearly a monotone function, and it is not possible to
achieve  = ( 2p1 n ) on the uniform distribution: the vectors with half the bits on constitute ( p1n )
of the distribution, and the target function is truly random on these vectors. Thus, Theorem 15
is optimal in the sense that generalization in any direction | uniform distributions to arbitrary
23
distributions, weak learning to strong learning, or monotone functions to arbitrary functions |
results in intractability.

5.2 A polynomial-time learning algorithm for DNF under uniform distribu-


tions
We now give a polynomial-time algorithm for learning DNF in which each variable occurs at most
once (DNF ) under uniform target distributions. Recall that in the distribution-free setting, this
learning problem is as hard as the general DNF learning problem by Corollary 11.
Theorem 16 DNF is polynomially learnable by DNF under uniform D+ and uniform D,.
Proof: Let f = m1 +    ms be the target DNF formula over n variables, where each mi is a
monomial. Let d be such that nd = 1 , for n  2. We say that a monomial m appearing in f is
signi cant if Prv2D+ (v 2 pos (m))  4n = 4n1d+1 . Note that s  n since no variable appears twice
in f . Thus the error on D+ incurred by ignoring all monomials that are not signi cant is at most 4 .
We now give an outline of the learning algorithm and then show how each step can be implemented
and prove its correctness.
For simplicity, we describe our algorithm under the assumption that the target formula is
monotone. It will be easy for the reader to see afterwards that this restriction is easily removed,
because the algorithm decides separately for each variable whether the variable appears in the
formula or not, and with high probability never mistakenly includes a variable absent in the target
formula. Since a DNF formula can never include a variable and its negation, the algorithm is
easily modi ed to handle the non-monotone case.
Algorithm A:
Step 1. Assume that every signi cant monomial in f has at least rlog n literals for r = 2d. This
step will learn an approximation for f using only positive examples if this assumption is true.
If this assumption is not true, then we will discover this in Step 2, and learn correctly in Step
3 (using only negative examples). The substeps of Step 1 are as follows:
Substep 1.1. For each i, use positive examples to determine whether the variable xi appears
in one of the signi cant monomials of f . (With high probability, this step will nd all
variables in signi cant monomials, some variables in insigni cant monomials, and no
variables not in f .).
Substep 1.2. For each i; j such that variables xi and xj were determined in Substep 1.1 to
appear in some monomial of f , use positive examples to decide whether they appear in
the same signi cant monomial.
24
Substep 1.3. Form a DNF hypothesis hA in the obvious way.
Step 2. Decide whether hA is an -good hypothesis by testing it on a polynomial number of
positive and negative examples. If it is decided that hA is -good, stop and output hA .
Otherwise, guess that the assumption of Step 1 is not correct and go to Step 3.
Step 3. Assuming that some signi cant monomial in f is shorter than rlog n, we can also assume
that all the monomials are shorter than 2rlog n, since the longer ones are not signi cant. We
use only negative examples in this step. The substeps are:
Substep 3.1. For each i, use negative examples to determine whether variable xi appears in
some signi cant monomial of f . (Again, this might accidentally include some variables
in insigni cant monomials, as in Step 1.1.)
Substep 3.2. For each i; j such that variables xi and xj were determined in Substep 3.1 to
appear in some monomial, use negative examples to decide if they appear in the same
monomial.
Substep 3.3. Form a DNF hypothesis hA in the obvious way and stop.
Throughout the following analysis, we will make use of the following fact: let E1 and E2 be
events over a probability space, and let Pr(E1 [ E2) = 1 with respect to this probability space.
Then for any event E , we have
Pr(E ) = Pr(E jE1)Pr(E1) + Pr(E jE2)Pr(E2) , Pr(E jE1 \ E2)Pr(E1 \ E2) (7)
= Pr(E jE1)Pr(E1) + Pr(E jE2)(1 , Pr(E1) + Pr(E1 \ E2))
,Pr(E jE1 \ E2)Pr(E1 \ E2)
= Pr(E jE1)Pr(E1) + Pr(E jE2)(1 , Pr(E1)) + K ,
where K  O(Pr(E1 \ E2)).
In Step 1, we draw only positive examples. Since there are at most n (disjoint) monomials
in f , and we assume that the size of each monomial is at least rlog n, the probability that a
positive example of f drawn at random from D+ satis es 2 or more monomials of f is at most
n 1
2rlog n = nr,1 << . Therefore, in the following analysis, we restrict our attention to positive
examples of f which satisfy precisely one monomial of f . Intuitively, such restriction will not
be a problem since positive examples that satisfy more than one monomials are very rare (with
probability much less than ). Thus we can do analysis as if they were not there, and eventually
these vectors can only cause very small error much less than .

25
Analysis of Substep 1.1. For each i, if the variable xi is not in any monomial of f ,
Prv2D+ (vi = 0) = Prv2D+ (vi = 1) = 12
since D+ is uniform. Now suppose that variable xi appears in a signi cant monomial m of f . Note
that, since the length of a signi cant monomial is at least rlog n, we have,
rlog n,1
1=2  Prv2D+ (vi = 1jv 2 neg (m))  22rlog n ,,11 :
Then we have
Prv2D+ (vi = 1) = Prv2D+ (vi = 1jv 2 pos (m))Prv2D+ (v 2 pos (m)) (8)
+Prv2D+ (vi = 1jv 2 neg (m))Prv2D+ (v 2 neg (m))
n,1 ,1
 Prv2D+ (v 2 pos (m)) + 2r2log
rlog n ,1 (1 , Prv2D+ (v 2 pos (m)))  2 + 8nd+1 ,
1 1
O( n2d1,1 ).
1 ) between the probability that a variable appearing in a
Thus there is a di erence of
( nd+1
signi cant monomial is set to 1 and the probability that a variable not appearing in f is set to 1.
Using Facts CB1 and CB2, we can determine with high probability if xi appears in a signi cant
monomial of f by drawing a polynomial number of examples from POS . Notice that if xi appears
in a monomial that is not signi cant then it really does not matter. We might in this process also
nd some of those variables in some insigni cant monomials with higher probability, and lose some
others with lower probability. These variables do not matter to us. This will not a ect Substeps
1.2 and 1.3.
Analysis of Substep 1.2. For each pair of variables xi and xj that appear in some monomial of
f (as decided in Substep 1.1), we now decide whether they appear in the same monomial of f .
Lemma 17 If variables xi and xj appear in the same monomial of f , then
Prv2D+ (vi = 1 or vj = 1) = 43 + 12 (Prv2D+ (vi = 1) , 12 )  O( nr1,1 ):
Proof: Since xi and xj appear in the same monomial of f and appear only once in f , we have
Prv2D+ (vi = 1) = Prv2D+ (vj = 1) since D+ is uniform. Let m be the monomial of f in which xi
and xj appear, and let E1 be the event that m is satis ed. Let E2 be the event that at least one
monomial of f besides (but possibly in addition to) m is satis ed. Note that Prv2D+ (E1 [ E2) = 1.
Using the facts that since D+ is uniform, Prv2D+ (E1 \ E2)  nr1,1 (because given that a positive
example already satis es a monomial of f , the remaining variables are independent and uniformly
distributed) and Prv2D+ (vi = 1) = Prv2D+ (E1) + 12 (1 , Prv2D+ (E1)) and by Equation 7, we have

26
Prv2D+ (vi = 1 or vj = 1)
= Prv2D+ (vi = 1 or vj = 1jE1)Prv2D+ (E1)
+Prv2D+ (vi = 1 or vj = 1jE2)Prv2D+ (E2) , O( nr1,1 )
= Prv2D+ (E1) + 34 (1 , Prv2D+ (E1))  O( nr1,1 )
= 43 + 41 Prv2D+ (E1)  O( nr1,1 )
= 43 + 21 (Prv2D+ (vi = 1) , 12 )  O( nr1,1 ). (Lemma 17)
Lemma 18 If variables xi and xj appear in di erent monomials of f , then
Prv2D+ (vi = 1 or vj = 1) = 43 + 21 (Prv2D+ (vi = 1) , 21 ) + 12 (Prv2D+ (vj = 1) , 12 )  O( nr1,1 ):
Proof: Let E1 be the event that the monomial m1 of f containing xi is satis ed, and E2 the event
that the monomial m2 containing xj is satis ed. Let E3 be the event that some monomial other
than (but possibly in addition to) m1 and m2 is satis ed. Note that Prv2D+ (E1 [ E2 [ E3 ) = 1.
Then similar to the proof of Lemma 17, we have
Prv2D+ (vi = 1 or vj = 1)
= Prv2D+ (vi = 1 or vj = 1jE1)Prv2D+ (E1)+Prv2D+ (vi = 1 or vj = 1jE2)Prv2D+ (E2)
+Prv2D+ (vi = 1 or vj = 1jE3)Prv2D+ (E3) , O( nr1,1 )
= Prv2D+ (E1) + Prv2D+ (E2) + 34 Prv2D+ (E3) , O( nr1,1 )
= Prv2D+ (E1) + Prv2D+ (E2) + 34 (1 , Prv2D+ (E1) , Prv2D+ (E2))  O( nr1,1 )
= 43 + 41 (Prv2D+ (E1) + Prv2D+ (E2))  O( nr1,1 )
= 43 + 21 (Prv2D+ (vi = 1) , 12 ) + 12 (Prv2D+ (vj = 1) , 21 )  O( nr1,1 ) (Lemma 18)
From Equation 8, Lemma 17, Lemma 18 and the fact that if xi and xj appear in the same mono-
mial of f , then Prv2D+ (vi = 1) = Prv2D+ (vj = 1), we have that there is a di erence
( nd+1 ,1 nr,1 )
between the value of Prv2D+ (vi = 1 or vj = 1) in the two cases addressed by Lemmas 17 and 18.
Thus we can determine whether xi and xj appear in the same monomial by drawing a polynomial
number of examples from POS using Facts CB1 and CB2.
In Step 2, we draw a polynomial number of examples from both POS and NEG to test if the
hypothesis hA produced in Step 1 is -good, again using Facts CB1 and CB2. If it is determined that
hA is not -good, then A guesses the assumption made in Step 1 is not correct, and therefore that
there is a monomial in f which is of length at most rlog n. This implies that all the monomials of
length larger than 2rlog n are not signi cant. Therefore in Step 3 we assume that all the monomials
in f are shorter than 2rlog n. We use only the negative examples.
Analysis of Substep 3.1. If variable xi does not appear in any monomial of f , then
Prv2D, (vi = 0) = 21 (9)

27
since D+ is uniform.
Lemma 19 If variable xi appears in a signi cant monomial of f , then
Prv2D, (vi = 0)  12 + 2(n2r1, 1) :
Proof: Let l be the number of literals in the monomial m of f that variable xi appears in. Then
in a vector v drawn at random from D, , if some bit of v is set such that m is already not satis ed,
the remaining bits are independent and uniformly distributed. Thus
,1
Prv2D, (vi = 0) = 22ll,1 = 21 + 2(2l1,1) :
Since l  2rlog n, the claim follows. (Lemma 19)
1
By Equation 9 and Lemma 19, there is a di erence of
( n2r ) between the probability that a
variable in a signi cant monomial of f is set to 0 and the probability that a variable not appearing
in f is set to 0. Thus we can draw a polynomial number of examples from NEG , and decide if
variable xi appears in some signi cant monomial of f , using Facts CB1 and CB2.
Analysis of Substep 3.2. We have to decide whether variables xi and xj appear in the same
monomial of f , given that each appear in some monomial of f .
Lemma 20 If variables xi and xj are not in the same monomial of f , then
Prv2D, (vi = 0 and vj = 0) = Prv2D, (vi = 0)Prv2D, (vj = 0):
Proof: If xi and xj do not appear in the same monomial, then they are independent of each
other with respect to D, since each variable appears only once in f . (Lemma 20)
Lemma 21 If variables xi and xj appear in the same monomial of f , then
Prv2D, (vi = 0 and vj = 0) = 12 Prv2D, (vi = 0):
Proof:
Prv2D, (vi = 0 and vj = 0) = Prv2D, (vi = 0)Prv2D, (vj = 0jvi = 0)
But Prv2D, (vj = 0jvi = 0) = 21 . (Lemma 21)
By Lemmas 19, 20 and 21 we have that there is a di erence of
( n2r ) in the value of Prv2D, (vi =
1
0 and vj = 0) in the two cases addressed by Lemmas 20 and 21, since
Prv2D, (vi = 0)  12 + 2(n2r1, 1) :
Thus we can test if xi and xj appear in the same monomial of f by drawing a polynomial number
of examples from NEG using Facts CB1 and CB2. This completes the proof of Theorem 16.
28
The results of [PV88] show that k-term DNF is not learnable by k-term DNF unless NP =
RP . However, the algorithm of Theorem 16 outputs an hypothesis with at most the same number
of terms as the target formula. This is an example of a class for which learning under arbitrary
target distributions is NP -hard, but learning under uniform target distributions is tractable.

6 Equivalence of Weak Learning and Group Learning


In this section we prove the equivalence of the model of weak learning with another model which
we call group learning. Informally, in group learning we ask that the learning algorithm output
an hypothesis that is 1 ,  accurate in classifying a polynomial-size group of examples that are
either all positive or all negative. Thus, the basic model of (strong) learnability is a special case of
group learning where the group size is 1. The question we wish to address here is whether learning
becomes easier in some cases if the group size is allowed to be larger.
Recently it has been shown by Schapire [S89] that in the distribution-free setting, polynomial-
time weak learning is in fact equivalent to polynomial-time strong learning. His proof gives a
recursive technique for taking an algorithm outputting hypotheses with accuracy slightly above
1
2 and constructing hypotheses of accuracy 1 , . This result combined with ours shows that
group learning is in fact equivalent to strong learning. Thus, allowing the hypothesis to accurately
classify only larger groups of (all positive or all negative) examples does not increase what is
polynomially learnable. These results also demonstrate the robustness of our underlying model of
learnability, since it is invariant under these apparently signi cant but reasonable modi cations.
Related equivalences are given in [HKLW88].
Our equivalence proof also holds in both directions under xed target distributions: thus, C is
polynomially group learnable under a restricted class of distributions if and only if C is polynomially
weakly learnable under these same distributions. As an immediate corollary, we have by the results
of Section 5.1 that the class of all monotone Boolean functions is group learnable in polynomial
time under uniform distributions. Furthermore, since it was argued in Section 5.1 that the class
of all monotone Boolean functions cannot be strongly learned in polynomial time under uniform
distributions, there cannot be a distribution-preserving reduction of the strong learning model
to the weak learning model. Thus, the problem of learning monotone functions under uniform
distributions exhibits a trade-o : we may either have accuracy slightly better than guessing on
single examples, or high accuracy on large groups of examples, but not high accuracy on single
examples.
Our formal de nitions are as follows: for p any xed polynomial in 1 ; 1 and jcj, the hypothesis
hA of learning algorithm A is now de ned over X p( 1 ; 1 ;jcj). We ask that if p( 1 ; 1 ; jcj) examples

29
all drawn from D+ (respectively, D, ) are given to hA , then hA classi es this group as positive
(negative) with probability at least 1 , . If A runs in polynomial time, we say that C is polynomially
group learnable.
Theorem 22 Let C be a polynomially evaluatable parameterized Boolean representation class.
Then C is polynomially group learnable if and only if C is polynomially weakly learnable.

Proof: (If) Let A be a polynomial-time weak learning algorithm for C . We construct a polynomial-
time group learning algorithm A0 for C as follows: A0 rst simulates algorithm A and obtains an
hypothesis hA that with probability 1 ,  has accuracy 12 + p(jc1j;n) for some polynomial p. Now
given m examples that are either all drawn from D+ or all drawn from D, , A0 evaluates hA on
each example. Suppose all m points are drawn from D+ . Assuming that hA in fact has accuracy
1 1 1
2 + p(jcj;n) , the probability that hA is positive on fewer than 2 of the examples can be made smaller
than 2 by choosing m to be a large enough polynomial in 1 and p(jc1j;n) using Fact CB1. On the
other hand, if all m examples are drawn from D, then the probability that hA is positive on more
than 21 of the examples can be made smaller than 2 for m large enough by Fact CB2. Thus, if hA
is positive on more than m2 of the m examples, A0 guesses that the sample is positive; otherwise, A0
guesses that the sample is negative. The probability of misclassifying the sample is then at most
, so A0 is a group learning algorithm.
(Only if) Let A be a polynomial-time group learning algorithm for C . We use A as a subroutine
in a polynomial-time weak learning algorithm A0 . Suppose algorithm A is run to obtain with high
probability an 2 -good hypothesis hA for groups of size l = p( 2 ; 1 ; jcj; n) all drawn from D+ or all
drawn from D, for some polynomial p.
Note that although hA is guaranteed to produce an output only when given l positive examples
or l negative examples, the probability that hA produces an output when given a mixture of positive
and negative examples is well de ned. Thus for 0  i  l, let qi denote the probability that hA
is positive when given as input a group whose rst i examples are drawn from D+ and whose last
l , i examples are drawn from D,. Then since hA is an 2 -good hypothesis we have q0  2 and
ql  1 , 2 . Thus,
(ql , q0 )  (1 , 2 ) , 2 = 1 , :
Then
1 ,   (ql , q0 )
= (ql , ql,1 ) + (ql,1 , ql,2 ) +    + (q1 , q0 ):
This implies that for some 1  j  l, (qj , qj ,1 )  1,l   21l for   21 .

30
Let us now x  = 21 . Algorithm A0 rst runs algorithm A with accuracy parameter 2 . A0 next
obtains an estimate qi0 of qi for each 0  i  l that is accurate within an additive factor of 161 l , that
is,
qi , 161 l  qi0  qi + 161 l : (10)
This is done by repeatedly evaluating hA on groups of l examples in which the rst i examples are
drawn from D+ and the rest are drawn from D, , and computing the fraction of runs for which hA
evaluates as positive. These estimates can be obtained in time polynomial in l and 1 with high
probability using Facts CB1 and CB2.
Now for the j such that (qj , qj ,1 )  21l , the estimates qj0 and qj0 ,1 will have a di erence
of at least 41l with high probability. Furthermore, for any i, if (qi , qi,1 )  81l then with high
probability the estimates satisfy (qi0 , qi0,1 )  41l . Let k be the index such that (qk0 , qk0 ,1 ) is the
largest separation between adjacent estimates. Then we have argued that with high probability,
(qk , qk,1 )  81l .
The intermediate hypothesis h of A0 is now de ned as follows: given an example whose classi-
cation is unknown, h constructs l input examples for hA consisting of k , 1 examples drawn from
D+, the unknown example, and l , k examples drawn from D,. The prediction of h is then the
same as the prediction of hA on this constructed group. The probability that h predicts positive
when the unknown example is drawn from D+ is then qk and the probability that h predicts positive
when the unknown example is drawn from D, is qk,1 .
The rst problem with h is that new examples need to be drawn from D+ and D, each time
an unknown point is classi ed. This sampling is eliminated as follows: for U a xed sequence
of k , 1 positive examples of the target representation and V a xed sequence of l , k negative
examples, de ne h(U; V ) to be the hypothesis h described above using the xed constructed samples
consisting of U and V . Let p+ (U; V ) be the probability that h(U; V ) classi es a random example
drawn from D+ as positive, and let p, (U; V ) be the probability that h(U; V ) classi es a random
example drawn from D, as positive. Then for U drawn randomly according to D+ and V drawn
randomly according to D, , de ne the random variable R(U; V ) = p+ (U; V ) , p, (U; V ). Then the
expectation of R obeys E(R(U; V ))  2 where = 81l .
However, it is always true that R(U; V )  1. Thus, let r be the probability that R(U; V )  .
Then we have r + (1 , r)  2 . Solving, we obtain r  = 41l .
Thus, to x this rst problem, A0 repeatedly draws U from D+ and V from D, until R(U; V ) 
2 . By the above argument, this takes only O( 1 log 1 ) to succeed with probability exceeding 1 ,  .
Note that A0 can test whether R(U; V )  2 is satis ed in polynomial time by sampling.
The (almost) nal hypothesis h(U0 ; V0) simply \hardwires" the successful U0 and V0, leaving

31
one input free for the example whose classi cation is to be predicted.
The second problem is that we still need to \center" the bias of the hypothesis h(U0; V0), that
is, to modify it to provide a slight advantage over random guessing on both the distributions D+
and D, . Since U0 and V0 are now xed, let us simplify our notation and write p+ = p+ (U0 ; V0)
and p, = p, (U0; V0). We assume that 21  p+ > p, ; the other cases can be handled by a similar
analysis (although for the case p+ > 1=2 > p, , it may not be necessary to center the bias). Recall
that we know the separation (p+ , p, ) is \signi cant" (that is, at least 81l ).
To center the bias, the nal hypothesis h0 of A0 will be randomized and behave as follows: on
any input x, h0 rst ips a coin whose probability of heads is p , where p will be determined by
the analysis. If the outcome is heads, h0 immediately outputs 1. If the outcome is tails, h0 uses
h(U0 ; V0) to predict the label of x.
Now if x is drawn randomly from D+ , the probability that h0 classi es x as positive is p +
(1 , p )p+ . If x is drawn randomly from D, , the probability that h0 classi es x as positive is
p + (1 , p)p,. To center the bias, the desired conditions on p are
p + (1 , p )p+ = 12 +
and
p + (1 , p )p, = 12 ,
for some \signi cant" quantity . Solving, we obtain
, p+ , p,
p = 21 , p+ , p,
and  + ,
= (1 , p )(2p , p ) :
Now it is easily veri ed that p  12 , and thus  321 l . Thus from suciently accurate estimates
of p+ and p, , A0 can determine an accurate approximation to p and thus center the bias.

7 Concluding Remarks
A great deal of progress has been made towards understanding the complexity of learning in our
model since the results presented here were rst announced. The interested reader is encouraged
to consult the proceedings of the annual workshops on Computational Learning Theory (Morgan
Kaufmann Publishers) for further investigation.
Perhaps the most important remaining open problem suggested by this research is whether the
class of polynomial size DNF formulae is learnable in polynomial time in our model.
32
8 Acknowledgements
We would like to thank Lenny Pitt for helpful discussions on this paper and Umesh Vazirani for
many ideas on learning under uniform distributions. We also thank the two very careful referees
who have corrected many errors and suggested many improvements.

References
[A86] D. Aldous.
On the Markov chain simulation method for uniform combinatorial distributions and
simulated annealing.
University of California at Berkeley Statistics Department,
technical report number 60, 1986.
[AV79] D. Angluin, L.G. Valiant.
Fast probabilistic algorithms for Hamiltonian circuits and matchings.
Journal of Computer and Systems Sciences,
18, 1979, pp. 155-193.
[BI88] G.M. Benedek, A. Itai.
Learnability by xed distributions.
Proceedings of the 1988 Workshop on Computational Learning Theory,
Morgan Kaufmann Publishers, 1988, pp. 80-90.
[BEHW86] A. Blumer, A. Ehrenfeucht, D. Haussler, M. Warmuth.
Classifying learnable geometric concepts with the Vapnik-Chervonenkis dimension.
Proceedings of the 18th A.C.M. Symposium on the Theory of Computing,
1986, pp. 273-282.
[C52] H. Cherno .
A measure of asymptotic eciency for tests of a hypothesis based on the sum of
observations.
Annals of Mathematical Statistics,
23, 1952, pp. 493-509.
[EHKV88] A. Ehrenfeucht, D. Haussler, M. Kearns. L.G. Valiant.
A general lower bound on the number of examples needed for learning.

33
Proceedings of the 1988 Workshop on Computational Learning Theory,
Morgan Kaufmann Publishers, 1988, pp. 139-154.
[GJ79] M. Garey, D. Johnson.
Computers and intractability: a guide to the theory of NP-completeness.
Freeman, 1979.
[G89] M. Gereb-Graus.
Complexity of learning from one-sided examples.
Harvard University, unpublished manuscript, 1989.
[H88] D. Haussler.
Quantifying inductive bias: AI learning algorithms and Valiant's model.
Arti cial Intelligence,
36(2), 1988, pp. 177-221.
[HKLW88] D. Haussler, M. Kearns, N. Littlestone, M. Warmuth.
Equivalence of models for polynomial learnability.
Proceedings of the 1988 Workshop on Computational Learning Theory,
Morgan Kaufmann Publishers, 1988, pp. 42-55, and
University of California at Santa Cruz Information Sciences Department,
technical report number UCSC-CRL-88-06, 1988.
[HSW88] D. Helmbold, R. Sloan, M. Warmuth.
Bootstrapping one-sided learning.
Unpublished manuscript, 1988.
[KLPV87a] M. Kearns, M. Li, L. Pitt, L.G. Valiant.
On the learnability of Boolean formulae.
Proceedings of the 19th A.C.M. Symposium on the Theory of Computing,
1987, pp. 285-295.
[KLPV87b] M. Kearns, M. Li, L. Pitt, L.G. Valiant.
Recent results on Boolean concept learning.
Proceedings of the 4th International Workshop on Machine Learning,
Morgan Kaufmann Publishers, 1987, pp. 337-352.
[KV89] M. Kearns, L.G. Valiant.
Cryptographic limitations on learning Boolean formulae and nite automata.

34
Proceedings of the 21st A.C.M. Symposium on the Theory of Computing,
1989, pp. 433-444.
[L88] N. Littlestone.
Learning quickly when irrelevant attributes abound: a new linear threshold algorithm.
Machine Learning,
2(4), 1988, pp. 245-318.
[N87] B.K. Natarajan.
On learning Boolean functions.
Proceedings of the 19th A.C.M. Symposium on the Theory of Computing,
1987, pp. 296-304.
[PV88] L. Pitt, L.G. Valiant.
Computational limitations on learning from examples.
Journal of the A.C.M.,
35(4), 1988, pp. 965-984.
[PW88] L. Pitt, M.K. Warmuth.
Reductions among prediction problems: on the diculty of predicting automata.
Proceedings of the 3rd I.E.E.E. Conference on Structure in Complexity Theory,
1988, pp. 60-69.
[R87] R. Rivest.
Learning decision lists.
Machine Learning,
2(3), 1987, pp. 229-246.
[S89] R. Schapire.
The strength of weak learnability.
Machine Learning,
5(2), 1990, pp. 197-227.
[S90] H. Shvayster.
A necessary condition for learning from positive examples.
Machine Learning,
5(1), 1990, pp. 101-113.

35
[V84] L.G. Valiant.
A theory of the learnable.
Communications of the A.C.M.,
27(11), 1984, pp. 1134-1142.
[V85] L.G. Valiant.
Learning disjunctions of conjunctions.
Proceedings of the 9th International Joint Conference on Arti cial Intelligence,
1985, pp. 560-566.

36

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy