Cap8 Resultado
Cap8 Resultado
Cap8 Resultado
Cast your mind back tothebeginning of this book whereI said that the behavior of
systems depends on their architectures, tasks, and environments. In the intervening
chaptersI presented methods fordiscoveñng and demonstrating how factors interact
to influence behavior. Some factors are features of architectures (e.g., whethera
part-of-speech tagger uses tag-repair heuristics) and some arefeatures of tasks (e.g.,
whether one therapy will serve for several diseases). Increasingly, as AI systems be-
come embedded inreal-world applications, we must consider environmental factors,
such as the mean time between tasks, noise in data, and so on. With thetoolbox of
methods from theprevious chapters, we can detect faint hints of unsuspected factors
and amplify them with well-designed experiments. We can demonstrate gross effects
of factors and dissect them into orthogonal components; and we can demonstrate
complex interactions between factors. Having done our reductionist best to under-
stand individual and joint influences on behavior, we are left with something likea
carefully disassembled mantle clock. We know thepieces and how they connect,
we can differentiate functional pieces from ornamental ones, and we can identify
subassemblies. We even know that the clock loses time on humid afternoons, and
that humidity, not the time of day, does thedamage. But we don't necessarily have
an explanation of how all the pieces work and how humidity affects them. We need
tocomplement reductionism with synthesis. Having discovered the factors that in-
fluence performance, we must explain how they all work together. Such explanations
are often called models.
Different sciences have diflerent modeling tools, of course, and some regard pro-
grams asmodels. For example, cognitive scientists offer programs as models of spe-
cific behaviors, such asverbal learning, or even asarchitectures of cognition (Newell,
1990; Anderson, l9g3). It's difficult to think ofa science that doesn't use computer-
based models orsimulations. But within AI we tendtoview thebehaviors of programs
not as models of phenomena, butasphenomena intheir own right. When we speak
310 Modeling
andp andc asfeatures of the environment, whereasI is the optimal behavior, the
optimal monitoring interval.
In specifying what ouragent shoulddo,our model is normative. Normative models
arenot common inartificial intelligence, partly because optimality usually requires
exhaustive search—an impossibility for most AI problems—and partly because our
intellectual tradition is based in Simon's notion of bounded rationality and his re-
jection of normative models ineconomics. The models discussed in this chapter are
performance models: They describe, predict, and explain how systems behave.
A good performance model isa compromise. Tt should fit but not overfit sample
data, that is, it shouldn't account forevery quirk ina sample ifdoing so obscures gen-
eral trends or principles.A related point is thata good model compromises between
accuracy and parsimony. For instance, multiple regression models of thekind de-
scribed in section 8.6 can be made more accurate by adding more predictor variables,
but the marginal utility of predictors decreases, while the number of interactions
among predictors increases, These interactions must be interpreted or explained.
Other compromises pertain to the scope and assumptions that underlie models. For
example, the scope of our illustrative model = (2c/tpp)5 is just those situations
in which theprobabilityp ofa message arriving is stationary, unchanging over time,
But becausea model fornonstationary probabilities is more complicated, we use the
stationary model innonstationary situations, where its accuracy is compromised but
sufficient for heuristic purposes.
Good models capture essential aspects ofa system. They don't merely simplify;
they assert that some things are important and others are not. For example, our illus-
trative model asserts that only three quantities and a few assumptions are important,
and every other aspect of any monitoring situation is unimportant. This is quitea
claim. Quite apart from its veracity, we must appreciate how it focuses and moti-
vates empirical work. It says, “find out whether the assumptions are warranted; see
whether c, u› and p are truly the only important factors.” Lackinga model, what
can we say of an empirical nature? That our communications assistant appears to
work; that its users are happy; that its success might be due toanything. In short, asa
model summarizes theessential aspects and interactions ina system, it also identifies
assumptions and factors to explore in experiments.
Artificial intelligence researchers can be divided into those who use models and
those who don't, The former group concerns itself with theoretical and algorithmic
issues while the latter builds relatively large, complicated systems. I drew these
conclusions from my survey of 150 papers from theEighth National Conference on
i Artificial Intelligence (Cohen, 1991),I find them disturbing because surely architects
and systems engineers need models themost. Think ofany complicated artifacts—
312 3fodeling
Ina complicated system, some design decisions have theoretical weight while others
are expedient. If you writea paper about your system or describe it toa colleague,
you emphasize theimportant decisions. You say, “It's called ‘case-based’ reasoning,
because itre-uses cases instead of solving every problem fromfirst principles. The two
big design issues are retrieving the best cases and modifying them fornew problems.”
You don't talk about implementation details because, although they influence how
yoursystem behaves, you regard them asirrelevant. Allen Newell madethis important
distinction in his 1982 paper, 77ie Knowledge Level:
The system at the knowledge level is the agent. The components at theknowledge level are
goals, actions and bodies. ... The medium attheknowledge level is knowledge (asmight be
suspected), Thus, the agent pmeesses its knowledge t0detemiine theecf‹oits to take. Finally,
thc behavior law is the principle of rationality.’ Actions as selected to attain the agent's goals.
The knowledge level sits in the hierarchy of systems levels immediately above thesymbol
level.
As is true ofñn$ level, although theknowledge level POH be constructedfmm the level below
(i. e., the symbol level), If also hus at autoiiamuus formulation ax at i'ndependent level. thus,
knowledge eat lie defined independent of thesymbDl level, but it C0tI alSD be reduced tosymbol
systems.
’The knowledge level pemiits predicting and understanding behavior without having eft
operational model oftheproccssing that is actually being done by thedgenf. the utility of such
a level wauld .seem clear given the w'idespread need inAI's affairs for distal prediction, and
also the paucity of knowledge about theinternal workings of humans. . The utility is also
clear in designing AI systerrts, where theinternal mechanisms are still to be specified. T'o the
eMent thatAl systems successfully approximate rational agents, it is also usefulfor predicting
Programs as Models, Executable Specifications and Essential Miniatures
and understanding them. Indeed the usefulness extends beyond AI systerru to all c‘omjrtffer
programs. (J9&2, pp. g8-i08)
level of abstraction, (3) can be executed to permit direct observation of the beha
it predicts, (4) permits clear separation of the essential, theoretically-motivated com-
ponents from theinessential but necessary algorithmic details” (Couper et al., 1992,l
p. 8). Sceptic is syntactically very much like Prolog, andabove-the-line components
of Sceptic models arecollections of Prolog-like rules. To illustrate,I will ext .
Sceptic rules for soAR's elaboration phase: When soAR gets some input from itsi
environment, it puts tokens representing the input into working memory, after
it searches long term memory for productions that match thecontents of working
memory. InSceptic, this phase is implemented as follows:
elaboration,phase:
true
—+ mark all wmesms old,
input.cycle,
continue cycling ifmot quiescent
continue cycling ifmot quiescent:
not(quiescent)
-+ elaboration cycle,
output cycle,
input cycle,
continue cyclingñfmot quiescent
elaboration cycle:
generate unrefracted instantiation(Instantiation,FiringGoal),
not(imwatch(Instantiation,FiringGoal))
—+ im ake(Instantiation,FiringGoal),
fire production(Instantiation,FiringGoal)
The first rule defines the elaboration phase: It marks all current working memory
elements as old, then it gets input from theenvironment and adds it to working
memory, and then it executes the rule continue cye1 ingi f_noC_quiescenL.
This rule tests for quiescence (a stiite in which no new productions are triggered
by the contents of working memory), and if quiescence has not been achieved it
does several othef things pertinent to instantiating and executing productions from
long term memory. For instance, the elaboration cycle generates an instantiation and
firing goal fora production and, ifit is not already in instantiation memory (tested by
imwat.ch), it puts it there and then fires the production.
Five Sceptic rules are required to implement theabove-the-line model of SOAR's
elaboration phase, and several below-the-line functions must be supplied as well.
us Programs as Models. Exe‹,’utable Specifications and Essential Niiniatures
Figure 8.1 The cost of tagging errors asa function of the number oftag-trigrams leamed.
Unfortunately, this result says little about how the confusion depends on other
factors. For instance, OTB improved its performance by learning; did it learn to avoid
confusing nouns and noun modifiers? Figure 8.1 shows, in factthat the differences
between laxand stringent scores were mostpronounced when oT8 hadlearned fewer
than two hundred tag-trigrams. After that, the mean difference hovers around5
percent. For simplicity, call the mean difference between laxand stringent scores the
cost of noun/noun modifier tagging errors. We will derivea model that relates this
cost to OTB's learning. Even though figure 8.1 shows clearly that the relationship is
nonlinear,I will developa linearmodel; later,I willtransform Lehnertand McCarthy's
data soa linear model fits it better.
2 4 6 8 10
that the student's mean squared error is the sample variance of they scores,s2. It
should also be clear thats2 isa lower bound on one's ability to predict they value of
a coordinate, givena corresponding x value. One's mean squared error should be no
higher thans2, andmight be considerably lower if one knows something about the
relationship betweenx andy narrates.
For instance, after looking carefully at the scatterplot in figure 8.2, you guess the
following rule:
Your Rule y; = 7.27a, -I- 27.
Now you are asked, “Whaty value is associated with x 1?” and you respond
“y = 34.27.” The difference between your answer and thetrue value ofy is calleda
residual. Suppose theresiduals are squared and summed, andtheresult is divided by
N — 1. This quantity, called mean square residuals is one assessment of thepredictive
power ofyour rule. As you would expect, the mean square residuals associated with
yourrule is much smaller than the sample variance; in fact, yourrule gives the smallest
possible mean squared residuals. The line that represents your rule in figure 8.3 is
calleda regression line, and it has the property of being the least-squares fit to the
points in the scatterplot. It is the line that minimizes mean squared residuals.
For simple regression, which involves just one predictor variable, the regression line
y = fix -I-a is closely related to the correlation coefficient. Its parameters are:
(a.i)
5J9 Cost asa Function of“Learning: Linear Regression
2 4 6 8 10
wherer is the correlation between z and y,and s,ands are the standard deviations
of x and y, respectively. Note that the regression line goes through themean of
z and
themean ofy,A general procedure to solve for the parameters of the regression line
is described in section 8A.
As we saw earlier, withouta regression line the best prediction ofy is y, yielding
deviations(ri— and mean squared deviationss = Z(›:— y)2/(N - 1). Each
deviation isa mistake—an incorrect stab at the value of y. Wantinga regression line,
2
S y represents the “average squared mistake.” The predictive power oftheregression
line is usually expressed in terms of its ability to reduce these mistakes.
Recall how one-way analysis of variance decomposes variance intoa systematic
component due toa factor and an unsystematic “error” component (see chapter 6).
In an essentially identical manner, linear regression breaks deviations (yi — y) (and
thus variance) intoa component due to theregression line and a residual, or enor
component. This is illustrated in figure 8.4 for two points and their predicted values,
shown as filled and open circles, respectively. The horizontal line isy and the
upward-sloping line isd regression line. ’the total deviation y, — v is divided into
two parts:
I (Hi— ›i) was zero forall points, then all points would fall exactly on the regression
line, that is, the filled and open circles in figure 8.4 would coincide.
1
320 Modeling
The sum of squared deviations can also be broken into two parts:
ss-•t•! = ss •,+ ss..:
In words, the total sum of squares comprises the sum of squares for regression plus
the sum of squares residual. The proportion of the totalvariance accounted forby the
regression line is
tDttll r CS. Ke .
(8.2)
It isa simple matter to fita regression line to Lehnert and McCarthy's data. 7'he line
is shown intheleft panel in figure 8.5, which relates the number oftag-trigrams OT8
has learned (horizontal axis) to the cost of noun/noun modifier tagging erf8fs (vertical
axis). Although this line is the &st possible fit to the data (in the least-squares sense.
see the appendix to this chapter), it accounts for only 44 percent of the variance in
the most of tagging errors.
In general, whena model fails to account formuch ofthevariance ina performance
variable, we wonder whether the culprit isa systematic nonlinear bias in Our data,
Transfarming DutaforL.inear Models
.4 .25
.35
.15
.1
,05
.15
.7 0
.05 -.05
0
0 100 300 SOO 700 900 0 100 300 500 700 900
Tag-trigrams learned Tag-trigrams learned
Figure If Linear regression and residuals for Lehnert and McCarthy's data.
outliers, or unsystematic variance. Systematic bias is often seen ina residuals plot,a
scatterplot in which thevertical axis represents residuals ii - ii from theregression
line and the horizontal axis represents the predictor variable z. The right panel of
figure 8.5 showsa residuals plot for the regression line in the left panel. Clearly, there
is much structure in the residuals. As it happens, the residual plot shows nothing that
wasn't painfully evident in the original scanerplot: The regression line doesn't fit the
data well because the data are nonlinear. In general, however, residuals plots often
disclose subtle structure because theranges of their vertical axes tend to be narrower
than in the original scatterplot (e.g., see figure 8.11.e). Conversely, ifa regression
model accounts for little variance iny (i,e., r’ is low) but no structure or outliers
are seen in the residuals plot, then the culprit is plain old unsystematic variance, or
noise.
Thus they can change theshape ofa function, even straighteninga distinctly curved
one.
Consider how to straighten Lehnert and McCarthy's data, reproduced in the top-;
left graph in figure 8.6. Let y denote the performance variable and x denote W
level of learning. The function plunges fromy = .375 toy = ,075 (roughly) as x
increases from0 to 100, theny doesn't change much asz increases further. Imag
the function is plotted ona sheet of rubber that can be stretched or compressed until
the points forma straight line. What should we do? Four options are:
u Compress theregion abovey — .1 along they axis.
u Stretch the region belowy — .1 along they axis.
u Compress theregion to the right ofx — 100 along thex axis.
u Stretch the region to the left of x == 100 along thex axis.
A log transform stretches the distances between small values and compresses the
distances betweenlargeones; forexample,1og(20)—log(10) == .3 whereas log(l20)—
log(1l0) .038, soa difference of ten for large values is compressed relative to the
same difference between small values. Figure 8.6 shows three log transformationsof
y and x. The top-right graph showsa plot off against log(y); as expected, points with
smallery values are spread out more along they axis, and you can see more detail than
when thepoints were scrunched together.A regression line fit to these data explains
more of thevariance, too. The regression of log(y) on z explains 66.5 percent of
the variance in log(y). The bottom-left graph showsa regression ofy on log(x).
As expected, points with smallz values are spread wider by thetransformation, and
points with largex values are compressed. Unfortunately, this createsa dense mass of
points at the right of the graph, making it very difficult to see any details, although the
variance iny accounted for by log(z) is pretty high,r2 = 82.3 percent. Finally, we
can transform bothx and y,yielding the graph at the lower right of figure 8.6. Notice
that the worrying clump of points has been spread apart along they axis because
they had lowy values. Nearly 90 percent of the variance in log(y) is explained by
log(z).
The log transform is but one of many used by data analysts fora variety of pur-
poses. In addition to straightening functions, transformations are also applied tomake
points fall ina more even distribution (e.g., not clumped as in the lower left graph
in figure 8.6), to give thema symmetric distribution along one axis, or to make the
variances of distributions of points roughly equal. Some ofthese things are done to
satisfy assumptions of statistical tests; for instance, regression and analysis of vari-
ance assume equal variance in distributions ofy values at all levels of z. In his book,
323 Transforming Dataforlinear Models
—J —1
x y2
Figure 8.7 shows how these re-expressions (or transformations) stretcha function
in different parts of its range. The straight line is, of course, x’ == x. To pull
apart large values of x but leave small values relatively unchanged, you would use
a power function with an exponent greater than one, such as x’ —— x*, You can
see that the deviation betweenz and z' = x’ increases with z; therefore, x’ x2
spreads points further apart when they have big x values. Alternatively, to pull apart
small values ofz and make larger values relatively indistinguishable, you would
324 Madeling
10
I“igure 8.7 Graphing transformation functions illustrates their effects on dii?erent parts of the ,
range.
One can derive confidence intervals (secs. 4.6, 5.2.3) for several parameters of re-
gression equations, including
M the slope of the regression line
n they intercept
u the population mean ofy values,
Confidence intervals for the slope of the regression line will illustrate the general
approach. See any good statistics text(e.g., Olson, 1987, pp. 448—452) forconfidence
intervals for other parameters.
325 Confidence Intewals f“or linear Regression Models
Here, z is the stati.tic that estimatesy and ñ, is the estimated standard error of the
sampling distribution of , which isa i distribution. Under theassumption thatthe
predicted variabley is normally distributed with equal variance at all levels of the
predictor variable z, coEfldenceinternals ifir parameters of regression lines have
exactly the same form. This is the 95 percent confidence interval for theparametric
slope of the regression line, :
As always, the tricky pan i.s estimating the Standard ciTur of’thestatistis, be e timatt
re / N )
Figure 8.9 Scatterplot of WindSpeed and FinishTime showing heteroscedasticity. The solid
line is the regression line for the sample of48 points. The dotted lines are bootstrap confidence
intervals on the slope of the regression line.
The assumptions that underlie parametric confidence intervals are oftenviolated. The
predicted variable sometimes isn't normally distributed, and even when it is, the vari-
ance ofthenormal distribution mightnotbe equal atall levels of the predictor variable.
For example, figure 8.9 showsa scatterplot of FinishTime predicted by WindSpeed in
forty-eight trials of the Phoenix planner. The variances of Finis6fiime at each level of
WindSpeed arenotequal, suggesting that they might notbe equal in the populations
from which these data were drawn (aproperty called heteroscedasticity, as opposed
to homoseedasticity, ar equal variances). This departure from the assumptions of
parametric confidence intervals might be quite minor and unimportant, but even so,
the assumptions are easily avoided by constructinga bootstrap confidence interval,
as described in section 5.2.3. The procedure is:
327 The Significance ofa Predictor
Figure 8.IN A bootstrap sampling distriti;jti9n oJ the slope ot the regression ljne. The original
sample comprises 48 data shown in_ figure b.9.
However,r2leaves some doubt a.S to whether its value could bc due to chance, In
sessions 4.5 and 5,3,3I dessribed tests ot‘the hypothesis that thy correlation is zero,