Cap8 Resultado

8 Modeling
Cast your mind back tothebeginning of this book whereI said that the behavior of
systems depends on their architectures, tasks, and environments. In the intervening
chaptersI presented methods fordiscoveñng and demonstrating how factors interact
to influence behavior. Some factors are features of architectures (e.g., whethera
part-of-speech tagger uses tag-repair heuristics) and some arefeatures of tasks (e.g.,
whether one therapy will serve for several diseases). Increasingly, as AI systems be-
come embedded inreal-world applications, we must consider environmental factors,
such as the mean time between tasks, noise in data, and so on. With thetoolbox of
methods from theprevious chapters, we can detect faint hints of unsuspected factors
and amplify them with well-designed experiments. We can demonstrate gross effects
of factors and dissect them into orthogonal components; and we can demonstrate
complex interactions between factors. Having done our reductionist best to under-
stand individual and joint influences on behavior, we are left with something likea
carefully disassembled mantle clock. We know thepieces and how they connect,
we can differentiate functional pieces from ornamental ones, and we can identify
subassemblies. We even know that the clock loses time on humid afternoons, and
that humidity, not the time of day, does thedamage. But we don't necessarily have
an explanation of how all the pieces work and how humidity affects them. We need
tocomplement reductionism with synthesis. Having discovered the factors that in-
fluence performance, we must explain how they all work together. Such explanations
are often called models.
Different sciences have diflerent modeling tools, of course, and some regard pro-
grams asmodels. For example, cognitive scientists offer programs as models of spe-
cific behaviors, such asverbal learning, or even asarchitectures of cognition (Newell,
1990; Anderson, l9g3). It's difficult to think ofa science that doesn't use computer-
based models orsimulations. But within AI we tendtoview thebehaviors of programs
not as models of phenomena, butasphenomena intheir own right. When we speak
310 Modeling
of models we mean models of architectures, tasks, and environments—models

explain how these interact to produce behavior. i
To illustrate some essential features of models, consider one aspect ofa
munications assistant,” an agent that checksa remote server to see whether i»,
has arrived for you. Currently, it checks once per minute: You are concerned tbic]
frequency is too high. You can builda model of theoptimal monitoring intervals
by makinga few assumptions: First, the probabilityp of mail arriving during any
one-minute interval doesn't change during the day. Second, at most one piece of math
will arrive during any one-minute interval. Third, the cost of checking formail, c,
fixed. Fourth, the cost of waiting mail isa linear function of how long it waits, costing
in per minute. Having made these assumptions you prove that the optimal interval
between checks isf = ‹2c/wp)’5 (Fatima, 1992). It isn't easy to put the cosEr
and w ona common scale, but suppose checking requiresa long-distance telephom
call that costs one dollar, and you assess yourselfa penalty of one centa minute fa
every message that you haven't read. Suppose, also, that the probability of getting
a message in any given minute is .2. Then theoptimal interval between checks is
((2 100)/(l .2))5 — 31.62. In other words, instead of checking every minute, your
communications assistant should check every half hour.
Our model, = (2c/wp)’5, is in many ways typical. First, it is an abstraction
of the monitoring interval problem that assumes some things and ignores others. It
makes two assumptions about the arrival rate of messages and it ignores absolutely
everything about the architecture of the communications assistant. In fact, from
the standpoint of the model, the assistant is merelya device that checks whether
something has happened. The model doesn't care whether the assistant hasa fancy
user interface, whether it is part ofa larger “office management system,” or whether
it is implemented in Lisp or C. The model also doesnst care what the assistant is
checking for. It might be thearrival ofa bus, or whethera traffic light has turned
green, or whethera book has been returned toa library. The model accomodates
all these things, provided events havea constant probability of occurring, and they
occur at most once ineach lime interval. Finally, the model doesn't care how costs
are assessed, beyond requiring that one isa fixed cost, c, and the other isa rate, to,
Another typical aspect of our model is that it describes interactions. As p in-
creases, decreases—the probability ofa message arriving influences the frequency
of checking for messages—but if the cost of leavinga message unread decreases
(in), it can offset the influence of p. Recalling the three basic research questions
introduced in chapter 1, models can include terms forenvironmental influence, archi-
tectural decisions, and task features, and can describe or specify how these interact
to produce behavior. For example, we might characterize to asa feature of the task
Modeling
andp andc asfeatures of the environment, whereasI is the optimal behavior, the
optimal monitoring interval.
In specifying what ouragent shoulddo,our model is normative. Normative models
arenot common inartificial intelligence, partly because optimality usually requires
exhaustive search—an impossibility for most AI problems—and partly because our
intellectual tradition is based in Simon's notion of bounded rationality and his re-
jection of normative models ineconomics. The models discussed in this chapter are
performance models: They describe, predict, and explain how systems behave.
A good performance model isa compromise. Tt should fit but not overfit sample
data, that is, it shouldn't account forevery quirk ina sample ifdoing so obscures gen-
eral trends or principles.A related point is thata good model compromises between
accuracy and parsimony. For instance, multiple regression models of thekind de-
scribed in section 8.6 can be made more accurate by adding more predictor variables,
but the marginal utility of predictors decreases, while the number of interactions
among predictors increases, These interactions must be interpreted or explained.
Other compromises pertain to the scope and assumptions that underlie models. For
example, the scope of our illustrative model = (2c/tpp)5 is just those situations
in which theprobabilityp ofa message arriving is stationary, unchanging over time,
But becausea model fornonstationary probabilities is more complicated, we use the
stationary model innonstationary situations, where its accuracy is compromised but
sufficient for heuristic purposes.
Good models capture essential aspects ofa system. They don't merely simplify;
they assert that some things are important and others are not. For example, our illus-
trative model asserts that only three quantities and a few assumptions are important,
and every other aspect of any monitoring situation is unimportant. This is quitea
claim. Quite apart from its veracity, we must appreciate how it focuses and moti-
vates empirical work. It says, “find out whether the assumptions are warranted; see
whether c, u› and p are truly the only important factors.” Lackinga model, what
can we say of an empirical nature? That our communications assistant appears to
work; that its users are happy; that its success might be due toanything. In short, asa
model summarizes theessential aspects and interactions ina system, it also identifies
assumptions and factors to explore in experiments.
Artificial intelligence researchers can be divided into those who use models and
those who don't, The former group concerns itself with theoretical and algorithmic
issues while the latter builds relatively large, complicated systems. I drew these
conclusions from my survey of 150 papers from theEighth National Conference on
i Artificial Intelligence (Cohen, 1991),I find them disturbing because surely architects
and systems engineers need models themost. Think ofany complicated artifacts—
312 3fodeling
airplanes, chemical processes, trading systems, and, no less compleK, AI systems—

and you seeimmediately the need to predict and analyze behavior. However, my
survey foundvirtually no interaction betweenresearchers who develop models that are
in principle predictive and anflytic and researchers who build systems. The problem
is due in part to the predominance of worst-case complexity analysis,a pretty crude
kind of model. Worst-case models usually don't predict actual computational effort.
Furthermore, formal problems often don't represent natural ones. Perhaps models of
AI systems and tasks have become more sophisticated and realistic, but in my 1990
survey, only eight of 150 papers described models sufficient to predict or explain the
behavior ofa system. Recent surveys (Churchill and Walsh, 1991; Lukowicz etal.,
1994; Prechelt, 1994) present very similar results.
8.1 Og dIRS dS Models: Executable Specifications and Essential Miniatures
Ina complicated system, some design decisions have theoretical weight while others
are expedient. If you writea paper about your system or describe it toa colleague,
you emphasize theimportant decisions. You say, “It's called ‘case-based’ reasoning,
because itre-uses cases instead of solving every problem fromfirst principles. The two
big design issues are retrieving the best cases and modifying them fornew problems.”
You don't talk about implementation details because, although they influence how
yoursystem behaves, you regard them asirrelevant. Allen Newell madethis important
distinction in his 1982 paper, 77ie Knowledge Level:
The system at the knowledge level is the agent. The components at theknowledge level are
goals, actions and bodies. ... The medium attheknowledge level is knowledge (asmight be
suspected), Thus, the agent pmeesses its knowledge t0detemiine theecf‹oits to take. Finally,
thc behavior law is the principle of rationality.’ Actions as selected to attain the agent's goals.
The knowledge level sits in the hierarchy of systems levels immediately above thesymbol
level.
As is true ofñn$ level, although theknowledge level POH be constructedfmm the level below
(i. e., the symbol level), If also hus at autoiiamuus formulation ax at i'ndependent level. thus,
knowledge eat lie defined independent of thesymbDl level, but it C0tI alSD be reduced tosymbol
systems.
’The knowledge level pemiits predicting and understanding behavior without having eft
operational model oftheproccssing that is actually being done by thedgenf. the utility of such
a level wauld .seem clear given the w'idespread need inAI's affairs for distal prediction, and
also the paucity of knowledge about theinternal workings of humans. . The utility is also
clear in designing AI systerrts, where theinternal mechanisms are still to be specified. T'o the
eMent thatAl systems successfully approximate rational agents, it is also usefulfor predicting
Programs as Models, Executable Specifications and Essential Miniatures
and understanding them. Indeed the usefulness extends beyond AI systerru to all c‘omjrtffer
programs. (J9&2, pp. g8-i08)
So,we should strive forknowledge-level models of Al systems, models that hide

details and present the essential characteristics ofa system. In what language should
these models be framed? English and other natural languages have tremendous
expressive power, but they are ambiguous. For example, we said case-based systems
should retrieve the “best” cases, but what does this really mean? It probably doesn't
mean finding optimum cases ina mathematical sense; it probably means strikinga
balance between search effort and the number and utility of the retrieved cases. It
doesn't matter very much whether you call these the “best” cases or merely “good”
cases—although theformer connotes somethíng you probably don't mean—neither
word says what you do mean. Now, imprecision plays an important role in evolving
theories, and connotation saves us the trouble of saying everything explicitly, asI
discuss in chapter 9. Thus we shouldn't disiniss English asa modeling language, but
we should consider some alternatives.
One alternative is thata program is its own model (Simon, 1993). It is, after aIl,
a declaratíve, unambiguous, nonredundan( executable specifïcation. For ail these
positive attributes, programs are hard to read, they obscure distinctions between
essentials and implementation details, and their behaviors are difficult to predict.
Another alternative, called an executable specífication, is based on theidea that al-
thougha program shouldn't be its own model, it can be modeled by another program.
Cooper, Fox, Farringdon, and Shallice (1992) propose and develop this idea and il-
lustrate it with an executable specification of the sol architecture (Newell, 1990).
SOAR is itselfa model ofthehuman cognitive architecture but, as Cooper etal. point
out, 'Pere aremany components of LISP SOAR that do not seem tohave much theo-
reócal force, such as theRETE algorithm [and] the algorithms employed toefficiently
remove working memory elements on subgoal termination” (Cooper et al., 1992,
p. 7). The designers of soAR probably don't intend to claim that humans have RsTE
algorithms in their heads, but the soAR code does not contain little disclaimers such
as “the following procedure has no theoretical justification or implications.” Cooper
etal. allow that SoAR is an implementation ofa theory, but they insist the muck of
implementation details must be hosed away before the theory itself is clearly seen.
Capitalizing on the hierarchical decomposition of functions, they draw an imaginary
line: Above it are design decisions with theoretical force, below it are implementation
decisions that have none. Then they introducea modeling language called Sceptic
with whích they implement all above-the-line functions.
Sceptic is “an executable specification language in which we can describea cog-
nitive theory ina form that is (1) clear and succinct, and (2) is at the appropriate
314 Modeling
level of abstraction, (3) can be executed to permit direct observation of the beha
it predicts, (4) permits clear separation of the essential, theoretically-motivated com-
ponents from theinessential but necessary algorithmic details” (Couper et al., 1992,l
p. 8). Sceptic is syntactically very much like Prolog, andabove-the-line components
of Sceptic models arecollections of Prolog-like rules. To illustrate,I will ext .
Sceptic rules for soAR's elaboration phase: When soAR gets some input from itsi
environment, it puts tokens representing the input into working memory, after
it searches long term memory for productions that match thecontents of working
memory. InSceptic, this phase is implemented as follows:
elaboration,phase:
true
—+ mark all wmesms old,
input.cycle,
continue cycling ifmot quiescent
continue cycling ifmot quiescent:
not(quiescent)
-+ elaboration cycle,
output cycle,
input cycle,
continue cyclingñfmot quiescent
elaboration cycle:
generate unrefracted instantiation(Instantiation,FiringGoal),
not(imwatch(Instantiation,FiringGoal))
—+ im ake(Instantiation,FiringGoal),
fire production(Instantiation,FiringGoal)
The first rule defines the elaboration phase: It marks all current working memory
elements as old, then it gets input from theenvironment and adds it to working
memory, and then it executes the rule continue cye1 ingi f_noC_quiescenL.
This rule tests for quiescence (a stiite in which no new productions are triggered
by the contents of working memory), and if quiescence has not been achieved it
does several othef things pertinent to instantiating and executing productions from
long term memory. For instance, the elaboration cycle generates an instantiation and
firing goal fora production and, ifit is not already in instantiation memory (tested by
imwat.ch), it puts it there and then fires the production.
Five Sceptic rules are required to implement theabove-the-line model of SOAR's
elaboration phase, and several below-the-line functions must be supplied as well.
us Programs as Models. Exe‹,’utable Specifications and Essential Niiniatures
Checking wh.ether an in.stan.tration is in instantiation memory (immatch) is one func-

tion: it is caileda state fetter function. Testing for quiescence is another state-tester
fun.ction. The work ofidentifying and instantiating productions prior to firing tliem is
dorie bya below-the-line function called generate unref racted instant.i a-
tion. lt is an example ofa generator function. Finally, some functions re-
quired to add elements to working memory, and to mark them, These are called
updater functions. Thus,a Sceptic model compñsesa list of Sceptic rules, which are
theabove-the-line specification ofa system, anda collection of state-tester, genera-
tor, and updater functions, which implement the below-the-line specification. (See
also Cooper, 1992a and 1992b, forexamples of Sceptic models forother theories.)
Cooper eta!. (1992) claim this division satisfies all their criteria fora good mpdel,
It is c!ear an.d concise, because Sceptic rules are declarative and relatively 1ew are
required. In fact, just twenty-eight above-the-line rules were needed tospecifySoAR.
EYen With the below-the-linefunctions, th.e Sceptic mode! ofsoAR is reported to be
much smallerin terms oflines of code.
Sceptic models are “atthe appropriate !evel of abstraction,”n.ot because Sceptic
enforcesa particular separation between above-the-line andbelow-the-line aspects of
a system, but, rather, because it allows thg model builder to draw ie line at the level
he or she finds appropriate. Similarly, Sceptic doesn't separate essential components
from inessential ones, but it allows the inodel builder to do so.
Perhaps the mostimportant feature of Sceptic models is executability: Sceptic
mode!s run. The resulting execution traces can be compared withtraces thorn. the
target system; forexample, Cooper etal, report. “Execution of the various crs1on.s
of the program on published examples (e.g., monkeys and bananas, the eight puzzle,
b!ock stacking) has demonstratgdth.a.tthe behavior of th.e LISP a.d Scepticversion.s xc
essentially _iden:ical” (1992, p. 25). Although thcsg results arc preliminary, they are
significant: Ifa Sceptic modelcan_ reproduce soAR'sñei¡av_ior on previously studied
problems, then perhaps it can. predict SQAR's behavior un new problems. Th.is isa
crucial test of the Sceptic approach. IfSceptic models do npt make predictions, but
only m.imis what already ex.1sts lEd target system, then neithtr engineersnor scientists
will rind them us9tiil. On th.e othgr hand, it ex.ecutable specifications can junction
as other engineering mode!s do,predicting h.ow components willin.t9(act ing range
of conditions, th.en. th.e implications 1or software engineering practice could becnor-
m_ous, froms scienlist's perspective, executable specifications must make interesting
predictions ifwe are to accord tiicm the status of ficories. (Inthis context, interesting
means notobvious, given tire target system.) 11 thins criterion his met—ii exwutañ!e
specification s can minke interesting predictions—then they might completely change
how Al researchers discuss their theories and present their results. Tim.e wiI! tell.
316 Modeling
An approximation toa Sceptic-like executable specification can be achieved by

modelinga complicated system witha simpler one, called an essential miniature. The
term “essential” comes from “Essential MYCm,” luiown widely as EMYCIN, theexpert
system shell that remained after all of MYC1N’s medical knowledge was e›tcised (van
Melle, 1984). The term“miniature” was used first by Schank andRiesbeck todescñbe
small versions of someoftheirearly natural language programs (Schank and Riesbeck,
1981). An essential miniature, then, isa small version ofa system thatcaptures al) the
essential aspects of the system. Unfortunately, the implication of the EMYCIN example,
that you get an essential miniature by throwing awayinessential stuff, is wrong: It
cost William van Mellea lot of effort to build EMYCIN, and theother examples we
know o1(e.g., Blumenthal, 1990) WGfGlikemse built largely from scratch.
Clearly, the technology for building predictive models of complex system behav-
iors is promising but nascent, On the other hand, if one is willing to settle fora model
that predicts, say, the values of performance variables, then several well-established
statistical techniques are available. Our focus forthe rest of the chapter is linear re-
gression models, including multiple regression and causal models. One can certainly
find more sophisticated kinds of models, butit's hard tobeat regression for versatility,
ease ot use, and comprehensibility.
6.2 Cost asa Function ofLearning: Linear Regression
In chapter6 we encountered Lehnert and McCarthy's study of their part-o1’-speech

tagger,a system called oTB. We saw how or tagged the words “general motors
corporation.” The right answer is to tag “general” and “motors” as noun modifiers
and “corporation” asa mun, butthis is difficult because the first two words arcalso
nuuns. Although oTB gottheright answer inchapter 6, it mistakenly tagged nuuii as
noun modifiers, and vice versa, in other examples. In this section we model thecosts
of these mistakes. A simple approach is to score oTa's performance by two cñteña:
The “stringent” one credits oTB with correct tags only if nounsarc tagged as nouns
and noun modifiers as noun modifiers; the “lax” criterion gives full credit if oTB tags
nouns asnoun modifiers and vice versa, but not if it tags either as, say,verbs. ovB w»
tested on batches of fifty sentences. Thee performance measure was tire proportion of
all the words ina batch that it tagged correctly, averaged over thenumber ofbatches.
The mean proportion o1 words tagged correctly by the stringent and fax criteria are
.738 and ,b92, respectively, and the mean difference is highly significant. Confusion
between nouns and noun modifiers reduces oTn's niean proportion of correctly tagged
words bya significant chunk, from .892 to .7sg.
317 Cost us u l'"unction of Learning: Linear Regression
.4 Difterence in proportion of words

• ta8sed correctly under laxand
.35 } stringent interpretations of correct
.3 •°
.25 ;
.2
15
o
.1
” ”’- t...,. t t a .°Ät

0
Figure 8.1 The cost of tagging errors asa function of the number oftag-trigrams leamed.
Unfortunately, this result says little about how the confusion depends on other
factors. For instance, OTB improved its performance by learning; did it learn to avoid
confusing nouns and noun modifiers? Figure 8.1 shows, in factthat the differences
between laxand stringent scores were mostpronounced when oT8 hadlearned fewer
than two hundred tag-trigrams. After that, the mean difference hovers around5
percent. For simplicity, call the mean difference between laxand stringent scores the
cost of noun/noun modifier tagging errors. We will derivea model that relates this
cost to OTB's learning. Even though figure 8.1 shows clearly that the relationship is
nonlinear,I will developa linearmodel; later,I willtransform Lehnertand McCarthy's
data soa linear model fits it better.
8.2.1 Introduction to Linear Regression
Through thenightmares of statisticians run images like this: A hapless student is

captured by an ogre who says, “I have before me a scatterplot, and I won't show it to
you, butI will show you thedistribution ofy values of all the points. After you have
studied this distribution,I will blindfold you, and thenI will call out thez value ofa
randomly-selected point. You must tell me itsy value. You'll be wrong, of course,
and the difference between your guess and the point's truey value will be noted. We
will play this game untilI have selected all N OÎnts, after whichI will square your
errors, sunl them, and divide the result by N — 1. Should this average be greater than
it needs to be,a terrible penalty will be exaeted!” Ominous? Not really. Whenever
theogre selectsa point and calls out itsz value, the student should respond with y,
the mean ofall they values. This way the student will never incur greater errors than
necessary, and the ogre's terrible penalty will be avoided. It will come asno surprise
318 MDdeling
2 4 6 8 10
Figure 8J A scatterplot can be viewed asa guessing game.
that the student's mean squared error is the sample variance of they scores,s2. It
should also be clear thats2 isa lower bound on one's ability to predict they value of
a coordinate, givena corresponding x value. One's mean squared error should be no
higher thans2, andmight be considerably lower if one knows something about the
relationship betweenx andy narrates.
For instance, after looking carefully at the scatterplot in figure 8.2, you guess the
following rule:
Your Rule y; = 7.27a, -I- 27.
Now you are asked, “Whaty value is associated with x 1?” and you respond
“y = 34.27.” The difference between your answer and thetrue value ofy is calleda
residual. Suppose theresiduals are squared and summed, andtheresult is divided by
N — 1. This quantity, called mean square residuals is one assessment of thepredictive
power ofyour rule. As you would expect, the mean square residuals associated with
yourrule is much smaller than the sample variance; in fact, yourrule gives the smallest
possible mean squared residuals. The line that represents your rule in figure 8.3 is
calleda regression line, and it has the property of being the least-squares fit to the
points in the scatterplot. It is the line that minimizes mean squared residuals.
Parameters of the Regression Line
For simple regression, which involves just one predictor variable, the regression line
y = fix -I-a is closely related to the correlation coefficient. Its parameters are:
(a.i)
5J9 Cost asa Function of“Learning: Linear Regression
2 4 6 8 10
Figure 8d Residuals associated with therulei = 7.27x + 27.
wherer is the correlation between z and y,and s,ands are the standard deviations
of x and y, respectively. Note that the regression line goes through themean of
z and
themean ofy,A general procedure to solve for the parameters of the regression line
is described in section 8A.
Variance Accountedforby theRegression Line
As we saw earlier, withouta regression line the best prediction ofy is y, yielding
deviations(ri— and mean squared deviationss = Z(›:— y)2/(N - 1). Each
deviation isa mistake—an incorrect stab at the value of y. Wantinga regression line,
2
S y represents the “average squared mistake.” The predictive power oftheregression
line is usually expressed in terms of its ability to reduce these mistakes.
Recall how one-way analysis of variance decomposes variance intoa systematic
component due toa factor and an unsystematic “error” component (see chapter 6).
In an essentially identical manner, linear regression breaks deviations (yi — y) (and
thus variance) intoa component due to theregression line and a residual, or enor
component. This is illustrated in figure 8.4 for two points and their predicted values,
shown as filled and open circles, respectively. The horizontal line isy and the
upward-sloping line isd regression line. ’the total deviation y, — v is divided into
two parts:
I (Hi— ›i) was zero forall points, then all points would fall exactly on the regression
line, that is, the filled and open circles in figure 8.4 would coincide.
1
320 Modeling
Figure 8.4 Decomposition of residuals from themean.
The sum of squared deviations can also be broken into two parts:
ss-•t•! = ss •,+ ss..:
In words, the total sum of squares comprises the sum of squares for regression plus
the sum of squares residual. The proportion of the totalvariance accounted forby the
regression line is
tDttll r CS. Ke .
(8.2)
Remarkably, this proportion is the square of the correlation coefficient,r2. When

SS „. 0, all of the variance in data is accounted foror explained by the regression
line. If SS —- SS„, then the regression line isa horizontal line through y, and
has no more predictive power than the rule y, = y. Ifr2 = 0.5, then SS is
half ofSSL. .›, so one's squared, summed mistakes are half as cosdy by merit of the
regression line.
8.2.2 Eack ofFitand Plotting Ifesiduals
It isa simple matter to fita regression line to Lehnert and McCarthy's data. 7'he line
is shown intheleft panel in figure 8.5, which relates the number oftag-trigrams OT8
has learned (horizontal axis) to the cost of noun/noun modifier tagging erf8fs (vertical
axis). Although this line is the &st possible fit to the data (in the least-squares sense.
see the appendix to this chapter), it accounts for only 44 percent of the variance in
the most of tagging errors.
In general, whena model fails to account formuch ofthevariance ina performance
variable, we wonder whether the culprit isa systematic nonlinear bias in Our data,
Transfarming DutaforL.inear Models
.4 .25
.35
.15
.1
,05
.15
.7 0
.05 -.05
0
0 100 300 SOO 700 900 0 100 300 500 700 900
Tag-trigrams learned Tag-trigrams learned
Figure If Linear regression and residuals for Lehnert and McCarthy's data.
outliers, or unsystematic variance. Systematic bias is often seen ina residuals plot,a
scatterplot in which thevertical axis represents residuals ii - ii from theregression
line and the horizontal axis represents the predictor variable z. The right panel of
figure 8.5 showsa residuals plot for the regression line in the left panel. Clearly, there
is much structure in the residuals. As it happens, the residual plot shows nothing that
wasn't painfully evident in the original scanerplot: The regression line doesn't fit the
data well because the data are nonlinear. In general, however, residuals plots often
disclose subtle structure because theranges of their vertical axes tend to be narrower
than in the original scatterplot (e.g., see figure 8.11.e). Conversely, ifa regression
model accounts for little variance iny (i,e., r’ is low) but no structure or outliers
are seen in the residuals plot, then the culprit is plain old unsystematic variance, or
noise.
8.3 Transforming Data forLinear Models
Linearmodels aresouseful and linearregression so straightforward thatwe often want

toapply regression to variables that aren't linearly related. In the previous sections
we wanted toseetheinfluence of learning ona performance variable, the difference
between lax and stringent scoring of noun modifier errors, but this influence was
clearly not linear. A solution is to transform one variable or the other to “straighten”
the function relating them. Power transforms and log transforms are commonly used
for this purpose. Both preserve order, which means if a > b then t(a) » t(b)
fora transform i, but they compress or stretch scales at higher or lower values.
322 Modeling
Thus they can change theshape ofa function, even straighteninga distinctly curved
one.
Consider how to straighten Lehnert and McCarthy's data, reproduced in the top-;
left graph in figure 8.6. Let y denote the performance variable and x denote W
level of learning. The function plunges fromy = .375 toy = ,075 (roughly) as x
increases from0 to 100, theny doesn't change much asz increases further. Imag
the function is plotted ona sheet of rubber that can be stretched or compressed until
the points forma straight line. What should we do? Four options are:
u Compress theregion abovey — .1 along they axis.
u Stretch the region belowy — .1 along they axis.
u Compress theregion to the right ofx — 100 along thex axis.
u Stretch the region to the left of x == 100 along thex axis.
A log transform stretches the distances between small values and compresses the
distances betweenlargeones; forexample,1og(20)—log(10) == .3 whereas log(l20)—
log(1l0) .038, soa difference of ten for large values is compressed relative to the
same difference between small values. Figure 8.6 shows three log transformationsof
y and x. The top-right graph showsa plot off against log(y); as expected, points with
smallery values are spread out more along they axis, and you can see more detail than
when thepoints were scrunched together.A regression line fit to these data explains
more of thevariance, too. The regression of log(y) on z explains 66.5 percent of
the variance in log(y). The bottom-left graph showsa regression ofy on log(x).
As expected, points with smallz values are spread wider by thetransformation, and
points with largex values are compressed. Unfortunately, this createsa dense mass of
points at the right of the graph, making it very difficult to see any details, although the
variance iny accounted for by log(z) is pretty high,r2 = 82.3 percent. Finally, we
can transform bothx and y,yielding the graph at the lower right of figure 8.6. Notice
that the worrying clump of points has been spread apart along they axis because
they had lowy values. Nearly 90 percent of the variance in log(y) is explained by
log(z).
The log transform is but one of many used by data analysts fora variety of pur-
poses. In addition to straightening functions, transformations are also applied tomake
points fall ina more even distribution (e.g., not clumped as in the lower left graph
in figure 8.6), to give thema symmetric distribution along one axis, or to make the
variances of distributions of points roughly equal. Some ofthese things are done to
satisfy assumptions of statistical tests; for instance, regression and analysis of vari-
ance assume equal variance in distributions ofy values at all levels of z. In his book,
323 Transforming Dataforlinear Models
1 1.2 1.6 2 2.4 2.8 3 1 1.2 1.6 2 2.4 2.8 3

x' = log(x) x' = log(x)
Figure 8.6 Three log transformations of figure 8.1.
Exploratorr Data Arimysis, Tukey (1977) introducesa ladder of re-expressions. Here

aresome steps on the ladder:
—J —1
x y2
Figure 8.7 shows how these re-expressions (or transformations) stretcha function
in different parts of its range. The straight line is, of course, x’ == x. To pull
apart large values of x but leave small values relatively unchanged, you would use
a power function with an exponent greater than one, such as x’ —— x*, You can
see that the deviation betweenz and z' = x’ increases with z; therefore, x’ x2
spreads points further apart when they have big x values. Alternatively, to pull apart
small values ofz and make larger values relatively indistinguishable, you would
324 Madeling
10
I“igure 8.7 Graphing transformation functions illustrates their effects on dii?erent parts of the ,
range.
usea transformation such as x’ — — 1/z ’. Note that z’ = z2 and z’ = —1/z5

areon opposite sides of the “null” transformation z' x. Moving down thelad-
der (left of z' = z) stretches the scale for large values ofz and moving up the
ladder stretches the scale for small values of z. As you move further in either direc-
tion transformations become more radical. The log transformation, used earlier, is
pretty gentle. Less radical yet are the power transformations z' — z09, x’- z0 '.
. Note that all the terms inthe ladder of re-expressions can be negated and thus
applied to curves that are reflections of those in figure 8.7. General guidance for
applying re-expressions is provided by Tukey (1977), Emerson and Stoto (1983), and
Gregory (1994).
8.4 Confidence Intervals for Linear Regression Models
One can derive confidence intervals (secs. 4.6, 5.2.3) for several parameters of re-
gression equations, including
M the slope of the regression line
n they intercept
u the population mean ofy values,
Confidence intervals for the slope of the regression line will illustrate the general
approach. See any good statistics text(e.g., Olson, 1987, pp. 448—452) forconfidence
intervals for other parameters.
325 Confidence Intewals f“or linear Regression Models
8.4,1 Parametric Confidence Intervals
Confidence interval.s describe how wella sample statistic, such as b, estimatesa

population parameter, whicfi in the case ot’the slope ot’the regression line is denoted
Q (not to be confused with theprobability ofa type 11 crrur, nor with beta coefficients,
discussed later). Not surprisingly, all confidence intervals for parameters ofregression
equations depend on. SS „ , the emu of squared residuals from theregression line.
Recall tne parametric confidence interval for the population mean, p:
Here, z is the stati.tic that estimatesy and ñ, is the estimated standard error of the
sampling distribution of , which isa i distribution. Under theassumption thatthe
predicted variabley is normally distributed with equal variance at all levels of the
predictor variable z, coEfldenceinternals ifir parameters of regression lines have
exactly the same form. This is the 95 percent confidence interval for theparametric
slope of the regression line, :
As always, the tricky pan i.s estimating the Standard ciTur of’thestatistis, be e timatt
re / N )
(I - r’) SStp«i (equation 8,2); J5q is the numerator ofthevariance of x. Th.e

appropriatei distribution has N -2 degrees o1 freedom. Note th.at — be'.025*› will
produce man. upper ;lnd lower bound on the95 percent confidence intgrva! for $, that
is, two slopes, one steeper thanb an.â pne symmetrically more shallow. When you
draw confidenceintervals aroun.da regression line, they !ook ads shown in figure 6,6.
All regression lines dss throughI andy so tfie confidence In.terva.!s for$ “fan out”
around theregression lineun.either side of x. Of cuurse, an.dy care also estimates
of’ their true population values, en con.tiden.ce intervals arc often. derived for them as
well, In this cols the conti.dtnce intervals for the slope of th.e regression line will not
pass througha point, as in figure 8.b. Instead, the picture wi!! !ook. !ike the outline of
on hourglass, where the width of the cen.tral aperture depends on uncertainty about
the population means.
Incidentally, with the standard error fi¿ we can test the hypothesis that m .is the
population value of the slope ot’a regression line,If0 ;Q - m. Thee appropriate test
statistic isi == {b — m)/ñ3; degrees of freedom. for this test a.re N — 2.
326 Modeling ’
Figure 8.8 Confidence intervals fora regression line.
Figure 8.9 Scatterplot of WindSpeed and FinishTime showing heteroscedasticity. The solid
line is the regression line for the sample of48 points. The dotted lines are bootstrap confidence
intervals on the slope of the regression line.
8.4.2 Bootstrap Confidence Intervals
The assumptions that underlie parametric confidence intervals are oftenviolated. The
predicted variable sometimes isn't normally distributed, and even when it is, the vari-
ance ofthenormal distribution mightnotbe equal atall levels of the predictor variable.
For example, figure 8.9 showsa scatterplot of FinishTime predicted by WindSpeed in
forty-eight trials of the Phoenix planner. The variances of Finis6fiime at each level of
WindSpeed arenotequal, suggesting that they might notbe equal in the populations
from which these data were drawn (aproperty called heteroscedasticity, as opposed
to homoseedasticity, ar equal variances). This departure from the assumptions of
parametric confidence intervals might be quite minor and unimportant, but even so,
the assumptions are easily avoided by constructinga bootstrap confidence interval,
as described in section 5.2.3. The procedure is:
327 The Significance ofa Predictor
Slope of regression line for

bootstrap sample •b
Figure 8.IN A bootstrap sampling distriti;jti9n oJ the slope ot the regression ljne. The original
sample comprises 48 data shown in_ figure b.9.
Procedure 8.1 Bootstrap Contide ce Intervals on Regression Lines

Do i—1 to 500times:
1. Select with replacementq bootstrap sampleñ of‘ 48 points (keeping the x,
and y; together as a unit) from theoriginal sample 6’.
2. Compute andrecord b*, the slope of the regression line ior3 .
The resulting distribution of‘#‘ is shown in figure 8. lO. Its mean, —2.36, is very
slosh to the slope of ti:e regression line of the oñginal sample, —2,33. "IN find the 9?
percent confidencc intewals for thg slope of the regression line, just find the cutoif
values that bound theupper and!ower 2.5 percent ot the distribution in figure 6,19,
These valuss are .17 and 4,644, respectively, and theyare also the slopes of ths
dotted links in figure 6.9.
8.5 The Significance ofa Pnuctor
One assessm.en.t of th.e“goodness” ofa predictor isr2,introduced in section g,z.i as

th.9 proportion of the total v;i¡i;¡n59399oen.tefi for by the regression rule;
However,r2leaves some doubt a.S to whether its value could bc due to chance, In
sessions 4.5 and 5,3,3I dessribed tests ot‘the hypothesis that thy correlation is zero,

Cap8 Resultado

Uploaded by

Copyright:

Available Formats

Cap8 Resultado

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cap8 Resultado

Uploaded by

Copyright:

Available Formats

8 Modeling

of models we mean models of architectures, tasks, and environments—models

airplanes, chemical processes, trading systems, and, no less compleK, AI systems—

8.1 Og dIRS dS Models: Executable Specifications and Essential Miniatures

So,we should strive forknowledge-level models of Al systems, models that hide

Checking wh.ether an in.stan.tration is in instantiation memory (immatch) is one func-

An approximation toa Sceptic-like executable specification can be achieved by

6.2 Cost asa Function ofLearning: Linear Regression

In chapter6 we encountered Lehnert and McCarthy's study of their part-o1’-speech

.4 Difterence in proportion of words

” ”’- t...,. t t a .°Ät

8.2.1 Introduction to Linear Regression

Through thenightmares of statisticians run images like this: A hapless student is

Figure 8J A scatterplot can be viewed asa guessing game.

Parameters of the Regression Line

Figure 8d Residuals associated with therulei = 7.27x + 27.

Variance Accountedforby theRegression Line

Figure 8.4 Decomposition of residuals from themean.

Remarkably, this proportion is the square of the correlation coefficient,r2. When

8.2.2 Eack ofFitand Plotting Ifesiduals

8.3 Transforming Data forLinear Models

Linearmodels aresouseful and linearregression so straightforward thatwe often want

1 1.2 1.6 2 2.4 2.8 3 1 1.2 1.6 2 2.4 2.8 3

Figure 8.6 Three log transformations of figure 8.1.

Exploratorr Data Arimysis, Tukey (1977) introducesa ladder of re-expressions. Here

usea transformation such as x’ — — 1/z ’. Note that z’ = z2 and z’ = —1/z5

8.4 Confidence Intervals for Linear Regression Models

8.4,1 Parametric Confidence Intervals

Confidence interval.s describe how wella sample statistic, such as b, estimatesa

(I - r’) SStp«i (equation 8,2); J5q is the numerator ofthevariance of x. Th.e

Figure 8.8 Confidence intervals fora regression line.

8.4.2 Bootstrap Confidence Intervals

Slope of regression line for

Procedure 8.1 Bootstrap Contide ce Intervals on Regression Lines

8.5 The Significance ofa Pnuctor

One assessm.en.t of th.e“goodness” ofa predictor isr2,introduced in section g,z.i as

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.