Baum Chapter10 PDF
Baum Chapter10 PDF
Baum Chapter10 PDF
This chapter deals with models for discrete and limited dependent variables. Discrete dependent variables arise naturally from discrete-choice models in which individuals choose from a finite or countable number of distinct outcomes and from count processes that record how many times an event has occurred. Limited dependent variables have a restricted range, such as the wage or salary income of nonaelf-employed individuals, which runs from 0 to the highest level recorded.' Discrete and limited dependent variables cannot be modeled by linear regression. These models require more computational effort to fit and are harder to interpret. This chapter discusses models of binary choice, which can be fitted by binomial logit or probit techniques. The following section takes up their generalization to ordered logit or ordered probit in which the response is one of a set of values from an ordered scale. I then present techniques appropriate for truncated and censored data and their extension to sample-selection models. The final section of the chapter considers bivariate probit and probit with ~election.~
10
248
We could develop a. belnavioral model of each of t,hese phenomena, iricluding several expla.nat,ory factors (we should not call them regressors) t h a t we expect t o influel~ce t.lie respoiident,'~ a,nswer t80s11c:h a cll~cst,ion. But, wc? should readily spot the flaw in the 1inea.r ~rohabilit~y ]nodel (10.1) ri = x i p i 2 ~ i
where we pla.ce the Boolean response variable in r arid regress it upon a set of x variables. ,411 t.he o1)servations wc h;l.ve on T a,rc cit,hcr 0 or 1 and may be viewed a.s the ex post ljrol)ii~llilit ies of' rc~sl.)olitliiig "ycls" t;o t,lic? cl~lcst~ion ~>oso(l. Ijllt, t.ht: prc(iic:tioris of' a. lirloal. regression niodel arc: u~ll.>o~~nclccl, ancl t,hc rnodel of (10.1), fit,ted witjh regress. call producr ilega,tive predictions ancl predic;t,ioris exceeding unit,y, neit.her of which can be considerc;d pol-)al>ilit.ies.Beca,use t.lie responsc va.riablc is bounded, restxictecl t,o take on values of { O , l ) , the model should generate a predicted probability t,ha.t illdiviclual i will choose to alistver "yes" rather t,liali "no". In such a, framework, if ,Bj > 0:individuals wit,li high values of z j will be more likely to respond "yes", but their probability of doing so 11111st.respect t.he upper bollrid. For i~ist,ancc, if higlt'er disposable income nlakes a new ca,r purchase niore prol)al~lc,we iiiust I)(+ able t,o incluclc a wc?alt,hyperson in the. s;~m~,lc and find t.11a.t his 01. Iier ~,rctlict,c?(l prol~at)ili(-y of riew car ~ I ~ ~ c : } I H is. s110 ~ gi-eat,~lt l i w ~ i I . 1,il;c~wisc~. n. p o o ~ pctl~so~i's l)rt>tlic:(c>tl ~)t.ol)al)ilit,,y 1111ist. 1)o I,otl~itlotlt)y 0. Altllo~lgliwc ca,ri fit (10.1.) with O ~ S t,lic , ~riodelis likely to produce point predictioris outside t.he 11nit interval. We could arbit,rarily constrain tllerri t.o either 0 or 1. but this linear prol~ability nod el has other problems: the error term cannot satisfy the assu~npt,io~i of 1iolno~lioda.st~icit.y. For 21. givcm set o f x values, t,hcre arc only two possihlc \ra.lues for t#llc:dist,urbance, - x p and (1 - xp): the dist,~lrbancefollows a birlonlial dist:ribution. Given t,he properties of t,he binomial distribution, the variance of the dist.1.1rba1ice process. ~oiidit~ioned on x , is
No const,raint can ensure that this cluantit,y will be positive for arbitrary x values. Tllercforc?, we ca,niiot uso regrc?ssioli with a binary-response variable h11t must. fbllow a differt?~it strategy. Before developing t>hat. stra,t,egy,let us consider another formulation of t,licr. motlel holii all ccononiic. st,a,ntll)oiilt,.
~ i t r i i \ l ) l (is \ a ~sc>S ;i~)l)roa(:li ~l t,o s11cl1nn cc.onoliict rice 11lod~l. Esp1.cs5 the (10.I ) i\h ( 10.2) !/: = x,P, + 1 1 ,
where ! j * is a,ll i~nol~servable ~nag~-iit~ldc:, wliic'h ct1.11 1,e considerecl t.he net henefit to individual i of t,a.king a particular course of action (e.g.! purchasing a new car). Vl'e callnot observe t,hat net benefit, 111lt we can observe the out,colne o f the individual having followed the decisioil rule
That is, we observe that the indiviclual did ( y = 1 ) or did not ( y = 0 ) purchase a new car in 2005. We speak of y* as a lateilt variable, linearly related t o a set of factors x and a dist~lrbance process 1 1 . In the latent model, we nod el the prot):~bility of' an irldividual rriakirig each choice. Using (10.2) and (10.3),we have
c ~ l c'ht i iliiatc. I 1 1 p;tr;~~l~otcl.s ~ of' 1)iii:~ry-c'lioic:('iiio(1(~1:, by using 111t~xiii1llli1 likclilioocl tc~clilii~~~ic~s." For c;~cliobst~~.t.at ion, tlit. 1)rol)t~l)ilit;v of oI)s(:rvilig :y co~i~lit~iorii~l 011 x nlaJvbe written as
\\7~l
+ (1 - Y I ) 1%
(1 - Q ( x 1 P ) )
and tlie log likelihooti of tlie sarnple is L ( P ) = with respect to tlie X. clc~iicnts of 0.
The two colizniorl est,ilrlators of t,lie binary-choice lrlodel are the binonlial probit and binornial logit nlodels. For the probit model, a ( . ) is the c n F of the normal distribution function (Stata's normal (1 function). For the logit nioclel. 9(.) is t,hc (:I)F of the logist,ic distribl~t~ion:''
Tlio ('1)l:s of tho 1ior111al;111(1 logistic, clistril)11t,io1isart: si~rlil;~r. In I l l ( . latr>~lt,-vi~riahlc nod el. we nlust asslune that the disturbalice process has a known variance, a : . Unlike the linear regression problem, we do not have enough inforrliation in the d a t a t o estiniate
3. For a discussion of maxirnum likelihood estimation, see Greene (2003, chap. 17) and Gould, Pitblado, anti Srihney 2006. 4.T h e probability density funct,ion of t h e logistic clist.ribution, which is needed to calculate marginal effects. is + ( z ) = exp(z)/{l exp(z))'.
its magnitude. Because we can divide (10.2) by any positive a without altering the estimation problem, a is not identified. o is set to one for the probit model and n/fl in the logit model. The logistic distribution has fatter tails, resembling the Student t distribution with 7 degrees of Ereed~m.~ The two models will produce similar results if the distribution of sample values of yi is not too extreme. However, a sample in which the proportion yi = 1 (or the proportion yi = 0) is very small will be sensitive to the choice of CDF. Neither of these cases is really amenable to the binary-choice model. If an unusual event is modeled by yi, the "naive model" that it will not happen in any event is hard to beat. The same is true for an event that is almost ubiquitous: the nai've model that predicts that all people have eaten a candy bar at some time in their lives is accurate. We can fit these binary-choice models in Stata with the commands probit and logit. Both commands assume that the response variable is coded with zeros indicating a negative outcome and a positive, nonmissing value corresponding to a positive outcome (i.e., I purchased a new car in 2005). These commands do not require that the variable be coded {0,1), although that is often the case.
Via the chain rule, the effect of an increase in x j on the probability is the product of two factors: the effect of Xj on the latent variable and the derivative of the CDF evaluated at yf. The latter term, $(.), is the probability density function of the distribution. In a linear regression model, the coefficient Pjmeasures the marginal effect dy/dxj, and that effect is constant over the sample. In a binary-outcome model, a change in factor x j does not induce a constant change in the Pr(y = l(x)because 9() is a nonlinear function of x. As discussed above, one of the reasons that we use 9 ( )in the binary-outcome model is to keep the predicted probabilities inside the interval [0,1]. This boundedness property of i&() implies that the marginal effects must go to zero as the absolute value of x j gets large. Choosing smooth distribution functions, like the normal and logistic, implies that the marginal effects vary continuously with each xj.
5. Other distributions, including nonsymmetric distributions, may be used in this context. For example, Stata's c l o g l o g command (see [R] cloglog) fits the complementary log-log model Pr(y = 11%) = 1- exp(exp(-xp)).
"0.1.2
Binomial probit
Stata's probi t command reports the maximum likelihood estimates of the coefficients. We can also use dprobit to display the marginal effect aPr(y = l,lx)/axj, that is, the effect of an infinitesimal change in xj.6 We can use probit with no arguments following a dprobit command to "replay" the probit results in this format. Using p r o b i t this way does not affect the z statistics or pvalues of the estimated coefficients. 'Because the model is nonlinear, the dF/dx reported by dprobit will vary through the sample space of the explanatory variables. By default, the marginal effects are calculated at the multivariate point of means but can be calculated at other points via the a t 0 option. After fitting the model with either p r o b i t or l o g i t , we can use mfx to compute f x calculates the dF/dx values the marginal effects. A p r o b i t estimation followed by m (identical to those from dprobit). We can use mfx's a t 0 option to compute the effects at a particular point in the sample space. As discussed in section 4.7, m f x can also calculate elasticities and semielasticities. a By default, the dF/dx effects produced by dprobit or mfx are the marginal effects for an average individual. Some argue that it would be more preferable to compute the average marginal effect: that is, the average of each individual's marginal effect. The marginal effect computed at the average x is different from the average of the marginal effect computed at the individual xi. Increasingly, current practice is moving to looking a t the distribution of the marginal effects computed for each individual in the sample. Stata does not have such a capability, but a useful margef f routine written by Bartus (2005) adds this capability for p r o b i t , l o g i t , and several other Stata commands discussed in this chapter (although not dprobit). Its dummies 0 option signals the presence of categorical explanatory variables. If some explanatory variables are integer variables, the count option should be used. After fitting a probit model, the p r e d i c t command, with the default option p, computes the predicted probability of a positive outcome. Specifying the xb option ; . calculates the predicted value of y The following example uses a modified version of the womenwk dataset, which contains information on 2,000 women, 657 of which are not recorded as wage earners. The indicator variable work is set to zero for the nonworking and to one for those reporting positive wages.
\I
use http://www.stata-press.com/data/imeus/womenwk, clear summarize work age married children education Variable Std. Dev. Min Obs Mean work age married children education
2000 2000 2000 2000 2000 .6715 36.208 .6705 1.6445 13.084 .4697852 8.28656 .4701492 1.398963 3.045912 0 20 0 0 10
M a x
1
59
1
5 20
6. Because an indicator variable cannot undergo an infinitesimal change, the default calculation for such a variable is the discrete change in the probability when the indicator is switched from 0 to 1.
We fit a probit model of the decision to work depending on the woman's age, marital status, number of children, and level of e d ~ c a t i o n . ~
Coef
z
8.21 5.81 15.56 5.32 -12.81
P>lzl
0.000 0.000 0.000 0.000 0.000
. Interval]
.0430105 .5763025 .5036576 .0798735 -2.089948
Surprisingly, the effect of more children in the household increases the likelihood that the woman will work. mf x computes marginal effects at the multivariate point of means, or we could generate them by using dprobit for the estimation.
compute Marginal e f f e c t s after probit y = Pr(work) (predict) = .71835948 variable age married* children educat -n dy/dx .Oil721 .I50478 .1510059 .0197024 Std. Err. .00142 .02641 .00922 .0037
. mfx
z
8.25 5.70 16.38 5.32
P>lzl
0.000 0.000 0.000 0.000
95% C . I .
X
36.208 .6705 1.6445 13.084
The marginal effects imply that married women have a 15% higher probability of labor force participation, whereas a marginal change in age from the average of 36.2 years is associated with a 1%increase in participation. Bartus's margef f routine computes average marginal effects, each of which is slightly smaller than that computed at the point of sample means by mf x.
II
count
Average makginal e f f e c t s on Prob(work==l) after probit Variables treated as counts: age children education
Coef
Std. Err.
.0011512 .0225035 .0057959 .0030558
z
8.70 5.74 20.38 5.49
P>(zl
0.000 0.000 0.000 0.000
. Interval]
.0122742 .I73382 .I294947 .0227591
When the logistic CDF is used in (10.5), the probability of y = 1, conditioned on x, is T i = exp(xiP)/{l + exp(xi/3)). Unlike the CDF of the normal distribution, which lacks a closed-form inverse, this function can be inverted to yield log
(") 1
- Ti
xis
This expression is termed the logit of m,which is a contraction of the log of the odds ratio. The odds ratio reexpresses the probability in terms of the odds of y = 1, It does not apply to microdata in which yi equals zero or one, but it is well defined for averages of such microdata. For instance, in the 2004 U.S. presidential election, the ex post probability of a Massachusetts resident voting for John Kerry according to cnn.com was 0.62, with a logit of log{0.62/(1- 0.62)) = 0.4895. The probability of that person voting for George W. Bush was 0.37, with a logit of -0.5322. Say that we had such data for all 50 states. It would be inappropriate to use linear regression on the probabilities voteKerry and voteBush, just as it would be inappropriate to run a regression on the voteKerry and voteBush indicator variables of individual voters. We can use g l o g i t (grouped logit) to produce weighted least-squares estimates for the model on state-level data. As an alternative, we can use blogit to produce maximum likelihood estimates of that model on grouped (or "blocked") data, or we could use the equivalent commands gprobit and bprobit to fit a probit model to grouped data. What if we have microdata in which voters' preferences are recorded as indicator variables, for example voteKerry = 1 if that individual voted for John Kerry, and vice versa? Instead of fitting a probit model to that response variable, we can fit a logit model with the l o g i t command. This command will produce coefficients that, like those of probit, express the effect on the latent variable y* of a change in xj;'.see (10.6). As with dprobit, we can use l o g i s t i c to compute coefficients that express the effects of the explanatory variables in terms of the odds ratio associated with that explanatory factor. Given the algebra of the model, the odds ratio is merely exp(pj) for the jth coefficient estimated by l o g i t and may also be requested by specifying the or option on the l o g i t command. Logistic regression is intimately related to the binomial logit model and is not an alternative econometric technique to l o g i t . The documentation for l o g i s t i c states that the computations are carried out by calling l o g i t .
254 .
As with probit, by default predict after l o g i t calculates the probability of a positive outcome. mf x produces marginal effectsexpressing the effect of an infinitesimal change in each z on the probability of a positive outcome, evaluated by default at the multivariate point of means. We can also calculate elasticities and sernielasticities. We can use Bartus's margef f routine to calculate the average marginal effects over the sample observations after either l o g i t or l o g i s t i c .
. logit
=
=
P>lzl
0.000 0.000 0.000 0.000 0.000
Interval]
.0720833 .9896549 .8654827 ,1348089 -3.508462
Although the logit coefficients' magnitudes differ considerably from their probit counterparts, the marginal effects at the multivariate point of means are similar to those computed after probit.
8. An approach similar to the Davidson-MacKinnon J test described in section 4.5.5 has been proposed but has been shown to have low power.
. mfx compute
Marginal e f f e c t s a f t e r l o g i t y = Pr(work) ( p r e d i c t ) = ,72678688
- - -
P>lzl
0.000 0.000 0.000 0.000
X
36.208 ,6706 1.6446 13.084
We illustrate the a t (1 option, evaluating the estimated logit function at children = 0. The magnitudes of each of the marginal effects are increased at this point in the x space, with the effect of an additional year of education being almost 5% higher (0.0241 versus 0.0195) for the childless woman.
. mfx
compute, at(children=O)
warning: no value assigned i n a t 0 f o r v a r i a b l e s age married education; means used f o r age married education Marginal e f f e c t s a f t e r l o g i t y = Pr (work) ( p r e d i c t ) = .43074191
-
P>lzl
O.'OOO 0.000 0.000
0.000
95% C.I.
.01071 .I20897 .I65609 .015115
X
36.208 .6705 0 13.084
dummy v a r i a b l e from 0 t o 1
We can test for appropriate specification of a subset model, as in the regression context, with the t e s t command. The test statistics for exclusion of one or more explanatory variables are reported as X 2 rather than F statistics because Wald tests from ML estimators have large-sample X 2 distributions. We can apply the other postestimation commands-tests of linear expressions with t e s t or lincom and tests of nonlinear expressions with t e s t n l or nlcom-the same way as with regress. How can we judge the adequacy of a binary-choice model fitted with probit or l o g i t ? Just as the "ANOVA F" tests a regression specification against the null model in which all regressors are omitted, we may consider a null model for the binary-choice specification to be Pr(y = 1) = g. Because the mean of an indicator variable is the sample proportion of Is, it may be viewed as the unconditional probability that y = 1.' We can contrast that with the conditional probabilities generated by the model that takes into account the explanatory factors x. Because the likelihood function for the null model can readily be evaluated in either the probit or logit context, both
9. For instance, the estimate of the constant in a constant-only probit model is invnormal (y).
256
commands produce a likelihood-ratio test1' [LR chi2 ( k - l ) ] where ( k - 1) is the number of explanatory factors in the model (presuming the existence of a constant term). As mentioned above, the null model is hard to beat if is very close to 0 or 1. Although this likelihood-ratio test provides a statistical basis to reject the null model$ versus the fitted model, there is no measure of goodness of fit analogous to R2 fo& linear regression. Stata produces a measure called Pseudo R2 for both commands and*!? . for all commands estimated by maximum likelihood; see [R]maximize. Let L1 be': the log-likelihood value fur the fitted model, as presented on the estimation output: after convergence. Let Lo be the log-likelihood value for the null model excluding all explanatory variables. This quantity is not displayed but is available after estimation as e (11-0). The LR chi2 ( k - 1) likelihood-ratio test is merely 2(L1 - Lo), and it has a large-sample X2(k - 1) distribution under the null hypothesis that the explanatory factors are jointly uninformative.
<
If we rearrange the log-likelihood values, we may define the pseudo R2 as (1-L1/ Lo), which like regression R~ is on a [O, 1 1 scale, with 0 indicating that the explanatory variables failed to increase likelihood and 1indicating that the model perfectly predicts each observation. We cannot interpret this pseudo-R2, as we can for linear regression, as the proportion of variation in y explained by x, but in other aspects it does resemble an R2 measure.ll Adding more explanatory factors to the model does not always result in perfect prediction, as it does in linear regression. In fact, perfect prediction may inadvertently occur because one or more explanatory factors are perfectly correlated with the response variable. Stata's documentation in probi t and l o g i t discusses this issue, which Stata will detect and report. Several other measures based on the predictions of the binary-choice model have been proposed, but all have their weaknesses, particularly if there is a high proportion of 0s or 1s in the sample. e s t a t gof and e s t a t c l a s compute many of these measures. With a constant term included, the binomial logit model will produce 3 = g, as does regression: the average of predicted probabilities from the model equals the sample proportion g, but that outcome is not guaranteed in the binomial probit model.
257
> 0. The ordered-choice model generalizes this concept to the notion of multiple
thresholds. For instance, a variable recorded on a five-point Likert scale will have four , thresholds over the latent variable. If y* 61, we observe y = 1; if ~1 < y* 5 ~ 2 we observe y = 2; if ~2 < y* 63, we observe y = 3, and so on, where the tc values are the thresholds. In a sense, this is imprecise measurement: we cannot observe y* directly, but only the range in which it falls. Imprecise measurement is appropriate for many forms of microeconomic data that are "bracketed" for privacy or summary reporting purposes. Alternatively, the observed choice might reveal only an individual's relative preference.
<
<
The parameters to be estimated are a set of coefficients /3 corresponding to the explanatory factors in x, as well as a set of (I - 1) threshold values K corresponding to the I alternatives. In Stata's implementation of these estimators in oprobit and o l o g i t , the actual values of the response variable are not relevant. Larger values are taken to correspond to higher outcomes. If there are I possible outcomes (e.g., 5 for the Likert scale), a set of threshold coefficients or cutpoints {fill 6 2 , . . . , 61-1) is defined, where KO = -m and 61 = 00. The model for the j t h observation defines
where the probability that individual j will choose outcome i depends on the product x j p falling between cutpoints (i - 1) and i. This is a direct generalization of the twooutcome binary-choice model, which has one threshold at zero. As in the binomial probit model, we assume that the error is normally distributed with variance unity (or distributed logistic with variance n2/3 for ordered logit). Prediction is more complex in ordered probit (logit) because there are I possible predicted probabilities corresponding to the I possible values of the response variable. The default option for predict is to compute predicted probabilities. If I new variable names are given in the command, they will contain the probability that i = 1, the probability that i = 2, and so on. The marginal effects of an ordered probit (logit) model are also more complex than their binomial counterparts because an infinitesimal change in xj will not only change ) will also make the probability within the current cell (for instance, if ~2 < y"" 5 K ~ but it more likely that the individual crosses the threshold into the adjacent category. Thus if we predict the probabilities of being in each category at a different point'in the sample space (for instance, for a family with three rather than two children), we will find that those probabilities have changed, and the larger family may be more likely to choose the jth response and less likely to choose the ( j - 1)st response. We can calculate the average marginal effects with margef f . We illustrate the ordered probit and logit techniques with a model of corporate bond ratings. The dataset contains information on 98 U.S. corporations' bond ratings and financial characteristics where the bond ratings are AAA (excellent) to C (poor). The integer codes underlying the ratings increase in the quality of the firm's rating, such that an increase in the response variable indicates that the firm's bonds are a more
258
attractive investment opportunity. The bond rating variable (rating83c) is coded as integers 2-5, with 5 corresponding to the highest quality (AAA) bonds and 2 to the lowest. The tabulation of rat ing83c shows that the four ratings categories contain a similar number of firms. We model the 1983 bond rating as a function of the firm's income-to-asset ratio in 1983 (ia83: roughly, return on assets) and the change in that ratio from 1982 to 1983 (dia). The income-to-asset ratio, expressed as a percentage, varies widely around a mean of 10%.
Cum,
26.53 55.10 70.41 100.00
AAA
Tot a1
We fit the model with ologit; the model's predictions are quantitatively similar if we use oprobit.
ologit rating83c ia83 dia, nolog Ordered logistic regression Log likelihood = -127.27146 rat ing83c ia83 dia /cut1 /cut2 /cut3 Coef . .0939166 -.0866925 -.I853053 1.185726 1.908412 Std. Err. .0296196 ,0449789 .3571432 .3882098 .4164895
z
[95% Conf . Interval] ,0358633 -.I748496 -.8852931 .4248489 1.092108 .I519699 .0014646 .5146825 1.946603 2.724717
3.17 -1.93
0.002 0.054
ia83 has a significant positive effect on the bond rating, but somewhat surprisingly the change in that ratio (dia) has a negative effect. The model's ancillary parameters -cutI to -cut3 indicate the thresholds for the ratings categories. Following the ologit estimation, we use predict to compute the predicted probabilities of achieving each rating. We then examine the firms who were classified as most likely to have an "AAA"(excellent) rating and c 6 ~ ~ -(poor ~ - ~ quality) 7 ' rating,
respectively. Firm 31 has a 75% predicted probability of being rated "AAA", whereas firm 67 has a 72% predicted probability of being rated "BA" or below. The former probability is in accordance with the firm's rating, whereas the latter is a substantial misclassification. However, many factors enter into a bond rating, and that firm's level and change of net income combined to produce a very low prediction.
predict spBA-B-C spBAA s p A A , A spAAA (option pr assumed; predicted probabilities) summarize spAAA, mean list sp* rating83c if spAAA==r(max>
.
.
spBA-B-C .0388714
spBAA .0985567
spAA-A .I096733
spAAA .7528986
rat i-83c
AAA
AAA
Economic research also uses response variables, which represent unordered discrete alternatives, or multinomial models. For a discussion of how to fit and interpret unordered discrete-choice models in Stata, see Long and Freese (2006).
10.3.1 Truncation
Some LDVS are generated by truncated processes. For truncation, the sample is drawn from a subset of the population so that only certain values are included in the sample. We lack observations on both the response variable and explanatory variables. For instance, we might have a sample of individuals who have a high school diploma, some college experience, or one or more college degrees. The sample has been generated by interviewing those who completed high school. This is a truncated sample, relative to the population, in that it excludes all individuals who have not completed high school. The excluded individuals are not likely to have the same characteristics as those in our sample. For instance, we might expect average or median income of dropouts to be lower than that of graduates.
260
The effect of truncating the distribution of a random variable is clear. The expected value or mean of the truncated random variable moves away from the truncation point,, and the variance is reduced. Descriptive statistics on the level of education in our sample should make that clear: with the minimum years of education set to 12, the mean education level is higher than it would be if high school dropouts were included, and the variance will be smaller. In the subpopulation defined by a truncated sample, we have no information about the characteristics of those who were excluded. For instance, we do not know whether the proportion of minority high school dropouts exceeds the proportion of minorities in the population. We cannot use a sample from this truncated population to make inferences about the entire population without correcting for those excluded individuals' not being randomly selected from the population at large. Although it might appear that we could use these truncated data'to make inferences about the subpopulation, we cannot even do that. A regression estimated from the subpopulation will yield coefficients that are biased toward zero--or attenuated-as well as an estimate of a: that is biased downward. If i is observed we are dealing with a truncated normal distribution, where y = x i p u only if it exceeds r, we can define
where a, is the standard error of the untruncated disturbance u, q5(-) is the normal density function, and a(.)is the normal CDF. The expression X(ai) is termed the inverse Mills ratio (IMR). Standard manipulation of normally distributed random variables shows tha.t
years. Upper truncation can be handled with the ul ( #) option; for instance, we may have a sample of individuals whose income is recorded up to $200,000. We can ,specify both lower and upper truncation by combining the options. In the example below, we consider a sample of married women from the laborsub dataset whose hours of work (whrs) are truncated from below at zero. Other variables of interest are the number of preschool children (k16),number of school-aged children (k618),age (wa), and years of education (we).
. use http://www.stata-press.com/data/imeus/laborsub,
clear
Obs
250 250 250 250 250
Mean
799 -84 .236 1.364 42.92 12.352
Std. Dev.
Min
0 0 0 30
5
Max
4950 3 8 60 17
wa we
To illustrate the consequences of ignoring truncation, we fit a model of hours worked with OLS, including only working women.
. regress whrs
Source
1831748.79
Total whrs
LO2120099
Coef
149
685369.794
Number of obs F( 4 , 145) Prob > F R-squared Adj R-squared Root MSE
Std. Err.
P>ltl
[95% Conf
. Ineerval]
We now refit the model with truncreg, taking into account that 100 of the 250 observations have zero recorded whrs:
262
whrs
Coef
Std. Err.
P>lz)
[95% Conf
. Interval]
J
sigma -con8
983.1262
94.44303
10.42
0.000
798.6213
1168.831
Some of the attenuated coefficient estimates from regress are no more than half as large as their counterparts from truncreg. The parameter sigma -cons, comparable to Root MSE in the OLS regression, is considerably larger in the truncated regression, reflecting its downward bias in a truncated sample. We can use the coefficient estimates and marginal effects from truncreg to make inferences about the entire population, whereas we should not use the results from the misspecified regression model for any purpose.
10.3.2 Censoring
Censoring is another common mechanism that restricts the range of dependent variables. Censoring occurs when a response variable is set to an arbitrary value when the variable is beyond the censoring point. In the truncated case, we observe neither the dependent nor the explanatory variables for individuals whose yi lies in the truncation region. In contrast, when the data are censored we do not observe the value of the dependent variable for individuals whose yi is beyond the censoring point, but we do observe the values of the explanatory variables. A common example of censoring is "top coding", which occurs when a variable that takes on values of x or more is recorded as x. For instance, many household surveys top code reported income at $150,000 or $200,000. There is some discussion in the literature about how to interpret some LDVS that appear to be censored. As Wooldridge (2002) points out, censoring is a problem with how the data were recorded, not how they were generated. For instance, in the above topcoding example, if the survey administrators chose not to top code the data, the data would not be censored. In contrast, some LDVs result from corner solutions to choice problems. For example, the amount an individual spends on a new car in a given year may be zero or positive. Wooldridge (2002) argues that this LDV is a corner solution, not a censored variable. He also shows that the object of interest for a corner solution model can be different from that for a censored model. Fortunately, both the censoring
10.3.2 Censoring
263
and corner-solution motivations give rise to the same ML estimator. Furthermore, the same Stata postestimation tools can be used to interpret the results from censored and corner-solution models.
A solution to the problem with censoring at 0 was first proposed by Tobin (1958) as the censored regression model; it became known as "Tobin's probit" or the tobit model.13 The model can be expressed in terms of a latent variable:
yr
= xip ui
yi contains either zeros for nonpurchasers or a positive dollar amount for those who chose to buy a car last year. The model combines aspects of the binomial probit for the distinction of yi = 0 versus yi > 0 and the regression model for E[yiJyi > 1,xi]. Of course, we could collapse all positive observations on yi and treat this as a binomial probit (or logit) estimation problem, but doing so would discard the information on the dollar amounts spent by purchasers. Likewise, we could throw away the yi = 0 observations, but we would then be left with a truncated distribution, with the various problems that creates.14 To take account of all the information in yi properly, we must fit the model with the t o b i t estimation method, which uses maximum likelihood to combine the probit and regression components of the log-likelihood function. We can express the log likelihood of a given observation as
where I ( - )= 1 if its argument is true and is zero otherwise. We can write the likelihood function, summing li over the sample, as the sum of the probit likelihood for those observations with yi = 0 and the regression likelihood for those observations with yi > 0. We can define tobit models with a threshold other than zero. We can specify censoring from below at any point on the y scale with the 11 (#) option for left censoring. Similarly, the standard tobit formulation may use an upper threshold (censoring from above, or right censoring) using the ul(#) option to specify the upper limit. Stata's t o b i t command also supports the two-limit tobit model where observations on y are censored from both left and right by specifying both the 11 (#) and ul(#) options. Even with one censoring point, predictions from the tobit model are complex, since we may want to calculate the regression-like xb with p r e d i c t , but we could also compute
13. The term "censored regression" is now more commonly used for a generalization of the tobit model in which the censoring values may vary from observation to observation. See [R] cnreg. 14. The regression coefficients estimated from the positive y observations will be attenuated relative t o the tobit coefficients, with the degree of bias toward zero increasing in the proportion of "limit observations1' in the sample.
264
the predicted probability that y (conditional on x) falls within a particular interval (which may be open ended on the left or right).'' We can do so with the pr(a, b ) option, where arguments a, b specify the limits of the interval; the missing-value code ( .) is taken to mean infinity (of either sign). Another predict option, e (a, b) , calculates the ~ [ * B + u ~ l < a xip+ui < b]. Lmt, the ystar(a, b ) option computes the prediction from (10.8): a censored prediction, where the threshold is taken into account.
h
The marginal effects of the tobit model are also complex. The estimated coefficients are the marginal effects of a change in Xj on y*, the unobservable latent variable
where a, b are defined as above for predict. For instance, for left censoring at zero, a = 0, b = +m. Since that probability is at most unity (and will be reduced by a larger proportion of censored observations), the marginal effect of Xj is attenuated from the reported coefficient toward zero. An increase in an explanatory variable with a positive coefficient implies that a left-censored individual is less likely to be censored. The predicted probability of a nonzero value will increase. For an uncensored individual, an 1 will increase. So, for instance, a decrease in the increase in xj will imply that E [y( y > 0 mortgage interest rate will allow more people to be homebuyers (since many borrowers' incomes will qualify them for a mortgage at lower interest rates) and allow prequalified homebuyers to purchase a more expensive home. The marginal effect captures the combination of those effects. Since newly qualified homebuyers will be purchasing the cheapest homes, the effect of the lower interest rate on the average price at which homes are sold will incorporate both effects. We expect that it will increase the average transactions price, but because of attenuation, by a smaller amount than the regression function component of the model would indicate. We can calculate the marginal effects with mf x or, for average marginal effects, with Bartus's margef f . For an empirical example, we return to the womenwk dataset used to illustrate binomial probit and logit. We generate the log of the wage (lw) for working women and set l w f equal to lw for working women and zero for nonworking women.'"e first fit the model with OLS, ignoring the censored nature of the response variable:
15. For more information, see Greene (2003, 764-773). 16. This variable creation could be problematic if recorded wages less than $1.00 were present in the data, but in these data the minimum wage recorded is $5.88.
! Censoring
,
use http://www.stata-press.com/data/imeus/womenwk, clear Number o f obs 4, 1995) Prob > F R-squared Adj R-squared Root MSE C 9 5 X Conf
= =
F(
937.873188 4 234.468297 3485.34135 1995 1.74703827 4423.21454 1999 2.21271363 Coef
=
=
=
=
Std. Err.
P>ltl
. Interval]
Refitting the model as a tobit and indicating that lwf is left censored at zero with the 11 () option yields
tobit lwf age married children education, ll(0) Number o f obs Tobit regression LR chi2(4) Prob > chi2 Pseudo R2 Log likelihood = -3349.9685 lwf age married children education -cons /sigma Obs. summary: Coef. .052157 .4841801 .4860021 .I149492 -2.807696 1.872811 657 left-censored observations a t lwf<=O 1343 uncensored observations 0 right-censored observations Std. Err. .0057457 .I035188 .0317054 .0150913 .2632565 t 9.08 4.68 15.33 7.62 -10.67 P>ltl 0.000 0.000 0.000 0.000 0.000
The tobit estimates of lwf show positive, significant effects for age, marital status, the number of children, and the number of years of education. We expect each of these factors to increase the probability that a woman will work as well as inc;ease her wage conditional on employment st atus. Following tobit estimation, we first generate the marginal effects of each explanatory variable on the probability that an individual will have a positive log(wage) by using the pr(a, b) option of predict.
266
.I )
z
X
36.208 .6705 1.6445 13.084
variable from 0 t o 1
We then calculate the marginal effect of each explanatory variable on the expected log wage, given that the individual has not been censored (i.e., w a s working). These effects, unlike the estimated coefficients from regress, properly take into account the censored nature of the response variable.
mfx compute, p r e d i c t ( e ( 0 , . ) ) Marginal e f f e c t s a f t e r t o b i t y = E(lwf 1 lwf >O) ( p r e d i c t , e ( 0 , . ) ) = 2.3102021 , . . . ...
-
Since the tobit model has a probit component, its results are sensitive to the assumption of homoskedasticity. Robust standard errors are not available for Stata's t o b i t command, although bootstrap or jackknife standard errors may be computed with the vce option. The tobit model imposes the constraint that the same set of factors x determine both whether an observation is censored (e.g., whether an individual purchased a car) and the value of a noncensored observation (how much a purchaser spent on the car). Furthermore, the marginal effect is constrained to have the same sign in both parts of the model. A generalization of the tobit model, often termed the Heckit model (after James Heckman), can relax this constraint and allow different factors to enter the two parts of the model. We can fit this generalized tobit model with Stata's heckman command, as described in the next section of this chapter.
For truncation, the sample is drawn from a subset of the population and does not contain observations on the dependent or independent variables for any other subset of the population. For example, a truncated sample might include only individuals with a permanent mailing address and exclude the homeless. For incidental truncation, the
267
sample is representative of the entire population, but the observations on the dependent variable are truncated according to a rule whose errors are correlated with the errors from the equation of interest. We do not observe y because of the outcome of some other variable, which generates the selection indicator, s. To understand the issue of sample selection, consider a population model in which the relationship between y and a set of explanatory factors x can be written as a linear model with additive error u. That error is assumed to satisfy the zero-conditional-mean assumption of (4.2). Now consider that we observe only some of the observations on yi-for whatever reason-and that indicator variable si equals 1 when we observe both yi and xi and is zero otherwise. If we merely run a regression on the observations
on the full sample, those observations with missing values of yi (or any elements of xi) will be dropped from the analysis. We can rewrite this regression as
The OLS estimator of (10.10) will yield the same estimates as that of (10.9). They will be unbiased and consistent if the error term siui has zero mean and is uncorrelated with each element of xi. For the population, these conditions can be written as
because s2 = S. This condition differs from that of a standard regression equation (without selection), where the corresponding zero-conditional-mean assumption requires only that E[xu] = 0. In the presence of selection, the error process u must be uncorrelated with sx. Consider the source of the sample-selection indicator si. If that indicator is purely a function of the explanatory variables in x, we have exogenous sample selection. If the explanatory variables in x are uncorrelated with u, and s is a function of xs, then it too will be uncorrelated with u, as will the product sx. OLS regression estimated on a subset will yield unbiased and consistent estimates. For instance, if gender is one of the explanatory variables, we can estimate separate regressions for men and women with no difficulty. We have selected a subsample based on observable characteristics; e.g., si identifies the set of observations for females. We can also consider selection of a random subsample. If our full sample is a random sample from the population and we use Stata's sample command to draw a 10%)20%) or 50% subsample, estimates from that subsample will be consistent as long as estimates from the full sample are consistent. In this case, si is set randomly.
If si is set by a rule, such as si = 1 if yi 5 c, then as in section 10.3.1, OLS estimates i < (c - zip), will be biased and inconsistent. We can rewrite the rule as si = 1 if u which makes it clear that si must be correlated with ui. As shown above, we must use the truncated regression model to derive consistent estimates.
268
Incidental truncation means that we observe yi based not on its value but rather on the observed outcome of another variable. For instance, we observe hourly wage when an individual participates in the labor force. We can imagine fitting a binomial probit or logit model that predicts the individual's probability of participation. In this circumstance, si is set to zero or one based on the factors underlying that decision
% = xiP+u si = I(ziy v
2 0)
(10.11) (10.12)
where we assume that the explanatory factors in x satisfy the zero-conditional-mean assumption E[xu]= 0. The I(.) function equals 1 if its argument is true and is zero otherwise. We observe yi if si = 1. The selection function contains a set of explanatory factors z, which must be a superset of x. For us to identify the model, z contains all x but must also contain more factors that do not appear in x.17 The error term in the selection equation, v, is assumed to have a zero-conditional mean: E [zv]= 0, which implies that E[xv]= 0. We assume that v follows a standard normal distribution. Incidental truncation arises when there is a nonzero correlation between u and v. If both these processes are normally distributed with zero means, the conditional expectation E[ulv]= pv, where p is the correlation of u and v. From (10.11),
The conditional expectation E[vlz,s] for si = 1, the case of observability, is merely A, the IMR defined in section 10.3.1. Therefore, we must augment (10.1.1) with that term:
I f p # 0, OLS estimates from the incidentally truncated sample will not consistently estimate p unless the IMR term is included. Conversely, if p = 0, that 01,s regression will yield consistent estimates.
The IMR term includes the unknown population parameters y,which may be fitted by a binomial probit model Pr(s = 112) = iP(zy) from the entire sample. With estimates of y,we can compute the IMR term for each observation for which yi is observed (si= 1) and fit the model of (10.14). This two-step procedure, based on the work of Heckman (1976), is often termed the Heckit model. Instead, we can use a full maximum-likelihood procedure to jointly estimate P, y , and
P*
17. As Wooldridge (2006) discusses, when z contains the same variables as x the parameters are theoretically identified, but this identification is usually too wealc to be practically applied.
269
The Heckman selection model in this context is driven by the notion that some of the z factors for an individual are different from the factors in x. For instance, in a wage equation, the number of preschool children in the family is likely to influence whether a woman participates in the labor force but might be omitted from the wage determination equation: it appears in z but not x. We can use such factors to identify the model. Other factors are likely to appear in both equations. A woman's level of education and years of experience'in the labor force will likely influence her decision to participate as well as the equilibrium wage that she will earn in the labor market. Stata's heckman command fits the full maximum-likelihood version of the Heckit model with the following syntax:
heckman depvar
[ indepvars] [ i f ] [ i n ] ,
s e l e c t (varlist2)
where indepvars specifies the regressors in x and varlist2 specifies the list of Z factors expected to determine the selection of an observation as observable. Unlike with t o b i t , where the depvar is recorded at a threshold value for the censored observations, we should code the depvar as missing ( .) for those observations that are not selected.'' The model is fitted over the entire sample and gives an estimate of the crucial correlation p, along with a test of the hypothesis that p = 0. If we reject that hypothesis, a regression of the observed depvar on indepvars will produce inconsistent estimates of
p.19
The heckman command can also generate the two-step estimator of the selection model (Heckman 1979) if we specify the twostep option. This model is essentially the regression of (10.7) in which the IMR has been estimated as the prediction of a binomial probit (10.12) in the first step and used as a regressor in the second step. A significant coefficient of the IMR, denoted lambda, indicates that the selection model must be used to avoid inconsistency. The t w o s t ep approach, cornputationally less burdensome than the full maximum-likelihood approach used by default in heckman, may be preferable in complex selection models.20 The example below revisits the womenwk dataset used to illustrate t o b i t . To use these data in heckman, we define l w as the log of the wage for working women and as missing for nonworking women. We assume that marital status affects selection (whether a woman is observed in the labor force) but does not enter the log(wage) equation. All factors in both the log(wage) and selection equations are significant. By using the selection model, we have relaxed the assumption that the factors determining participation and the wage are identical and of the same sign. The effect of more children increases the probability of selection (participation) but decreases the predicted wage, conditional on participation. The likelihood-ratio test for p = 0 rejects its null, so that
18. An alternative syntax of heckman allows for a second dependent variable: an indicator that signals which observations of depvar are observed. 19. The output produces an estimate of /athrho, the hyperbolic arctangent of p. That parameter is entered in the log-likelihood function to enforce the constraint that -1 < p < 1. The point and interval estimates of p are derived from the inverse transformation. 20. For more details on the two-step versus maximum likelihood approaches, see Wooldridge (2002, 560-566).
children,
Std. Err.
P>lzl
We also use the heckman two-step procedure, which makes use of the IMR from a probit equation for selection.
Number of obs Censored obs Uncensored obs Wald chi2(6) Prob > chi2
= =
I
lw education age children -cons select age married children education -cons mills lambda rho sigma lambda
Coef.
Std. Err.
P>lzl
[95% Conf
. Interval]
.0638285
2.86
0.004
.05718
.307383
.0638285
Although it also provides consistent estimates of the selection model's parameters, we see a qualitative difference in the log(wage) equation: the number of children is not significant in this formulation of the model. The maximum likelihood formulation, when comput ationally feasible, is attractive-not least because it can generate interval estimates of the selection model's p and a parameters.
The observable counterparts to the two latent variables y:, y$ are yl, yz. These variables are observed as 1 if their respective latent variables are positive and zero otherwise.
272
One formulation of this model, termed the seemingly unrelated bivariate pro bit model in b i p r o b i t , is similar to the SUR model that I presented in section 9.4. As in the regression context, we can view the two probit equations as a system and estimate them jointly if p # 0, but it will not affect the consistency of individual probit equations' estimates. However, consider one common formulation of the bivariate probit model because it is similar to the selection model described above. Consider a two-stage process in which the second equation is observed conditional on the outcome of the first. For example, some fraction of patients diagnosed with circulatory probleuis undergoes multiple-bypass surgery (yl = 1). For each patient, we record whether he or she died within 1 year of the surgery (y2 = 1). The y2 variable is available only for those patients who are postoperative. We do not have records of mortality among those who chose other forms of treatment. In this context, the reliance of the second'equation on the first is an issue of partial observability, and if p # 0 it will be necessary to take both equations' factors into account to generate consistent estimates. That correlation of errors may be likely in that unexpected health problems that caused the physician to recommend bypass surgery may recur and kill the patient. As another example, consider a bank deciding to extend credit to a small business. The decision to offer a loan can be viewed as yl = 1. Conditional on that outcome, the borrower will or will not default on the loan within the following year, where a default is recorded as y2 = 1. Those potential borrowers who were denied cannot be observed defaulting because they did not receive a loan in the first stage. Again the disturbances impinging upon the loan offer decision may well be correlated (here negatively) with the disturbances that affect the likelihood of default. Stata can fit these two bivariate probit models with the b i p r o b i t command. The seemingly unrelated bivariate probit model allows x l # x2, but the alternative form that we consider here allows only one warlist of factors that enter both equations. In the medical example, this warlist might include the patient's body mass index (a measure of obesity), indicators of alcohol and tobacco use, and age-all of which might affect both the recommended treatment and the 1-year survival rate. With the p a r t i a l option, we specify that the partial observability model of Poirier (1981) be fitted.
10.5.1
Closely related to the bivariate probit with partial observability is the binomial probit with selection model. This formulation, first presented by Van de Ven and Van Pragg (1981), has the same basic setup as (10.15) above: the latent variable yT depends on factors x, and the binary outcome yl = 1 arises when yT > 0. However, y l j is observed only when Y2.j = ( ~ 2 7 ~ 2 > j 0) that is, when the selection equation generates a value of 1. This result could be viewed, s y2 indicating whether the patient underwent bypass surgery. in the earlier example, a We observe the following year's health outcome only for those patients who had the
273
surgical procedure. As in (10.15), there is a potential correlation (p) between the errors of the two equations. If that correlation is nonzero, estimates of the yl equation will be biased unless we account for the selection. Here that suggests that focusing only on the patients who underwent surgery (for whom yz = 1) and studying the factors that contributed to survival is not appropriate if the selection process is nonrandom. In the medical example, selection is likely nonrandom in that those patients with less serious circulatory problems are not as likely to undergo heart surgery.
I illustrate one form of this model with the Federal Reserve Bank of Boston HMDA dataset2' (Munnell et al. 1996), a celebrated study of racial discrimination in banks' home mortgage lending. Of the 2,380 loan applications in this subset of the dataset, 88% were granted, as approve indicates. For those 2,095 loans that were approved and originated, we may observe whether they were purchased in the secondary market by the quasigovernment mortgage finance Fannie Mae (FNMA) or Freddie Mac (FHLMC), agencies. The variable fanfred indicates that 33% (698) of those loans were sold to Fannie or Freddie. We seek to explain whether certain loans were attractive enough to the secondary market to be resold as a function of the loan amount (loanamt), an indicator of above-average vacant properties in that census tract (vacancy), an indicator of above-average median income in that tract (rned-income), and the appraised value of the dwelling (appr-value). The secondary market activity is observable only if the loan was originated. The selection equation contains an indicator for black applicants, applicants' income, and their debt-bincome ratio (debt-incr) as predictors of loan approval.
clear
. rename s6 loanamt
. rename v r
-
vacancy
21. Under the Home Mortgage Disclosure Act of 1975, as amended, institutions regulated by HMDA must report information on the disposition of every mortgage application and purchase as well as provide data on the race, income, and gender of the applicant or mortgagor.
Max
1 1
980 1
fanfred
loanamt vacancy med-income appr-value black appl-income debt-inc-r
I
4316 1 999.9994 300
= =
Std. Err.
z
-3.29 -3.55 2.99 -2.82 1.43
P>lzl
0.001 0.000 0.003 0.005 0.154
[95% Conf
Interval]
-.0026434
-.2163306 .2671338 -.0014358 .I684829
- -0042169 - ,0010698
-.3358488 .0920407 -.0024351 -.0631954 -.0968124 .4422269 -.0004364 .4001612
- .7343534
-.0006596 -.0262367 2.236424 -.6006626 -.5376209
- .8947921 -.5739147
-.0011221 -.033379 1.977844 -1.132311 -.8118086 -.0001971 -.0190944 2.495004 -.0690146 -.0689052
The model is successful, indicating that the secondary market sale is more likely to take place for smaller-value loans (or properties). The probability is affected negatively by nearby vacant properties and positively by higher income in the neighborhood. In
275
the selection equation, the original researchers' findings of a strong racial effect on loan approvals is borne out by the sign and significance of the black coefficient. Applicants' income has an (unexpected) negative effect on the probability of approval, although the debt-to-income ratio has the expected negative sign. The likelihood-ratio test of independent equations conclusively rejects that null hypothesis with an estimated rho of -0.54 between the two equations' errors, indicating that ignoring the selection into approved status would render the estimates of a univariate probit equation for f anfr e d equation biased and inconsistent.
Exercises
1. In section 10.3.1, we estimated an OLS regression and a truncated regression from the laborsub sample of 250 married women, 150 of whom work. This dataset can be treated as censored.in that we have full information on nonworking women's characteristics. Refit the model with t o b i t and compare the results to those of
OLS.
2. In section 10.3.2, we fitted a tobit model for the log of the wage from womenwk, taking into account a zero wage recorded by 1/3 of the sample. Create a wage variable in which wages above $25.00 per hour are set to that value and missing wage is set to zero. Generate the log of the transformed wage, and fit the model as a two-limit tobit. How do the t o b i t coefficients and their marginal effects differ from those presented in section 10.3.2?
3. Using the dataset http://www.stata-press.com/data/r9/school.dta, fit a bivariate probit model of private (whether a student is enrolled in private school) and vote (whether the parent voted in favor of public school funding). Model the first response variable as depending on years and logptax, the tax burden; and estimate the second response variable as depending on those factors plus loginc. Are these equations successful? What do the estimate of p and the associated Wald test tell you? 4. Using the HMDA dataset from section 10.5.1, experiment with alternative specifications of the model for loan approval (approve = 1). Should factors such as the loan amount or the ratio of the loan amount to the appraised'.value of the property be entered in the loan approval equation? Test an alternative heckprob model with your revised loan approval equation.