Stata Excel
Stata Excel
Stata Excel
Getting Started:
1. Go to Stata prompt and click on Intercooled Stata
2. In the command line type: set mem 5000k (then press enter)
Note: for very large files: set mem 500000k (press enter)
then type: set matsize 150 (press enter allows 150 variables)
3. Click on file (upper left corner)
4. Click on open
5. Click on down arrow to use either the c or a drives
6. Click so that the desired drive reveals its files
7. Click on the file you want to load
8. To execute an operation (e.g., ordered probit) type the following in the
command line: regress (now click on the variable names as listed
on the left side of page beginning with the dependent variable
note that using that replacing regress with fit will deliver many
useful diagnostics - in regression and dichotomous logit/probit a
constant is automatically included). Then press enter. Other
estimators: logit (for the odds ratio instead of the log of the odds
ratio that logit yields - replace logit with logistic), probit (for marginal
effects replace probit with dprobit), oprobit, ologit (ordered
probit/logit), mlogit (multinomial logit), nlogit (nested logit) and tobit.
If using tobit, after the last independent variable type a comma and
the letters ll and ul (e.g., tobit ratio ada pover, ll ul). This censors
the model at the lower limit and upper limit (i.e., uses the lowest
and highest values as the censoring points). You can select a
censoring point [i.e., ll (17)]. If censoring is only one side, you may
use just one censor point (e.g., tobit ratio ada pover, ll) After
running a tobit model, type quadchk (if these results differ greatly
from the tobit results it means that you probably shouldnt use the
tobit results). After probit or logit commands you can eliminate all
the convergence output by placing nolog at the end of the
command line: probit vote ada99, bush00, nolog
Long Command Lines: If the command you are entering is so long that it will
not fit on one line then type /// at the endpoint of the first line and continue
the command on a second line (i.e., the line below).
Reconfigure Screen: (1) click on Prefs; (2) click on Manage Preferences; (3)
After the last results type: log close You should then find this file on the f
drive. Highlight all results and switch the font to courier new 9 and it will
line up correctly.
Sampling: to randomly select 10.5% of the cases from your dataset type:
sample 10.5 You can save this smaller sample (e.g., for Small Stata).
Finding Variable Descriptions: after reading in a Stata dataset type: describe
If the person put variable description in the dataset, this command should
produce them.
Select Cases by Scores on a Variable: logit nafta avmich if divrk>25 (only
uses cases where divrk is greater than 25; use >=25 for 25 or greater) If
selecting by a particular year you need to use two consecutive equal
signs. Thus to list the scores on variable race for 1990 type: list race if
year==1990 (two consecutive equal signs). To select on two variables at
the same time use &: logit nafta avmich if party==1 & south==0 Missing
data are a great problem with this procedure. You can use two commas if
you have multiple sets of instructions after the last independent variable.
For example, to specify lower and upper censoring points in tobit and use
only observations where variable state =1:
tobit ratio ada85 par85, ll ul, if state==1
You can also select cases by scores on a variable using the keep
command prior to statistical analysis. For example, to select cases
scoring 2 on variable brown (with possible scores of 1, 2 and 3) you
could use the following command: keep if (brown==2). To use cases with
scores of 1 and 2 on brown type: keep if (brown== 1 & 2). To use cases
with a score of less than 10,000 on a continuous variable income type:
keep if (income <10000) or for 10,000 and less type:
keep if (income <=10000). Make sure you dont save the data because
youll lose all the dropped observations. If you are using a do file to
prevent permanent loss of data than have the last command reinstall the
original dataset.
Select Cases by Observation: In a data set with 100 observations, to use
observations 1, 25-29 and 34-100 type: drop in 2/24 (press enter)
drop in 30/33 (press enter) and then run regression
Deleting Observations from a Dataset: Read the dataset into Stata. In a
dataset in which year was a variable and I wanted to take data from the
years 1985, 1987, 1988, 1991, 1992, 1995, 1996, 2000 and 2005 from a
dataset that was annual from 1880 to 2008 I did the following:
drop if year<1985
drop if year==1986
drop if year==1989
drop if year==1990
You get the picture. I dont know how to drop consecutive years (e.g.,
1989 and 1990) in one command. When youve deleted all the years you
dont want you will be left with those you do want. Then, using the data
editor you can cut and paste the new dataset into Excel. You might
google both drop and keep in Stata. There may be easier ways to do
this than that described above.
Stacking Data How to Change it: To change a dataset stacked by state (e.g.,
observations 1-20 are 20 consecutive annual observations on state #1
with observation 21 being the first observation on state #2) to a dataset
stacked by year (e.g., observations 1-50 are the scores for 1985 on each
of the 50 states stnum = state number): sort year stnum
Transpose Rows and Columns: xpose, clear Since the xpose command
eliminates letter names for variables, it might be useful to use the xpose
command on a backup file while having a primary file containing the
variable names. You could then transpose the variable name column or
row in Excel and cut and paste the variable names into the xpose file.
Show scores on a particular observation: to show the score on variable dlh for
the 19th state (stcode is the variable name) and the year 1972 (year as the
variable name) type: list dlh if stcode==19 & year==1972
Entering a Series of Independent Variables: You can enter a series of
consecutive independent variables by subtracting the last independent
variable from the first. For example, suppose you have 50 state dummy
variables that appear in consecutive order in your Stata dataset (e.g., ala,
ak, etc. through wy). Instead of entering each independent variable by
name, you could type: ala-wy and every variable beginning with ala and
ending with wy would be entered.
String Variables to Numeric Variables: Stata cannot read variables which
appear in the Data Editor in red (i.e., string variables - letters or numbers
that are in red). To convert a red/string variable, d3, to a numeric
variable (also named d3) you need to convert the string variable to
another variable (place in the example ahead), set this new variable
equal to d3 and delete the created variable (i.e., place). The word force
is part of the process and not a variable name. Proceed as follows:
destring d3, generate(place) force
drop d3
gen d3=place
drop place
Mathematical Procedures: addition +; subtraction -; multiplication: *; division /
Recoding a Variable: To convert a score of 1 into 2 and vice versa on variable
grhcum type: recode grhcum (1=2) (2=1) You can also use multiple
procedures with nonconsecutive changes in one command. For example
if you have a variable vote with 4 categories of responses (0, 1, 2, and 3)
and want to have 0, 1 and 3 read as 0 while 2 becomes read as 1 type the
following: recode vote (3=0) (1=0) (2=1) To put combine two consecutive
categories use / thus to get 1 and 2 read as 1 type: recode cons (1/2 =
1). To recode a variable that ranged from 0 to 1 (and assumed any value
in between) into just 0 and 1, I typed the following:
recode demcont (.0001/.5 = 0) (.5001/1.0 = 1) Note: the recode command
does not recognize < and >. To create and recode a variable that
may need to recode one or both variables. For a three variable table
either: tabulate tax1 cons1 if party==1, row column all
or you need to sort by the control variable. For example, to use the two
variables above controlling for party type: sort par85 (press enter)
by par85: tabulate grh85 grh87, row column all exact (press enter)
Correlation: correlate tax cons (to correlate tax and cons can add more
variables)
Partial Correlation: pcorr tax cons party stinc
Kendalls tau: ktau tax cons (can add more variables)
Spearman rank correlation: spearman tax cons (can add more variables)
Gamma: see cross tabulation and measures of association above or tabulate
tax cons, gamma or tab tax cons, gam
Note: you can only use two variables at a time and you may need to
recode before obtaining a gamma statistic you can have 5 categories
per variable but I dont know how many more categories are allowed
if you use the procedure listed at the beginning of Cross Tabulation and
Measures of Association you can avoid recodes.
Cronbachs Alpha: Cronbach's Alpha examines reliability by determining the
internal consistency of a test or the average correlation of items
(variables) within the test. In Stata, the alpha command conducts the
reliability test. For example, suppose you wish to test the internal
reliability of ten variables, v1 through v10. You could run the following:
alpha v1-v10, item In this example, the item option displays the effects of
removing an item from the scale. If you want to see if a group of items
can reasonably be thought to form an index/scale you could also use
Cronbachs alpha. For example: alpha a3e a3g a3j a3o, c i The alpha
score in the Test scale row (alpha is in the far right column and Test
scale is a row) should be about .80 (maximum is 1.0) to show a high
degree of reliability of the components. However, William Jacoby said
the .80 threshold is very high. He wouldve gone lower to .70 (but would
never use a scale with a reliability below .5 because youd have more
error variance than substantive variance). If the variables are measured
on different scales you may want to standardize them. If so then add s
to the above command (i.e., alpha a3e a3g a3j a3o, c i s). Since the
score for a variable in the Test scale column is what the Test scale
number would be if that variable were deleted, you can maximize the
score in the Test scale row by deleting any variables whose score in the
alpha column is greater than the alpha in the Test scale row. You can
make the scale into a variable by typing: alpha a3e a3g a3j a3o, c
gen(anscale) Note: anscale is arbitrary (you can pick any name you
want this will now appear as a variable). If you want to exclude those
respondents that had a particular score on a variable (e.g., using scores 1
and 2 on variable petition but excluding 3) then do the following: alpha
a3e a3g a3j a3o if partition==1&2, c i s) For a better understanding see
Intermediate Social Statistics: Lecture 6. Scale Construction by Thomas
A.B. Snijders saved as adobe file: StataMokkenCronbach.
most infrequently endorsed items feature at the top. (Frank Doyle, et. al.,
Exhaustion, Depression and Hopelessness in Cardiac Patients: A
Unidimensional Hierarchy of Symptoms Revealed by Mokken Scaling,
Royal College of Surgeons in Ireland, 2011, pp. 29-30). Loevinger
coefficients Mokken (1971) proposed to measure the quality of the pair of
items i; j by the Loevinger coefficient Hij = 1 Observed Nij (1; 0)
Expected Nij (1; 0): The expected value is calculated under the null
model that the items are independent. If no errors are observed, Hij = 1;
if as many errors are observed as expected under independence, then
Hij = 0. For example with two items with means _Xi: = 0:2; _Xj: = 0:6,
for a sample size of n = 100 the expected table is
Xjh = 0
Xjh = 1
Xih = 0 32
48
80
Xih = 1
8
12
20
40
60
There are 8 errors in the above table. Now suppose the errors were
reduced to just 2 (i.e., 2 in the cell which contains 8). Then Hij = 1 - (2/8)
= 0:75. Thus, a good scale should have Loevinger H coefficients that are
large enough for all pairs i; j with i < j. Rules of thumb that have been
found useful are as follows: Hij < 0:3 indicates poor/no scalability;
0:3 < Hij < 0:4 indicates useful but weak scalability;
0:4 < Hij < 0:5 indicates medium scalability;
0:5 < Hij indicates good scalability.
Similarly, Loevingers coefficients can be defined for all pairwise errors for
a given item (Hi ) and for all pairwise errors for the entire scale (H).
Although you can run the procedure without specifying a value for
Loevingers H, you can set levels as in the following command (c is the
value set). msp a3f a3h a3i a3k, c(.4)
Below are some additional commands that can be used:
msp a3a-a3o, c(.4)
msp a3a-a3o, pairwise c(.4)
loevh a3a-a3o, pairwise
Mokken scaling can be used in a confirmatory way, with a given set of
items (where the order can be determined empirically) as well as in an
exploratory way. In the exploratory method, a set of items is given,
and it is tried to find a well-scalable subset. This is done by first finding the
pair with the highest Hij as the starting point for the scale; and by then
consecutively adding items that have the highest values of the Loevinger
coefficients with the items already included in the scale. This procedure
can then be repeated with the remaining items to find a further scale
among those. The reliability can be estimated also from the inter-item
correlations. The Mokken scaling module is not part of the normal Stata
program and must be downloaded. In the command line type: findit msp
Multidimensional Scaling: Assume we have information about the American
electorates perceptions of thirteen prominent political figures from the
period of the 2004 presidential election. Specifically, we have the
deleting variables that are less significant than the selected threshold. For
example: stepswise, pr(.2) hierarchical: regress tax cons party stinc
would mean Stata would estimate the model with all three independent
variables and then re-estimate excluding any independent variable that
was not significant at the .20 level. Can use with probit, logit, etc.
Regression Diagnostics: Run a regression with regress. Now in the command
line type: dfbeta (youll see each DF and the name of each independent
variable type list and then the name of the independent variable you
are interested in). For other diagnostics run a regression. For
standardized residuals type: predict esta if e(sample), rstandard (in the
command line). You will see esta appear in the variable list. Now type:
list esta and you will see the values. For studentized residuals do try the
following after a regression: Predict estu if e(sample), rstudent (estu will
now appear as a variable). For Cooks distance type: predict cooksd if
e(sample), cooksd after running a regression.
Multicollinearity: after running a regression using regress, type: vif (or: estat vif)
in the command line. Subtract the number in the 1/VIF column from 1
to obtain the percentage of variation in that independent variable which is
explained by all other independent variables. In the VIF column, numbers
above 30 indicate high variance inflation (i.e., high multicollinearity).
Doesnt work in probit/logit. Since at this point youre only interested in
multicollinearity, re-estimate a probit/logit equation in regression and then
follow the procedure above.
Autocorrelation/Correlogram: regdw (replaces regress command and
executes Durbin-Watson test). The data need to be dated for example,
if your data are annual and you have a variable called year, then before
you do a regression type: tsset year and press enter (or after running
regression with regress type dwstat). The command corc (replaces
regress command and executes Cochrane-Orcutt correction for firstorder autocorrelation note data must be dated, see regdw discussion
above). You can save the first observation by using the Prais-Winsten
(replace regress with prais). You can obtain a correlogram and specify
the number of lags. For example, to obtain a correlogram for the variable
top1 with 12 lags type: corrgram top1, lags(12)
Heteroscedasticity: run regression replacing regress with fit and then, as a next
command, type: hettest and press enter. If you have significant
heteroscedasticity, use the robust estimation option. Thus, for a robust
regression type: rreg tax cons party stinc
Lagged Independent Variable: to lag variable ussr by one time period type:
You can lag a variable one time period by typing l. in front of the variable.
Thus, l.ussr should be a one period lag of ussr while l2.ussr would be for
two time periods back. You can also do this by typing: gen xussr =
ussr[_n-1] which will create a new lagged variable: xussr. Remember that
your data must be dated (see regdw discussion under Autocorrelation
above). Lagging will cost one data point, when you run the regression it
for the equality of two R squareds) from page 144 of J. Scott Long
and Jeremy Freese, Regression Models for Categorical Dependent
Variables Using Stata, 2nd. ed.
probit involvem repcont demcont ablegal fund1 catholic
estimates store fullmodel
probit involvem ablegal fund1 catholic
estimates store smallmodel
lrtest fullmodel smallmodel
Weights/Downloading Data If you are downloading data and Stata refuses
to accept the file by saying weights not allowed or something like that put
quotations around the file name. Thus, if the file name is test, then in the
command line type: use test (press enter) Put the parentheses around
everything expect the word: use (thus use C:/test not use C:/test)
Word Responses into Numerical Responses: If you the responses are words
(e.g., strong agree, etc.) and you want to convert them to numerical
values, one suggestion is to cut and paste the dataset into Excel and use
the Find and Replace option you can ask Excel to find words and
then replace with numbers. You can see if there is a numerical code
that the words translate into (e.g., strong agree becomes 1, etc.) by the
following procedure: (1) click on Data at the top of the screen; (2) click
on Data Editor; (3) I think you can choose either Edit or Browse; (4)
click on Tools; (5) click on value labels; (6) click on Hide all value
labels numbers should appear at this point. There is a way to
permanently convert words into numbers. Go to data editor (choose edit,
not browse) and select (i.e., highlight) the variable (one variable at a
time) you are interested in (e.g. q1). Right click the mouse and choose
Value Labels then choose Assign Value Label to Variable 'q1 and
finally, choose None. This will erase the labels for the variable and leave
the numeric values.
Merging Datasets:
Dear Stata Technical Support,
I=92m a Stata version #11 user (serial number: 30110511993). I'm trying t=
o merge two state files (statedata.dta and kellywitco.dta). Statedata is annual data stacked
by state from 1880 to 2008 while kellywitco.dta is annual data stacked by state from 1975
to 2006 (e.g., annual data for 1975 through 2006 for Alabama, then 1975 through 2006
for Alaska, etc. - both datasets are stacked this way but the statedata.dta series is a much
longer time frame). The states are in the same order in both data sets. The variable "year"
is common to both datasets as is the variable "stnum" (for the number of the states and
the state numbers are the same in both data sets). Could you write out
the specific commands I should use to merge them? If you give "general directions" I'll
never be able to figure it out. Please write out the commands. Thanks for your help.
Chris Dennis
Omitted Variables Test: run a regression with fit instead of regress and then,
as a next command, type: ovtest (executes Ramsey Reset).
Predicting Scores from a Combination of Scores on the Independent
Variables that are Not in the Data Set: Since you will need a data set
that contains the values on the independent variables, (1) start by going
into Excel and creating a file that contains the variable names and the
desired scores on the independent variables (remember that row #1
contains the variable names thus row #2 will be the first row with scores
on the independent variables). [Note on Excel: (1) ctrl z (i.e., control z)
will undo what you last did in Excel; (2) to repeat a score down a column:
click once and drag to highlight and use ctrl d to duplicate the number.];
(2) save as both an Excel file and a tab delimited text file thus, click on
save and save as an Excel file to the c drive then click on save as and
highlight tab delimited text file and save to the c drive. (3) go into Stata
and bring in the tab delimited text file you just created by typing: insheet
using C:/deterp.txt (you need the quotation marks deterp is the name
of the file) (4) now save this file as a Stata file by clicking on file (upper
left corner) and then save as (make sure that the c drive is highlighted
and you have a put the file name without an extension in the file name
box (such as: deterp) (5) now bring in the data set on which the coefficient
values are to be estimated (i.e., not the set you just created and saved in
Stata) (6) run the equation that you want to use to estimate the
coefficients (7) bring in the set you created by typing: use C:/deterp.dta
(8) type: predict yhat (9) click on window and then click on data editor
and the yhat column will show the predicted values.
Multinomial Logit/Probit: Stata automatically selects the most frequent
occurring outcome as the base outcome (i.e., the outcome all other
outcomes are compared to). If you want to designate category 2 as the
base type: mlogit grhcum ccus86, base (2)
Event History/Hazard Model/Survival Analysis:
Note for State Politics Research: In analyses of states over time (i.e.,
consecutive yearly observations on a state) use probit or xtprobit (I
believe youll get the same results) instead of a continuous time procedure
(e.g., cox regression) because time is not continuous (i.e., many state
legislatures only meet for several months of the year William Berry
suggest this). A good robustness check is to use the cluster option
discussed below by state [i.e., on the end of the command line put a
comma and then cluster(state)]
From Tom Hayes on Cox Model and testing to see if its appropriate
****sets the data for duration
stset durat, failure(adoptinc)
***runs the stcox model, do not include DV, just do stcox and then IVs
stcox incomerel urban fiscal elect1 elect2 previousa ideology demcont repcont
top1 south
*****this test let's us konw if appropriate model is cox regression. If any variables
come out signficant, it may not be best model*****
stphtest, detail
****test equality of survivor functions
sts test adoptinc
Unit Root Tests for Panel Data: search typing: findit unit root try xtfisher
Cameron and Trivedi (pp. 267-268) say, A serious shortcoming of the standard
Hausman test is that it requires the RE estimator to be efficient. This in turn
require that the (alpha and e dont have greek letters) are i.i.d. (independent
and identically distributed), an invalid assumption if cluster-robust standard errors
for the RE estimator differ substantially from default standard errors. A user
written version of the robust Hausman test can be executed as follows:
xtreg top1 demcont repcont top1lag, re vce(cluster id)
xtoverid (when I ran this test I received the following error message: saved RE
estimates are degenerate (sigma_u=0) and equivalent to pooled OLS)
Other Diagnostic Tests:
Regardless of whether you use fixed effects or random effects (just replace fe
with re in the commands ahead), the following might be useful additional
models to estimate:
1.If the Hausman test results suggest that you should use a random effects
model, the LM test helps you decide if you should use OLS instead of
random effects. The null hypothesis is that there is no variation among
units (states in this example i.e., no panel effect).
xtreg top1 demcont repcont top1lag, re
xttest0
If the number to the right of Prob > chi2 is .05, or lower, reject the null
hypothesis of no variation between entities in favor of the alternative
hypothesis of variation between entities. If the null hypothesis is rejected
run OLS (e.g., reg command).
xttest1 (you can run this immediately after xttest0)
xttest1 is an extension of xttest0. It offers several specification tests for
error-component models. It includes the Breusch and Pagan (1980)
Lagrange multiplier test for random effects; the Baltagi-Li (1995) test for
first-order serial correlation; the Baltagi-Li (1991) joint test for serial
correlation and random effects; and the family of robust tests in Bera,
Sosa-Escudero, and Yoon (2001). The procedure handles unbalanced
panels as long as there are no "gaps" in the series; that is, individual time
series may differ in their start and end period but cannot have missing
values in intermediate periods. Consider the standard-error component
model allowing for possible first-order serial correlation:
y[i,t] = a + B*x[i,t] + u[i] + e[i,t]
e[i,t] = rho e[i,t-1] + v[i,t]
permit lagged variables you would need to create a lagged variable (not in
way Stata would read such as l.demcont but rather some other name).
Also, abar (another test for autocorrelation - see above) will allow
obviously lagged variables.
4. If the results of the Hausman test indicate you should use a fixed effects model
the following test is useful. According to Baltagi, cross-sectional
dependence is a problem in macro panels with long time series (over 2030 years). This is not much of a problem in micro panels (few years and
large number of cases). The null hypothesis in the B-P/LM test of
independence is that residuals across entities are not correlated. The
command to run this test is xttest2 (run it after xtreg, fe): According to
Baltagi, cross-sectional dependence is a problem in macro panels with
long time series (over 20-30 years). This is not much of a problem in micro
panels (few years and large number of cases). The null hypothesis in the
B-P/LM test of independence is that residuals across entities are not
correlated. The command to run this test is xttest2 (run it after xtreg, fe):
xtreg top1 demcont repcont top1lag, fe
xttest2
If the number to the right of Pr is less than .05 reject the null hypothesis
that residuals across entities are independent (i.e., uncorrelated). When I
ran the test above, I received an error message that read: too few
common observations across panel. no observations. Rejection of the
null hypothesis could lead to using the robust standard errors model
shown below. Also, see next test discussed.
5. As mentioned above, cross-sectional dependence is more of an issue in
macro panels with long time series (over 20-30 years) than in micro
panels. Pasaran CD (cross-sectional dependence) test is used to test
whether the residuals are correlated across entities. Cross-sectional
dependence can lead to bias in tests results (also called
contemporaneous correlation). The null hypothesis is that residuals are
not correlated.
xtreg top1 demcont repcont top1lag, fe
xtcsd, pesaran abs
If the number to the right of Pr is less than .05 reject the null hypothesis
that the residuals are not correlated. When I ran this test I received the
following error message: The panel is highly unbalanced. Not enough
common observations across panel to perform Pesaran's test. insufficient
observations. Had cross-sectional dependence be present Hoechle
suggests to use Driscoll and Kraay standard errors using the command
xtscc.
xtscc top1 demcont repcont top1lag, fe
Note: even though I received the error message above when I ran the
xtcsd test, I did not receive an error message when I executed the xtscc
command.
Possible Models:
1.The model with heteroscedasticity:
xtreg top1 demcont repcont top1lag, fe vce(robust)
Note: The vce(robust) option should be used with caution. It is robust in
the sense that, unlike default standard errors, no assumption is made
about the functional form (A. Colin Cameron and Pravin K. Trivedi,
Microeconometrics Using Stata, revised ed., p. 334). From their
discussion, this can lead to problems. This is a reason for the cluster
option discussed ahead.
2.The model with first-order autocorrelation:
xtregar top1 demcont repcont top1lag, fe
3.The model with both first-order autocorrelation and heteroscedasticity:
xtreg top1 demcont repcont top1lag, fe vce(cluster id)
If you have reason to below the disturbances are related to one of the
variables use that variable after cluster. For example, there may be
correlation within states but not across states. Thus, observations in
different clusters (e.g., states) are independent but observations within the
same cluster (e.g., states) are not independent.
xtreg top1 demcont repcont top1lag, fe vce(cluster stnum)
Note: if you replace stnum with id (cluster id) you get very different
standard errors. The id is suppose to represent an individual whereas
i.i.d. means(independent and identically distributed. I wouldve thought it
would be an individual state and, thus, would be the same as (cluster
stnum) but this is not the case.
4.The model with heteroscedastic, contemporaneously correlated crosssectionally correlated, and autocorrelated of type AR(1) [Beck & Katzs
panel corrected standard errors]. Assumptions: (1) if I assume no
heteroscedasticity - panels(uncorrelated); (2) if I assume the variances
differ for each unit panels(hetero); (3) if I assume that the error terms of
specific. It the number of time periods is not much larger than the number of
states use the more restrictive corr(ar1) option.
If you want random effects omit the dummy variables (al-wi) in the above
command lines. Much of the above discussion was taken from
Microeconometrics Using Stata, revised edition, by A. Colin Cameron and Pravin
K. Trivedi, Stata Press, 2010 and Panel Data Analysis Fixed and Random
Effects (using Stata 10) by Oscar Torres-Reyna (available at
www.princeton.edu/~otorres).
Stationarity in Panel Models/Error Correction Models: You need the xtfisher
command which can be downloaded (findit xtfisher). This is a test to see if a
variable is stationary. The null hypothesis is that it has a unit root. Rejecting
the null hypothesis of a unit root means the variable is stationary. With error
correction models a useful test is to see is the residuals are stationary. If so, you
dont need to worry about some important problems. After estimating the model
type: predict yhat (to obtain the residuals). Then type: xtfisher yhat if yhat==1
This will yield results for no lags. To specify lag length change the command to
read: xtfisher yhat if yhat==1, lag(2) (for two time periods) If you reject this with
confidence then the residuals are stationary. The results should say:
Ho: unit root
chi2(0) = 0.000 (i.e., less than .001 of a unit root)
Prob > chi2 =
Models/Commands that run but not sure why they should be used:
xtreg top1 demcont repcont top1lag, i(id) fe
reg top1 demcont repcont top1lag i.year i.stnum
xtreg top1 demcont repcont top1lag i.year, fe
xtreg top1 demcont repcont top1lag i.year i.stnum, fe
xtgls top1 demcont repcont top1lag, i (id)
Discussion
Long panels are where time is much greater than the number of units (e.g.,
states). Short panels are the opposite. Use fixed-effects (FE) whenever you are
only interested in analyzing the impact of variables that vary over time. FE
explore the relationship between predictor and outcome variables within an entity
(country, person, company, etc.). Each entity has its own individual
characteristics that may or may not influence the predictor variables (for
example being a male or female could influence the opinion toward certain
issue or the political system of a particular country could have some effect on
trade or GDP or the business practices of a company may influence its stock
price). When using FE we assume that something within the individual may
impact or bias the predictor or outcome variables and we need to control for this.
This is the rationale behind the assumption of the correlation between entitys
error term and predictor variables. FE remove the effect of those time-invariant
characteristics from the predictor variables so we can assess the predictors net
effect. Another important assumption of the FE model is that those time-invariant
characteristics are unique to the individual and should not be correlated with
other individual characteristics. Each entity is different therefore the entitys
error term and the constant (which captures individual characteristics) should
not be correlated with the others. If the error terms are correlated then FE is no
suitable since inferences may not be correct and you need to model that
relationship (probably using random-effects), this is the main rationale for the
Hausman test (presented later on in this document). Control for time effects (in
our example year dummy variables) whenever unexpected variation or special
events my affect the dependent variable. The fixed-effects model controls for all
time-invariant differences between the individuals, so the estimated coefficients
of the fixed-effects models cannot be biased because of omitted time-invariant
characteristics[like culture,religion, gender, race, etc] One side effect of the
features of fixed-effects models is that they cannot be used to investigate timeinvariant causes of the dependent variables. Technically, time-invariant
characteristics of the individuals are perfectly collinear with the person [or entity]
dummies. Substantively, fixed-effects models are designed to study the causes of
changes within a person [or entity]. A time-invariant characteristic cannot cause
such a change, because it is constant for each person.
One alternative to a fixed effects model is a random effects model. The rationale
behind random effects model is that, unlike the fixed effects model, the variation
across entities is assumed to be random and uncorrelated with the predictor or
independent variables included in the model: the crucial distinction between
fixed and random effects is whether the unobserved individual effect embodies
elements that are correlated with the regressors in the model, not whether these
effects are stochastic or not [Green, 2008, p.183] If you have reason to believe
that differences across entities have some influence on your dependent variable
then you should use random effects. An advantage of random effects is that you
can include time invariant variables (i.e. gender). In the fixed effects model these
variables are absorbed by the intercept. The random effects model is:
Yit = Xit + + uit + it (u is the between entity error and e is the within entity
error)
Random effects assume that the entitys error term is not correlated with the
predictors which allows for time-invariant variables to play a role as explanatory
variables. In random-effects you need to specify those individual characteristics
that may or may not influence the predictor variables. The problem with this is
that some variables may not be available therefore leading to omitted variable
bias in the model.
The between estimator uses only between or cross-section variation in the data.
Because only cross-section variation in the dta is used, the coefficients of any
individual-invariant regressors, such as time dummies, cannot be identified. It is
seldom used.
To decide between fixed or random effects you can run a Hausman test where
the null hypothesis is that the preferred model is random effects vs. the
alternative the fixed effects (see Green, 2008, chapter 9). It basically tests
whether the unique errors (ui) are correlated with the regressors, the null
hypothesis is they are not. Run a fixed effects model and save the estimates,
then run a random model and save the estimates, then perform the test.
(Hausman test is near the beginning of TSCS material)
Older Discussion/Comments from Neal Beck:
I think it is desirable to use the panel corrected standard errors that Beck and
Katz (APSR, 1995) developed. STATA has the necessary commands. The
impression I got from the STATA manual was that you should only use panel
corrected standard errors if the number of time periods is equal to, or greater,
than the number of panels. In a study of the 50 states over 23 years, I would
seem to violate this assumption. I asked Neal Beck about it and here is his reply,
PCSEs work in this case. The only issue is T (i.e., time periods) being large
enough (say over 15). Clearly you have that. FGLS (parks) does not work here,
but you do not care. In a subsequent message in which I asked about using
PCSEs with both random effects and fixed effects models, Beck replied as
follows: PCSES are totally orthogonal to the question of effects. For bigger T
TSCS, re and fe are pretty similar. All the issues where re wins are for small T.
But whether or not you need effects has literally nothing to with PCSES. In a
later e-mail Beck mentioned that panel corrected standard errors cannot really be
done with random effects models.
Time Series Models
Dickey-Fuller Unit Root Test: dfuller loginc2, regress
Note: adding regress in the above command means you will
receive the regression result in addition to the Dickey-Fuller test
results Correlogram of ACF and PACF: corrgram loginc2
Note: to obtain pointwise confidence intervals try
the following: ac loginc2, needle (for ACF)
or: pac loginc2, needle (for PACF)
Also: for a correlogram on first differences try:
corrgram D.loginc2 (Note: there is no space between D.
and loginc2)
ARIMA Models:
Structural (i.e., substantive independent variables with an arima
error process with both one autoregressive and one moving
average term) arima loginc2 dempreslag, ar(1) ma(1)
Non-Structural Model (i.e., current value of the dependent variable
is entirely in terms of an autoregressive and moving average
calculating the standard errors for a model multiple times will produce
different estimates each time because different bootstrap samples are
drawn each time. This leads to the question of how to report results in
journal articles and presentations. This is an issue with any simulationbased technique, such as conventional bootstrapping
or Markov Chain Monte Carlo (MCMC) methods. The important points to
note are that the analyst controls the number of simulations and
adding more simulations brings the estimate closer to the true value.
In the case of BCSE, increasing the number of bootstrap replications
(B) shrinks the variation between calculations.1 From this, the analyst
could choose the number of digits to round off from each standard
error estimate when reporting results, and increase B until multiple
calculations return the same rounded value.2 In addition, the seed of
the random number generator can be set to duplicate the same
bootstrap samples in later iterations. In the article, I identify some
differences that make BCSE preferable to the CLRT method: (1) CLRT
does not calculate a full covariance matrix of the parameter estimates,
(2) CLRT is driven by a much different philosophy than the other
methods I examine, and (3) CLRT is more difficult to implement than
BCSE.
STATA DO Files
The do file below was for an annual data set (1976-2003) stacked by
state (i.e., observation #1 was Alabama for 1976, observation #2 was
Alabama for 1977, etc.) where I wanted the average score on id1 for the
1980-89 period the average score on id3 for the same period for all
states. Also, the same procedure for the 1980-89 period.
use "C:\EriksonWrightMcIverUpdate.dta", clear
keep stateid year id1 id3
sort state year
list in 1/10
foreach val in 1 3 {
egen eighties_id`val' = total(id`val') ///
if year > 1979 & year < 1990, by(state)
egen nineties_id`val' = total(id`val') ///
if year > 1989 & year < 2000, by(state)
sort stateid eighties_id`val'
by stateid : replace eighties_id`val' = ///
eighties_id`val'[_n-1] if eighties_id`val' == .
sort stateid nineties_id`val'
by stateid : replace nineties_id`val' = ///
nineties_id`val'[_n-1] if nineties_id`val' == .
replace eighties_id`val' = eighties_id`val'/10
replace nineties_id`val' = nineties_id`val'/10
}
gen ideology8089 = eighties_id1 - eighties_id3
gen ideology9099 = nineties_id1 - nineties_id3
drop eight* nine*
The following commands were used to find the number of years between
1989 and 2002 that the Democrats were in control (i.e., had a majority
of both houses of the state legislature and the governorship) in each
state. Stnum was simply the state number (i.e., Alabama was state #1)
use "C:\StatePoliticalData19492006.dta", clear
gen flag = uhdem>uhrep & lhdem>lhrep & demgov ==1 & year>1988 &
year<2003
egen demcontrol=total(flag), by(stnum)
tabulate stnum, summ (demcontrol)
The following commands were used to obtain the average Democratic
strength in each states government over the 1989-2002 period (lower
house worth 25%, upper house worth 25% and the governorship worth 50% i.e., the maximum value would be 1.0)
use "C:\StatePoliticalData19492006.dta", clear
gen dem_strength = .
levelsof stnum
foreach st in `r(levels)' {
local total_percent=0 //
foreach val in lh uh {
qui sum `val'rep if stnum == `st' & year>1988 &
year<2003
local reps = r(sum)
qui sum `val'dem if stnum == `st' & year>1988 &
year<2003
local
local
local
local
dems = r(sum)
total = `reps' + `dems'
demp = (`dems'/`total') * .25
total_percent = `total_percent' + `demp'
}
qui count if demgov == 1 & stnum == `st' & year>1988 &
year<2003
local demgs = r(N)
qui count if stnum == `st' & year>1988 & year<2003
zmass zmich zminn zmiss zmissouri zmont zneb znev znhamp zjersey ///
zmexico znyork zncar zndak zohio zok zore zpenn zrhode zscar ///
zsdak ztenn ztex zutah zvmont zvirg zwash zwvirg zwis,
correlation(ar1)
xtpcse sptaxrat xsptaxrat pover elderi whitei nameri defconp dhousecc
dsenc ///
dhouseccc dsenccc dhouseccp dsenccp pres repv child5t17
margin3less ///
zala zalaska zariz zark zcal zcolo zconn zdel zflor ///
zgeor zhaw zidaho zillin zindiana ziowa zkent zkan zmaine zlouis
zmary ///
zmass zmich zminn zmiss zmissouri zmont zneb znev znhamp zjersey ///
zmexico znyork zncar zndak zohio zok zore zpenn zrhode zscar ///
zsdak ztenn ztex zutah zvmont zvirg zwash zwvirg zwis,
correlation(ar1) rhotype(tscorr)
period and having the average score read for each year (i.e., the average of
1973, 1974, 1975 and 1976 read for each of those same four years) type:
use c:/StataTechnicalSupportVersion, clear
set more off
gen term=0
forvalues i = 1973(4)2008 {
replace term=1 if year==`i'
}
bysort stnum: replace term=sum(term)
bysort stnum term: egen term_avg=mean(stideoan)
set more on
exit
The variable term_avg was what you were after.
DO File: Partisan Strength Variables:
Creating Alt/Lowrys Political Control Variables (Demcont = 1 = Democratic
governor plus Democratic control of both houses of the legislature, 0 =
other; Repcont = 1 = Republican governor plus Republican control of both
houses of the legislature, other = 0; DS = 1 = Democratic governor and Split
Legislature one house majority Democratic and the other majority
Republican, 0 = other; DR = 1= Democratic Governor and Republican
control of both houses of the legislature, other = 0; RS = 1 = Republican
governor and one house majority Republican and the other house majority
Democratic, other = 0; and RD = 1 = Republican governor and Democratic
majorities in both houses of the legislature, other = 0).
use "F:\statedata8.dta", clear
tsset stnum year, yearly
//The following commands generate Alt/Lowry Political Variables and Unified
//Partisan Control Variables at the State Level
drop demcont repcont ds dr rs rd
drop lhdemp uhdemp lhdemc uhdemc lhrepp uhrepp lhrepc uhrepc
gen lhtot = lhdem + lhrep
gen lhdemp = lhdem/lhtot
gen lhrepp= lhrep/lhtot
gen uhtot = uhdem + uhrep
gen uhdemp = uhdem/uhtot
gen uhrepp= uhrep/uhtot
gen lhdemc= lhdemp
recode lhdemc (.0000/.5 = 0) (.5001/1 = 1)
gen uhdemc= uhdemp
recode uhdemc (.0000/.5 = 0) (.5001/1 = 1)
gen demcont = lhdemc + uhdemc + demgov
recode demcont (.0001/2.999 = 0) (3 = 1)
EXCEL
When preparing an Excel file for STATA, make sure that only the top
row has letters. STATA will read the top row as the variable names, but
a second row will be read as the first scores on the variables.
Additionally, if you are going to read an Excel file into STATA, save the
Excel file as both an Excel file and a tab delimited text file. The tab
delimited text file is the file STATA will read.
Sort: If sorting for a class, deleted the names column. Then highlight
the rest of the spreadsheet, click on Data, click on Sort, click on
Sort by (choose the column you want the data sorted on the basis
of e.g., ID) and it should go from lowest to highest in the column
you selected. If this doesnt work, try what is written below.
to sort from, for example, lowest to highest scores in a column do
the following: (1) put the column you wish to sort on the basis of in
column A (i.e., furthest column to the left); (2) highlight column
A; (3) click on heading entitled Data; (4) click on Sort; (5)
choose sorting criteria (usually lowest to highest is the default
option i.e., to sort from lowest to highest you should be able to
click on OK YOU MAY NEED TO CHOOSE EXPAND THE
CURRENT SELECTION if it doesnt work, check the option OTHER
than Expand the Current Selection)
Reading an Excel file into STATA: Make sure that any missing data is
denoted with a period (.), which is Statas missing data code. Save the
Excel file as a tab delimited text file to the c drive (the tab delimited
option is in the lower part of the gray box that will appear when you save
the file (clicking on Save as in the upper left corner). You do not need a
file extension. Note: a gray box in the middle of the screen will say that
the file is not saved as an Excel file. Click on yes. Go into Stata. In the
command line type: set mem 5000k and press enter. In the command
line type: insheet using C:/senate1.txt (you need both the quotation
marks and the txt on the end). Save as a Stata file by clicking in the
upper left corner on file and then save as and make sure you remove
the * and have .dta as a file extension. Thus to save as a file called
senate you would need the file name box to read: senate.dta (no asterik).
Converting a STATA file into Excel: read the data file into Stata; click on
Window (top of the screen); click on Data Editor; click on upper left
corner to highlight and keep highlighting as you move right; click on Edit
(top of screen); click on copy(or control c i.e., Ctrl c); then go into
Excel and paste (or control v i.e., Ctrl v). Check to see that dots
(i.e., .) representing missing data are in the new Excel file. If not, follow
Recoding and Missing Data section ahead.
the right as you can (i.e., put all the data as well as names that might
appear on the left side of page in column A it will be very wide only
the data will end up being transferred); (2) highlight the data you are
interested in; (3) click on Data at the top of the screen; (4) click on Text
to Columns; (5) click on Next (Fixed Width should be the default which
will have a dot in it); (6) click on Finish. You may need to repeat this
procedure because of how the data are blocked. For example, you may
have data on all 50 states with left/right lines separating one group of
years from another. In such a circumstance you need to go below each
set of left/right lines (i.e., you dont want to highlight the left/right lines or
other such writing or figures, just the data you want to use) highlight the 50
state scores and then repeat the process for another year or set of years.
Note: If the above does not work on any part of your document try the
following: (1) repeat steps 1-5 (i.e., through click on Next); (2) In the
Data Preview box click between the numbers on what appears to be a
ruler at points that place vertical lines isolating the columns you want to
use (i.e., say at 15, then 30, etc.) go as far to the right as your data
allow; (3) click on Next; (4) click on Finish If this doesnt work then
repeat initial steps 1-4 (i.e., through Text to Columns) and then change
from Fixed Width to Delimited and follow as immediately above. Also,
you might need to copy (i.e., highlight) what you want to move, click on
copy, then paste special and click on values. Ive been able to move
stuff from a downloaded Excel file into another Excel file by this method.
Copy and Paste Non-Consecutive Rows or Columns (e.g., transforming
data organized by year to data organized by state in which case you
might want to copy and paste lines 2, 52, 102, etc.): If you keep the
control key down, you can highlight non-consecutive rows or columns.
Unfortunately, when you then try the copy or cut option, it wont work. So,
you need an alternative strategy. Try the following: (1) duplicate the entire
file so you wont harm the original; (2) take the unit you trying to stack by
(e.g., state or year) and copy the names of the various units so they are
on both the left and right side of the spreadsheet (i.e., it is much easier to
know what unit youre on if the information is on both sides of the
spreadsheet); (3) highlight all the information between consecutive
observations on the unit (e.g., if you have data for the 50 state by year
observation 1 is Alabama in year 1920 and observation 2 is Alaska in 1920
and want to covert this so that observation 2 is Alabama in 1921 then
highlight all rows between Alabama in 1921 and Alabama in 1922) and
then right click on copy and then paste (e.g., using control v) into the
new document.
Removing a Symbol (e.g., $) from a Column of Data (1) highlight the row
or column the symbol you want to remove is in; (2) click on Format; (3)
click on Cells; (4) highlight number (to covert to just numbers); (5) click
on Okay
Reducing the Size of a Cell (e.g., the data take two rows for each cell and
one of the rows is blank): (1) select all the cells that have information; (2) go to
the Edit menu; (3) within this menu go to Clear; (4) from the list select
Formats (the cells should now be broken into two now we will delete
the blank cells); (5) go to the Edit menu; (6) within this menu select Go
To; (7) on the Go To menu click Special; (8) on the list select Blanks;
(9) press OK; (10) go to the Edit menu; (11) within this menu select
Delete; (12) in the delete box select Shift cells up; (13) press OK
Interpolation: Let us say that you are working with annual time series data and
have scores at decade wide intervals (e.g., census figures every 10
years). If you want to average the change over the intervening years,
precede as follows: (1) find the amount (or rate) of change over the time
interval you are interested in (e.g., if there is a 2 point change over 10
years, this would be .2 of a point per year); (2) In the first cell in which you
lack data, type a formula such as: =A2-.2 (this indicates a decrease of 2
tenths of a unit per year from the score in cell A2 assuming cell A1
contains the name of the variable, cell A2 would have the first score you
type the preceding formula in cell A3 i.e., the first cell without data other mathematical signs are * for multiply, + for add and / for division) and
press enter; (3) Highlight the cell you just typed the formula in (e.g., A3)
and as many cells as you want this same amount of change to extend; (4)
Click on Edit; (5) Click on Fill; (6) Click on Down or whatever
direction you are going; (7) Click somewhere else to remove the highlight.
Statistics in Excel: 1 open an Excel file; 2 click on Tools; 3 click on Data
Analysis; 4 click on Regression; 5 make sure that the dependent
variable is either in Column A or the last column on the right; 6 - It is
important to remember that Excel does not read letter variable names.
Also, the variables need to be by column (i.e., one column per variable).
For example, suppose your first column (i.e., column A in Excel) contains
100 observations on a variable named CONS. Excel cant read the
variable name CONS. Since the variable (i.e., column) contains 100
scores beginning with the second row and ending with the 101 st row, enter
the following in the box: $A$2:$A$101 (i.e., Excel will read cells A2
through A101). If CONS were the dependent variable (i.e., y) in a
regression, then $A$2:$A$101 would go in the y box. In my case, the
independent variables were in columns A through C and the dependent
variable was in column D. It would appear handy to have the independent
variables next to each other. Thus, if you have 4 variables (i.e., the
dependent variable and three independent variables), make your
dependent variable either the first, or last, column. In the Input Y Range