Readme PDF
Readme PDF
Readme PDF
of China
Replication Instructions
This note explains step-by-step how to replicate all the results in The Size Distribution of Firms and
Industrial Water Pollution: A Quantitative Analysis of China. We begin with some housekeeping stuff in
Section I. Section II explains in detail how to replicate all the results in the main text. Section III shows how
the results in the appendices can be replicated. The data and code can be downloaded from the AEA Data
and Code Repository (OPENICPSR-112005).
I. Housekeeping Information
A. Overview
To replicate all the results in the paper, three softwares are used. Specifically, S TATA is used for initial
data processing. All the empirical results are then produced using R. MATLAB is used to compute all the
quantitative results. There is no particular scientific reason that we use one statistical software (S TATA) for
data cleaning and the other (R) for empirical analysis.1 Veterans of S TATA or R should be able to replicate
all the empirical results with just one software using our source code as reference with ease. In addition, it
should go without saying that for readers without a licence to MATALB, GNU Octave could be used instead.
We use three datasets in our analysis: The National General Survey of Pollution Sources (NGSPS, 1˜
g IÀ/ Ê ), The National Economic Census (CNEC, 1˜g I²LÊ ) and the Statistics of
U.S. Businesses (SUSB).
• The NGSPS is a confidential dataset housed at the Ministry of Ecology and Environment (MEE hence-
forth). The dataset is subject to regulation by The Law of the People’s Republic of China on Guarding
State Secrets (¥u<¬ ÚI ÅI[“—{) because it contains sensitive information. Please
refer to Regulation on Archiving the National General Survey of Pollution Sources Data (State Envi-
ronmental Protection Administration [2007] No.187, À/ Ê Y+n•{[‚u[2007]187Ò]).
The data can only be accessed within the MEE as well as institutions under its direct supervision (¥u
<¬ ÚI) ‚¸Ü9Ù†áü ). Researchers affiliated with these institutions should submit
application for access following the internal procedure of the MEE. This is how we access the data.
∗
Please send all correspondence to: zjutangxin@gmail.com
1
Due to limited backward compatibility of S TATA and that the foreign package of R supports only .dta files saved by S TATA
up to Version 12, to run the code without any modification, the reader would need S TATA 12. Readers with more recent version
of S TATA would need to either modify in the .do file everywhere .dta files are read and written, or use alternative R package (for
instance readstata13) to convert the data in cpsc convert.R and cec convert.R. Compatibility of our code is not tested under
such environments. However, none of our results should be affected though.
1
2 AMERICAN ECONOMIC JOURNAL: MACROECONOMICS
Individual researchers not affiliated with the MEE and institutions under its direct supervision need to
submit the application through the official portal for requesting Information Access ( ?&Eúm)
by the MEE at the following url: http://www.mee.gov.cn/weihu/201706/t20170625_416641.
shtml. The url is retrieved on September 15th, 2019 and may vary in the future. All applications are
subject to official approval by the General Office of Ministry of Ecology and Environment (¥u<¬
ÚI) ‚¸Ü•úe).
For an official introduction to the NGSPS and declassified information at the aggregated level, in-
terested readers could refer to the book Data Collection of the National General Survey of Pollution
Sources (5À/ Ê êâ86, ¥I‚¸‰Æч , 2011), which is publicly available. We
provide a summary in English in Online Appendix A of our paper.
• The CNEC is also a confidential dataset belonging to National Bureau of Statistics of China. Many
universities and institutions have legal copy of the dataset though. Our copy was obtained from Fudan
University.
• The SUSB is publicly available from the Census Bureau’s dedicated website. The url, retrieved on
September 15th, 2019, and subject to possible change in the future, is https://www.census.gov/
programs-surveys/susb.html.
In the replication files, we cannot provide micro-level data of the NGSPS and CNEC, due to non-disclosure
agreements with the MEE and Fudan University. Therefore, we only post the source code that generates the
results. However, to facilitate replication by the readers who have access to the datasets, we post the log files
when running the analysis on our computer. More details on the log files are provided at the end of this note.
B. File Structures
The readers will get a number of files for replicating the results. To ease digestion, we briefly intro-
duce the structure of the folders and files. As typographical convention, we use blue computer modern
typewriter to indicate folders and files. The parent folder is referred to as ./.
• ./Empirical/:
• ./Quantitative/
– ./twosectors.m: computes the equilibrium of the two-sector model with perfect substitution
used in the main text and saves the results in .mat format.
– ./fcn2.m: computes the excess demand at a given vector of price; called by twosectors.m.
– ./compute tables.m: computes the raw data used to calculate Tables 4, 5 and 6 in the main
text.
– ./plot figures.m: plots Figures 5 and 6 in the main text.
– ./twosectors ces.m: computes the equilibrium of the two-sector model with constant elastic-
ity of substitution used in Appendix J and saves the results in .mat format.
– ./fcnces.m: computes the excess demand at a given vector of price; called by
twosectors ces.m.
– ./compute tables ces.m: computes the raw data used to calculate Tables J.1 in Appendix J.
– ./misc.m: draws Figure D.2 and computes the decomposition in Sections IV.A and IV.B of the
main text.
All the file reference in the source code is specified by relative position. Please set the working director
of your software to the current folder that hosts the file. For example, if the package is saved locally at
C:/Users/AEJ/pollution rep/, then to replicate the empirical results, the working directory of S TATA
and R should all be set to C:/Users/AEJ/pollution rep/Empirical, while the MATLAB working di-
rectory should be set to C:/Users/AEJ/pollution rep/Quantitative when replicating the quantitative
results.
C. Data Cleaning
The NGSPS comes by 5 S TATA .dta files. keynum.dta contains the information of the key sources, while
reg1.dta to reg4.dta host the information of the regular sources. There are 4 files for the regular sources
because of the technical limitation back when we initially applied for the data. It is likely that the reader will
get one file for the regular source at this moment. These files should be placed under ./Empirical/Data/.
Executing cpsc raw.do will combine the four regx.dta to regall.dta and label all the variables in
keynum.dta and regall.dta. It also generates another file allfirms.dta where the two files are com-
bined. All three files are saved under ./Empirical/Data/.
Our copy of the CNEC has two S TATA .dta files. cec2004 large full.dta holds those samples that
overlap with the commonly used 2004 Annual Surveys of Industrial Firms (ASIF), while cec2004 small.dta
contains small firms not surveyed in the annual ASIF. Again, they should be placed under ./Empirical/Data/.
The file cec clean.do labels all the variables. Upon completion, the original un-labeled files will be re-
placed by files with the same name.
With these two steps completed, the reader needs to convert the data from S TATA .dta format to the
internal binary R format .RData. The file cpsc convert.R converts keynum.dta to KEYFIRM R.RData,
regall.dta to REGFIRM R.RData and allfirms.dta to ALLFIRM R.RData. Similarly, cec convert.R
converts
cec2004 small.dta to DSMALL R.RData, cec2004 large full.dta to DLARGE R.RData. It then ex-
tracts only the variables used in the paper and combines the two .RData files to one file CNEC avgp.RData.
All the files are saved under ./Empirical/Data/ as before.
The SUSB data are directly downloaded from the Census website as susb04.csv. We use the dataset in
its original form.
4 AMERICAN ECONOMIC JOURNAL: MACROECONOMICS
In what follows, we will only use the files converted by R. A list of the essential files is provided below.
All the files should be placed under ./Empirical/Data/. For ease of reference, we use the two-letter
acronyms in parenthesis when referring to the data files later.
• DLARGE R.RData (DL): A subset of CA which overlaps with the 2004 ASIF.
We try to organize our source code by sections in the paper as much as we can. This is not always possible
though (or creates unnecessary confusion). We do, however, break the code by sections within the file, where
tasks that are reasonably connected logically are grouped together. The code is written such that each section
is self-contained, so the readers do not have to run from the very beginning if she/he only wants to replicate
a certain result.2 Table 1 summarizes the correspondence between the results in the paper (tables, figures,
regression results, in-text numbers, etc.) and the section of code files that computes them, as well as data
dependency. To further assist the readers to visually navigate through the code, important code snippets are
always indicated by a 76-character line of =. This is the longest line in all the source code, and should be
those most visually noticeable amongst all.
We now explain how to replicate the results in the main text step-by-step. The results are presented in the
order of appearance in the main text.
1. Table 1 is produced by the first part of Descriptive.R. All the numbers are printed to the terminal.
Row 1 comes first, with Rows 2 and 3 follow in order.
2. Regression (1) and Figure 1 are generated by Section 1 of Empirical AEJ.R. The regression results
are printed to the terminal. Figure 1 is saved at ./Empirical/Results/Figure1.pdf.
3. Regressions (2), (3), and the un-numbered one in Section I.B part The Role of End-of-pipe Treatment
Technologies, as well as Table 2 are produced by Section 2 of Empirical AEJ.R. As one exception,
Section 2 relies on Section 1 of Empirical AEJ.R. The information in Table 2 is printed to the
terminal in the order of Columns 2, 1, 3 and 4. The numbers in Columns 1 and 2 are directly taken
from the output. Columns 3 and 4 are calculated by normalizing the physical equipment level to 100.
Regression results are then printed to the terminal: first the unnumbered regression, with those of
Regressions (2) and (3) follow in order.
4. Figure 2 is produced by Section 3 of Empirical AEJ.R. The two panels are exported as
./Empirical/Results/Figure2 Left.pdf and Figure2 Right.pdf.
2
That said, the readers should always check to ensure that all the functional packages of R are loaded.
QI ET AL.: FIRM SIZE DISTRIBUTION AND POLLUTION REPLICATION 5
5. The in-text number of the accounting exercise can be calculated from the results when executing
Accounting.R. The 33% reduction in the last paragraph of Section I.C part An Accounting Exercise
can then be calculated as the weighted average of the variable ppar with Row 1 of Table 1 being
the weight. Notice that the variable ppar is exactly Row 3 of Table F.1. Because according to Row
1 of Table 1, the five industries combined contribute to 77% of total COD emission, assuming that
industries emitting the other 23% COD remain intact, the 25% reduction in average intensity is simply
calculated as 1 − (67% × 77% + 100% × 23%) , 25%.
6. Figure 3 is produced by the second part of Section 4 of Empirical AEJ.R. The two panels are ex-
ported as ./Empirical/Results/Figure3 Left.pdf and Figure3 Right.pdf.
7. To compute the quantitative results, the reader would need to execute twosectors.m several times.
There are four cases to be computed: the benchmark case, the no distortion case, the regulation case
and the flat tax case. The reader needs to manually set the parameters in twosectors.m for each
cases, namely for the distortions tauzd, tauzc, the regulation intensity xi and the file name at the
very end. The code is tuned to compute and save the benchmark results initially. The code contains
all the relevant statements. Hence the reader only needs to comment and un-comment the relevant
sections, which should be very easy to locate by searching the variable name.
(a) Execute twosectors.m. Some information would be printed to the terminal. Please refer to the
log file quantitative main log.pdf for details. The results are saved in
./Quantitative/Results/benchmark new.mat.
(b) Set tauzd and tauzc to zero, and the export destination to ./Results/notax new.mat. The
results will be saved in this file after execution. This case is labeled as Case (i) in the paper.
(c) Set tauzd and tauzc back to the benchmark level, change xi to 0.355, and the export destination
to ./Results/regulation new.mat. The results will be saved in this file after execution. This
case is labeled as Case (ii) in the paper.
(d) Set tauzd and tauzc to the flat tax level, reverse xi back to 0.23, and set the export destination
to ./Results/flattax new.mat. The results will be saved in this file after execution. This
case is labeled as Case (i’) in the paper.
8. With all the *.mat files computed, execute plot figures.m will produce Figures 5 and 6. The
corresponding files will be saved under ./Quantitative/Results/. The list of files include
Figure5 Left.pdf, Figure5 Right.pdf, Figure6 TopLeft.pdf, Figure6 TopRight.pdf,
Figure6 BotLeft.pdf and Figure6 BotRight.pdf.
9. Tables 4, 5 and 6 are calculated using the output of compute tables.m. Notice that Tables 4 and 6 are
presented in relative changes and share a similar structure. By changing the input files at the beginning
of compute tables.m (in particular the 4 .mat just computed), compute tables.m prints the raw
data of the corresponding entries in the same structure as the tables in the paper. Table 5 can be
directly taken from the output Table 5 after executing compute tables.m with the input being the
corresponding cases. Entries in Table 4 can be calculated from the Table 4 of the benchmark case
and Cases (i) and (ii), while that of Table 6 can be calculated from Table 4 of the benchmark case
and Case (i’). The calculations of the relative changes are not automated, and the reader does have to
do the calculations manually as we did. We apologize for the inconvenience.
The in-text decomposition of total reduction to the contribution of size distribution and technology
adoption can be performed using misc.m. The reader needs to change the second input file to either
6 AMERICAN ECONOMIC JOURNAL: MACROECONOMICS
notax new.mat or regulation new.mat, depending on the case to compute. The role of technology
adoption is printed to the terminal, with the residual being that from the size distribution.
10. Several calibration targets are computed from the data as well. This last bulletin point explains where
they are computed.
• The share of firms using clean technology 57% is calculated in Section 2 of Empirical AEJ.R.
The variable clean share hosts the number.
• The average adoption cost of clean technology as a ratio of output for clean firms 2.5% is cal-
culated at the end of Section 2 of Empirical AEJ.R. The number is printed to the terminal
directly.
• The value of φ1 is computed in the first part of Section 4 of Empirical AEJ.R. The variable
phi quant carries the information.
• The firm size and employment distributions [the light (yellow) bars in Figure 5] are computed
at the end of Section 3 of Empirical AEJ.R. The variable distchn is first used to save the
employment distribution and then the firm size distribution.
That concludes our instructions of replicating all the results in the main text.
We now proceed with replicating the results in the appendices step-by-step. The results are presented in the
order of appearance in the appendices. The correspondence between the results in the appendices and the
section of code files is summarized in Table 2.
1. Table A.1 is produced by the second part of Descriptive.R. All numbers are printed to the terminal.
2. Figure B.1 is generated by Section 1 of Empirical Appendix.R. The four panels are saved respec-
tively as ./Empirical/Results/FigureB1 TopLeft.pdf, FigureB1 TopRight.pdf,
FigureB1 BotLeft.pdf and FigureB1 BotRight.pdf.
3. Section 2 of Empirical Appendix.R produces the results in Appendix C.1. The beginning of that
section first runs the regression of intensity to size by industry. The regression results are printed to
the terminal, which are exactly the results reported in Table C.1. The five panels of Figure C.1 are then
exported to ./Empirical/Results/ with the names FigureC1 TopLeft.pdf,
FigureC1 TopRight.pdf, FigureC1 MidLeft.pdf, FigureC1 MidRight.pdf and
FigureC1 BotLeft.pdf. Notice that the bottom-right panel is simply Figure 1 in the main text. The
code then repeats the exercise for the manufacturing sector as a whole. Results of the un-numbered
regressions on Pages 7, 9 and 10 (the last one) are then printed to the terminal, with Figure C.2 exported
to ./Empirical/Results/FigureC2.pdf.
The three un-numbered regressions on the intensity-size relationship by the type of equipment are
estimated in Section 2 of Empirical AEJ.R.
4. The three panels of Figure D.1 are produced by Section 4 of Empirical Appendix.R. The fig-
ures are exported as ./Empirical/Results/FigureD1 TopLeft.pdf, FigureD1 TopRight and
FigureD1 Bot.pdf.
QI ET AL.: FIRM SIZE DISTRIBUTION AND POLLUTION REPLICATION 7
5. Figure D.2 is generated by the last part of misc.m. This requires the results from the benchmark case
./Quantitative/Results/benchmark new.mat. The file that hosts this figure would be
./Quantitative/Results/FigureD2.eps.
6. Results in Appendix D.2 are computed by Section 3 of Empirical Appendix.R. Specifically, the 4
panels of Figures D.3 are first plotted and saved as ./Empirical/Results/FigureD3 TopLeft.pdf,
FigureD3 TopRight.pdf, FigureD3 BotLeft.pdf and FigureD3 BotRight.pdf. A series of re-
sults are then printed to the terminal: first are the in-text numbers of the mean of the ratio of adoption
cost to output for firms with output in each quintile (22%, 11%, 6.7%, 5.0% and 2.7%), second the
coefficients of the un-numbered regression, with the results of Table D.3 come at last.
7. Panels in Figure E.1 are produced by Section 5 of Empirical Appendix.R. The five panels are saved
as ./Empirical/Results/FigureE1 TopLeft.pdf, FigureE1 TopRight.pdf,
FigureE1 MidLeft.pdf, FigureE1 MidRight.pdf and FigureE1 BotLeft.pdf. Similarly as for
Figure C.1, the bottom-right panel is simply the left panel of Figure 2 in the main text.
8. The accounting exercises in Appendix F are performed by Accounting.R. Variables pmed, preg and
ppar are respectively for Rows 1, 2 and 3 in Table F.1.
9. Finally, to reproduce Table J.1, one essentially follows the same logic as in the main text: namely
execute twosectors ces.m many times with different parameter values, print the raw numbers for
the corresponding table entries and then manually compute the percentage changes in Table J.1. There
are two cases in general, where CES = 1.5 and CES = 3.0. There are three parameters that are “CES-
specific:” sigmaces, ke and varces. They are conveniently grouped together in the code, under the
snippet % CES = 1.5 and % CES = 3.0. The reader would also need to modify tauzd and tauzc
for different scenarios. As before, all the statements have been coded; and the reader only needs to
comment and un-comment the relevant snippets of the code. Likewise, the instruction to save the
results at the end of the file also needs to be changed. One complication is that the reader needs to
set the initial guess for the market clearing prices. For each case, the initial guesses are conveniently
stored in the code snippets at the beginning of Section Solve the equilibrium wage. The program
is reasonably robust to different initial guesses. However, if the readers do not use the exact initial
guess we use, there may be very minor numerical difference for some of the variables. Similarly, if
the reader uses other versions of MATLAB (we use 2016a) or Octave, such numerical errors would
probably show up as well.
(a) By default, twosectors ces.m is tuned to produce the benchmark results with CES = 1.5.
Executing the file would print some information to the terminal, again please refer to the log file
quantitative ces log.pdf for detail. The equilibrium results are then saved under
./Quantitative/Results/benchmark ces.mat. This case is called benchmark.
(b) Set tauzd to zero and leave tauzc untouched, change the initial guesses and the export destina-
tion. The results for eliminating only the distortions in the polluting sector would be saved under
./Quantitative/Results/nodirty ces.mat. This is Case (ii) in Appendix J.
(c) Set both tauzd and tauzc to zero, change the initial guesses and the export destination. The
results for the no distortion case would be saved under
./Quantitative/Results/notax ces.mat. This case is labeled as Case (i) in Appendix J.
10. With all the *.mat files computed, as before, compute tables ces.m computes the raw data used
in calculating entries in Table J.1. The results printed to the terminal can then be used to compute
8 AMERICAN ECONOMIC JOURNAL: MACROECONOMICS
the percentage changes between equilibria in Table J.1. In particular, with the three cases at hand, the
upper panel of Table J.1 can be reproduced.
11. Repeating Steps 9 and 10 with the parameters set to CES = 3.0 then produces the lower panel of Table
J.1, which finally completes the replication of all the results in our paper.
Right before our final submission, we did a final test run on all the programs. We saved all the log files
of this particular run in .pdf format. The correspondence between the log files and the source code is as
follows. The original log files are saved in text files of .log extension with the same names. For instance,
executing the source code cpsc raw.do will generate a text log file cpsc raw.log, which we converted to
cpsc raw log.pdf manually.
The log files for the quantitative exercises are not automatically generated. We manually copied and pasted
the intermediate results sent to the terminal to two text files: quantitative main.log for the perfect
substitute case in the main text and quantitative ces.log for the constant elasticity of substitution case
in the appendix. Then we convert the two files to .pdf format.