Using Machine Learning To Detect Misstatements: Jeremy Bertomeu Edwige Cheynel Eric Floyd Wenqiang Pan

Review of Accounting Studies
https://doi.org/10.1007/s11142-020-09563-8
Using machine learning to detect misstatements
Jeremy Bertomeu1 · Edwige Cheynel1 · Eric Floyd2 · Wenqiang Pan3
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
Machine learning offers empirical methods to sift through accounting datasets with a
large number of variables and limited a priori knowledge about functional forms. In
this study, we show that these methods help detect and interpret patterns present in
ongoing accounting misstatements. We use a wide set of variables from accounting,
capital markets, governance, and auditing datasets to detect material misstatements.
A primary insight of our analysis is that accounting variables, while they do not
detect misstatements well on their own, become important with suitable interactions
with audit and market variables. We also analyze differences between misstate-
ments and irregularities, compare algorithms, examine one-year- and two-year-ahead
predictions and interpret groups at greater risk of misstatements.
Keywords Restatement · Manipulation · Earnings management ·

Machine learning · Data analytics · Regression tree · Misstatement · Irregularity ·
Fraud · Prediction · SEC · Enforcement · Gradient boosted regression tree ·
Data mining · Accounting · Detection · AAERs
JEL Classiﬁcation C63 · D83 · G38 · K22 · K42 · M41
Machine learning is a broad discipline that has designed learning algorithms that
can drive cars, recognize spoken language, and discover hidden regularities in grow-
ing volumes of data. Archival financial research relies on data streams capturing
firm characteristics, governance attributes, audit reports, market data, and accounting
We gratefully thank B. Cadman, P. Dechow, C. Lennox, S.X. Li, D. Macciocchi, M. Plumlee, X.

Peng, and seminar participants at LSE, University of Utah, the USC-UCLA-UCSD-UCI conference,
MIT, and the CMU Accounting Mini Conference for valuable feedback. We also thank J. Engelberg
for the many suggestions that were central in seeding the project.
Jeremy Bertomeu
bjeremy@wustl.edu
Extended author information available on the last page of the article.

J. Bertomeu et al.
variables. Machine learning algorithms detect complex patterns in this data, select the
best variables to explain an outcome variable, and uncover suitable combinations of
variables to make accurate out-of-sample predictions. These algorithms are a key to
unlocking the large - and growing - financial data sources to make better predictions
and smarter decisions. This paper offers preliminary steps to applying this technol-
ogy in accounting. We motivate the method by answering a practical question: How
do we detect ongoing accounting misstatements?
We focus on restatement items 4.02(a) “Non-Reliance on Previously Issued Finan-
cial Statements or a Related Audit Report or Completed Interim Review,” which are,
in principle, restatements that materially affect the interpretation of accounting num-
bers by investors. These misstatements differ from irregularities because they need
not be fraudulent nor carry evidence of managerial intent. Still, the audit procedures
in place should have prevented these misstatements from occurring. The reasons for
such misstatements can be complex and related to a large set of factors. Relative to
a linear statistical model, machine learning algorithms are best suited for problems
in which the set of variables, their interactions, and the mapping into outcomes is
not theoretically obvious. For our research question, many accounting numbers could
serve as input variables to aid in detecting misstatements. An increase in accruals,
for example, may either reflect growth or a firm overstating its earnings. Many other
characteristics, such as industry, changes in balance sheet accounts (including cash
or inventories), or special transactions, such as leases, do not have a clear theoretical
relationship with misstatements.
Intuitively, a machine learning algorithm can be taught to replicate the process
of a researcher seeking a good empirical model for the relation between outcomes
and input variables. This search involves many moving parts, such as finding the
right specification and interactions between variables, which variables to select, and
when to stop searching. The relation between the dependent and independent vari-
ables need not be monotonic, and the interactions between variables are a priori
unknown. Machine learning algorithms capture the relationship between misstate-
ments and financial data, enabling auditors, managers, regulators and investors to
better evaluate the predictors of misstatements, thereby offering a starting point for
taking preventive actions.
For our baseline analysis, we use the gradient boosted regression tree (GBRT)
developed by Friedman (2001) and Friedman et al. (2001). This algorithm is a
regression method that explains the probabilities of misstatements; it has the further
conceptual advantage of relating to traditional regression methods, such as ordinary
least squares or logit, which allows us to compare its performance to traditional
research designs in capital markets. We also examine a number of alternative machine
learning approaches drawn from the wider literature in computer science. We find
that several other methods perform well at detecting misstatements. Through vari-
ous performance metrics, we illustrate how the choice of algorithm may depend on
the intended use of the output of the model, between a focus on estimating a poste-
rior expectation with a regression method (e.g., a conditional probability of having
a misstatement) and a focus on classifying observations into misstatement and non-
misstatement firms. By design, GBRT appears to perform best on a likelihood-based
measure performance (the pseudo R 2 ); that is, in the sense of explaining posterior
expectations about the occurrence of misstatements. However, classification algo-

rithms perform better at separating misstatement and nonmisstatement firms if a large
enough fraction of firms can be inspected.
The top seven variables identified by the baseline model with the most influ-
ence on detecting misstatements are % of soft assets, bid-ask spread, non-audit fees
divided by total fees, a dummy for qualified opinion over internal controls, changes
in operating lease activity, short interest, and stock return volatility. Surprisingly,
while governance, audit, and market variables separately have the largest influence
on the detection of misstatements, the cumulative importance of accounting variables
is larger than all other groups. But a model using only accounting performs notably
worse than a model using only market-based or audit-based variables. Hence the
algorithm suggests that accounting is an important factor when considered with other
variables, especially audit variables that capture uncertainties and risks in accounting
numbers.
The potential complementarities between accounting and other variables are intu-
itive: auditing variables can capture whether the auditing profession has the ability to
discipline firms when it comes to reporting reliable information. Accounting infor-
mation is incomplete either because accounting rules set boundaries on what can or
cannot be included as an asset or liability or managers choose accounting estimates
about their receivables, inventory write-downs, valuation of securities, impairments
of long-term assets, etc. Market variables can partly fill the gap and account for inter-
nally developed intangibles and contingencies not present in the balance sheets as
well as events that affect the company in a more timely fashion.
To gain more intuition on the interactions between these prominent variables, we
use the algorithm InTrees developed by Deng (2018). InTrees extracts simplified
rules based on a set of thresholds implied by the GBRT ensemble learning model.
We vary the proportion of the sample that can be selected by the rule for inspection
and the number of variables that can be used. If the population inspected is low, the
rule prescribes to select firms with qualified opinions because the event is rare and
meaningful. If the population inspected is above 5%, large non-audit fees indicate
high risks of misstatements. Accounting and market variables in combination with
the audit variables help refine the search.
Machine learning has attracted considerable interest in social sciences and is unde-
niably enriching the analysis of data. Although the basic principles of data reduction
are not new in accounting research, machine learning offers new ways to find patterns
in accounting numbers and help regulators monitor reporting practices. A study by
Dechow et al. (2011) is an archetype of research in this area. It develops a prediction
model that outputs a scaled logistic probability of accounting irregularities for each
firm-year using financial statement variables. Irregularities, captured as Accounting
and Auditing Enforcement Releases (AAERs) issued by the SEC, are detected with a
summary F-score measure obtained from a logistic regression. Machine learning can
be viewed as a generalization of this approach, which embeds the model selection
problem in a data-mining algorithm.
Naturally, other studies have examined measures that predict misstatements or
irregularities, such as nonfinancial measures, corporate governance, incentive com-
pensation, and audit variables. Beneish (1999) develops an alternative scoring
J. Bertomeu et al.
mechanism to detect overstatements of earnings using financial statement ratios.

Several studies focus specifically on the incremental predictive power of various vari-
ables suggested by theory, in particular, deferred tax liabilities (Ettredge et al. 2008),
audit effort (Lobo and Zhao 2013) and accounting quality (Hribar et al. 2014). Like
our work, this research aims to uncover fundamental patterns that help detect and
stop ongoing misstatements.
At the intersection of accounting and computer science, there have been several
recent studies using machine learning in the context of SEC enforcement releases. To
our knowledge, the first studies in accounting are by Perols (2011) and Perols et al.
(2016), which use machine learning to predict Accounting Auditing and Enforcement
releases. A recent study by Bao et al. (2020) further extends this methodology by
using a wider set of ratios and variables and comparing various machine learning
methods.
This literature is also part of the effort in computer science to develop models of
financial irregularities, as initiated by the early studies by Green and Choi (1997)
and Fanning and Cogger (1998) and recently by Lin et al. (2003), Ragothaman and
Lavin (2008), Whiting et al. (2012), Gupta and Gill (2012), Abbasi et al. (2012),
and Kim et al. (2016). To our knowledge, while most studies in this area focus on
irregularities noted by the SEC (AAERs), the study closest to ours is by Dutta et al.
(2017). They also incorporate misstatements and AAERs using a two-step proce-
dure where they remove less significant and redundant explanatory variables before
running the various machine learning algorithms. Our baseline method is based on
gradient-boosted regression trees that combine multiple decision trees. In addition to
their variables, we also incorporate directly into the algorithms corporate governance
data, information asymmetry measures, and auditing variables.
The growth of research using machine learning indicates the importance of the
approach. We hope to contribute to this growing literature along three important
dimensions. First, we focus on material non-reliance restatements Item 4.02(a),
which are important events both to an audit committee and to investors concerned
about overall accounting quality when numbers may be corrected on an ex-post basis.
Thus our research question is not restricted to frauds but also comprises material mis-
statements inasmuch as their occurrence reduces timeliness and transparency. From a
policy perspective, corrections are often caused by low quality estimates: our results
also motivate a greater adoption of machine learning tools to verify and supplement
managerial estimates (Ding et al. 2020).
Second, the scope of our research question is broader because material misstate-
ments are more prevalent than irregularities, which provide additional motivation
to use machine learning methods that require larger samples. Restatements are, of
course, not entirely free from the detection problem and also feature an imbal-
anced distribution of occurrence versus nonoccurence observations; however, there
are empirically about 10 times more material restatements than irregularities (3,599
selected restatement firm-years over 2001-2014).
Third, our approach enables us to exploit the richness in restatement data that
would be difficult to replicate with SEC enforcement actions. The Audit Analytics
database reports per-period restatement effects on earnings since 2001, which allows
us to quantify the magnitude and timing of restatements. We keep this information as
model validation, as we fit the occurrence of misstatements but not their magnitude;
thus we can examine whether misstatements detected by machine learning are larger
or detected earlier than with other models. To connect to the growing research on
irregularities, we also examine whether the detected restatements appear to link to
irregularities. In particular, because of the larger number of misstatements, a model
trained with misstatements is about as good at detecting AAERs as a model trained
with AAERs and considerably better at detecting joint occurrences of misstatements
and AAERs.
The remainder of the paper is organized as follows. Section 1 briefly introduces
GBRT. Section 2 describes the data and research design. Section 3 discusses the most
important variables used in the model. Section 4 presents comparisons with linear
models and classification models. Section 5 studies and compares implications of
machine learning algorithms on AAERs and misstatements. Section 6 performs sev-
eral additional analyses and applies a method to interpret better the results provided
by GBRT.
1 Overview of machine learning
We use the gradient boosted regression tree (GBRT) developed by Friedman (2001)
for our main analysis. A review of use of the method is given by Guelman (2012),
Zhang and Haghani (2015), and Kleinberg et al. (2017). We begin with a nontechnical
overview of the main building blocks of this algorithm.
As a preliminary activity to most machine learning exercises, the first step is to
divide the sample into two subsamples: the training and validation sample serves
to fit the model, and the test sample serves to evaluate the predictive ability of the
model.
GBRT has two components that can be described independently. The regression
tree is a model in which the set of predictive variables is sequentially partitioned into
regions, where each region provides one average predicted number for the outcome
variable. The predictive variables are first split into two regions to maximize pre-
dictive accuracy. In the next steps, a region is then split again, optimally to improve
predictive power. One advantage of a regression tree over, say, a linear regression is
that the tree will sequentially pick the variables that best explain an outcome vari-
able. The most important variables will typically be the initial splits in the tree, and
the following splits will build interactions with other variables. The approach also
implements a variable selection procedure in each tree, with benefits similar to dif-
ferent regularization techniques, because the set of variables used in the regression
model is controlled by the number of splits.
A second part of GBRT is the novel aspect introduced by Friedman (2001): gra-
dient boosting. Intuitively, boosting generalizes supervised learning with a single
tree to iteratively fit multiple trees. The idea for boosting is to compute the residu-
als from the tree and apply another tree on the residuals to improve the accuracy of
the overall model. The residuals are then re-partitioned over a potentially different
set of variables and interactions. To gain some intuition, the first tree might capture
the first-order economic factor behind an outcome variable but miss on second or
J. Bertomeu et al.
third-order factors; the second and third trees (and onward) will capture these factors
better than the first tree by focusing on what remains unexplained by the first tree.
When implementing the regression tree, GBRT requires two parameters: the num-
ber of splits (depth of tree) and the fraction of the sample to be used in fitting the tree
(bagging). Boosting has two additional parameters: the number of trees to use and the
shrinkage parameter, which is defined as the percentage of the prediction of a new
tree to incorporate into the model. A higher value of these parameters increases the
fit of the model but can lead to overfitting, which occurs when the machine learning
model explains noise in the training sample, instead of the underlying relationship
that is generalizable to the test sample. Friedman (2001) recommends using weak
learners, defined as trees with four to eight splits, bagging between 50% and 80% of
observations and a shrinkage parameter no greater than 10%. The preferred number
of trees can depend heavily on the nature of the data. In summary, GBRT depends on
four fitting parameters: tree depth, bagging, number of trees, and shrinkage.
With a set of independent observations, a common procedure to tune these param-
eters is to proceed by k-fold cross-validation: partition the training randomly into k
subsamples (folds) and fit the model by excluding one of the folds, which is then
used to evaluate performance. The researcher repeats this procedure, rotating the
excluded fold and selecting parameters maximizing average performance. However,
our dataset is a panel, and therefore rotating the folds would imply using data known
from future events to detect current misstatements.1 So, for the purpose of the panel,
we fit the model using years 2001-2009 and evaluate the performance in a validation
sample composed of years 2010-2011. We search these parameters within the range
suggested by Friedman (2001), varying bagging from 50%, 60%, 70%,and 80%,
shrinkage from 1%, 5%, and 10%, tree depth from 1 to 12, and finding the number
of trees until either the pseudo R 2 is maximized or reaches a plateau. The algorithm
is explained in more detail in Appendix B.
We focus on GBRT as a baseline because it is closest in spirit to a logistic regres-
sion, to the extent that it directly yields a probability as the output. By analogy to a
logit, we first report a purely statistical measure of fit based on a pseudo R 2 in a test
sample. We then compare the area under the receiver operating characteristic (ROC)
curve, defined as the plot of true misstatements identified by the algorithm for any
given rate of firms that are inspected with no misstatement. The area under this curve,
denoted AUC, is a commonly used summary of the performance of the algorithm. In
essence, the AUC is equivalent to the probability that a randomly chosen true positive
is ranked higher than a randomly chosen true negative (Fawcett 2006).
2 Data and research design
We use the Audit Analytics Non-Reliance Restatement database updated on January 15,
2019. Non-reliance restatements refer to restatements that undermine the previous or
1 Another possible solution is to fit the model using a rolling window or exclude observations from firms
used to build the model. However, both of these choices severely restrict the sample effectively used for
cross-validation.
current financial statements or both, due to material accounting misstatements, and

must be restated in an 8-K item 4.02 filing.2 Non-reliance misstatements exclude
nonmaterial errors, such as out-of-period adjustments, and revision restatements,
such as voluntary or mandatory changes in accounting standards. The restatement
dataset covers all SEC registrants who have disclosed a financial statement restate-
ment in electronic filings since 1 January 2001. Audit Analytics tracks the restated
years corresponding to each filing. Our original sample consists of 23,760 unique
restatement files.
The sample selection is outlined in Table 1. We drop restatements for firms that
are not traded after the end of 2007 on one of the three main exchanges or can-
not be matched to accounting data from Compustat. We require CRSP share code
= 10, 11, 12. We drop restatement filings with a missing restated start- or end-year
and restatements tied to SAB 108 and FIN 48, which are one-time regulations that
caused an unusual number of restatements around 2006: SAB 108 is a correction to
balance sheet discrepancies emerging from prior restatements, and FIN 48 is a new
standard requiring retrospective recognition of tax risks. We define material restate-
ments as those that satisfy at least one of the following conditions: (1) has an income
effect greater than 1% (relative to average total assets), (2) has a three-day cumula-
tive return around the restatement announcement date that is less than -10%, or (3) is
investigated by the SEC or other regulators.
Since our database starts in 2001 and ends in 2014, we keep firm-years between
2001 and 2014, which also provides time for the restatement to be filed (Chef-
fers et al. 2010; Lobo and Zhao 2013). In our sample, about 88.46% (38.44%) of
misstatements were filed within two years after the end (start) of the event.
Our research design uses financial ratios as predictor variables, which often
require two consecutive years of nonmissing asset data in Compustat. Therefore firm-
years without a positive total asset value in the prior year are dropped. We also require
firms to have at least a one-year return history in CRSP. We remove firm-years that
include a misstatement and an announcement of restatement within the same year.
After imposing these selection criteria, our final sample consists of 3,599 restatement
firm-years and 54,354 Compustat firm-years.
Table 2 reports several summary statistics. Panel A of Table 2 describes firm-
years characteristics. Total assets, market value, and book value (as originally issued)
are lower during misstated firm years. Panel B of Table 2 describes the cumulative
effect of a restatement event on accounting numbers as well as the restated amount
for each year impacted by the restatement provided by Audit Analytics.3 Income
2 In a previous version of the manuscript, we focused on all restatements, including restatements not
reported in 8-K; however, many of these restatements need not include large events. We thank Andy
Imdieke for this suggestion.
3 Audit Analytics provides the restated amount for each year only for the five most recent years impacted by
the restatement. However, firms’ restatements can often impact more than five years of financial data. The
impact on accounting numbers prior to the most recent five years is usually reported as a cumulative charge
to retained earnings, and, in practice, firms need not retrospectively adjust all prior years. To account for
this, we assume that the cumulative effect to retained earnings is distributed evenly across the misstatement
span identified in the restatement filing. If the span is missing, we allocate the unexplained cumulative
change to the year prior to the last year with an income effect.
J. Bertomeu et al.
effects of restatements are summarized in Table 2. We winsorize the income effects

distribution at the 1% and 99% levels. Of the restatements with a negative income
effect, the average effect is a decrease in net income by $15.2 million or 2.23% of
average assets. For restatements with a positive income effect, the corresponding
average is an increase in net income by $9.0 million or 1.93% of average assets.
Overall the average across the full sample is a loss of $7.4 million corresponding to
0.96% of average assets. We keep both positive and negative income effects in the
sample since our purpose is to detect misstatements, which may occur either during
an overstatement or when the overstatement reverses.
Table 3 shows the number of restated firms over time. The percentage of restated
firm-years varies between 3.3% and 12.0%. Table 4 provides some additional infor-
mation about restated years across industries (panel A) and as a function of market
size (panel B). In panel A of Table 4, we document that three industries (retail, com-
puters, and services) are disproportionately represented in the misstatement sample,
relative to their share of the total population. Lastly, panel B of Table 4 reports a
peak in restatements for intermediate-size firms, suggesting a nonlinear relationship
between size and misstatement propensity.
As in a logit, a regression model such as GBRT yields estimated probabilities of a
misstatement for a given firm-year in the sample. However, unlike a linear regression,
GBRT will search through a large class of possible models, which could overfit the
sample if, say, the number of trees or the depth of the trees is set too high. To tune the
parameters in GBRT (number of trees, tree depth, learning and bagging), we separate
the data into a training sample (years 2001-2009), a validation sample (2010-2011),
and a test sample (2012-2014), as indicated in Fig. 1. The training sample is used to
build the model for any given set of parameters, and its performance is assessed on
the validation sample. We choose the parameters that yield the highest pseudo R 2 .
The prediction model is then estimated on the validation and training sample (2001-
2011). The performance of this chosen model is then analyzed on the test sample
(2012-2014).
Test sample
results
11,323 firm-years
of our analysis
2012-2014
Data
54,354 firm-years
Validation sample
2001-2014
7,883 firm-years Prediction model
Predictors
2010-2011
Select optimal parameters
Training sample
35,148 firm-years
2001-2009
Fig. 1 Schematic for data flow and use

Note that this approach can be applied to any prediction model. For example,
Dechow et al. (2011) estimate an additive logistic class of models using the train-
ing sample to reduce the set of predictive variables. They proceed by eliminating
variables from a full (kitchen-sink) model, or vice-versa, adding variables sequen-
tially from a sparse model. Relative to a class of logistic models, machine learning
organizes this process of selection over a larger set of variables. The regression
tree component of GBRT also allows for a search over non-additive effects and
interactions, which are excluded by assumption in a logistic regression.
We build an extensive dataset of over 100 potential predictor variables obtained
from public records, as defined in Table 5. We include financial variables (Dechow
et al. 2011), audit variables (Frankel et al. 2002; DeFond et al. 2002; Johnson et al.
2002; Romanus et al. 2008), credit rating variables (Avramov et al. 2009), opinion
divergence variables (Garfinkel 2009), and corporate governance variables (Larcker
et al. 2007). We use SIC codes to identify 15 major industries. We also include the
auditor opinion, an indicator variable for the existence of a management forecast,
analyst consensus forecast, short interest, and indicator variables for foreign firms
and current or past restatement announcements. In the baseline model, we use the
lag one-year value for the mean-adjusted absolute value of DD and studentized DD
(Dechow and Dichev 2002) residuals to avoid using information known in the future
in the prediction model. Each of the continuous variables is winsorized at the 1% and
99% levels. We set a missing variable to zero but include an indicator variable equal
to one when the variable is missing for any variable with more than a 10% missing
rate.
Several studies on AAERs use machine learning tools, such as ensemble learn-
ing, to classify a subset of firm-years into frauds, implying that the output is in terms
of a binary classification (Perols et al. 2016; Bao et al. 2020). These methods dif-
fer from regression methods in that the algorithm is optimized to correctly classify
irregularities (AAERs) in datasets constructed with a balanced number of misstating
and non-misstating firms. Balancing is critical in a classification algorithm because
the preferred classification with heavily imbalanced data will involve classifying
all observations into one group. We compare GBRT to other common methods in
machine learning, including classification methods such as RUSBoost.4 Regression
methods, such as logistic regressions or GBRT, fit the likelihood of misstatements
and typically do not require balancing.
We extend the evaluation procedure to include economically interpretable evalu-
ations, which are meant to capture an upper bound on what a regulator might detect
under ideal conditions. To be specific, we assume that the SEC inspects a fraction
4 RUSBoost refers to random undersampling, such that balanced samples are constructed by randomly
drawing from the sample. With a heavily imbalanced dataset, however, nonrandom undersampling may
perform better than random sampling. In untabulated results, we used the sampling method of Perols et al.
(2016), but it did not perform better than RUSboost in our dataset. Under this alternate sampling method,
the AUC is 69.4%, and the detection rate of restatements is 60.0% and of AAERs is 81.1%. This method
ranks better than logistic models but slightly worse than GBRT and random under-sampling in Tables 10
and 15. An important difference is that there are more material misstatements than AAERs, so the benefits
of nonrandom sampling to alleviate imbalance are more muted.
J. Bertomeu et al.
Average pseudo R2 with different tree depth Average pseudo R2 with different shrinkage
when shrinkage=1% and baggage=70% when baggage=70% and tree depth=9
11.0% 11.0%
10.5% 10.5%
10.0% 10.0%
9.5%
9.5%
9.0%
9.0% 8.5%
8.5% 8.0%
1 2 3 4 5 6 7 8 9 10 11 12 1% 5% 10%
Tree depth Shrinkage
Average pseudo R2 with different baggage

when shrinkage=1% and tree depth=9
11.0%
10.5%
10.0%
9.5%
9.0%
50% 60% 70% 80%
Baggage
Fig. 2 Pseudo R2 using different parameters
of all the firms and can perfectly detect an inspected misstating firm. We vary the
probability of inspection to examine which algorithm performs best and then further
evaluate the model based on its ability to detect large versus small misstatements,
detect misstatements early, and detect misstatements that ultimately lead to an SEC
enforcement action. We vary the rate of inspection from 1%, consistent with the
rate of AAERs, to 10%, consistent with the rate of SEC investigations measured in
Blackburne et al. (2020).
3 Detection model
As a preliminary to our analysis, Fig. 2 documents the results of our validation to

select four parameters. We select the optimal combination of parameters from three
shrinkage × four baggage × 12 tree depth, or 144 combinations, and, for each com-
bination, optimize the number of trees. From this validation, we select a tree depth of
nine, number of trees equal to 6,376, shrinkage of 1%, and baggage of 70%.
The algorithm measures the predictive power of a variable, as in the case of a
logistic regression. Each split in a regression tree increases the log likelihood of the
regression tree model. The sum of log likelihood increases across all trees, due to
a given variable, yields the influence of that variable. Table 7 reports the influence
of predictors with importance higher than 1.5%.5 Although we use 102 predictors in
total, the top 10 (20) predictors account for 26.27% (47.42%) of the total influence
of our variables. Many of the variables used in the F-score model of Dechow et al.
(2011) are prominent in Table 7. The top seven variables identified by GBRT are % of
5 We report the summary statistics of the important predictors in Table 6.

45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
Accounting Governance Audit Market Business
Fig. 3 Cumulative importance per type of variable
soft assets, bid-ask spread, non-audit fees divided by total fees, a dummy for qualified
opinion over internal controls, changes in operating lease activity, short interest, and
stock return volatility.6
Of interest, accrual-based metrics commonly used in accounting research do not
seem to be the leading variables selected by the algorithm. Only working capital
accruals are listed in the top 20 most important variables, and the remaining accrual
measures are in the top 40 most important variables. However, a closer inspection
reveals a more nuanced answer. Double-entry links multiple accounting variables,
so that one accounting variable may not carry a large significance on its own, but
combining other accounting variables might help predict misstatements. Similarly,
summary measures, such as accruals, contain information already captured by other
variables.
To address the total importance of all accounting variables, we classify pre-
dictors into five groups: accounting (i.e., constructed from accounting numbers),
governance, audit, market information (including analysts and credit ratings), and
business environment (i.e., all other non-accounting variables including industry
fixed effects).7 We then total the influence by group in Fig. 3. Surprisingly, while
governance, audit, and market variables individually have the largest influence, the
cumulative effect of accounting variables is larger than all other groups, attaining
36.2% of total importance. Overall, accounting variables remain the most important
source of information to detect misstatements.
6 As in any multivariate descriptive analysis with multiple correlated variables, interpretation requires some
caution since the method may select one variable over another for reasons that relate primarily to the fitting
procedure. Later on, we list the set of important variables in other methods and observe that, while many
variables are common to multiple algorithms, there are also some differences.
7 For variables combining market and accounting information, such as book-to-market and earnings-to-
price, we allocate half weight to each category.

J. Bertomeu et al.
Accounting may also be important because of its standalone role in detecting

misstatements or because it helps build decision trees involving other types of vari-
ables. To assess these interactions, in Table 8 we re-estimate the model using only
accounting variables versus other combinations of variables. Accounting variables
alone achieve a catch rate of restatements of 41.6%, which remains at slightly lower
levels than governance (42.5%) and audit variables (48.4%) taken in isolation. Audit
variables by themselves are the leading group, which is consistent with the central
role of audit procedures to certify the reliability of the data. The value of accounting
variables mainly emerges from its relation with other variables, and cumulatively this
is the most important group in the full model.
To explain this feature of the analysis, note that many other sources of information
likely complement accounting information (Cheynel and Levine 2020). Internally
developed intangible assets and contingencies (e.g., pending lawsuits) are typically
not present in accounting information but are reflected in market variables. Further,
financial analysts and informed traders tend to convey private information through
their reports and trading that can contribute to predicting misstatements beyond
accounting variables. Audit and governance variables also interact with accounting
information, consistent with enforcement playing a major role in interpreting and
guaranteeing the quality of accounting numbers.
In Table 9, we compare the means of the most important variables for firms with
restatements to firms without restatements. The difference between the means of the
two samples indicates that the variable predicts misstatements in univariate analy-
ses. If the means are similar, the test is not significant and suggests a nonmonotonic
relationship between that variable and predicted restatements. All but one (bid-Ask
spread) of the 10 most important variables return significant differences in means.
We can also interpret why these variables may predict misstatements in univari-
ate tests. As observed by Barton and Simko (2002), firms with a high proportion of
soft assets feature a lower likelihood of misstatements, which one could interpret as
greater discretion allowed under GAAP.8 Higher non-audit fees may indicate greater
complexities in the business and lower independence (Kornish and Levine 2004),
both of which predict misstatements. Qualified audit opinions can indicate disagree-
ments with the auditor and often precede restatement announcements. Operating
leases keep obligations off balance sheets, such that large changes in these accounts
may indicate liquidity constraints. Higher leverage in misstating firms further signals
liquidity constraints. The short interest captures short-sellers’ anticipation of mis-
statements that are likely to occur with higher levels of uncertainty measured by stock
return volatility. Consistent with models of audit risk (Laux and Newman 2010), we
expect auditors’ pricing to be associated with a client’s risk of misstatement.
Lastly, the bid-ask spread is the second most important variable, but it is not
significantly different across the two samples. The bid-ask spread is driven by infor-
mation asymmetries across investors (Glosten and Milgrom 1985), which not only
stem from misstatement risk but from many other sources of private information. If
8 The theoretical model of Bertomeu and Marinovic (2015) also predicts this relation, as firms that
endogenously retain more soft assets tend to be more credible.

0.8
0.6
True Positive rate
0.4
0.2
0 0.2 0.4 0.6 0.8 1

False Positive rate
GBRT Random Forest RUSBoost Backward Logit
Fig. 4 Catch rate as a function of rate of false positives (ROC curve)
private information reduces market liquidity, this may reduce the sensitivity of price
to accounting reports, which in turn can reduce the benefit of accounting discretion
(Samuels et al. 2018).
4 Model comparisons
We next examine the performance of the model and compare GBRT with the back-
ward logistic model of Dechow et al. (2011) and other machine learning models.9 All
results are evaluated on the test dataset, which has 11,323 firm-years (4,278 unique
firms) and 428 restatement firm-years (255 unique firms).
The AUC, defined as the area under the receiver operating characteristic (ROC)
curve, is the probability that a classifier ranks a randomly chosen occurrence of mis-
statement higher than a randomly chosen occurrence of no-misstatement. We plot the
receiver operating characteristic (ROC) curve in Fig. 4. This curve plots the fraction
of true positives caught for a given fraction of false positives.
In Table 10, RUSBoost performs better than GBRT under the AUC criterion,
and both perform better than the logistic model. In the ROC curve, the difference
is greatest for intermediate levels of false positives. This result is intuitive because
classification methods, such as RUSBoost, maximize the fit in constructed balanced
samples and will work better when evaluated in imbalanced samples if the rate of
9 We only document the results with the backward logistic model because the forward logistic and simple
logistic models exhibit the same results. Backward and forward logistic models are much more sparse;
that is, they use fewer variables than GBRT and simple logistic models. However, they do not appear
to perform better than a simple logistic model. This finding suggests complex interactions in the entire
population of potential predictors capture misstatements.
J. Bertomeu et al.
F0.5 score F1 score

0.25 0.25
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Inspection rate Inspection rate
GBRT Random Forest RUSBoost Backward Logit GBRT Random Forest RUSBoost Backward Logit
F2 score
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Inspection rate
GBRT Random Forest RUSBoost Backward Logit
Fig. 5 Fβ scores
false positives is high. GBRT appears to perform well on a likelihood-based measure

performance (the pseudo R 2 ); that is, in the sense of explaining posterior expectations
about the occurrence of misstatements.10
The ROC is difficult to interpret in heavily imbalanced problems because such
problems do not have an intuitive target rate of false positives. To address this, we
report an alternative performance metric, which can be computed for a given fraction
of firms inspected. The Fβ -score (Rijsbergen and Joost 2004) weights the fraction of
detected misstatements to total inspected firm-years (i.e., precision) and the fraction
of detected misstatements to total misstatements (i.e., recall). With β > 0, the Fβ -
score is defined by
precision × recall
Fβ = (1 + β 2 ) .
β 2 precision + recall
We report the three most common Fβ -scores in Fig. 5: F1 , defined as equal weight on
precision and recall, F.5 defined as higher weight on precision (i.e., inspections are
10 We report in Table10 bootstrapped standard errors, retraining and testing the model 200 times on ran-
domly drawn datasets. Differences between the performance of most models tend to be greater than two
standard errors, indicating these differences are significant. In untabulated analyses, we bootstrapped
differences between model performance and confirm that differences between models are significantly
different from zero at conventional levels.
very costly) and F2 defined as higher weight on recall (i.e., undetected misstatements
are very costly). The x-axis in Fig. 5 represents the inspection rate (i.e., the fraction
of the population inspected).
Across all specifications, GBRT and RUSBoost outperform logit models. The F0.5
score is maximized by Random Forest at an inspection rate of about 2%, primar-
ily due to the difficulty of detecting misstatements accurately. The other algorithms
exhibit very similar patterns. Likewise, all algorithms follow the same trends for the
F1 score. The F1 score peaks at low levels of inspection rates between 2% and 3%,
and the curve flattens over inspection rates from 3% to 15%. The F2 score, which
assigns greater weight to recall, prescribes higher levels of inspections. Classification
methods, such as Random Forest and RUSBoost, perform better and suggest optimal
inspection rates at 20%.
We consider a third criterion based on inspection rates implied by SEC enforce-
ment actions (AAERs). We match the 3,599 misstated firm years in our full sample
of 54,354 firm years to AAERs and obtain 385 misstated firm years connected to
an AAER (see Table 14 for details). Hence the SEC inspected at least 385 out of
54,354, which is about 0.7% of the sample if we assume a perfect detection model
and no more than 385/3, 599 = 10.7% if we assume an entirely random inspec-
tion. In a recent study, Blackburne et al. (2020) obtain the list of all targets of a
formal SEC review from a Freedom of Information Act request. Their evidence
points to 10% of firms being reviewed, and, given that not all reviews may find a
misstatement, the effective rate of inspections should be somewhere between 1%
and 10%.
We select the top 1% through the top 10% of predicted misstatement probabilities
of firm-years in a given year and report catch rates in Table 11. For nearly the entire
range, RUSBoost has the highest catch rates, ranging from 6.3% to 20.8% for inspec-
tion rates below 4% and steadily performs better with catch rates between 23.8% and
36.4% for inspection rates between 5% and 10%. In general, Random Forest performs
second best, followed closely by GBRT. The logistic model exhibits similar catch
rates to machine learning algorithms for inspection rates lower than 4%. However,
the gap widens for high inspection rates. Using an inspection rate of 10% as refer-
ence, machine learning algorithms can catch a significant amount of misstatements-
between 33.4% and 36.4%.
The metrics reported until this point assume low inspection rates. Alternatively,
we consider performance metrics based on an interpretation of the SEC mandated
reviews. Officially, the SEC must review financial statements once every three years
(every year for large filers), which translates to inspecting roughly one-third of
the sample each year. Under the assumption of ideal inspections, we would be
able to detect misstatements in the top third of the firms most likely to feature a
misstatement.
Panel A of Table 12 displays the results for the different methods when we classify
the top-third-predicted probability of firm-years within a given year as “inspected”
firms and count the number of actual misstated years that are detected between
models. Out of 428 restatement firm years, GBRT detects 64.3% restatements, a
J. Bertomeu et al.
nontrivial improvement of about 16.1% over backward logistic regressions.11 RUS-

Boost achieves the highest detection rate at 72.2%.12
Table 12 further evaluates the performance of GBRT and other models on two
other criteria: whether it predicts larger misstatements (panel C) and whether it
catches misstatement strings earlier (panel D). As noted in panel A of Table 12,
GBRT and RUSBoost capture more misstatements, but panel C demonstrates that the
misstatements are smaller than those detected by a logistic regression. Therefore the
incremental detected misstatements by GBRT and RUSBoost, relative to a logistic
regression, tend to be smaller by about $1 million, suggesting decreasing returns to
more sophisticated methods for data analysis.
Panel D of Table 12 evaluates the performance of the models at catching misstate-
ments as early as possible. First, we compute the number of years since the start of a
string of misstatements and when the misstatements are detected by the model. Sec-
ond, we assess whether the models detect the misstatements closer to the restatement
filing date. We document mixed results on these metrics: machine learning algo-
rithms detect misstatements later, relative to the start of the restatement string, but
earlier, relative to the restatement filing date. Composition effects could cause the
smaller incremental misstatements detected by the algorithms to be harder to detect.
In addition, algorithms are likely to be better at detecting patterns preceding a restate-
ment announcement (for example, changes in audit or governance variables) rather
than only the misstatement proper.
The three primary methods implemented up to this point (GBRT, RUSBoost,
and Random Forest) combine multiple trees to improve the performance of the
model. Next, we compare the performance of these algorithms to several clas-
sic methods used in machine learning in Table 13: k-nearest neighbors, linear
discriminant analysis (LDA), naive Bayes, support vector machines (SVM), and
classification trees (CT). We tune key parameters of these models using the same
procedure as in the baseline models, implying (i) k = 26 for the number of neigh-
bors in k-nearest neighbors, (ii) γ = 0.01 and c = 50 for SVM, and (iii) 187
splits for the classification tree. Overall, the results confirm that methods com-
bining trees, while more computing-intensive, perform slightly better. Interestingly,
the performance of most of the methods noted above are in between the logistic
regression and our baseline methods, with k-nearest neighbors and SVM working
best.
11 In untabulated results, we also estimate the model by separating the restatement sample into positive and
negative period income effects, under the conjecture that positive effects may reflect reversals or incen-
tives to influence the stock price downward. See Kasnik (1999) for an extensive continuing literature. We
divide restatements into three categories: negative income effects (overstatement), zero income effects,
and positive income effects (understatement). We then build three models and predict the probability of
overstatement, understatement, and a zero income effect separately. We do not find any notable improve-
ment to predictive power in the test sample, likely because these alternative methods reduce the size of the
dataset used to estimate the model.
12 In Panel B of Table 12, we obtain similar results after excluding firms with restatements in the training
sample. Machine learning algorithms continue to perform better, compared to the logistic model, but
feature lower catch rates.
5 SEC enforcement actions
5.1 Detecting AAERs
We match detected misstatements to AAERs, because presumably misstatements tied

to an AAER are likely the most severe (Perols et al. 2016; Bao et al. 2020). The
Berkeley Haas Center for Financial Reporting and Management dataset consists of
AAERs between May 17, 1982, and Sept. 30, 2016. We manually expand this data
to Jan. 15, 2019, following the steps in Dechow et al. (2011). Panel A of Table 14
presents the sample selection of AAERs. Compared to a final sample of 3,599 restate-
ment firm-years, the number of AAER firm-years is small: about 385 firm-years or
10.7% of restatement firm-years. This small percentage is not surprising because the
role of the SEC is partly dissuasive: by catching a fraction of misstatements, the
SEC enforcement aims to reduce incentives to misstate without incurring excessive
enforcement costs. Panel B of Table 14 reports the income effect of AAER firm-
years. Compared to restatement firm-years in Table 2, most AAER firm-years relate
to overstatements and have a much larger income effect.
In Table 15, we report the detection rates of AAER firm-years in our test data. Our
test data set has 37 AAER firm-years, 8.64% out of 428 restatement firm-years. The
partial overlap between AAERs and restatements means that the information con-
veyed by AAERs differs somewhat from the information conveyed by misstatements.
The overall detection rates of AAER firm-years are higher than those of restate-
ment firm-years. GBRT continues to perform better than the logistic specification at
detecting misstatements tied to AAERs. Both RUSBoost and Random Forest catch a
few more AAERs than GBRT, and machine learning methods outperform the logistic
model.13
5.2 Reconciliation to AAERs irregularities
We provide further analyses to reconcile misstatements and irregularities described

in AAER releases. Research has found that a number of accounting variables appear
to be important for predicting AAERs. In Table 16, we compare the top 10 most
important variables across the three baseline models and between restatements and
AAERs.
Variables in italics are in the list of at least two out of three models. For restatement
data, over half of the variables are important in all three models, with % soft assets
and bid-ask spread being important in all three models. Similar patterns can be seen
in the AAER dataset, where % soft assets and non-audit fees are important variables
in all models. Variables in bold indicate (within the same model) that the variable
is important for both misstatements (panel A) and AAERs (panel B). Many of the
variables important for misstatements are also important for AAERs. For GBRT, six
13 In untabulated analyses, we compute the number of caught misstatements at least a year before the
AAERs. Out of 29 misstatements caught by GBRT in the test sample, they relate to 20 AAERs fillings,
and all of them are detected at least a year (often more than a year) before the AAER is filed.
J. Bertomeu et al.
out of the 10 most important variables in the restatement sample are important in the
AAER sample as well. For Random Forests (RUSBoost), four (5) out of 10 variables
are in both lists. We also observe that variables that tend to be used across methods
also tend to be used across panels A and B. Overall, six variables appear to be impor-
tant across methods and for both AAERs and misstatements: % soft Assets, bid ask
spreads, non-audit fee to total fee, short interest, stock returns, and percent of total
audit fee.
If misstatements caught by the model are of a more serious nature, they may help
predict AAERs. To test this, we consider whether misstatements help predict AAERs.
We also ask whether the restatement model predicts AAERs better than a model
trained only with AAERs. On the one hand, there are more restatements than AAERs,
so restatement data can help train the model. On the other hand, if misstatements
are largely disconnected from irregularities causing AAERs, we should expect data
using misstatements to do poorly, relative to using AAERs.
Panel A in Table 17 reveals that the misstatement model does about the same at
predicting AAERs as a model using AAERs only (“AAER model”). For regression
methods, catch rates are slightly higher with the AAER model, but for classifi-
cation methods, catch rates are slightly higher with the misstatement model. This
finding suggests that the machine learning approach applied to misstatements also
captures more serious irregularities. However, the algorithms may also capture the
SEC inspection model, given that the SEC is more likely to check for irregularities
in firms that restate, or vice-versa, given that restatements may have been triggered
by SEC inquiries.
In panel B of Table 17, we consider catch rates of AAERs occurring jointly with
a misstatement. These can be interpreted as more serious irregularities, requiring a
retrospective change in financial statements.14 Panel B shows that the misstatement
model does reasonably well at catching AAERs connected to misstatements: in fact,
it does better than the AAER model across all specifications.
In summary, while most misstatements are not irregularities, the results suggest
that severe misstatements are easier to detect than errors; therefore the prediction of
the misstatement model has some ability to yield catch rates in line with a model
estimated from irregularity data. Within the more narrow question of misstatement
AAER pairs, the misstatement model performs better than the AAER model.
6 Further analyses
6.1 Predicting versus detecting misstatements
In our baseline model, we use GBRT to detect ongoing misstatements before they are
announced, presumably because ongoing misstatements produce a track of evidence
14 We still estimate the models as in panel A using the entire population of misstatements and AAERs. One
alternative would have been to estimate a model using only AAER-misstatement pairs as irregularities.
However, the number of observations here becomes too small to build a model with reasonable out-of-
sample performance.
9.0%
8.0%
7.0%
6.0%
5.0%
4.0%
3.0%
2.0%
1.0%
0.0%
t=-2 t=-1 t=0
non restatement restatement
Fig. 6 Probability of misstatements for firms with and without restatements
that will affect accounting numbers and other variables. However, a misstatement
may also be predicted because of persistent risk factors that would cause certain
classes of businesses to be more prone to misstatements, regardless of whether a
misstatement is actually occurring.
To assess predictive ability over longer horizons, we train and estimate the GBRT
model using lagged variables (t = −1) to predict misstatements in the next year.
The variables are used before the occurrence of the misstatement year. In Fig. 6,
we calculate the probability of misstatement as assessed by the model for firms that
misstated versus firms that did not misstate. A large difference between the probabil-
ities implies that the model assigns a large revision of beliefs about the occurrence
of misstatements in the next year, suggesting that the model captures misstatement
risk factors. We repeat this exercise using two-year lagged variable (t = −2) and
compare these results to our baseline (t = 0).
As reported in Table 18, the model performs better to assess misstatements using
current explanatory variables. We find that the model can somewhat predict a mis-
statement in the next year but that this predictive power is lower for horizons of two
years or more. Hence, while contemporaneous risk factors may indeed explain the
ability of the model to predict misstatements, persistent risk factors do not appear to
do well at predicting misstatements. We also note that the lower predictive power one
year ahead appears to be caused by the fact that many misstatement strings take two
years, implying that the model may be detecting a current misstatement and using it
to predict the next misstatement.15
We also examine in Table 19 the nature of the variables that help predict mis-
statements one or two years ahead. For both predictions, the short interest becomes
15 In untabulated results, we find very low predictive ability when we predict the first misstatement year.
J. Bertomeu et al.
the most important variable, indicating that markets serve an additional role in antic-
ipating future misstatements. As is also intuitive, qualified audit opinions speak to
current misstatements, not to the risk of future misstatements, and are no longer
one of the most important variables (even though audit risk factors, such as high
non-audit fees and audit fees, remain important predictors). Leverage and financ-
ing cash flows are important to detect current misstatements but are less critical for
predicting misstatements. Working accruals are important in both detection and one-
year-ahead prediction but not in two-year-ahead prediction, which we interpret as
reversals of accruals indicating continuing misstatements across two years. Lastly,
with a two-year horizon, we start seeing a greater influence from industry character-
istics, indicating businesses likely to be more at risk of misstating due to the nature
of their operations.
6.2 Interpretation and high risk regions
Machine learning algorithms exhibit a known trade-off between ease of interpre-

tation and performance. A logistic regression, for example, features a closed-form
relationship between independent variables and the probability of misstatements. By
contrast, ensemble learning methods, such as GBRT, combine thousands of trees,
implying a black box between the model prediction and the variables. To speak to
this limitation, we implement the algorithm InTrees by Deng (2018). InTrees extracts
tens of thousands of possible simplified rules based on a set of thresholds implied
by the ensemble learning model. In Table 20, we report rules obtained by applying
InTrees to GBRT.
InTrees generates rules that depend on the number of variables to be used, the
frequency of the population subject to the rule, that is, rule frequency, the predicted
probability of misstatement when subject to the rule, and the error rate of the rule.
In our research design, it is intuitive to focus on high-risk firms, that is, firms that
are likely to feature a high probability of misstatement, so we report rules with the
greatest predicted probability.
We build four regions, varying rule frequency from 1% to 20% (plus or minus
0.5%). For each region, we consider the length of variable-value pairs or thresholds
of three and five.16 We first run InTrees on the top 10 most important variables,
which facilitates the interpretation in terms of several key variables. While a com-
plete theoretical analysis of these rules would go beyond our current study, these
analyses suggest several important interactions. Except for one rule, accounting vari-
ables are always part of a rule with another variable indicating higher audit risks,
consistent with the complementarity between the information in accounting numbers
and audit characteristics. For a low rule frequency of 1%, rules select firms with
16 InTrees can imply redundant conditions if an inequality is repeated twice or is a subset of another
inequality. In these cases, we only report the stricter condition.

more hard assets and qualified opinions on controls, which may lead to situations in
which transactions relating to hard assets are incorrectly processed (incorrect capi-
talization, depreciation, inventory valuations, etc.). For rule frequencies above 5%, a
large proportion of non-audit fees indicates a high risk of misstatements. Large non-
audit fees might impede the auditors’ independence and compromise the integrity of
the financial information.
When considering all variables, regardless of the rule frequency, audit vari-
ables are still very present and, in combinations with other variables, signal a
misstatement. For a rule frequency below 10%, governance variables help detect
misstatements. Accounting variables contribute to detect misstatements for rule fre-
quencies above 5%. In nearly all regions, market variables, in combination with
other variables, can gauge market uncertainty. The presence of short-selling helps
narrow down the population more likely to misstate. Industry characteristics can
refine the search. For low frequencies of rule below 5%, the computer industry is
prone to misstatements. For high frequencies at 20%, the pharmaceutical industry is
excluded.
7 Conclusion
Machine learning methods in accounting are still in their infancy, and it remains an
open question as to how accounting researchers should use and think about what
we learn from machine learning methods above and beyond detection. We present
here some preliminary evidence of the type of insights available from applying the
method to a classic question in accounting research: the detection of misstatements.
But evaluating the method in terms of detection is likely reductive. Being primarily a
descriptive tool for data exploration, machine learning can illuminate patterns in the
data without assuming a particular theory. Therefore, while a-theoretical by design, it
can paradoxically guide researchers to key variables that should be part of theoretical
causal models.
The insights are likely to be of interest to different users of accounting informa-
tion. First, the SEC as well as other regulatory bodies might find immediate use out of
a model that will help more accurately identify firms with accounting misstatements.
Machine learning tools are unlikely to replace other sources of information, such as
institutional industry knowledge, but it is inexpensive (i.e., relies on public sources of
information and limited human expertise) and can provide early warning signs. Sec-
ond, academic research in accounting has been facing growing datasets, and machine
learning offers a methodology that provides new means to explore complex patterns
in the data. Third, although machine learning is becoming common in practice, prac-
titioners will hopefully find our results useful to analyze data for decision-making
purposes.
Appendix A: Omitted Tables
Table 1 Sample selection
Sample selection Compustat firm-years Restatement firm-years Restatement filings
Original restatement filings sample 23,760

Merged to Compustat with nonmissing CIK 133,047 18,720
Main US exchange after December 2007 70,901 11,941
Remove missing start or end year 70,901 23,772 9,967
Remove SAB 108 and FIN 48 records 70,901 20,239 7,403
Remove nonmaterial restatements 70,901 9,278 2,971
Unique firm-years 70,901 8,147
Firm-years between fiscal year 2001-2014 70,901 5,679
Firm-years with two-year asset value 65,380 5,217
Firm-years with one-year return history 55,145 4,390
Remove firm-years including a misstatement
and an announcement of restatement
during the same year 54,354 3,599
J. Bertomeu et al.
Table 2 Key summary statistics
Panel A: Firm-years characteristics

Variable Misstatement firm-years No-Misstatement Firm-years Compustat Firm-years
Number 3,599 50,755 54,354
Total assets (in $ millions) 7,135 9,627 9,462
Market value (in $ millions) 3,117 4,126 4,059

Book value (in $ millions) 1,189 1,713 1,679
Panel B: Restatement income effects

Restatement income effect Freq Percent Average (in million) Average (scaled by average asset)
Negative 2,201 61.2 -15.2 -2.23%
Zero 649 18.0 0 0
Positive 749 20.8 9.0 1.93%
Total 3,599 100 -7.4 -0.96%
Table 3 Distribution of restated firms by year
Fiscal year Number of restated firms Percentage Number of compustat firms Percentage of compustat firms
2001 392 10.9 3,435 11.4

2002 417 11.6 3,485 12.0
2003 414 11.5 3,565 11.6
2004 417 11.6 3,680 11.3
2005 340 9.5 3,886 8.7
2006 234 6.5 4,132 5.7
2007 202 5.6 4,388 4.6
2008 173 4.8 4,389 3.9
2009 191 5.3 4,188 4.6
2010 199 5.5 3,996 5.0
2011 192 5.3 3,887 4.9
2012 171 4.8 3,815 4.5
2013 133 3.7 3,731 3.6
2014 124 3.5 3,777 3.3
Total 3,599 100.0 54,354 6.6
J. Bertomeu et al.
Table 4 Frequency of firm-year restatements
Panel A: Frequency of firm-year restatements by industry

Industry Restatement firms Compustat population
Agriculture 0.2 0.2
Mining & Construction 3.6 3.1
Food & Tobacco 2.2 2.1
Textiles and Apparel 1.0 1.0
Lumber, Furniture, & Printing 1.1 2.4
Chemicals 1.4 2.3
Refining & Extractive 4.6 4.1
Durable Manufacturers 17.3 18.3
Computers 20.2 14.0
Transportation 5.1 5.3
Utilities 1.9 3.2
Retail 12.6 7.9
Services 12.3 8.5
Banks & Insurance 9.1 20.3
Pharmaceuticals 7.5 7.4
Total 100 100
Panel B: Frequency of firm-year restatements by size deciles

Decile rank of market value of Compustat population Frequency Percentage
1 313 8.7
2 304 8.5
3 343 9.5
4 370 10.3
5 385 10.7
6 437 12.1
7 433 12.0
8 425 11.8
9 313 8.7
10 276 7.7
Total 3,599 100.0
Table 5 Variable definitions
Variable Abbreviation Calculation
WC accruals WC acc [(Current Assets- Cash and Short-term Investments)-(Current Liabilities-Debt in

Current Liabilities-Taxes Payable)]/ Average Total Assets
RSST accruals rsst ac (WC+NCO+FIN)/Average Total Assets, Where WC=(Current Assets- Cash and
Short- term Investments)-(Current Liabilities-Debt in Current Liabilities);NCO=(Total
Assets-Current Assets-Investments and Advances)-(Total Liabilities-Current Liabilities-
Long-term Debt; FIN=(Short-term Investments+ Long-term Investments)-(Long-term
Debt +Debt in Current Liabilities+Preferred Stock)
Change in receivables ch rec Accounts Receivable/Average Total Assets
Change in inventory ch inv Inventory/Average Total Assets
% Soft assets soft assets (Total Assets-PP&E-Cash and Cash Equivalent)/Total Assets
Modified Jones model discretionary accruals da The modified Jones model discretionary accruals is estimated cross-sectionally each
year using all firm-year observations in the same two-digit SIC code: WC Accru-
als =α +β(1/Beginning assets)+γ (Sales-Rec) /Beginning assets+ρPPE /Beginning
assets+. The residuals are used as the modified Jones model discretionary accruals.
Performance- matched discretionary accruals dadif The difference between the modified Jones discretionary accruals for firm i in year t
and the modified Jones discretionary accruals for the matched firm in year t, following
Kothari et al. (2005); each firm-year observation is matched with another firm from the
same two-digit SIC code and year with the closest return on assets.
Mean-adjusted absolute value of DD residuals resid The following regression is estimated for each two-digit SIC industry: W C = b0 +
b1 ∗ CF Ot−1 + b2 ∗ CF Ot + b3 ∗ CF Ot+1 +. The mean absolute value of the residual is
calculated for each industry and is then subtracted from the absolute value of each firm’s
observed residual.
Studentized DD residuals sresid Scales each residual by its standard error from the industry-level regression
Change in cash sales ch cs Percentage change in cash sales (Sales-Accounts Receivable)
Change in cash margin ch cm Percentage change in cash margin where cash margin is measure as 1-[(Cost of Good-
Inventory+Accounts Payble)/ (Sales-Accounts Receivable)]
J. Bertomeu et al.
Table 5 (continued)
Change in return on assets ch roa (Earningst /Average Total Assetst )- (Earningst−1 /Average Total Assetst−1 )
Change in free cash flows ch fcf (Earnings-RSST Accruals)/ Average Total Assets
Deferred tax expense tax Deferred tax expense for year t/ total assets for year t-1
Existence of operating leases leasedum An indicator variable coded 1 if future operating lease obligations are greater than zero
Change in operating leases activity oplease The change in the present value of future noncancelable operating lease obligations
deflated by average total assets
Expected return return on pension plan assets pension Expected return on pension plan assets
Change in expected return return on pension plan assets ch pension Expected return on pension plan assets
Abnormal change in employees ch emp Percentage change in the number of employees-percentage change in assets
Abnormal change in order backlog ch backlog Percentage change in order backlog -percentage change in sales
Actual issuance issue An indicator variable coded 1 if the firm issues securities during year t
Ex ante financing need exfin An indicator variable coded 1 if CFO - past three year average capital expenditures /
Current Assets < −0.5
Level of finance raised cff Financing Activities Net Cash Flow /Average total assets
Leverage leverage Long-term Debt/Total Assets
Market-adjusted stock return rett Annual buy-and-hold return inclusive of delisting returns minus the annual buy-and-hold
value-weighted market return
Lagged Market-adjusted stock return rett−1 Previous year’s annual buy-and-hold return inclusive of delisting returns minus the annual
buy-and-hold value-weighted market return
Book-to-market bm Equity/Market Value
Earnings-to-price ep Earnings/Market Value
Agriculture ind1 SIC 100-999
Mining and Construction ind2 SIC 1000-1299 and 1400-1999
Table 5 (continued)
Food and Tobacco ind3 SIC 2000-2141

Textiles and Apparel ind4 SIC 2200-2399
Lumber, Furniture, Printing ind5 SIC 2400-2796
Chemicals ind6 SIC 2800-2824 and 2840-2899
Refining and Extractive ind7 SIC 1300-1399 and 2900-2999
Durable Manufacturers ind8 SIC 3000-3569, 3580-3669, and 3680-3999
Computers ind9 SIC 3570-3579, 3670-3679, and 7370-7379
Transportation ind10 SIC 4000-4899
Utilities ind11 SIC 4900-4999
Retail ind12 SIC 5000-5999
Services ind13 SIC 7000-7369 and 7380-9999
Banks and Insurance ind14 SIC 6000-6999
Pharmaceuticals ind15 SIC 2830-2836 and 3829-3851
% Audit Committee Affiliated AC Affiliate The fraction of the audit committee that is comprised of affiliated (gray) directors
Audit Committee Chair Affiliated AC Chair Affiliate An indicator variable equal to one if the chairperson of the audit committee is affiliated
and zero otherwise
Number of Audit Committee meetings AC Meetings The number of audit committee meetings
Audit Committee size ACsize The number of directors serving on the audit committee
% Affiliated Own AffiliateOwn The fraction of outstanding shares held by the average affiliated director
% Affiliated Appointed Appointed Affiliate The fraction of affiliated directors who were appointed by existing insiders
% Outsiders Appointed Appointed Outside The fraction of outside directors who were appointed by existing insiders
% Board Inside Board Inside The fraction of board comprised of insider (executive) directors
J. Bertomeu et al.
Table 5 (continued)
Number of board meetings Board Meetings The number of board meetings

Board size Boardsize The number of directors serving on the board
% Busy Affiliate Busy Affiliate The fraction of affiliated directors who serve on four or more other boards
% Busy Insider Busy Insider The fraction of insider directors who serve on two or more other boards
% Busy Outsider Busy Outsider The fraction of outside directors who serve on four or more other boards
% Compensation Committee Affiliated CC Affiliate The fraction of the compensation committee that is comprised of affiliated directors
Compensation Committee Chair Affiliated CC Chair Affiliate An indicator variable equal to one if the chairperson of the compensation committee is
affiliated and zero otherwise
Number of Compensation Committee meetings CC Meetings The number of compensation committee meetings
Compensation Committee size CCsize The number of directors serving on the compensation committee
Insider Chairman Insider Chairman An indicator variable equal to one if an executive holds the position of chairperson of the
board and zero otherwise
Lead director Lead Director An indicator variable equal to one if there is a lead director on the board and zero
otherwise
% Outsiders own Outsider Own The fraction of outstanding shares held by the average outside director
Big 4 auditor big4 An indicator variable equal to one if the firm’s auditor is a Big 4 firm and zero otherwise
Non-audit fees / total fees feeratio Ratio of non-audit fees to total fees
Percentile rank of non-audit fees by auditor ranknon Percentile rank of non-audit fees by auditor
Percentile rank of audit fees by auditor rankaud Percentile rank of audit fees by auditor
Percentile rank of total fees by auditor ranktot Percentile rank of total fees by auditor
Log of non-audit fees non audit fees Log value of non-audit fees
Log of audit fees audit fees Log value of audit fees
Log of total fees total fee Log value of total audit fees
Table 5 (continued)
Firm age age Firm age since listing

Auditor tenure audit tenure Auditor tenure
Auditor change audit ch An indicator variable coded 1 if the firm changes auditor
Going-concern opinion going concern An indicator variable coded 1 if the firm gets a going-concern opinion from auditor
Qualified opinion auop An indicator variable coded 1 if the firm gets a qualified opinion from auditor
Qualified opinion (internal control) auopic An indicator variable coded 1 if the firm gets a qualified opinion in term of internal
control from auditor
Beat analyst forecast beat An indicator variable coded 1 if the firm’s earnings meet or beat the analyst forecast
Analyst forecast error af error Absolute value of difference between actual earnings and analyst forecast consensus,
deflated by stock price at the fiscal year-end
Analyst forecast dispersion disper Standard deviation of analyst forecast / mean of analyst forecast
Bid-ask spread spread Past 252 days average bid-ask spread
Stock return volatility vol Past 252 days stock return volatility
Short interest short Past 12 months average short interest percentage from Compustat
Management forecast mf An indicator variable coded 1 if the firm issues a management forecast
Foreign firm foreign An indicator variable coded 1 if the firm is a foreign firm
Announce of restatement at the same year res an0 An indicator variable coded 1 if the firm announces a restatement
Announce of restatement one year before res an1 An indicator variable coded 1 if the firm announces a restatement one year before
Announce of restatement two years before res an2 An indicator variable coded 1 if the firm announces a restatement two years before
Announce of restatement three years or more before res an3 An indicator variable coded 1 if the firm announces a restatement three years or more
before
Missing variable: board size missing Board An indicator variable coded 1 if missing board size number in Equilar
Missing variable: lag DD residuals missing DD An indicator variable coded 1 if missing lag DD residuals
J. Bertomeu et al.
Table 5 (continued)
Missing variable: discretionary accruals missing DA An indicator variable coded 1 if missing discretionary accruals
Missing variable: deferred tax expense missing TAX An indicator variable coded 1 if missing deferred tax expense
Missing variable: change in order backlog missing OB An indicator variable coded 1 if missing change in order backlog
Missing variable: pension return missing Pension An indicator variable coded 1 if missing pension return
Missing variable: analyst forecast missing AF An indicator variable coded 1 if missing analyst forecast
Missing variable: credit ratings missing Rating An indicator variable coded 1 if missing credit ratings
Missing variable: internal control opinion missing auopic An indicator variable coded 1 if missing opinion or unaudited internal control
Missing variable: going-concern opinion missing going-concern An indicator variable coded 1 if missing going concern opinion
Table 6 Summary statistics of predictors in Table 7
Variable Mean SD P1 P10 P25 P50 P75 P90 P99
Panel A: Accounting variables (including BM and EP)

% Soft assets 0.576 0.281 0 0.152 0.357 0.609 0.821 0.934 0.978
Change in operating leases activity 0.003 0.025 -0.09 -0.013 -0.002 0 0.005 0.021 0.124
Leverage 0.156 0.186 0 0 0.001 0.089 0.252 0.415 0.856
Level of finance raised 0.037 0.191 -0.312 -0.099 -0.043 0 0.049 0.201 1.026
WC accruals 0.002 0.06 -0.22 -0.052 -0.014 0 0.02 0.06 0.213
Book-to-market 0.669 0.614 0 0.139 0.29 0.52 0.84 1.295 3.755
Change in inventory 0.005 0.035 -0.128 -0.018 -0.001 0 0.009 0.035 0.154
Earnings-to-price -0.073 0.414 -2.961 -0.262 -0.033 0.038 0.066 0.096 0.264
Lag mean-adjusted absolute value of DD residuals -0.132 0.336 -1.968 -0.342 -0.134 -0.032 0 0.01 0.34
Change in cash margin -0.107 2.468 -14.116 -0.991 -0.212 -0.01 0.107 0.54 12.134
RSST accruals 0.014 0.162 -0.579 -0.135 -0.023 0 0.067 0.164 0.588
Change in cash sales 0.047 1.257 -6.573 -0.471 -0.088 0.051 0.203 0.536 6.652
Deferred tax expense 0 0.019 -0.091 -0.012 -0.002 0 0.004 0.015 0.069
Lag studentized DD residuals 0.003 0.399 -1.487 -0.313 -0.083 0 0.076 0.327 1.545
Change in receivables 0.014 0.059 -0.172 -0.038 -0.008 0.005 0.03 0.075 0.255
Change in return on assets -0.001 0.033 -0.135 -0.027 -0.007 0 0.006 0.024 0.139
Performance matched discretionary accruals -0.008 0.569 -2.58 -0.161 -0.041 0 0.039 0.153 2.456
Modified Jones model discretionary accruals 0.093 0.708 -1.365 -0.128 -0.021 0 0.056 0.212 3.818
J. Bertomeu et al.
Table 6 (continued)
Variable Mean SD P1 P10 P25 P50 P75 P90 P99
Panel B: Audit variables

Non audit fee / total fee 0.2 0.18 0 0 0.056 0.157 0.295 0.462 0.762
Qualified opinion (internal control) 0.03 0.17 0 0 0 0 0 0 1
Log of non audit fee 10.646 3.932 0 0 10.166 11.546 12.864 14.02 16.065
Percentile rank of audit fee by auditor 0.669 0.225 0 0.364 0.505 0.694 0.857 0.951 1
Percentile rank of non audit fee by auditor 0.627 0.266 0 0.2 0.458 0.67 0.847 0.947 1
Percentile rank of total fee by auditor 0.668 0.225 0 0.362 0.503 0.691 0.857 0.951 1
Log of total fee 13.526 1.983 0 11.884 12.649 13.622 14.548 15.482 17.18
Log of audit fee 13.264 1.964 0 11.599 12.361 13.374 14.31 15.204 16.828
Panel C: Market variables (excluding BM and EP)
Bid ask spread 0.011 0.017 0 0.001 0.001 0.004 0.014 0.031 0.095
Short interest 0.036 0.047 0 0 0.002 0.019 0.049 0.095 0.237
Stock return volatility 0.033 0.019 0.009 0.014 0.019 0.028 0.042 0.059 0.109
Lag one year return 0.089 0.574 -0.813 -0.454 -0.228 0 0.252 0.658 2.94
Return 0.068 0.567 -0.861 -0.484 -0.256 -0.023 0.246 0.64 2.836
Analyst forecast errors 0.096 0.63 0 0 0 0.001 0.006 0.036 3.661
Panel D: Business variables
Abnormal change in employees -0.042 0.278 -1.347 -0.272 -0.119 -0.025 0.058 0.2 0.888
Firm age 18.042 15.679 2 4 7 13 24 39 80
Panel E: Governance variables
% Outsiders own 0.023 0.057 0 0 0 0.004 0.016 0.055 0.368
% Outsiders Appointed 0.186 0.334 0 0 0 0 0.273 0.909 1
J. Bertomeu et al.
Table 7 Importance of predictors (importance higher than 1.5%)
Predictor Importance Cumulative importance
% Soft assets 3.25 3.25

Bid ask spread 2.91 6.17
Non-audit fee / total fee 2.76 8.93
Qualified opinion (internal control) 2.72 11.65
Change in operating lease activity 2.69 14.34
Short interest 2.57 16.91
Stock return volatility 2.40 19.31
Log of non-audit fee 2.35 21.66
Percentile rank of audit fee by auditor 2.30 23.97
Leverage 2.30 26.27
Level of finance raised 2.24 28.50
Abnormal change in employees 2.18 30.69
WC accruals 2.17 32.86
% Outsiders own 2.14 35.00
Book-to-market 2.13 37.13
Change in inventory 2.13 39.25
Lag one year return 2.10 41.35
Earnings-to-price 2.08 43.43
Return 2.01 45.44
Lag mean-adjusted absolute value of DD residuals 1.98 47.42
% Outsiders appointed 1.97 49.39
Analyst forecast errors 1.94 51.33
Change in cash margin 1.92 53.25
RSST accruals 1.92 55.17
Firm age 1.92 57.09
Change in cash sales 1.91 59.00
Deferred tax expense 1.90 60.90
Lag studentized DD residuals 1.83 62.74
Percentile rank of non audit fee by auditor 1.82 64.56
Percentile rank of total fee by auditor 1.79 66.35
Log of total fee 1.70 68.05
Log of audit fee 1.69 69.74
Change in receivables 1.69 71.43
Change in return on assets 1.60 73.03
Performance matched discretionary accruals 1.59 74.62
Modified Jones model discretionary accruals 1.55 76.17
Table 8 Models using subgroups of variables
Model R2 AUC Catch rate of restatement Catch Rate of AAER Importance of accounting variables
Business only 0.8% 55.5% 37.1% 56.8% 0%

Governance only 6.0% 57.2% 42.5% 35.1% 0%
Market only 5.0% 60.6% 46.5% 45.9% 0%

Audit only 8.5% 61.7% 48.4% 48.6% 0%
Accounting only 3.5% 58.3% 41.6% 59.5% 100 %
Accounting+ Business 4.7% 61.6% 45.1% 59.5% 82.8%
Accounting+ Governance 7.2% 62.0% 46.7% 45.9% 74.4%
Accounting + Market 5.3% 62.0% 46.7% 51.4% 55.2%
Accounting + Audit 10.1% 66.5% 53.7% 62.2% 68.6%
Full model 14.1% 72.8% 64.3% 78.4% 36.2%
Table 9 Difference of means test between restatement and nonrestatement firm-years
Res Non Res Res - Non Res

Predictors Mean Mean Diff in mean Two-tailed p-value
% Soft assets 55.2% 57.7% -2.5% 0.0001

Bid ask spread 1.2% 1.1% 0.0% 0.1904
Non-audit fee / total fee 25.3% 19.6% 5.7% 0.0001
Qualified opinion (internal control) 11.3% 2.4% 8.9% 0.0001
Change in operating lease activity 0.6% 0.3% 0.3% 0.0001
Short interest 3.7% 3.5% 0.2% 0.0280
Stock return volatility 3.8% 3.3% 0.5% 0.0001
Log of non audit fee 11.1 10.6 0.5 0.0001
Percentile rank of audit fee by auditor 69.8% 66.7% 3.1% 0.0001
Leverage 16.9% 15.5% 1.3% 0.0001
J. Bertomeu et al.
Table 10 Fit and accuracy of the different models
Model GBRT Random forest RUSBoost Backward logistic
Panel A: Pseudo R 2 and AUC in full test dataset (11,323 firm-years), bootstrap standard errors in parentheses
Pseudo R 2 14.1% 12.6% n.a.a 10.8%
s.e. (0.50%) (0.47%) (0.43%)
AUC 72.8% 76.1% 76.3% 67.1%
s.e. (0.70%) (0.58%) (0.73%) (0.59%)

Panel B: Pseudo R 2 and AUC excluding firms with restatements in the training sample (9,298 firm-years), bootstrap standard errors in parentheses
Pseudo R 2 21.9% 12.9% n.a. 21.0%
s.e. (0.67%) (0.83%) (0.84%)
AUC 67.9% 70.0% 70.4% 67.3%
s.e. (0.91%) (0.69%) (1.07%) (0.92%)
a Boosted trees do nothave a simple expression to map scores to probabilities, so we do not report a Pseudo R 2 for RUSBoost. As further analyses, we mapped the RUSBoost
score to a Pseudo R 2 by running a spline regression of restatement occurrence on scores and setting the minimum (maximum) probability equal to 0.01 (0.99). We set those
values to obtain an upper bound on the implied pseudo R 2 as they appear to maximize R 2 out-of-sample. This approach returns a pseudo R 2 of 7.1%, lower than GBRT.
This mapping confirms that RUSBoost, which is a classification algorithm not meant to estimate conditional expectations, does worse than a regression model in fitting
expectations.
J. Bertomeu et al.
Table 11 Catch rate when using small bandwidths
Inspection rate GBRT Random forest RUSBoost Backward logistic
1% 7.9% 8.4% 6.3% 7.0%

2% 11.9% 14.0% 13.6% 12.9%
3% 17.1% 17.5% 17.8% 15.4%
4% 21.7% 20.6% 20.8% 18.2%
5% 22.9% 23.1% 23.8% 19.9%
6% 25.7% 25.9% 26.6% 22.0%
7% 27.1% 28.3% 28.0% 23.8%
8% 29.7% 30.1% 30.8% 25.7%
9% 32.7% 31.5% 33.2% 26.2%
10% 33.4% 35.3% 36.4% 27.8%
Table 12 Performance of the models for the top 1/3 predicted probabilities of firm-years
Model GBRT Random forest RUSBoost Backward Logistic
Panel A: Detection rate for the top 1/3 predicted probabilities of firm−years
Catch rate 64.3% 70.8% 72.2% 55.4%
Catch number 275 303 309 237
Panel B: Detection rate for the top 1/3 predicted probabilities of firm−years excluding firms with restatement in the training dataset
Catch rate 55.8% 61.0% 61.6% 54.7%
Catch number 96 105 106 94
Panel C: Restatement income effect for the top 1/3 predicted probabilities of firm−years
Number of unique firm−years 98 126 132 60

Average absolute income effect (million) 8.5 12.3 8.9 9.8
Average scaled absolute income effect 1.22% 1.24% 1.19% 1.99%
Number of negative firm−years 58 79 76 32
Average income effect (million) −10.7 −15.5 −10.8 −13.8
Average scaled income effect −1.45% −0.92% −0.89% −2.50%
Number of positive firm−years 19 27 29 12
Average income effect (million) 11.0 12.2 12.1 12.6
Average scaled income effect 1.85% 3.08% 3.11% 3.28%
Panel D: Average detection times relative to start year and filing year for the top 1/3 predicted probabilities of firm−years
Nb. Mean Nb. Mean Nb. Mean Nb. Mean
Relative to filing year 224 1.88 239 1.92 241 1.97 197 1.77
Relative to starting year 224 1.74 239 1.67 241 1.63 197 1.57
J. Bertomeu et al.
Table 13 Performance of alternative methods
Model AUC R2 Catch restatements Catch AAERs
k-nearest neighbors 70.1% n.a. 57.5 % 81.1%

Linear discriminant analysis 66.3% 10.3% 55.4% 62.2%
Naive Bayes 66.3% n.a. 54.2% 67.6%
Support vector machines 68.0% 7.9% 58.9% 86.5%
Classification trees 63.0% n.a. 45.1% 51.4%
Table 14 AAERs sample selection and description
Panel A: AAERs sample selection

Number of AAERs Firm-years Number
All AAERs firm-years from 2001-2014 865
Less: restatement sample selection filters (373)
Less: are not in restatement files (107)
Total 385
Panel B: Income effects of AAERs

Income effect Freq Percent Average Average (scaled
(in million) by average asset)
Negative 302 78.4 −42.2 −2.48%
Zero 36 9.4 0 0
Positive 47 12.2 22.4 1.49%
Total 385 100 −30.4 −1.76%
Table 15 AAERs catch rates on test dataset (11,323 firm years)
Model Catch Percentage Total AAER Firm-years
GBRT 29 78.4% 37
Random forest 35 94.6% 37
RUSBoost 33 89.2% 37
Backward logistic 23 62.2% 37
Table 16 Top 10 variables (bold: same variable across panels; italics: same variable in at least two models)
GBRT Random forest RUSBoost
Panel A: Restatement models

% Soft assets Bid-ask spread % Soft assets
Bid-ask spread Chg. in operating leases Return
Non-audit fee / total fee Non-audit fee / total fee Lag one year return
Qualified opinion (controls) % Soft assets Bid-ask spread
Chg. in operating leases Level of finance raised Auditor tenure
Short interest Lag one year return Book-to-market
Stock return volatility Stock return volatility Firm age
Log of non-audit fee Perc. rank of audit fee by auditor Level of finance raised
Perc. rank of audit fee by auditor Earnings-to-price Short interest
Leverage Perc. rank of total fee by auditor Change in receivables
Panel B: AAER models
% Soft assets Non-audit fee / total fee Auditor tenure
Non-audit fee / total fee % Soft assets Return
Return Perc. rank of audit fee by auditor Lag one year return
Log of non-audit fee Log of non audit fee % Soft assets
Perc. rank of audit fee by auditor Bid-ask spread Change in receivables
Short interest Perc. rank of total fee by auditor WC accruals
Deferred tax expense Log of total fee Log of non-audit fee
Bid-ask spread Short interest Chg. in cash sales
Lag one year return Perc. rank of non-audit fee by auditor Chg. in cash margin
Firm age Log of audit fee Chg. in operating leases
J. Bertomeu et al.
Table 17 Performance on catching AAERs
Model built based on: Model R2 AUC Catch rate of AAERs
Panel A: Performance on catching all AAER

GBRT 10.8% 75.7% 68.3%
AAERs Backward logistic 4.4% 65.9% 57.1%
RUSBoost 76.7% 73.0%
Random forest 12.4% 77.5% 71.4%
GBRT 71.4% 63.5%
Restatements Backward Logistic 66.5% 55.6%
Random forest 78.8% 74.6%
Panel B: Performance on catching AAER-restatement intersection

GBRT 15.9% 75.2% 64.9%
AAERs Backward logistic 11.8% 67.8% 62.2%
Random forest 15.1% 83.0% 81.1%
GBRT 81.1% 78.4%
Restatement Backward logistic 73.1% 62.2%
Random forest 88.0% 94.6%
Table 18 Results from predicting ahead models
Model R2 AUC Catch rate of restatement Catch rate of AAER
Current year 14.8% 72.3% 63.5% 82.8%

One-year-ahead 11.7% 68.0% 56.9% 79.3%
Two-year-ahead 7.5% 59.9% 42.2% 48.3%
Table 19 Importance of predictors from predicting ahead model (top 15)
Predictor Importance Cumulative importance
Panel A: One-year ahead

% Soft assets 3.25 6.51
Bid-ask spread 3.01 15.85
Change in operating leases activity 2.62 18.48
Lag one year return 2.56 21.04
Book-to-market 2.30 25.70
Return 2.23 30.21
WC accruals 2.20 32.41
% Outsiders own 2.19 34.60
Abnormal change in employees 2.17 36.76
Change in inventory 2.14 38.91
Panel B: Two-year ahead

Industry: Banks & Insurance 4.16 29.62
Percentile rank of total fee by auditor 3.72 33.35
Bid-ask spread 3.54 36.89
Industry: Computers 3.21 40.09
Log of total fee 2.95 43.04
% Outsiders appointed 2.43 48.17
Change in operating leases activity 2.23 50.40
Missing or unaudited internal control 2.23 52.63
Industry: Retail 1.88 56.61
% Board Inside 1.78 58.39
Table 20 InTrees for top 10 and all variables
Max var Frequency of rule Variables Predicted Error rate
Top 10 variables
3 0.005 % soft assets≤0.513 & Qualified opinion (controls)= 1 & Leverage≤0.017 40.89% 24.17%
5 0.006 0.011<% soft assets≤0.602 & Bid-ask spread>0.005 & Qualified opinion (controls)=1 & Pct. rank of audit fee≤0.987 37.50% 23.44%
3 0.051 % soft assets>0.088 & Non-audit fee / total fee>0.323 & Chg. in operating lease>0.008 16.30% 13.64%
5 0.047 Non-audit fee / total fee>0.307 & Qualified opinion (controls)=0 & Chg. in operating lease>-0.018 & Short interest rate≤0 16.86% 14.02%
& Non audit fee>11.833
3 0.104 Non-audit fee / total fee>0.401 & Qualified opinion (controls)=0 & Stock return volatility>0.024 14.54% 12.43%
5 0.104 % soft assets≤0.886 & Non-audit fee / total fee>0.399 & Qualified opinion (controls)=0 & Chg. in operating lease≤0.033 11.11% 9.87%
& Stock return volatility≤0.053
3 0.199 Non-audit fee / total fee>0.249 & Chg. in operating lease≤0.008 & Stock return volatility>0.02 10.20% 9.16%
5 0.196 Non-audit fee / total fee>0.247 & Qualified opinion (controls)=0 & Chg. in operating lease≤0.032 & Stock return 10.93% 9.74%
volatility>0.02 & Pct. rank of audit fee>0.401
All variables
3 0.006 Earnings to price > -0.57 & Analyst forecast errors > 0.025 & Qualified opinion (controls) = 1 38.06% 23.57%
5 0.009 Short interest rate ≤0 & Qualified opinion (controls) = 0 & Total audit fees > 13.191 & % Board insiders > 0.087 & 39.36% 23.87%
Computers = 1
3 0.05 Operating lease = 1 & % appointed outsiders > 0.095 & Computers = 1 19.93% 15.96%
5 0.045 % soft assets ≤ 0.874 & Expected return return on pension plan assets ≤ 0.065 & Pct. rank of audit fee > 0.603 & % 20.02% 16.01%
appointed outsiders > 0.095 & Comp. committee size ≤ 2
3 0.098 Stock return volatility > 0.023 & Pct. rank of audit fee > 0.663 & % appointed outsiders > 0.226 15.69% 13.23%
5 0.105 Short interest rate ≤ 0 & Credit rating ≤ 17.5 & Stock return volatility > 0.028 & Analyst forecast error ≤ 0.361 & Past 14.30% 12.26%
restatement announcement (2y) = 0
3 0.199 Financing cash flow > 0.006 & Earnings to Price ≤ 0.041 & Total audit fees ≤ 13.878 10.09% 9.07%
5 0.196 Change in cash sales ≤ 0.42 & Qualified opinion (controls) = 0 & Pct. Rank of audit fee > 0.473 & missing opinion 12.15% 10.67%
(controls) = 1 & Pharmaceuticals = 0
J. Bertomeu et al.
Appendix B: An illustration of GBRT

We now offer a more formal presentation of the main steps of the gradient-
boosting algorithm. Let us hold the depth of the tree as fixed and denote φ(x; S ) as a
mapping that yields a vector of predicted values for some variable to be predicted z,
given x, under the regression tree applied to sample S = (zi , xi )N N
i=1 , where (zi )i=1
is a set of observations to be predicted. Note that we focus here on the first-order
aspects as we use it in this study; the reader should refer to Friedman (2001) for a
general description that would apply to a broader class of datasets.
Let us subsequently set a loss function L(y, ŷ) for the boosting algorithm, where y
is any variable that could be predicted and ŷ is a prediction. For a typical application
with continuous variables, we could set a quadratic loss Lq (y, ŷ) = (y − ŷ)2 /2;
for our purpose, given that we will predict probabilities, we will use a logistic loss
function L(y, ŷ) = log(1+exp(−2y ŷ)). Note that, in the quadratic case, the residual
y − ŷ is also given by the derivative in the second component of the loss function
q
L2 (y, ŷ) = y − ŷ; gradient boosting builds on this intuition to compute a local
version of the residual for a nonquadratic loss function by computing L2 (y, ŷ) =
2y/(1 + exp(2y ŷ)). Note here that a correct prediction y = ŷ reduces L2 , as it
would in the case of a residual in a linear regression. So, for later use, let us interpret
L2 (y, ŷ) as a measure of the component of y unexplained by ŷ.
In pseudo-code, the algorithm will operate as follows.
1. Initialize the analysis with a single prediction for the entire sample that mini-
mizes the logistic loss function; that is,
1 1+y
ŷi0 = log ,
2 1−y
where y is the sample mean.
2. For each step m ∈ [1, M], where M is the number of trees:
2.1. Compute the residuals of the previous step and redefine this residual as the
variable to be explained:
zim = L2 (yi , ŷim−1 ).
2.2. Fit a regression tree to S m = (zim , xi )N

i=1 , where S is a subset drawn
randomly according to parameter (d). The prediction is updated to
ŷim = ŷim−1 + νφ(xi ; S ),
where ν is the speed of learning parameter (c).
We illustrate next how gradient boosting trees work using a simple numerical
example, where we set the number of trees to M = 2 and the tree depth to 2. In
Table 21, we create 16 firm-year observations, coding a misstatement as 1 and non-
misstatements as −1 (column 1). Of course, this is only meant to carry the analysis
by paper and pencil and is in no way a reasonable sample size to conduct a machine
learning analysis. The dependent variable will be whether there is a misstatement,
and we use two explanatory variables: earnings-to-price ratio and % of soft assets
(columns 2 and 3).
J. Bertomeu et al.
Table 21 Example - sample 16 observations, nine of which have misstatements, with two observable
explanatory variables E/P and Soft
Misstatement E/P Soft zi1 yi1 pi1 yi2 pi2

(1) (2) (3) (4) (5) (6) (7) (8)
−1 0.03 0.2 −1.125 −1.017 0.116 −0.968 0.126

−1 0.05 0.2 −1.125 −1.017 0.116 −0.968 0.126
−1 0.07 0.2 −1.125 −1.017 0.116 −0.968 0.126
−1 0.09 0.2 −1.125 −1.017 0.116 −1.942 0.020
1 0.03 0.4 0.875 1.015 0.884 1.063 0.893
1 0.05 0.4 0.875 1.015 0.884 1.063 0.893
−1 0.07 0.4 −1.125 −0.001 0.499 0.048 0.524
−1 0.09 0.4 −1.125 −0.001 0.499 −0.926 0.136
1 0.03 0.6 0.875 1.015 0.884 1.063 0.893
1 0.05 0.6 0.875 1.015 0.884 1.063 0.893
1 0.07 0.6 0.875 −0.001 0.499 0.048 0.524
−1 0.09 0.6 −1.125 −0.001 0.499 −0.926 0.136
1 0.03 0.8 0.875 1.015 0.884 1.889 0.978
1 0.05 0.8 0.875 1.015 0.884 1.889 0.978
1 0.07 0.8 0.875 −0.001 0.499 0.873 0.851
1 0.09 0.8 0.875 −0.001 0.499 0.873 0.851
Note that, in this created data set, the pattern that ties the variables to misstate-
ments is visually self-evident. Plotting this data in Fig. 7, it is clear that misstatements
have a monotonic relationship with the two variables. But this relationship is not lin-
ear, and there is an interaction between the two variables such that the cutoff for soft
assets doesn’t change at low levels of earnings to price. Our objective will be to show
0.9
0.8
0.7
0.6
0.5
Soft
0.4
0.3
0.2
0.1
0
0.01 0.03 0.05 0.07 0.09 0.11
EP
Fig. 7 Graphical analysis: misstatements as a function of explanatory variables

how machine learning can uncover this pattern that we have ourselves “learned” from
visual inspection over the graph of observations.
Following the pseudo code, we first calculate the sample mean as y = 0.125 and
initialize our first predictor at a predicted probability for the whole sample given by
1 1 + 0.125
ŷi0 = log = 0.125657.
2 1 − 0.125
Then we calculate the residuals for each of the 16 observations given by
2yi
zi1 = L2 (yi , 0.125657) = ,
1 + exp(2yi ŷi0 )
as calculated in column 4 of Table 21. The next step is to fit the regression tree to zi1 ,
which involves finding one variable and one cutoff, such that all observations where
the variable is at one side of the cutoff are given the same prediction. Of note, the
regression tree does not use the same loss function as the boosting because the fitted
zi1 are derivatives that need not be bounded. In particular, the loss function used for
this step is simply the quadratic loss function Lq and the best cutoff minimizes this
loss.
In this example, there are two variables, and each has four values, so we could
choose to place the cutoff on one of two variables and at three locations per variable,
for six possible choices of the cutoff. For each possible cutoff, we compute a predic-
tion z1i as the average for all observations at one side of the cutoff and sum over the
loss function Lq (zi1 , z1i ). Table 22 provides the loss function for every possible cut-
off. Here, the best cutoff is to partition as a function of soft assets greater or equal
than 0.4, with 4 observations below this cutoff and 12 above.
Given that the number of leaves is set to be equal to three, we need to apply a
second (and final) cutoff to obtain three end leaves. We proceed similarly in Table 23,
except that there are now two choices as to whether to apply the cutoff on the left or
right side of Soft ≥ 0.4 and then five choices for the cutoff for the two variables if we
place the cutoff on the right node versus three if we place the cutoff on the left node.
Note that, for the left node, all outcome variables have the same value equal to zero
Table 22 Squared loss for various cutoffs (first branch)
Threshold Left mean z̄i1 Right mean z̄i1 Squared loss
EP ≥ 0.09 0.208 −0.625 13.667

EP ≥ 0.07 0.375 −0.375 13.5
EP ≥ 0.05 0.375 −0.125 15
Soft ≥ 0.8 −0.292 0.875 11.667
Soft ≥ 0.6 −0.625 0.625 9.5
Soft ≥ 0.4 −1.125 0.375 9
J. Bertomeu et al.
Table 23 Squared loss for various cutoffs (second branch)
Node Threshold Left Mean z̄i1 Right Mean z̄i1 Sum Loss
EP ≥ 0.09 0.653 −0.458 6.222

EP ≥ 0.07 0.875 −0.125 6
right EP ≥ 0.05 0.875 0.208 8
Soft ≥ 0.8 0.125 0.875 7.5
Soft ≥ 0.6 −0.125 0.625 7.5
EP ≥ 0.09
left EP ≥ 0.07 N/A N/A N/A
EP ≥ 0.05
so that the cutoff itself does not increase explanatory power; hence we know, for this
case, that the best cutoff must be on the right node.
Note that a variable that has been used in a prior cut can be used again in later steps,
thus creating potentially complex interactions or nonlinearities. For this example, the
best cutoff point is located at EP ≥ 0.07, which leads to a partition of the sample
into three regions, as illustrated in Fig. 8.
Having completed the first tree, we compute an adjustment to our initial prediction
from step 2.2, that is,
ŷim = ŷim−1 + φ(xi ; S ),
where φ(.) is based on the logistic functional

0
(zi ,xi )∈R 1 (xi ) zi
φ(xi ; S0 ) = ,
(zi ,xi )∈R 1 (xi ) |zi |(2 − |zi |)
0 0
where R 1 (xi ) indicates the set of observations classified in the same region by the
tree. Column 5 in Table 21 provides an updated estimate after the first tree. Note
0.9
16 observations 0.8
0.7
0.6
Soft < 0.4 Soft ≥ 0.4
0.5
Soft
4 observations 0.4
12 observations
Square loss = 0 Square loss = 9 0.3
0.2
EP< 0.07 EP≥ 0.07 0.1
0
6 observations 6 observations 0.01 0.03 0.05 0.07 0.09 0.11
EP
Fig. 8 First regression tree - complete

0.9
16 0.8
observations
0.7
Soft < 0.8 Soft ≥ 0.8 0.6
So2t
0.5
12 observations 4 observations 0.4
0.3
EP < 0.09 EP ≥ 0.09 0.2
0.1
9 observations 3 observations
0
0.01 0.03 0.05 0.07 0.09 0.11
EP
Fig. 9 Second tree
that these refer to the term inside of the logistic term. To make a prediction, we
need to convert this term into a probability (column 6) using a standard logistic
transformation.17
Since we have set the number of trees to two, we need to repeat these steps a
second time, using an updated set of residuals
zi2 = L2 (yi , ŷi1 ).
Skipping the calculations (which are conceptually identical), the second tree illus-
trated in Fig. 9 is now given by a cutoff on both high E/P and soft assets, which
unsurprisingly correspond to the area of predictors that were partly misclassified by
the first tree.
After computing the corresponding φ(xi ; S1 ), where S1 = (zi2 , xi ) is the sample
for the second tree, and mapping back to an updated value for ŷi2 and the implied
probability, we compute in Column 8 of Table 21 the resulting prediction. As a result
of the two trees, the data is now classified into seven regions.
To further illustrate in Fig. 10 the tree component of the analysis, we estimate
a single regression tree by partitioning the probability of misstatements into 10
nodes. The probability of misstatements across nodes varies from 2% to 50%. The
first branch of the tree, the most important variable is usually whether the audi-
tor has issued a qualified opinion. As expected, qualified opinions tend to be more
frequent in the presence of an ongoing misstatement. Conditional on a qualified opin-
ion, the firms with higher book-to-market and higher deferred tax expenses tend to
be the most likely to misstate, thus suggesting disagreements about deferred taxes.
17 This result coincides with the Stata package Boost, with code boost Res EP Soft , distribution(logistic)
train(1) bag(1) interaction(2) maxiter(1) shrink(1) predict(pred).

J. Bertomeu et al.
Opinion (internal control)

unqualifiedqualified
Non audit fee / total fee Book-to- market

< 40.2% ≥ 40.2% < 40.7% ≥ 40.7%
% Outsiders appointed Stock return volatility % Board Insider Deferred tax expense
< 39.2% ≥ 39.2% < 2.4% ≥ 2.4% < 11.8% ≥11.8% < 2.7% ≥ 2.7%
Industry: Banks & Insurance Industry: Computers
No Yes 0.0729 0.1459 0.2383 0.3879 0.1841 0.5135

No Yes
0.0566 0.0202 0.0795 0.1653
Fig. 10 Regression tree
Conditional on an unqualified opinion, by contrast, a very different set of variables

predicts misstatement, with firms with high non-audit fees and high stock market
volatility having a high probability to misstate.
References
Abbasi, A., Albrecht, C., Vance, A., Hansen, J. (2012). Metafraud: a meta-learning framework for
detecting financial fraud. Mis Quarterly, 36(4), 1293–1327.
Avramov, D., Chordia, T., Jostova, G., Philipov, A. (2009). Credit ratings and the cross-section of stock
returns. Journal of Financial Markets, 12(3), 469–499.
Bao, Y., Ke, B., Li, B., Julia Yu, Y., Zhang, J. (2020). Detecting accounting fraud in publicly traded us
firms using a machine learning approach. Journal of Accounting Research, 58(1), 199–235.
Barton, J., & Simko, P.J. (2002). The balance sheet as an earnings management constraint. The Accounting
Review, 77(s-1), 1–27.
Beneish, M.D. (1999). The detection of earnings manipulation. Financial Analysts Journal, 55(5), 24–36.
Bertomeu, J., & Marinovic, I. (2015). A Theory of hard and soft information. The Accounting Review,
91(1), 1–20.
Blackburne, T., Kepler, J., Quinn, P., Taylor, D. (2020). Undisclosed sec investigations. Forthcoming
Management Science.
Cheffers, M., Whalen, D., Usvyatsky, O. (2010). 2009 financial restatements: A nine year comparison.
Audit Analytics Sales (February).
Cheynel, E., & Levine, C. (2020). Public disclosures and information asymmetry: A theory of the mosaic.
The Accounting Review, 95(1), 79–99.
Dechow, P.M., & Dichev, I.D. (2002). The quality of accruals and earnings: The role of accrual estimation
errors. The Accounting Review, 77(s-1), 35–59.
Dechow, P.M., Ge, W., Larson, C.R., Sloan, R.G. (2011). Predicting material accounting misstatements.
Contemporary Accounting Research, 28(1), 17–82.
DeFond, M.L., Raghunandan, K., Subramanyam, K.R. (2002). Do non–audit service fees impair auditor
independence? evidence from going concern audit opinions. Journal of Accounting Research, 40(4),
1247–1274.
Deng, H. (2018). Interpreting tree ensembles with inttrees. International Journal of Data Science and
Analytics, pp 1–11.
Ding, K., Lev, B., Peng, X., Sun, T., Vasarhelyi, M.A. (2020). Machine learning improves accounting
estimates. Review of Accounting Studies, pp 1–37.
Dutta, I., Dutta, S., Raahemi, B. (2017). Detecting financial restatements using data mining techniques.
Expert Systems with Applications, 90, 374–393.
Ettredge, M.L., Sun, L., Lee, P., Anandarajan, A.A. (2008). Is earnings fraud associated with high deferred
tax and/or book minus tax levels?. Auditing: A Journal of Practice & Theory, 27(1), 1–33.
Fanning, K.M., & Cogger, K.O. (1998). Neural network detection of management fraud using published
financial data. Intelligent Systems in Accounting, Finance & Management, 7(1), 21–41.
Fawcett, T. (2006). An introduction to roc analysis. Pattern Recognition Letters, 27(8), 861–
874.
Frankel, R.M., Johnson, M.F., Nelson, K.K. (2002). The relation between auditors’ fees for nonaudit
services and earnings management. The Accounting Review, 77(s-1), 71–105.
Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics,
pp 1189–1232.
Friedman, J., Hastie, T., Tibshirani, R. (2001). The Elements of Statistical Learning Vol. 1. New York:
Springer series in statistics.
Garfinkel, J.A. (2009). Measuring investors’ opinion divergence. Journal of Accounting Research, 47(5),
1317–1348.
Glosten, L.R., & Milgrom, P.R. (1985). Bid, ask and transaction prices in a specialist market with
heterogeneously informed traders. Journal of Financial Economics, 14(1), 71–100.
Green, B.P., & Choi, J.H. (1997). Assessing the risk of management fraud through neural network
technology. Auditing, A Journal of Practice and Theory, 16, 14–28.
Guelman, L. (2012). Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert
Systems with Applications, 39(3), 3659–3667.
Gupta, R., & Gill, N.S. (2012). A solution for preventing fraudulent financial reporting using descriptive
data mining techniques. International Journal of Computer Applications.
Hribar, P., Kravet, T., Wilson, R. (2014). A New measure of accounting quality. Review of Accounting
Studies, 19(1), 506–538.
Johnson, V.E., Khurana, I.K., Kenneth Reynolds, J. (2002). Audit-firm tenure and the quality of financial
reports. Contemporary Accounting Research, 19(4), 637–660.
Kasznik, R. (1999). On the association between voluntary disclosure and earnings management. Journal
of Accounting Research, 37(1), 57–81.
Kim, Y.J., Baik, B., Cho, S. (2016). Detecting financial misstatements with fraud intention using multi-
class cost-sensitive learning. Expert Systems with Applications, 62, 32–43.
Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., Mullainathan, S. (2017). Human decisions and
machine predictions. The Quarterly Journal of Economics, 133(1), 237–293.
Kornish, L.J., & Levine, C.B. (2004). Discipline with common agency: The case of audit and nonaudit
services. The Accounting Review, 79(1), 173–200.
Larcker, D.F., Richardson, S.A., Tuna, I.rem. (2007). Corporate governance, accounting outcomes, and
organizational performance. The Accounting Review, 82(4), 963–1008.
Laux, V., & Newman, P.D. (2010). Auditor liability and client acceptance decisions. The Accounting
Review, 85(1), 261–285.
Lin, J.W., Hwang, M.I., Becker, J.D. (2003). A Fuzzy neural network for assessing the risk of fraudulent
financial reporting. Managerial Auditing Journal, 18(8), 657–665.
Lobo, G.J., & Zhao, Y. (2013). Relation between audit effort and financial report misstatements: Evidence
from quarterly and annual restatements. The Accounting Review, 88(4), 1385–1412.
Perols, J. (2011). Financial statement fraud detection: An analysis of statistical and machine learning
algorithms. Auditing: A Journal of Practice & Theory, 30(2), 19–50.
Perols, J.L., Bowen, R.M., Zimmermann, C., Samba, B. (2016). Finding needles in a haystack: Using data
analytics to improve fraud prediction. The Accounting Review, 92(2), 221–245.
Ragothaman, S., & Lavin, A. (2008). Restatements due to improper revenue recognition: a neural networks
perspective. Journal of Emerging Technologies in Accounting, 5(1), 129–142.
Romanus, R.N., Maher, J.J., Fleming, D.M. (2008). Auditor industry specialization, auditor changes, and
accounting restatements. Accounting Horizons, 22(4), 389–413.
Samuels, D., Taylor, D.J., Verrecchia, R.E. (2018). Financial misreporting: Hiding in the shadows or in
plain sight?.
Rijsbergen, V., & Joost, C. (2004). The geometry of information retrieval. Cambridge University Press.
J. Bertomeu et al.
Whiting, D.G., Hansen, J.V., McDonald, J.B., Albrecht, C., Steve Albrecht, W. (2012). Machine learning
methods for detecting patterns of management fraud. Computational Intelligence, 28(4), 505–527.
Zhang, Y., & Haghani, A. (2015). A gradient boosting method to improve travel time prediction.
Transportation Research Part C: Emerging Technologies, 58, 308–324.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Aﬃliations
Jeremy Bertomeu1 · Edwige Cheynel1 · Eric Floyd2 · Wenqiang Pan3
Edwige Cheynel
echeynel@wustl.edu
Eric Floyd
ejfloyd@ucsd.edu
Wenqiang Pan
wenqiang.pan1994@gmail.com
1 Olin Business School, Washington University, Saint Louis, MO, USA
2 Rady School of Management, University of California San Diego, La Jolla, CA, USA
3 Columbia Business School, Columbia University, New York City, NY, USA

Using Machine Learning To Detect Misstatements: Jeremy Bertomeu Edwige Cheynel Eric Floyd Wenqiang Pan

Uploaded by

Copyright:

Available Formats

Using Machine Learning To Detect Misstatements: Jeremy Bertomeu Edwige Cheynel Eric Floyd Wenqiang Pan

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Using Machine Learning To Detect Misstatements: Jeremy Bertomeu Edwige Cheynel Eric Floyd Wenqiang Pan

Uploaded by

Copyright:

Available Formats

Review of Accounting Studies

Using machine learning to detect misstatements

Jeremy Bertomeu1 · Edwige Cheynel1 · Eric Floyd2 · Wenqiang Pan3

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Keywords Restatement · Manipulation · Earnings management ·

JEL Classiﬁcation C63 · D83 · G38 · K22 · K42 · M41

We gratefully thank B. Cadman, P. Dechow, C. Lennox, S.X. Li, D. Macciocchi, M. Plumlee, X.

Extended author information available on the last page of the article.

expectations about the occurrence of misstatements. However, classification algo-

mechanism to detect overstatements of earnings using financial statement ratios.

1 Overview of machine learning

2 Data and research design

current financial statements or both, due to material accounting misstatements, and

effects of restatements are summarized in Table 2. We winsorize the income effects

Select optimal parameters

Fig. 1 Schematic for data flow and use

Average pseudo R2 with different baggage

Fig. 2 Pseudo R2 using different parameters

As a preliminary to our analysis, Fig. 2 documents the results of our validation to

5 We report the summary statistics of the important predictors in Table 6.

Fig. 3 Cumulative importance per type of variable

price, we allocate half weight to each category.

Accounting may also be important because of its standalone role in detecting

endogenously retain more soft assets tend to be more credible.

0 0.2 0.4 0.6 0.8 1

GBRT Random Forest RUSBoost Backward Logit

Fig. 4 Catch rate as a function of rate of false positives (ROC curve)

F0.5 score F1 score

false positives is high. GBRT appears to perform well on a likelihood-based measure

nontrivial improvement of about 16.1% over backward logistic regressions.11 RUS-

5 SEC enforcement actions

5.1 Detecting AAERs

We match detected misstatements to AAERs, because presumably misstatements tied

5.2 Reconciliation to AAERs irregularities

We provide further analyses to reconcile misstatements and irregularities described

6.1 Predicting versus detecting misstatements

Fig. 6 Probability of misstatements for firms with and without restatements

6.2 Interpretation and high risk regions

Machine learning algorithms exhibit a known trade-off between ease of interpre-

inequality. In these cases, we only report the stricter condition.

Table 1 Sample selection

Sample selection Compustat firm-years Restatement firm-years Restatement filings

Original restatement filings sample 23,760

Panel A: Firm-years characteristics

Market value (in $ millions) 3,117 4,126 4,059

Panel B: Restatement income effects

2001 392 10.9 3,435 11.4

Table 4 Frequency of firm-year restatements

Panel A: Frequency of firm-year restatements by industry

Panel B: Frequency of firm-year restatements by size deciles

Variable Abbreviation Calculation

WC accruals WC acc [(Current Assets- Cash and Short-term Investments)-(Current Liabilities-Debt in

Variable Abbreviation Calculation

Variable Abbreviation Calculation

Food and Tobacco ind3 SIC 2000-2141

Variable Abbreviation Calculation

Number of board meetings Board Meetings The number of board meetings

Variable Abbreviation Calculation

Firm age age Firm age since listing