Using Machine Learning To Detect Misstatements: Jeremy Bertomeu Edwige Cheynel Eric Floyd Wenqiang Pan
Using Machine Learning To Detect Misstatements: Jeremy Bertomeu Edwige Cheynel Eric Floyd Wenqiang Pan
Using Machine Learning To Detect Misstatements: Jeremy Bertomeu Edwige Cheynel Eric Floyd Wenqiang Pan
https://doi.org/10.1007/s11142-020-09563-8
Abstract
Machine learning offers empirical methods to sift through accounting datasets with a
large number of variables and limited a priori knowledge about functional forms. In
this study, we show that these methods help detect and interpret patterns present in
ongoing accounting misstatements. We use a wide set of variables from accounting,
capital markets, governance, and auditing datasets to detect material misstatements.
A primary insight of our analysis is that accounting variables, while they do not
detect misstatements well on their own, become important with suitable interactions
with audit and market variables. We also analyze differences between misstate-
ments and irregularities, compare algorithms, examine one-year- and two-year-ahead
predictions and interpret groups at greater risk of misstatements.
Machine learning is a broad discipline that has designed learning algorithms that
can drive cars, recognize spoken language, and discover hidden regularities in grow-
ing volumes of data. Archival financial research relies on data streams capturing
firm characteristics, governance attributes, audit reports, market data, and accounting
variables. Machine learning algorithms detect complex patterns in this data, select the
best variables to explain an outcome variable, and uncover suitable combinations of
variables to make accurate out-of-sample predictions. These algorithms are a key to
unlocking the large - and growing - financial data sources to make better predictions
and smarter decisions. This paper offers preliminary steps to applying this technol-
ogy in accounting. We motivate the method by answering a practical question: How
do we detect ongoing accounting misstatements?
We focus on restatement items 4.02(a) “Non-Reliance on Previously Issued Finan-
cial Statements or a Related Audit Report or Completed Interim Review,” which are,
in principle, restatements that materially affect the interpretation of accounting num-
bers by investors. These misstatements differ from irregularities because they need
not be fraudulent nor carry evidence of managerial intent. Still, the audit procedures
in place should have prevented these misstatements from occurring. The reasons for
such misstatements can be complex and related to a large set of factors. Relative to
a linear statistical model, machine learning algorithms are best suited for problems
in which the set of variables, their interactions, and the mapping into outcomes is
not theoretically obvious. For our research question, many accounting numbers could
serve as input variables to aid in detecting misstatements. An increase in accruals,
for example, may either reflect growth or a firm overstating its earnings. Many other
characteristics, such as industry, changes in balance sheet accounts (including cash
or inventories), or special transactions, such as leases, do not have a clear theoretical
relationship with misstatements.
Intuitively, a machine learning algorithm can be taught to replicate the process
of a researcher seeking a good empirical model for the relation between outcomes
and input variables. This search involves many moving parts, such as finding the
right specification and interactions between variables, which variables to select, and
when to stop searching. The relation between the dependent and independent vari-
ables need not be monotonic, and the interactions between variables are a priori
unknown. Machine learning algorithms capture the relationship between misstate-
ments and financial data, enabling auditors, managers, regulators and investors to
better evaluate the predictors of misstatements, thereby offering a starting point for
taking preventive actions.
For our baseline analysis, we use the gradient boosted regression tree (GBRT)
developed by Friedman (2001) and Friedman et al. (2001). This algorithm is a
regression method that explains the probabilities of misstatements; it has the further
conceptual advantage of relating to traditional regression methods, such as ordinary
least squares or logit, which allows us to compare its performance to traditional
research designs in capital markets. We also examine a number of alternative machine
learning approaches drawn from the wider literature in computer science. We find
that several other methods perform well at detecting misstatements. Through vari-
ous performance metrics, we illustrate how the choice of algorithm may depend on
the intended use of the output of the model, between a focus on estimating a poste-
rior expectation with a regression method (e.g., a conditional probability of having
a misstatement) and a focus on classifying observations into misstatement and non-
misstatement firms. By design, GBRT appears to perform best on a likelihood-based
measure performance (the pseudo R 2 ); that is, in the sense of explaining posterior
Using machine learning to detect misstatements
model validation, as we fit the occurrence of misstatements but not their magnitude;
thus we can examine whether misstatements detected by machine learning are larger
or detected earlier than with other models. To connect to the growing research on
irregularities, we also examine whether the detected restatements appear to link to
irregularities. In particular, because of the larger number of misstatements, a model
trained with misstatements is about as good at detecting AAERs as a model trained
with AAERs and considerably better at detecting joint occurrences of misstatements
and AAERs.
The remainder of the paper is organized as follows. Section 1 briefly introduces
GBRT. Section 2 describes the data and research design. Section 3 discusses the most
important variables used in the model. Section 4 presents comparisons with linear
models and classification models. Section 5 studies and compares implications of
machine learning algorithms on AAERs and misstatements. Section 6 performs sev-
eral additional analyses and applies a method to interpret better the results provided
by GBRT.
We use the gradient boosted regression tree (GBRT) developed by Friedman (2001)
for our main analysis. A review of use of the method is given by Guelman (2012),
Zhang and Haghani (2015), and Kleinberg et al. (2017). We begin with a nontechnical
overview of the main building blocks of this algorithm.
As a preliminary activity to most machine learning exercises, the first step is to
divide the sample into two subsamples: the training and validation sample serves
to fit the model, and the test sample serves to evaluate the predictive ability of the
model.
GBRT has two components that can be described independently. The regression
tree is a model in which the set of predictive variables is sequentially partitioned into
regions, where each region provides one average predicted number for the outcome
variable. The predictive variables are first split into two regions to maximize pre-
dictive accuracy. In the next steps, a region is then split again, optimally to improve
predictive power. One advantage of a regression tree over, say, a linear regression is
that the tree will sequentially pick the variables that best explain an outcome vari-
able. The most important variables will typically be the initial splits in the tree, and
the following splits will build interactions with other variables. The approach also
implements a variable selection procedure in each tree, with benefits similar to dif-
ferent regularization techniques, because the set of variables used in the regression
model is controlled by the number of splits.
A second part of GBRT is the novel aspect introduced by Friedman (2001): gra-
dient boosting. Intuitively, boosting generalizes supervised learning with a single
tree to iteratively fit multiple trees. The idea for boosting is to compute the residu-
als from the tree and apply another tree on the residuals to improve the accuracy of
the overall model. The residuals are then re-partitioned over a potentially different
set of variables and interactions. To gain some intuition, the first tree might capture
the first-order economic factor behind an outcome variable but miss on second or
J. Bertomeu et al.
third-order factors; the second and third trees (and onward) will capture these factors
better than the first tree by focusing on what remains unexplained by the first tree.
When implementing the regression tree, GBRT requires two parameters: the num-
ber of splits (depth of tree) and the fraction of the sample to be used in fitting the tree
(bagging). Boosting has two additional parameters: the number of trees to use and the
shrinkage parameter, which is defined as the percentage of the prediction of a new
tree to incorporate into the model. A higher value of these parameters increases the
fit of the model but can lead to overfitting, which occurs when the machine learning
model explains noise in the training sample, instead of the underlying relationship
that is generalizable to the test sample. Friedman (2001) recommends using weak
learners, defined as trees with four to eight splits, bagging between 50% and 80% of
observations and a shrinkage parameter no greater than 10%. The preferred number
of trees can depend heavily on the nature of the data. In summary, GBRT depends on
four fitting parameters: tree depth, bagging, number of trees, and shrinkage.
With a set of independent observations, a common procedure to tune these param-
eters is to proceed by k-fold cross-validation: partition the training randomly into k
subsamples (folds) and fit the model by excluding one of the folds, which is then
used to evaluate performance. The researcher repeats this procedure, rotating the
excluded fold and selecting parameters maximizing average performance. However,
our dataset is a panel, and therefore rotating the folds would imply using data known
from future events to detect current misstatements.1 So, for the purpose of the panel,
we fit the model using years 2001-2009 and evaluate the performance in a validation
sample composed of years 2010-2011. We search these parameters within the range
suggested by Friedman (2001), varying bagging from 50%, 60%, 70%,and 80%,
shrinkage from 1%, 5%, and 10%, tree depth from 1 to 12, and finding the number
of trees until either the pseudo R 2 is maximized or reaches a plateau. The algorithm
is explained in more detail in Appendix B.
We focus on GBRT as a baseline because it is closest in spirit to a logistic regres-
sion, to the extent that it directly yields a probability as the output. By analogy to a
logit, we first report a purely statistical measure of fit based on a pseudo R 2 in a test
sample. We then compare the area under the receiver operating characteristic (ROC)
curve, defined as the plot of true misstatements identified by the algorithm for any
given rate of firms that are inspected with no misstatement. The area under this curve,
denoted AUC, is a commonly used summary of the performance of the algorithm. In
essence, the AUC is equivalent to the probability that a randomly chosen true positive
is ranked higher than a randomly chosen true negative (Fawcett 2006).
We use the Audit Analytics Non-Reliance Restatement database updated on January 15,
2019. Non-reliance restatements refer to restatements that undermine the previous or
1 Another possible solution is to fit the model using a rolling window or exclude observations from firms
used to build the model. However, both of these choices severely restrict the sample effectively used for
cross-validation.
Using machine learning to detect misstatements
2 In a previous version of the manuscript, we focused on all restatements, including restatements not
reported in 8-K; however, many of these restatements need not include large events. We thank Andy
Imdieke for this suggestion.
3 Audit Analytics provides the restated amount for each year only for the five most recent years impacted by
the restatement. However, firms’ restatements can often impact more than five years of financial data. The
impact on accounting numbers prior to the most recent five years is usually reported as a cumulative charge
to retained earnings, and, in practice, firms need not retrospectively adjust all prior years. To account for
this, we assume that the cumulative effect to retained earnings is distributed evenly across the misstatement
span identified in the restatement filing. If the span is missing, we allocate the unexplained cumulative
change to the year prior to the last year with an income effect.
J. Bertomeu et al.
Test sample
results
11,323 firm-years
of our analysis
2012-2014
Data
54,354 firm-years
Validation sample
2001-2014
7,883 firm-years Prediction model
Predictors
2010-2011
Training sample
35,148 firm-years
2001-2009
Note that this approach can be applied to any prediction model. For example,
Dechow et al. (2011) estimate an additive logistic class of models using the train-
ing sample to reduce the set of predictive variables. They proceed by eliminating
variables from a full (kitchen-sink) model, or vice-versa, adding variables sequen-
tially from a sparse model. Relative to a class of logistic models, machine learning
organizes this process of selection over a larger set of variables. The regression
tree component of GBRT also allows for a search over non-additive effects and
interactions, which are excluded by assumption in a logistic regression.
We build an extensive dataset of over 100 potential predictor variables obtained
from public records, as defined in Table 5. We include financial variables (Dechow
et al. 2011), audit variables (Frankel et al. 2002; DeFond et al. 2002; Johnson et al.
2002; Romanus et al. 2008), credit rating variables (Avramov et al. 2009), opinion
divergence variables (Garfinkel 2009), and corporate governance variables (Larcker
et al. 2007). We use SIC codes to identify 15 major industries. We also include the
auditor opinion, an indicator variable for the existence of a management forecast,
analyst consensus forecast, short interest, and indicator variables for foreign firms
and current or past restatement announcements. In the baseline model, we use the
lag one-year value for the mean-adjusted absolute value of DD and studentized DD
(Dechow and Dichev 2002) residuals to avoid using information known in the future
in the prediction model. Each of the continuous variables is winsorized at the 1% and
99% levels. We set a missing variable to zero but include an indicator variable equal
to one when the variable is missing for any variable with more than a 10% missing
rate.
Several studies on AAERs use machine learning tools, such as ensemble learn-
ing, to classify a subset of firm-years into frauds, implying that the output is in terms
of a binary classification (Perols et al. 2016; Bao et al. 2020). These methods dif-
fer from regression methods in that the algorithm is optimized to correctly classify
irregularities (AAERs) in datasets constructed with a balanced number of misstating
and non-misstating firms. Balancing is critical in a classification algorithm because
the preferred classification with heavily imbalanced data will involve classifying
all observations into one group. We compare GBRT to other common methods in
machine learning, including classification methods such as RUSBoost.4 Regression
methods, such as logistic regressions or GBRT, fit the likelihood of misstatements
and typically do not require balancing.
We extend the evaluation procedure to include economically interpretable evalu-
ations, which are meant to capture an upper bound on what a regulator might detect
under ideal conditions. To be specific, we assume that the SEC inspects a fraction
4 RUSBoost refers to random undersampling, such that balanced samples are constructed by randomly
drawing from the sample. With a heavily imbalanced dataset, however, nonrandom undersampling may
perform better than random sampling. In untabulated results, we used the sampling method of Perols et al.
(2016), but it did not perform better than RUSboost in our dataset. Under this alternate sampling method,
the AUC is 69.4%, and the detection rate of restatements is 60.0% and of AAERs is 81.1%. This method
ranks better than logistic models but slightly worse than GBRT and random under-sampling in Tables 10
and 15. An important difference is that there are more material misstatements than AAERs, so the benefits
of nonrandom sampling to alleviate imbalance are more muted.
J. Bertomeu et al.
Average pseudo R2 with different tree depth Average pseudo R2 with different shrinkage
when shrinkage=1% and baggage=70% when baggage=70% and tree depth=9
11.0% 11.0%
10.5% 10.5%
10.0% 10.0%
9.5%
9.5%
9.0%
9.0% 8.5%
8.5% 8.0%
1 2 3 4 5 6 7 8 9 10 11 12 1% 5% 10%
Tree depth Shrinkage
of all the firms and can perfectly detect an inspected misstating firm. We vary the
probability of inspection to examine which algorithm performs best and then further
evaluate the model based on its ability to detect large versus small misstatements,
detect misstatements early, and detect misstatements that ultimately lead to an SEC
enforcement action. We vary the rate of inspection from 1%, consistent with the
rate of AAERs, to 10%, consistent with the rate of SEC investigations measured in
Blackburne et al. (2020).
3 Detection model
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
Accounting Governance Audit Market Business
soft assets, bid-ask spread, non-audit fees divided by total fees, a dummy for qualified
opinion over internal controls, changes in operating lease activity, short interest, and
stock return volatility.6
Of interest, accrual-based metrics commonly used in accounting research do not
seem to be the leading variables selected by the algorithm. Only working capital
accruals are listed in the top 20 most important variables, and the remaining accrual
measures are in the top 40 most important variables. However, a closer inspection
reveals a more nuanced answer. Double-entry links multiple accounting variables,
so that one accounting variable may not carry a large significance on its own, but
combining other accounting variables might help predict misstatements. Similarly,
summary measures, such as accruals, contain information already captured by other
variables.
To address the total importance of all accounting variables, we classify pre-
dictors into five groups: accounting (i.e., constructed from accounting numbers),
governance, audit, market information (including analysts and credit ratings), and
business environment (i.e., all other non-accounting variables including industry
fixed effects).7 We then total the influence by group in Fig. 3. Surprisingly, while
governance, audit, and market variables individually have the largest influence, the
cumulative effect of accounting variables is larger than all other groups, attaining
36.2% of total importance. Overall, accounting variables remain the most important
source of information to detect misstatements.
6 As in any multivariate descriptive analysis with multiple correlated variables, interpretation requires some
caution since the method may select one variable over another for reasons that relate primarily to the fitting
procedure. Later on, we list the set of important variables in other methods and observe that, while many
variables are common to multiple algorithms, there are also some differences.
7 For variables combining market and accounting information, such as book-to-market and earnings-to-
8 The theoretical model of Bertomeu and Marinovic (2015) also predicts this relation, as firms that
0.8
0.6
True Positive rate
0.4
0.2
private information reduces market liquidity, this may reduce the sensitivity of price
to accounting reports, which in turn can reduce the benefit of accounting discretion
(Samuels et al. 2018).
4 Model comparisons
We next examine the performance of the model and compare GBRT with the back-
ward logistic model of Dechow et al. (2011) and other machine learning models.9 All
results are evaluated on the test dataset, which has 11,323 firm-years (4,278 unique
firms) and 428 restatement firm-years (255 unique firms).
The AUC, defined as the area under the receiver operating characteristic (ROC)
curve, is the probability that a classifier ranks a randomly chosen occurrence of mis-
statement higher than a randomly chosen occurrence of no-misstatement. We plot the
receiver operating characteristic (ROC) curve in Fig. 4. This curve plots the fraction
of true positives caught for a given fraction of false positives.
In Table 10, RUSBoost performs better than GBRT under the AUC criterion,
and both perform better than the logistic model. In the ROC curve, the difference
is greatest for intermediate levels of false positives. This result is intuitive because
classification methods, such as RUSBoost, maximize the fit in constructed balanced
samples and will work better when evaluated in imbalanced samples if the rate of
9 We only document the results with the backward logistic model because the forward logistic and simple
logistic models exhibit the same results. Backward and forward logistic models are much more sparse;
that is, they use fewer variables than GBRT and simple logistic models. However, they do not appear
to perform better than a simple logistic model. This finding suggests complex interactions in the entire
population of potential predictors capture misstatements.
J. Bertomeu et al.
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Inspection rate Inspection rate
GBRT Random Forest RUSBoost Backward Logit GBRT Random Forest RUSBoost Backward Logit
F2 score
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Inspection rate
GBRT Random Forest RUSBoost Backward Logit
Fig. 5 Fβ scores
10 We report in Table10 bootstrapped standard errors, retraining and testing the model 200 times on ran-
domly drawn datasets. Differences between the performance of most models tend to be greater than two
standard errors, indicating these differences are significant. In untabulated analyses, we bootstrapped
differences between model performance and confirm that differences between models are significantly
different from zero at conventional levels.
Using machine learning to detect misstatements
very costly) and F2 defined as higher weight on recall (i.e., undetected misstatements
are very costly). The x-axis in Fig. 5 represents the inspection rate (i.e., the fraction
of the population inspected).
Across all specifications, GBRT and RUSBoost outperform logit models. The F0.5
score is maximized by Random Forest at an inspection rate of about 2%, primar-
ily due to the difficulty of detecting misstatements accurately. The other algorithms
exhibit very similar patterns. Likewise, all algorithms follow the same trends for the
F1 score. The F1 score peaks at low levels of inspection rates between 2% and 3%,
and the curve flattens over inspection rates from 3% to 15%. The F2 score, which
assigns greater weight to recall, prescribes higher levels of inspections. Classification
methods, such as Random Forest and RUSBoost, perform better and suggest optimal
inspection rates at 20%.
We consider a third criterion based on inspection rates implied by SEC enforce-
ment actions (AAERs). We match the 3,599 misstated firm years in our full sample
of 54,354 firm years to AAERs and obtain 385 misstated firm years connected to
an AAER (see Table 14 for details). Hence the SEC inspected at least 385 out of
54,354, which is about 0.7% of the sample if we assume a perfect detection model
and no more than 385/3, 599 = 10.7% if we assume an entirely random inspec-
tion. In a recent study, Blackburne et al. (2020) obtain the list of all targets of a
formal SEC review from a Freedom of Information Act request. Their evidence
points to 10% of firms being reviewed, and, given that not all reviews may find a
misstatement, the effective rate of inspections should be somewhere between 1%
and 10%.
We select the top 1% through the top 10% of predicted misstatement probabilities
of firm-years in a given year and report catch rates in Table 11. For nearly the entire
range, RUSBoost has the highest catch rates, ranging from 6.3% to 20.8% for inspec-
tion rates below 4% and steadily performs better with catch rates between 23.8% and
36.4% for inspection rates between 5% and 10%. In general, Random Forest performs
second best, followed closely by GBRT. The logistic model exhibits similar catch
rates to machine learning algorithms for inspection rates lower than 4%. However,
the gap widens for high inspection rates. Using an inspection rate of 10% as refer-
ence, machine learning algorithms can catch a significant amount of misstatements-
between 33.4% and 36.4%.
The metrics reported until this point assume low inspection rates. Alternatively,
we consider performance metrics based on an interpretation of the SEC mandated
reviews. Officially, the SEC must review financial statements once every three years
(every year for large filers), which translates to inspecting roughly one-third of
the sample each year. Under the assumption of ideal inspections, we would be
able to detect misstatements in the top third of the firms most likely to feature a
misstatement.
Panel A of Table 12 displays the results for the different methods when we classify
the top-third-predicted probability of firm-years within a given year as “inspected”
firms and count the number of actual misstated years that are detected between
models. Out of 428 restatement firm years, GBRT detects 64.3% restatements, a
J. Bertomeu et al.
11 In untabulated results, we also estimate the model by separating the restatement sample into positive and
negative period income effects, under the conjecture that positive effects may reflect reversals or incen-
tives to influence the stock price downward. See Kasnik (1999) for an extensive continuing literature. We
divide restatements into three categories: negative income effects (overstatement), zero income effects,
and positive income effects (understatement). We then build three models and predict the probability of
overstatement, understatement, and a zero income effect separately. We do not find any notable improve-
ment to predictive power in the test sample, likely because these alternative methods reduce the size of the
dataset used to estimate the model.
12 In Panel B of Table 12, we obtain similar results after excluding firms with restatements in the training
sample. Machine learning algorithms continue to perform better, compared to the logistic model, but
feature lower catch rates.
Using machine learning to detect misstatements
13 In untabulated analyses, we compute the number of caught misstatements at least a year before the
AAERs. Out of 29 misstatements caught by GBRT in the test sample, they relate to 20 AAERs fillings,
and all of them are detected at least a year (often more than a year) before the AAER is filed.
J. Bertomeu et al.
out of the 10 most important variables in the restatement sample are important in the
AAER sample as well. For Random Forests (RUSBoost), four (5) out of 10 variables
are in both lists. We also observe that variables that tend to be used across methods
also tend to be used across panels A and B. Overall, six variables appear to be impor-
tant across methods and for both AAERs and misstatements: % soft Assets, bid ask
spreads, non-audit fee to total fee, short interest, stock returns, and percent of total
audit fee.
If misstatements caught by the model are of a more serious nature, they may help
predict AAERs. To test this, we consider whether misstatements help predict AAERs.
We also ask whether the restatement model predicts AAERs better than a model
trained only with AAERs. On the one hand, there are more restatements than AAERs,
so restatement data can help train the model. On the other hand, if misstatements
are largely disconnected from irregularities causing AAERs, we should expect data
using misstatements to do poorly, relative to using AAERs.
Panel A in Table 17 reveals that the misstatement model does about the same at
predicting AAERs as a model using AAERs only (“AAER model”). For regression
methods, catch rates are slightly higher with the AAER model, but for classifi-
cation methods, catch rates are slightly higher with the misstatement model. This
finding suggests that the machine learning approach applied to misstatements also
captures more serious irregularities. However, the algorithms may also capture the
SEC inspection model, given that the SEC is more likely to check for irregularities
in firms that restate, or vice-versa, given that restatements may have been triggered
by SEC inquiries.
In panel B of Table 17, we consider catch rates of AAERs occurring jointly with
a misstatement. These can be interpreted as more serious irregularities, requiring a
retrospective change in financial statements.14 Panel B shows that the misstatement
model does reasonably well at catching AAERs connected to misstatements: in fact,
it does better than the AAER model across all specifications.
In summary, while most misstatements are not irregularities, the results suggest
that severe misstatements are easier to detect than errors; therefore the prediction of
the misstatement model has some ability to yield catch rates in line with a model
estimated from irregularity data. Within the more narrow question of misstatement
AAER pairs, the misstatement model performs better than the AAER model.
6 Further analyses
In our baseline model, we use GBRT to detect ongoing misstatements before they are
announced, presumably because ongoing misstatements produce a track of evidence
14 We still estimate the models as in panel A using the entire population of misstatements and AAERs. One
alternative would have been to estimate a model using only AAER-misstatement pairs as irregularities.
However, the number of observations here becomes too small to build a model with reasonable out-of-
sample performance.
Using machine learning to detect misstatements
9.0%
8.0%
7.0%
6.0%
5.0%
4.0%
3.0%
2.0%
1.0%
0.0%
t=-2 t=-1 t=0
non restatement restatement
that will affect accounting numbers and other variables. However, a misstatement
may also be predicted because of persistent risk factors that would cause certain
classes of businesses to be more prone to misstatements, regardless of whether a
misstatement is actually occurring.
To assess predictive ability over longer horizons, we train and estimate the GBRT
model using lagged variables (t = −1) to predict misstatements in the next year.
The variables are used before the occurrence of the misstatement year. In Fig. 6,
we calculate the probability of misstatement as assessed by the model for firms that
misstated versus firms that did not misstate. A large difference between the probabil-
ities implies that the model assigns a large revision of beliefs about the occurrence
of misstatements in the next year, suggesting that the model captures misstatement
risk factors. We repeat this exercise using two-year lagged variable (t = −2) and
compare these results to our baseline (t = 0).
As reported in Table 18, the model performs better to assess misstatements using
current explanatory variables. We find that the model can somewhat predict a mis-
statement in the next year but that this predictive power is lower for horizons of two
years or more. Hence, while contemporaneous risk factors may indeed explain the
ability of the model to predict misstatements, persistent risk factors do not appear to
do well at predicting misstatements. We also note that the lower predictive power one
year ahead appears to be caused by the fact that many misstatement strings take two
years, implying that the model may be detecting a current misstatement and using it
to predict the next misstatement.15
We also examine in Table 19 the nature of the variables that help predict mis-
statements one or two years ahead. For both predictions, the short interest becomes
15 In untabulated results, we find very low predictive ability when we predict the first misstatement year.
J. Bertomeu et al.
the most important variable, indicating that markets serve an additional role in antic-
ipating future misstatements. As is also intuitive, qualified audit opinions speak to
current misstatements, not to the risk of future misstatements, and are no longer
one of the most important variables (even though audit risk factors, such as high
non-audit fees and audit fees, remain important predictors). Leverage and financ-
ing cash flows are important to detect current misstatements but are less critical for
predicting misstatements. Working accruals are important in both detection and one-
year-ahead prediction but not in two-year-ahead prediction, which we interpret as
reversals of accruals indicating continuing misstatements across two years. Lastly,
with a two-year horizon, we start seeing a greater influence from industry character-
istics, indicating businesses likely to be more at risk of misstating due to the nature
of their operations.
16 InTrees can imply redundant conditions if an inequality is repeated twice or is a subset of another
more hard assets and qualified opinions on controls, which may lead to situations in
which transactions relating to hard assets are incorrectly processed (incorrect capi-
talization, depreciation, inventory valuations, etc.). For rule frequencies above 5%, a
large proportion of non-audit fees indicates a high risk of misstatements. Large non-
audit fees might impede the auditors’ independence and compromise the integrity of
the financial information.
When considering all variables, regardless of the rule frequency, audit vari-
ables are still very present and, in combinations with other variables, signal a
misstatement. For a rule frequency below 10%, governance variables help detect
misstatements. Accounting variables contribute to detect misstatements for rule fre-
quencies above 5%. In nearly all regions, market variables, in combination with
other variables, can gauge market uncertainty. The presence of short-selling helps
narrow down the population more likely to misstate. Industry characteristics can
refine the search. For low frequencies of rule below 5%, the computer industry is
prone to misstatements. For high frequencies at 20%, the pharmaceutical industry is
excluded.
7 Conclusion
Machine learning methods in accounting are still in their infancy, and it remains an
open question as to how accounting researchers should use and think about what
we learn from machine learning methods above and beyond detection. We present
here some preliminary evidence of the type of insights available from applying the
method to a classic question in accounting research: the detection of misstatements.
But evaluating the method in terms of detection is likely reductive. Being primarily a
descriptive tool for data exploration, machine learning can illuminate patterns in the
data without assuming a particular theory. Therefore, while a-theoretical by design, it
can paradoxically guide researchers to key variables that should be part of theoretical
causal models.
The insights are likely to be of interest to different users of accounting informa-
tion. First, the SEC as well as other regulatory bodies might find immediate use out of
a model that will help more accurately identify firms with accounting misstatements.
Machine learning tools are unlikely to replace other sources of information, such as
institutional industry knowledge, but it is inexpensive (i.e., relies on public sources of
information and limited human expertise) and can provide early warning signs. Sec-
ond, academic research in accounting has been facing growing datasets, and machine
learning offers a methodology that provides new means to explore complex patterns
in the data. Third, although machine learning is becoming common in practice, prac-
titioners will hopefully find our results useful to analyze data for decision-making
purposes.
Appendix A: Omitted Tables
Fiscal year Number of restated firms Percentage Number of compustat firms Percentage of compustat firms
Change in return on assets ch roa (Earningst /Average Total Assetst )- (Earningst−1 /Average Total Assetst−1 )
Change in free cash flows ch fcf (Earnings-RSST Accruals)/ Average Total Assets
Deferred tax expense tax Deferred tax expense for year t/ total assets for year t-1
Existence of operating leases leasedum An indicator variable coded 1 if future operating lease obligations are greater than zero
Change in operating leases activity oplease The change in the present value of future noncancelable operating lease obligations
deflated by average total assets
Expected return return on pension plan assets pension Expected return on pension plan assets
Change in expected return return on pension plan assets ch pension Expected return on pension plan assets
Using machine learning to detect misstatements
Abnormal change in employees ch emp Percentage change in the number of employees-percentage change in assets
Abnormal change in order backlog ch backlog Percentage change in order backlog -percentage change in sales
Actual issuance issue An indicator variable coded 1 if the firm issues securities during year t
Ex ante financing need exfin An indicator variable coded 1 if CFO - past three year average capital expenditures /
Current Assets < −0.5
Level of finance raised cff Financing Activities Net Cash Flow /Average total assets
Leverage leverage Long-term Debt/Total Assets
Market-adjusted stock return rett Annual buy-and-hold return inclusive of delisting returns minus the annual buy-and-hold
value-weighted market return
Lagged Market-adjusted stock return rett−1 Previous year’s annual buy-and-hold return inclusive of delisting returns minus the annual
buy-and-hold value-weighted market return
Book-to-market bm Equity/Market Value
Earnings-to-price ep Earnings/Market Value
Agriculture ind1 SIC 100-999
Mining and Construction ind2 SIC 1000-1299 and 1400-1999
Table 5 (continued)
Number of Compensation Committee meetings CC Meetings The number of compensation committee meetings
Compensation Committee size CCsize The number of directors serving on the compensation committee
Insider Chairman Insider Chairman An indicator variable equal to one if an executive holds the position of chairperson of the
board and zero otherwise
Lead director Lead Director An indicator variable equal to one if there is a lead director on the board and zero
otherwise
% Outsiders own Outsider Own The fraction of outstanding shares held by the average outside director
Big 4 auditor big4 An indicator variable equal to one if the firm’s auditor is a Big 4 firm and zero otherwise
Non-audit fees / total fees feeratio Ratio of non-audit fees to total fees
Percentile rank of non-audit fees by auditor ranknon Percentile rank of non-audit fees by auditor
Percentile rank of audit fees by auditor rankaud Percentile rank of audit fees by auditor
Percentile rank of total fees by auditor ranktot Percentile rank of total fees by auditor
Log of non-audit fees non audit fees Log value of non-audit fees
Log of audit fees audit fees Log value of audit fees
Log of total fees total fee Log value of total audit fees
Table 5 (continued)
Missing variable: discretionary accruals missing DA An indicator variable coded 1 if missing discretionary accruals
Using machine learning to detect misstatements
Missing variable: deferred tax expense missing TAX An indicator variable coded 1 if missing deferred tax expense
Missing variable: change in order backlog missing OB An indicator variable coded 1 if missing change in order backlog
Missing variable: pension return missing Pension An indicator variable coded 1 if missing pension return
Missing variable: analyst forecast missing AF An indicator variable coded 1 if missing analyst forecast
Missing variable: credit ratings missing Rating An indicator variable coded 1 if missing credit ratings
Missing variable: internal control opinion missing auopic An indicator variable coded 1 if missing opinion or unaudited internal control
Missing variable: going-concern opinion missing going-concern An indicator variable coded 1 if missing going concern opinion
Table 6 Summary statistics of predictors in Table 7
Log of audit fee 13.264 1.964 0 11.599 12.361 13.374 14.31 15.204 16.828
Panel C: Market variables (excluding BM and EP)
Bid ask spread 0.011 0.017 0 0.001 0.001 0.004 0.014 0.031 0.095
Short interest 0.036 0.047 0 0 0.002 0.019 0.049 0.095 0.237
Stock return volatility 0.033 0.019 0.009 0.014 0.019 0.028 0.042 0.059 0.109
Lag one year return 0.089 0.574 -0.813 -0.454 -0.228 0 0.252 0.658 2.94
Return 0.068 0.567 -0.861 -0.484 -0.256 -0.023 0.246 0.64 2.836
Analyst forecast errors 0.096 0.63 0 0 0 0.001 0.006 0.036 3.661
Panel D: Business variables
Abnormal change in employees -0.042 0.278 -1.347 -0.272 -0.119 -0.025 0.058 0.2 0.888
Firm age 18.042 15.679 2 4 7 13 24 39 80
Panel E: Governance variables
% Outsiders own 0.023 0.057 0 0 0 0.004 0.016 0.055 0.368
% Outsiders Appointed 0.186 0.334 0 0 0 0 0.273 0.909 1
J. Bertomeu et al.
Model R2 AUC Catch rate of restatement Catch Rate of AAER Importance of accounting variables
Panel A: Pseudo R 2 and AUC in full test dataset (11,323 firm-years), bootstrap standard errors in parentheses
Pseudo R 2 14.1% 12.6% n.a.a 10.8%
s.e. (0.50%) (0.47%) (0.43%)
AUC 72.8% 76.1% 76.3% 67.1%
Using machine learning to detect misstatements
a Boosted trees do nothave a simple expression to map scores to probabilities, so we do not report a Pseudo R 2 for RUSBoost. As further analyses, we mapped the RUSBoost
score to a Pseudo R 2 by running a spline regression of restatement occurrence on scores and setting the minimum (maximum) probability equal to 0.01 (0.99). We set those
values to obtain an upper bound on the implied pseudo R 2 as they appear to maximize R 2 out-of-sample. This approach returns a pseudo R 2 of 7.1%, lower than GBRT.
This mapping confirms that RUSBoost, which is a classification algorithm not meant to estimate conditional expectations, does worse than a regression model in fitting
expectations.
J. Bertomeu et al.
Panel A: Detection rate for the top 1/3 predicted probabilities of firm−years
Catch rate 64.3% 70.8% 72.2% 55.4%
Catch number 275 303 309 237
Panel B: Detection rate for the top 1/3 predicted probabilities of firm−years excluding firms with restatement in the training dataset
Catch rate 55.8% 61.0% 61.6% 54.7%
Catch number 96 105 106 94
Panel C: Restatement income effect for the top 1/3 predicted probabilities of firm−years
Using machine learning to detect misstatements
GBRT 29 78.4% 37
Random forest 35 94.6% 37
RUSBoost 33 89.2% 37
Backward logistic 23 62.2% 37
Using machine learning to detect misstatements
Table 16 Top 10 variables (bold: same variable across panels; italics: same variable in at least two models)
Top 10 variables
3 0.005 % soft assets≤0.513 & Qualified opinion (controls)= 1 & Leverage≤0.017 40.89% 24.17%
5 0.006 0.011<% soft assets≤0.602 & Bid-ask spread>0.005 & Qualified opinion (controls)=1 & Pct. rank of audit fee≤0.987 37.50% 23.44%
3 0.051 % soft assets>0.088 & Non-audit fee / total fee>0.323 & Chg. in operating lease>0.008 16.30% 13.64%
5 0.047 Non-audit fee / total fee>0.307 & Qualified opinion (controls)=0 & Chg. in operating lease>-0.018 & Short interest rate≤0 16.86% 14.02%
& Non audit fee>11.833
3 0.104 Non-audit fee / total fee>0.401 & Qualified opinion (controls)=0 & Stock return volatility>0.024 14.54% 12.43%
5 0.104 % soft assets≤0.886 & Non-audit fee / total fee>0.399 & Qualified opinion (controls)=0 & Chg. in operating lease≤0.033 11.11% 9.87%
& Stock return volatility≤0.053
3 0.199 Non-audit fee / total fee>0.249 & Chg. in operating lease≤0.008 & Stock return volatility>0.02 10.20% 9.16%
5 0.196 Non-audit fee / total fee>0.247 & Qualified opinion (controls)=0 & Chg. in operating lease≤0.032 & Stock return 10.93% 9.74%
volatility>0.02 & Pct. rank of audit fee>0.401
All variables
3 0.006 Earnings to price > -0.57 & Analyst forecast errors > 0.025 & Qualified opinion (controls) = 1 38.06% 23.57%
5 0.009 Short interest rate ≤0 & Qualified opinion (controls) = 0 & Total audit fees > 13.191 & % Board insiders > 0.087 & 39.36% 23.87%
Computers = 1
3 0.05 Operating lease = 1 & % appointed outsiders > 0.095 & Computers = 1 19.93% 15.96%
5 0.045 % soft assets ≤ 0.874 & Expected return return on pension plan assets ≤ 0.065 & Pct. rank of audit fee > 0.603 & % 20.02% 16.01%
appointed outsiders > 0.095 & Comp. committee size ≤ 2
3 0.098 Stock return volatility > 0.023 & Pct. rank of audit fee > 0.663 & % appointed outsiders > 0.226 15.69% 13.23%
5 0.105 Short interest rate ≤ 0 & Credit rating ≤ 17.5 & Stock return volatility > 0.028 & Analyst forecast error ≤ 0.361 & Past 14.30% 12.26%
restatement announcement (2y) = 0
3 0.199 Financing cash flow > 0.006 & Earnings to Price ≤ 0.041 & Total audit fees ≤ 13.878 10.09% 9.07%
5 0.196 Change in cash sales ≤ 0.42 & Qualified opinion (controls) = 0 & Pct. Rank of audit fee > 0.473 & missing opinion 12.15% 10.67%
(controls) = 1 & Pharmaceuticals = 0
J. Bertomeu et al.
Using machine learning to detect misstatements
Table 21 Example - sample 16 observations, nine of which have misstatements, with two observable
explanatory variables E/P and Soft
Note that, in this created data set, the pattern that ties the variables to misstate-
ments is visually self-evident. Plotting this data in Fig. 7, it is clear that misstatements
have a monotonic relationship with the two variables. But this relationship is not lin-
ear, and there is an interaction between the two variables such that the cutoff for soft
assets doesn’t change at low levels of earnings to price. Our objective will be to show
0.9
0.8
0.7
0.6
0.5
Soft
0.4
0.3
0.2
0.1
0
0.01 0.03 0.05 0.07 0.09 0.11
EP
how machine learning can uncover this pattern that we have ourselves “learned” from
visual inspection over the graph of observations.
Following the pseudo code, we first calculate the sample mean as y = 0.125 and
initialize our first predictor at a predicted probability for the whole sample given by
1 1 + 0.125
ŷi0 = log = 0.125657.
2 1 − 0.125
2yi
zi1 = L2 (yi , 0.125657) = ,
1 + exp(2yi ŷi0 )
as calculated in column 4 of Table 21. The next step is to fit the regression tree to zi1 ,
which involves finding one variable and one cutoff, such that all observations where
the variable is at one side of the cutoff are given the same prediction. Of note, the
regression tree does not use the same loss function as the boosting because the fitted
zi1 are derivatives that need not be bounded. In particular, the loss function used for
this step is simply the quadratic loss function Lq and the best cutoff minimizes this
loss.
In this example, there are two variables, and each has four values, so we could
choose to place the cutoff on one of two variables and at three locations per variable,
for six possible choices of the cutoff. For each possible cutoff, we compute a predic-
tion z1i as the average for all observations at one side of the cutoff and sum over the
loss function Lq (zi1 , z1i ). Table 22 provides the loss function for every possible cut-
off. Here, the best cutoff is to partition as a function of soft assets greater or equal
than 0.4, with 4 observations below this cutoff and 12 above.
Given that the number of leaves is set to be equal to three, we need to apply a
second (and final) cutoff to obtain three end leaves. We proceed similarly in Table 23,
except that there are now two choices as to whether to apply the cutoff on the left or
right side of Soft ≥ 0.4 and then five choices for the cutoff for the two variables if we
place the cutoff on the right node versus three if we place the cutoff on the left node.
Note that, for the left node, all outcome variables have the same value equal to zero
Node Threshold Left Mean z̄i1 Right Mean z̄i1 Sum Loss
so that the cutoff itself does not increase explanatory power; hence we know, for this
case, that the best cutoff must be on the right node.
Note that a variable that has been used in a prior cut can be used again in later steps,
thus creating potentially complex interactions or nonlinearities. For this example, the
best cutoff point is located at EP ≥ 0.07, which leads to a partition of the sample
into three regions, as illustrated in Fig. 8.
Having completed the first tree, we compute an adjustment to our initial prediction
from step 2.2, that is,
ŷim = ŷim−1 + φ(xi ; S ),
where R 1 (xi ) indicates the set of observations classified in the same region by the
tree. Column 5 in Table 21 provides an updated estimate after the first tree. Note
0.9
16 observations 0.8
0.7
0.6
Soft < 0.4 Soft ≥ 0.4
0.5
Soft
4 observations 0.4
12 observations
Square loss = 0 Square loss = 9 0.3
0.2
EP< 0.07 EP≥ 0.07 0.1
0
6 observations 6 observations 0.01 0.03 0.05 0.07 0.09 0.11
EP
0.9
16 0.8
observations
0.7
Soft < 0.8 Soft ≥ 0.8 0.6
So2t
0.5
12 observations 4 observations 0.4
0.3
EP < 0.09 EP ≥ 0.09 0.2
0.1
9 observations 3 observations
0
0.01 0.03 0.05 0.07 0.09 0.11
EP
that these refer to the term inside of the logistic term. To make a prediction, we
need to convert this term into a probability (column 6) using a standard logistic
transformation.17
Since we have set the number of trees to two, we need to repeat these steps a
second time, using an updated set of residuals
Skipping the calculations (which are conceptually identical), the second tree illus-
trated in Fig. 9 is now given by a cutoff on both high E/P and soft assets, which
unsurprisingly correspond to the area of predictors that were partly misclassified by
the first tree.
After computing the corresponding φ(xi ; S1 ), where S1 = (zi2 , xi ) is the sample
for the second tree, and mapping back to an updated value for ŷi2 and the implied
probability, we compute in Column 8 of Table 21 the resulting prediction. As a result
of the two trees, the data is now classified into seven regions.
To further illustrate in Fig. 10 the tree component of the analysis, we estimate
a single regression tree by partitioning the probability of misstatements into 10
nodes. The probability of misstatements across nodes varies from 2% to 50%. The
first branch of the tree, the most important variable is usually whether the audi-
tor has issued a qualified opinion. As expected, qualified opinions tend to be more
frequent in the presence of an ongoing misstatement. Conditional on a qualified opin-
ion, the firms with higher book-to-market and higher deferred tax expenses tend to
be the most likely to misstate, thus suggesting disagreements about deferred taxes.
17 This result coincides with the Stata package Boost, with code boost Res EP Soft , distribution(logistic)
% Outsiders appointed Stock return volatility % Board Insider Deferred tax expense
< 39.2% ≥ 39.2% < 2.4% ≥ 2.4% < 11.8% ≥11.8% < 2.7% ≥ 2.7%
References
Abbasi, A., Albrecht, C., Vance, A., Hansen, J. (2012). Metafraud: a meta-learning framework for
detecting financial fraud. Mis Quarterly, 36(4), 1293–1327.
Avramov, D., Chordia, T., Jostova, G., Philipov, A. (2009). Credit ratings and the cross-section of stock
returns. Journal of Financial Markets, 12(3), 469–499.
Bao, Y., Ke, B., Li, B., Julia Yu, Y., Zhang, J. (2020). Detecting accounting fraud in publicly traded us
firms using a machine learning approach. Journal of Accounting Research, 58(1), 199–235.
Barton, J., & Simko, P.J. (2002). The balance sheet as an earnings management constraint. The Accounting
Review, 77(s-1), 1–27.
Beneish, M.D. (1999). The detection of earnings manipulation. Financial Analysts Journal, 55(5), 24–36.
Bertomeu, J., & Marinovic, I. (2015). A Theory of hard and soft information. The Accounting Review,
91(1), 1–20.
Blackburne, T., Kepler, J., Quinn, P., Taylor, D. (2020). Undisclosed sec investigations. Forthcoming
Management Science.
Cheffers, M., Whalen, D., Usvyatsky, O. (2010). 2009 financial restatements: A nine year comparison.
Audit Analytics Sales (February).
Cheynel, E., & Levine, C. (2020). Public disclosures and information asymmetry: A theory of the mosaic.
The Accounting Review, 95(1), 79–99.
Dechow, P.M., & Dichev, I.D. (2002). The quality of accruals and earnings: The role of accrual estimation
errors. The Accounting Review, 77(s-1), 35–59.
Dechow, P.M., Ge, W., Larson, C.R., Sloan, R.G. (2011). Predicting material accounting misstatements.
Contemporary Accounting Research, 28(1), 17–82.
DeFond, M.L., Raghunandan, K., Subramanyam, K.R. (2002). Do non–audit service fees impair auditor
independence? evidence from going concern audit opinions. Journal of Accounting Research, 40(4),
1247–1274.
Deng, H. (2018). Interpreting tree ensembles with inttrees. International Journal of Data Science and
Analytics, pp 1–11.
Using machine learning to detect misstatements
Ding, K., Lev, B., Peng, X., Sun, T., Vasarhelyi, M.A. (2020). Machine learning improves accounting
estimates. Review of Accounting Studies, pp 1–37.
Dutta, I., Dutta, S., Raahemi, B. (2017). Detecting financial restatements using data mining techniques.
Expert Systems with Applications, 90, 374–393.
Ettredge, M.L., Sun, L., Lee, P., Anandarajan, A.A. (2008). Is earnings fraud associated with high deferred
tax and/or book minus tax levels?. Auditing: A Journal of Practice & Theory, 27(1), 1–33.
Fanning, K.M., & Cogger, K.O. (1998). Neural network detection of management fraud using published
financial data. Intelligent Systems in Accounting, Finance & Management, 7(1), 21–41.
Fawcett, T. (2006). An introduction to roc analysis. Pattern Recognition Letters, 27(8), 861–
874.
Frankel, R.M., Johnson, M.F., Nelson, K.K. (2002). The relation between auditors’ fees for nonaudit
services and earnings management. The Accounting Review, 77(s-1), 71–105.
Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics,
pp 1189–1232.
Friedman, J., Hastie, T., Tibshirani, R. (2001). The Elements of Statistical Learning Vol. 1. New York:
Springer series in statistics.
Garfinkel, J.A. (2009). Measuring investors’ opinion divergence. Journal of Accounting Research, 47(5),
1317–1348.
Glosten, L.R., & Milgrom, P.R. (1985). Bid, ask and transaction prices in a specialist market with
heterogeneously informed traders. Journal of Financial Economics, 14(1), 71–100.
Green, B.P., & Choi, J.H. (1997). Assessing the risk of management fraud through neural network
technology. Auditing, A Journal of Practice and Theory, 16, 14–28.
Guelman, L. (2012). Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert
Systems with Applications, 39(3), 3659–3667.
Gupta, R., & Gill, N.S. (2012). A solution for preventing fraudulent financial reporting using descriptive
data mining techniques. International Journal of Computer Applications.
Hribar, P., Kravet, T., Wilson, R. (2014). A New measure of accounting quality. Review of Accounting
Studies, 19(1), 506–538.
Johnson, V.E., Khurana, I.K., Kenneth Reynolds, J. (2002). Audit-firm tenure and the quality of financial
reports. Contemporary Accounting Research, 19(4), 637–660.
Kasznik, R. (1999). On the association between voluntary disclosure and earnings management. Journal
of Accounting Research, 37(1), 57–81.
Kim, Y.J., Baik, B., Cho, S. (2016). Detecting financial misstatements with fraud intention using multi-
class cost-sensitive learning. Expert Systems with Applications, 62, 32–43.
Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., Mullainathan, S. (2017). Human decisions and
machine predictions. The Quarterly Journal of Economics, 133(1), 237–293.
Kornish, L.J., & Levine, C.B. (2004). Discipline with common agency: The case of audit and nonaudit
services. The Accounting Review, 79(1), 173–200.
Larcker, D.F., Richardson, S.A., Tuna, I.rem. (2007). Corporate governance, accounting outcomes, and
organizational performance. The Accounting Review, 82(4), 963–1008.
Laux, V., & Newman, P.D. (2010). Auditor liability and client acceptance decisions. The Accounting
Review, 85(1), 261–285.
Lin, J.W., Hwang, M.I., Becker, J.D. (2003). A Fuzzy neural network for assessing the risk of fraudulent
financial reporting. Managerial Auditing Journal, 18(8), 657–665.
Lobo, G.J., & Zhao, Y. (2013). Relation between audit effort and financial report misstatements: Evidence
from quarterly and annual restatements. The Accounting Review, 88(4), 1385–1412.
Perols, J. (2011). Financial statement fraud detection: An analysis of statistical and machine learning
algorithms. Auditing: A Journal of Practice & Theory, 30(2), 19–50.
Perols, J.L., Bowen, R.M., Zimmermann, C., Samba, B. (2016). Finding needles in a haystack: Using data
analytics to improve fraud prediction. The Accounting Review, 92(2), 221–245.
Ragothaman, S., & Lavin, A. (2008). Restatements due to improper revenue recognition: a neural networks
perspective. Journal of Emerging Technologies in Accounting, 5(1), 129–142.
Romanus, R.N., Maher, J.J., Fleming, D.M. (2008). Auditor industry specialization, auditor changes, and
accounting restatements. Accounting Horizons, 22(4), 389–413.
Samuels, D., Taylor, D.J., Verrecchia, R.E. (2018). Financial misreporting: Hiding in the shadows or in
plain sight?.
Rijsbergen, V., & Joost, C. (2004). The geometry of information retrieval. Cambridge University Press.
J. Bertomeu et al.
Whiting, D.G., Hansen, J.V., McDonald, J.B., Albrecht, C., Steve Albrecht, W. (2012). Machine learning
methods for detecting patterns of management fraud. Computational Intelligence, 28(4), 505–527.
Zhang, Y., & Haghani, A. (2015). A gradient boosting method to improve travel time prediction.
Transportation Research Part C: Emerging Technologies, 58, 308–324.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Affiliations
Edwige Cheynel
echeynel@wustl.edu
Eric Floyd
ejfloyd@ucsd.edu
Wenqiang Pan
wenqiang.pan1994@gmail.com
1 Olin Business School, Washington University, Saint Louis, MO, USA
2 Rady School of Management, University of California San Diego, La Jolla, CA, USA
3 Columbia Business School, Columbia University, New York City, NY, USA