Data Science Article
Data Science Article
Data Science Article
Abstract
In Data Science and Data Analysis, as in most technical or quantitative fields of inquiry, there is an
important distinction between understanding the theoretical underpinnings of the methods and knowing
how and when to best apply them to practical situations.
The successful transition from clean pedagogical toy examples to messy situations can be complicated
by a misunderstanding of what a useful and insightful solution looks like in a non-academic context.
In this report, we provide examples of data analysis and quantitative methods applied to “real-life”
problems. We emphasize qualitative aspects of the projects as well as significant results and conclusions,
rather than explain the algorithms or focus on theoretical matters.
Keywords
case studies, data science, machine learning, data analysis, statistical analysis, quantitative methods
1
Centre for Quantitative Analysis and Decision Support, Carleton University, Ottawa
2
Department of Mathematics and Statistics, University of Ottawa, Ottawa
3
Idlewyld Analytics and Consulting Services, Wakefield, Canada
Email: patrick.boily@carleton.ca
1 Classification: Tax Audits 2 The case studies were selected primarily to showcase a wide
breadth of analytical methods, and are not meant to repre-
2 Sentiment Analysis: BOTUS and Trump & Dump 5
sent a complete picture of the data analysis landscape. In
3 Clustering: The Livehoods Project 9 some instances, the results were published in peer-reviewed
4 Association Rules: Danish Medical Data 11 journals or presented at conferences. In each case, we pro-
vide the:
objective;
methodology;
advantages or disadvantages of specific methods;
procedures and results;
evaluation and validation;
project summary, and
challenges and pitfalls, etc.
1. Classification: Tax Audits the testing data consisted of APGEN Use Tax audits
conducted in 2007 and was used to test or evaluate
Large gaps between revenue owed (in theory) and revenue
models (for Large and Smaller businesses) built on
collected (in practice) are problematic for governments.
the training dataset,
Revenue agencies implement various fraud detection strate-
while validation was assessed by actually conducting
gies (such as audit reviews) to bridge that gap.
field audits on predictions made by models built on
Since business audits are rather costly, there is a definite 2007 Use Tax return data processed in 2008.
need for algorithms that can predict whether an audit is
likely to be successful or a waste of resources. None of the sets had records in common (see Figure 1).
Title Data Mining Based Tax Audit Selection: A Case Study Strengths and Limitations of Algorithms
of a Pilot Project at the Minnesota Department of Revenue The Naïve Bayes classification scheme assumes in-
[1] dependence of the features, which rarely occurs in
real-world situations. Furthermore, this approach
Authors Kuo-Wei Hsu, Nishith Pathak, Jaideep Srivastava,
tends to introduce bias to classification schemes. In
Greg Tschida, Eric Bjorklund
spite of this, classification models built using Naïve
Date 2015 Bayes have a successfully track record.
MultiBoosting is an ensemble technique that uses
Sponsor Minnesota Department of Revenue (DOR) forms a committees (i.e. groups of classification mod-
els) and group wisdom to make a prediction; unlike
Methods classification, data mining other ensemble techniques, it also uses a committee
Objective The U.S. Internal Revenue Service (IRS) esti- of sub-committee It is different from other ensemble
mated that there were huge gaps between revenue owed techniques in the sense that it forms a committee of
and revenue collected for 2001 and for 2006. The project’s sub-committees (i.e. a group of groups of classifica-
goals were to increase efficiency in the audit selection pro- tion models), which has a tendency to reduce both
cess and reduce the gap between revenue owed and revenue bias and variance of predictions.
collected.
Procedures Classification schemes need a response vari-
Methodology able for prediction: audits which yielded more than $500
per year in revenues during the audit period were Good;
1. Data selection and separation: experts selected several the others were Bad. The various models were tested and
hundred cases to audit and divided them into training, evaluated by comparing the performances of the manual
testing and validating sets. audit (which yield the actual revenue) and the classification
2. Classification modeling using MultiBoosting, Naïve models (the predicted classification).
Bayes, C4.5 decision trees, multilayer perceptrons, The procedure for manual audit selection in the early
support vector machines, etc. stages of the study required:
3. Evaluation of all models was achieved by testing the
model on the testing set. Models originally performed 1. Department of Revenue (DOR) experts selecting sev-
poorly on the testing set until it was realized that the eral thousand potential cases through a query;
size of the business being audited had an effect of 2. DOR experts further selecting several hundreds of
the model accuracy: the task was split in two parts these cases to audit;
to model large businesses and smaller business sepa- 3. DOR auditors actually auditing the cases, and
rately. 4. calculating audit accuracy and return on investment
4. Model selection and validation was done by comparing (ROI) using the audits results.
the estimated accuracy between different classifica- Once the ROIs were available, data mining started in earnest.
tion model predictions and the actual field audits. Ul- The steps involved were:
timately, MultiBoosting with Naïve Bayes was selected
1. Splitting the data into training, testing, and validat-
as the final model; the combination also suggested
ing sets.
some improvements to increase audit efficiency.
2. Cleaning the training data by removing inadequate
Data The data consisted of selected tax audit cases from cases.
2004 to 2007, collected by the audit experts, which were 3. Building (and revising) classification models on the
split into training, testing and validation sets: training dataset. The first iteration of this step intro-
duced a separation of models for larger businesses
the training data set consisted of Audit Plan General and relatively smaller businesses according to their
(APGEN) Use Tax audits and their results for the years average annual withholding amounts (the thresh-
2004-2006; old value that was used is not revealed in [1]).
Figure 1. Data sources for APGEN mining [1]. Note the 6 final sets which feed the Data Analysis component.
Figure 2. The feature selection process [1]. Note the involvement of domain experts.
4. Selecting separate modeling features for the AP- classification model suggested 534 audits (386 of which
GEN Large and Small training sets. The feature selec- proved successful). The theoretical best process would find
tion process is shown in Figure 2. 495 successful audits in 495 audits performed, while the
5. Building classification models on the training dataset manual audit selection process needed 878 audits in order
for the two separate class of business (using C4.5, to reach the same number of successful audits.
Naïve Bayes, multilayer perceptron, support vector For APGEN Small (2007), 473 cases were recommended
machines, etc.), and assessing the classifiers using for audit by experts (only 99 of which proved successful);
precision and recall with improved estimated ROI: in contrast, 47 out of the 140 cases selected by the classifi-
cation model were successful. The theoretical best process
Total revenue generated
Efficiency = ROI = (1) would find 99 successful audits in 99 audits performed,
Total collection cost while the manual audit selection process needed 473 audits
Results, Evaluation and Validation The models that were in order to reach the same number of successful audits.
eventually selected were combinations of MultiBoosting In both cases, the classification model improves on the
and Naïve Bayes (C4.5 produced interpretable results, but manual audit process: roughly 685 data mining audits
its performance was shaky). would be required to reach 495 successful audits of APGEN
Large (2007), and 295 would be required to reach 99 suc-
For APGEN Large (2007), experts had put forward 878 cessful audits for APGEN Small (2007), as can be seen in
cases for audit (495 of which proved successful), while the Figure 3.
Figure 3. Audit resource deployment efficiency [1]. Top: APGEN Large (2007). Bottom: APGEN Small (2007). In both cases, the
Data Mining approach was more efficient (the slope of the Data Mining vector is “closer“ to the Theoretical Best vector than is the
Manual Audit vector).
Table 1 presents the confusion matrices for the classification Take-Aways A substantial number of models were churned
model on both the APGEN Large (2007) and APGEN Small out before the team made a final selection. Past perfor-
(2007) data. Columns and rows represent predicted and mance of a specific model family in a previous project can
actual results, respectively. The revenue R and collection be used as a guide, but it provides no guarantee regarding
cost C entries can be read as follows: the 47 successful its performance on the current data – remember the No Free
audits which were correctly identified by the model for Lunch (NFL) Theorem [2]: nothing works best all the time!.
APGEN Small (2007) correspond to cases consuming 9.9% There is a definite iterative feel to this project: the
of collection costs but generating 42.5% of the revenues. feature selection process could very well require a number
Similarly, the 281 bad audits correctly predicted by the of visits to domain experts before the feature set yields
model represent notable collection cost savings. These are promising results. This is a valuable reminder that the
associated with 59.4% of collection costs but they generate data analysis team should seek out individuals with a good
only 11.1% of the revenues. understand of both data and context. Another consequence
of the NFL is that domain-specific knowledge has to be
Once the testing phase of the study was conpleted, the integrated in the model in order to beat random classifiers,
DOR validated the data mining-based approach by using on average [3].
the models to select cases for actual field audits in a real Finally, this project provides an excellent illustration
audit project. The prior success rate of audits for APGEN that even slight improvements over the current approach
Use tax data was 39% while the model was predicting a can find a useful place in an organization – data science is
success rate of 56%; the actual field success rate was 51%. not solely about Big Data and disruption!
Table 1. Confusion matrices for audit evaluation [1]. Top: APGEN Large (2007). Bottom: APGEN Small (2007). R stands for
revenues, C for collection costs.
In 2013, the BBC reported on various ways in which social Authors Tradeworx (BOTUS), T3 (Trump & Dump)
media giant Twitter was changing the world, detailing spe- Date 2017
cific instances in the fields of business, politics, journalism,
sports, entertainment, activism, arts, and law [9]. Sponsor NPR’s podcast Planet Money (BOTUS)
The reasons are varied (see Figures 6 and 7), but the most
important setback was that @realDonaldTrump had not
made a single valid tweet about a public company whose
stock BOTUS could trade during the stock market business
hours. Undeterred, Planet Money relaxed its trading strat-
egy: if @realDonaldTrump tweets during off-hours, BOTUS
will short the stock at the market’s opening bell.
This is a risky approach, and so far it has not proven
very effective: a single trade of Facebook’s trade, on August
23rd, which resulted in a loss of 0.30$ (see Figure 7).
Take-Aways As a text analysis and scenario analysis project,
both BOTUS and Trump & Dump are successful – they
present well-executed sentiment analyses, and a simulation
process that finds an optimal trading strategy. As predictive
tools, they are sub-par (as far as we can tell), but for rea-
sons that (seem to) have little to do with data analysis per se.
Figure 8. Some livehoods in metropolitan Pittsburgh, PA: in Shadyside/East Liberty, Lawrenceville/Polish Hill, and South
Side. Municipal borders are shown in black.
Strengths and Limitations of the Approach The guiding principle of the model is that the “char-
acter” of an UA is defined both by the types of venues it
The technique used in this study is agnostic towards
contains and by the people frequent them as part of their
the particular source of the data: it is not dependent
daily activities. These clusters are referred to as Livehoods,
on meta-knowledge about the data.
by analogy with more traditional neighbourhoods.
The algorithm may be prone to “majority” bias, conse-
quently misrepresenting or hiding minority behaviours. Let V be a list of Foursquare venues, A the associated affin-
ity matrix representing a measure of similarity between
The data are based on a limited sample of check-ins each venue, and G(A) be the graph obtained from the A by
shared on Twitter and are therefore biased towards linking each venue to its nearest m neighbours. Spectral
the types of places that people typically want to share clustering is implemented by the following algorithm:
publicly.
1. Compute the diagonal degree matrix Dii = j Ai j ;
P
Tuning the clusters is non-trivial: experimenter bias 2. Set the Laplacian matrix L = D − A and
may combine with “confirmation bias” of the inter-
viewees in the validation stage. Lnorm = D−1/2 LD−1/2 ;
Procedures The Livehoods project uses a spectral clus- 3. Find the k smallest eigenvalues of Lnorm , where k is
tering model to provide structure for local urban areas the index which provides the biggest jump in succes-
(UAs), grouping close Foursquare venues into clusters based sive eigenvalues of eigenvalues of Lnorm , in increasing
on both the spatial proximity between venues and the order;
social proximity which is derive from the distribution of 4. Find the eigenvectors e1 , ...ek of L corresponding to
people that check-in to them. the k smallest eigenvalues;
5. Construct the matrix E with the eigenvectors e1 , ...ek 4. Association Rules: Danish Medical Data
as columns;
6. Denote the rows of E by y1 , ..., yn , and cluster them Title Temporal disease trajectories condensed from popu-
into k clusters C1 , ..., Ck using k-means. This will lation wide registry data covering 6.2 million patients
induce a clustering A1 , ..., Ak defined by Ai = { j| y j ∈ Authors Anders Boeck Jensen, Pope L. Moseley, Tudor I.
Ci }. Oprea, Sabrina Gade Ellesøe, Robert Eriksson, Henriette
7. For each Ai , let G(Ai ) be the subgraph of G(A) induced Schmock, Peter Bjødstrup Jensen, Lars Juhl Jensen, and
by vertex Ai . Split G(Ai ) into connected components. Søren Brunak
Add each component as a new cluster to the list of
clusters, and remove the subgraph G(Ai ) from the Date 2014
list. Sponsor Danish National Patient Registry
8. Let b be the area of bounding box containing coordi-
nates in the set of venues V , and bi be the area of the Methods association rules mining, clustering
b
box containing Ai . If bi > τ, delete cluster Ai , and
Objective Estimating disease progression (trajectories)
redistribute each of its venues v ∈ Ai to the closest A j
from current patient state is a crucial notion in medical
under the distance measurement.
studies. Trajectories have so far only been analyzed for a
Results, Evaluation and Validation The parameters used small number of diseases or using large-scale approaches
for the clustering were m = 10, kmin = 30, kmax = 45, without consideration for time exceeding a few years.
and τ = 0.4. The results for three areas of the city are Using data from the Danish National Patient Registry
shown in Figure 8. In total, 9 livehoods have been identified (an extensive, long-term data collection effort by Denmark),
and validated by 27 Pittsburgh residents (see Figure 8; this study finds connections between different diagnoses
the original report has more information on the interview and how the presence of a diagnosis at some point in time
process). might allow for the prediction of another diagnosis at a
later point in time.
Municipal Neighborhoods Borders: livehoods are
dynamic, and evolve as people’s behaviours change, Methodology The following methodological steps were
unlike the fixed neighbourhood borders set by the taken:
city government. 1. compute strength of correlation for pairs of diagnoses
Demographics: the interview displayed strong ev- over a 5 year interval (on a representative subset of
idence that the demographics of the residents and the data);
visitors of an area often play a strong role in explain- 2. test diagnoses pairs for directionality (one diagnosis
ing the divisions between livehoods. repeatedly occurring before the other);
Development and Resources: economic develop- 3. determine reasonable diagnosis trajectories (thor-
ment can affect the character of an area. Similarly, oughfares) by combining smaller (but frequent) tra-
the resources (or lack there of) provided by a region jectories with overlapping diagnoses;
has a strong influence on the people that visit it, and 4. validate the trajectories by comparison with non-
hence its resulting character. This is assumed to be Danish data;
reflected in the livehoods. 5. cluster the thoroughfares to identify a small number
Geography and Architecture: the movements of of central medical conditions (key diagnoses) around
people through a certain area is presumably shaped which disease progression is organized.
by its geography and architecture; livehoods can re-
veal this influence and the effects it has over visiting Data The Danish National Patient Registry is an electronic
patterns. health registry containing administrative information and
diagnoses, covering the whole population of Denmark, in-
Take-Away k−means is not the sole clustering algorithm cluding private and public hospital visits of all types: inpa-
in applications! tient (overnight stay), outpatient (no overnight stay) and
References emergency. The data set covers 14.9 years from January
’96 to November ’10 and consists of 68 million records for
[1]
Chen, Z., Caverlee, J., Lee, K., Su, D.Z. [2011], Ex- 6.2 million patients.
ploring millions of footprints in location sharing services,
ICWSM. Challenges and Pitfalls
[2]
Cranshaw, J., Schwartz, R., Hong, J.I., Sadeh, N. Access to the National Patient Registry is protected
[2012], The Livehoods Project: Utilizing Social Media to and could only be granted after approval by the Dan-
Understand the Dynamics of a City, International AAAI ish Data Registration Agency the National Board of
Conference on Weblogs and Social Media, p.58. Health.
Figure 9. The COPD cluster showing five preceding diagnoses leading to COPD and some of the possible outcomes;
Cerebrovascular cluster with epilepsy as key diagnosis.