Data Science Article

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

DATA SCIENCE REPORT SERIES

Data Analysis Case Studies


Patrick Boily1,2,3

Abstract
In Data Science and Data Analysis, as in most technical or quantitative fields of inquiry, there is an
important distinction between understanding the theoretical underpinnings of the methods and knowing
how and when to best apply them to practical situations.
The successful transition from clean pedagogical toy examples to messy situations can be complicated
by a misunderstanding of what a useful and insightful solution looks like in a non-academic context.
In this report, we provide examples of data analysis and quantitative methods applied to “real-life”
problems. We emphasize qualitative aspects of the projects as well as significant results and conclusions,
rather than explain the algorithms or focus on theoretical matters.
Keywords
case studies, data science, machine learning, data analysis, statistical analysis, quantitative methods
1
Centre for Quantitative Analysis and Decision Support, Carleton University, Ottawa
2
Department of Mathematics and Statistics, University of Ottawa, Ottawa
3
Idlewyld Analytics and Consulting Services, Wakefield, Canada
Email: patrick.boily@carleton.ca

Contents Data Analysis Case Studies

1 Classification: Tax Audits 2 The case studies were selected primarily to showcase a wide
breadth of analytical methods, and are not meant to repre-
2 Sentiment Analysis: BOTUS and Trump & Dump 5
sent a complete picture of the data analysis landscape. In
3 Clustering: The Livehoods Project 9 some instances, the results were published in peer-reviewed
4 Association Rules: Danish Medical Data 11 journals or presented at conferences. In each case, we pro-
vide the:

project title and citation references;


author(s) and publication date;
sponsors (if there were any), and
methods that were used.

Depending on the case study, some of the following items


are also provided:

objective;
methodology;
advantages or disadvantages of specific methods;
procedures and results;
evaluation and validation;
project summary, and
challenges and pitfalls, etc.

Since the various sponsoring organizations have not always


allowed the dissemination of specific results (for a variety
of reasons), we have opted to follow their lead; when such
results are available, the interested reader can consult them
in the appropriate publications or presentations.
DATA SCIENCE REPORT SERIES Data Analysis Case Studies

1. Classification: Tax Audits the testing data consisted of APGEN Use Tax audits
conducted in 2007 and was used to test or evaluate
Large gaps between revenue owed (in theory) and revenue
models (for Large and Smaller businesses) built on
collected (in practice) are problematic for governments.
the training dataset,
Revenue agencies implement various fraud detection strate-
while validation was assessed by actually conducting
gies (such as audit reviews) to bridge that gap.
field audits on predictions made by models built on
Since business audits are rather costly, there is a definite 2007 Use Tax return data processed in 2008.
need for algorithms that can predict whether an audit is
likely to be successful or a waste of resources. None of the sets had records in common (see Figure 1).

Title Data Mining Based Tax Audit Selection: A Case Study Strengths and Limitations of Algorithms
of a Pilot Project at the Minnesota Department of Revenue The Naïve Bayes classification scheme assumes in-
[1] dependence of the features, which rarely occurs in
real-world situations. Furthermore, this approach
Authors Kuo-Wei Hsu, Nishith Pathak, Jaideep Srivastava,
tends to introduce bias to classification schemes. In
Greg Tschida, Eric Bjorklund
spite of this, classification models built using Naïve
Date 2015 Bayes have a successfully track record.
MultiBoosting is an ensemble technique that uses
Sponsor Minnesota Department of Revenue (DOR) forms a committees (i.e. groups of classification mod-
els) and group wisdom to make a prediction; unlike
Methods classification, data mining other ensemble techniques, it also uses a committee
Objective The U.S. Internal Revenue Service (IRS) esti- of sub-committee It is different from other ensemble
mated that there were huge gaps between revenue owed techniques in the sense that it forms a committee of
and revenue collected for 2001 and for 2006. The project’s sub-committees (i.e. a group of groups of classifica-
goals were to increase efficiency in the audit selection pro- tion models), which has a tendency to reduce both
cess and reduce the gap between revenue owed and revenue bias and variance of predictions.
collected.
Procedures Classification schemes need a response vari-
Methodology able for prediction: audits which yielded more than $500
per year in revenues during the audit period were Good;
1. Data selection and separation: experts selected several the others were Bad. The various models were tested and
hundred cases to audit and divided them into training, evaluated by comparing the performances of the manual
testing and validating sets. audit (which yield the actual revenue) and the classification
2. Classification modeling using MultiBoosting, Naïve models (the predicted classification).
Bayes, C4.5 decision trees, multilayer perceptrons, The procedure for manual audit selection in the early
support vector machines, etc. stages of the study required:
3. Evaluation of all models was achieved by testing the
model on the testing set. Models originally performed 1. Department of Revenue (DOR) experts selecting sev-
poorly on the testing set until it was realized that the eral thousand potential cases through a query;
size of the business being audited had an effect of 2. DOR experts further selecting several hundreds of
the model accuracy: the task was split in two parts these cases to audit;
to model large businesses and smaller business sepa- 3. DOR auditors actually auditing the cases, and
rately. 4. calculating audit accuracy and return on investment
4. Model selection and validation was done by comparing (ROI) using the audits results.
the estimated accuracy between different classifica- Once the ROIs were available, data mining started in earnest.
tion model predictions and the actual field audits. Ul- The steps involved were:
timately, MultiBoosting with Naïve Bayes was selected
1. Splitting the data into training, testing, and validat-
as the final model; the combination also suggested
ing sets.
some improvements to increase audit efficiency.
2. Cleaning the training data by removing inadequate
Data The data consisted of selected tax audit cases from cases.
2004 to 2007, collected by the audit experts, which were 3. Building (and revising) classification models on the
split into training, testing and validation sets: training dataset. The first iteration of this step intro-
duced a separation of models for larger businesses
the training data set consisted of Audit Plan General and relatively smaller businesses according to their
(APGEN) Use Tax audits and their results for the years average annual withholding amounts (the thresh-
2004-2006; old value that was used is not revealed in [1]).

P.Boily, 2017 Page 2 of 13


DATA SCIENCE REPORT SERIES Data Analysis Case Studies

Figure 1. Data sources for APGEN mining [1]. Note the 6 final sets which feed the Data Analysis component.

Figure 2. The feature selection process [1]. Note the involvement of domain experts.

4. Selecting separate modeling features for the AP- classification model suggested 534 audits (386 of which
GEN Large and Small training sets. The feature selec- proved successful). The theoretical best process would find
tion process is shown in Figure 2. 495 successful audits in 495 audits performed, while the
5. Building classification models on the training dataset manual audit selection process needed 878 audits in order
for the two separate class of business (using C4.5, to reach the same number of successful audits.
Naïve Bayes, multilayer perceptron, support vector For APGEN Small (2007), 473 cases were recommended
machines, etc.), and assessing the classifiers using for audit by experts (only 99 of which proved successful);
precision and recall with improved estimated ROI: in contrast, 47 out of the 140 cases selected by the classifi-
cation model were successful. The theoretical best process
Total revenue generated
Efficiency = ROI = (1) would find 99 successful audits in 99 audits performed,
Total collection cost while the manual audit selection process needed 473 audits
Results, Evaluation and Validation The models that were in order to reach the same number of successful audits.
eventually selected were combinations of MultiBoosting In both cases, the classification model improves on the
and Naïve Bayes (C4.5 produced interpretable results, but manual audit process: roughly 685 data mining audits
its performance was shaky). would be required to reach 495 successful audits of APGEN
Large (2007), and 295 would be required to reach 99 suc-
For APGEN Large (2007), experts had put forward 878 cessful audits for APGEN Small (2007), as can be seen in
cases for audit (495 of which proved successful), while the Figure 3.

P.Boily, 2017 Page 3 of 13


DATA SCIENCE REPORT SERIES Data Analysis Case Studies

Figure 3. Audit resource deployment efficiency [1]. Top: APGEN Large (2007). Bottom: APGEN Small (2007). In both cases, the
Data Mining approach was more efficient (the slope of the Data Mining vector is “closer“ to the Theoretical Best vector than is the
Manual Audit vector).

Table 1 presents the confusion matrices for the classification Take-Aways A substantial number of models were churned
model on both the APGEN Large (2007) and APGEN Small out before the team made a final selection. Past perfor-
(2007) data. Columns and rows represent predicted and mance of a specific model family in a previous project can
actual results, respectively. The revenue R and collection be used as a guide, but it provides no guarantee regarding
cost C entries can be read as follows: the 47 successful its performance on the current data – remember the No Free
audits which were correctly identified by the model for Lunch (NFL) Theorem [2]: nothing works best all the time!.
APGEN Small (2007) correspond to cases consuming 9.9% There is a definite iterative feel to this project: the
of collection costs but generating 42.5% of the revenues. feature selection process could very well require a number
Similarly, the 281 bad audits correctly predicted by the of visits to domain experts before the feature set yields
model represent notable collection cost savings. These are promising results. This is a valuable reminder that the
associated with 59.4% of collection costs but they generate data analysis team should seek out individuals with a good
only 11.1% of the revenues. understand of both data and context. Another consequence
of the NFL is that domain-specific knowledge has to be
Once the testing phase of the study was conpleted, the integrated in the model in order to beat random classifiers,
DOR validated the data mining-based approach by using on average [3].
the models to select cases for actual field audits in a real Finally, this project provides an excellent illustration
audit project. The prior success rate of audits for APGEN that even slight improvements over the current approach
Use tax data was 39% while the model was predicting a can find a useful place in an organization – data science is
success rate of 56%; the actual field success rate was 51%. not solely about Big Data and disruption!

P.Boily, 2017 Page 4 of 13


DATA SCIENCE REPORT SERIES Data Analysis Case Studies

Table 1. Confusion matrices for audit evaluation [1]. Top: APGEN Large (2007). Bottom: APGEN Small (2007). R stands for
revenues, C for collection costs.

References It is not always clear what influence Twitter users have, if


any, on world events or business and cultural trends; it was
[1]
Hsu, K.W., Pathak, N., Srivastava, J., Tschida, G., Bjork- once thought (perhaps without appropriate evidence) that
lund, E. [2015] Data Mining Based Tax Audit Selection: entertainers, athletes, and celebrities, that is to say, users
A Case Study of a Pilot Project at the Minnesota De- with extremely high followers/following ratios, wielded
partment of Revenue, in: Abou-Nasr, M., Lessmann, S., more “influence” on the platform than world leaders [1].
Stahlbock, R., Weiss, G. (eds) Real World Data Min- Certainly, such users continue to be among the most
ing Applications, Annals of Information Systems, v.17, popular – as of September 13, 2017, Twitter’s 40 most-
Springer. doi:10.1007/978-3-319-07812-0_12 followed accounts tend to belong to entertainers, celebrities,
[2] and athletes, with a few exceptions [15].
Wolpert, D.H. [1996] The Lack of a priori dis-
tinctions between learning algorithms, Neural One account has recently bridged the gap between
Computation, v.8, n.7, pp.1341-1390, MIT Press. celebrity and politics in an explosive manner: @realDon-
doi:10.1162/neco.1996.8.7.1341 aldTrump, which belongs to the 45th President of the United
[3] States of America, has maintained a very strong presence
Wolpert, D.H., Macready, W.G. [2005] Coevolution-
on Twitter.
ary free lunches, IEEE Transactions on Evolution-
ary Computation, v.9, n.6, pp.721-735, IEEE Press. As of September 13, 2017, the account had 38,205,766
doi:10.1109/TEVC.2005.856205 followers, and it was the 26th most-followed account on
the planet, producing 35,755 tweets since it was activated
in March 2009 [15].
2. Sentiment Analysis: BOTUS and Trump &
Dump Titles BOTUS [5], Trump & Dump Bot [16]

In 2013, the BBC reported on various ways in which social Authors Tradeworx (BOTUS), T3 (Trump & Dump)
media giant Twitter was changing the world, detailing spe- Date 2017
cific instances in the fields of business, politics, journalism,
sports, entertainment, activism, arts, and law [9]. Sponsor NPR’s podcast Planet Money (BOTUS)

P.Boily, 2017 Page 5 of 13


DATA SCIENCE REPORT SERIES Data Analysis Case Studies

Methods sentiment analysis, social media monitoring, AI,


real-time analysis, simulations
Objective There is some evidence to suggest that tweets
from the 45th POTUS may have an effect on the stock
market [8, 10]. Can sentiment analysis and AI be used
to take real-time advantage of the tweets’ unpredictable
nature? Let’s take a look at bots built for that purpose
by NPR’s Planet Money and by T3 (an Austin advertising
agency).
Methodology Tradeworx followed these steps:
1. Data collection: tweets from @realDonaldTrump are
collected for analysis.
2. Sentiment analysis of tweets: each tweet is given a
sentiment score on the positive/negative axis.
3. Validation: the sentiment analysis scoring must be
validated by observers: are human-identified posi-
tive or negative tweets correctly identified as such by
BOTUS?
4. Identification of the company in a tweet: is the tweet
even about a company? If so, which one?
5. Determining the trading universe: are there compa-
nies that should be excluded from the bot’s trading
algorithms? Figure 4. T3’s Trump and Dump process [16].
6. Classifying tweets as “applicable” or “unapplicable”:
is a tweet’s sentiment strong enough for BOTUS to
engage the trading strategy?
7. Determining a trading strategy: how soon after a
flagged tweet does BOTUS buy a company’s stock, numerous challenges, mostly related to the richness
and how long does it hold it for? and flexibility of human languages and their syntax
8. Testing the trading strategy on past data: how would variations, the context-dependent meaning of words
BOTUS have fared from the U.S. Presidential Election and lexemes, the use of sarcasm and figures of speech,
to April 2017? What are BOTUS’ limitations? and the lack of perfect inter-rater reliability among
T3’s Trump and Dump uses a similar process (see Figure 4). humans [12]. As it happens, @realDonaldTrump is
not much of an ironic tweeter – when he uses “sad”,
“bad” and “great”, he usually means “sad”, “bad” and
Data The data consists of: “great” in their most general sense. This greatly sim-
plifies the analysis.
tweets by @realDonaldTrump (from around Election
The bots have to learn to recognize whether a tweet is
Day 2016 through the end of March 2017 for BOTUS;
directed at a publicly traded company or not. In cer-
no details are given for T3) (see Figure 5 for sample);
tain cases, the ambiguity can be resolved relatively
a database of publicly traded companies, such as can
easily with an appropriate training set (Apple the
be found at [3, 14, 17], although which of these were
company vs. apple the food-item, say), but no easy
used, if any, is not specified (no explicit mention is
solutions were found in others (Tiffany the company
made for BOTUS), and
vs. Tiffany the daughter, for example). Rather than
stock market data for real-time pricing (Google Fi-
have humans step in and instruct BOTUS when it
nance for T3) and backcasting simulation (for BOTUS,
faces uncertainty (which would go against the pur-
source unknown).
pose of the exercise), a decision was made to exclude
It is not publicly known whether the bots are upgrading these cases from the trading universe. What T3’s bot
their algorithms by including new data as time passes. does is not known.
Once the bot knows how to rate @realDonaldTrump’s
Strengths and Limitations of Algorithms and Procedure
tweets and to identify when he tweets about publicly-
In sentiment analysis, an algorithm analyzes docu- traded companies, the next question is to determine
menta in an attempt to identify the attitude they ex- what the trading strategy should be. If the tweet’s
press or the emotional response they seek. It presents sentiment is negative enough T3 shorts the company’s

P.Boily, 2017 Page 6 of 13


DATA SCIENCE REPORT SERIES Data Analysis Case Studies

stock.1 Of course, this requires first purchasing the


stock (so that it can be shorted). Planet Money’s
decision was similar: buy once the tweet is flagged,
and sell right away... but what does “right away”
mean in this context? There is a risk involved: if the
stock goes back up before BOTUS has had a chance
to purchase the low-priced stock, it will lose money.
To answer that question, Tradeworx simulated the
stock market over the last few months, introducing
the tweets, and trying out different trading strategies.
It turns out that, in this specific analysis, “right away”
can be taken to be 30 minutes after the tweet.
Results, Evaluation and Validation For a trading bot, the
validation is in the pudding, as they say – do they make
money? T3’s president says that their bot is profitable (they
donate the proceeds to the ASPCA) [16]: for instance, they
netted a return of 4.47% on @realDonaldTrump’s Delta
tweet (see Figure 5); however, he declined to provide spe-
cific numbers (and made vague statements about providing
monthly reports, which I have not been able to locate) [11].

The BOTUS process was more transparent, and we can


point to Planet Money’s transcript for a discussion on sen-
timent analysis validation (comparing BOTUS’s sentiment
rankings with those provided by human observers, or run-
ning multiple simulations to determine the best trading
scenario) [5] – but it suffers from a serious impediment:
as of roughly 4 months after going online, it still had not
made a single trade [4]!

The reasons are varied (see Figures 6 and 7), but the most
important setback was that @realDonaldTrump had not
made a single valid tweet about a public company whose
stock BOTUS could trade during the stock market business
hours. Undeterred, Planet Money relaxed its trading strat-
egy: if @realDonaldTrump tweets during off-hours, BOTUS
will short the stock at the market’s opening bell.
This is a risky approach, and so far it has not proven
very effective: a single trade of Facebook’s trade, on August
23rd, which resulted in a loss of 0.30$ (see Figure 7).
Take-Aways As a text analysis and scenario analysis project,
both BOTUS and Trump & Dump are successful – they
present well-executed sentiment analyses, and a simulation
process that finds an optimal trading strategy. As predictive
tools, they are sub-par (as far as we can tell), but for rea-
sons that (seem to) have little to do with data analysis per se.

Unfortunately, this is not an atypical feature of descrip-


tive data analysis: we can explain what has happened (or
what is happening), but the modeling assumptions are not
always applicable to the predictive domain.
Figure 5. Examples of @realDonaldTrump tweets 1
It sells the stock when the price is high, that is, before the tweet has
involving Delta, Toyota Motor, L.L.Bean, Ford, Boeing, had the chance to bring the stock down, and it repurchases it once the
Nordstrom. price has been lowered by the tweet, but before the stock has had the
chance to recover.

P.Boily, 2017 Page 7 of 13


DATA SCIENCE REPORT SERIES Data Analysis Case Studies

Figure 7. BOTUS reporting on its trades.


Figure 6. BOTUS reporting on its trades.

P.Boily, 2017 Page 8 of 13


DATA SCIENCE REPORT SERIES Data Analysis Case Studies

References 3. Clustering: The Livehoods Project


[1]
Arthur, C. [2010], The big bang visualisation of When we think of similarity at the urban level, we typically
the top 140 Twitter influencers, retrived from The think in terms of neighbourhoods. Is there some other way
Guardian.com on September 13, 2017. to identify similar parts of a city?
[2]
Basu, T. [2017], NPR’s Fascinating Plan to Use A.I.
on Trump’s Tweets, retrieved from inverse.com on Title The Livehoods Project: Utilizing Social Media to
September 12, 2017. Understand the Dynamics of a City [2]
[3]
Company List: NASDAQ, NYSE, & AMEX Compa- Authors Justin Cranshaw, Raz Schwartz, Jason I. Hong,
nies, retrieved on NASDAQ.com on September 15, Norman Sadeh
2017.
[4]
Dieker, N. [2017], Planet Money’s BOTUS Bot Has Date 2012
Yet to Make a Single Stock Trade, retrieved from
Medium.com’s The Billfold on September 12, 2017. Sponsors National Science Foundation, Carnegie Mel-
[5]
lon’s CyLab, Army Research Office, Alfred P. Sloan Foun-
Goldmark, A. [2017], Episode 763: BOTUS, Planet dation, CMU/Portugal ICTI, with additional support from
Money podcast, retrieved from NPR.org’s Planet Google, Nokia, and Pitney Bowes.
Money on September 12, 2017.
[6]
Green, S. [2017], Trump Tweets: Separate Positive Methods spectral clustering, social dynamics
and Negative Tweets, retrieved from Green Analytics
on September 15, 2017. Objective The project aims to draw the boundaries of
[7] livehoods, areas of similar character within a city, by using
Greenstone, S. [2017], When Trump Tweets, This Bot
clustering models. Unlike static administrative neighbor-
Makes Money, retrieved from NPR.org on September
hoods, the livehoods are defined based on the habits of
12, 2017.
people who live there.
[8]
Ingram, M. [2017], Here’s What a Trump Tweet
Does to a Company’s Share Price, retrieved from For- Methodology The case study introduces a spectral cluster-
tune.com on September 15, 2017. ing model (the method will be described later) to discover
[9]
Lee, D. [2013], How Twitter Changed the World, the distinct geographic areas of the city based on its in-
Hashtag-by-Hastag, retrieved from BBC.com on habitants’ collective movement patterns. Semi-structured
September 13, 2017. interviews are used to explore, label and validate the result-
[10]
ing clusters, as well as the urban dynamics that shape them.
McNaney, B. [2017], A Negative Trump Tweet About
Your Company Is An Eye Opener, Not A Crisis, re- Livehood clusters are built and defined using the following
trieved from The Buzz Bin on September 22, 2017. methodology:
[11]
Mettler, K. [2017], ‘Trump and Dump’: When POTUS
tweets and stocks fall, this animal charity benefits, 1. a geographic distance is computed based on pairs of
retrieved from the Washington Post on September 19, check-in venues’ coordinates;
2017. 2. social similarity between each pair of venues is com-
[12] puted using cosine measurements;
Ogneva, M. [2010], How Companies Can Use Senti-
3. spectral clustering produces candidate livehoods clus-
ment Analysis to Improve Their Business, retrieved
ters;
from Mashable on September 17, 2017.
4. interviews are conducted with residents in order to
[13]
Popper, N. [2017] A Little Birdie Told Me: Playing validate the clusters discovered by the algorithm.
the Market on Trump Tweets, retrieved from The
New York Times on September 22, 2017. Data The data comes from two sources, combining ap-
[14] proximately 11 million Foursquare (a recommendation
List of Publicly Traded Companies, retrieved on In-
vestorGuide.com on September 15, 2017. site for venues based on users’ experiences) check-ins from
[15] the dataset of Chen et al. [1] and a new dataset of 7 million
Top 100 Most Followed Twitter Accounts, retrieved
Twitter check-ins downloaded between June and December
from Twitter Counter on September 13, 2017.
of 2011. For each check-in, the data consists of the user
[16]
Trump & Dump Bot: Analyzes Tweets, Short Stocks, ID, the time, the latitude and longitude, the name of the
retrieved from T3 on September 14, 2017. venue, and its category.
[17]
The World’s Biggest Public Companies List - Forbes In this case study, livehood clusters from Pittsburgh,
2000, retrieved on Forbes.com on September 15, Pennsylvania, are examined using 42,787 check-ins of 3840
2017. users at 5349 venues.

P.Boily, 2017 Page 9 of 13


DATA SCIENCE REPORT SERIES Data Analysis Case Studies

Figure 8. Some livehoods in metropolitan Pittsburgh, PA: in Shadyside/East Liberty, Lawrenceville/Polish Hill, and South
Side. Municipal borders are shown in black.

Strengths and Limitations of the Approach The guiding principle of the model is that the “char-
acter” of an UA is defined both by the types of venues it
The technique used in this study is agnostic towards
contains and by the people frequent them as part of their
the particular source of the data: it is not dependent
daily activities. These clusters are referred to as Livehoods,
on meta-knowledge about the data.
by analogy with more traditional neighbourhoods.
The algorithm may be prone to “majority” bias, conse-
quently misrepresenting or hiding minority behaviours. Let V be a list of Foursquare venues, A the associated affin-
ity matrix representing a measure of similarity between
The data are based on a limited sample of check-ins each venue, and G(A) be the graph obtained from the A by
shared on Twitter and are therefore biased towards linking each venue to its nearest m neighbours. Spectral
the types of places that people typically want to share clustering is implemented by the following algorithm:
publicly.
1. Compute the diagonal degree matrix Dii = j Ai j ;
P
Tuning the clusters is non-trivial: experimenter bias 2. Set the Laplacian matrix L = D − A and
may combine with “confirmation bias” of the inter-
viewees in the validation stage. Lnorm = D−1/2 LD−1/2 ;

Procedures The Livehoods project uses a spectral clus- 3. Find the k smallest eigenvalues of Lnorm , where k is
tering model to provide structure for local urban areas the index which provides the biggest jump in succes-
(UAs), grouping close Foursquare venues into clusters based sive eigenvalues of eigenvalues of Lnorm , in increasing
on both the spatial proximity between venues and the order;
social proximity which is derive from the distribution of 4. Find the eigenvectors e1 , ...ek of L corresponding to
people that check-in to them. the k smallest eigenvalues;

P.Boily, 2017 Page 10 of 13


DATA SCIENCE REPORT SERIES Data Analysis Case Studies

5. Construct the matrix E with the eigenvectors e1 , ...ek 4. Association Rules: Danish Medical Data
as columns;
6. Denote the rows of E by y1 , ..., yn , and cluster them Title Temporal disease trajectories condensed from popu-
into k clusters C1 , ..., Ck using k-means. This will lation wide registry data covering 6.2 million patients
induce a clustering A1 , ..., Ak defined by Ai = { j| y j ∈ Authors Anders Boeck Jensen, Pope L. Moseley, Tudor I.
Ci }. Oprea, Sabrina Gade Ellesøe, Robert Eriksson, Henriette
7. For each Ai , let G(Ai ) be the subgraph of G(A) induced Schmock, Peter Bjødstrup Jensen, Lars Juhl Jensen, and
by vertex Ai . Split G(Ai ) into connected components. Søren Brunak
Add each component as a new cluster to the list of
clusters, and remove the subgraph G(Ai ) from the Date 2014
list. Sponsor Danish National Patient Registry
8. Let b be the area of bounding box containing coordi-
nates in the set of venues V , and bi be the area of the Methods association rules mining, clustering
b
box containing Ai . If bi > τ, delete cluster Ai , and
Objective Estimating disease progression (trajectories)
redistribute each of its venues v ∈ Ai to the closest A j
from current patient state is a crucial notion in medical
under the distance measurement.
studies. Trajectories have so far only been analyzed for a
Results, Evaluation and Validation The parameters used small number of diseases or using large-scale approaches
for the clustering were m = 10, kmin = 30, kmax = 45, without consideration for time exceeding a few years.
and τ = 0.4. The results for three areas of the city are Using data from the Danish National Patient Registry
shown in Figure 8. In total, 9 livehoods have been identified (an extensive, long-term data collection effort by Denmark),
and validated by 27 Pittsburgh residents (see Figure 8; this study finds connections between different diagnoses
the original report has more information on the interview and how the presence of a diagnosis at some point in time
process). might allow for the prediction of another diagnosis at a
later point in time.
Municipal Neighborhoods Borders: livehoods are
dynamic, and evolve as people’s behaviours change, Methodology The following methodological steps were
unlike the fixed neighbourhood borders set by the taken:
city government. 1. compute strength of correlation for pairs of diagnoses
Demographics: the interview displayed strong ev- over a 5 year interval (on a representative subset of
idence that the demographics of the residents and the data);
visitors of an area often play a strong role in explain- 2. test diagnoses pairs for directionality (one diagnosis
ing the divisions between livehoods. repeatedly occurring before the other);
Development and Resources: economic develop- 3. determine reasonable diagnosis trajectories (thor-
ment can affect the character of an area. Similarly, oughfares) by combining smaller (but frequent) tra-
the resources (or lack there of) provided by a region jectories with overlapping diagnoses;
has a strong influence on the people that visit it, and 4. validate the trajectories by comparison with non-
hence its resulting character. This is assumed to be Danish data;
reflected in the livehoods. 5. cluster the thoroughfares to identify a small number
Geography and Architecture: the movements of of central medical conditions (key diagnoses) around
people through a certain area is presumably shaped which disease progression is organized.
by its geography and architecture; livehoods can re-
veal this influence and the effects it has over visiting Data The Danish National Patient Registry is an electronic
patterns. health registry containing administrative information and
diagnoses, covering the whole population of Denmark, in-
Take-Away k−means is not the sole clustering algorithm cluding private and public hospital visits of all types: inpa-
in applications! tient (overnight stay), outpatient (no overnight stay) and
References emergency. The data set covers 14.9 years from January
’96 to November ’10 and consists of 68 million records for
[1]
Chen, Z., Caverlee, J., Lee, K., Su, D.Z. [2011], Ex- 6.2 million patients.
ploring millions of footprints in location sharing services,
ICWSM. Challenges and Pitfalls
[2]
Cranshaw, J., Schwartz, R., Hong, J.I., Sadeh, N. Access to the National Patient Registry is protected
[2012], The Livehoods Project: Utilizing Social Media to and could only be granted after approval by the Dan-
Understand the Dynamics of a City, International AAAI ish Data Registration Agency the National Board of
Conference on Weblogs and Social Media, p.58. Health.

P.Boily, 2017 Page 11 of 13


DATA SCIENCE REPORT SERIES Data Analysis Case Studies

Gender-specific differences in diagnostic trends are References


clearly identifiable (pregnancy and testicular cancer [1]
do not have much cross-appeal). But many diagnoses Jensen, A.B., Moseley, P.L., Oprea, T.I., Ellesoe, S.G.,
were found to exclusively (or at least, predominantly) Eriksson, R., Schmock, H., Jensen, P.B., Jensen, L.J.,
be made in different sites (inpatient, outpatient, emer- Brunak, S. [2014], Temporal disease trajectories con-
gency ward), which suggests the importance of strat- densed from population-wide registry data covering
ifying by site as well as by gender. 6.2 million patients, Nature Communications.
In the process of forming small diagnoses chains, it
became necessary to compute the correlation using
large groups for each pair of diagnoses. To compen-
sate for multiple testing for close to 1 millions pairs
and obtain a significant p−value, more than 80 mil-
lion samples would have been required for each pair.
This would have translated to a few thousand years’
worth of computer running time. In order to avoid
this pitfall, a pre-filtering step was included. Pairs
included in the trajectories were eventually validated
using the full sampling procedure, however.
Project Summaries and Results The dataset was reduced
to 1,171 significant trajectories. These thoroughfares were
clustered into patterns centred on 5 key diagnoses central to
disease progression: diabetes, chronic obstructive pulmonary
disease (COPD), cancer, arthritis, and cerebrovascular dis-
ease.
Early diagnoses for these central factors can help reduce
the risk of adverse outcome linked to future diagnoses of
other conditions. Three author quotes illustrate the impor-
tance of these results:
The research could yield tangible health bene-
fits as we move beyond one-size-fits-all medicine.
− L.J.Jensen
The sooner a health risk pattern is identified,
the better we can prevent and treat critical dis-
eases.
− S.Brunak
Instead of looking at each disease in isolation,
you can talk about a complex system with many
different interacting factors. By looking at the
order in which different diseases appear, you
can start to draw patterns and see complex
correlations outlining the direction for each
individual person.
− L.J.Jensen
Among the specific results, the following “surprising” in-
sights were found:
a diagnosis of anemia is typically followed months
later by the discovery of colon cancer;
gout was identified as a step on the path toward
cardiovascular disease, and
COPD is under-diagnosed and under-treated.
The disease trajectories clusters for two key diagnoses are
shown in Figure 9.

P.Boily, 2017 Page 12 of 13


DATA SCIENCE REPORT SERIES Data Analysis Case Studies

Figure 9. The COPD cluster showing five preceding diagnoses leading to COPD and some of the possible outcomes;
Cerebrovascular cluster with epilepsy as key diagnosis.

P.Boily, 2017 Page 13 of 13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy