The State-Of-The-Art in Predictive Visual Analytics
The State-Of-The-Art in Predictive Visual Analytics
The State-Of-The-Art in Predictive Visual Analytics
Yafeng Lu1 , Rolando Garcia1 , Brett Hansen1 , Michael Gleicher2 , and Ross Maciejewski1 ,
1 School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA
2 Department of Computer Sciences, University of Wisconsin, Madison, WI, USA
Abstract
Predictive analytics embraces an extensive range of techniques including statistical modeling, machine learning, and data
mining and is applied in business intelligence, public health, disaster management and response, and many other fields. To
date, visualization has been broadly used to support tasks in the predictive analytics pipeline. Primary uses have been in
data cleaning, exploratory analysis, and diagnostics. For example, scatterplots and bar charts are used to illustrate class
distributions and responses. More recently, extensive visual analytics systems for feature selection, incremental learning, and
various prediction tasks have been proposed to support the growing use of complex models, agent-specific optimization, and
comprehensive model comparison and result exploration. Such work is being driven by advances in interactive machine learning
and the desire of end-users to understand and engage with the modeling process. In this state-of-the-art report, we catalogue
recent advances in the visualization community for supporting predictive analytics. First, we define the scope of predictive
analytics discussed in this article and describe how visual analytics can support predictive analytics tasks in a predictive visual
analytics (PVA) pipeline. We then survey the literature and categorize the research with respect to the proposed PVA pipeline.
Systems and techniques are evaluated in terms of their supported interactions, and interactions specific to predictive analytics
are discussed. We end this report with a discussion of challenges and opportunities for future research in predictive visual
analytics.
niques to improve users’ performance, perception, and externaliza- 2.1. Predictive Analytics
tion of insights.
Predictive analytics covers the practice of identifying patterns
In this state-of-the-art report, we catalogue recent advances in within data to predict future outcomes and trends. With respect
the visualization community for supporting predictive analytics. to analytics, three common terms are descriptive, prescriptive, and
First, we define the scope of predictive analytics as used in this arti- predictive analytics. Descriptive analytics focuses on illustrating
cle and describe how visual analytics can support predictive analyt- what has happened and what the current status of a system is.
ics tasks in a predictive visual analytics pipeline. We then describe Prescriptive analytics uses data to populate decision models that
our method of surveying the literature and explain our categoriza- produce optimal (or near-optimal) decisions of what should be
tion of the research with respect to the proposed PVA pipeline. done, and predictive analytics applies algorithms to extrapolate and
Systems and techniques are then evaluated in terms of their inter- model the future based on available data. In this sense, we can
actions supported, and predictive analytics specific interactions are think of descriptive as a passive, historical analysis, prescriptive
discussed. We end this report with a discussion of challenges and as active analysis suggesting how to exploit new opportunities, and
opportunities for future research in predictive visual analytics. The predictive as an intersecting medium where historical data is used
main contributions of this article include: to produce knowledge of what is likely to happen as a means of
driving decisions. Arguably, the main tasks in predictive analytics
• A definition of the PVA pipeline and its relationship with predic- are relevant to numerical predictions (where the most common pre-
tive analytics, dictive analytics methods are regressions), and categorical predic-
• A categorization of PVA methods as they relate to stages of the tions (where the most common methods focus on classification and
proposed pipeline, clustering) [KJ13]. As an introduction to predictive analytics tech-
• A categorization of the interaction techniques used in PVA sys- niques, we provide a brief definition of regression, classification,
tems, and clustering.
• An analysis of existing trends and relationships upon analyzing
the proposed categories of the PVA pipeline and the interactions, Regression analysis is a statistical technique for modeling the
and relationships between variables [MPV15]. Specifically, regres-
• A discussion of future challenges and opportunities in PVA. sion analysis focuses on understanding how a dependent variable
changes when a predictor (or independent variable) is changed.
In order to guide other researchers interested in predictive visual Linear regression is perhaps one of the most common predictive
analytics, we have also developed an interactive web-based browser analytics techniques available to analysts with implementations in
available at http://vader.lab.asu.edu/pva_browser where surveyed many common software packages such as Excel, SAS [Ins13], and
papers can be explored and filtered by the categories proposed in JMP [SAS12]. Much of its power comes from the interpretability
this article. of the model, where relationships tend to be readily explorable by
end users. For different relationships between variables, regression
In this survey, we have limited our study to works focusing on
models can be linear and non-linear and to explore local patterns,
enabling predictive analysis that were published by the visual ana-
segmented or piecewise regression models can be used. Challenges
lytics community. The papers we have collected explicitly relate to
include data transformation, feature selection, model comparison,
at least one step in the PVA pipeline (Section 4) and cover a variety
and avoiding over-fitting , and widely used techniques to address
of predictive analytics techniques ranging from general methods
these challenges include stepwise feature screening and comparing
(including regression, classification, and clustering) to data spe-
models through performance measures, such as the p-value and R2 .
cific methods (including methods for text, time and spatial data).
While there are a variety of papers in the visualization commu- Classification broadly covers the problem of identifying which
nity that cover topics on the periphery of predictive analytics, such category a new observation belongs to based on information from
as uncertainty analysis (e.g. [BHJ∗ 14]) and ensemble visualization a training data set in which observations have known category
(e.g. [OJ14]), we consider these papers to be out of the scope of this memberships. Classifiers learn patterns using the data attribute fea-
survey as the emphasis is on interactive analytics for prediction. tures from the training set and these patterns can be applied to un-
known instances to predict their categories. Well-known classifi-
cation methods include Bayesian classification, logistic regression,
2. Background decision trees, support vector machines (SVM), and artificial neu-
ral networks (ANN). Challenges with classification include learn-
Predictive analytics is a core research topic with roots in statistics,
ing large and/or streaming data (e.g., real-time security classifica-
machine learning, data mining, and artificial intelligence. Starting
tion [Sut14]), defining proper cost functions in model optimization
with an informal survey of predictive analytics researchers, we cap-
for domain specific tasks where the error cost varies on instances,
tured a variety of definitions of predictive analytics. Definitions
obtaining enough labeled data samples, understanding what char-
ranged from broad—every machine learning technique is predic-
acteristics the models have learned, and avoiding over-fitting.
tive analytics—to narrow—making empirical predictions for the
future [SK10]. In this section, we provide our definition of pre- Similar to classification, clustering also attempts to categorize a
dictive analytics that serves to bound the scope of this paper. We new observation into a class membership. However, clustering is an
present brief definitions of basic concepts related to predictive an- unsupervised method that discovers the natural groupings of a data
alytics and introduce how visualization can be used to augment the set with unknown class labels. Clustering has been widely used in
predictive analytics process. pattern recognition, information retrieval, and bioinformatics, and
c 2017 The Author(s)
Computer Graphics Forum
c 2017 The Eurographics Association and John Wiley & Sons Ltd.
Yafeng Lu, Rolando Garcia, Brett Hansen, Michael Gleicher & Ross Maciejewski / The State-of-the-Art in Predictive Visual Analytics
popular applications include gene sequence analysis, image seg- 3. The predictive visual analytics method supports both prediction
mentation, document summarization, and recommender systems. and visual explanation, which allows analysts either to improve
Challenges with clustering include feature extraction due to the model performance with respect to the general accuracy or to
high dimensionality and unequal length of feature vectors, metric improve an analyst’s understanding of the modeling process and
learning, and clustering evaluation due to unknown ground truth. output.
We consider clustering as a prediction task given the current use of
Moreover, predictive visual analytics methods should share the
clustering for prediction [TPH15, KPJ04] along with a variety of
same goal as predictive analytics methods, which is to make accu-
work in visualization focused on clustering analysis.
rate predictions. In addition, predictive visual analytics could also
For the purposes of this report, we consider predictive analytics focus on improving users’ satisfaction and confidence in the result-
to be the method of analysis in the process of prediction modeling ing predictions. While decision-making systems overlap with this
which consists of building and assessing a model aimed at making definition, we do not specifically survey such tools as this falls more
empirical predictions [SK10]. Predictive analytics overlaps with in the realm of prescriptive analytics. If a system supports decision
the process of knowledge discovery, but the emphasis is on pre- making, this decision has to be made directly from predictive algo-
dictions, specifically forecasts of the future, unknown, and ‘what rithms to be categorized in our survey.
if’ scenarios [Eck07, Sie13]. The goal of prediction modeling is
To further clarify the scope of predictive visual analytics in this
to make statements about an unknown or uncertain event, which
paper, it is important to note that there are visual analytics papers
can be numerical (prediction), categorical (classification), or ordi-
that are related to predictive analytics but are considered to be out
nal (ranking) [SK10]. In our context, we consider a paper to fall
of the scope of this paper. Specifically, visual analytics works that
into the scope of predictive analytics if it satisfies the following
use predictive analysis methods for guiding the design of a visu-
conditions:
alization (e.g., placing a flow field [AKK∗ 13]) are not part of our
1. The analysis process has a clear prediction goal. While open- definition as the goal of such papers is not to make a prediction but
ended explorations and exploratory data analysis play a role in to use prediction to help improve the rendering process. Another
predictive analysis, the task must be to ultimately make a state- example of a related, but excluded, work is the work by Tzeng and
ment about an unknown event. Ma [TM05] which proposed a visualization method to explain the
2. The analysis process uses quantitative algorithms, such as statis- behavior of artificial neural networks (ANNs). This paper focused
tical methods, machine learning models, and data mining tech- on the design of a visualization but provided (to our knowledge)
niques to make grounded and reasonable predictions. Our focus no interactive analytics for the classification using the ANN. Such
is data-driven, as opposed to theory-driven models. methods that do not provide interactivity are also considered to be
3. The prediction results and the prediction model itself have a outside of the scope of our definition of predictive visual analytics.
means of being evaluated. Similarly, uncertainty visualization [BHJ∗ 14] is also excluded from
this article. While uncertainty analysis is a critical part of predictive
Finally, if the model developed only extracts or explains features, analytics, the addition of such a topic would prove too broad.
patterns, correlations, and causalities but does not make reference
to future predictions or ‘what if’ scenarios, it is not considered to
fall under our scope of predictive analytics. The reason for our cho- 3. Methodology
sen scope is that in order to make a prediction, the model needs to There are many visual analytics techniques and systems focused
be applied to unknowns. on modeling, prediction, and decision making that integrate ma-
chine learning, data mining, and statistics. The increasing cover-
age indicates the importance of this topic. To begin our survey, we
2.2. Predictive Visual Analytics first performed a preliminary literature review comprised of expert-
Predictive analytics approaches primarily rely on a four-step pro- recommended papers [Sie13, Eck07, SK10, Shm10] in the field and
cess of data cleaning, feature selection, modeling, and valida- established the definitions and scope of our survey as defined in
tion [HPK11, PC05]. We broadly consider predictive visual ana- Section 2. Once our scope and definition were clear, we collected
lytics to cover the domain of visualization methods and techniques papers from digital archives, which we filtered by keywords to pro-
that have been used to support these predictive analytics process duce a more manageable sample set. We manually searched for rel-
steps. For the purposes of this report, we consider a paper to fall evant papers in this sample set, which we then read and classified.
into the scope of predictive visual analytics if the paper satisfies the This procedure is illustrated in Figure 1.
following conditions:
3.1. Paper Collection & Sample Set Extraction
1. The predictive visual analytics method is specific to prediction
problems, not only confirmatory or explanatory problems. This We collected 6,144 papers published in major visualization jour-
means the task of the predictive visual analytics system, method, nals and proceedings (specifically, IEEE VIS, EuroVis, Pacific
or technique, is to support analysts in making predictions. Vis, ACM CHI, IEEE Transactions on Visualization and Com-
2. The predictive visual analytics method enables the user to in- puter Graphics, and Computer Graphics Forum) from 2010 to the
teract with at least one step in the predictive analytics process present. Next, we filtered our sample set to retain only papers con-
through exploratory visualization (as opposed to traditional in- taining keywords including forecast or predict and visual in their
teractions in user interfaces such as save and undo). title or abstract (422 papers). Then, we added to our sample set
Figure 1: Flowchart illustrating our survey method. Papers and paper sources are colored gray; processes are colored black. Two important
cycles exist in this diagram. First is the “read – follow citation link – read” cycle (marked in orange). This cycle illustrates the iterative
process by which we collected relevant papers for the literature review. Second is the scope redefinition and paper re-classification cycle
(marked in blue). This cycle represents the incremental process by which we delineated the scope of predictive visual analytics for the survey
and improved our classification. By “sample set,” we refer to the sample set of papers collected during this process.
any paper published in IEEE VAST on or after the year 2010, if erage. The final sample in this survey consists of 72 PVA papers.
the paper was not already in our selection (158 papers). Thus, we To support the coding results, most of the papers are discussed as
reduced our sample set to a more manageable size of 580 papers examples in corresponding sections in this paper.
which formed our long list for manual classification. Then, to fur-
ther refine our sample, we pulled the abstracts for these papers and
assigned them for reading and filtering to three of the authors, each.
During the literature review process, we added new and relevant pa- 3.3. Paper Reading & Sample Set Classification
pers to our sample set by following citation links. During the literature review process, each author was instructed to
classify each paper from three aspects: the type of the model inte-
3.2. Expert Filtering & Sample Set Refinement grated in the PVA work, the predictive analytics step involved in
the PVA work, and the interaction techniques supported for pre-
Three of the authors read each of the 580 abstracts. After read-
dictive analytics. For paper categorization, we utilized quantitative
ing an abstract, we marked whether the corresponding paper was
content analysis [RLF14]. We began by classifying papers using
in the scope of predictive visual analytics. We achieved an aver-
the proposed PVA pipeline defined by Lu et al. [LCM∗ 16] and the
age pairwise percent agreement score of 90.66%. We collectively
interaction categorization defined by Yi et al. [YaKSJ07].
inspected any paper that caused a disagreement and found that the
primary source of disagreement was that one author considered risk An initial categorization focused on model types including nu-
assessment to lie in the scope of predictive visual analytics. After merical and categorical, supervised and unsupervised, regression,
discussion, we concluded that risk assessment would be too broad classification, and clustering; however, many of these techniques
for our scope of predictive visual analytics. After we resolved our proved to be sparsely represented in the literature and were grouped
disagreements, we were left with a sample set of 57 papers. The into an “other" category. Furthermore, the predictive visual ana-
sample set was distributed in equal parts among the three authors, lytics pipeline considered initially had only four steps (data pre-
where 6 of the papers assigned to a reader overlapped with another processing, feature engineering, modeling, and model selection
reader. As such, each author read 25 papers and coding results were and validation). As we discussed and refined our PVA definitions,
collected and discussed to minimize the disagreements. As papers model selection and validation was expanded into two different
were categorized, citations were also explored to expand our cov- pieces of our PVA pipeline (result exploration & model selection
c 2017 The Author(s)
Computer Graphics Forum
c 2017 The Eurographics Association and John Wiley & Sons Ltd.
Yafeng Lu, Rolando Garcia, Brett Hansen, Michael Gleicher & Ross Maciejewski / The State-of-the-Art in Predictive Visual Analytics
Table 1: Predictive Visual Analytics Papers Coding Scheme pect of such work should be supporting model comprehensibil-
ity [Gle16]. We define predictive visual analytics as visualiza-
Aspect Category tion techniques that are directly coupled (through user interac-
Data Preprocessing tion) to the predictive analytics process. The four steps of the
Feature Engineering predictive analytics pipeline (data preprocessing, feature engineer-
PVA Pipeline Modeling ing, model building, and model selection and validation) serve as
Result Exploration & Model Selection a basis for defining the PVA pipeline (Figure 2). Our definition
of the PVA pipeline is further informed by the knowledge dis-
Validation
covery process of Pirolli and Card [PC05] and a variety of re-
Select
cent surveys on topics ranging from visual analytics pipelines and
Explore frameworks [CT05, KMS∗ 08, WZM∗ 16] to human-centered ma-
Reconfigure chine learning [BL09, SSZ∗ 16, TKC17] to knowledge discovery.
Encode
Interaction As a starting point for defining the PVA pipeline, we began with
Abstract/Elaborate
the four step pipeline of predictive analytics and general data min-
Filter
ing [HPK11] consisting of: Data Preprocessing, Feature Engineer-
Connect ing, Model Building, and Model Selection and Validation. Chen et
Shepherd al. [LCM∗ 16] extended this pipeline by adding an Adjustment Loop
Regression and a Visualization step allowing for the application of different vi-
Classification sual analytics methods within the general data mining framework.
Prediction Task
Clustering Similar to previously proposed frameworks, our PVA pipeline is
Other also built on top of the typical process of knowledge discovery. In
this paper, we extend and modify Chen et al.’s pipeline, splitting the
process into five steps: Data Preprocessing, Feature Engineering,
Modeling, Result Exploration and Model Selection, and Validation,
as shown in Figure 2. We separate the step Model Selection and
Validation into two distinct steps, Result Exploration and Model
Selection and Validation, and we represent the interactive analytics
loop by bidirectionally connecting the first four steps with Visual
Analytics. Our proposed pipeline highlights two specific aspects of
PVA systems:
• Visual Analytics can be integrated into any of the first four steps
iteratively so that these steps need not proceed in a specific order
in every iteration.
• In the validation step, model testing can be applied. Model test-
Figure 2: The predictive visual analytics pipeline has two parts ing can use statistical tests or visual analytics approaches. If vi-
where the left components are interactively connected by visual an- sual analytics is used, users are able to go back to the first four
alytics and the right component is a further validation step. Visual steps after validation, but the integration level must be shallow
Analytics is added between Data Preprocessing, Feature Engineer- to prevent overfitting and conflation of testing and training data.
ing, Modeling, and Result Exploration and Model Selection where
Given the tight coupling of interaction in the pipeline, we also pro-
the order of interaction is left to the system design or end-user.
vide a detailed categorization of interactions found in the predictive
visual analytics literature, building off of Yi et al.’s interaction tax-
onomy [YaKSJ07]. To illustrate the difference between PVA and
and validation). Similarly, the seven interaction types we started
the general predictive modeling process, we summarize the goals
our categorization with were select, explore, reconfigure, encode,
of predictive analytics and compare those to what we see as the
abstract/elaborate, filter, and connect [YaKSJ07]. After the reading
goals of predictive visual analytics in Table 2.
and discussion, the final categories were expanded to include shep-
herd. The final paper coding scheme is shown in Table 1. The de- To date, few visual analytics systems have been developed to
tailed coding results, together with prediction tasks of regression, support the entire PVA pipeline. Some examples of the most com-
classification, and clustering, are provided for each paper in Ta- prehensive PVA systems are presented in Figure 3. Heimerl et
ble 3 (see Appendix A). Additionally, we specified the model types al. [HKBE12] discussed three text document classification methods
compatible with each paper (e.g. agglomerative hierarchical clus- covering the first four steps in the PVA pipeline and used statistical
tering [BDB15]), and the visualization techniques used. model validation on the test dataset to demonstrate the effective-
ness of the interactive visual analytics method (Figure 3a). iVis-
Classifier [CLKP10] (Figure 3b) proposed a classification system
4. PVA Pipeline
based on linear discriminant analysis (LDA). This system empha-
We consider predictive visual analytics to fall squarely under the sizes the data preprocessing step by providing parallel coordinates
umbrella of human-machine integration and believe that a key as- plots, heat maps, and reconstructed images for the user to explore
Table 2: The goal of each step in the PVA pipeline and the general predictive analytics procedure.
PA PVA Exclusive
Overall Goal Make Prediction Support Explanation
Data Preprocessing Clean and format data Summarize and overview the training data
Feature Selection and Generation Optimize prediction accuracy Support reasoning and domain knowledge integration
Modeling Optimize prediction accuracy Support reasoning and domain knowledge integration
Result Exploration and Model Se- Model quality analysis Get insights; Select the proper model; Feedback for
lection model updates
Validation Test for overfitting Get insights from other datasets
Figure 3: Examples of PVA systems representing the entire PVA pipeline. (a) A document classifier training system [HKBE12] which in-
cludes document preprocessing, classification feature engineering, visual analytics supported active learning, result analysis after each
iteration, and testing and validation. (b) iVisClassifier [CLKP10] supports data encoding and other optional preprocessing steps, visualizes
reduced dimensions and cluster structures from linear discriminant analysis (LDA), but lacks the validation step. (c) Scatter/Gather Clus-
tering [HOG∗ 12] has predefined data preprocessing and feature extraction steps and supports an interactive clustering process by updating
the model with the expected number of clusters controlled by the user.
the data structure with reduced dimensions. iVisClassifier’s feature The most commonly neglected step tends to be the data prepro-
engineering and modeling steps are embedded in the classification cessing step and the formal validation step that utilizes testing of
process and involve significant manual work where users need to the prediction model. The formal validation step is quite rare in
label the unknown data and trigger a new round of LDA by remov- visual analytics papers, though simple performance measures are
ing/adding labeled instances. However, iVisClassifier has only been often reported. Given the lack of full pipeline support, we catego-
demonstrated using case studies without a well-established testing rized the surveyed papers according to which of the PVA pipeline
and validation step. Scatter/gather clustering [HOG∗ 12] (Figure 3c) steps are supported.
supports interactive clustering as part of the modeling phase where
users can set soft constraints on the clustering method and com-
4.1. Data Preprocessing
pare results. However, its data preprocessing and feature extraction
steps, although included in the system, is transparent to the user. Data Preprocessing has two objectives. The first objective is to un-
derstand the data, and the second objective is to prepare the data for
Other representative examples that cover the complete PVA
analysis. Typical preparation approaches include data cleaning, en-
process include a visual analytics framework for box office pre-
coding, and transformation. Examples of systems where data pre-
diction [LKT∗ 14] allowing users to iterate through each step; a
processing is firmly integrated into the predictive analysis loop in-
predictive policing visual analytics framework [MMT∗ 14]; Peak-
clude the work by Krause et al. [KPS16] which presents a visual
Preserving [HJM∗ 11] time series prediction; and iVisCluster-
analytics system, COQUITO, focusing on cohort construction by
ing [LKC∗ 12] which implements a document clustering system
iteratively updating queries through a visual interface (Figure 4).
based on latent Dirichlet allocation topic modeling (distinct from
The usability of this system has been demonstrated in diabetes di-
the LDA used in iVisClassifier). iVisClustering enables the user to
agnosis and social media pattern analysis. Other examples include
control the parameters and update the clustering model via a Term-
the Peak-Preserving time series predictions by Hao et al. [HJM∗ 11]
Weight view. It also supports relations analysis between clusters
which supports noise removal prior to building prediction mod-
and documents by using multiple views to visualize the results.
els for seasonal time series data. Lu et al. [LWM14] propose a
Our analysis of papers indicates that visual analytics develop- system for predicting box-office revenue from social media data.
ers tend to focus only on portions of the PVA pipeline. Even in Their system allows users to refine features by deleting noisy Twit-
cases when all of the steps of the pipeline are found within a single ter data which then updates the feature values. Other systems, such
system, several steps will often lack a direct connection to any vi- as iVisClustering [LKC∗ 12] and iVisClassifier [CLKP10], mention
sualization. Instead, many steps are often left as black-boxes to the the data encoding process (i.e., given a text document, iVisCluster-
user in order to focus on a subset of steps within the PVA pipeline. ing encodes the document set as a term-document matrix using a
c 2017 The Author(s)
Computer Graphics Forum
c 2017 The Eurographics Association and John Wiley & Sons Ltd.
Yafeng Lu, Rolando Garcia, Brett Hansen, Michael Gleicher & Ross Maciejewski / The State-of-the-Art in Predictive Visual Analytics
Figure 5: Examples of feature engineering in the PVA pipeline. (a) INFUSE [KPB14] presented a novel feature glyph to visualize different
measures of a feature as part of classification’s cross-validation. This system supports reordering and other measure inspection for feature
selection. (b) Segmented linear regression [MP13] is supported in this visual analytics system with 1D and 2D views of regression per-
formance on simple features and pairwise feature interactions. (c) May et al. [MBD∗ 11] use dependencies and interdependencies between
feature space and data space to guide feature selection.
c 2017 The Author(s)
Computer Graphics Forum
c 2017 The Eurographics Association and John Wiley & Sons Ltd.
Yafeng Lu, Rolando Garcia, Brett Hansen, Michael Gleicher & Ross Maciejewski / The State-of-the-Art in Predictive Visual Analytics
Figure 7: Examples of result exploration for predictive analytics. (a) Afzal et al. [AME11] presented a decision history tree view to show pre-
diction models’ results as a branching time path. Users were allowed to add/remove mitigation measures for epidemic modeling. (b) Slingsby
et al. [SDW11] present interactive graphics with color coded maps and parallel coordinates to explore uncertainty in area classification
results from the Output Area Classification (OAC). (c) Rauber et al. [RFFT17] use projections to visualize the similarities between artificial
neurons and reveal the inter-layer evolution of hidden layers after training.
data labels while constructing the tree. Backtracking in the tree con- In this phase, systems tend to support connect interactions to high-
struction phase is also supported. More recent work includes Baob- light and link relationships to explore and compare the outputs of
abView [VDEvW11] (Figure 6) which supports manual decision the modeling process under different feature inputs.
tree construction through visual analytics. BaobabView enables the
Examples of result exploration in PVA include Afzal et
model developer to grow the tree, prune branches, split and merge
al. [AME11] who present a decision history tree view to analyze
nodes, and tune parameters as part of the tree construction process.
disease mitigation measures. Users can analyze the future course of
As shown in Figure 6, the color and the width of each branch rep-
epidemic outbreaks and evaluate potential mitigation strategies by
resent the class and the sample size, respectively. Users are able to
flexibly exploring the simulation results and analyzing the local ef-
manually choose the splitting attributes and the value on the nodes.
fects in the map view (Figure 7a). Different paths can be displayed
Additionally, neural networks and support vector machines have
revealing prediction outcomes under different settings by deploying
also been incorporated into predictive visual analytics works where
selected strategies. In this way, the user can explore model results
the focus is on enabling users to understand the black-box model-
and decide which strategy to use while comparing multiple cases.
ing process of these algorithms [LSL∗ 16, TM05, CCWH08].
Slingsby et al. [SDW11] present geodemographic classification re-
What we find in the modeling stage of the PVA pipeline is that sults using a map view, parallel coordinates, and a hierarchical rect-
a major focus is on both model configuration and model compre- angular cartogram. The parallel coordinates view is used to drive
hensibility. Currently, some of the most popular classification al- the Output Area Classification model and compare classification
gorithms are inherently black-box in nature, which has led to re- results given different parameterizations. The classification results
searchers asking questions about how and why certain algorithms and the uncertainty are also visualized on a map view, as shown
come to their conclusions. Challenges here include how much of in Figure 7b. Color is used to indicate the class and lightness is
the model should be open and configurable to the user and what used to represent the uncertainty on the map. Lighter areas tend to
the best sets of views and interactions are for supporting modeling. be less typical of their allocated class. Rauber et al. [RFFT17] use
Again, we see a relatively tight coupling of this stage in the PVA dimension reduction techniques to project data instances and neu-
pipeline with the feature engineering stage. This is likely due to the rons in multilayer perceptrons and convolutional neural networks to
iterative nature of the knowledge foraging process [PC05]. present both the classification results and the relationships between
artificial neurons (Figure 7c). Dendrogramix [BDB15] interactively
visualizes clustering results and data patterns from accumulated hi-
4.4. Result Exploration and Model Selection erarchical clustering (AHC) by combining a dendrogram and simi-
larity matrix. iVisClustering [LKC∗ 12] visualizes the output of La-
Once a model is generated, the next step in the PVA pipeline is
tent Dirichlet allocation by displaying cluster relationships based
to explore the results and compare the performance among sev-
on keyword similarity in a node-link cluster tree view. Users can
eral model candidates (if more than one model is generated). In
explore the model results, and interactions support model refine-
this step, scatterplots, line charts, and other diagnostic statistical
ment. Alsallakh et al. propose the confusion wheel [AHH∗ 14] and
graphics are often the primary means of visualization, and many
visualize true positive, false positive, false negative, and false pos-
variations of these statistical graphics have been proposed, e.g., the
itive classification results.
line chart with confidence ranges and future projections [BZS∗ 16],
node-link layout for hierarchical clustering results [BvLH∗ 11], etc. Along with result exploration, model selection methods have
ing for model validation include residual plots, the receiver oper-
ating characteristic (ROC) curve, and the auto-correlation func-
tion (ACF) plot. Hao et al. [HJM∗ 11] present a visual analytics
approach for peak-preserving predictions where they visualize the
certainty of the model on future data using a line chart with a cer-
tainty band. This work explores model accuracy on the training
time series data by using color codes. K-fold cross-validation is
used in INFUSE [KPB14] for feature selection, and Andrienko et
al. [AAR∗ 09] apply the classifier to a new set of large-scale trajec-
tories and calculate the mean distance of the class members to the
prototype for validation.
Figure 8: An example of model selection. Squares [RAL∗ 17] uses Validations in PVA systems are also often done through case
small multiples composed of grids of different colors and visual studies. An example case study was the 2013 VAST Challenge
textures to display the distribution of probabilities in classification. on box office predictions [KBT∗ 13, LWM13] where participating
teams submitted predictions of future box-office revenues and rat-
ings of upcoming movies using their visual analytics system (over
been employed to compare prediction results and model quality the course of 23 weeks). The performance of these tools has been
under different parameterizations. For example, Squares [RAL∗ 17] reported in follow-up papers [LWM14, EAJS∗ 14] and provides in-
is a visual analytics technique to visualize the performance of the sights into the current design space of PVA. Other works include
results of multiclass classification at both the instance level and statistical tests for validating their PVA approaches. For example,
the class level (Figure 8). Squares uses small multiples to support BaobabView [VDEvW11] compares its classification accuracy to
the analysis of the probability distribution of each class label in the automatic implementation of C4.5 [Qui14] on an evaluation set.
a classification, and it uses squares to visualize the prediction re- Heimer et al. [HKBE12] separate training and test data and provide
sult and error type for each instance. Pilhöfer et al. [PGU12] use a detailed performance comparison of the three models they dis-
Bertin’s Classification Criterion to optimize the display order of cussed to illustrate that the user-driven classification model outper-
nominal variables and better interpret and compare clustering re- forms others. Similar examples can be found in works from Ankerst
sults from different models. Other techniques have explored meth- et al. [AEEK99], Kapoor et al. [KLTH12], and Seifert and Gran-
ods for visually comparing clustering results under different param- itzer [SG10].
eters [MMT∗ 14, ZLMM16, ZM17] in geographical displays.
What we observe in the survey is that validation is perhaps the
From our survey, we observe that there are many PVA works most under-served stage in the PVA pipeline. In many PVA sys-
supporting result exploration and model selection. However, one tems, the user is allowed to interact until the model outputs match
under-supported topic is model comparison, i.e., comparing the re- their expectation; however, such a process is dangerous as it allows
sults of two different classifiers such as a decision tree and a support the user to inject their own biases into the process. More research
vector machine. Furthermore, we note that many systems also have should be done to explore the extent to which humans should be
a distinct lack of provenance and history support. Result explo- involved in the predictive analytics loop. This requires validation
ration often gets tied into the feature engineering process as many on the user side and methods for measuring a user’s model com-
systems have been developed for feature steering and selection. As prehension. Insight generation should also be considered alongside
features are modified, results from the model are presented. With- measures of the predictive accuracy of the model.
out the ability to save results, however, comparison can be difficult
even within a model.
5. Interactions in PVA
4.5. Validation In addition to sorting papers by the stages that they support in the
PVA pipeline, we also want to consider the type of interactions that
Finally, once a model is generated and the results are explored,
are supported. We recognize that the types of interaction being sup-
validation is performed to test model quality. After training, held-
ported can have both exploratory and explanatory value to the user
out data (i.e. data that has not been used in the first four stages
while promoting the broader goal of making accurate predictions.
of the PVA pipeline) can be used to evaluate the performance of
We sort the PVA papers based on the interaction categories pro-
the model. This step is critical to verify the adequate of the model
posed by Yi et al. [YaKSJ07] and propose an additional interaction
and that the model generalizes well to future or unknown data. Sta-
category, Shepherd. By Shepherd, we mean that the interactions
tistical measures such as accuracy, precision, recall, mean square
could enable the user to guide the modeling process either directly
error (MSE), and R2predicted are commonly used to evaluate model
or indirectly. This interaction type partially includes the annotation
performance. Currently, we do not consider the user’s enjoyment
and labeling interaction type proposed by Sacha et al. [SZS∗ 16]
measurement [BAL∗ 15] as part of the validation step in the PVA
(excepting those for information enrichment). During our classi-
pipeline, but we argue that such measures should be an integral
fication, we also considered categorizing papers based on the use
part of a PVA system. Similarly, efficiency and scalability are also
of semantic interaction [EFN12]. Given that semantic interaction
not considered.
intersects with multiple interaction types (e.g. searching, highlight-
Common visualizations used in machine learning and data min- ing, annotating, and rearranging) we have chosen not to add this
c 2017 The Author(s)
Computer Graphics Forum
c 2017 The Eurographics Association and John Wiley & Sons Ltd.
Yafeng Lu, Rolando Garcia, Brett Hansen, Michael Gleicher & Ross Maciejewski / The State-of-the-Art in Predictive Visual Analytics
Figure 9: User-interaction examples in PVA systems. (a) An example of select and elaborate: A map view shows selected flight trajectories
in black. Details are displayed in a textbox-overlay. The user can select flight trajectories and view them in a different color at the fore-
front [BJA∗ 15]. (b) An example of explore: Here, a panel displays a simulated ocean surface in 3D. The user can explore the surface by
panning [HMC∗ 13]. (c) An example of reconfigure: A Parallel Coordinates Plot shows the five most similar objects highlighted in red. The
user can reconfigure the display by selecting the primary axis from the drop-down menu [LKT∗ 14].
Figure 10: An example of the encode interaction. An icon-based Figure 11: A hybrid visualization of hierarchical clustering
visualization of multidimensional clusters [CGSQ11] shows how results demonstrating the cluster-folding feature of Dendro-
the icon encodes skewness information for different values of s. The gramix [BDB15]. The folding interaction is an instance of abstrac-
icon on the right encodes both kurtosis and skewness information. tion.
The user can interact with the visualization to enable or disable
position and shape encoding of skewness and kurtosis.
c 2017 The Author(s)
Computer Graphics Forum
c 2017 The Eurographics Association and John Wiley & Sons Ltd.
Yafeng Lu, Rolando Garcia, Brett Hansen, Michael Gleicher & Ross Maciejewski / The State-of-the-Art in Predictive Visual Analytics
Figure 13: Examples of shepherd interactions. (a) A multidimensional scaling scatterplot. The user can manipulate the visualization to
reweight the distance function so that it better reflects her mental model [BLBC12]. (b) After the distance function is adjusted, the same red
and blue points marked on the left figure now appear on the right figure with red and blue halos. (c) ManiMatrix, an interactive system that
allows users to directly manipulate the confusion matrix to specify preferences and explore the classification space [KLTH10].
filter interaction on both feature selection and data sample selec- the treemap to highlight other regions associating to the same per-
tion in model building. To be specific, users can brush on the axis son. Other examples of connect in PVA include the co-cluster anal-
shown in Figure 9c to filter out movies with a feature value out of ysis of bipartite graphs [XCQS16]. In this system, the user clicks
the brushed range. on a record to select all related records and highlight all selected
records across multiple views. This type of interaction is also com-
monly used in feature and data subspace search to help users un-
5.6. Abstract/Elaborate
derstand clustering results. Work by Tatu et al. [TMF∗ 12] enables
Abstract/Elaborate interactions enable the user to view the data the connect interaction when selected clusters of objects are high-
representation at various levels of granularity. As data sets be- lighted by colors among other subspaces and coordinated views for
come larger, methods for aggregating and abstracting the data be- comparative exploration of their grouping structures.
come critical to provide an overview at different stages in the PVA
pipeline. An example of Abstract/Elaborate in the PVA pipeline
5.8. Shepherd
is Dendrogramix [BDB15], a hybrid tree-matrix visualization that
superimposes the relationship between individual objects onto the The final interaction category is shepherd, which enables the user to
hierarchy of clusters. Dendrogramix enables users to explain why guide the modeling process. Such a guide can be direct or indirect.
particular objects belong to a particular cluster and blend informa- Direct shepherding includes choosing model parameter settings
tion from both clusters and individual objects, which was not well (such as choosing the number of clusters in k-means) and model
supported by previous cluster visualization techniques. As shown type selection (such as switching from k-means to hierarchical clus-
in Figure 11, users can label clusters to generate a folded cluster tering). Indirect shepherding includes strong constraints (e.g., set-
containing these sub-classes, and we consider this interaction an ting distance thresholds [AAR∗ 09], redefining distance functions
instance of abstract. Users are also allowed to unfold the cluster interactively as in Figure 13a, and Figure 13b [BLBC12], or manip-
by clicking on it, and we consider this interaction as instance of ulating the confusion matrix as in Figure 13c) and soft constraints
elaborate. (e.g., model changing direction such as the expected classification
distribution [KLTH12]) from the visual interface. While our def-
inition of this category is broad enough to encompass annotation
5.7. Connect
interactions, such as the expert-labeling of automatically misclassi-
Connect interactions highlight links and relationships between fied data points, and other interactions that re-draw decision bound-
entities or bring related items into view. Additionally, con- aries [MvGW11], we feel this needed to be a distinct category as
nect can be used to highlight features of the same entity dis- the act of directing a model is unique to our context. For example,
tributed throughout a visualization grouped by features, such as changing a feature value to update the model is one intuitive way
a treemap [CGSQ11], or to highlight the same item across mul- of exploring “what if” questions in predictive analytics. We think
tiple coordinated views [BWMM15]. As shown in Figure 12, the there is a meaningful difference, as far as interaction is concerned,
treemap displays the multidimensional clusters of people based on between changing the value of a target feature (class) or any other
their risk of having different disease. User can click on a region in feature. Furthermore, if we are willing to concede that annotation
Figure 14: Visual summary of the co-occurrence and correlation analysis of categories. (a) The co-occurrence matrix shows how often
categories of interactions, model types used, and stages of the predictive visual analytics pipeline overlapped in our classification. (b)
The correlation matrix shows the relationship between categorization terms. Blue indicates positive correlation, red indicates negative
correlation, and white indicates no correlation. Note that Result Exploration refers to the Result Exploration and Model Selection step of the
PVA pipeline due to the space limitations of the graph.
or class labeling (in specific instances) is a shepherd interaction, tion scheme and calculated a Pearson correlation coefficient for the
then it follows that tweaking the features of a data point to gain pairwise categories based on our 72 labeled papers. Figure 14 pro-
a deeper understanding of the effect that particular feature has on vides a visual summary of the co-occurrence and correlation anal-
prediction is also an instance of shepherd. ysis with a symmetric matrix view. In summary, some interesting
patterns found from these two analyses include:
For illustrating this kind of interaction, consider Prospector, a
visual analytics system for inspecting black-box machine learning • Function related interactions, such as select, filter, connect, and
models [KPN16]. Prospector enables the user to change the feature shepherd, are more fundamental to PVA and have been imple-
values of any data point to explore how that point’s probability of mented more than encode and reconfigure.
belonging to some class is affected. Another example of shepherd • PVA stages are not often jointly supported and the modeling
can be found in Xu et al. [XCQS16] which proposes an interactive stage is relatively uncorrelated to all other stages.
co-clustering visualization to facilitate the identification of node • The data preprocessing and validation stages are less supported
clusters formed in a bipartite graph. Their system allows the user to than other stages.
adjust the node grouping to incorporate their prior knowledge of the
domain and supports both direct and indirect shepherd. Directly, Co-Occurrence Between Categories: In Figure 14a, each cell
the user can split and merge the clusters. Indirectly, the user can describes the number of co-occurrences between two categories,
provide explicit feedback on the cluster quality, and the system will where the top left block represents interaction co-occurrence, the
use the feedback to learn a parameterization of the co-clustering middle block represents PVA pipeline stage co-occurrence, and the
algorithm that aligns with the user’s mental model. lower left block represents co-occurrence between the types of pre-
diction tasks. The colors on the diagonal provide insight into the
frequency at which a category is likely to be encountered in our
6. Discussion data.
After several iterations of paper selection and refinement, we have With respect to the interactions, we find that select, explore, con-
been able to summarize the state-of-the-art in predictive visual an- nect, and filter, are the most widely used types of interaction and
alytics. In doing so, we have defined a pipeline for predictive vi- show strong co-occurrence in the data. They together cover 59 out
sual analytics and described user interactions with different roles in of 72 papers. We see relatively fewer PVA methods that take advan-
terms of predictive modeling. In addition, we have investigated how tage of reconfigure, encode, and abstract/elaborate interactions. Fi-
the prediction tasks and interactions correlate with stages in the nally, given the modeling nature of PVA, we see strong support for
predictive visual analytics pipeline. Specifically, we have counted shepherd interactions as well with 32 papers. The co-occurrence of
the number of times that categories co-occurred in our classifica- shepherd with other interaction types is evenly distributed across all
c 2017 The Author(s)
Computer Graphics Forum
c 2017 The Eurographics Association and John Wiley & Sons Ltd.
Yafeng Lu, Rolando Garcia, Brett Hansen, Michael Gleicher & Ross Maciejewski / The State-of-the-Art in Predictive Visual Analytics
Figure 15: Our web-based Predictive Visual Analytics Browser supports PVA paper organization, categorization, quick search, and filtering.
other interaction types. We note that some interaction types are rep- From this analysis, we find that the four interaction types, select,
resented much more than others in the survey. This may be due to explore, connect, and filter, are highly correlated, indicating that
the fact that functions related to data exploration and model build- if one of these interactions is supported, the other three interac-
ing (which requires a lot of select, filter, connect and shepherd tions are also likely to be supported. Positive correlations between
interactions) are more fundamental to the PVA pipeline. Interac- select, explore, connect, and filter and the PVA pipeline stage—
tions that drastically modify the underlying visualization (reconfig- result exploration and model selection—can also be observed, in-
ure and encode) are supported far less often in the papers surveyed. dicating that these interactions are highly relevant to these stages
of the pipeline. Among other stages in the PVA pipeline, positive
The co-occurrence matrix also confirms that current PVA works correlations can be observed between data preprocessing and fea-
have tended to focus on modeling and result exploration. These ture engineering (0.29) as well as between result exploration and
stages of the pipeline are represented in 60 out of 72 papers. We model selection and validation (0.23). Surprisingly, the modeling
also note that the validation stage is rarely covered. This may in- stage is found to be relatively uncorrelated to all other stages in the
dicate that, much like data preprocessing, the steps that go into PVA pipeline indicating that more support is needed. Among the
model validation require interactions and views that are disjoint 18 papers having validation, 14 of them co-occur with result ex-
from other steps in the PVA pipeline. We also note that there is ploration and model selection. Similarly, among the 20 papers on
very little strong co-occurrence between different stages in the PVA data preprocessing, 12 also include feature engineering. This pat-
pipeline. This indicates that a strong coupling between steps in the tern conforms with the typical knowledge discovery process where
PVA pipeline is still an ongoing research challenge. feature engineering could be intertwined with data preprocessing
Finally, observations on the modeling types indicate little co- (e.g., feature extraction in image processing and text analysis). The
occurrence between the models. This is unsurprising as different negative correlation between modeling and result exploration and
models tend to require different visualization approaches within model selection (-0.23) indicates that the current stages of the PVA
the PVA pipeline and individual PVA systems usually utilize just pipeline are supported rather disjointly.
one type of model (with very few exceptions). We also note that
there are no overly dominant interaction type associations with the The shepherd interaction category has a positive correlation with
various model categories. the modeling stage, 24 out of the 32 shepherd papers co-occur with
modeling. One example of the exceptions is Prospector [KPN16],
Correlation Between Categories: In Figure 14b, each cell repre- where shepherd is used to tweak the features for “what if” analy-
sents the correlation between two categories. Blue indicates posi- sis and understanding the effect of the features on the predictions.
tive correlation, red indicates negative correlation, and white indi- The connect interaction is positively correlated to the result explo-
cates no correlation. Darker shades of blue and red correspond to ration and model selection stage (0.40). Shepherd interactions are
larger positive and negative correlations, respectively. As in the co- mostly used for modeling with a positive correlation (0.45). Feature
occurrence matrix, we order the correlation matrix by interactions, engineering tends to use more filter and abstract/elaborate interac-
PVA pipeline stages, and the model types. tions. The interaction categories and pipeline steps are generally
7. Future Directions in PVA and create more or fewer class labels as needed for the prediction
task [AEK00].
Our analysis of the predictive visual analytics literature identified a
number of trends in the current state-of-the-art. Here, we use that Less exact, relational knowledge is sufficient for grouping data
analysis to identify some key challenges for future work. points [DDJB13, BO13, GRM10, BAL∗ 15], merging or splitting
clusters [XCQS16], validating point-to-point proximity [BLBC12],
and making queries about similar points [LKT∗ 14]. It is a knowl-
7.1. Knowledge Generation and Integration in PVA edge about what is similar or different, what is near or far, and what
is closely related. Relational knowledge is the most common type
Our definition of predictive analytics (Section 1) focuses on the
of knowledge integrated in PVA systems and is especially useful
use of data-centric methods for modeling. However, such model-
for unsupervised learning.
ing is rarely purely data-driven: human knowledge is valuable in
both directions. First, human knowledge can be put into the models Germane knowledge results from experience-based intuition. It
to improve their performance. Second, constructed models can be is informal and either impossible to articulate, or hard to articulate
used for stakeholders to gain knowledge, although by our definition without deep reflection. Broadly, it is knowledge about what is rel-
the goal of knowledge generation always occurs with the primary evant. Germane knowledge forms a basis for setting good model
goal of making predictions. Predictive visual analytics offers the hyper-parameters, choosing appropriate thresholds or cut-offs, se-
opportunity to help with knowledge transfer in both directions. lecting relevant ranges or data subsets [BJA∗ 15,MBD∗ 11], and ex-
cluding outliers from the analysis process [BWMM15].
Knowledge acquisition in modeling: Visual analytics owes its
success, in part, to its ability to integrate expert knowledge and tacit Hazy knowledge, as the name suggests, is the least exact type of
assumptions into the analysis process. The interaction techniques knowledge. It enables the expert to feel satisfied or dissatisfied with
of visual analytics systems form an interface between the expert’s the direction and progress of the analysis [HOG∗ 12], hint at er-
mental ontology and the computer. Thus, the expert is able to sup- ror tolerance levels [KLTH10], and provide feedback about the in-
ply missing information. We found four knowledge types that are terestingness of different models or visualizations through system-
generally well-supported by interactive systems: taxonomic, rela- usage patterns [SSZ∗ 16].
tional, germane, and hazy—listed in decreasing order of exactitude
Providing human knowledge in the predictive model construc-
(Figure 16). In what follows, we provide PVA system examples for
tion process is clearly valuable. However, opening the black-box
each knowledge type.
of predictive models for human intervention is not without issues.
Taxonomic knowledge is the most exact, and is sufficient for as- By giving users the option to integrate their domain knowledge,
signing the relevant class or classes at the desired level of granular- we have also allowed them to inject bias into the model. What’s
ity to the known data point. The expert is thus able to rely on a men- the point of using technology to learn something new when you’re
tal concept hierarchy. This type of knowledge enables the expert bending it to fit your pre-existing notions? More seriously, how can
to generate training data for a supervised model [SG10, HNH∗ 12, we regulate or constrain knowledge integration so that we get the
HKBE12], correct model classification errors [BTRD15, MW10], benefits of domain knowledge, social and emotional intuition, and
c 2017 The Author(s)
Computer Graphics Forum
c 2017 The Eurographics Association and John Wiley & Sons Ltd.
Yafeng Lu, Rolando Garcia, Brett Hansen, Michael Gleicher & Ross Maciejewski / The State-of-the-Art in Predictive Visual Analytics
minimize the costs of introducing bias? How much human-in-the- merge and split clusters, query about similar points and can cor-
loop is the right amount? A recent study [DSM16] ran experiments rect classification errors or add labels to the data.
on incentivized forecasting tasks where participants could choose
to use forecasting outputs from an algorithm or provide their own
7.2. User Types Supported in PVA
inputs. The study found that letting people adjust an imperfect algo-
rithm’s forecasts would increase both their chances of using the al- While knowledge generation and integration are critical, the ability
gorithm and their satisfaction with the results. However, the authors to climb the knowledge hierarchy may also be directly related to the
also found that participants in the study often worsened the algo- type of user interacting with the system. Different users/analysts
rithm’s forecasts when given the ability to adjust them. This further have different knowledge to contribute to the predictive analytics
brings into question how much interaction should be provided in process, and also they have different demands from the system. As
the PVA pipeline. Results from the forecasting study also indicated such, an ongoing challenge in predictive visual analytics is how
that people were insensitive to the amount that they could adjust the to tailor systems for certain classes of users and what interactions
forecasts, which may indicate that interaction as a placebo could within the PVA pipeline should be instantiated or hidden to reduce
be an option. Given these results, it is clear that more studies are the potential for incorrect modeling. We identify three types of
needed to provide clear guidelines for predictive visual analytics users to consider when developing PVA methods, based on their
methodologies. knowledge:
Gaining Knowledge from Models: Predictive visual analytics • End-users are experts in neither the domain nor the prediction
approaches can also help users gain knowledge from the con- methodology. They usually lack the necessary knowledge to un-
structed models. Ideally, a good PVA system should enable the derstand advanced prediction models, and they may not be in-
expert to climb the knowledge hierarchy introduced above. For terested in mastering the technical knowledge that is auxiliary or
example, suppose Darcy is analyzing a database of peoples’ accidental to the analysis. Examples of PVA systems to support
names [GRM10]. She views a visual cluster layout and begins end users include the works by Lee et al. [LKC∗ 12] and Elzen
grouping names together based on her relational knowledge. Darcy et al. [vdEHBvW16]. Heimerl et al.’s work [HKBE12] has a low
knows that Charlie is more like Matthew and less like Yao; more- demand for the user’s knowledge as long as they have experience
over, she sees that the clustering algorithm positions Charlie and using web search engines.
Matthew near each other and away from Yao, but Yao appears near • Domain experts master the knowledge in a particular field but
Liu. Soon, she realizes that the clustering algorithm is grouping generally are not experts in predictive modeling. Examples of
people’s names by nationality. If Darcy is able to identify the na- PVA systems to support domain experts include the works by
tionality of each cluster of names, she has gained taxonomic knowl- Zhang et al. [ZYZ∗ 16], Jean et al. [JWG16] and Migut and Wor-
edge. ring [MW10].
• Modeling experts are the analysts that know the general pro-
The set of analyses an expert can perform by the type of her cess and reasoning behind predictive modeling but lack the spe-
knowledge increases monotonically from hazy to taxonomic (Fig- cialized knowledge for specific applications. Examples of PVA
ure 16); that is, an expert who has taxonomic knowledge can also applications that support modeling experts include the works by
group points together (relational), exclude outliers (germane), and Elzen and Wijik [VDEvW11] and Mühlbacher et al. [MPG∗ 14].
feel satisfied or dissatisfied with the progress and direction of the
analysis (hazy). Of course, an expert can at once have different lev- Our survey found that the majority of PVA methods being devel-
els of knowledge about different things: for example, a forensic oped focused on domain users. Given that a major goal of predictive
analyst can have taxonomic knowledge about the different estab- visual analytics is to increase algorithmic performance via domain
lishments a suspect entered, but only germane knowledge about the knowledge injection by the user, this is not surprising. However,
suspect’s behavioral patterns. systems that are useful for domain experts may not be useful for
others. There is a need for explainability in science not only to ex-
Looking at recent work in the visual analytics community, we perts, but also to the general public. As such, further research into
identify two main roles that PVA techniques generally play in ex- how users with different backgrounds and goals interact with such
tracting human knowledge from predictive models. systems is also an open area for exploration. Should all knowledge
injection techniques be open for normal users or should there be
1. PVA methods provide multiple exploration views and models to techniques specific to domain users? Is there a hierarchy of which
enable data exploration and hypothesis generation. The injec- stages in the PVA pipeline to open based on user type? Does the
tion of the domain knowledge is carried out indirectly through importance of explainability vary for different users as well? These
the exploration as interesting patterns or discoveries are found. questions require further research in predictive visual analytics.
We would classify these interactions as supporting hazy and ger-
mane knowledge. Here, analysts can reflect about the model and
7.3. XAI and PVA
choose appropriate data ranges.
2. PVA methods focus on opening the black box of prediction Data and computational resources are rapidly becoming more
models. The goal is to improve understandability of the mod- widely available. A central trend in modeling has been to employ
eling process and to enable domain knowledge injection into increasingly large and sophisticated models that exploit these. This
the modeling process. We would classify this as supporting re- trend is typified by “deep learning” approaches that apply large
lational knowledge and taxonomic knowledge where analysts scale network models to modeling problems such as prediction.
Such approaches are often able to leverage large data to achieve but also somehow capture and communicate human interaction bi-
impressive performance. However, this performance is not without ases within the system. For real-world predictions, factors such as
a cost: such models are large, complex, and constructed automat- human sympathy and social knowledge are critical for predictions
ically (especially in terms of feature engineering), making them such as stock market performance, security attacks, and elections.
difficult to interpret. While the predictive results may be accurate, The challenge is in adapting predictive visual analytics methods to
if the generated model lacks interpretable meaning, then its pre- include these factors which are difficult to digitize and use in au-
dictive power is hampered [Pea03]. Interpretability is an important tomatic models but may be critical in helping people make use of
concern whenever AI techniques are utilized, and this problem is predictions.
exacerbated with the emergence of deep models. The challenges of Scaling to Larger and More Complex Models: As modeling is
interpreting complex models are often referred to as explainable AI applied to more complex situations, new challenges emerge. For
(or XAI for short). Interpretable models can serve many goals for a example, the arsenal of modeling techniques to address the wide
variety of stakeholders [Gle16]. range of needs is growing rapidly, making choosing proper model
An example in the requirement of explainable AI is the self- types challenging for analysts. Predictive visual analytics should
driving car. Google’s self-driving car project utilizes machine learn- enable users to not only compare different parameterizations of one
ing in order to generate models that can accurately process and re- model type but also enable between-model comparison.
spond to input from its sensors [Gui11]. The self-driving cars have Historically, PVA systems have focused on simpler model types,
now logged over 2 million miles on public roads with only a couple such as decision trees, but are currently evolving to consider more
dozen accidents, only one of which was caused by the autonomous complex models, such as neural networks. PVA systems must not
vehicle [Hig16]. This is an impressive safety record, but given the only meet the user needs in model complexity, but help in deter-
complexity of input and response the cars need to handle, it cannot mining how to balance between the model complexity for better
be known if the cars will respond well in every situation. This is performance and the amount of knowledge users can add through
a prime example of an accurate predictive model that lacks inter- more comprehension.
pretability in a domain where the interpretability of the model is of
grave importance. As models grow larger and more complex, efficiency also be-
comes an issue. Much of the power of PVA comes from inter-
There has been a perceived trade-off between model inter- activity which can be difficult to maintain as the modeling com-
pretability and performance, however there may be other pathways putations become more time consuming. Similarly, the visual ap-
to improving interpretability besides using simpler models with proaches employed in PVA systems must be made to scale: modern
poorer performance. In terms of self-driving cars, it is conceivable data sets quickly grow beyond what can be presented directly in a
that a better safety record would be traded for a simpler model that visualization.
makes it easier to draft legislation and comply to regulations con-
Better Model Validation Support: Model validation is a key part
cerning autonomous vehicles [Sch16], however it is preferable to
of predictive modeling, but it is not well-supported by current PVA
have both safety and comprehensibility. Research in explainable AI
tools. Cross validation and other introductory methods need further
has explored approaches including generating descriptions of com-
visual support, and as the validation experiments become increas-
plex models and for interpreting complex models through a series
ingly complex (e.g., monte carlo bootstrapping approaches), sup-
of simpler ones (e.g., LIME [HAR∗ 16]).
port for understanding these validations will become a key chal-
Predictive visual analytics is currently being used to make com- lenge. In addition, predictive visual analytics must also be con-
plex models generated by black-box AI techniques simpler to in- cerned with validating the visualizations and interactions proposed.
terpret thus allowing for more accuracy. PVA systems can pro-
Improving the User Experience: Users have difficulty under-
vide tools that aid in supervised learning to generate interpretable
standing complex prediction models, and they also could have diffi-
classification models [HKBE12, AHH∗ 14] or to allow the injec-
culty understanding and using complex PVA systems. While com-
tion of domain knowledge during the construction of decision trees
plex tasks are usually supported comprehensively by complex sys-
[VDEvW11]. However, PVA must find ways to scale to the in-
tems, new approaches must make this functionality available to a
creasingly large and complex models that are in use for emerging
broad spectrum of users. A key aspect of improving usability of
applications. This will require integration of the analysis methods
PVA approaches will be to better consider and support both the
emerging from XAI research, as well as developing more scalable
analysis workflow and cooperation amongst analysts.
interaction and visual paradigms. One source of ideas in this di-
rection is to consider the entirety of the modeling process, not just
examining the internals of the constructed models. 8. Acknowledgement
Some of the material presented here was supported by the NSF un-
7.4. A Summary of Challenges in PVA der Grant No. 1350573 and IIS-1162037 and in part by the U.S. De-
partment of Homeland Security’s VACCINE Center under Award
From our survey analysis and internal discussions, we have identi- Number 2009-ST-061-CI0001.
fied some key themes for future PVA research:
Integrating User Knowledge: With the integration of user’s References
knowledge, bias may also be imported. As such, predictive visual [AAR∗ 09] A NDRIENKO G., A NDRIENKO N., R INZIVILLO S., NANNI
analytics needs to capture and communicate not only data biases M., P EDRESCHI D., G IANNOTTI F.: Interactive Visual Clustering of
c 2017 The Author(s)
Computer Graphics Forum
c 2017 The Eurographics Association and John Wiley & Sons Ltd.
Yafeng Lu, Rolando Garcia, Brett Hansen, Michael Gleicher & Ross Maciejewski / The State-of-the-Art in Predictive Visual Analytics
Large Collections of Trajectories. In IEEE Symposium on Visual Ana- [But13] B UTLER D.: When Google Got Flu Wrong. Nature 494, 7436
lytics Science and Technology (2009), IEEE, pp. 3–10. 8, 10, 13, 16, (2013), 155. 1
23
[BvLBS11] B REMM S., VON L ANDESBERGER T., B ERNARD J.,
[AEEK99] A NKERST M., E LSEN C., E STER M., K RIEGEL H.-P.: Vi- S CHRECK T.: Assisted Descriptor Selection Based on Visual Compara-
sual Classification: An Interactive Approach to Decision Tree Construc- tive Data Analysis. In Computer Graphics Forum (2011), vol. 30, Wiley
tion. In Proceedings of the fifth ACM SIGKDD international conference Online Library, pp. 891–900. 7, 23
on Knowledge discovery and data mining (1999), ACM, pp. 392–396. 8,
[BvLH∗ 11] B REMM S., VON L ANDESBERGER T., H ESS M., S CHRECK
10, 23
T., W EIL P., H AMACHER K.: Interactive Visual Comparison of Multiple
[AEK00] A NKERST M., E STER M., K RIEGEL H.-P.: Towards an Effec- Trees. In IEEE Conference on Visual Analytics Science and Technology
tive Cooperation of the User and the Computer for Classification. In (2011), IEEE, pp. 31–40. 9, 23
Proceedings of the Sixth ACM SIGKDD International Conference on
[BWMM15] B RYAN C., W U X., M NISZEWSKI S., M A K.-L.: Integrat-
Knowledge Discovery and Data Mining (New York, NY, USA, 2000),
ing Predictive Analytics Into a Spatiotemporal Epidemic Simulation. In
KDD ’00, ACM, pp. 179–188. 16, 23
IEEE Symposium on Visual Analytics Science and Technology (2015),
[AHH∗ 14] A LSALLAKH B., H ANBURY A., H AUSER H., M IKSCH S., IEEE, pp. 17–24. 11, 13, 16, 23
R AUBER A.: Visual Methods for Analyzing Probabilistic Classification
[BYJ∗ 13] BARLOWE S., YANG J., JACOBS D. J., L IVESAY D. R., A L -
Data. IEEE transactions on visualization and computer graphics 20, 12
SAKRAN J., Z HAO Y., V ERMA D., M OTTONEN J.: A Visual Analytics
(2014), 1703–1712. 9, 18, 23
Approach to Exploring Protein Flexibility Subspaces. In IEEE Pacific
[AKK∗ 13] AUER C., K ASTEN J., K RATZ A., Z HANG E., H OTZ I.: Au- Visualization Symposium (2013), IEEE, pp. 193–200. 11
tomatic, Tensor-Guided Illustrative Vector Field Visualization. In IEEE
[BZS∗ 16] BADAM S. K., Z HAO J., S EN S., E LMQVIST N., E BERT D.:
Pacific Visualization Symposium (2013), IEEE, pp. 265–272. 3
TimeFork: Interactive Prediction of Time Series. In Proceedings of the
[AKMS07] A SSENT I., K RIEGER R., M ÜLLER E., S EIDL T.: VISA: 2016 CHI Conference on Human Factors in Computing Systems (2016),
Visual Subspace Clustering Analysis. ACM SIGKDD Explorations ACM, pp. 5409–5420. 9, 23
Newsletter 9, 2 (2007), 5–12. 23
[CCWH08] C ARAGEA D., C OOK D., W ICKHAM H., H ONAVAR V.: Vi-
[AME11] A FZAL S., M ACIEJEWSKI R., E BERT D. S.: Visual Analyt- sual Methods for Examining SVM Classifiers. In Visual Data Mining.
ics Decision Support Environment for Epidemic Modeling and Response Springer, 2008, pp. 136–153. 9, 23
Evaluation. In IEEE Conference on Visual Analytics Science and Tech-
[CGSQ11] C AO N., G OTZ D., S UN J., Q U H.: DICON: Interactive Vi-
nology (2011), IEEE, pp. 191–200. 1, 7, 9, 23
sual Analysis of Multidimensional Clusters. IEEE transactions on vi-
[BAL∗ 15] B ROOKS M., A MERSHI S., L EE B., D RUCKER S. M., sualization and computer graphics 17, 12 (2011), 2581–2590. 12, 13,
K APOOR A., S IMARD P.: FeatureInsight: Visual Support for Error- 23
Driven Feature Ideation in Text Classification. In IEEE Symposium on
[CLKP10] C HOO J., L EE H., K IHM J., PARK H.: iVisClassifier: An In-
Visual Analytics Science and Technology (2015), IEEE, pp. 105–112. 7,
teractive Visual Analytics System for Classification Based on Supervised
10, 16, 23
Dimension Reduction. In IEEE Symposium on Visual Analytics Science
[BDB15] B LANCH R., DAUTRICHE R., B ISSON G.: Dendrogramix: A and Technology (Oct 2010), pp. 27–34. 1, 5, 6, 23
Hybrid Tree-Matrix Visualization Technique to Support Interactive Ex-
[Cra08] C RANOR L. F.: A Framework for Reasoning About the Human
ploration of Dendrograms. In IEEE Pacific Visualization Symposium
in the Loop. UPSEC 8 (2008), 1–15. 1
(2015), IEEE, pp. 31–38. 5, 9, 12, 13, 23
[CT05] C OOK K. A., T HOMAS J. J.: Illuminating the Path: The Re-
[BHJ∗ 14] B ONNEAU G.-P., H EGE H.-C., J OHNSON C. R., O LIVEIRA
search and Development Agenda for Visual Analytics. Tech. rep., Pacific
M. M., P OTTER K., R HEINGANS P., S CHULTZ T.: Overview and
Northwest National Laboratory (PNNL), Richland, WA (US), 2005. 5
State-Of-The-Art of Uncertainty Visualization. In Scientific Visualiza-
tion. Springer, 2014, pp. 3–27. 2, 3 [DDJB13] D UDAS P. M., D E J ONGH M., B RUSILOVSKY P.: A Semi-
Supervised Approach to Visualizing and Manipulating Overlapping
[BJA∗ 15] B UCHMÜLLER J., JANETZKO H., A NDRIENKO G., A N -
Communities. In 17th International Conference on Information Visu-
DRIENKO N., F UCHS G., K EIM D. A.: Visual Analytics for Explor-
alisation (2013), University of Pittsburgh. 16, 23
ing Local Impact of Air Traffic. In Computer Graphics Forum (2015),
vol. 34, Wiley Online Library, pp. 181–190. 11, 16, 23 [DPD∗ 15] D IEHL A., P ELOROSSO L., D ELRIEUX C., S AULO C., RUIZ
J., G RÖLLER M., B RUCKNER S.: Visual Analysis of Spatio-Temporal
[BL09] B ERTINI E., L ALANNE D.: Surveying the Complementary Role
Data: Applications in Weather Forecasting. In Computer Graphics Fo-
of Automatic Data Analysis and Visualization in Knowledge Discovery.
rum (2015), vol. 34, Wiley Online Library, pp. 381–390. 11, 23
In Proceedings of the ACM SIGKDD Workshop on Visual Analytics and
Knowledge Discovery: Integrating Automated Analysis with Interactive [DSM16] D IETVORST B. J., S IMMONS J. P., M ASSEY C.: Overcoming
Exploration (2009), ACM, pp. 12–20. 5 Algorithm Aversion: People Will Use Imperfect Algorithms If They Can
(Even Slightly) Modify Them. Management Science (2016). 1, 17
[BLBC12] B ROWN E. T., L IU J., B RODLEY C. E., C HANG R.: Dis-
Function: Learning Distance Functions Interactively. In IEEE Sympo- [EAJS∗ 14] E L -A SSADY M., J ENTNER W., S TEIN M., F ISCHER F.,
sium on Visual Analytics Science and Technology (2012), IEEE, pp. 83– S CHRECK T., K EIM D.: Predictive Visual Analytics: Approaches for
92. 7, 8, 11, 13, 16, 23 Movie Ratings and Discussion of Open Research Challenges. In An
IEEE VIS 2014 Workshop: Visualization for Predictive Analytics (2014).
[BO13] B RUNEAU P., OTJACQUES B.: An Interactive, Example-Based,
10
Visual Clustering System. In 17th International Conference on Informa-
tion Visualization (July 2013), pp. 168–173. 8, 16, 23 [Eck07] E CKERSON W. W.: Predictive Analytics: Extending the Value
of Your Data Warehousing Investment. TDWI Best Practices Report. Q
[BPFG11] B ERGER W., P IRINGER H., F ILZMOSER P., G RÖLLER E.:
1 (2007), 2007. 3
Uncertainty-Aware Exploration of Continuous Parameter Spaces Using
Multivariate Prediction. In Computer Graphics Forum (2011), vol. 30, [EFN12] E NDERT A., F IAUX P., N ORTH C.: Semantic Interaction for
Wiley Online Library, pp. 911–920. 11, 23 Visual Text Analytics. In SIGCHI Conference on Human Factors in
Computing Systems (2012), ACM, pp. 473–482. 10
[BTRD15] BABAEE M., T SOUKALAS S., R IGOLL G., DATCU M.:
Visualization-Based Active Learning for the Annotation of SAR Images. [GBP∗ 13] G OSINK L., B ENSEMA K., P ULSIPHER T., O BERMAIER H.,
IEEE Journal of Selected Topics in Applied Earth Observations and Re- H ENRY M., C HILDS H., J OY K. I.: Characterizing and Visualizing Pre-
mote Sensing 8, 10 (2015), 4687–4698. 16, 23 dictive Uncertainty in Numerical Ensembles Through Bayesian Model
Averaging. IEEE transactions on visualization and computer graphics [KK15] K UCHER K., K ERREN A.: Text Visualization Techniques: Tax-
19, 12 (2013), 2703–2712. 23 onomy, Visual Survey, and Community Insights. In IEEE Pacific Visual-
[Gle16] G LEICHER M.: A Framework for Considering Comprehensibil- ization Symposium (April 2015), IEEE, pp. 117–121. 16
ity in Modeling. Big Data (2016). 1, 5, 18 [KLTH10] K APOOR A., L EE B., TAN D., H ORVITZ E.: Interactive Op-
[GRM10] G ARG S., R AMAKRISHNAN I. V., M UELLER K.: A Visual timization for Steering Machine Classification. In Proceedings of the
Analytics Approach to Model Learning. In IEEE Symposium on Visual SIGCHI Conference on Human Factors in Computing Systems (2010),
Analytics Science and Technology (Oct 2010), pp. 67–74. 16, 17, 23 ACM, pp. 1343–1352. 7, 13, 16, 23
[Gro13] G ROENFELDT T.: Kroger Knows Your Shopping Patterns Better [KLTH12] K APOOR A., L EE B., TAN D. S., H ORVITZ E.: Performance
Than You Do, October 2013. [Online; posted 27-October-2013]. 1 and Preferences: Interactive Refinement of Machine Learning Proce-
dures. In AAAI (2012), Citeseer. 10, 13, 23
[Gui11] G UIZZO E.: How Google’s Self-Driving Car Works. IEEE Spec-
trum Online, October 18 (2011). 18 [KMS∗ 08] K EIM D. A., M ANSMANN F., S CHNEIDEWIND J., T HOMAS
J., Z IEGLER H.: Visual Analytics: Scope and Challenges. In Visual data
[GWR09] G UO Z., WARD M. O., RUNDENSTEINER E. A.: Model mining. Springer, 2008, pp. 76–90. 5
Space Visualization for Multivariate Linear Trend Discovery. In IEEE
Symposium on Visual Analytics Science and Technology (2009), IEEE, [KPB14] K RAUSE J., P ERER A., B ERTINI E.: INFUSE: Interactive Fea-
pp. 75–82. 7, 23 ture Selection for Predictive Modeling of High Dimensional Data. IEEE
transactions on visualization and computer graphics 20, 12 (2014),
[HAR∗ 16] H ENDRICKS L. A., A KATA Z., ROHRBACH M., D ONAHUE 1614–1623. 7, 8, 10, 23
J., S CHIELE B., DARRELL T.: Generating Visual Explanations. In Eu-
ropean Conference on Computer Vision (2016), Springer, pp. 3–19. 18 [KPHH11] K ANDEL S., PAEPCKE A., H ELLERSTEIN J., H EER J.:
Wrangler: Interactive Visual Specification of Data Transformation
[Hig16] H IGGINS T.: Google’s Self-Driving Car Program Odometer Scripts. In Proceedings of the SIGCHI Conference on Human Factors
Reaches 2 Million Miles. The Wall Street Journal (2016). 18 in Computing Systems (2011), ACM, pp. 3363–3372. 7
[HJM∗ 11] H AO M. C., JANETZKO H., M ITTELSTÄDT S., H ILL W.,
[KPJ04] K ING A. D., P RŽULJ N., J URISICA I.: Protein Complex Pre-
DAYAL U., K EIM D. A., M ARWAH M., S HARMA R. K.: A Visual
diction via Cost-Based Clustering. Bioinformatics 20, 17 (2004), 3013–
Analytics Approach for Peak-Preserving Prediction of Large Seasonal
3020. 3, 23
Time Series. In Computer Graphics Forum (2011), vol. 30, Wiley On-
line Library, pp. 691–700. 6, 10, 23 [KPN16] K RAUSE J., P ERER A., N G K.: Interacting With Predictions:
Visual Inspection of Black-Box Machine Learning Models. In Proceed-
[HKBE12] H EIMERL F., KOCH S., B OSCH H., E RTL T.: Visual Classi-
ings of the 2016 CHI Conference on Human Factors in Computing Sys-
fier Training for Text Document Retrieval. IEEE Transactions on Visu-
tems (2016), ACM, pp. 5686–5697. 7, 14, 15, 23
alization and Computer Graphics 18, 12 (2012), 2839–2848. 5, 6, 8, 10,
16, 17, 18, 23 [KPS16] K RAUSE J., P ERER A., S TAVROPOULOS H.: Supporting Itera-
[HMC∗ 13] H ÖLLT T., M AGDY A., C HEN G., G OPALAKRISHNAN G., tive Cohort Construction With Visual Temporal Queries. IEEE transac-
H OTEIT I., H ANSEN C. D., H ADWIGER M.: Visual Analysis of Un- tions on visualization and computer graphics 22, 1 (2016), 91–100. 6, 7,
certainties in Ocean Forecasts for Planning and Operation of Off-Shore 23
Structures. In IEEE Pacific Visualization Symposium (2013), IEEE, [LCM∗ 16] L U J., C HEN W., M A Y., K E J., L I Z., Z HANG F., M A -
pp. 185–192. 7, 11, 23 CIEJEWSKI R.: Recent Progress and Trends in Predictive Visual Analyt-
[HNH∗ 12] H ÖFERLIN B., N ETZEL R., H ÖFERLIN M., W EISKOPF D., ics. Frontiers of Computer Science (2016). 4, 5
H EIDEMANN G.: Inter-Active Learning of Ad-Hoc Classifiers for Video [LKC∗ 12] L EE H., K IHM J., C HOO J., S TASKO J., PARK H.: iVis-
Visual Analytics. In IEEE Symposium on Visual Analytics Science and Clustering: An Interactive Visual Document Clustering via Topic Model-
Technology (2012), IEEE, pp. 23–32. 8, 16, 23 ing. In Computer Graphics Forum (2012), vol. 31, Wiley Online Library,
[HOG∗ 12] H OSSAIN M. S., O JILI P. K. R., G RIMM C., M ÜLLER R., pp. 1155–1164. 6, 9, 17, 23
WATSON L. T., R AMAKRISHNAN N.: Scatter/Gather Clustering: Flex- [LKKV14] L AZER D., K ENNEDY R., K ING G., V ESPIGNANI A.: The
ibly Incorporating User Feedback to Steer Clustering Results. IEEE Parable of Google Flu: Traps in Big Data Analysis. Science 343, 6176
Transactions on Visualization and Computer Graphics 18, 12 (Dec (2014), 1203–1205. 1
2012), 2829–2838. 6, 8, 16, 23
[LKT∗ 14] L U Y., K RÜGER R., T HOM D., WANG F., KOCH S., E RTL
[HPK11] H AN J., P EI J., K AMBER M.: Data Mining: Concepts and T., M ACIEJEWSKI R.: Integrating Predictive Analytics and Social Me-
Techniques. Elsevier, 2011. 3, 5 dia. In IEEE Symposium on Visual Analytics Science and Technology
[IMI∗ 10] I NGRAM S., M UNZNER T., I RVINE V., T ORY M., B ERGNER (2014), IEEE, pp. 193–202. 6, 7, 11, 12, 16, 23
S., M ÖLLER T.: DimStiller: Workflows for Dimensional Analysis and [LSL∗ 16] L IU M., S HI J., L I Z., L I C., Z HU J., L IU S.: Towards Better
Reduction. In IEEE Symposium on Visual Analytics Science and Tech- Analysis of Deep Convolutional Neural Networks. IEEE Transactions
nology (2010), IEEE, pp. 3–10. 7, 23 on Visualization and Computer Graphics PP, 99 (2016), 1–1. 9, 24
[Ins13] I NSTITUTE S.: SAS 9.4 Language Reference Concepts, 2013. 2 [LSP∗ 10] L EX A., S TREIT M., PARTL C., K ASHOFER K., S CHMAL -
[JKM12] JANKOWSKA M., K EŠELJ V., M ILIOS E.: Relative N-Gram STIEG D.: Comparative Analysis of Multidimensional, Quantitative
Signatures: Document Visualization at the Level of Character N-Grams. Data. IEEE Transactions on Visualization and Computer Graphics 16, 6
In IEEE Symposium on Visual Analytics Science and Technology (2012), (Nov 2010), 1027–1035. 24
IEEE, pp. 103–112. 23 [LWM13] L U Y., WANG F., M ACIEJEWSKI R.: VAST 2013 Mini-
[JWG16] J EAN C. S., WARE C., G AMBLE R.: Dynamic Change Arcs to Challenge 1: Box Office Vast-Team Vader. In IEEE Conference on Visual
Explore Model Forecasts. In Computer Graphics Forum (2016), vol. 35, Analytics Science and Technology (2013). 10
Wiley Online Library, pp. 311–320. 17 [LWM14] L U Y., WANG F., M ACIEJEWSKI R.: Business Intelligence
[KBT∗ 13] K RÜGER R., B OSCH H., T HOM D., P ÜTTMANN E., H AN From Social Media: A Study From the Vast Box Office Challenge. IEEE
Q., KOCH S., H EIMERL F., E RTL T.: Prolix-Visual Prediction Analysis computer graphics and applications 34, 5 (2014), 58–69. 6, 10, 24
for Box Office Success. In IEEE Conference on Visual Analytics Science [MBD∗ 11] M AY T., BANNACH A., DAVEY J., RUPPERT T.,
and Technology (2013). 10 KOHLHAMMER J.: Guiding Feature Subset Selection With an Interac-
[KJ13] K UHN M., J OHNSON K.: Applied Predictive Modeling. Springer, tive Visualization. In IEEE Symposium on Visual Analytics Science and
2013. 2 Technology (2011), IEEE, pp. 111–120. 7, 8, 12, 16, 24
c 2017 The Author(s)
Computer Graphics Forum
c 2017 The Eurographics Association and John Wiley & Sons Ltd.
Yafeng Lu, Rolando Garcia, Brett Hansen, Michael Gleicher & Ross Maciejewski / The State-of-the-Art in Predictive Visual Analytics
[MBH∗ 12] M EYER J., B ETHEL E. W., H ORSMAN J. L., H UBBARD [PGU12] P ILHÖFER A., G RIBOV A., U NWIN A.: Comparing Cluster-
S. S., K RISHNAN H., ROMOSAN A., K EATING E. H., M ONROE L., ings Using Bertin’s Idea. IEEE Transactions on Visualization and Com-
S TRELITZ R., M OORE P., ET AL .: Visual Data Analysis as an Integral puter Graphics 18, 12 (Dec 2012), 2506–2515. 10, 11, 24
Part of Environmental Management. IEEE transactions on visualization
[PSPM15] PAIVA J. G. S., S CHWARTZ W. R., P EDRINI H., M INGHIM
and computer graphics 18, 12 (2012), 2088–2094. 11, 24
R.: An Approach to Supporting Incremental Visual Data Classification.
[MDK10] M AY T., DAVEY J., KOHLHAMMER J.: Combining Statistical IEEE transactions on visualization and computer graphics 21, 1 (2015),
Independence Testing, Visual Attribute Selection and Automated Anal- 4–17. 8, 24
ysis to Find Relevant Attributes for Classification. In IEEE Symposium
on Visual Analytics Science and Technology (2010), IEEE, pp. 239–240. [Qui14] Q UINLAN J. R.: C4. 5: Programs for Machine Learning. Else-
24 vier, 2014. 10
[MHR∗ 11] M ACIEJEWSKI R., H AFEN R., RUDOLPH S., L AREW S. G., [RAL∗ 17] R EN D., A MERSHI S., L EE B., S UH J., W ILLIAMS J. D.:
M ITCHELL M. A., C LEVELAND W. S., E BERT D. S.: Forecasting Squares: Supporting Interactive Performance Analysis for Multiclass
Hotspots – A Predictive Analytics Approach. IEEE Transactions on Vi- Classifiers. IEEE Transactions on Visualization and Computer Graphics
sualization and Computer Graphics 17, 4 (2011), 440–453. 24 23, 1 (2017), 61–70. 10, 24
[MMT∗ 14] M ALIK A., M ACIEJEWSKI R., T OWERS S., M C C ULLOUGH [RCMM∗ 16] R AIDOU R., C ASARES -M AGAZ O., M UREN L., VAN DER
S., E BERT D. S.: Proactive Spatiotemporal Resource Allocation and H EIDE U., R ØRVIK J., B REEUWER M., V ILANOVA A.: Visual Analy-
Predictive Visual Analytics for Community Policing and Law Enforce- sis of Tumor Control Models for Prediction of Radiotherapy Response.
ment. IEEE transactions on visualization and computer graphics 20, 12 Computer Graphics Forum 35, 3 (2016), 231–240. 11, 24
(2014), 1863–1872. 6, 10, 24 [RD00] R HEINGANS P., D ESJARDINS M.: Visualizing High-
[MP13] M ÜHLBACHER T., P IRINGER H.: A Partition-Based Framework Dimensional Predictive Model Quality. In Proceedings of Visualization
for Building and Validating Regression Models. IEEE Transactions on 2000. (2000), IEEE, pp. 493–496. 24
Visualization and Computer Graphics 19, 12 (2013), 1962–1971. 7, 8, [RFFT17] R AUBER P. E., FADEL S. G., FALCAO A. X., T ELEA A. C.:
24 Visualizing the Hidden Activity of Artificial Neural Networks. IEEE
[MPG∗ 14] M ÜHLBACHER T., P IRINGER H., G RATZL S., S EDLMAIR Transactions on Visualization and Computer Graphics 23, 1 (2017),
M., S TREIT M.: Opening the Black Box: Strategies for Increased User 101–110. 9, 24
Involvement in Existing Algorithm Implementations. IEEE transactions [RLF14] R IFF D., L ACY S., F ICO F.: Analyzing Media Messages: Using
on visualization and computer graphics 20, 12 (2014), 1643–1652. 17 Quantitative Content Analysis in Research. Routledge, 2014. 4
[MPK∗ 13] M ACIEJEWSKI R., PATTATH A., KO S., H AFEN R., C LEVE -
[SAS12] SAS P UBLISHING C O .: JMP ten modeling and multivariate
LAND W. S., E BERT D. S.: Automated Box-Cox Transformations for
methods, 2012. 2
Improved Visual Encoding. IEEE transactions on visualization and com-
puter graphics 19, 1 (2013), 130–140. 7 [Sch16] S CHERER M. U.: Regulating Artificial Intelligence Systems:
[MPV15] M ONTGOMERY D. C., P ECK E. A., V INING G. G.: Introduc- Risks, Challenges, Competencies, and Strategies. SSRN Electronic Jour-
tion to Linear Regression Analysis. John Wiley & Sons, 2015. 2 nal (2016). 18
[MvGW11] M IGUT M., VAN G EMERT J., W ORRING M.: Interactive [SDW11] S LINGSBY A., DYKES J., W OOD J.: Exploring Uncertainty
Decision Making Using Dissimilarity to Visually Represented Proto- in Geodemographics With Interactive Graphics. IEEE Transactions on
types. In IEEE Symposium on Visual Analytics Science and Technology Visualization and Computer Graphics 17, 12 (2011), 2545–2554. 9, 24
(2011), IEEE, pp. 141–149. 13 [SG10] S EIFERT C., G RANITZER M.: User-Based Active Learning.
[MW10] M IGUT M., W ORRING M.: Visual Exploration of Classifica- In IEEE International Conference on Data Mining Workshops (2010),
tion Models for Risk Assessment. In 2010 IEEE Symposium on Visual IEEE, pp. 418–425. 10, 16, 24
Analytics Science and Technology (Oct 2010), pp. 11–18. 16, 17, 24 [Shm10] S HMUELI G.: To Explain or to Predict? Statistical science
[NHM∗ 07] NAM E. J., H AN Y., M UELLER K., Z ELENYUK A., I MRE (2010), 289–310. 3
D.: Clustersculptor: A Visual Analytics Tool for High-Dimensional [Sie13] S IEGEL E.: Predictive Analytics: The Power to Predict Who Will
Data. In IEEE Symposium on Visual Analytics Science and Technology. Click, Buy, Lie, or Die. John Wiley & Sons, 2013. 3
(2007), IEEE, pp. 75–82. 8, 24
[SK10] S HMUELI G., KOPPIUS O.: Predictive Analytics in Information
[OJ14] O BERMAIER H., J OY K. I.: Future Challenges for Ensemble
Systems Research. Robert H. Smith School Research Paper No. RHS
Visualization. IEEE Computer Graphics and Applications 34, 3 (2014),
(2010), 06–138. 2, 3
8–11. 2
[PBK10] P IRINGER H., B ERGER W., K RASSER J.: HyperMoVal: In- [SS02] S EO J., S HNEIDERMAN B.: Interactively Exploring Hierarchical
teractive Visual Validation of Regression Models for Real-Time Sim- Clustering Results [Gene Identification]. Computer 35, 7 (2002), 80–86.
ulation. In Computer Graphics Forum (2010), vol. 29, Wiley Online 24
Library, pp. 983–992. 1 [SS05] S EO J., S HNEIDERMAN B.: A Rank-By-Feature Framework for
[PC05] P IROLLI P., C ARD S.: The Sensemaking Process and Lever- Interactive Exploration of Multidimensional Data. Information visual-
age Points for Analyst Technology as Identified Through Cognitive Task ization 4, 2 (2005), 96–113. 7
Analysis. In Proceedings of international conference on intelligence [SSZ∗ 16] S ACHA D., S EDLMAIR M., Z HANG L., L EE J., W EISKOPF
analysis (2005), vol. 5, pp. 2–4. 3, 5, 9 D., N ORTH S., K EIM D.: Human-Centered Machine Learning Through
[PDF∗ 11] PATEL K., D RUCKER S. M., F OGARTY J., K APOOR A., TAN Interactive Visualization: Review and Open Challenges. In Proceedings
D. S.: Using Multiple Models to Understand Data. In IJCAI Proceed- of the 24th European Symposium on Artificial Neural Networks, Compu-
ings of International Joint Conference on Artificial Intelligence (2011), tational Intelligence and Machine Learning (2016). 5, 16
vol. 22, Citeseer, p. 1723. 7, 24 [Sut14] S UTHAHARAN S.: Big Data Classification: Problems and Chal-
[Pea03] P EARL J.: Comments on Neuberg’s Review of Causality. Econo- lenges in Network Intrusion Prediction With Machine Learning. ACM
metric Theory 19, 04 (jun 2003). 18 SIGMETRICS Performance Evaluation Review 41, 4 (2014), 70–73. 2
[PFP∗ 11] PAIVA J. G., F LORIAN L., P EDRINI H., T ELLES G., [SZS∗ 16] S ACHA D., Z HANG L., S EDLMAIR M., L EE J. A., P ELTO -
M INGHIM R.: Improved Similarity Trees and Their Application to Vi- NEN J., W EISKOPF D., N ORTH S., K EIM D. A.: Visual Interaction
sual Data Classification. IEEE transactions on visualization and com- With Dimensionality Reduction: A Structured Literature Analysis. IEEE
puter graphics 17, 12 (2011), 2459–2468. 24 Transactions on Visualization and Computer Graphics (2016). 7, 10
[TFA∗ 11] TAM G. K. L., FANG H., AUBREY A. J., G RANT P. W., [ZM17] Z HANG Y., M ACIEJEWSKI R.: Quantifying the Visual Impact
ROSIN P. L., M ARSHALL D., C HEN M.: Visualization of Time-Series of Classification Boundaries in Choropleth Maps. IEEE Transactions on
Data in Parameter Space for Understanding Facial Dynamics. Computer Visualization and Computer Graphics 23, 1 (2017), 371–380. 10
Graphics Forum 30, 3 (2011), 901–910. 7, 24 [ZYZ∗ 16] Z HANG C., YANG J., Z HAN F. B., G ONG X., B RENDER
[TKC17] TAM G. K., KOTHARI V., C HEN M.: An Analysis of Machine- J. D., L ANGLOIS P. H., BARLOWE S., Z HAO Y.: A Visual Analyt-
And Human-Analytics in Classification. IEEE Transactions on Visual- ics Approach to High-Dimensional Logistic Regression Modeling and
ization and Computer Graphics 23, 1 (2017), 71–80. 5 Its Application to an Environmental Health Study. In IEEE Pacific Visu-
alization Symposium (2016), IEEE, pp. 136–143. 17
[TM05] T ZENG F.-Y., M A K.-L.: Opening the Black Box-Data Driven
Visualization of Neural Networks. In IEEE Visualization. (2005), IEEE,
pp. 383–390. 3, 9
[TMF∗ 12] TATU A., M AASS F., FÄRBER I., B ERTINI E., S CHRECK T.,
S EIDL T., K EIM D.: Subspace Search and Visualization to Make Sense
of Alternative Clusterings in High-Dimensional Data. In IEEE Confer-
ence on Visual Analytics Science and Technology (2012), IEEE, pp. 63–
72. 1, 13, 24
[TPH15] T RIVEDI S., PARDOS Z. A., H EFFERNAN N. T.: The Utility of
Clustering in Prediction Tasks. arXiv preprint arXiv:1509.06163 (2015).
3
[TPRH11] T URKAY C., PARULEK J., R EUTER N., H AUSER H.: In-
teractive Visual Analysis of Temporal Cluster Structures. In Computer
Graphics Forum (2011), vol. 30, Wiley Online Library, pp. 711–720. 24
[USK14] U ENAKA T., S AKAMOTO N., KOYAMADA K.: Visual Analysis
of Habitat Suitability Index Model for Predicting the Locations of Fish-
ing Grounds. In IEEE Pacific Visualization Symposium (2014), IEEE,
pp. 306–310. 24
[vdEHBvW16] VAN DEN E LZEN S., H OLTEN D., B LAAS J., VAN W IJK
J. J.: Reducing Snapshots to Points: A Visual Analytics Approach to
Dynamic Network Exploration. IEEE transactions on visualization and
computer graphics 22, 1 (2016), 1–10. 17
[VDEvW11] VAN D EN E LZEN S., VAN W IJK J. J.: Baobabview: Inter-
active Construction and Analysis of Decision Trees. In IEEE Symposium
on Visual Analytics Science and Technology (2011), IEEE, pp. 151–160.
8, 9, 10, 17, 18, 24
[WFZ∗ 15] WANG Y., FAN C., Z HANG J., N IU T., Z HANG S., J IANG
J.: Forecast Verification and Visualization Based on Gaussian Mixture
Model Co-Estimation. In Computer Graphics Forum (2015), vol. 34,
Wiley Online Library, pp. 99–110. 24
[WWN∗ 15] WATANABE K., W U H.-Y., N IIBE Y., TAKAHASHI S., F U -
JISHIRO I.: Biclustering Multivariate Data for Correlated Subspace Min-
ing. In IEEE Pacific Visualization Symposium (2015), IEEE, pp. 287–
294. 24
[WYG∗ 11] W EI J., Y U H., G ROUT R. W., C HEN J. H., M A K.-L.:
Dual Space Analysis of Turbulent Combustion Particle Data. In IEEE
Pacific Visualization Symposium (2011), IEEE, pp. 91–98. 24
[WZM∗ 16] WANG X.-M., Z HANG T.-Y., M A Y.-X., X IA J., C HEN W.:
A Survey of Visual Analytic Pipelines. Journal of Computer Science and
Technology 31, 4 (2016), 787–804. 5
[XCQS16] X U P., C AO N., Q U H., S TASKO J.: Interactive Visual Co-
Cluster Analysis of Bipartite Graphs. In IEEE Pacific Visualization Sym-
posium (2016), IEEE, pp. 32–39. 13, 14, 16, 24
[YaKSJ07] Y I J. S., AH K ANG Y., S TASKO J., JACKO J.: Toward a
Deeper Understanding of the Role of Interaction in Information Visual-
ization. IEEE transactions on visualization and computer graphics 13, 6
(2007), 1224–1231. 4, 5, 10, 11
[ZLH∗ 16] Z HOU F., L I J., H UANG W., Z HAO Y., Y UAN X., L IANG
X., S HI Y.: Dimension Reconstruction for Visual Exploration of Sub-
space Clusters in High-Dimensional Data. In IEEE Pacific Visualization
Symposium (2016), IEEE, pp. 128–135. 7, 8, 24
[ZLMM16] Z HANG Y., L UO W., M ACK E., M ACIEJEWSKI R.: Visu-
alizing the Impact of Geographical Variations on Multivariate Cluster-
ing. In Computer Graphics Forum (2016), vol. 35, Wiley Online Library,
pp. 101–110. 10
c 2017 The Author(s)
Computer Graphics Forum
c 2017 The Eurographics Association and John Wiley & Sons Ltd.
Yafeng Lu, Rolando Garcia, Brett Hansen, Michael Gleicher & Ross Maciejewski / The State-of-the-Art in Predictive Visual Analytics
Table 3: Results from the PVA categorization scheme. Papers are encoded according to their coverage of the PVA pipeline and interactions.
Note that Result Exploration refers to the Result Exploration and Model Selection step of the PVA pipeline due to space limitations.
Feature Engineering
Data Preprocessing
Abstract/Elaborate
Result Exploration
Classification
Reconfigure
Regression
Validation
Clustering
Modeling
Shepherd
Connect
Encode
Explore
Select
Other
Filter
[AAR∗ 09] * * * * * * *
[AEEK99] * * * * * *
[AEK00] * * * *
[AHH∗ 14] * * * * * * *
[AKMS07] * * * * *
[AME11] * * * * * * *
[BAL∗ 15] * * * * * * *
[BDB15] * * * * * *
[BJA∗ 15] * * * * * * * *
[BLBC12] * * * * * * *
[BO13] * * *
[BPFG11] * * * * * * *
[BTRD15] * * *
[BvLBS11] * * * *
[BvLH∗ 11] * * * * *
[BWMM15] * * * * * * *
[BZS∗ 16] * * *
[CCWH08] * * * *
[CGSQ11] * * * * * * * * *
[CLKP10] * * * * * * * * *
[DDJB13] * * * * * *
[DPD∗ 15] * * * * * *
[GBP∗ 13] * * *
[GRM10] * * * *
[GWR09] * * * * * * * * *
[HJM∗ 11] * * * * * * * * *
[HKBE12] * * * * * * * * * * * *
[HMC∗ 13] * * * * * * *
[HNH∗ 12] * * * * * * * * * *
[HOG∗ 12] * * * * * * * * *
[IMI∗ 10] * * * * * * *
[JKM12] * * * * *
[KLTH10] * * *
[KLTH12] * * * * *
[KPB14] * * * * *
[KPJ04] * * * * * * *
[KPN16] * * * * * * * * *
[KPS16] * * * * * *
[LKC∗ 12] * * * * * * * * *
[LKT∗ 14] * * * * * * * * * * *
Feature Engineering
Data Preprocessing
Abstract/Elaborate
Result Exploration
Classification
Reconfigure
Regression
Validation
Clustering
Modeling
Shepherd
Connect
Encode
Explore
Select
Other
Filter
[LSL∗ 16] * * * * * * * * * *
[LSP∗ 10] * * * * *
[LWM14] * * * * * * * *
[MBD∗ 11] * * * * * * * * * * *
[MBH∗ 12] * * * *
[MDK10] * * * * * *
[MHR∗ 11] * * * * *
[MMT∗ 14] * * * * * * * * * * * *
[MP13] * * * * * * *
[MW10] * * * * * * *
[NHM∗ 07] * * * * *
[PDF∗ 11] * * * * * *
[PFP∗ 11] * * * * * *
[PGU12] * * *
[PSPM15] * * * * * * *
[RAL∗ 17] * * * * * * * *
[RCMM∗ 16] * * * * *
[RD00] * * * *
[RFFT17] * * * * * *
[SDW11] * * *
[SG10] * * * *
[SS02] * * * * * * * *
[TFA∗ 11] * * * * * * * *
[TMF∗ 12] * * * * * * * *
[TPRH11] * * * * *
[USK14] * * * * *
[VDEvW11] * * * * * *
[WFZ∗ 15] * * * *
[WWN∗ 15] * * * * * *
[WYG∗ 11] * * * * * *
[XCQS16] * * * * * * * * * *
[ZLH∗ 16] * * * * * *
c 2017 The Author(s)
Computer Graphics Forum
c 2017 The Eurographics Association and John Wiley & Sons Ltd.