0% found this document useful (0 votes)
63 views4 pages

Querying and ReUsing Workflows With Visstrails

VisTrails is an open-source provenance-enabled scientific workflow system. It can be combined with a wide range of tools, libraries, and visualization systems. These techniques can be used to simplify the notoriously hard tasks of creating and refining workflows.

Uploaded by

vthung
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views4 pages

Querying and ReUsing Workflows With Visstrails

VisTrails is an open-source provenance-enabled scientific workflow system. It can be combined with a wide range of tools, libraries, and visualization systems. These techniques can be used to simplify the notoriously hard tasks of creating and refining workflows.

Uploaded by

vthung
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Querying and Re-Using Workflows with VisTrails

Carlos E. Scheidegger Huy T. Vo David Koop Juliana Freire Cláudio T. Silva


SCI Institute and School of Computing – University of Utah
{cscheid, hvo, dakoop, juliana, csilva}@cs.utah.edu

ABSTRACT Who created this data product and when? When was it
We show how workflow systems can be augmented to lever- modified and by whom? What was the process used to cre-
age provenance information to enhance usability. In par- ate the data product? Were two data products derived from
ticular, we will demonstrate new mechanisms and intuitive the same raw data? Not only is the process time-consuming,
user interfaces designed to allow users to query workflows but also error-prone.
by example and to refine workflows by analogies. These Workflow systems have therefore grown in popularity with-
techniques are implemented in VisTrails, an open-source in the scientific community (see e.g., [3, 8, 9]). Not only do
provenance-enabled scientific workflow system that can be they support the automation of repetitive tasks, but they
combined with a wide range of tools, libraries, and visualiza- can also capture complex analysis processes at various levels
tion systems. We will show different scenarios where these of detail and systematically capture provenance information
techniques can be used to simplify the notoriously hard tasks for the derived data products.
of creating and refining workflows. While significant progress has been made in unifying com-
putations under the workflow umbrella, workflow systems
Categories and Subject Descriptors are notoriously hard to use. They require a steep learning
H.4 [Information Systems Applications]: General curve: users need to learn programming languages, program-
ming environments, specialized libraries, and best practices
General Terms for constructing workflows. Often, workflow engineers who
Algorithms, Human Factors, Management have programming expertise construct workflows at the re-
quest of scientists. Whereas this modus operandi is accept-
Keywords able for tasks that require few workflows that will be run
scientific workflows, visualization, query-by-example, prove- many times, the same cannot be said of tasks that are ex-
nance, analogy ploratory in nature. For the latter, a workflow (or a set of
workflows) must be iteratively refined as a scientist formu-
1. INTRODUCTION lates and tests hypotheses. For example, while mining or
creating visualizations of a dataset, a user needs to experi-
Computing has been an enormous accelerator to science
ment with different parameter values as well as with different
and has led to an information explosion in many different
techniques. This process requires domain expertise that a
fields. To analyze and understand scientific data, complex
workflow engineer does not have, and it requires expertise in
computational processes must be assembled, often requir-
building workflows which a domain scientist seldom has. To
ing the combination of loosely-coupled resources, special-
bridge this gap, we need systems that facilitate the process
ized libraries, and grid and Web services. These processes
of constructing and refining workflows. Usability and the
may generate even more final and intermediate data prod-
ability to cater to a broad set of users with varying levels of
ucts, adding to the overflow of information scientists need to
experience is of utmost importance for workflow systems.
deal with. Ad-hoc approaches to data exploration (e.g., Perl
The VisTrails system [1, 7] represents our initial attempt
scripts) have been widely used in the scientific community,
to provide support for tasks that involve data exploration
but have serious limitations. In particular, scientists and
through workflows. VisTrails is an open-source provenance-
engineers need to expend substantial effort managing data
enabled scientific workflow middleware which can be com-
(e.g., scripts that encode computational tasks, raw data,
bined with a wide range of tools, libraries and visualization
data products, and notes) and recording provenance infor-
systems. A new concept we introduced with VisTrails is the
mation so that basic questions can be answered, such as:
notion of provenance of workflows [5]. In contrast to previ-
ous workflow systems which maintained provenance only for
the data products generated by workflows, VisTrails treats
Permission to make digital or hard copies of all or part of this work for the workflows themselves as first-class data items and keeps
personal or classroom use is granted without fee provided that copies are their provenance. We have shown that the provenance of
not made or distributed for profit or commercial advantage and that copies how workflows evolve over time enables a series of operations
bear this notice and the full citation on the first page. To copy otherwise, to which simplify exploratory processes, for example: scientists
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
can easily navigate through the space of workflows created
SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.
Copyright 2008 ACM 978-1-60558-102-6/08/06 ...$5.00.
for a given exploration task; visually compare workflows and
their results; and explore large parameter spaces.
In this demonstration, we show how VisTrails unobtru-
sively tracks detailed provenance information for exploratory
tasks and leverages this data to enhance usability. In partic-
ular, we will demonstrate new mechanisms and intuitive user
interfaces we designed that allow users to query workflows
by example and to refine workflows by analogy [7]. Query-
by-example simplifies the task of locating existing workflows
with certain structures or parameter settings by allowing the
user to specify a query exactly as she would design that fea-
ture in a workflow. Using the analogy mechanism, a user
can modify a workflow without having to directly edit the
workflow specification: she can specify a change by selecting
two workflows that exhibit the “before” and “after” and ap-
ply a similar change to a third workflow. These interfaces
form the basis of an infrastructure that makes it possible Figure 1: The version tree stores the complete evo-
for scientists to learn by example; expedites their scientific lution of a collection of workflows. Each node corre-
training; and potentially reduces their time to insight. sponds to a workflow, the edges show how the work-
flows are derived.
2. THE VISTRAILS SYSTEM 3.1 Query by Example
VisTrails combines features of both workflow and visual- A major problem with many computational tasks is that
ization systems. Like workflow systems, it enables the use of code is written once and rarely re-used. Often, code is tai-
loosely-coupled resources such as specialized libraries, grid, lored to a specific problem, and in trying to solve a new
and Web-services, to be used in concert. It parallels some problem, it is hard to locate existing code that is relevant.
visualization systems by providing mechanisms to perform Workflows alleviate this problem in part by promoting the
parameter explorations and visually compare multiple re- adoption of a service-oriented architecture which leads to
sults. But unlike these systems, VisTrails was designed to re-use. But to enable users to find existing code, a query
manage exploratory activities, where computational tasks interface is needed.
are iteratively refined as users formulate and test hypothe- Workflows are represented as graphs: modules connected
ses. A distinguishing feature of VisTrails is a comprehensive by input/output ports, which carry the data type and mean-
provenance infrastructure. The system maintains detailed ing. Consequently, using text-based languages (e.g., SQL)
history information about the steps followed and data de- is not desirable, because it effectively requires that a sub-
rived in the course of an exploratory task; and provides novel graph query be encoded as text. The VisTrails query-by-
operations and user interfaces for users to explore and re-use example mechanism eliminates the need to learn a new query
this information. language or decompose workflow graphs into SQL syntax:
The change-based provenance model adopted by VisTrails Users build queries exactly as they would build pieces of a
maintains the sequence of actions that are applied to work- workflow. In addition to being able to define the structure of
flows (e.g., the addition of a module, the modification of a the workflow, users can choose to search specific parameters
parameter, etc), akin to a database transaction log. These with a set of filters. Finally, the results are displayed visu-
changes are sufficient to determine the provenance of data ally: each workflow version that matches is highlighted along
products, and they also contain information about how the with the corresponding portion of each matching workflow.
workflows evolve over time. The change-based model is both
simple and compact—it uses substantially less space than
the alternative of storing multiple versions of a workflow.
3.2 Workflow Analogies
The model is also extensible. The underlying algebra of ac- In exploratory tasks, it is important to understand what
tions can be customized to support provenance capture at the differences between workflows are, especially if multiple
different granularities. We refer to the detailed provenance people are collaboratively exploring data. Computing the
of the workflow evolution as a visual trail, or a vistrail. differences between two workflows by considering their un-
A tree-based view of a vistrail allows a scientist to re- derlying graph structure turns out to be impractical (the
turn to a previous version in an intuitive way, to undo problem can be reduced to subgraph isomorphism). How-
bad changes, to compare different workflows, and to be re- ever, using the change-based provenance model, this prob-
minded of the actions that led to a particular result (see lem becomes much easier and can be solved in linear time [5].
Figure 1). This, combined with a caching strategy that VisTrails provides a visual difference mechanism that allows
eliminates redundant computations, allows the scientist to users to compare two workflows by coloring modules and
efficiently explore a large number of related workflows and connections according to which workflow they belong to. If
their results [1]. modules occur in both, any parameter differences between
them can be displayed. This difference is extremely useful
for users because it offers a template for applying a similar
3. EXPLORING WORKFLOWS set of changes to another workflow.
Below we give an overview of features of VisTrails that Workflow analogies automate these changes by flexibly
allow users to query and re-use provenance information. A updating workflows according to a change-based template.
more detailed description is given in [7]. Just as in the vocabulary analogy “dog is to puppy as cat is
Figure 2: Given a workflow (A) that renders a protein, another (B) that generates an HTML report for a
protein, and a third (C) that dynamically obtains protein data and produces an improved protein rendering,
we generate a new workflow (D) that produces a web report using improved rendering via a workflow analogy.
to ?”, workflow analogies discover the relationship between in workflow C, the change becomes adding a connection be-
the first two entities and construct an answer by applying tween modules c and d. The translated changes are then
this difference to the third entity. As illustrated in Figure 2, applied to C to create a new pipeline D. Figure 2 shows
we accomplish this by: determining the changes that were an example of the entire process. Of course, there are cases
made from the linked pair; performing a match between the where analogies do not make sense. Some workflows cannot
two “starting” workflows; mapping the differences through be matched because there is not enough information or if
the derived match; and applying the new set of changes to they are too different. In addition, some actions will not
the third workflow to produce a new workflow. make sense after translation; such changes are discarded.
However, even when they are not perfect, analogies can pro-
Workflow Differences. The first ingredient in computing
vide a useful starting point for users trying to incorporate
analogies is a method for computing differences between a
new techniques.
pair of workflows. As described earlier, VisTrails can effi-
ciently compute the difference between the two linked work-
flows by determining the sequence of changes from one work- 4. DEMONSTRATION OVERVIEW
flow to another. In this demonstration, we will show the power of using
Workflow Matching. In addition to finding the changes analogies and query by example in workflow systems by
we wish to apply, we need to identify a correspondence presenting a set of examples that emphasize the usability
between the two starting workflows. Because in VisTrails of the techniques, and how they enable knowledge re-use
workflows are directed acyclic graphs, this problem is equiv- in the composition of complex workflows. We will use sce-
alent to graph matching. Unfortunately, this problem is NP- narios from real applications in cosmology, environmental
Complete and cannot be efficiently approximated within a observation systems, Bioinformatics and radiation oncology
subpolynomial factor [2]. However, because modules have treatment planning.
well-defined semantics, we can model this in a probabilis- One of the key applications of the interfaces described
tic manner with a good likelihood for success. We need to above is the semi-automatic addition of new functionality to
balance local compatibility between modules with the sim- an existing workflow. We will demonstrate how this can be
ilarity of the global topologies. In order to do so, we first achieved by querying a database of existing workflows and
score each pair of modules based on how well they match, selecting a pair of workflows whose difference encompasses
and then diffuse these compatibility measures through the the new feature to be added to a third workflow. We will
product graph of the two workflows. The score of a pair of also show how complex workflows can be constructed by
modules depends on the inputs these modules accept and a sequence of analogy steps. Finally, we will discuss the
the outputs they produce. We diffuse this score using an al- robustness and applicability of analogies in real applications.
gorithm reminiscent of PageRank [4]. In our algorithm, this VisTrails can be downloaded from http://www.vistrails.org.
diffusion is performed on the product graph, similar to the A video demonstrating the analogies and query-by-example
similarity flooding strategy to match database schemas [6]. mechanisms is available at
http://www.cs.utah.edu/ juliana/videos/vistrails-analogies.mov.
Applying the Analogy. After computing the difference
∆(A, B) and matching M (A, C), we need to translate the
difference according to the matching M . To do so, we 5. REFERENCES
need to translate each individual change through the derived [1] L. Bavoil, S. Callahan, P. Crossno, J. Freire,
matching. Specifically, if one change is to add a connection C. Scheidegger, C. Silva, and H. Vo. Vistrails: Enabling
between modules a and b, and the matching specifies that interactive multiple-view visualizations. In Proceedings
a and b in workflow A are equivalent to modules c and d of IEEE Visualization, pages 135–142, 2005.
[2] J. Hastad. Clique is hard to approximate within n1−ε . [6] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity
Acta Mathematica, 182:105–142, 1999. flooding: A versatile graph matching algorithm and its
[3] The Kepler Project. http://kepler-project.org. application to schema matching. In ICDE, pages
[4] A. N. Langville and C. D. Meyer. Google’s PageRank 117–128, 2002.
and Beyond: The Science of Search Engine Rankings. [7] C. E. Scheidegger, H. T. Vo, D. Koop, J. Freire, and
Princeton University Press, 2006. C. T. Silva. Querying and creating visualizations by
[5] J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E. analogy. IEEE Transactions on Visualization and
Scheidegger, and H. T. Vo. Managing rapidly-evolving Computer Graphics, 13(6):1560–1567, 2007. Papers
scientific workflows. In International Provenance and from the IEEE Information Visualization Conference
Annotation Workshop (IPAW), LNCS 4145, pages 2007.
10–18, 2006. Invited paper. [8] The Taverna Project. http://taverna.sourceforge.net.
[9] The VisTrails Project. http://www.vistrails.org.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy