Special Issue: The First Provenance ChallengeMoreau, Luc; Ludäscher, Bertram
doi: 10.1002/cpe.1233pmid: N/A
The first Provenance Challenge was set up in order to provide a forum for the community to understand the capabilities of different provenance systems and the expressiveness of their provenance representations. To this end, a functional magnetic resonance imaging workflow was defined, which participants had to either simulate or run in order to produce some provenance representation, from which a set of identified queries had to be implemented and executed. Sixteen teams responded to the challenge, and submitted their inputs. In this paper, we present the challenge workflow and queries, and summarize the participants' contributions. Copyright © 2007 John Wiley & Sons, Ltd.
Automatic capture and efficient storage of e‐Science experiment provenanceMoreau, Luc; Ludäscher, Bertram
doi: 10.1002/cpe.1235pmid: N/A
For the first provenance challenge, we introduce a layered model to represent workflow provenance that allows navigation from an abstract model of the experiment to instance data collected during a specific experiment run. We outline modest extensions to a commercial workflow engine so it will automatically capture provenance at workflow runtime. We also present an approach to store this provenance data in a relational database. Finally, we demonstrate how core provenance queries in the challenge can be expressed in SQL and discuss the merits of our layered representation. Copyright © 2007 John Wiley & Sons, Ltd.
A Semantic Web approach to the provenance challengeMoreau, Luc; Ludäscher, Bertram
doi: 10.1002/cpe.1238pmid: N/A
Provenance is critically important for scientific workflow systems, as it allows users to verify data, repeat experiments, and discover dependencies. The Semantic Web is a natural fit for representing provenance, as it contains explicit support for representing and inferring connections between data and processes, as well as for adding annotations to data. In this article, we present a Semantic Web approach to the Provenance Challenge (Concurrency Computat.: Pract. Exper. 2007; DOI: 10.1002/cpe.1233). We use web services, ontologies, OWL reasoners, triple stores, and the SPARQL query language to implement the workflow, represent the data and the connections within it, and execute queries. We successfully implemented and answered all of the challenge queries. The flexibility of the Semantic Web also makes it quite easy to convert different provenance systems' data representation to a form we can work with. We illustrate this by integrating data from the PASS approach into our system, and successfully executing all of the challenge queries on it as well. Copyright © 2007 John Wiley & Sons, Ltd.
Query capabilities of the Karma provenance frameworkMoreau, Luc; Ludäscher, Bertram
doi: 10.1002/cpe.1229pmid: N/A
Provenance metadata in e‐Science captures the derivation history of data products generated from scientific workflows. Provenance forms a glue linking workflow execution with associated data products, and finds use in determining the quality of derived data, tracking resource usage, and for verifying and validating scientific experiments. In this article, we discuss the scope of provenance collected in the Karma provenance framework used in the LEAD Cyberinfrastructure project, distinguishing provenance metadata from generic annotations. We further describe our approaches to querying for different forms of provenance in Karma in the context of queries in the first provenance challenge. We use an incremental, building‐block method to construct provenance queries based on the fundamental querying capabilities provided by the Karma service centered on the provenance data model. This has the advantage of keeping the Karma service generic and simple, and yet supports a wide range of queries. Karma successfully answers all but one challenge query. Copyright © 2007 John Wiley & Sons, Ltd.
gLite Job Provenance—a job‐centric viewMoreau, Luc; Ludäscher, Bertram
doi: 10.1002/cpe.1252pmid: N/A
Job Provenance (JP), part of the gLite Grid middleware, is a service that keeps long‐term trace on completed computations for further reference. It is a job‐centric service, keeping records about job life cycle, its environment, inputs/outputs, user parameters, and annotations. During the first provenance challenge, we explored the relation between a specific job‐centric Grid‐oriented provenance and a more general data provenance approach. The challenge represents a use case which emphasizes fields that were not priorities in the original JP design. However, we proved that the design is sufficiently general to cope with this mode of use. We also identified several areas where it is feasible to extend the current implementation. Copyright © 2007 John Wiley & Sons, Ltd.
Mining Taverna's semantic web of provenanceMoreau, Luc; Ludäscher, Bertram
doi: 10.1002/cpe.1231pmid: N/A
Taverna is a workflow workbench developed as part of the UK's myGrid project. Taverna's provenance model captures both internal provenance locally generated in Taverna and external provenance gathered from third‐party data providers. This model also supports overlaying secondary provenance over the primary logs and lineage. This design is motivated by the particular properties of bioinformatics data and services used in Taverna. A Semantic Web of provenance, Ouzo, is built to combine the above different provenance by means of semantic annotations. This paper shows how Ouzo can be mined by a provenance usage component, Provenance Query and Answer (ProQA). ProQA supports provenance retrievals as well as provenance abstraction, aggregation, and semantic reasoning. ProQA is implemented as a suite APIs which can be deployed as provenance services to compose system provenance workflows that analyse experiment results using the provenance records. We show how these features of Taverna's provenance support us in answering the questions from the provenance challenge workshop and a set of additional provenance queries. Copyright © 2007 John Wiley & Sons, Ltd.
Tackling the Provenance Challenge one layer at a timeMoreau, Luc; Ludäscher, Bertram
doi: 10.1002/cpe.1237pmid: N/A
VisTrails is a new workflow and provenance management system that provides support for scientific data exploration and visualization. Whereas workflows have been traditionally used to automate repetitive tasks, for applications that are exploratory in nature, change is the norm. VisTrails uses a new change‐based provenance mechanism, which was designed to handle rapidly evolving workflows. It uniformly and automatically captures provenance information for data products and for the evolution of the workflows used to generate these products. In this paper, we describe how the VisTrails provenance data are organized in layers and present a first approach for querying this data that we developed to tackle the Provenance Challenge queries. Copyright © 2007 John Wiley & Sons, Ltd.
Automatic capture and reconstruction of computational provenanceMoreau, Luc; Ludäscher, Bertram
doi: 10.1002/cpe.1247pmid: N/A
The Earth System Science Server (ES3) project is developing a local infrastructure for managing Earth science data products derived from satellite remote sensing. By ‘local,’ we mean the infrastructure that a scientist uses to manage the creation and dissemination of her own data products, particularly those that are constantly incorporating corrections or improvements based on the scientist's own research. Therefore, in addition to being robust and capacious enough to support public access, ES3 is intended to be flexible enough to manage the idiosyncratic computing ensembles that typify scientific research. Instead of specifying provenance explicitly with a workflow model, ES3 extracts provenance information automatically from arbitrary applications by monitoring their interactions with their execution environment. These interactions (arguments, file I/O, system calls, etc.) are logged to the ES3 database, which assembles them into provenance graphs. These graphs resemble workflow specifications, but are really reports—they describe what actually happened, as opposed to what was requested. The ES3 database supports forward and backward navigation through provenance graphs (i.e. ancestor/descendant queries), as well as graph retrieval. Copyright © 2007 John Wiley & Sons, Ltd.
Addressing the provenance challenge using ZOOMMoreau, Luc; Ludäscher, Bertram
doi: 10.1002/cpe.1232pmid: N/A
ZOOM* UserViews presents a model of provenance for scientific workflows that is simple, generic, and yet sufficiently expressive to answer questions of data and step provenance that have been encountered in a large variety of scientific case studies. In addition, ZOOM builds on the concept of composite step‐classes—or sub‐workflows—which is present in many scientific workflow systems to develop a notion of user views. This paper discusses the design and implementation of ZOOM in the context of the queries posed by the provenance challenge, and shows how user views affect the level of granularity at which provenance information can be seen and reasoned about. Copyright © 2007 John Wiley & Sons, Ltd.
From computation models to models of provenance: the RWS approachMoreau, Luc; Ludäscher, Bertram
doi: 10.1002/cpe.1234pmid: N/A
Scientific workflows often benefit from or even require advanced modeling constructs, e.g. nesting of subworkflows, cycles for executing loops, data‐dependent routing, and pipelined execution. In such settings, an often overlooked aspect of provenance takes center stage: a suitable model of provenance (MoP) for scientific workflows should be based upon the underlying model of computation (MoC) used for executing the workflows. We can derive an adequate MoP from a MoC (such as Kahn's process networks) by taking into account the assumptions that a MoC entails, and by recording the observables which it affords. In this way, a MoP captures or at least better approximates ‘real’ data dependencies for workflows with advanced modeling constructs. As a specific instance, we elaborate on the Read–Write–ReSet model, a simple and flexible MoP suitable for a number of different MoCs. Copyright © 2007 John Wiley & Sons, Ltd.