Data provenance allows scientists to validate their model as well as to investigate the origin of an unexpected value. Furthermore, it can be used as a replication recipe for output data products. However, capturing provenance requires enormous effort by scientists in terms of time and training. First, they need to design the workflow of the scientific model, i.e., workflow provenance, which requires both time and training. However, in practice, scientists may not document any workflow provenance before the model execution due to the lack of time and training. Second, they need to capture provenance while the model is running, i.e., fine-grained data provenance. Explicit documentation of fine-grained provenance is not feasible because of the massive storage consumption by provenance data in the applications, including those from the geoscience domain where data are continuously arriving and are processed. In this paper, we propose an inference-based framework, which provides both workflow and fine-grained data provenance at a minimal cost in terms of time, training, and disk consumption. Our proposed framework is applicable to any given scientific model, and is capable of handling different model dynamics, such as variation in the processing time as well as input data products arrival pattern. Our evaluation of the framework in a real use case with geospatial data shows that the proposed framework is relevant and suitable for scientists in geoscientific domain.
|Number of pages||18|
|Journal||IEEE transactions on geoscience and remote sensing|
|Publication status||Published - Nov 2013|
- Provenance Graph
- Data Provenance
- Geoscience Applications