Abstract
Tremendous volumes of data have been captured, archived and analyzed. Sensors, algorithms and processing systems for transforming and analyzing the data are evolving over time. Web Portals and Services can create transient data sets on-demand. Data are transferred from organization to organization with additional transformations at every stage. Provenance in this context refers to the source of data and a record of the process that led to its current state. It encompasses the documentation of a variety of artifacts related to particular data. Provenance is important for understanding and using scientific datasets, and critical for independent confirmation of scientific results. Managing provenance throughout scientific data processing has gained interest lately and there are a variety of approaches. Large scale scientific datasets consisting of thousands to millions of individual data files and processes offer particular challenges. This paper uses the analogy of art history provenance to explore some of the concerns of applying provenance tracking to earth science data. It also illustrates some of the provenance issues with examples drawn from the Ozone Monitoring Instrument (OMI) Data Processing System (OMIDAPS) (Tilmes et al. 2004) run at NASA’s Goddard Space Flight Center by the first author.
Similar content being viewed by others
Notes
Cool URIs don’t change, http://www.w3.org/Provider/Style/URI
The graph notation follows The Open Provenance Model (Moreau et al. 2008a), arrows point from artifacts back to inputs from which the artifacts were derived.
References
Bose R, Frew J (2005) Lineage retrieval for scientific data processing: a survey. ACM Comput Surv 37(1):1–28. doi:10.1145/1057977.1057978
da Silva PP, McGuinness DL, Fikes R (2006) A proof markup language for Semantic Web services. Inf Syst 31(4–5):381–395. doi:10.1016/j.is.2005.02.003, http://www.sciencedirect.c7cb2466e94e825 , the Semantic Web and Web Services
Freire J, Missier P, Moreau L, Schreiber A, Mattoso M, Silva CT (2008) Provenance and annotation of data and processes, vol 5272/2008. Springer, Berlin. doi:10.1007/978-3-540-89965-5
Heinis T, Alonso G (2008) Efficient lineage tracking for scientific workflows. In: SIGMOD ’08: proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, New York, pp 1007–1018. doi:10.1145/1376616.1376716
Moreau L, Ludäscher B, Altintas I, Barga RS, Bowers S, Callahan S, Chin GJ, Clifford B, Cohen S, Cohen-Boulakia S, Davidson S, Deelman E, Digiampietri L, Foster J, Freire I, Frew J, Futrelle J, Gibson T, Gil Y, Goble C, Golbeck J, Groth P, Holland DA, Jiang S, Kim J, Koop D, Krenek A, McPhillips T, Mehta G, Miles S, Metzger D, Munroe S, Myers J, Plale B, Podhorszki N, Ratnakar V, Santos E, Scheidegger C, Schuchardt K, Seltzer M, Simmhan YL, Silva C, Slaughter P, Stephan E, Stevens R, Turi D, Vo H, Wilde M, Zhao J, Zhao Y (2007) Special issue: the first provenance challenge. Concurr Comput: Practice and Experience 20(5):409–418. doi:10.1002/cpe.1233
Moreau L, Freire J, Futrelle J, Mcgrath R, Myers J, Paulson P (2008a) The open provenance model: an overview. Provenance and annotation of data and processes, pp 323–326. doi:10.1007/978-3-540-89965-5_31
Moreau L, Groth P, Miles S, Vazquez-Salceda J, Ibbotson J, Jiang S, Munroe S, Rana O, Schreiber A, Tan V, Varga L (2008b) The provenance of electronic data. Commun ACM 51(4):52–58. doi:10.1145/1330311.1330323
Nurmi D, Wolski R, Grzegorczyk C, Obertelli G, Soman S, Youseff L, Zagorodnov D (2009) The eucalyptus open-source cloud-computing system. In: CCGRID ’09: proceedings of the 2009 9th IEEE/ACM international symposium on cluster computing and the grid. IEEE Computer Society, Washington, DC, pp 124–131. doi:10.1109/CCGRID.2009.93
Simmhan YL, Plale B, Gannon D (2005) A survey of data provenance in e-science. SIGMOD Rec 34(3):31–36. doi:10.1145/1084805.1084812
Suarez-Sola I, Davey A, Hourcle JA (2008) What are we tracking ... and why? AGU Fall Meeting Abstracts, pp C1047+
Tilmes C, Linda M, Fleig A (2004) Development of two Science Investigator-led Processing Systems (SIPS) for NASA’s Earth Observation System (EOS). In: Geoscience and remote sensing symposium, 2004. In: IGARSS ’04. Proceedings. 2004 IEEE International, vol 3, pp 2190–2195. doi:10.1109/IGARSS.2004.1370795
Acknowledgement
Thanks to the NASA MODIS and OMI Data Processing teams.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Thomas Narock
Rights and permissions
About this article
Cite this article
Tilmes, C., Yesha, Y. & Halem, M. Tracking provenance of earth science data. Earth Sci Inform 3, 59–65 (2010). https://doi.org/10.1007/s12145-010-0046-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12145-010-0046-3