Barateiro, "Digital preservation of heterogeneous data", TCDL Bulletin 5.2 (2009)

Digital Preservation of Heterogeneous Data

José Barateiro

Instituto Superior Técnico, Lisbon, Portugal
jbarateiro@lnec.pt

Abstract

Digital preservation aims at maintaining digital objects and making data accessible over long periods of time. The emerging context of e-Science imposes new scenarios and new requirements for digital preservation. This paper presents the main issues, current solutions and prominent projects for digital preservation and e-Science data management. Finally, it presents the problem to be addressed, the solutions to explore, the proposed methodology and expected contributions of the associated PhD work.

Keywords

Digital Libraries, Digital Preservation, Data Grids, e-Science, Interoperability

1 Introduction

Artifacts stored on physical supports, like printed materials or drawings carved in stone, can survive for centuries. These contents with centuries of existence are testimonials of past generations and an important asset to the future. In contrast with physical information, digital materials require the continuous execution of management and preservation tasks in order to allow their future interpretation.

The increasing use of technology in professional and personal activities has resulted in a continuous growth of the volume of valuable digital materials. A significant part of those materials must be preserved in order to provide crucial knowledge for future generations.

Usually, today’s organizations make use of isolated information systems to produce, manage and exploit large amounts of heterogeneous data. When the information is managed by integrated information systems, the integration is based in processes defined to assure the interoperability¹ of a specific set of operations, without the guarantee that the integration of systems will be preserved in the future. Furthermore, associated documents, like technical reports, etc., may be produced and managed outside the information systems. Also non-digitized documents could contain valuable data and may also be associated with digital information handled by current information systems.

In order to run efficiently their data analysis applications (e.g., decision support systems, data mining, customer relationship management) and extract valuable information from organizations’ repositories, both data and relations about data must be managed. However, heterogeneous scenarios such as the one described above firmly limit the interoperability and information sharing between systems and processes, requiring the development of posterior tasks and data manipulation to achieve the assumed knowledge. This lack of interoperability reduces the efficiency of organizations.

For instance, the real scenario of the Portuguese National Laboratory of Civil Engineering² includes several focuses generating technical and scientific data as, for example, data captured by sensors located at strategic points of dams, bridges or other structures. This information is generated, captured and managed by different processes and different systems, using different formats (e.g., different file schemas to store the same type of information). Thus, data consolidation and sharing are difficult, since no adequate policies and/or infrastructures are defined.

In these types of scenarios, it is crucial to provide solutions supporting spatial interoperability (reuse and sharing of processes and data between different systems), and supporting temporal interoperability (long-term preservation) of data. The interoperability problem becomes even more complex due to the specific data structures that must be analyzed in order to generate relevant information. Consequently, it is crucial to map and understand data structures and schemas, guarantee that the same information is only generated once (avoiding the costs and risks associated with data redundancy), assure the sharing of data between systems with different logics and, finally, provide the long-term preservation of valuable data.

The active management and assessment of data over the entire Information Life-Cycle, from the production and management phase to the dissemination of information, is a crucial factor to provide interoperability and, consequently, contribute to the efficient and effective execution of actual and future organization processes. Moreover, the information life-cycle management must be a transversal element within the organization, in order to support the data sharing and interoperability between different systems, where the same information must represent the same knowledge.

Digital preservation aims at ensuring interoperability in the time dimension (interoperate with the future), that is, to guarantee that data or digital objects remain authentic and accessible to users over a long period of time, maintaining the value of data. Achieving this goal may require a large investment in infrastructure for storing data, management, maintenance, etc.

The complexity of digital preservation increases with the fact that each type of digital information has its own particularities and special requirements. For instance, the preservation of audio files requires compression and encoding/decoding information not needed in the preservation of XML files.

Several communities, like biology, medicine, engineering or physics, manage large amounts of scientific information. It usually includes large datasets of structured data (e.g., data captured by sensors), physical or mathematical simulations and several highly specialized documents reporting the work and conclusions of researchers.

The above mentioned information can be represented in a wide range of formats (e.g., a researcher can use her own input and output formats, and also her own program to produce simulations) and include a high level of relations that are not expressed in the data model. Moreover, the collaborative environment of the scientific community, and associated services and infrastructures, usually known as e-Science (or enhanced Science) [13], involves the requirement of interoperability and data sharing.

The remainder of this paper is organized as follows. First, section 2 surveys the most prominent works in digital preservation. Section 3 illustrates the proposed research methodology and expected contributions of the ongoing work in the scope of this PhD. Finally, section 4 presents the main conclusions of this paper.

2 State of the art

The concept of digital preservation is relatively new and together with some related terms, such as digital curation, it is still evolving. It is important to remark that these terms can be differently interpreted by different communities or disciplines.

Thus, in the scope of this PhD, digital curation includes digital preservation, as it is broader in the sense that it also concerns a degree of added value and knowledge in the life-cycle management [3]. For instance, digital curation includes issues in data creation (e.g., control of the application of standards), data selection (e.g., selection of representative sub-sets of huge data sets), and management to provide easy discovery, retrieval and sharing.

Digital preservation can be considered a recent research area. Digital preservation aims at optimizing the information life-cycle management, for long periods of time (maybe infinite). To provide the future use of information, digital preservation solutions must assure access to data from multiple and heterogeneous systems (unpredictable future systems), requiring the sharing and interoperability between information systems.

This section intends to survey the most prominent initiatives in the digital preservation arena. The remainder of this section is organized as follows. First, sections 2.1 and 2.2 provide an overview of national and international initiatives in digital preservation, respectively. Second, section 2.3 presents some relevant e-Science projects. Finally, an overview of other related work, especially solutions and tools for digital preservation, is provided in section 2.4.

2.1 National initiatives

The National Archives of the United States started, in 1998, the ERA project [6]. The ERA project is concerned in developing a system that assures the preservation and access to electronic records (e.g., word documents, emails) independently from the hardware and software configurations. Consequently, it intends to make it possible to guarantee the availability and access to electronic records for long periods of time.

It is important to remark that the ERA project includes the collaboration of several entities, like governmental agencies, other national archives, commercial industry and the scientific community. An important result of this joint collaboration was the development of a common framework of terms and concepts for long-term preservation, aligned with the reference model for an open archival information system (OAIS) [15]. The OAIS provides a high level conceptual model, designed to support static processes and static information types for long-term preservation.

The Federal Ministry of Education and Research of Germany is funding two initiatives for long-term preservation [1]: (i) the NESTOR project (Network of Expertise in Long-Term Storage of Digital Resources) has been established to foster a collaborative infrastructure of long-term preservation and exchange expertise between cultural heritage institutions; (ii) the KOPAL project (Kooperativer Aufbau eines Langzeitarchivs digitaler Informationen or Co-operative Development of a long-term Digital Information Archive) [22] intends to develop a technological solution for a long-term archival system.

The PANDORA project (Preserving and Accessing Networked Documentary Resources of Australia) [16] was established initially by the National Library of Australia in 1996 and now is maintained in collaboration with other Australian libraries and cultural organizations. The PANDORA project organizes a collection of techniques (e.g., software to process word formats), policies and workflows designed to achieve digital preservation. It collects online publications and web sites relevant to Australia or produced by an Australian author, providing long-term access to the collected contents.

The PADI project (Preserving Access to Digital Information) [8] is also a National Library of Australia initiative. The PADI website provides a gateway for cooperation, containing collections of resources dealing with digital preservation, such as policies, reports and journal articles. The main goals of this cooperation gateway are to provide an agile mechanism to support the development of guidelines and strategies for digital preservation, provide a forum for cross-sector cooperation on digital preservation actives, to maintain a public web site for promotion and information purposes and, finally, to dynamically identify and promote significant activities. PADI allows registered contributors from all around the world to add resources to the PADI repository, making them available to the community.

In Portugal, the National Archives (Direcção Geral de Arquivos) in collaboration with the University of Minho are developing the project RODA [2], which intends to create a prototype to assure the preservation of digital documents and records from the Public Administration, in contexts where data is stored in relational databases, images and text.

2.2 International projects

Digital preservation systems require specific metadata. The PREMIS initiative (PREservation Metadata Implementation Strategies) focuses on concepts for digital preservation [11], defining a dictionary of elements for preservation metadata. Its main goal is to provide a description for digital objects within an archive and also the exchange of objects and their associated metadata between archives and archival systems.

The European Union is increasingly interested in digital issues, especially in digital preservation. Currently, the European Union is funding the CASPAR project (Cultural, Artistic, and Scientific knowledge for Preservation, Access and Retrieval), the Planets project, the DELOS Network of Excellence on Digital Libraries and the DPE project (Digital Preservation Europe).

The CASPAR project mainly intends to implement, extend and validate OAIS reference model, to enhance techniques to extract representation information of digital contents, and to design services supporting long-term preservation independent from the hardware and software configuration,

The Planets project includes partners from European national libraries, archives and research institutions. The Plato tool [18], developed within Planets, enables organizations to define, evaluate and execute preservation plans. The final goal of Planets is to integrate the preservation planning, characterization of digital objects and preservation actions in order to form a long-term preservation framework.

An objective of DELOS is to provide a unified and comprehensive methodological framework over the life-cycle of digital information. As a result, it may ensure that the design of digital libraries addresses preservation issues. It has a strong emphasis on training and intends to develop testbeds, metrics and tools to evaluate digital preservation strategies.

DPE continues the early work of the ERPANET project (Electronic Resource Preservation and Access Network). It intends to promote the collaboration between national and international initiatives of digital preservation across the European research. This collaboration environment may be achieved with a coherent platform for collaboration, exchange and dissemination of research results.

2.3 About e-Science

The e-Science problems range from observational data (sensor data, data streams), simulation data (generated by very large computations), and experimental data. Some examples of e-Science domains include physics, bioinformatics, geographical sciences and engineering. For all these domains, the technological infrastructure to handle such collaboration and data management is referred as the Data Grid [9] .

The Earth System Grid II project is funded by the US Department of Energy and intends to transform climate data into community resources. It is a large production distributed system, storing high-resolution long-duration simulations performed with advanced climate models that produce many terabytes of output. The Earth System Grid allows worldwide access to researchers at laboratories, universities and other institutions. Users can locate desired datasets, with the option of browsing metadata of over 130 TB of data, representing the past six years of joint climate modeling experiments.

The GEODISE project (Grid Enabled Optimization and Design Search for Engineering) is funded by the UK Engineering and Physical Sciences Research Council. It intends to provide a Grid-based framework for computation and intensive data management. It plans to automatically archive engineering data, especially CAD and simulation data. It has a strong focus on dissemination capabilities, using knowledge management and ontology techniques to allow users to extract knowledge, instead of just data. GEODISE uses World Wide Web Consortium standards like XML, XSD, WSDL and UDDI to access and interchange data.

Sensor networks are able to provide measurements at temporal and spatial granularities previously impossible by manual observation [4]. For example, the LifeUnderYourFeet project [19] provides a wireless sensor network for soil observation, where raw data is periodically captured by sensors and automatically stored in a central database. This project will use approximately 2000 sensors monitoring air and soil temperature and moisture, light and gases. The expected data growth is about 200 million of measurements per year, depending on the periodicity of the data capturing.

In the US, the National Science Foundation supports several relevant e-Science projects, such as (i) National Virtual Observatory for astronomical archives, (ii) TeraGrid, including several open scientific datasets, (iii) Long Term Ecological Research, for ecological processes over long temporal and broad spatial scales, (iv) Alliance for Cellular Signaling, including datasets of proteins and cellular signaling and (v) SIOExplorer, handling Oceanography maps and documents.

2.4 Some related techniques

Data grid infrastructures are mainly used in management, processing and sharing of data. However, these data processing features may be used and extended to build a technological infrastructure for digital preservation on the top of a data grid solution. In [20], the authors present a survey of data grid solutions and propose a taxonomy for data grids, organized in four major topics: (i) organization (organizational characteristics of data grids), (ii) data transport (data model, security, fault tolerance, transfer modes), (iii) data replication, and (iv) scheduling (resource allocation).

The iRODS data grid middleware is being developed at the San Diego Supercomputer Center. It can be used to map preservation actions into operations in data grid storage systems. The iRODS system is an open-source storage solution for data grids based on client-server architecture with distributed computational resources and storage. A database system is used to maintain attributes, state of data and operations. A rule system is used to enforce and execute adaptive rules. This system belongs to the class of adaptive middleware systems, since it allows users to alter software functionalities without any recompilation [17].

Management policies are mapped into rules that invoke and control operations (micro-services) on remote storage media. Rules can be used in access control, in access to another grid systems, etc. Middleware can be extended in functionalities by the composition of new rules and policies.

iRODS uses the storage provided by the local file system, creating a virtual file system on top of it. That virtualization creates infrastructural independence, since logical names are given to files, users and resources.

In [5], the authors state that the preservation and data sharing involve not just technical issues, but also social and cultural processes. Indeed, a social/cultural factor is required to define which information may be preserved for long term and how the information can be archived. Moreover, it is required to define the privileges to access and maintain the archived information. Consequently, a “long-term agreement” must be defined to state the cooperation and sharing of information between the involved participants.

The XML format simplifies the interoperability between several systems, since it facilitates the migration of information, is a human readable format and supports international character sets (UNICODE). Moreover, a well-defined XML-Schema allows the validation of the contents of each XML file. Thus, several projects like DiVA [14] are based in XML to define a strategy of interoperability and preservation of digital data [12]. In DiVA, every object has a persistent unique identifier and a checksum to validate the integrity of contents. Associated metadata is also defined in XML, allowing, for instance, the identification of the producer of a document, the title, language, etc.

The use of standards for describing data [21] (e.g., binary data), like the Data Format Description Language may only be used to characterize the structures of data. The relationships between the structures or even relationships with other digital materials can not be represented.

3 Proposed research and methodology

This PhD will focus on the digital preservation of scientific data (highly heterogeneous) and related information like documents and metadata. The generic problem of digital preservation in e-Science scenarios will be analyzed and defined, identifying related problems and issues. Methodological proposals will be provided, enclosing best practices and techniques to face digital preservation issues. The main purpose will be to provide a comprehensive description and analysis of the problem of digital preservation in e-Science scenarios, with recommendations about how to address it in some of these scenarios. The proposed results will be sustained by proofs and analysis of their applications in real cases.

A special set of scenarios will be provided by the Portuguese National Laboratory of Civil Engineering. A first scenario may be used as a development case, where the proposed solution must assure the spatial and temporal interoperability of the related information. At least, a second scenario will be used to validate the solutions proposed during the development case, in order to guarantee the applicability of the proposed solutions, i.e., the proposed solution must be applicable in several scenarios instead of being specific for a single case. Both scenarios may include heterogeneous data, including born-digital and digitized information, managed by distinct processes within the same organization.

Examples of born-digital data include data stored in relational databases, ASCII files and XML files (for example, bridge observation data stored in a relational database, data loggers registered by sensors, etc.). Moreover, contextual related information, like technical reports produced by experts or photographs captured during inspections (e.g., to document a specific anomaly) may exist in digital or just physical format.

Current approaches and e-Science projects like those described in section 2.3 are designed to manage a specific type of digital information (e.g., sensor data, simulation datasets). Moreover, the contents are preserved as single entities. e-Science and heterogeneous domains require the preservation of context, that is, the preservation of relations between data and on the structures of data. Furthermore, the increasingly global collaboration of people and resources involved in e-Science domains requires not only interoperability features, but also the capacity to handle large amounts of data generated in different (maybe heterogeneous) sources [9]. To face these issues, this PhD intends to model the preservation of multiple types of heterogeneous data and respective relations as a unique DataSpace [7]. For instance, it may be crucial to preserve the environment (representation of known properties of digital contents and creation processes) of a specific simulation based on inputs provided by some records of a sensor dataset. Consequently, I will focus not only on the preservation of materials produced in e-Science domains, but also on the preservation of relations between these materials and other contents (digital or even non-digital) required to correctly interpret the preserved data in the future.

Summarizing, this work will contribute to the understanding of digital preservation in the emerging context of e-Science. To start, I intend to analyze the context with detail, identify its main scopes and related requirements, and model the generic problem domain and ontology. After that, I intend to propose solutions to address the main aspects of the problem. I intend to contribute with conceptual models of solutions, and demonstrate them in real cases, namely in the scope of the Portuguese Civil Engineering Laboratory (as also in other e-Science national and international scenarios). A final important contribution will be the analysis of solutions to represent the relationships between preserved data, in ways that future applications can understand.

4 Conclusions and scope

Current e-Science initiatives using data grids to manage large amount of data are designed to provide interoperability and collaboration, instead of long term preservation. Moreover, it is not possible to manage and preserve relations between different types of heterogeneous data (maybe domain specific). In the scope of my PhD, I intend to provide solutions for digital preservation in the emerging context of e-Science. These solutions may handle heterogeneous types of contents (e.g., data sets, documents, images, CAD), manage relations between contents, and preserve the production context.

This PhD is attended in the Doctoral Programme in Information Systems and Computer Engineering, at Instituto Superior Técnico, Portugal, under the supervision of Prof. José Borbinha. It formally started in January, 2008 and the expected conclusion will be in December, 2010.

I would like to remark that the outputs of this PhD will be used in the scope of the Portuguese national project GRITO (ref. FCT GRID/GRI/81872/2006) and the European project SHAMAN - Sustaining Heritage Access through Multivalent ArchiviNg (IP Grant Agreement 216736- 7th Framework Programme / ICT- 1-4.1 Digital libraries and technology-enhanced learning).

Acknowledgments

This work is supported by the individual grant from FCT (SFRH/BD/23405/2005) and LNEC.

Footnotes

1	As defined by the Institute of Electrical and Electronics Engineers (IEEE), interoperability is the ability of two or more systems or components to exchange information and to use the information that has been exchanged [10].

2	LNEC - Laboratório Nacional de Engenharia Civil

References

[1]	R. Altenhöne. NESTOR and KOPAL-- two national initiatives to ensure long-term accessibility of digital documents in germany, June 2005.

[2]	F. Barbedo, L. Corujo, L. Faria, R. Castro, M. Ferreira, and J. Ramalho. Roda: Repositório de objectos digitais autênticos. In 9º Congresso Nacional de Bibliotecários, Arquivistas e Documentalistas, Ponta Delgada, Portugal, 2007.

[3]	N. Beagrie. Digital curation for science, digital libraries, and individuals. The International Journal of Digital Curation, Issue 1, Volume 1, Autumn 2006.

[4]	D. Cuff, M. Hansen, and J. Kang. Urban sensing: Out of the woods. Communication of the ACM, 51, No. 3:24–33, March 2008.

[5]	M. Day, M. Pennock, and J. Allinson. Co-operation for digital and curation: Collaboration for collection development in institutional repository networks. In DigCCur2007 - International Symposium in Digital Curation, Chapel Hill, April 2007.

[6]	M. Folk and B. Barkstrom. Attributes for long-term preservation of scientific and engineering data in digital libraries, 2003.

[7]	M. J. Franklin, A. Y. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4):27–33, 2005.

[8]	M. Hanley. PADI (preserving access to digital information) and safekeeping. HEP Libraries Webzine, Issue 9, February 2004.

[9]	T. Hey and A. E. Trefethen. The UK e-Science core program and the grid. In International Conference on Computational Science (1), pages 3–21, 2002.

[10]	IEEE. IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glossaries, 1990.

[11]	B. Lavoie and R. Gartner. Preservation metadata. DPC Technology Watch Report No. 05-01, September 2005.

[12]	R. McGrath. XML and scientific file formats. Geological Society of America Annual Meeting, November 2003.

[13]	S. Miles, S. C. Wong, W. Fang, P. Groth, K.P. Zauner, and L. Moreau. Provenance-based validation of e-Science experiments. Web Semant., 5(1):28–38, 2007.

[14]	E. Muller, U. Klosa, S. Andersson, and P. Hansson. The diva project development of an electronic publishing system. D-Lib Magazine, Volume 9, November 2003.

[15]	OAIS. Reference model for an open archival information system (adopted as iso 14721:2003). Blue Book (Standard). Issue 1, January 2002.

[16]	M. Phillips. Pandora, Australia’s web archive, and the digital archiving system that supports it. DigiCULT.info, Issue 6, December 2003.

[17]	A. Rajasekar, M. Wan, R. Moore, and W. Schroeder. A prototype rulebased distributed data management system. In HPDC workshop on Next Generation Distributed Data Management, Paris, France, 2006.

[18]	S. Strodl, C. Becker, R. Neumayer, and A. Rauber. How to choose a digital preservation strategy: evaluating a preservation planning procedure. In JCDL, pages 29–38, 2007.

[19]	K. Szlavecz, A. Terzis, S. Ozer, R. Musaloiu-E, J. Cogan, S. Small, R. Burns, J. Gray, and A. Szalay. Life under your feet: An end-to-end soil ecology sensor network, database, web server, and analysis service. Microsoft Technical Report MSR TR 2006 90, June 2006.

[20]	S. Venugopal, R. Buyya, and K. Ramamohanarao. Taxonomy of data grids for distributed data sharing, management, and processing. ACM Computing Surveys, 38(1):153, March - June 2006.

[21]	M. Westhead, T. Wen, and R. Carroll. Describing data on the grid. In GRID, pages 134–140, 2003.

[22]	T. Wollschläger. KOPAL- a cooperative approach to develop a long-term digital information archive. In International Coalition of Library Consortia, Rome, 2006.