Manguinhas, "Achieving Semantic Interoperability using Model Descriptions", TCDL Bulletin 6.2 (2010)

Achieving Semantic Interoperability using Model Descriptions

Hugo Manguinhas

INESC-ID – Instituto de Engenharia
de Sistemas e Computadores
Apartado 13069, 1000-029 Lisboa, Portugal
hugo.manguinhas@ist.utl.pt

Abstract

The integration and reuse of descriptive metadata available through the Internet is becoming essential. To reuse this descriptive metadata created in multiple contexts and by a large spectrum of different entities or communities creates a problem of semantic interoperability. When two processes need to exchange information, a level of semantic interoperability is reached when both of them are able to find a common understanding of the information being interchanged. The assumption for this dissertation is that this understanding can be achieved by aligning the source and target information models (and the underlining data models), using the rules modeling the relationships and equivalences between them (mappings). This dissertation will focus on using Metadata Registries as the means to make these models ubiquitous throughout and between organizations, and to allow for a better integration of interoperability functionality into applications.

Categories and Subject Descriptors

D.2.12 [Software Engineering]: Interoperability – Data mapping

I.6.1 [Simulation and Modeling]: General.

I.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods - Representation languages, Representations (procedural and rule-based), Semantic networks.

H.2.1 [Database Management]: Logical Design – Data models, Schema and subschema.

H.2.7 [Database Management]: Database Administration - Data dictionary/directory.

General Terms

Management, Standardization, Languages, Design, Measurement.

Keywords

Interoperability, Semantic Interoperability, Data interoperability, Metadata Registries, Conceptual Models, Data Models, Ontologies, Mapping Models.

1 Introduction

The rapid growth of resources available on the Internet is becoming a valuable source for descriptive metadata. The integration and re-usage of this metadata created in multiple contexts and by a large spectrum of different entities or communities creates a problem for semantic interoperability.

In general terms, a problem of interoperability in Information Systems can occur when there is a need to exchange information between two processes, which can occur within one system (manipulation of different information entities, with semantic overlapping but created in different contexts), or between distributed systems. A level of semantic interoperability is reached when both processes find a common understanding of the information being interchanged--more precisely, when the information model is shared between the two processes and the information entities being interchanged are unambiguously defined.

An information model is here defined as an abstract description of how information is represented in a specific context (a business area, an organization or a community). When two processes need to interchange information, they rely on data representations to encode that information in a specific syntax and/or storage method, which is commonly called data model or schema.

When the information model behind the information being interchanged is not shared by the two intervenient processes, this information must be aligned. The assumption for this dissertation is that this alignment can be achieved by taking advantage of the rules modeling the relationships and equivalences (mappings) between the information models. These relationships are defined in the context of a mapping model, which can then be used for cross-walking between the two information models.

The information needed to describe these models (information model and data model) has been called metadata (in opposition to the term “data”, to refer to the data objects representing the instances of the information entities). Associated to that resulted the concept of Metadata Registries (MDR), defined as information systems to manage and publish metadata. A MDR provides a unified view and promotes a common understanding of information managed within and across organizations.

This dissertation will focus on the representation of data and information models, along with both the visual and formal representations of the mappings that may be established between them. It will also focus on using MDRs as the means to make these models ubiquitous throughout and between organizations, and to allow for a better integration of interoperability functionality into applications.

The remainder of this paper is organized as follows: Section 2 explains the problem of semantic interoperability while Section 3 explains the assumptions of this dissertation for achieving this interoperability, Section 4 summarizes the current knowledge of the problem domain, as well as the state of the art for solutions; Section 5, identifies the significant problems in the field of research; Section 6 formulates the research question addressed in this dissertation; Section 7 explains the motivation of the author; Section 8 presents the research plans and ideas; and finally, Section 9 describes the expected contributions of the author to the research area.

2. About Interoperability

The Institute of Electrical and Electronics Engineers (IEEE) in 1990 defined interoperability as “the ability of two or more systems or components to exchange information and to use the information that has been exchanged” [1]. This is a simple definition which gave way to others that introduce other dimensions to interoperability.

According to ISO/IEC 2382-01, Information Technology Vocabulary, Fundamental Terms, interoperability is defined as follows: "The capability to communicate, execute programs, or transfer data among various functional units in a manner that requires the user to have little or no knowledge of the unique characteristics of those units" [2]. This definition introduces a human factor to interoperability.

Another definition comes from the National Information Standards Organization (NISO) that defines interoperability as “the ability of multiple systems with different hardware and software platforms, data structures, and interfaces to exchange data with minimal loss of content and functionality” [3]. This definition introduces the idea of a lossless communication between information systems.

After all these efforts to define interoperability, in 2005 a more refined view on interoperability was defined, coming from the simulation theory. This new definition considers a 7-tiered model which measures the degree of conceptual alignment between two information systems looking at the data being transferred between them. This model is called the Levels of Conceptual Interoperability Model (LCIM) [B4] and defines that semantic interoperability is reached when “the meaning of the data is shared and the content of the information exchange requests are unambiguously defined”.

3 Achieving Interoperability

An information model is an abstract description of how information is represented in a specific context (a business area, an organization or a community), consisting therefore in the definition of the relevant terms, the relations between them, and the vocabularies or rules to assign values to them. Associated to that is the concept of data model (schema), which is defined as the representation of an information model for a specific syntax (XML Schema, SQL-DDL, etc.) and/or storage method (file, database, etc.), which defines the data elements and arrangements between them.

Looking at these concepts, the definition provided in the last section for semantic interoperability can be reformulated, to mean that semantic interoperability is reached when the information model is shared and the information entities being interchanged are unambiguously defined. This last definition for semantic interoperability is the one adopted in this dissertation. For the sake of simplicity, the remainder of this paper will refer to interoperability to mean only semantic interoperability defined in these terms. This dissertation assumes that a technical level of interoperability is assured between the two systems, meaning that “a communication infrastructure is established and underlying networks and communication protocols are unambiguously defined” as described in [4].

In order to reach this level of interoperability, the two intervenient processes must find a common understanding of the information being interchanged. In other words, the two processes must share the same information model, or the part of it that is essential for their interaction.

As an example, imagine the following scenario: a traveling agency wants to offer a service that allows users to travel to any location in Europe in any kind of transportation (air, land and sea), choosing the cheapest and most efficient transportations possible. To do this, the agency wants to design a process that interacts with web-services provided by transportation companies to identify routes, prices, schedules and bookings so to offer the best possible travelling alternative to the user. When evaluating the existing web-services of the transportation companies, the agency realized that although almost all the web-services were conformant with the standard protocols (e.g. SOAP over TCP/IP) and transport formats (e.g. XML), they all used different data formats (e.g. XMLs structured according to different schemas). In this scenario, and although the information models that are behind these different data formats are, in most cases, not the same, they are sufficiently identical for the traveling agency to identify the routes, their price and the existing schedule and bookings. To solve this, the traveling agency would need to develop and maintain the translation from all this heterogeneous sources of data, making it a high cost solution.

When the information model behind the data being interchanged is not shared by the two intervenient processes, the information models must be aligned. To achieve this alignment, the two data models (and their intrinsic information models) and the rules modeling the relationships and equivalences between them must be known by the two processes. This information is commonly called mappings [5] and may be defined within the context of a mapping model. These mappings are particularly important for cross-walking between the information models. By cross-walking the participating processes can align each others’ data and thus use it in an effective way.

The information needed to describe these models (information model and data model) has been called metadata (in opposition to the term “data”, to refer to the data objects representing the instances of the information entities). Associated to that resulted the concept of Metadata Registries (MDR), defined as information systems to manage and publish metadata.

A MDR is a central location (e.g. repository) in an organization where metadata is stored and maintained in a controlled environment. The “registry” part implies that metadata is subject of registration within the Registry. Registration specifies the set of rules, operations, and procedures that apply to an MDR, accomplishing three main goals: identification, provenance, and monitoring quality. Identification is accomplished by assigning a unique identifier (within the registry) to each registered object. Provenance addresses the source of the metadata and the object described. Monitoring quality ensures that the metadata does the job it is designed to do. The registration and administration functions of an MDR are what separate an MDR from a database of metadata.

A MDR provides a unified view and promotes a common understanding of the information managed in an organization. It assists organizations in the sharing and exchanging of mutually agreed information. It can also promote harmonization, standardization, use, and reuse of the information throughout and between organizations.

A MDR promotes interoperability by using a common reference model (metamodel, or equivalently, a “metadata model”) for the registration of the data model and the context where it should be used, while registering version information about the data model and the corresponding relations, whether related to relationships between different versions of the same or different data models.

4 State of the Art

This section summarizes some of the state of the art solutions for solving problems of interoperability.

4.1 Data Integration

The problem of interoperability from the data perspective is an established problem within the database management community, as a Data Integration problem.

Most data warehouses are typically assembled from a variety of data sources with different formats and purposes. The process of bring all the data together in a standard, homogenous environment is commonly called ETL (Extract, Transform and Load). An ETL process involves extracting data from outside sources, transforming it to fit operational need (ensuring quality among the data), and loading it into the end target (a database).

The industry de facto standard model for data integration platforms is the Common Warehouse Model (CWM). It is a standard for interchange of warehouse metadata and is probably the best example to date of applying the MDA paradigm (described in Section 4.3) to an application area. The CWM provides a framework for representing metadata about data sources, data targets, transformations, analysis, as well as the processes and operations that create and manage warehouse data and provide lineage information about its use. The CWM characterizes the ETL process as transformations between models detailing warehouse metadata and uses a specific transformation package (although the CWM is based on MDA it does not use the QVT).

On another front, some ETL tools have started to migrate into Enterprise Application Integration (EAI), or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation, and loading of data.

Figure 1 – Example of a mapping between two data models using the Microsoft BizTalk Mapper.

4.2 Information Systems Integration

The problem of interoperability is an established problem in Information Systems, when connecting applications within the same or different organizations. The first is commonly referred to as Enterprise Application Integration (EAI) [6], and is the process of linking together applications within a single organization in order to simplify and automate business processes, while at the same time avoiding having to make sweeping changes to the existing applications or data structures. The second is commonly known as Business-to-Business (B2B) Integration [7].

In both domains, the integration of applications is normally addressed by creating an environment where business processes are designed and run, and can communicate with services running on different environments. This is accomplished by creating adapters for each service being integrated. Each adapter provides data encoded in a particular data model and relies on transformation facilities to transform between the data models along the business process. A common practice to reduce the number of transformations is to stipulate a common data model to be used along the process. Transformations between data models often involve datatype conversion, mathematical calculations, string manipulation, splitting/merging of data, or even complex algorithms for data recreation.

An interesting feature often provided by these applications is the ability to establish mappings between data models using visual tools. These tools are often very intuitive and simple to use, requiring very little knowledge to perform transformation between data models. The mappings are then used by the transformation facilities to transform data models. Figure 1 shows an example of mappings using the Microsoft BizTalk Mapper.

4.3 Model Driven Architecture

The Object Management Group (OMG) is a consortium founded by some of the biggest companies in the industry. The OMG's mission is to help solving integration problems by supplying open, vendor-neutral interoperability specifications.

In 1997, the OMG issued several important specifications that encompass a new vision for solving integration problems, the Model Driven Architecture (MDA). The MDA was developed with the purpose of supporting interoperability with specifications that address integration through the entire systems life cycle: from business modeling to system design, to component construction, to assembly, integration, deployment, management, and evolution. The primary goals of MDA are portability, interoperability, and reusability.

The MDA specification is based on a set of core technologies. The Meta Object Facility (MOF) provides a standard way for describing the models used within the MDA, in other words it defines the metamodel. It also provides the basis for model/metadata interchange and interoperability, and is the mechanism through which models are encoded in the XML Metadata Interchange (XMI). XMI is the standard interchange format used between various tools, repositories and middleware.

Figure 2 – Overview of the three languages included in QVT.

A complementary specification to MOF is the Query/View/ Transformation (QVT), which defines a standard way for transforming between two models defined in MOF. The QVT defines not one but three domain-specific languages named Relations, Core and Operational Mappings that are organized in a layered architecture (see Figure 2). Relations and Core are declarative languages at two different levels of abstraction, with a normative mapping between them. The Relations language has a textual and a graphical concrete syntax. The Operational Mappings language is an imperative language that extends both Relations and Core. The syntax of the Operational Mappings language provides constructs commonly found in imperative languages (loops, conditions, etc.). Finally, the QVT includes also as part of the specification a mechanism called BlackBox for invoking transformation facilities expressed in other languages (XSLT, XQuery). One interesting idea behind QVT is that the transformation program is considered itself as a model and as a consequence also conforms to a MOF metamodel. This means more precisely that the abstract syntax of QVT should conform to a MOF metamodel.

Presently there are several products (commercial or open source) that claim compliance to the QVT standard. Some of these products are: the SmartQVT, Eclipse M2M, OptimaJ, MediniQVT, ModelMorf and ATL.

4.4 Ontologies

The most referenced definition for ontology is the one defined by Gruber in [8]: “an ontology is an explicit specification of a conceptualization”, which was slightly modified by Borst [9] to “Ontologies are defined as a formal specification of a shared conceptualization”. Nevertheless this last definition can be further refined to the definition proposed by the ODM project [10] and which is adopted in this dissertation:

“An ontology defines the common terms and concepts (meaning) used to describe and represent an area of knowledge. An ontology can range in expressivity from a taxonomy (knowledge with minimal hierarchy or a parent/child structure), to a thesaurus (words and synonyms), to a conceptual model (with more complex knowledge), to a logical theory (with very rich, complex, consistent, and meaningful knowledge)”

The area of artificial intelligence concerned with the representation of ontologies is commonly known as knowledge representation (KR). The fundamental goal of KR is to represent knowledge in a manner as to facilitate inference (i.e. drawing conclusions) from knowledge. To accomplish this some kind of formal logic is used to supply the semantics of how reasoning is applied to the knowledge.

The most commonly used family of KR languages based on formal logics is first-order logic languages. First-order logic is a formal system in which formulas contain predicates which can be quantified. Examples of such languages are Knowledge Interchange Format (KIF) and most recent developments coming from ISO, the Common Logic (CL) which is to be the successor for KIF.

A relevant family of KR languages is frame-based languages (or simply frames) [13] which define a data structure that is roughly similar to the object-oriented paradigm. It represents classes (called frames) with certain properties (called attributes or slots) which can refer to other frames (relationships). Frame-based languages are not based on formal logics (e.g. KL-ONE [11]) but efforts have been made to integrate it with formal logic-based semantics (e.g. F-Logic).

To overcome the lack of formal logics in some KR languages, a new family of languages was designed as an extent to frames, the description logics (DL). A description logic language describes a domain using concept descriptions that can be translated into first-order predicate logic.

Two groups of families have been used for the representation and interchange of knowledge: semantic networks and topic maps. Their simple structure and expressive power allow them to be used for the representation of other more complex languages (e.g. DL languages). A semantic network [12] is a directed graph consisting of vertices (which represent concepts) and edges (which represent semantic relations between the concepts). A topic map is a structure expressing the “aboutness” of information using topics (concepts), associations (relations between the concepts) and occurrences (information relevant to a particular topic).

An important work have been developed under the World-Wide Web Consortium (W3C) that form the foundation of the semantic web, which include the Resource Description Framework (RDF) which is a semantic network; RDF Schema and the Web Ontology Language (OWL) which belong to the description logics family, and recently Simple Knowledge Organization System (SKOS), which is a language for defining thesaurus-like ontologies for the semantic web.

4.5 Conciliation efforts

In the recent years, some efforts have been made to integrate the different approaches for the representation of knowledge coming from the different fields of research. The idea is to allow ontology-based applications to reuse knowledge independently of the language used (metamodel) for its representation.

The most relevant work is currently being done in the context of the MDA, called Ontology Definition Metamodel (ODM) [10], which aims at providing a metamodel based on MOF for the most important KR languages (RDF, OWL, TopicMaps and CL). This enables any ontology defined with these languages to be imported into a MDA system and transformed to any other metamodel when required.

4.6 MDR related standards

The ISO/IEC JTC1 SC32 WG2 develops standards for metadata and related technology , namely the ISO/IEC 11179, which is the standard that defines the concepts behind MDR, addressing “the semantics of data”, “the representation of data”, and “the registration of the descriptions of that data”.

The ISO11179 standard is complemented with other standards which the most relevant are: the ISO 20944 - Metadata Registries Interoperability and Bindings (MDR-IB), for the formalized representation of information; APIs, which specify a binding in programming terms; and protocols, which specify formalized communications; the ISO/IEC 19763 - Metamodel Framework for Interoperability (MFI) for evaluating interoperability with other metamodels (e.g. MOF, ontology languages, ODM).

5 Evaluation

This section evaluates the significant problems in the fields of research identified in the previous section.

In ETL tools and particularly on the CWM, interoperability can be achieved in the specific context of an integration process and when the data sources come from Database Management Systems (DBMS). Moreover, DBMS models often do not represent all the rules embodied in the information model. This may bring interoperability issues when establishing the transformations for ETL, since a proper understanding of the data being transformed is required. Although the CWM is based on MDA, it does not formally specify the rules between the data model and the information model.

When in an EAI scenario and although the integration is made within the same organization, the data model being used may not be understood the same way within that organization. On the other hand, the integration process itself is often not shared within the organization and thus there is no way to assure that it is correctly performed. This is even more critical when integration is made between different organizations.

The MDA approach to interoperability relies on the definition of an architecture based on models. Nevertheless, it does not mandate how these models are to be used in an operational scenario. Following this, when an application receives some data encoded in a model different from the one that is expected, how will it be used? Looking at the MDA support for transformation (QVT), it is not shown if they can actually be applied to transform between instances of two different models. On the other hand, the expressive power of QVT, particularly of QVT Relations must be evaluated, as well as the level of knowledge required for a professional to apply it.

Knowledge representation languages, particularly first-order logic and description logics provide the tools for a formal representation of data and information models. These languages may also be used for the representation of mapping information even though the extent of its representation still needs to be better evaluated.

Finally, the MDR related standards offer a good starting point for data, information and mapping models to be ubiquitous inside and outside an organization. Nevertheless, this specification did not consider previous research in the field of knowledge representation and the new meta-modeling approaches from the MDA, creating a new metamodel. Although new efforts have been undertaken to align this specification with MOF and ontology languages, some issues remain in the ability of the metamodel to properly represent them. These issues come from a comparison done between the ISO 11179 metamodel and MOF, which show that some basic meta-elements are missing in the ISO 11179 metamodel.

6 Problem Statement

The problem being addressed in this dissertation consists on helping Information Systems achieve a level of semantic interoperability when interchanging information between two processes. The assumption for this dissertation is that semantic interoperability is reached when the two intervenient processes share a common understanding of the information model behind the data being interchanged. This dissertation will focus on the representation and management of data and information models within Metadata Registries, along with both the visual and formal representations of the mappings that may be established between them. The concepts discussed in this dissertation will be applied to processes dealing with bibliographic information contained within libraries, archives and museums.

7. Motivation

The work being developed in the context of this dissertation was motivated and is being performed in the scope of the EuropeanaConnect , EuDML and SHAMAN projects, all partially supported by the European Commission.

8. Research Plan

This dissertation is based on the assumption that metadata registries can be used to achieve interoperability by supporting in a controlled environment, the storage and maintenance of the metadata required for describing the information, data and mapping models. To achieve this, metadata registries rely on a common reference model (metamodel) for the registration of these models and the context of its use.

Currently, a MDR is being developed in the context of the EuropeanaConnect and EuDML projects, to manage the several data models being used by the different data providers and also to manage the mapping models defined to map between them. These projects will provide a rich set of heterogeneous data models that will allow for a good evaluation of the existing metamodels on their ability to represent the required models. This evaluation will include the MDR family of standards, considering also what is being done in the context of ODM, the MDA related standards with special concern with MOF, description logics language like OWL and also first-order logics.

Looking at the representation of mappings within mapping model, the work being done in the context of QVT seems relevant but a better evaluation of its expressiveness must be performed. Nevertheless, a mapping language with correspondence to first-order logic seems to be the most promising approach. This is because of its ability to express a wide range of rules since it is a Turing complete language. On the other hand, first-order logic follows human basic thinking and thus enables the construction of visual tools that allow non-experts to create mappings as seen in EAI applications. Another key feature to consider is the ability to plug new functions into the mapping model as found in the QVT BlackBox.

Finally, to complete the circle this dissertation will work on the specification of new services to allow a proper integration of interoperability functionality into applications. The considered technologies range from WSDLs to REST services and also UDDI and ebXML for their publication.

9. Contributions to the Problem

The main contributions to the research area will consist firstly on an evaluation of the existing standards for the representation of models; secondly, provide a specification for a mapping model that allows the creation of mappings using simple visual representations; thirdly, specify the services need to integrate this approach to semantic interoperability into applications; and finally, provide an implementation framework where all these concepts are put together.

10. Acknowledgments

This dissertation has José Borbinha as its advisor and is hosted by the Lisbon Technical University, “IST – Instituto Superior Técnico”, which is the degree granting institution and “INESC-ID – Instituto de Engenharia de Sistemas e Computadores” as the research institute, and is currently being supported by the “FCT – Fundação para a Ciência e Tecnologia” (the Portuguese Foundation for Science and Technology), under the contract SFRH/BD/62520/2009.

References

[1]	Institute of Electrical and Electronics Engineers (1990). IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glossaries. New York.
[2]	International Organization for Standardization. ISO/IEC 2382-01, Information Technology Vocabulary, Fundamental Terms.

[3]	NISO (National Information Standards Organization). (2004). Understanding metadata. Bethesda, MD: NISO Press. Available: http://www.niso.org/standards/resources/UnderstandingMetadata.pdf.
[4]	Turnitsa, C.D. (2005). Extending the Levels of Conceptual Interoperability Model. Proceedings IEEE Summer Computer Simulation Conference, IEEE CS Press.

[5]	Baca, M. Gill, T. Gilliland, A.J. & Woodley, M.S. (2000). Introduction to metadata: pathway to digital information. Online edition 2.1. Glossary. Available: http://www.getty.edu/research/ conducting_research/standards/intrometadata/glossary.html.

[6]	Linthicum, D. S. (2000). Enterprise application integration. Addison-Wesley. ISBN: 0-201-61583-5.

[7]	Bussler, C. (2003). B2B Integration - Concepts and Architecture. Springer. ISBN: 978-3-540-43487-0.

[8]	Gruber, T. R. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5:199–220.

[9]	Borst, W. N. (1997). Construction of Engineering Ontologies. PhD thesis, University of Twente, Enschede.

[10]	Object Management Group (2009). Ontology Definition Metamodel (ODM). Version 1.0.
[11]	Brachman, R.J. and Schmolze, J. (1985). An Overview of the KL-ONE Knowledge Representation System, Cognitive Sci 9(2).

[12]	Shapiro, S. (1987). Semantic Networks. Encyclopedia of Artificial Intelligence. Wiley.
[13]	Minsky, M. (1975). A Framework for Representing Knowledge. Patrick Henry Winston (ed.), The Psychology of Computer Vision. McGraw-Hill, New York (U.S.A.).