|
|
TCDL Bulletin Agents and Ontologies working together to Federate Digital LibrariesFrancisco J. Rodríguez Martínez AbstractMy work presents a general architecture to federate Digital Libraries in the Web based on ontologies and software agents. The use of ontologies to federate digital libraries provides an interesting and useful way to integrate different information collections stored in different schemas. In our architecture several kinds of agents have been described: interface agents, broker agents, dl agents, etc. A prototype will be implemented in JADE using OWL ontologies. Also, harvesting protocols versus federation systems will be investigated. IntroductionThe increasingly large amount of accessible document information (e.g., published literature, scanned images, and microfilms) is confronting researchers with challenges in dealing with this large volume of data. With the advance of digitisation and other technologies, some historic collections in traditional libraries, universities and research departments are producing a huge amount of experimental data, diverse new discoveries, and related publications. Historic research depends on typologies, comparisons, and the existence of relationships with information from other documents, libraries, projects/sites, etc.; therefore, it will only be possible to conduct such research in real-time if these various digital repositories are interoperable due to the systems that support them. Over the past ten years, the Database Lab of the University of A Coruña in Spain, has developed several Digital Libraries as the basis of research on historic and linguistic documents, among others. Actually, there are three thematic Digital Libraries with collections of different historic documents [18]: the Digital library of the Spanish Emblem Books from XVIth-XVIIth Centuries, the Digital library of Emblem Books translated into Spanish, and the Success History Documents from the XVIIIth Century. A prototype to federate these different kinds of real Digital Libraries, among others (see appendix), will be implemented as a result of the evaluation method described in this thesis. MotivationA real problemThe three Digital Libraries discussed in this paper are monolithic. They were built in isolation using different requirements, technologies, interfaces and protocols. They were adapted to meet individual requirements, and use different schemes to organize and store documents and their metadata. History researchers (present users) perform manual interoperability actions among these digital libraries (among others) querying each interface one by one. The drawbacks of having to query each relevant document database separately make obvious the need to work with a federated system. That is, a system is needed that integrates a number of document databases (that agree to be queried in a federated way) under a unique user interface through a unique URL. Currently, we are investigating the design of an abstract general architecture for an efficient Digital Library federation to provide interoperability services for history researchers. Such a system will allow the user to get information from all the document databases in the federation simply by writing a unique query. Such a query will be redirected to the set of document databases in the federation that are relevant for that specific query. In addition, the system will integrate the answers from each one of those digital libraries so they can be presented to the user by means of the User Interface. It is necessary to indicate that every interface of these Digital Libraries will be presented to the user (not only our system interface), because the contents of the databases represent the intellectual property of each Library, and their corporate look-and-feel must be preserved. Reminds that some Digital Libraries could be federated without a fully collaboration of their personal (non-cooperative components). Digital Libraries research motivationThe research motivation for this work is based on recognized, actual research issues for Digital Libraries. For instance, in Italy, on June 2001, the DELOS Network of Excellence organized a brainstorming workshop on "Digital Library Research Directions" [10]. A brainstorming report was published as the result of three days of intensive workshop discussions. From this report we can extract the following bases of our research:
BackgroundThis work involves different fields of computer literature. In this section, I discuss previous work in Digital Library architectures, uses of ontologies in federation systems and multi-agent related systems. Digital Library ArchitecturesAccording to [10], the current typical client-server and 3-tier architectures frequently found in digital library systems are not adequate to provide the required functionality to achieve the high-level vision for the digital libraries of the future. The basic system architecture [39] of future Digital Libraries should rather explore open, component-based and multi-tier architectures. Actual Digital Libraries can be studied as distributed open information systems. They enable cooperation and sharing of different resources, and a distributed system should be perceived by its users as a single, integrated computing facility [14][46]. Openness of Digital Libraries implies that they use established and well-defined protocols. Two concepts arise in actual architecture definitions [40]: components and services. Our system has to deal and implement such concepts. Components are reusable and replaceable units of software (or hardware), and a reusable component can be developed as a subpart of different components [9] . A replaceable component is a component that can be substituted without modifications to the component or the existing system [32]. On the other hand, Services is a more ambiguous term. Service-oriented computing has recently emerged as the new computing paradigm in the networked world. Vinoski [33] defines service-oriented architectures as three-step interaction models. The most typical service that a digital library provides is enabling users to search and retrieve information. Examples of relevant services in digital libraries include: search services, browsing facilities, retrieval services, and authorization and access management services. Some examples of digital library architectures are: MARIAN. Multiple Access and Retrieval of Information with AnnotatioNs (MARIAN) is a DL system with an extensible set of services, including browsing, searching, retrieving, automatic collection building, and uniform preservation over networked collections. MARIAN started as an Open Public Access Catalog (OPAC) and later evolved to become a complete, modern DL. MARIAN is a monolithic DL with a layered architecture [15] [27]. NDLTD. Networked Digital Library of Theses and Dissertations (NDLTD) is an initiative for creating a DL of theses and dissertations on a world-wide basis [12]. The collections are maintained by individual universities. Every university that wants to make its theses and dissertations globally accessible can implement data providers to expose the related metadata in standard formats (Dublin Core, ETD-MS, etc.). The metadata is then harvested into a Union Catalog, on top of which various services are provided to users of the NDLTD. Some of the services provided include searching, browsing, viewing new items in the collection, reviewing items, etc. [34]. Also, Digital Library architectures have been explored in many different projects. The University of Michigan Digital Library focused on the use of cooperating agents in a heterogeneous digital library system. A different metaphor was investigated in the Stanford Digital Library Project, by the definition of a set of protocols commonly referred to as the Infobus [1][28]. The Dienst protocol [24] and its descendants like NCSTRL [4] and OpenDLib [11] are other systems for distributed digital libraries, and more recently the Open Archives Initiative [6] has emerged as a possible framework for the sharing and reuse of metadata in a range of applications. Harvesting or Federation?There are two common approaches that are used in Digital Libraries to provide services to users when the content is distributed among different Digital Libraries: harvesting and federation. In federation, a search request is distributed to other Digital Libraries in turns. The results are combined and returned to the initiator of the request. Alternatively, in federated search, a client or Digital Library front-end must broadcast queries and collect results directly. Federation has the advantage of distributing the responsibility of searching across various DLs, but suffers from the fact that it depends on the high availability of the underlying Digital Libraries. In addition, there may be significant latency in generating the search results. On the other hand, harvesting is a process of recovering the metadata from other Digital Libraries and then providing services on the local copy of harvested metadata. Such an approach is used in major commercial search engines like Yahoo or Google. Since the search is performed on a local copy of the metadata, results can be returned with very low latency. Moreover, since harvesting is carried out on a periodic basis, even if the Digital Library is unavailable, it is still possible to search its metadata [34]. Federated systemsSeveral research systems currently address similar issues about federation different systems using semantic interoperability and information mediation. Wiederhold [41] proposed the concept of a mediator as a way of formulating semantic information necessary to integrate heterogeneous sources and to make sense out of a collection of potentially incomplete and inconsistent data and inherently incompatible programs. TSIMMIS PROJECT. The Stanford-IBM Manager of Multiple Information Sources [7] is a system for integrating information. It offers a data model and a common query language that are designed to support the combining of information from many different sources. It also offers tools for automatically generating the components that are needed to build systems for integrating information. The TSIMMIS architecture uses the mediator architecture [42], and the principle components are as described below. A lightweight object model called OEM (Object Exchange Model) and a common query language called LOREL (Lightweight Object Repository Language) is used to link components and query substructures in OEM objects. SIMS PROJECT. The goal of the SIMS project [2] is to provide intelligent access to heterogeneous distributed information sources, while isolating users and application programs from the need to be aware of the location of the sources, their query languages, organizations, etc. A model of the application domain is created using a knowledge representation system to establish a fixed vocabulary describing objects in the domain, their attributes, and relationships among them. Queries to SIMS are written in high-level uniform language. That language is independent of the information sources. Queries do not need to contain information describing which sources are relevant, where they are located or how the information obtained from different sources should be combined or manipulated. OBSERVER SYSTEM. [29] is an instantiation of the abstract metadata-based architecture for information brokering. Brokering is performed at the level of information content and at the level of vocabulary. The components are as described below: Query processor which represents the broker vocabulary identified in the abstract metadata-based architecture. Inter-ontology relationships manager and Metadata System responsible for enabling brokering at information content level. INFOSLEUTH SYSTEM. [5] This is an agent-based system for information gathering and analysis. InfoSleuth uses domain specific ontologies as the basis for brokering across multiple information sources containing information that can be described by common domain ontology. Harvesting systemsThe OAI-PMH uses harvesting as its approach. The assumption is that though there is an increased burden on the service provider, it can be overcome by increasing the resources of the service provider, thereby leveraging the other advantages of the harvesting approach, such as increased availability, low latency, etc. The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content [6]. Early efforts to achieve interoperability amongst DLs led to the creation of custom solutions such as the Deinst system [24] and Z39.50 [3]. However, end users were faced with multiple search interfaces making resource discovery harder, and there was no machine-based way of sharing metadata. In response to the problem, the Santa Fe meeting led to substantial support for the OAI, which promotes interoperability via developing OAI-PMH as an open standard. OAI-PMH is a low-cost mechanism for harvesting metadata records from one system to another [24]. The underlying technologies are HTTP (Hyper-Text Transfer Protocol) and XML. OAI-PMH makes use of HTTP for sending requests and receiving XML responses. ODL (Open Digital Libraries) is a network of extended Open Archives that work together to supply the services required by information seekers [34]. CITIDEL (Computing and Information Technology Interactive Education Digital Library) [8] is a part of the National Science Foundation's (NSF) National Science Digital Library (NSDL) program and is a Digital Library that has collections and resources on computer science. It provides services useful to different "societies". Some of the major services that CITIDEL provides are searching, browsing using multiple classification schemes, binding, annotation, etc. In CITIDEL many services are implemented using ODL components. OntologiesGruber [16] provides one of the most cited definitions of ontology as "a formal, explicit specification of a shared conceptualization". A conceptualization is an abstract, simplified view of the world. Ontologies are designed for enabling knowledge sharing and reuse. In that context, ontology is a specification used for making ontological commitments. An ontological commitment is an agreement to use a vocabulary in a way that is consistent with respect to a specified theory by ontology. But we can find several other definitions relating to ontologies. From a formal logic perspective on one hand, to comparing it with a simple vocabulary of terms on the other hand, it is evident that there is a diversity of opinions about what an ontology is. Taxonomic classification is frequently used [29]. Labrou and Finin [23] support the idea that Internet directories like the Open Directory and Yahoo! can also serve as important taxonomic references to communicate different agents [45]. Nicola Guarino argues that there needs to be a formal approach to designing and testing taxonomies [31]. The rationale is that errors in the design of the backbone taxonomy will create multiple problems when trying to use these ontologies later. Guarino introduces identity, rigidity, and unity that describe properties of taxonomic concepts and that can be attached to those concepts. Examples of existing ontologies or sources of knowledge are: the Dublin Core Metadata Initiative, WordNet (and EuroWordnet) [13], the Open Directory and OpenCyc. Finally, the Semantic Web initiative will provide users with a set of standard mark-up languages, allowing us to express ontologies of various levels of complexity. Any organization could pick up a basic taxonomy of terms, cross-link, and enrich it with axioms. The ontology layer is the central and most important piece in the Semantic Web framework. The goal is to provide a language that can be used for applications that need to understand the content of information, instead of just understanding the human-readable presentation of content. Two major languages can be found in this level, DAML+OIL [19] and the recently published W3C Web Ontology Language. The Web Ontology Language (OWL) is being designed by a W3C Web Ontology Working Group. It can be considered the successor of the DAML+OIL. In this work OWL will be used as the basis of ontology definition. AgentsThe Agent concept constitutes a powerful abstraction tool in software development, which facilitates the construction of distributed, intelligent and robust systems. Agents are defined by means of the attributes that they must accomplish: autonomy, reactivity, social ability and proactivity, besides other aspects such as: rationality, mobility, etc. The solution of a complex problem (like digital library federation) can not be provided by a single agent, but by a set of agents that interact and cooperate in order to achieve system objectives. In our architecture several kinds of agents have been described: interface agents, broker agents, dl agents, etc. The use of several agents naturally represents system decentralization and stresses communication in managing dependencies. Agents represent an important advance in abstraction level in the process of software development. They may be used for the design and implementation of complex (distributed) software systems [21]. Actually there is no universal definition of the Agent concept, but a widely used one states that an intelligent agent is a computer system capable of flexible, autonomous action in some environment; in this definition, flexible means that agent is reactive; pro-active and social [43]. Intelligent Information Agents can be defined as computational software systems that access multiple heterogeneous and geographically distributed information sources in order to help users in the process of searching relevant information [22]. These agents provide transparent access to many different information sources, simplifying the problem of managing a large amount of data. This implies that agents must interpret the user request, analyse it and translate it in the correct way for each source, retrieve the desired information, integrate the results and give them back to the user. Preferably the entire process must be done in a just-in-time fashion. Information Agents may be classified according to different criteria. Depending on their ability to co-operate with each other in the execution of tasks, agents may be non cooperative or cooperative. This aspect distinguishes two kinds of systems, the ones that are formed by a single agent and systems composed by a society of interacting agents. Any of the previous types may be classified as rational, which are utilitarian in an economic sense; adaptive, which are able to adapt themselves to changes in the environment; or mobile, which travel autonomously throughout the Internet [22]. RETSINA is an open multi-agent system that performs information retrieval and integration directed by goals, in support of a variety of decision-making tasks [35]. The process of mediation in RETSINA basically relies on service matchmaking. The specification of capability and service is based on an Agent Capability Description Language (ACDL) called LARKS (Language for Advertisement and Request for Knowledge Sharing). Application domain knowledge in agent advertisements and requests can be currently specified as local ontologies written in a specific concept language ITL or by using WordNet. InfoSleuth (also explained in federation systems) is an agent-based system that can be configured to perform many different information management activities in a distributed environment [5]. It is composed of seven kinds of software agents that, all together, provide a number of complex query services that involve resolving ontology-based queries over dynamically changing resources. InfoSleuth architecture uses a process of matchmaking, that is, the broker agent maintain a knowledge base with information about all the agents in the system and their capabilities and uses this knowledge to match required services with the provider agent. Java Agent DEvelopment Framework (JADE) is an agent-oriented tool, implemented in JAVA and FIPA-compliant. It is composed of an agent platform (execution environment) and a set of packages that provide the basic support for multiagent systems construction. In our system JADE will be used as the developing agent framework. Research questions and methodologyWe investigate how ontologies and agents can also be used as an infrastructure for developing applications that intelligently support a historic researcher (our user) with various tasks. Particularly, the following key questions will be answered:
ObjectivesThe major contributions of this thesis will be:
References[1] Andreas Paepcke, Michelle Q. Wang Baldonado, Chen-Chuan K. Chang, Steve Cousins, and Hector Garcia-Molina. Using Distributed Objects to Build the Stanford Digital Library Infobus. Computer, 32(2):80-87, 1999. [2] Arens Y., Chee C., Hsu C. and Knoblock C. "Retrieving and Integrating Data from Multiple Information Sources". International Journal of Intelligent and Cooperative Information Systems, 1993 [3] ANSI/NISO. Information Retrieval (Z39.50): Application Service Definition and Protocol Specification (ANSI/NISO Z39.50-1995). Bethesda, MD, NISO Press, 1995. [4] Barry M. Leiner. The NCSTRL Approach to Open Architecture for the Confederated Digital Library. D-Lib Magazine, December 1998. <http://www.dlib.org/dlib/december98/leiner/12leiner.html>. [5] Bayardo, R. J. et al. (1997). Infosleuth: Agent-based semantic integration of information in open and dynamic environments. In ACM SIGMOD, pp. 195-206. [6] Carl Lagoze and Herbert Van de Sompel. The Open Archives Initiative : Building a Low-Barrier Interoperability Framework. In Proceedings of the first ACM/IEEE-CS Joint Conference on Digital Libraries, pages 54-62. ACM Press, 2001. [7] Chawathe S., Garcia-Molina H., Hammer J., Ireland K., Papakonstantinou Y., Ullman J. and Widom J. "The TSIMMIS Project: Integration of Heterogeneous Information Sources", Stanford University, 1994. [8] CITIDEL (Computing and Information Technology Interactive Education Digital Library), 2004. <http://www.citidel.org>. [9] Clemens Szyperski. Component Software: Beyond Object-Oriented Programming. ACM Press / Addison Wesley, 1999. [10] Digital Libraries: Future Research Directions for a European Research Programme. Workshop report , ERCIM, 2002. [11] Donatella Castelli and Pasquale Pagano. OpenDLib : A Digital Library Service System. In Maristella Agosti and Constantino Thanos, editors, Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, ECDL 2002, number 2458 in Lecture Notes in Computer Science, pages 292-308. Springer-Verlag, 2002. [12] Fox, E. A., Eaton, J. L., McMillan, G., Kipp, N. A., Mather, P., McGonigle, T., Schweiker, W., and Devane, B. "Networked Digital Library of Theses and Dissertations: an international effort unlocking university resources", in D-Lib Magazine, 1997. <http://www.dlib.org/dlib/september97/theses/09fox.html>. [13] G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Introduction to wordnet: An on-line lexical database. International Journal of Lexicography, 3:245-264, 1990. [14] George Coulouris, Jean Dollimore, and Tim Kindberg. Distributed Systems: Concepts and Design. Addison-Wesley, 1998. [15] Goncalves, M. A., Mather, P., Wang, J., Zhou, Y., Luo, M., Richardson, R., Shen, R., Xu, L., and Fox, E. A. Java MARIAN: From an OPAC to a Modern Digital Library System. Presented at the 9th String Processing and Information Retrieval Symposium (SPIRE 2002), Lisbon, Portugal, 2002. [16] Gruber, T. (1994). Toward Principles for the Design of Ontologies Used for Knowledge Sharing. In International Journal on Human Computer Systems, 43 (5/6), pp. 907-928. [17] History and development of OAI-PMH, 2004. <http://www.oaforum.org/tutorial/ english/page2.htm>. [18] <http://rosalia.dc.fi.udc.es/emblemas>. [19] J. Hendler and D. McGuinness. The DARPA agent markup language. IEEE Intelligent Systems, 15(6):72{73, Nov./Dec. 2000. [20] J. Hendler. Agents and the Semantic Web. IEEE Intelligent Systems, 16(2), 2001. [21] Jennings, N.R. (1999). Agent-Based Computing: Promise and Perils. Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI-99), pp. 1429-1436. [22] Klusch, M. (1999). Intelligent Information Agents -Agent-Based Information Discovery and Management on the Internet. Berlin:Springer Verlag. [23] Labrou, Y., Finin, T. & Peng, Y. (1999). Agent Communication Languages: The Current Landscape. In IEEE Intelligent Systems, pp. 45-52. [24] Lagoze, C. and Davis, J. R. Dienst - An Architecture for Distributed Document Libraries, in Communications of the ACM, vol. 38(4), pp. 47, 1995. [25] Lagoze, C., Sompel, H. V. d., Nelson, M., and Warner, S. The Open Archives Initiative Protocol for Metadata Harvesting - Version 2.0, Open Archives Initiative, 2002. <http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm>. [26] M. Erdmann and R. Studer. Ontologies as conceptual models for XML documents. In Proceedings of the 12th Workshop for Knowledge Acquisition, Modeling and Management (KAW'99), Banff, Canada, October 1999. [27] MARIAN Digital Library Information System, 2004. <http://www.dlib.vt.edu/ products/marian.html>. [28] Martin Roscheisen, Michelle Baldonado, Kevin Chang, Luis Gravano, Steven Ketchpel, and Andreas Paepcke. The Stanford InfoBus and Its Service Layers : Augmenting the Internet with Higher-Level Information Management Protocols. Technical report, Stanford, 1997. [29] Mena E., Kashyap V., Sheth A. and Illarramendi A. "OBSERVER: An approach for query processing in global information systems based on interoperation across pre-existing ontologies". Proceedings of the First IFCIS International Conference on Cooperative Information Systems, 1996. [30] N. Guarino and C. Welty. Evaluating ontological decisions with ontoclean. Communications of the ACM, 45(2), February 2002. [31] N. Guarino and C.Welty. A formal ontology of properties. In R. Dieng and O. Corby, editors, 14th European Conference on Artificial Intelligence, 187. [32] Robert C. Seacord. Replaceable Components and the Service Provider Interface. In J. Dean and A. Gravel, editors, Proceedings of the First International Conference on COTS-Based Software Systems, ICCBSS 2002, number 2255 in Lecture Notes in Computer Science, pages 222-233. Springer-Verlag, February 4-6 2002. [33] Steve Vinoski. Web Services Interaction Models : Current Practice. IEEE Internet Computing, 6(3):89-91, June 2002. [34] Suleman, H. Open Digital Libraries. Ph.D. Dissertation. Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA, 2002. <http://scholar.lib.vt.edu/theses/available/etd-11222002-155624/>. [35] Sycara, K.,Paolucci, M., van Velsen, M., & Giampapa, J. (2001). The RETSINA MAS Infrastructure. Robotics Institute Technical Report # CMU-RI-TR-01-05. [36] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific American, pages 28-37, May 2001. [37] T. Gruber. Towards Principles for the Design of Ontologies Used for Knowledge Sharing. In N. Guarino and R. Poli, editors, Formal Ontology in Conceptual Analysis and Knowledge Representation, 1993. Kluwer Academic Publishers. [38] W3C. Extensible Markup Language (XML) 1.0, 2000. <http://www.w3.org/TR/2000/ REC-xml-20001006> (as of April, 2004). [39] William Y. Arms, Christophe Blanchi, and Edward A. Overly. An Architecture for Information in Digital Libraries. D-Lib Magazine, 3(2), February 1997. <http://www.dlib.org/dlib/february97/cnri/02arms1.html>. [40] William Y. Arms. Key Concepts in the Architecture of the Digital Library. D-Lib Magazine, July 1995. <http://www.dlib.org/dlib/July95/07arms.html>. [41] Wiederhold G. "Mediators in the Architecture of Future Information Systems". IEEE Computer, Vol.25 No.3, March 1992, pages 38-49. [42] Wiederhold G., "Interoperation, Mediation, and Ontologies". Stanford University. 1994. [43] Wooldridge, M & Jennings, N.R. (1995). Intelligent Agents: Theory and Practice. The Knowledge Engineering Review, 10(2), pp. 115-152. [44] Workshop on Applications of Ontologies and Problem-Solving Methods, pages 97, 112, August 2000. [45] Y. Labrou and T. Finin. Yahoo! as an ontology: Using yahoo! Categories to describe documents. In Proceedings of the Eight International Conference of Information Knowledge Management (CIKM 1999), pages 180-187, USA, October 1999. [46] Zahir Tari and Omran Bukhres. Fundamentals of Distributed Object Systems: The CORBA Perspective. Wiley Series on Parallel and Distributed Computing. John Wiley & Sons, 2001. © Copyright 2006 Francisco J. Rodríguez Martínez Top | Contents |