Zhang, "XML Retrieval for Digital Archives", TCDL Bulletin 6.1 (2010)

XML Retrieval for Digital Archives

Junte Zhang

Archives and Information Studies
University of Amsterdam
j.zhang@uva.nl

Abstract

Nowadays, digital archival access is facilitated by deploying archival finding aids in the metadata standard Encoded Archival Description (EAD) in XML. Studies in XML Information Retrieval (XML IR) have shown that structural cues in XML can be used to provide focused access to information by not retrieving the full-text document, but only the requested part of it. This PhD project aims to understand how (online) information access to archival materials works, and empirically evaluating the effectiveness of this access in order to improve it, and in particular the value of focused access by using the structural cues given in EAD. The document gives a high-level outline of the research, starting with the research objectives and contributions. Relevant archival and XML IR literature, in which the scope of the research is delineated, illustrates the background. This eventually leads to the results so far, and an outlook on future work.

Introduction

Cultural heritage (CH) information from libraries, museums and archives can be increasingly found online. In the past, the physical CH materials, like books, paintings or personal letters, were cataloged in paper finding aids by curators to allow for access and use. For example, a user who was looking for a personal letter in an archive of a historical person, had to travel to an archive to find that letter by consulting paper finding aids and the archivist. Nowadays, with the advent of digital finding aids to provide online access to these physical artifacts, records are also found using a computer. Archivists create digital finding aids to help users find and understand the online archival information that they need. These finding aids are increasingly converted from paper resources, but an increasing amount are born digital. The premise is that it will be a major benefit for users, as CH materials are potentially more effectively disclosed both in terms of time and effort using these digital assets. How to effectively use these resources remains an open question.

The international standard Encoded Archival Description (EAD, [21]) is used to encode archival descriptions (metadata) about materials in digital finding aids. EAD is compliant with General International Standard Archival Description (ISAD(G), [10)]). Archives, as well as manuscript libraries and museums, are expanding their digital resources by adopting EAD in eXtensible Markup Language (XML, [4]) to describe their assets, and publishing these digital assets online. This means that structural CH information in XML can be exploited on the Web for web services that could facilitate improved information access for users. This also means that information in archival records can be found online just a few clicks away.

That is the context of this research. Research that is positioned in the intersection between archival science and computer science is scant. However, this does not mean that both research fields are disjoint. This paper gives an outline of a dissertation project that has been positioned in the intersection between both domains. The research objectives and contributions will be outlined first. The relevant related work within the proposed scope will be discussed, providing the context of the research objectives and contributions. The (preliminary) results are presented, and an outlook of the remaining research and future prospects are provided.

Research Objectives and Contributions

In terms of searching and gaining access to the archival materials using archival finding aids formatted in EAD/XML, XML Retrieval [17] can be used, including the traditional empirical and experimental evaluation research methodology to determine the quality of this access. Moreover, from the archival point of view, published work using this research methodology to evaluate archival access is very scant [9]. Several approaches have been developed to gain access to logically structured documents in XML Retrieval. The conjecture in XML Retrieval is that if the same results can be achieved with less user interaction, the better it is. Is this same proposition valid when dealing with archival finding aids? These documents can be extremely lengthy in content with thousands (or more) pages, and logically, but complexly structured with theoretically infinite depth and nesting. This results in the main research objective of this research.

Study effective retrieval techniques tailored to XML retrieval of archival finding aids in EAD, taking into account different users and contexts, and the textual context of the unit to return.

It has to be empirically established whether XML Retrieval techniques can fruitfully be applied to archival finding aids. In case they can be employed, the effectiveness of applying these techniques needs to be investigated so as to answer whether it also really fruitful for improving archival access in particular—this is an intricate open question. This eventually leads to the following research contributions, in order to achieve the main research objective, which aim and are expected to be:

(a) an overview of the current state of the art in digital information access to archival materials.
(b) an analysis on the evaluation of the different retrieval approaches in (a).
(c) a formalization based on the findings of (a) and (b) that results into a design and implementation of an archival IR system.
(d) an extension of (c) by employing the system to aggregate contextual information and enrich the metadata with link detection techniques.
(e) a thorough evaluation of the system that is built in (c) and its extension in (d), specifically
1. experimental designs for (c) and (d).
2. the evaluation measures for (c) and (d).
(f) the engineering of a re-usable test collection for (e), based on real-world e-mail requests and search log files, collected over multiple years, and obtained from the National Archives of the Netherlands.

The project is set up modularly with a few dependencies, notably the system building in (c) and the test collection building in (f). The project is (quantitatively) experimental and empirical in research methodology, based on re-using established theoretical work, and involves system building. Empirical evaluation as established in IR will complement the philosophical ideas as expressed by archivists and librarians and subsequently projected in an information representation model as EAD.

The information retrieval effectiveness of using EAD for XML Retrieval is evaluated standalone by (automatically) constructing a novel archival IR test collection (intrinsic evaluation). Besides evaluating the system in its natural archival environment, it will also be evaluated extrinsically using other datasets and domains, notably, test collections built by international evaluation initiatives, such as INEX (extrinsic evaluation). Moreover, the system will be adapted to different user types, and the effectiveness for archival access will be evaluated online with users.

Related Work

The research builds on existing, established work, with these main pillars:

Archival theory has been heavily swayed by the 19th century view of organizations, which established the importance of a mono-hierarchical structure in which units are subordinate to no more than one higher unit [2]. Moreover, classical organizational theory assumes that an organization is autonomous and sovereign [2]. Archives have also adopted this view, as described by the first manual on archiving published by the Dutch Association of Archivists in 1898. Due to this manual, the concept of provenance became an accepted notion among archivists, with two notable principles. First, the creator-centric concept of Principle of Provenance (see conceptualization in Fig. 1(a)); this principle is concerned with the origins of records, and can refer to the archival creator, information about the history of the ownership or custody of a record, or notably, the idea that an archive of a person, family or corporate body cannot be mixed with the records of other records creators. Second, the corollary document-centric doctrine of the Principle of the Sanctity of Original Order (see conceptualization in Fig. 1(b)), which states that records should be kept in the other in which they were originally arranged [13]. Additionally, archival descriptions are described from broader to the narrower (more detailed) levels of description using the hierarchical structure, which means that the narrower levels of descriptions inherit the features of the broader levels of description (see Fig. 1(c)).


(a) Provenance.	(b) Original Order.	(c) Inheritance.
Fig. 1. Archival principles conceptualized as retrieval approaches.

Traditionally, Archival Metadata Retrieval Systems (indirectly) used the Provenance method (Bearman and Lytle [1986], [19). For subject retrieval, the queries are matched with the set of words in the provenance information like administrative histories or biographies. Archival metadata retrieval can also be facilitated by examining the content of the files, which has been derived from librarianship The queries are matched with terms from an index or (card) catalog; this approach is item-orientated to the virtual exclusion of the provenance information [19]. Retrieval of archives is also possible by using contextual information that exists externally, or using the internal structure between finding aids. Context can refer to the relationship between different components and items in a collection as defined by the archival creator, which makes an archive especially valuable, and the alternative would be the loss of quality [29], to wit: "Reliable information becomes unreliable information, high quality information degenerates to information of poorer quality; archives degenerate to documentary collections, evidence turns into documentation, documents into loose data."
Published research that empirically or experimentally deals with the Evaluation of Archival Metadata Retrieval is scant [9]. Experiments that specifically examine the retrieval performance potential of archival finding aids in specifically EAD is almost non-existent. The first study in the archival field that empirically tested different subject retrieval methods was [19]. Subsequently, there were a few studies that tested the effects of some external context knowledge on retrieval, such as controlled vocabulary terms [23] or document-collection granularity [9]. The retrieval of online archival finding aids (not in EAD) have been examined in the study of Feeney [1999] by counting the number of finding aids returned by Web search engines using different types of query reformulations, i.e. keyword, phrase, and Boolean searches using the topical subject and names headings as queries. The retrieval experiments of Tibbo and Meho [2001] on finding aids as full text HTML documents on the World Wide Web pointed to the effectiveness of phrases for the retrieval of finding aids in six IR systems. Conversely, they also specifically did not investigate EAD, either in SGML or XML, as finding aids in either structure was not indexed, and thus not visible, by the search engines.
Many ideas on Evaluation in IR can be traced back to the work process of a librarian working with card indexes using library classification schemes [24]. The Cranfield experiments ran, and the methodology for IR experimentation was defined, in the 1950s [24]. Much of the experimentation was focused on building the ideal 'test collection', i.e. compiling the set of documents that was considered relevant and had to be returned by the system [14]. A notable collaborative ('pooling') example was and is the Text REtrieval Conference (TREC) [24]. Another example of building a test collection for a specialized closed domain (biology) was the Cystic Fibrosis database [27] or WT10g for the Web in general [1]. Doing IR experimentation by re-using existing test collections is also possible [Robertson, 2008], however, such a collection does not exist yet in the archival domain.
Archival Authority Records can be used to move beyond the strictly hierarchical organization of archival descriptions, and also share interrelated information between archives, but also museums and libraries, using these records of creators [22]. Earlier on, Bearman and Lytle [1986] already proposed to move beyond the mono-hierarchical representation and towards poly-hierarchical, or a network model. Encoded Archival Context: corporate bodies, persons, and families (EAC-CPF) is an ongoing initiative to design and implement a standard for encoding detailed descriptions of record creators [22]. It is an approach based on the International Standard Archival Authority Record for Corporate Bodies, Persons, and Families (ISAAR(CPF), [10]). The Linking and Exploring Authority Files (LEAF) project has set out to provide a framework for international, collaborative work in the sector of authority data with respect to authority control [15].

Results and Prospectives

README Prototype System

A preliminary result is an implementation of our XML IR system using archival finding aids: README (Retrieving Encoded Archival Descriptions More Effectively), an online archival information system that is able to retrieve information within the archives by exploiting their granularity and structure in XML. The system is based on MonetDB with the XQuery front-end Pathfinder [3] and the information retrieval module PF/Tijah [8]. All finding aids are indexed into a single main memory XML database, because commonly used relational databases do not provide a perfect solution to store this type of data. XML databases such as MonetDB are developed, which allows for powerful XQuerying of the information using the richness of the XML structure as expressed in EAD.

We have identified and implemented three approaches that query the index with different granularity, namely the whole archive as one unit, any element in the archive as unit, and an aggregation-based approach that is mix of the former two [37]. The XML database is queried using XQuery extended with Narrowed Extended XPath I (NEXI) [31]. First, for the whole archive ranking, the root element (thus the full-text document) is provided as target element. See Fig. 7(a). Second, for any element relevance ranking, no structural hint in the form of a target element is provided, hence any EAD element can be retrieved, which includes the absolute XPath to them, like /EAD[1]/ARCHDESC[1]/DSC[2]/C01[4]/C02[8]/DID[3]/UNITTITLE[1]. See Fig. 7(b). And finally, third, as explained in [33], the aggregation-based approach optimally utilizes the context of the archive. See Fig. 7(c).

(a) Document retrieval (Click here for a larger view of Fig 2 (a)).

(b) Element retrieval (Click here for a larger view of Fig 2 (b)).

(c) Aggregation-based retrieval. (Click here for a larger view of Fig 2 (c))

Fig. 2. README with query "koude oorlog" (cold war).

Test Collection Building

Another major contribution is that we develop a re-usable test collection for evaluating archival IR systems. We developed an assessment system that allows the user to assess relevant information within a finding aid and enables ad hoc topic development in conjunction with the README system. This relevancy is based on a narrative in which the information need is clearly stated, including what is considered relevant. The relevance is determined by locating particular units (elements) of archival materials that will likely contain the sought answer. But what to retrieve? Descriptions of individual files and records tend to be very succinct — seldomly more than a single sentence. Additionally, a finding aid also contains contextual background descriptions of the archive, which may directly contain relevant information.

Besides the focus on ad hoc topics, we will also investigate the use of known-item topics (partly) based on real-world e-mail reference requests, and test the effectiveness of a massive test collection that is automatically derived from search log files obtained from the National Archives of the Netherlands spanning over multiple years. The search log files also give further insight in how users search and browse on a popular archival website.

Retrieval Evaluation Experiments

The retrieval effectiveness of systems has not been deeply explored until now in the field of archival documentation and retrieval. The project has already commenced with testing archival principles and practices in terms of retrieval. It is pivotal here that the autonomy of an archive is respected. The tiniest modification, such as re-arrangement, may have the most far-reaching and potentially devastating results [13], such as losing information if the organic quality of archives is violated [19]. These are the fundamentals to curate, i.e. preserving archives, and at the same time allow and facilitate their handling and use [13]. Archival finding aids, both in paper and digital form, are surrogates of an archive. Therefore, the same set of principles is applied in a system as well (see conceptualization in Fig. 1).

With a controlled retrieval experiment, the effects of using Original Order in the ranking have been tested. Our preliminary results [35] show that relevance ranking is of paramount importance, although Original Order helps the retrieval of the first few results due to Cluster Hypothesis effects [12]; this adds motivation to test the retrieval effects of other types of results clustering. The prospect is to continue to conduct further experiments based on improved and larger test collections that we have constructed. Gradually, this will also lead to improved archival metadata retrieval systems.

Evaluation with User Studies

Traditional IR metrics may not be sufficient to capture or quantify the experiences that users face when interacting with the README system. That is why user studies can be set up in alignment with retrieval experiments. The system has (preliminary) been evaluated with nine users, and more details on this study can be found in [6]. In this study, the element ranking approach is least appreciated out of the three systems. The aggregation-based approach is appreciated the most. The prospect in terms of user studies is to map retrieval variables visually to user interfaces based on the existing README system as independent variables, with dependent variables as search efficiency (e.g. time, number of clicks) and search effectiveness (e.g. relevance, confidence), and conduct online surveys with large user bases and different user types.

Link Detection with LARA

The use of link detection techniques also touches on knowledge management and hypertext, adding to the interdisciplinary character of the research. The first steps are taken on EAD by detecting 'person links' using simple regular expressions and text processing techniques in conjunction with an IR system [34]. This can be framed as an extension of the README system, LARA, the acronym for Linking Authority Records Automatically. This extension will use the standard README architecture to find links and provide archival context with automatically compiled authority records. To evaluate the effectiveness of LARA, it can be evaluated in a system evaluation with IR metrics as Precision and user assessments as ground truth. The system has been evaluated extrinsically on a different dataset in the Link the Wiki (LTW) track of INEX, namely Wikipedia, and its effectiveness has been compared with other systems [35]. To measure the importance of LARA for the retrieval of archival descriptions, additional user studies can be setup for evaluation so as to determine the effectiveness of this system approach for archival access.

Conclusion

This paper presented the outline and scope of an ongoing dissertation project about system building of an XML IR system on archival finding aids. The primary aim is to deepen the understanding of archival information retrieval, and importantly, to improve it. Three main contributions of this work have been laid out: (a) IR experimentation with the system, primarily by treating key archival curation principles as independent variables; (b) the building of a re-usable test collection for archival finding aids in EAD to evaluate retrieval effectiveness; (c) the design and testing of a prototype system for authority files that allows information access beyond the single apparatus feature of archival finding aids. The outlook is to consolidate the research plan by putting more focus on IR evaluation, and eventually, to critically reflect on the applicability of the main findings beyond the archival domain.

Acknowledgments

I would like to thank the international panel at the Doctoral Consortium of ECDL 2009 in Corfu, Greece for their feedback. I also gratefully thank Henny van Schie of the National Archives of the Netherlands for providing valuable data. This research is supervised by Jaap Kamps, and supported by the Netherlands Organization for Scientific Research (NWO) under grant #639.072.601.

Bibliography

[1]	P. Bailey, N. Craswell, and D. Hawking, "Engineering a multi-purpose test collection for web retrieval experiments," Information Processing & Management, vol. 39, no. 6, 2003, pp. 853–871.

[2]	D. Bearman and R. Lytle, "The Power of the Principle of Provenance." Archivaria, vol. 21, Winter issue, 1986, pp. 14–27.

[3]	P. A. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger, and J. Teubner, "MonetDB/XQuery: A Fast XQuery Processor Powered by a Relational Engine," Proc. SIGMOD’06, ACM, 2006, pp. 479–490.

[4]	T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler, and F. Yergeau, "Extensible markup language (xml) 1.0 (fifth edition)," 2008. Available at http://www.w3.org/TR/REC-xml/.

[5]	W. M. Duff and P. Stoyanova, "Transforming the Crazy Quilt: Archival Displays from a User’s Point of View," Archivaria, vol. 45, Spring issue, 1998, pp. 44–79.

[6]	K.N. Fachry, J. Kamps, and J. Zhang, "Access to archival material in context," Proc. IIiX’08, ACM, 2008. pp. 102–109.

[7]	K. Feeney, "Retrieval of Archival Finding Aids Using World-Wide-Web Search Engines," The American Archivist, vol. 62, Fall issue, 1999, pp. 206–228.

[8]	D. Hiemstra, H. Rode, R. van Os, and J. Flokstra, "PF/Tijah: text search in an XML database system," Proc. OSIR’06, 2006, pp. 12–17.

[9]	T. Hutchinson, "Strategies for Searching Online Finding Aids: A Retrieval Experiment," Archivaria, vol. 44, Fall issue, 1997, pp. 72–101.

[10]	International Council on Archives, "ISAD(G): General International Standard Archival Description, Second edition," Technical report, 1999. http://www.ica.org/sites/default/files/isad_g_2e.pdf

[11]	International Council on Archives, "ISAAR(CPF): International Standard Archival Authority Record for Corporate Bodies, Persons, and Families, Second edition," Technical report, 2004. http://www.ica.org/sites/default/files/ISAAR2EN.pdf

[12]	N. Jardine and C. van Rijsbergen, "The use of hierarchical clustering in information retrieval," Information Storage and Retrieval, vol. 7, 1971, pp. 217–240.

[13]	H. Jenkinson, "Reflections of an Archivist," Contemporary Review, vol. 165, 1944, pp. 355–361.
[14]	K. Jones and C. J. van Rijsbergen, "Information retrieval test collections," Journal of Documentation, vol. 32, no. 1, 1976, pp. 59–75.

[15]	M. Kaiser, H.-J. Lieder, K. Majcen, and H. Vallant, "New Ways of Sharing and Using Authority Information: The LEAF Project," D-Lib Magazine, vol. 9, no. 11, 2003. http://www.dlib.org/dlib/november03/lieder/11lieder.html

[16]	K. Kiesling, "Metadata, metadata, everywhere - but where is the hook?" OCLC Systems &Services, vol. 17, 2001, pp. 84–88.

[17]	M. Lalmas, "XML Information Retrieval," Encyclopedia of Library and Information Sciences, Taylor & Francis Group, 2009.

[18]	R.R. Larson and R. Sanderson, "Cheshire3: retrieving from tera-scale grid-based digital libraries," Proc. SIGIR’06, ACM, 2006. pp. 730.

[19]	R.H. Lytle, "Intellectual Access to Archives: I. Provenance and Content Indexing Methods of Subject Retrieval," American Archivist, vol. 43, Winter issue, 1980, pp. 64–75.

[20]	W. Meier, "eXist: An open source native XML database," Web, Web-Services, and Database Systems, LNCS 2593, Springer, 2003, pp. 169–183.

[21]	D.V. Pitti, "Encoded archival description: The development of an encoding standard for archival finding aids," American Archivist, vol. 60, Summer issue, 1997, pp. 268–283.

[22]	D.V. Pitti, "Creator Description: Encoded Archival Context," Cataloging & Classification Quarterly, vol. 38, no. 3, 2004, pp. 201–226.

[23]	F. Ribeiro, "Subject Indexing and Authority Control in Archives: The Need for Subject Indexing in Archives and for an Indexing Policy Using Controlled Language," Journal of the Society of Archivists, vol. 17, no. 1, 1996, pp. 27–54.

[24]	S. Robertson, "On the history of evaluation in IR," Journal of Information Science, vol. 34, no. 4, 2008, pp. 439–456.

[25]	S.W. Schwartz, C.J. Prom, C.A. Rishel, and K.J. Fox, "Archon: A unified information storage and retrieval system for lone archivists, special collections librarians and curators," Canadian Journal of Library and Information Practice and Research , vol. 2, no. 2, 2007, http://www.criticalimprov.com/index.php/perj/article/viewArticle/246/520

[26]	M. Sévigny and F. Clavaud, "PLEADE – EAD for the Web," DigitCULT.Info, vol. 6, 2003, pp. 16–18.

[27]	W.M. Shaw, J.B. Wood, R.E. Wood, and H.R. Tibbo, "The cystic fibrosis database: content and research opportunities," Library and Information Science Research, vol. 13, 1991, pp. 347–366.

[28]	T. Stasinopoulou, L. Bountouri, C. Kakali, I. Lourdi, C. Papatheodorou, M. Doerr, and M. Gergatsoulis, "Ontology-based metadata integration in the cultural heritage domain," Proc. ICADL’07, LNCS 4822, Springer, 2007, pp. 165–175.

[29]	T. Thomassen, "A first introduction to archival science," Archival Science, vol. 1, no. 4, 2001, pp. 373–385.

[30]	H.R. Tibbo, "Primarily history: historians and the search for primary source materials," Proc. JCDL ’02, ACM, 2002, pp. 1–10.

[31]	A. Trotman and B. Sigurbjörnsson, "Narrowed Extended XPath I (NEXI)," Proc. INEX’04, Springer, 2004, pp. 16–40.

[32]	E. Yakel, "Encoded Archival Description: Are Finding Aids Boundary Spanners or Barriers for Users?," Journal of Archival Organization, vol. 2, no. 1, 2004, pp. 63–77.

[33]	J. Zhang, K. N. Fachry, and J. Kamps, "Access to Archival Finding Aids: Context Matters," Proc. ECDL’08, LNCS 5173, Springer, 2008, pp. 455–457.

[34]	J. Zhang, K. N. Fachry, and J. Kamps, "Automatic link-detection in encoded archival descriptions," Proc. DH’08, 2008, pp. 226–228.

[35]	J. Zhang and J. Kamps, "A content-based Link Detection Approach Using the Vector Space Model," Proc. INEX’08, LNCS 5631, Springer, 2009, pp. 395-400.

[36]	J. Zhang and J. Kamps, "Focused search in digital archives," Proc. WISE'09, LNCS 5802, Springer, 2009, pp. 463-471.

[37]	J. Zhang and J. Kamps, "Searching Archival Finding Aids: Retrieval in Original Order?," Proc. ECDL’09, LNCS 5714. Springer, 2009, pp. 447-450.