Automatic Creation and Enhancement of Metadata for Cultural Heritage
Metadata about Persons and Metadata for Different User Groups
This paper proposes an approach for extracting metadata from texts relating to images of cultural heritage. Although metadata has become increasingly important for effective information retrieval, manual construction of metadata is difficult, laborious, time consuming and expensive. For the metadata for fine arts, human-beings related to the artifacts and their roles are important to know for describing their nature and value. For example, an artifact is ordered by people with power such as a king, noble or priest for certain purpose, its meaning expressed by its creator is intermediated by an art dealer or connoisseur and received by the user or museum visitors. I put emphasize on the roles of individuals for works and the relationship between individuals in information extraction. Moreover, I propose the method for paraphrasing metadata or description according to the user's level of knowledge and experience.
There are currently numerous digital library projects underway throughout the world. These projects are developing collections of electronic resources and providing access to them for information seekers . In 2003, Japan's Agency of Culture launched the Cultural Heritage Online1 project to provide a portal site for access to various digitized cultural heritage objects in the collections of museums, archives and related organizations . The metadata has become increasingly important for effective search of the digitalized cultural heritage images. Metadata is like a library index, including title, creator, creation date, creation date, etc., and metadata provides the multiple access points to enhance usability of the search system. Moreover, according to Kando et al. , metadata is essential to link the multiple versions of the same objects from the same collections, and it can provide detailed description frameworks for each community as well as general frameworks for resource discovery across different communities at the same time. However, the information that comprises the metadata, such as owner, creative method or usage, is usually written in free-text description, except that title or period is sometimes assigned explicitly as metadata. In fact, manual construction of such detailed metadata for an enormous number of objects is too laborious and time consuming to realize. In those cases, automatic metadata extraction from natural language is critical for metadata enhancement.
1.1 Advantages of Hierarchical Faceted Metadata Categorization
Many researchers have demonstrated that multifaceted categorization is an effective method for searching cultural data [4-8]. One hand, Kando pointed out that one of the biggest problems in information retrieval (IR) for novice users is that they often cannot express their information needs appropriately as queries when they start the retrieval . By using metadata for navigational structure, users can begin with just one feature of a broader topic and move to a more precise search by obtaining relevant information through continuous searching and browsing. Moreover, seamless integration of search and browsing functionality offers interaction. The image access interface allows users to navigate a large collection using hierarchical faceted metadata in a flexible manner . On the other hand, the FLAMENCO Search Interface Project had a usability test , in which art history students explored a collection of fine arts images, that compared the keyword base search and searching by hierarchical faceted metadata. Despite the fact that the keyword search is superior in speed, the study results showed that a category-based approach is a more successful way to provide access to art image collections . Thus, in the cultural domain, hierarchical faceted metadata categorization is suitable not only for novice users but for practiced information seekers as well.
Metadata has become increasingly important for effective information retrieval, but manual construction of metadata is labor intensive, time consuming and expensive. Though lots of researchers who are engaged in automatic metadata extraction aim to create richer metadata at a lower cost and at a faster pace than is possible with human-assigned metadata, research concerning culturally related Japanese texts has not been done.
Therefore, I became interested in metadata enhancement by extracting information from free-text descriptions in order to facilitate searching cultural data. In particular, I concentrated on metadata enhancement of the PERSONs related to the cultural heritage content, and their ROLEs, because PERSONs related to creation, ownership and relevant historical events convey the nature and value of the artifacts, and novice users use well-known persons as clues to understand the historical and chronological relationships between information objects .
Besides, the other problem of metadata about cultural heritage is that the terminology and description are too often technical and difficult for nonprofessional users of the domain, especially users who are children or people from different cultures. Nevertheless, the metadata and description that were created in consideration of nonprofessional users are very limited and exist for only a few objects. Thus, the automatic paraphrasing of multiple metadata descriptions according to the different knowledge levels of users is a critical aspect of metadata enhancement.
2. Research Questions
My research focus is on Automatic Metadata Creation involving extracting metadata from related texts, enriching it and modifying it; I am especially interested in automatic metadata extraction that focuses on individuals (persons) related to the cultural resource and on multiple metadata creation for different user groups.
2.1 Extracting PERSONs and their ROLEs
An investigation of the online collection databases of museums and galleries in Japan revealed that, in the cultural domain, although the role of individuals is important in terms of art history and art documentation, there is no system for using information on related people for searching, except for the few systems that have a list of creators linking to their works.2 PERSON is an important factor as metadata from three aspects: i) art history, ii) art documentation and iii) cognition of time and space.
i.) art history
Those PERSONs related to cultural resources are not only the creators but also others who are related in some way to the creation of the resource, ownership and relevant historical events. These also convey the nature and value of artifacts. For example, the tea sets used by Sen Rikyu, the most celebrated tea master, are very valuable in the context of the tea ceremony. Nakamura and Kishi insist on the importance of individuals in explaining the arts within the framework of communication theory in their book for students of Japanese art . According to Nakamura and Kishi, art is a chain of events in which the CREATOR ordered by the OWNER creates a WORK based on some rules; the WORK is distributed by a respectable AGENT and consumed by the RECEIVER.
Of course, it is important to explain the aesthetic aspects of the WORK in the center of that framework. However, it is also important to know that the WORK as media was ordered by somebody for some purpose, expressed by its creator, intermediated by an art dealer or connoisseur, and received by the user or museum visitor.
ii.) art documentation
These elements are defined as the categories of CDWA (Categories for the Description of Works of Art).3 For CDWA, it is important to qualify or describe the nature of the relationship between one person, or corporate body, and another. CDWA defines categories describing individuals or groups of people and their role in the life and purpose of the work. In addition, CIDOC-CRM4 Version 3.4.9 defines 84 Entities as concepts and has 141 Properties to describe the relationships between Entities. In this context, the role of "Person" is "Production" of "Man-Made Object". Normally, such a "Person" is called the "Creator", that is, the term "Creator" involves the "Person" and his/her role. Therefore, every data point can be mapped in the relationship of concept by untying the metadata elements.
iii.) cognition of time and spaceIn terms of information retrieval, Miwa et al.  revealed in their research that novice users use well-known persons as clues to understand the historical and chronological relationship during search related to time and space. For example, novice users can imagine the atmosphere of the period by knowing the links between the individuals in the same period. Also, the links between objects and familiar persons such as a celebrated hero or heroine can evoke the user's interest in the object.
Thus, the information about PERSONs and their ROLEs is a significant factor for information retrieval in terms of description and discovery of artifacts. However, a full text search cannot analyze their roles in context, and the information described in the different databases is not homogeneous even if they annotate the same work. Extracting and storing the information as metadata enables integration of the information scattered in multiple databases.
2.2 Multiple Version of Metadata for Different User Groups
The next problem of search difficulty is the complex explanation about the artifact. The terminology and the description of cultural contents are generally too technical and difficult for students or nonprofessional users to understand or search, because they are written by domain experts. Then, multiple expressions are required to explain one concept according to users' level of experience or background (for example, experts and non-experts, adults and children, people from different cultures). Although museums prepare brochures for children of some special expositions, such brochures are very limited and extending similar explanations for all types of users would be a huge task. Fujita pointed out that automatic paraphrasing technology could bridge gaps between authors and readers , and it is one of the biggest issues for Natural Language Processing. In reading assistance, it is critical that a paraphrase of an input sentence or text actually improves its comprehensibility [24, 28]. However, although some groups have been researching text simplification for learners of Japanese or for deaf readers, paraphrase focusing on children, including that for educational purposes, has yet to be developed except for Kanji (Chinese characters). Actually, Kanji presents a critical issue regarding readability in Japanese, but vocabulary, syntax and sentence structure are also important factors that determine the difficulty of description. Paraphrosing of all of the linguistic phases is required to support reading.
3 Related Works
3.1 Metadata Extraction
Many methods have been proposed for metadata extraction [9-14], and these can be divided into two categories: 1) the machine learning approach using SVM (Support Vector Machines)  and 2) the rule-based approach by formatting information  or using Natural Language Processing techniques .
Metadata Extraction with Machine learning approach
Han et al.  proposed the metadata extraction method using Support Vector Machines (SVM). They viewed the problem as that of classifying the lines in a document into the categories of metadata and proposed using SVM as the classifier. They used linguistic features such as word and line-specific features.
Metadata Extraction using formatting rules
Kawtrakul and Yingsaeree  proposed automatic metadata extraction from research papers using rules on formatting information. Experimental results have shown that using the proposed framework greatly reduces the labor involved in the metadata creation process.
Liddy et al.  performed metadata extraction from educational materials using rule-based natural language processing technology. The results of their experiments show high performance both in quantity and quality, and it indicates that automatic metadata generation can produce acceptable results for subject metadata.
3.1 Paraphrase for Reading Assistance
Although the term paraphrase covers a very wide range of phenomena and is used for various purposes, I focus on lexical and structural paraphrase for reading assistance with linguistic knowledge but without referring to the communicative situation.
Conversion from Kanji to Hiragana
Enomoto et al.  developed a system to convert Chinese characters (KANJI) into Japanese phonetic syllabify (HIRAGANA).
Table 1. Examples of Conversion between KANJI and HIRAGANA
The Agency for Cultural Affairs5 designates 1,945 Chinese characters for everyday use as 'Common Kanji' (called Jo-yo Kanji in Japanese), and among those, 1,006 Chinese characters used in high frequency are desinated as 'Educational Kanji' taught in the elementary school. An appendix of "the Government Course Guidelines", "Educational Chinese Character Tables for Each Grade" designates the Kanji taught in each grade of elementary school.
Table 2. Number of Educational Chinese Characters Taught in Elementary School
In their system, the level of conversion is set according to student grade level by using dictionaries that are based on educational Chinese character tables for each grade. The system enables the numerous web contents, which are written without considering junior readers and include lots of Kanji, to be used for educational purposes.
However, 'Common Kanji' takes into consideration everyday usage such as that used for official documents, newspapers, broadcasting and so on, but it excludes specialized domains such as science, technology or the arts, and every named entity. Thus, cultural heritage description involves many difficult Kanjis outside that of 'Common Kanji'. Moreover, many Japanese words are formed by a compound of Kanjis, and pronunciation is also decided by each of the words. Thus, even if each Kanji compound word is very familiar, it could frequently happen that the word has too specialized a meaning and pronunciation. For example, "" (mi-zu-ra), a Japanese ancient hair style, is compounded by 'Educational Kanji's, but it is a technical term in archaeology and too unfamiliar for non-professionals of the domain to pronounce or understand.
Thus, Kanji is one of the major factors in determining the difficulty of Japanese terms.
Inui et al.  reported on the research project on text simplification for congenitally deaf people who tend to have difficulties in reading and writing text. They proposed an approach for readability assessment that involved submitting a survey to teachers in deaf schools to discovery problematic parts and paraphrasing alternatives. This approach suggests an essential aspect of paraphrase; paraphrase should be done for each purpose in various situations.
Besides trying to help special groups like the deaf, support for foreign people (learners of Japanese) is another aim of text simplification. Murayama et al.  insisted that important information like "caution" or "danger" should be written under a guideline to ensure its readability. They proposed the standard of Japanese sentences defining three-level readability for four factors: 1) Kanji, 2) vocabulary, 3) grammar and 4) quantitative complexity. This division of factors provides a useful foundation for constructing a model of paraphrase for reading assistance.
4 Proposed Methodology
4.1 Data Collection
We have a database of cultural heritage data for Cultural heritage Education using digital Archives with Cross (X) ing heterogeneous information objects and crossing different communities (CEAX),6 a project on "Research on Supporting Discovery Learning Using Cultural Digital Archives". For the CEAX database, Japan's National Tokyo Museum contributed cultural heritage images with detailed annotation prepared for experts and also for elementary school children by manually rewriting the original annotation by especially trained writers. We also now have parallel corpora of 80 objects for two different levels of users, for experts and for children, and the database is being added to continually.
4.2 Automatic Metadata Extraction by NLP
Most of the metadata concerning cultural heritage is a Named Entity (NE), such as "ARTIFACT", "PERSON", "ORGANIZATION", "PERIOD", etc.7. Although many researchers have reported on the Japanese NE extraction [15-19] and have undertaken the Information Retrieval and Extraction Exercise (IREX),8, the main two methods are rule-based and supervised machine learning. The Rule-based methods use extracting rules (part of speech, words-linking rule) and pattern dictionary of NE clues (e.g., Mr. <person>, <organization> hospital, etc.) but it is not sufficient to extract NE without NE clues , while the supervised machine learning method can be applied to a new item or domain by preparing the correct data .
However, characteristically the terms and personal names in the cultural domain are already in a thesaurus or dictionary, and lots of fixed expressions exist to explain the artifacts. Moreover, it is essential to analyze construction patterns for extracting expressions of relation between NEs. Therefore, I shall use the construction pattern involving relations between NEs as rule.
In the first phase I analyze the texts and formulate construction pattern rules, and in the next phase I will implement an automatic extraction system using these rules.
4.3 Metadata Description Paraphrase
On one hand, automatic ontology-based knowledge extraction and description generation has been researched in the cultural domain [21,22]. Guerts et al.  proposed a method of automatic presentation generation from semantic graphs. The method enables to create genre of discourse (e.g., biography) and document structure (e.g., multimedia presentation according to a user's requirement by using three kinds of ontologism. However, the ontological approach required enormous metadata defining very detailed information involving the material of the artifact, described subject, related historical event, related work and so on. It is difficult to assign all the information as metadata.
On the other hand, monolingual paraphrasing is one of the big issues in the natural language processing (NLP) domain [23,24]. It is defined as alternative ways of conveying the same contents, and it can be viewed as a special case of translation in the sense that both transform a given expression into a different expression, while preserving meaning as much as possible.
I apply a method to paraphrase existing descriptions into multiple explanations according to users' needs, because numerous descriptions written by domain experts exist and they are very valuable.
Position of My Research
Table 3 illustrates the classification of paraphrase on five axes; 1) Genre of document, 2) Purpose, 3) Target user, 4) Direction of paraphrase and 5) Linguistic level.
1) Genre of document
3) Target user
4) Direction of paraphrase
5) Linguistic level
Table 3. 5-Dimensional Model of Paraphrase and Position of the Research
My research aims to transform the technical description about cultural heritage to make it more understandable and provide reading assistance to children at all linguistic levels. The shadow part of Table 3 depicts the sections covered by my research.
Notable points are as follow. First, it is a paraphrase with an educational purpose, not only considering Kanji but involving all the linguistic levels. Second, it aims to paraphrase description into compact and clear description while avoiding lengthy explanation in terms of quantitative complexity. Third, it aims to paraphrase technical terms in the sequence of text considering readability without linking to the dictionary.
In the first phase I analyze the parallel corpora to acquire the paraphrasing rule and construct the patterns for automatic paraphrase, and then evaluate the model with other corpora.
5. Conclusions and Future Work
This paper proposes that faceted metadata is useful for searching data related to cultural heritage and that automatic extraction is required to enhance metadata. Faceted metadata helps users who cannot find the proper words for their query. In particular, I focus on the role of individuals (Persons) in relation to artifacts and on the relationships between individuals, because such information is important in aspects of both fine art and art documentation. In addition, automatically simplifying complex description based on extracted metadata is another way to support nonprofessional users.
To date, on metadata extraction, descriptions are processed from the viewpoint of the Name Entity extraction considering the Japanese dependency structure. Then, much attention is paid to key words such as personal names and words expressing the role of individuals. And from those results, I have formulated the rule with accurate data. The next step is implementation and evaluation by comparing results with the correct data.
With regard to paraphrase, so far I have analyzed the paraphrasing rule by comparing the database with the parallel corpora. The next phase of my research is to model the rule that I acquired from that process and to evaluate that paraphrasing rule.
3. CDWA (Categories for the Description of Works of Art): <http://www.getty.edu/research/conducting_research/standards/cdwa/index.html>.
5. National Language Information System by Agency of Cultural Affairs: <http://www.bunka.go.jp/kokugo/frame.asp?tm=20060618082803/>.
7. MUC definition: <http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html/>.
1. E.D. Liddy, E. Allen, S. Harwell, S. Corieri, O. Yilmazel, N.E. Ozgencil, A. Diekema, N.J. McCracken, J. Silverstein, S.A. Sutton: Automatic Metadata Generation & Evaluation. In the 25th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002), August 11-15, 2002, Tampere, Finland, : 401-402.
2. N. Kando, J. Adachi: Cultural Heritage Online: Information Access across Heterogeneous Cultural Heritage in Japan. In the Proceedings of International Symposium on Digital Libraries and Knowledge Communities in Networked Information Society DLKC'04: March 2-5 2004, Tsukuba, Japan (Invited). <http://www.kc.tsukuba.ac.jp/dlkc/>.
3. M.Miwa, N. Kando: Naive Ontology for Concepts of Time and Space for Searching and Learning, Submitted to Information Seeking in Context Conference 2006 (ISIC 2006), July 19-21, 2006, Sydney (to appear).
4. N.Knado, T.Kanazawa, A.Miyazawa: Retrieval of Web Resources Using a Fusion of Ontology-based and Content-based Retrieval with the RS Vector Space Model on a Portal for Japanese Universities and Academic Institutes, In the Proceedings of 39th Hawaii International Conference on System Sciences, January 4-7, 2006 [CD-ROM].
5. J.English, M.Hearst, R.Sinha, K.Swearingen, K. Yee: Flexible search and navigation using faceted metadata. In the 25th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002), August 11-15, 2002, Tamere, Finland.
6. J. English, M. Hearst, R. Sinha, K. Swearingen, K. Yee: Hierarchical Faceted Metadata in Site Search Interfaces. In the Proceedings of the SIGCHI conference on Human factors in computing systems, April 5-10, 2003, Florida, USA: 628-639.
7. B.H. Kwasnik: The Role of Classification in Knowledge Representation and Discovery-1, Library Trends(online), Summer, 1999, <http://www.findarticles.com/p/articles/mi_m1387/is_1_48/ai_57046525>.
8. P. Yee, K. Swearingen, K. Li, M. Hearst: Faceted Metadata for Image Search and Browsing, In the Proceedings of the SIGCHI conference on Human factors in computing systems, April 5-10, 2003, Florida, USA: 401-408.
9. J. Greenberg, A. Crystal: Iterative Design of Metadata Creation Tools for Resource Authors, In the Proceedings of the DC-2003 (2003 Dublin Core Conference), September 28 - October 2, 2003, Seattle, USA: 49-58.
10. H. Han, C.L. Giles, E. Manavoglu, H. Zha: : Automatic Document Metadata Extraction using Support Vector Machines, In the Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, Washington, USA, 2003: 37-48.
11. Y. Hu, H. Li, Y. Cao, D. Meyerzon, Q. Zheng: Automatic Extraction of Titles from General Documents using Machine Learning. In the Proceedings of JCDL 2005 (the Joint Conference on Digital Libraries), June 7-11, 2005, Denver, USA : 145-154.
12. M. Kan, J.L. Klavans, K. McKeown: Using the Annotated Bibliography as a Resource for Indicative Summarization, In the Proceedings of LREC 2002 (Third International Conference on Language Resources and Evaluation), May 29-31, 2002, Canary Island, Spain : 1746-1752.
13. A. Kawtrakul, C. Yingsaeree: A Unified Framework for Automatic Metadata Extraction from Electronic Document. In the Proceedings of IADLC2005 (The International Advanced Digital Library Conference), August 25-26, 2005, Nagoya. Japan: 71-77.
14. W. Paik, S. Yilmazel, E. Brown, M. Poulin, S. Dubon, C. Amice: Applying Natural Language Processing (NLP) Based Metadata Extraction to Automatically Acquire User Preferences. In the Proceedings of K-CAP 2001 (the First International Conference on Knowledge Capture), October 21-23, 2001, Victoria, Canada : 116-122.
15. M. Asahara, Y. Matsumoto: Japanese Named Entity Extraction with Redundant Morphological Analysis. In the Proceedings of HLT-NAACL 2003 (Human Language Technology conference), May 27-June1, 2003, Edmonton, Canada: 8-15.
16. H. Isozaki: Japanese Named Entity Recognition based on a Simple Rule Generator and Decision Tree Learning. In the Proceedings of ACL 2001(Association for Computational Linguistic, 39th Annual Meeting and 10th Conference of the European Chapter), July 9-11, 2001, Toulouse, France: 306-313.
17. S. Sekine and Y. Eriguchi: Japanese Named Entity Extraction Evaluation - Analysis of Results, In the Proceedings of The 18th International Conference on Computational Linguistics, July 31-August 4, 2000, Saarbrcken, Germany: 1106-1110.
18. I. Watanabe, F. Masui, J. Fukumoro: Progress of Precision and Usability on NexT (Named Entity Extraction Tool), In the Proceedings of 10th Annual Conference of the Association of Natural Language Processing Japan, March 16-18, 2004, Tokyo, Japan: 413-415 (in Japanese).
19. T. Iwakura, K. Tsukamoto, N. Igata, H. Tsuda: Extracting Metadata from Text and Application to KnowWho, In the Proceedings of 18th Annual Conference of the Japanese Society for Artificial Intelligence, June 2-4, 2004, Kanazawa, Japan: 3H3-04 (in Japanese).
20. K. Nakamura, F. Kishi: Nihon Bijyutu wo Manabu Hito no tameni (For the Students of Japanese Art), 2001, Sekai Shiso-sya, Tokyo (in Japanese).
21. J.Geurts, S.Bocconi, J. Ossenbruggen, and L. Hardman: Towards Ontology-driven Discourse: From Semantic Graphs to Multimedia Presentations. In Second International Semantic Web Conference ISWC2003. <http://www.cwi.nl/~media/publications/iswc2003.pdf>.
22. S. Kim, H. Alani, W. Hall, P. Lewis, D. Millard, N. Shadbolt, and M. Weal. Artequakt: Generating Tailored Biographies with Automatically Annotated Fragments from the Web, In the Proceedings of the Semantic Authoring, Annotation and Knowledge Markup Workshop in the 15th European Conf. on Artificial Intelligence, July 2002, Lyon, France: 1-6.
23. A.Fujita, K.Inui and Y.Matsumoto: Analysis and Modularization of Text Revision Processes for Text Simplification, In the Proceedings of 65th Annual Conference of Information Processing Society of Japan, March 25-27, 2003, Tokyo, Japan : 1T6-4, (in Japanese).
24. K.Inui and A.Fujita: A Survey on Paraphrase Generation and Recognition, Journal of Information Processing Society of Japan, Vol. 11, No. 5, 2004: 151-198, (in Japanese).
25. S.Enomoto, M.Murota and Y.Shimizu: Development of KANJI-KANA Auto-Translation Server which Corresponds the ON-KUN Reading and Mentions the FURIGANA, IEICE technical report. Education technology, Vol.99, 1991: 69-76, (in Japanese).
26. K.Inui, A.Fujita, T.Takahashi and R.Iida: Text simplification for reading assistance: a project note, In the Proceedings of 2nd International Workshop on Paraphrasing : Paraphrase Acquisition and Applications (IWP), July 11 2003, Sapporo, Japan: 9-16.
27. S.Sato, M.Tsuchiya, M.Murayama, M.Asaokaand Q.Wang: Standardization of Japanese Sentences, Special Interest Group Note (NLP) of Information Processing Society of Japan, Vol.2003, No.4(2003-NL-153): 133-140, (in Japanese).
28. K.Yamamoto: Present and Future if Paraphrase, Workshop in 7th Annual Conference of the Association for Natural Language Processing, March 30, 2001, Tokyo, Japan: 93-93, (in Japanese).
29. N.Kaji, M.Okamoto, S.Kurohashi : Paraphrase Decrinable Words from Vocabulary in Written Document into Vocabulary of Speech in WWW, 2004, Natural Language Processing, Vol. 11, No. 5, 2004: 19-37, (in Japanese).
© Copyright 2007 Satoko Fujisawa