Eckert, "A methodology for supervised automatic document annotation", TCDL Bulletin 4.2 (2008)

A methodology for supervised automatic document annotation

Kai Eckert

Computer Science Institute
University of Mannheim
http://ki.informatik.uni-mannheim.de

Abstract:

This paper outlines a proposal for a PhD thesis. The goal is the development of a new methodology to supervise and assess a thesaurus based automatic indexing process in a document retrieval system. Similar to existing approaches for semi-automatic indexing, where automatic annotations are only used as preliminary result and hint for a human indexer, the approach presented herein relies on a human expert to supervise the automatic indexer.

The methodology consists of new analysis methods combined with advanced visualization techniques and enables the human expert to identify concrete problems in a given environment of a document base, a thesaurus and an indexer.

We present the background for the thesis that forms the basis for the problem definition. After a review of the current state of the art, our approach and the research methodology are described. The current status of our work is presented, together with the results of a proof of concept study that has been done as one of the first steps.

1 Background

Currently, library inquiries usually are performed via Online Public Access Catalogs (OPACs). The first OPACs either attempted to emulate the familiar card catalog in its new online form or they adopted the model familiar to online database searchers of commercial search services [10]. They soon were replaced by the so called second generation OPACs, that combined these two approaches and enhanced the search possibilities (but also the complexity) for the user significantly. Borgman performed several studies on the problems of OPAC users between 1986 and 1996. She concludes that only little improvements, if any, are done in this interval to improve the usability of catalog systems. Users still need assistance to translate their questions in a structured query that can be interpreted by the retrieval system [2]. They are mainly usable by librarian experts, not by typical library user. The system design should follow the users' search behavior, not the other way round. These results are affirmed by various other studies, for example by Lombardo and Concic [13], who find that users prefer and use new features in an OPAC system like remote access, but consider enhancements in search like multiple keyword search options or subject and author hyperlinks rather irrelevant.

With the wide-spread use of internet search engines, a new component comes into play: now the users are not only inexperienced with the use of OPACs, they have expectations toward an OPAC that result from their experiences with search engines like Google. Yu and Young [29] describe this development in depth and suggest that OPACs have to implement search engine features like natural language search with keywords, relevance feedback, spelling corrections and relevance ranked output. Similar statements can be found at Campbell and Fast [3]; they see a huge potential for new innovations in the complementary relationship between catalogs and search engines.

But search engines have not only an impact on usability expectations, today they are an inherent part of any information search. According to Rosa [20], "89% of college student information searches begin with a search engine." But what are the differences between OPACs and search engines? In 2003, Eversberg [7] stated that "catalogs and search engines are juxtaposed in a pears vs. apples comparison." But he also admits that "there are, however, widening 'grey' areas: Genuine internet resources are being cataloged to enrich catalogs. And search engines index files that contain book reviews, abstracts, whole chapters, descriptions, etc."

Since then, there emerge more and more smooth transitions between catalogs and search engines. In 2004, Google started with Google Books¹ (formerly known as Google Print) and Google Scholar,² two new services that provided access to documents that were only available via library catalogs. There are visions of digital libraries, where the user can search and browse the whole inventory and access all documents (and audio-files, movies, ...) with one click at any time and any place in the world.

With the full-text or abstracts of most documents available for indexing, it is questionable if we today need subject headings at all. For keyword search in document title, this was examined by Gross and Taylor [9]. They found that the subject headings improve the search results significantly. Over 30% of relevant documents wouldn't be found with keyword search that is limited on the title.

Automated thesaurus based indexing (described for example by Ferber [8]) is an approach that seems to fill the gap between the catalog systems and full-text search engines. Thesauri are hierarchical structures of concepts, building a controlled vocabulary. In libraries this vocabulary is used for document indexing with subject headings, for example the Library of Congress Subject Headings (LCSH) or the Medical Subject Headings (MeSH). Every occurrence of a meaningful word is mapped on the appropriate concept in the thesaurus, which is then used to index the document. In a perfect world, this approach would overcome most weaknesses of full-text search and could be seen as an automatic assignment of subject headings without any human intervention. But it is arguable if the achievable quality is high enough to meet the requirements of a subject inquiry in a library.

Another application for automatic indexing is to use the results only as hints for a human indexer. This way, the time to index a document is reduced and therefore such semi-automatic indexing systems are in use today. But with this approach, still a human indexer is needed who processes each document manually.

2 Problem Definition

As stated above, the current use of automatic indexers as preprocessing step for manual annotations does not scale. So there still is a gap between supported manual indexing and automatic indexing systems. While the improvement of the indexing system is part of active research, it is questionable if the improvement of only one factor will lead to the desired quality.

Based on these experience, there are two cornerstones for this thesis: First, the focus lies on the improvement of the whole system, consisting of a concrete indexer, a concrete thesaurus and a concrete document base. Second, the human expert can and will be integral part of the indexing process, as with semi-automatic indexing. The latter is especially important, as the human indexer works around weaknesses of the thesaurus when indexing manually or suggests improvements to the thesaurus maintainer if a work-around is not adequate. These are tasks that cannot be performed by an automatic indexing system.

These considerations lead to the following research questions, that are to be answered by this thesis:

How can the quality of annotations be measured?
Is it possible to visualize the annotation results globally?
What kind of problems affect the annotation results?
Is it possible to detect these problems? Automatically?
Which analysis methods and visualizations support the human detection of these problems?
Is it possible to provide (proposed) solutions automatically?
Is the quality of retrieval results improved significantly by the overall approach?

3 State of the art

Several approaches exist for automatic document indexing with subject headings. There are commercial systems that perform this task, as well as technological studies that have not yet left the research labs. All of them have in common that they rely on the quality of the underlying thesaurus and face various problems that generally belong to aspects of natural language processing.

Examples for commercial applications are Transinsight³ with the search engine GoPubMed⁴ and customers like Elsevier, Free University of Berlin and the Christian-Albrechts University of Kiel or Collexis⁵ with customers like Berkeley University, San Francisco University of California or University of South Carolina. There is a free solution for digital libraries provided by the CERN, the CDS Invenio Document Server⁶. The CDS contains a module for automatic indexing with a thesaurus that is presented by Montejo-Raez [16].

The Indexing Initiative of the National Library of Medicine investigates methods whereby automated indexing methods partially or completely substitute for current indexing practices⁷. The Medical Text Indexer (MTI) is presented by Aronson et al. in [1] and uses machine learning techniques to learn from manually indexed documents. The application as supporting system for human indexers is well studied; results of a user survey are presented by Ruiz and Aronson [22].

Neveol et al. present an automatic indexer that is also intended to be used as preliminary indexer or final indexer after human revision [17]. It is embedded in the CISMeF project⁸, a quality-controlled subject gateway, catalog and index of French-language Health Internet resources.

With CADIS, the Computer Aided Document Indexing System, Kolar et al. present another approach that is strictly focussed on the support of human indexers [12].

Another motivation for the development and improvement of thesaurus based indexers arises from the multilingual approach of some thesauri. If documents of different languages are mapped on the same concepts of a multilingual thesaurus, cross-lingual document retrieval becomes possible. The promises and problems of this approach were investigated by Ferber [8]. The results of an automatic indexer that is trained on manually indexed documents in different languages is presented by Pouliquen et al. [18]. An interactive approach for multilingual document indexing is described by Schiel and de Sousa [23].

Not least, Ruch presents a different approach of assigning concepts to documents using a combination of a pattern matcher and a vector-space retrieval engine [21].

Until today, there is no single approach that turns out to be the best. As long as the results of automatic indexers are not as good as required, the question is if and how the existing problems can be at least circumvented, if not solved. The focus of the above mentioned approaches generally lies on a special kind of application, mostly even on a special thesaurus and/or a special document base (as the annotation of Medline abstracts with MeSH headings).

A common problem that is faced by everybody who wants to improve an indexing system is the choice of a quality measure that can be used to quantify the improvements. Generally, the quality of automatic indexers is evaluated by the Precision and Recall measure with manual annotations as gold standard, as shown by Neveol et al. [17] or Aronson et. al. [1]. Presumably, an advancement of the precision and recall measure is more suitable for this evaluation, as there are some graduations between correct and wrong annotations. These generalization of precision and recall to unsharp measurements have been done by Kekäläinen [11], Maynard [14] and Euzenat [6].

While such quality measures may be used to get an overall assessment of the indexing quality, they cannot be used to identify problematic areas that might be responsible for a lack of quality.

A platform for automatic indexing with integrated thesaurus maintenance (IKEM platform) is presented by Vervenne [27]. The IKEM platform is a very promising and holistic approach to focus on the whole indexing process and not only on a subset. The maintenance is for example supported by providing the human expert with terms from the indexed documents that might be meaningful, but could not be assigned to existing thesaurus concepts.

Advanced methods that follow the same approach of identification of meaningful terms based on Latent Semantic Indexing and Multidimensional Scaling are presented by Weiner [28].

While the complete replacement of manual indexers in libraries by automatic annotation systems is not likely and not wanted, the combination of automatic and manual indexing for the improvement of the document retrieval is very promising. The impact of additional automatic annotations to the results of a document retrieval system is investigated by Shin et al. [25].

4 Approach and Expected Contribution

**Figure 1:** Schematic view of the process assessment

Figure 1 presents the architecture of a typical thesaurus-based semantic search system. Such a system consists of two parts. The first one implements the actual search and retrieval functionality that takes a user query as input, compares it to document annotations and computes a ranked set of results. The second part of a semantic search system that is the focus of our work is concerned with the creation of document annotations. This process relies on three different components

the document set to be annotated
the thesaurus that provides index terms
the indexing method used and its ability to deal with ambiguities in the text.

With this thesis, the following approach is taken: The indexing process is extended with a separate assessment step that is independent of the used indexing system. This assessment step is again performed by a human indexer, but the goal is to provide different analyses and sophisticated visualizations of the result so that concrete weaknesses of either the indexer or the thesaurus can get identified and smoothed out. The idea is to enable a human indexer to keep track of the indexing process without checking every document and not only relying on control samples.

In Section 7, we give an in-depth example for such an analysis that is already developed and evaluated as a proof-of-concept. In this case, the idea is to use a treemap to visualize results of a new analysis approach, calls IC Difference Analysis. This analysis uses the difference of the information content of a given concept and an expected information content based either on a heuristic or a manually created gold standard.

The methodology is focused primarily on the thesaurus, not the document base. This way, the approach is scalable with increasing amounts of documents. The thesaurus is the constant factor in the process, represents the subject domain of the documents and is well known by the human expert. Identifying necessary changes to the thesaurus is essential and thus it is worthwhile to focus the efforts of the human expert on it.

The contribution of the thesis consists in the development of the above sketched methodology. A framework is implemented that supports the development of further analysis methods based on this methodology. An important aspect of the research is the use of different analysis and visualization techniques. It is expected that different human experts prefer different views on the analysis results and that they differ in their valuation of the different analysis methods.

This approach is orthogonal to the development and further improvement of automatic indexing systems and will help other researchers to identify concrete problems in an arbitrary setup and environment.

By the combination of various analysis methods with advanced visualization techniques, the assessment of indexing results is clearly improved. Precision and Recall measures are still needed to roughly quantify any achieved improvements, but they lack the possibilities of our approach to formulate statements about concrete weaknesses in a setup.

While there exist solutions like the IKEM platform [27] as well as various commercial systems for automatic indexing and thesaurus maintenance, the methodology developed in this thesis is another new building block to the ideal automatic indexing system.

5 Research Methodology

In this thesis, the following steps are to be taken:

5.0.0.1 Theoretic Considerations

After a sound problem definition and the formulation of the research questions, the various problems that may occur when using an automatic indexer need to be identified. The planned framework that implements the methodology has to be sketched and basic assumptions about the requirements have to be made.

5.0.0.2 Analysis of existing solutions

Existing solutions will be analyzed and evaluated to ensure that the methodology supports and extends existing solutions and contributes new achievements.

5.0.0.3 Definition of Quality Measures

To assess the developed tools and the possible improvement of the whole system, quality measures need to be defined. Starting with simple precision and recall against manual annotations as gold standard, the prospects of generalized precision and recall has to be evaluated.

5.0.0.4 Proof of concept

As there is no experience, if the planned approach is feasible, prior to the complete elaboration of the implementation concept, a proof of concept study is conducted. A prototype is implemented with a first implementation of an analysis approach. Based on the practical results, the problem definition and theoretic considerations may or may not need an adjustment. The goal is to prepare the ground for the next research steps.

5.0.0.5 Definition of interfaces

As the framework needs to be independent of the employed indexer and thesaurus, interfaces have to be developed for both of them. This includes the identification of requirements and criteria for a common thesaurus format and the minimum feature set of an automatic indexer. The SKOS format seems to emerge as a standard format for thesaurus representation, but it has to be investigated to determine if it fulfills the to-be-defined criteria. The framework needs an interface for the document base. And not least the results of the indexing process should be represented in an machine-readable form to allow for the integration with other systems and further examinations by other researchers.

5.0.0.6 Framework Design

Based on the preliminary work, the assessment framework has to be designed. The primary goal is the best possible support of the development of new analysis methods and the possibility for fast prototyping. Considerations have to be made about performance and usability issues, as most experiments need to involve a human tester.

5.0.0.7 Method Development

First experiments led to the conclusion that the combination of advanced analysis with adequate visualization techniques is a promising general approach. It is not expected that there is one magic bullet that depicts all weaknesses of a system. Instead, different analyses and visualizations will be the building blocks of a toolbox for the human expert. Ideally, an analysis approach is found for every possible problem identified before.

5.0.0.8 Evaluation

At last, the application of the approach to the real-life system needs to be evaluated. This evaluation will involve the survey of human indexing experts and an analysis of retrieval results within the prototype system. Interviews are used to determine the special needs of different experts and improve the employed analysis methods and visualizations. They are also needed to evaluate the different methods regarding their overall usefulness.

6 Status

The problem definition were carried out in the last months, as well as the development of first analysis methods and experiments on the impact of similarity measures, as mentioned above. The impact of different similarity measures on the results of precision and recall is presented in [15].

The concept were proved with a prototype implementation (see next section) containing the IC Difference Analysis combined with a treemap visualization as a first set of building blocks. The approach were presented at the K-CAP '07 and awarded as best paper [4]. The prototype will be demonstrated for the first time at the JCDL '08 [5].

Currently, the interfaces for the framework are to be defined. In parallel, new analysis approaches are developed and prepared and the use of self organized maps for visualization of analysis results are under investigation.

7 Proof of concept study

One analysis method that is currently under development is the IC Difference Analysis, a thesaurus evaluation that is based on a combination of statistical measures and appropriate visualization techniques. It supports the detection of potential problems in a thesaurus that affect the automatic indexing and is presented in [4].

7.0.0.1 Statistical measure

The IC Difference Analysis works on the thesaurus and enables the human expert to detect unbalanced hierarchy structures as well as terms that are more often or less often used in indexing than expected.

The decision whether a term is used more often for indexing as expected depends on our expectation about the frequency it should occur. This in turn depends on the level in the thesaurus hierarchy it is situated. Generally, more common concepts should have a higher frequency than special concepts. To take this into account, the difference of Information Contents as a distance measure (referred to as IC Difference Analysis) is proposed:

$\begin{displaymath} D_{IC_a}(c)=IC_a(c)-IIC(c) \end{displaymath}$

(1)

where $IC_a(c)=-\log P_a(c)$ is the information content of a concept c with respect to the automatically annotated document set, as proposed by [19]. is determined by the frequency of documents that are annotated with concept .

The idea of an Intrinsic Information Content (IIC), i.e. an information content that is determined only by means of the thesaurus structure itself, was introduced by [24].

The IIC of a concept c denoted as IIC(c) is defined as

$\begin{displaymath} IIC(c)=-\log\left(\frac{hypo(c)+1}{max}\right) \end{displaymath}$

(2)

with

as the number of hyponyms (i.e. child nodes) of a given concept

and

as the number of concepts in the whole thesaurus.

The Intrinsic Information Content is used as reference and compared to the Information Content based on the concept frequency in the document base.

The IC Difference Analysis is not limited to Figure 1, it can be applied to the manually selected keywords as well. In this case, the Information Content is based on the frequency of documents that are manually annotated with the given keyword:

$\begin{displaymath} D_{IC_m}(k)=IC_m(k)-IIC(k) \end{displaymath}$

(3)

An interesting variant is the replacement of the IIC by manual selected keywords as reference:

$\begin{displaymath} D_{IC_{am}}(c)=IC_a(c)-IC_m(k) \end{displaymath}$

(4)

This measure shows deviations between manually and automatically assigned keyword and therefore directly points to potential problems in the automatic indexing process.

In principle the measures in Equation 1, 4 and 5 are used for the same purpose: to detect deviations between the actual and the expected information content of a concept. Based on the information available (thesaurus structure, manual annotations, automatic annotations) and the target of the evaluation (manual or automatic annotations) the different measures can be used for this purpose.

7.0.0.2 Interactive Visualization

A major challenge in supporting thesaurus maintenance is to provide adequate tool support that guides the user to potential problems in a thesaurus based on the measures described above. In particular we have to find a way to provide the user with a view on the thesaurus that encodes the overall structure of the thesaurus or selected parts of it and the evaluation results for the different concepts in the thesaurus.

Treemaps, presented by Shneiderman [26], are a representation designed for human visualization of complex traditional tree structures: arbitrary trees are shown with a 2-d space-filling representation. With a treemap, two additional aspects can be displayed beside the thesaurus structure. One is represented by the size of the partitions, the other by its color⁹.

The hierarchy is visualized through the nesting of areas. The color of the different areas is used to represent the result of the different measures introduced above.

**Figure 2:** MeSH IC Difference Analysis () on automatically selected concepts
(Click here for a larger view)

Figure 2 shows the result of this analysis for the example document base. The tile representing the concept Chordata is standing out deep blue. Figure 3 zooms into the Mammals concept for further analysis. There are several blue areas and two dark blue concepts. One of them is the Equidae, also known as horse-like animals. The result shows a by far too low information content for this concept, thus it has a very high frequency in the document base. As the document base is not dedicated to horse diseases, this indicates a problem.

**Figure 3:** MeSH Analysis of the node representing Mammals
(Click here for a larger view)

In the MeSH Thesaurus the following terms for the concept Equidae are defined: Asses, Donkeys, Mules, Zebras, Ass, Equus asinus (the horses itself are contained in a subconcept of Equidae). The problem lies in the term Ass, as the language normalizer used by the indexer interpreted every occurrence of the word as as the singular form of Ass.

7.0.0.3 Evaluation

**Figure 4:** Interactive tool
(Click here for a larger view)

The treemap visualization requires some time for the user to get familiar with. A major drawback of treemaps is the possibility for the user to lose the orientation in the hierarchy as the visualization cannot provide information about the environment of the currently selected top concept. To deal with this problem, the prototype (Figure 4) combines a hierarchical common treeview with the treemap visualization. This allows interactive navigation through the thesaurus hierarchy without losing the orientation.

After some familiarization, the test users were able to detect various concepts that need special attention in the automatic indexing process. Beside the concepts that are mistakenly assigned (like "the lead" or "the will" for "to lead" and "to be" that need part of speech recognition), they also came up with concepts that are assigned as check-tags by human indexers. These are commonly used concepts to describe a text, like geographic locations or the kind of persons for which a document is applicable (male, female, children, adults, ...). If it is required to assign such concepts automatically, the indexer probably needs additional knowledge about these concepts, as they are in many cases not directly mentioned in the documents.

The result of the analysis depends strongly on the employed baseline (gold standard or heuristic), and especially in the case of the heuristic on the balanced structure of the thesaurus. While various problematic concepts can be identified this way, it is not possible to determine, that all problematic concepts are found. The better the visibility of a concept, the more the information content differs from the expected value. So the IC Difference Analysis is especially useful to quickly determine the most obvious outliers.

Even more important for most experts seems to be the possibility to view "their" thesaurus in a new perspective. For example, when they use manual assigned concepts as input for the information content and use the intrinsic information content as baseline, they see how concepts in the thesaurus are actually used.

8 Conclusion

This paper is a proposal for the development of "A methodology for supervised automatic document annotation". The problem definition and the review of the state of the art presented herein are the first steps done in this direction. The presentation of these steps pursues the goal to get comments on the intended approach and helpful suggestions regarding the further work.

The next step will be the implementation of the sketched framework that enables the development of further analysis methods and experimentations.

Bibliography

[1]	A. R. Aronson, J. G. Mork, C. W. Gay, S. M. Humphrey, and W. J. Rogers, "The NLM Indexing Iinitiative's Medical Text Indexer," in Medinfo 2004, M. Fieschi, E. Coiera, and Y.J. Li, Eds. IOS Press, 2004, pp. 268-272.

[2]	C. L. Borgman, "Why are Online Catalogs Still Hard to Use?" Journal of the American Society for Information Science, vol.47, no. 7, pp.493-503, Dec., 1996.

[3]	D. G. Campbell and K. V. Fast, "Panizzi, Lubetzky, and Google: How the Modern Web Environment is Reinventing the Theory of Cataloguing," Canadian Journal of Information & Library Sciences, vol. 28, no. 3, pp. 25-38, 2004.

[4]	K. Eckert, H. Stuckenschmidt, and M. Pfeffer, "Interactive thesaurus assessment for automatic document annotation," in Proceedings of The 4th International Conference on Knowledge Capture (K-CAP 2007), New York, NY: ACM, 2007, pp. 103-110.

[5]	K. Eckert, H. Stuckenschmidt, and M. Pfeffer, "Semtinel: Interactive Supervision of Automatic Indexing" in Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, New York, NY: ACM, 2008, pp. 425-425.

[6]	J. Euzenat, "Semantic Precision and Recall for Ontology Alignment Evaluation," in Proceedings of the International Joint Conferences on Artificial Intelligence, 2007, pp. 348-353.

[7]	B. Eversberg, "Grundsätze und Ziele des Katalogisierens," in Deutscher Bibliothekartag in Augsburg: Die Bibliothek zwischen Autor und Leser [Principles and Goals of Cataloguing, German Librarians' Annual Conference], vol. 92, H. Benkert, B. Rosenberger, and W. Dittrich, Eds. Klostermann, 2003. English translation available: http://www.allegro-c.de/formate/tlcse.htm.

[8]	R. Ferber, "Automated Indexing with Thesaurus Descriptors: A Co-occurrence Based Approach to Multilingual Retrieval," Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries. Lecture Notes In Computer Science, vol. 1324, pp. 233-252, 1997.

[9]	T. Gross and A. G. Taylor, "What Have We Got to Lose? The Effect of Controlled Vocabulary on Keyword Searching Results," College and Research Libraries, vol. 66, no. 3, pp. 212-230, May 2005.

[10]	C.R. Hildreth, "Online Catalog Design Models: Are We Moving in the Right Direction?" Palmer School of Library and Information Science, Long Island University, Technical report, August, 1995. Available: http://myweb.cwpost.liu.edu/childret/clr-opac.html

[11]	J. Kekäläinen and K. Järvelin, "Using Graded Relevance Assessments in IR Evaluation," Journal of the American Society for Information Science and Technology, vol. 53, no. 13, pp. 1120-1129, 2002.

[12]	M. Kolar, I. Vukmirovic, B. D. Basic, and J. Snajder, "Computer-aided Document Indexing System," in 27th International Conference on Information Technology Interfaces, 2005.

[13]	S.V. Lombardo and K.S. Condic, "Empowering Users with a New Online Catalog," Library Hi Tech, vol. 18, pp. 130-141, 2000.

[14]	D. Maynard, "Benchmarking Ontology-based Annotation Tools for the Semantic Web," in UK e-Science Programme All Hands Meeting (AHM2005) Workshop: "Text Mining, e-Research and Grid-enabled Language Technology," 2005.

[15]	D. Maynard, S. Dasiopoulou, S. Costache, K. Eckert, H. Stuckenschmidt, M. Dzbor, and S. Handschuh, "D1.2.2.1.3 Benchmarking of Annotation Tools," Knowledge Web Project, University of Sheffield, Technical Report, 2007. Available: http://knowledgeweb.semanticweb.org/semanticportal/deliverables/D1.2.2.1.3.pdf

[16]	A. Montejo-Raez, "Toward Conceptual Indexing Using Automatic Assignment of Descriptors," in Proceedings of the AM 2002 Workshop on Personalization Techniques in Electronic Publishing, Malaga, Spain, May 2002.

[17]	A. Neveol, A. Rogozan, and S. Darmoni, "Automatic Indexing of Online Health Resources for a French Quality Controlled Gateway," Information Processing and Management: an International Journal, vol. 42, no. 3, pp.695-709, 2006.

[18]	B. Pouliquen, R. Steinberger, and C. Ignat, "Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus," in Ontologies and Information Extraction. Workshop at EUROLAN 2003: The Semantic Web and Language Technology - Its Potential and Practicalities. Bucharest, July 28-August 8, 2003.

[19]	P. Resnik, "Using Information Content to Evaluate Semantic Similarity in a Taxonomy," in Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), 1995.

[20]	C. D. Rosa, College Students' Perceptions of Libraries And Information Resources, OCLC, Dublin, Ohio USA, 2006.

[21]	P. Ruch, "Automatic assignment of biomedical categories: toward a generic approach," Bioinformatics, vol. 22, no. 6, pp. 658-664, 2006.

[22]	M. E. Ruiz and A. Aronson, User-centered Evaluation of the Medical Text Indexing (MTI) System, Technical Report, National Library of Medicine, 2007.

[23]	U. Schiel and I. M. S. F. de Sousa, "Interactive Indexing of Documents with a Multilingual Thesaurus," in Effective Databases for Text & Document Management, Hershey, PA, USA: IGI Publishing, 2003, pp. 24-35.

[24]	N. Seco, T. Veale, and J. Hayes, "An Intrinsic Information Content Metric for Semantic Similarity in Wordnet," in Proceedings of the 16th European Conference on Artificial Intelligence, Valencia, Spain, 2004, pp. 1089-1090.

[25]	K. Shin, S.-Y. Han, and A. F. Gelbukh, "Balancing manual and automatic indexing for retrieval of paper abstracts," Lecture Notes in Computer Science. Text, Speech and Dialogue (TSD), vol. 3206, pp. 203-210, 2004.
[26]	B. Shneiderman, "Tree Visualization with Tree-maps: 2-d Space-filling Approach," ACM Transactions on Graphics (TOG), vol. 11, no. 1, pp. 92-99, 1992.
[27]	D. Vervenne, "Advanced Document Management Through Thesaurus-based Indexing: the IKEM Platform," CWI Quarterly, vol. 12, no. 2, pp. 159-172, 1999.
[28]	J. M. Weiner, "Differences in Indexing Term Vocabularies and Agreement with Subject Specialists," Electronic Journal of Academic and Special Librarianship, vol. 6, no. 1-2, 2005.
[29]	H. Yu and M. Young, "The Impact of Web Search Engines on Subject Searching in OPAC," Information Technology & Libraries, vol. 23, no. 4, pp. 168-180, 2004.

Footnotes

¹ http://books.google.com/

² http://scholar.google.com/

³ http://www.transinsight.com/

⁴ http://www.gopubmed.com/

⁵ http://www.collexis.com/

⁶ http://cdsware.cern.ch/

⁷ http://ii.nlm.nih.gov/

⁸ http://www.cismef.org/

⁹ Since the visualization relies on the use of colors as an integral part, the figures in the remainder of this paper should be viewed in full color as otherwise important information is lost. Full color versions of all figures are provided at http://www.kaiec.org