IEEE TCDL Bulletin
space space

TCDL Bulletin
Volume 3   Issue 2
Summer 2007


Unsupervised Structure Discovery for Biodiversity Information

[DEMO at JCDL 2006]

Hong Cui
University of Western Ontario
North Campus Building Room 240 London, Ontario Canada N6A 5B7

Richard M. McCourt
Academy of Natural Sciences
19th and the Parkway
Philadelphia, PA 19103

Monique Feist
Institut des Sciences de l'Evolution
Université Montpellier II
Place E. Bataillon
F-34095 Montpellier, France


The project presented here concerns with the design of XML Schemas for biodiversity information in legacy formats. To mobilize these information resources, two approaches have been taken: 1) fit the biodiversity information into the predefined structure of, typically, a relational database, using techniques such as information extraction [1]. 2) make explicit the inherent yet implicit semantic structure in the documents, using techniques such as XML tagging [2, 3]. Information extraction typically assumes a predefined template, while XML tagging a DTD or XML schema. Little has been discussed in literature regarding the validity of these structure constructs. We argue that 1) for morphological descriptions the complete representation of original information in new structures is crucial for specimen identification task. 2) manually enumerate all characters described in any reasonably large collection of taxonomic descriptions is not feasible. Although the templates or schemas reported in the literature capture most prominent characters, they may or may not represent the content of a particular collection well. We propose an unsupervised learning method to facilitate the discovery of semantic structure of a morphological description collection. The application is intended for use by taxonomists to design a complete schema by visualizing the concepts described in a collection, disambiguating the extracted concepts, and tracing back to any mention of a concept in description text.

Thumbnail image of poster

For a larger view of Figure 1, click here.

Instructions for Demo.


[1] Abascal, R. and Sánchez. (1999) X-tract: Structure Extraction from Botanical Textual Descriptions. In Proceeding of SPIRE/CRIWG 2-7.

[2] Cui, H., & Heidorn, P.B. (2006). The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions. Journal of the American Society for Information Science and Technology. (Accepted, preprint at <>.)

[3] Cui, H (2005) Floras in XML: Greenstone Digital Collection of Flora of China, Flora of North America, and Flora of North Central Texas. <>.


© Copyright 2007 Hong Cui, Richard M. McCourt, and Monica Feist
Some or all of these materials were previously published in the Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital libraries, ACM 1-59593-354-9.

Top | Contents
Previous Article
Next Article
Home | E-mail the Editor