Unsupervised Structure Discovery for Biodiversity Information
[DEMO at JCDL 2006]
Richard M. McCourt
The project presented here concerns with the design of XML Schemas for biodiversity information in legacy formats. To mobilize these information resources, two approaches have been taken: 1) fit the biodiversity information into the predefined structure of, typically, a relational database, using techniques such as information extraction . 2) make explicit the inherent yet implicit semantic structure in the documents, using techniques such as XML tagging [2, 3]. Information extraction typically assumes a predefined template, while XML tagging a DTD or XML schema. Little has been discussed in literature regarding the validity of these structure constructs. We argue that 1) for morphological descriptions the complete representation of original information in new structures is crucial for specimen identification task. 2) manually enumerate all characters described in any reasonably large collection of taxonomic descriptions is not feasible. Although the templates or schemas reported in the literature capture most prominent characters, they may or may not represent the content of a particular collection well. We propose an unsupervised learning method to facilitate the discovery of semantic structure of a morphological description collection. The application is intended for use by taxonomists to design a complete schema by visualizing the concepts described in a collection, disambiguating the extracted concepts, and tracing back to any mention of a concept in description text.
 Abascal, R. and Sánchez. (1999) X-tract: Structure Extraction from Botanical Textual Descriptions. In Proceeding of SPIRE/CRIWG 2-7.
 Cui, H., & Heidorn, P.B. (2006). The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions. Journal of the American Society for Information Science and Technology. (Accepted, preprint at <http://hong.fims.uwo.ca/Research/jasist06.pdf>.)
© Copyright 2007 Hong Cui, Richard M. McCourt, and Monica Feist