Volume 5 Issue 1
Spring 2009
ISSN 1937-7266

An Application for Semantic Markup of Biodiversity Documents

Hong Cui

School of Information Resources and Library Science
The University of Arizona
520-621-3565
hongcui@email.arizona.edu

MARTT II is a machine-learning based semantic markup system that may be used to reformat free-text biodiversity documents in XML format for digital libraries. It is built on the MARTT engine described in [1], but with new components, such as a parallel markup engine using the unsupervised learning algorithm described in [2]. The system supports two different automated markup engines; one requires training examples and the other doesn't. The system suports functions such as training data preparation and markup correction. Currently MARTTII can mark up a taxonomic description to the clause level (e.g. tag "Tepals 2-5, free or connate." <tepals>) without any human intervention. In a year or two we hope the program will be able to mark up at character level (e.g. recognize tepal count is 2-5, free or more or less connate).

REFERENCES

[1] Cui, H., & Heidorn, P.B. (2007). The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions. Journal of the American Society for Information Science and Technology. 58(1). 133-149.

[2] Cui, H. (2008). Unsupervised learning for semantic markup of biodiversity literature. JDCL 2008.