Tang, "Automatic Extraction of Morphological Information from Botanical Collections ", TCDL Bulletin 5.1 (2009)

Automatic Extraction of Morphological Information
from Botanical Collections

Xiaoya Tang

School of Library and Information Management
Emporia State University
1200 Commercial St., Campus Box 4025
Emporia, KS 66801
1-620-341-5071
xtang@emporia.edu

For a larger view of the poster, click here. After the image opens, click on it to zoom.

Large volumes of full-text collections exist in the field of biological science. Publications such as field guides, floras, and faunas are written to provide critical information for field identification of species. However, access to such collections has been difficult. While users tend to use specific morphological information to search, current systems based on statistical models are usually ineffective [3]. Therefore, automatically extracting such information from documents and then allowing users to use this information to search should improve effective access to such collections.

The full-text botanical text collection FNA (Flora of North America) is often used in the botanical field. Information such as plant distribution, genus, species, and morphological information such as plant leaf shape, margin, dimension, leaf arrangement, fruit/net shape, etc. is critical for species identification. This study adapts and enhances the information extraction system WHISK that was created by [2] to extract rental information from short newspaper ads and uses it to extract such information from FNA. The system uses machine learning techniques and automatically creates rules to extract desired information based on manually created examples, and uses such rules to extract information from new documents. This study examines the system and makes adaptations and enhancements so that the enhanced system could not only work on the FNA collection but also be more easily adapted to new text collections. Issues encountered in adapting this IE system are also examined with the intent of improving the portability of the system for use in other future domains.

WHISK is enhanced and generalized by the following changes. First, a pre-processing model is added to remove unexpected trivial text formatting. Then pattern recognition is generalized to accommodate the more complicated text characteristics of botanical descriptions. These include longer sentences, broader vocabulary, and a greater variation in structure among different documents compared to the target text of the original WHISK system. The system is also generalized so that it could use important text structures in the creation of extraction rules. Finally, a new knowledge base is created that contains botanical terms that were added during the training process. Terms in the knowledge base are categorized into different semantic classes.

The extraction performance is evaluated following the MUC evaluations [1] and uses the measures of precision and recall. The overall results are satisfactory. The precisions are higher than 80% except for that of the fruit/nut shape. The recalls are relatively lower, ranging from 57.04% and higher, mostly 60% to 80%, though recall for leaf arrangement is much lower. The difference in the performance of different types of information is mainly due to differences in the text characteristics associated with each type of information. The evaluation results demonstrate that automatic extraction of morphological information from full-text collections is promising. It also indicates that this approach is generalizable to similar collections in the same domain and even to different domains with adaptation of the pattern recognition and knowledge base in the new domain.

REFERENCES

[1] Chinchor, N., and Dungca, G. Four scores and seven years ago: The scoring method for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6), 1996, 33-38. Morgan Kaufmann, San Mateo, CA.

[2] Soderland, S. Learning information extraction rules for semi-structured and free text. Machine Learning, Feb. 1999, 34, 1-3, 233-272.

[3] Tang, Xiaoya, &Heidorn, P. Brayan. (2007). The Loss of domain knowledge in user search queries: A Query log analysis of a botanical retrieval system. In Proceedings of Annual Meeting of the American Society for Information Science and Technology, October 19-24, 2007, Milwaukee, Wisconsin.