Automatic Extraction of Table Metadata from Digital Documents

TCDL Bulletin
Volume 3 Issue 2
Summer 2007

Automatic Extraction of Table Metadata from Digital Documents

Ying Liu, Prasenjit Mitra, C. Lee Giles, Kun Bai
College of Information Sciences and Technology
Pennsylvania State University
University Park, Pennsylvania, 16802
{yliu, pmitra, giles, kbai}@ist.psu.edu

Tables are used to present, list, summarize, and structure important data in documents. In digital libraries, extracting this data automatically and understanding the structure and content of tables are very important to many applications. Tables are pervasively used in many media, such as HTML, PDF, images, etc. Researchers in the table-understanding field largely focus on algorithms to detect the table boundary and analyze the table structure in a specific document media. Limited attention has been paid to enabling table searching and table exchanging.

In this paper, we propose a set of medium-independent table metadata and build up the metadata foundation to facilitate the table representing, indexing, searching, and exchanging. The metadata not only includes the information related to the table itself, but also the accessorial information ignored by current researchers, such as table footnote and referenced annotation. With these metadata, we can build up shareable databases to enable more specific searching of tabular data. Moreover, automatic identification extraction, and search for the contents of tables can be made more precise. The table metadata will be easy for users to find the target tables across several mediums and reverse engineering the original table layout, store the extracted data in a database, present it in a medium same/different than their original target, and allow querying on the table contents if required. An automatic table metadata extraction algorithm is designed and tested on PDF documents.

For a larger view of Figure 1, click here.

© Copyright 2007 Ying Liu, Prasenjit Mitra, C. Lee Giles, and Kun Bai
Some or all of these materials were previously published in the Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital libraries, ACM 1-59593-354-9.

Top | Contents
Previous Article
Next Article
Home | E-mail the Editor