Kataria, "On Utilization of Information Extracted from Graph Images in Digital Documents ", TCDL Bulletin 4.2 (2008)

On Utilization of Information Extracted from Graph Images
in Digital Documents

Saurabh Kataria

Biomedical Knowledge Engineering Laboratory
Pennsylvania State University
saurabh@psu.edu

ABSTRACT

Most search engines index the textual content of documents in digital libraries. However, scholarly articles often report important findings in figures for visual impact and the contents of these figures are not indexed. Often, scientists want to compare their experimental results with previously reported data. Thus, searching for data reported in figures and extracting them is an important problem. To the best of my knowledge, there exists no tool to automatically extract data from figures in digital documents. If we can extract data from these images automatically and store them in a database, an end-user can query and join the data from multiple digital documents simultaneously and efficiently. This research works towards proposing a framework based on image analysis and machine learning to extract information from graph images and utilize the information for the purpose of search. The automatic detection of the graph images, extraction of information from the identified images and building indices to support search are some of the high-level challenges that needs to be addressed first. This abstract tends to address these issues with a report upon the current status quo of the research.

General Terms

Information Extraction, Machine Learning, Metadata

1. INTRODUCTION

A wide variety of quantitative information is summarized and visually presented using graphs (or chart images), including scientific results, business performance reports, demographic distributions, time series, etc. The embedded information is invaluable in that once extracted, the data can be indexed and the end-user has the ability to query the data. However, in order to extract the data from the figures without manual intervention, we must identify graph images, extract the text out of the image in order to formulate the metadata of the image (e.g. legend information in case of pie-charts, 2-d plots etc.), identify the data region, separate the data symbols from the text in the legend, and extract the data represented in the image. Performing all of these automatically with high precision is a challenging problem and we believe that this would be the first attempt to achieve this goal.

2. PROBLEM FORMULATION

Conceptually, graph images can refer to any diagram that displays quantitative information and are clearly distinguished from camera-taken pictures. However, in our case, graph images correspond to 2-dimensional plots which includes pie-charts, histograms, curve plots and scatter plots and 3-dimensional images. Below is a diagram showing a hierarchical representation of these classes which can further be extended to accommodate various other classes once a consensus is developed upon the hierarchy.

Figure 1: Hierarchy of scientific graphs

In almost every domain ranging from chemistry to economics that tends to utilize a digital library to manage its literary information, this class of graph images is used to show results of any particular experiment or study. Although the data shown by these images are very easily understandable by humans, this data is not available to machines for processing, let alone searching for the data. Usually, post-doctoral students (e.g. in the field of chemistry) are employed to re-engineer this information for making comparisons between different studies. With the proposed tool in place, this effort can be minimized while the searching capabilities in the digital libraries can be enhanced as well.

Figure 2: A sample 2-d plot describing the rate of a particular reaction in certain specific conditions

Each of the four classes has characteristic features that distinguish one class from other images. For example, pie-charts can be recognized by a big circular structure in the image, a 2-dimensional plot can be characterized by two perpendicular straight lines, etc. For putting forth a formal treatment of one of the class, let’s define a 2-dimensional plot. A 2- d figure contains three regions 1) X-axis region containing X-axis labels and numerical units, i.e. area below the horizontal axis in Fig 1., 2) Y-axis containing labels and numerical units, i.e. area to the left of vertical axis in Fig 1.and, 3) Curve region which contains legend text, i.e. the text that semantically define the data points used in the plot or any other text, and data represented by data points in case of curve-fitted plots or curve itself, otherwise. Theoretically, a 2-d figure depicts a functional distribution of the form yi = fi (x) with condition wi where labels at Y-axis and X-axis holds the text for y and x and legend text provides the particulars for conditions w. The values for these functions are represented by the data points or the curve in the plot. We define these textual values as the metadata for the plot. In addition to this, our metadata for the figure consists of the caption for the figure. This metadata can further be linked to the metadata of the document in an object-oriented way.

Based upon the above formulation, we have three specific problems; 1) 2-d plot identification, 2) Metadata extraction from the 2-d plot and its indexing, 3) Data extraction from the 2-d plot. This formal treatment needs to be extended to other classes as well which is an extension to the current work done.

3. RELATED WORK

The related work to the 2-dimensional images as the class of graph images has been identified below. However, the related work requires extension to other class of graph images. Related work consists of two parts: 1) Categorization of images into predefined types, and 2) Information retrieval from images and indexing to support search. To the best of our knowledge, there does not exist any prior standards for metadata of graph images.

Image categorization bears a direct relationship to image annotation and document image understanding. Image annotation refers to tagging the images with the keywords that are representative of the semantic of its content[16], while document image understanding aims at processing the raster from of image documents, e.g. scanned documents with the goal of tagging the image with its high-level semantic representation [4]. Generally, statistic models are used to classify images based upon the features present in the image and the text describing the image in the document. There has been an extensive work on image understanding, content-based structure analysis [16] [14], indexing and retrieval methods for partially understood images [5]. Generally, domain knowledge is also employed in image understanding task, e.g., parallel line detection in automated form processing using the hidden Markov model [22]. People identification in images also utilizes both text based and content based analysis [15] to identify people.

The image categorization part of our work bears a similarity to image understanding, but we are only interested in deciding whether a given image contains a 2-d plot. Li et.al. [11] developed wavelet transform based and context sensitive algorithms to perform texture based analysis of the image and separate camera-taken pictures from non-pictures. Based upon this framework, Lu et.al. [12] developed an automatic categorization image system for digital library documents that categorizes the images into multiple classes within non-picture class, e.g. diagram, 2-d figures, 3-d figures, diagrams and other. We show significant improvements in detecting 2-d figures by replacing/adding certain features used in [12].

Content based image search and retrieval of image objects has been extensively studied [4] [5]. Using image understanding techniques, features are extracted from image objects such as texture features or color features and a domain specific indexing scheme is applied to provide efficient search for related image objects. For instance, Manjunath, et al., [13] utilize texture features to index arbitrary textured images, and Smith, et al., [18] utilize the spatial layout of regions to index medical images and maps. We are interested in text present in the figure because we believe the text present in different labels (i.e. legend, axis labels, caption) are most descriptive of the figure in general. Previous work in locating text in images consists of different application based approaches such as page segmentation [10], address block location [9] , form [21], and, color image processing [7]. Text can be located in an image using two primary methods [10] . The first treats text as textured region into the mage and applies well-known texture filters such as Gabor filtering [13], Gaussian filtering [20], spatial variance [23], etc. The second method uses connected components analysis [21] [9] [10] [19] on localized text regions and is applicable to binary images, i.e. images having two pixel intensity levels. Generally, both methods are used in conjunction to first isolate the suspected text regions using texture analysis and then using the binarized connected component analysis. These methods are highly application-sensitive and image structure drawn from domain knowledge is generally applied.

Below, the method for information extraction from 2-d plots is described.

4. METHOD

4.1 Overview

The system uses a machine-learning based classifier to identify which figures in the document are 2-d plots. An identified image is, then, segmented into the previously mentioned three regions. The algorithm performs connected component analysis to label each connected component in the three regions so that its shape and position can be analyzed. Next, the candidate text components are identified based upon their mutual positioning and spacing information. This identification is based upon the intuition that the two characters appearing in a same string are very likely to be placed next to each other and the spacing between them is roughly the same for any two characters appearing in any other string of text in the figure. In the next stage, we identify the data points in the curve region. This is achieved by removing the lines from the region in such a way that only the data points are left. Fig. 3 depicts the whole process as a flow chart.

Figure 3: Process flow of information extraction from 2-dimensional plot

4.2 Identification of 2-d plots

Image segment features: Li, et al. [12], have proposed an image segmentation algorithm that divides an image into small non-overlapping blocks. They use the wavelet coefficients of each block as a localized feature to obtain a global information of the text, background and picture regions. Lu, et al., [12] have found these localized features to be very effective in separating photo and non-photo images too. Since 2-d plots are a subset of non-photo images, we use these features. We found them to be very effective in classifying 2-d plots from other images. Lu, et al., [12] have noted that the finer aspects of colors and shades do not contribute heavily towards identifying the ”semantic type” of the figure while at the same time reduce the computation and memory requirements. Therefore, for extracting the image segment features, we converted each image to grayscale(portable gray map or PGM) format.

Axes Features: 2-d figures range from curve-fitted plots to histograms and pie-charts. However, we are interested in 2-d plots that graph the variation of a variable with respect to another variable and the presence of coordinate axis is certainly a distinguishing feature of such plots. We apply the Hough transform [6] on the binarized image to obtain the positional information of the longest straight lines (i.e., their mutual orientations (i.e. slope between the two) and positions, their position in the image etc.) and use these as a feature.

Text Features: From our observations, we found that authors tend to commonly employ certain terms while writing captions for their 2-d plots that are not used frequently in captions for other types of figures. For instance, the frequently occurring set of words includes distribution, slope, axes, plot, range, etc. We use these words to form boolean features while training our classifier.

Support Vector Machines (SVM) [1] are being increasingly used in both 2-class and multi-class classifications for their robustness and computational efficiency when compared to other machine learning techniques. We train our classifier based on the above described features using an SVM. We found that a linear kernel along with the C-parameter set to 1.0 was best suited for our purpose.

4.3 Plot Segmentation

Plot segmentation is the process of identifying and separating all the three regions, defined earlier, of a plot. Specifically, the algorithm must find out the position of the coordinate axes in the plot. The coordinate axes act as a global feature that identifies a 2-d plot. Profiling (or image signature) [17] is one important technique to detect global features if a rough measure of the image structure is known beforehand. This method is particularly suitable in our setting because almost all 2-d figures will have two perpendicular lines, the axes of the figures. The profile of a binarized image is calculated by counting all the foreground pixels along a specified image axis. In an image containing a single 2-d plot, the peak in the vertical profile corresponds to the Y-axis and the peak in the horizontal profile corresponds to the X-axis. The horizontal profile of the sample 2-d plot in fig. 2 is shown in the fig. 4(b). However, the profile based axis detection requires the plot axes to be aligned with image axes, but in the presence of noise, e.g., in scanned images, the axes might not be perfectly perpendicular or aligned to the image axes, which may cause the proposed profiling technique to fail. Therefore, we need to perform a preprocessing step that enhances the global features of the plot. This preprocessing step identifies the potential axes lines in the 2-d plot and reconstructs them aligned with the image axes. To determine axes lines, one needs to find the pair of a set of maximum foreground pixels lying on straight line in the binarized image such that the slopes of two straight lines are almost orthogonal. This can easily be achieved by employing the Hough transformation to the original image. We refer the reader to [6] for the details of Hough transformation and briefly explain the detail of our axes reconstruction step.

Hough transformation converts a binarized image from its x-y co-ordinate space to ρ-θ space (parametric space for straight line) where ρ,θ are the solution to the equation xcosθ + ysinθ = ρ. Thus, every point in ρ-θ corresponds to a set of collinear points in the x-y space. Also, the peaks in ρ-θ space corresponds to large segments of lines in x-y space. Since axis lines in a 2-d plot are one of the largest straight lines, therefore, the corresponding peaks are the brightest (or dominant) in ρ-θ space. Also, the brightest peaks closest to 0 and π/2 on θ−axis will correspond to horizontal and vertical axis, respectively. The value of ρ for these peaks in ρ-θ space provides the perpendicular distance of the corresponding lines from origin in x-y space.

4.4 Text Detection

Extracting text blocks from figures is more challenging than the extraction of text blocks from the raster image of a document, because in 2-d figures a) the distribution of text is sparse, and b) data points and lines introduce noise especially for the legend detection scheme. Note that a legend is usually a mixed block containing symbols representing different data types as well as text and has to be extracted as a single block. Therefore, the usual profile-based text-block detection techniques employed for extraction of text from documents [9] [10] cannot be used. Because data and text both occur in the 2-d figure, it can not be directly processed by a standard optical character recognition (OCR) system. Considering these factors, we need to perform a connected component analysis to identify individual letters as a preprocessing step to recognize the text blocks that can, then, be sent to a standard OCR tool.

Usually, the text present in a 2-d figure does not contain any texture, i.e. significant intensity variation within the body of the text. Therefore, the loss of information is almost negligible if the image pixel intensities are converted into binary values. To perform the component analysis, we apply the connected-component labeling scheme [8] that labels all the different connected components in the image. Text possesses certain spatial properties such as horizontal alignment and a certain spacing between characters, which distinguish the characters of a text string from other components in the image. These spatial characteristics of the text components can be utilized to perform the component analysis that provide the probabilistic location of the text in the 2-d plot. Specifically, we employ fuzzy rules based upon the spatial alignment of the characters to cluster the components into potential strings of text and then employ OCR.

4.5 Data Points Detection

Scatter and curve-fitted plots contain geometrical shapes that serve as data points in 2-d plots. Our next step after text block detection is to locate these data points. Isolation of these data points from the curves is essential to perform the shape detection in order to have a mapping with the legend. A reasonable heuristic that the curves have similar pixel width to the width of axes can be utilized to filter the lines in the curve region. We extend the basic idea of the k-median filtering algorithm [17] to perform this operation. To summarize the K-median algorithm, it is a first order pixel-noise filtering technique that removes isolated foreground pixels. A raster scan of the image is performed with a square window of (k * k) size with k = 2 * w + 1 where w is set depending upon the noise in the image, e.g. w is set to 1 for 1-pixel noise. With each new position in the window, the intensity of the most central pixel in the window is assigned with the median intensity value of all the pixels that lie in that window.

The K-median filtering algorithm uses a 2-dimensional window, which is not able to preserve the contours of 2-dimensional shapes because pixels at edges of the shape are surrounded by a majority of background pixels. However, this problem can be overcome by taking a 1-dimensional filter and treating line width as the noise intensity. Therefore, we choose two windows of size (1 * k) and (k * 1) where k = 2 * w + 1 and then perform the raster scan of the image. Using two 1- dimensional windows preserves even the pixels at the edges of the two-dimensional data points reasonably well, but removes narrow lines from the figure (as desired). As pixel width and orientation of the data points has to be different from the curve and the average pixel width of the curve is almost similar to the axes width, therefore,w is set to be at most the value of axis width. We calculate the pixel width of the axes during the plot segmentation stage using the profile of the 2-d plot.

4.6 Data Point Disambiguation

Overlapping data points occur frequently in 2-d plots and identifying each individual data point and its coordinates is a difficult task. We apply simulated annealing (SA) in order to resolve individual data points within a region of overlap. SA is a stochastic method, based on the Metropolis algorithm, often used in non-convex optimization problems. It bears close similarity to annealing (i.e. slow cooling) in metallurgical processes. By analogy to its physical counterpart, the optimal configuration (lowest energy Emin) is approached while the temperature T is lowered. In accordance with the Metropolis algorithm, occasionally higher energy configurations Ef > Ei are assumed with probability e−(E f−Ei)/T . The details of the algorithm are presented below.

We start with initial configuration of randomly selected large number of candidate shapes, where candidates are previously identified shapes of data points in the 2-d plot using standard shape detection methods [17], and target the arbitrary shape of overlapped data points. The positions of the candidate shapes is arbitrarily chosen and bounded by the height and the width of the curve-region of the original 2-d plot. Hence, we obtain two matrices having entries as 0/1 belonging to the generated configuration and the original overlapping data points configuration, respectively. At this point, the cost function, i.e. the difference between the grammian matrix of the overlapped configuration and the generated configuration, is calculated which is optimized iteratively. Since we started with a randomly large number of individual shapes, the extra shapes needs to be pruned during iterations while keeping those shapes that lead to optimal configuration. This is achieved as follows. The coordinates of the candidate shapes are given random fluctuations, within the image boundary. In addition, point types are swapped, much like optimization within combinatorial problems such as TSP. Finally, Euclidean distance between candidate shapes, i.e. sum of the square of the distance between each pixel of the two shapes, is used as a measure for removal of identical types which overlap. This is done because identical shapes overlapped together very closely gives no additional information and one of the shape can be discarded.

Carnevali, et al., [2], applied simulated annealing to recognize known sets of shapes in a noisy image. However, to the best of our knowledge, application of simulated annealing to disambiguate overlapping shapes is a novel contribution.

5. EXPERIMENTS

In this section, we report the results obtained by evaluating the new features for 2-d plot identification and data point disambiguation algorithms. The data set that we used for our experiments is randomly selected publications crawled from the web site of Royal Society of Chemistry www.rsc.org and randomly selected computer science publications from the CiteSeer digital library [8] for scientific publications. Fig. 4 shows the result of our algorithm to fig. 2.

(Click here for a larger view)

(Click here for a larger view)

Figure 4: Information Extraction process on a sample 2-d figure

5.1 2-d figure Classification

For our classification experiments, we extracted the images from the above-mentioned documents and had them manually tagged by two volunteers as 2d or non-2d. Our set consists of 2494 images, out of which 734 images are 2d plots. As mentioned previously, we train a linear SVM (with C = 1.0) on this dataset.

Features	% CV(#3) accuracy
Only IS	85.24
Only CT	78.3
IS + CA	85.85
CT + CA	80.67
IS + CT	85.85
All	88.25

Table 1: Cross-validation accuracies

Class	Non 2-d	2-d
Non 2-d	1393	67
2-d	82	452

Table 2: Confusion matrix(train set)

5.1.1 Feature extraction

Table 1 shows the 3-fold cross-validation accuracies with different combinations of features. We use the following abbreviations: IS for image segmentation, CT for caption text, CA for the coordinate axes. The confusion matrix over a sample test set is shown in Table 3. For comparison purposes, we have also shown the confusion matrix over the training set in Table 2. The libSVM software was used for support vector classification [3].

Class	Non 2-d	2-d
Non 2-d	273	27
2-d	66	134

Table 3: Confusion matrix(sample test set)

5.2 Plot Segmentation and Text Extraction

Experiments on 2-d line plots were conducted in Octave and C/C++, by converting images after thresholding to a data matrix. Each pixel coordinate thus corresponds to the row and column index of a data matrix. Using the text detection and overlapping text separation algorithms (see technical report for details), segmentation tasks were performed on a random subset of the above mentioned dataset. Fig. 3 shows that our algorithm can perform the initial segmentation task very precisely but the precision drops for subsequent steps e.g. text identification. The latter step is also pictured in Fig. 3 and is reliable where needed most, in the data-plotting region. We sampled 504 2-d figures for the text block detection. The algorithm segments each plot into three regions. Axis detection is 99% accurate, even in the presence of noise. We show the match results for the sample of 504 2-d plots in table 4. We consider the X or Y label or the legend text block correctly identified if, on an average, at least 70 % of the identified letters in the blocks are correctly extracted.

	Total	# Correct	% Recall
X Labels	504	428	84.9
Y Labels	504	441	87.5
Legend Text	504	398	79.0

Table 4: Experimental results of Text Block Detection Algorithm

After filtering out the text blocks using the text block detection algorithm, we performed our data extraction algorithm on rest of the 2-d figure. We used the 504 2-d figures which had scattered or curve-fitted plot in it. We call a figure an accept if more than 90 % of the data points gets correctly extracted with their shape preserved. Table 5 shows the recall for the data extraction algorithm.

Table 5: Experimental results of Data Extraction Algorithm

5.3 Data Point Disambiguation

For the purposes of our experiment, 90 × 90 sized images of overlapping points were generated randomly using two types, a diamond (A) and triangle (B). Fig. 5 gives an example of a pixel region containing overlapping data points and the corresponding machine-learnt version; table 4 details the experimental parameters and results corresponding to fig. 5.

Table 5 gives the overall results of these experiments using an annealing constant of 0.4 and 10k iterations. In accordance with the SA algorithm, as the annealing schedule is slowed and iterations increased, the recall approaches 100%. A slower annealing schedule than that used here and more iterations are required as the pixel region and number of possible different data points increases. However the results are promising in that data that would be traditionally considered lost is recovered with fairly high accuracy.

Iterations	Temp. const.	Type	Offset (orig.)	Offset (calc.)
10k	0.4	A	(11,39)	(11,40)
			(35,19)	(34,20)
			(19,4)	(20,3)
		B	(21,35)	(22,35)
			(10,18)	(10,17)

Table 6: Example parameters for simulated annealing applied to the data point disambiguation problem.

Figure 5: Examples of overlapping data points (left) and machine learnt versions (right)

Shape	Total	# Correct	% Recall
Diamond Triangle	72 78	64 71	88.9 91.0

Table 7: Experimental Results for Data-Point Disambiguation

6. FURTHER WORK

The ongoing research aims to deal with graph images in digital documents. Currently, we have outlined a system that can identify 2-d plots in digital documents and extract data from the identified documents.

There are remaining concerns for 2-d plot image processing, such as discerning between hollow data points and line data which proves almost as difficult as the overlap in text and data points. For the latter data point disambiguation problem, error handling needs to be introduced to decide when simulated annealing is appropriate. Similarly, disambiguation is necessary for overlapping characters within text not resolved via CCL or other means, which is to the best of our knowledge, an open problem. However, the main focus of the research is going to be the extensions of the current algorithms to the other classes of the graph images. The other main focus is the incorporation of the extracted information into a search engine.

7. REFERENCES

[1]	C. J. C. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition," Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998.

[2]	P. Carnevali, L. Coletti, and S. Patarnello, "Image Processing by Simulated Annealing," IBM Journal of Research and Development, vol. 29, no. 6, pp. 569–579, 1985.

[3]	C.C. Chang and C.J. Lin, "LIBSVM: a Library for Support Vector Machines." Available: http://fbim.fh-regensburg.de/~saj39122/Diplomarbeiten/Miklos/SVM%20Toolboxes/libsvm.pdf

[4]	R. Datta, J. Li, and J.Z. Wang, "Content-based Image Retrieval: Approaches and Trends of the New Age," in Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval, New York, NY: ACM, pp. 253–262, 2005.

[5]	D. Doermann, "The Indexing and Retrieval of Document Images: A Survey," Computer Vision and Image Understanding, vol. 70, no. 3, pp. 287–298, 1998.

[6]	R. O. Duda and P. E. Hart, "Use of the Hough Transformation to Detect Lines and Curves in Pictures," Communications of the ACM, vol. 15, no. 1, pp.11–15, Jan. 1972.

[7]	L. A. Fletcher and R. Kasturi, "A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 10, no. 6, pp. 910–918, 1988.

[8]	C.L. Giles, K.D. Bollacker, and S. Lawrence, "Citeseer: an Automatic Citation Indexing System," in Proceedings of the third ACM Conference on Digital Libraries, New York, NY: ACM, 1998, pp. 89-98.

[9]	A. K. Jain and S. K. Bhattacharjee, "Address Block Location on Envelopes Using Gabor Filters," Pattern Recognition, vol. 25, no. 12, pp.1459–1477, Dec. 1992.

[10]	A. K. Jain and B. Yu, "Document Representation and Its Application to Page Decomposition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 294–308, March 1998.

[11]	J. Li and R. M. Gray, "Context-based Multiscale Classification of Document Images Using Wavelet Coefficient Distributions," IEEE Transactions on Image Processing, vol. 9, no. 9, pp. 1604–1616, Sept. 2000.

[12]	X. Lu, P. Mitra, J. Z. Wang, and C. L. Giles, "Automatic Categorization of Figures in Scientific Documents," in Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, New York, NY: ACM, 2006, pp. 129–138.

[13]	B. S. Manjunath and W-Y. Ma, "Texture Features for Browsing and Retrieval of Image Data," IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 18, no. 8, pp. 837–842, Aug. 1996.

[14]	S. Mao, A. Rosenfeld, and T. Kanungo, "Document Structure Analysis Algorithms: a Literature Survey," in SPIE proceedings series - Document recognition and retrieval. Conference No10, vol. 5010, pp. 197–207, 2003.

[15]	M. Naaman, R. B. Yeh, H. G. Molina, and A. Paepcke, "Leveraging Context to Resolve Identity in Photo Albums," in Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, New York, NY: ACM, 2005, pp. 178-187.

[16]	D. Niyogi and S. N. Srihari, "Knowledge-based Derivation of Document Logical Structure," in Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR), vol. 1, Washington, DC: IEEE Computer Society, 1995, p. 472.

[17]	M. Seul, L. O’Gorman, and M. J. Sammon, Practical Algorithms for Image Analysis: Description, Examples, and Code, New York, NY, USA: Cambridge University Press, 2000.

[18]	J. R. Smith and S.F. Chang, "Visualseek: A Fully Automated Content-based Image Query System," in Proceedings of the fourth ACM international conference on Multimedia, New York, NY: ACM, 1996, pp. 87-98.

[19]	Y.Y. Tang, S.W. Lee, and C.Y. Suen, "Automatic Document Processing - a Survey," in Pattern Recognition, vol. 29, no. 12, pp. 1931–1952, 1996.

[20]	V. Wu, R. Manmatha, and E. M. Riseman, "Finding Text in Images," in Proceedings of the Second ACM international conference on Digital Libraries, New York, NY: ACM, 1997, pp. 3-12.

[21]	B. Yu and A. K. Jain, "A Generic System for Form Dropout," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 11, pp.1127–1134, Nov. 1996.

[22]	H. Li, D. S. Doermann, and Y. Zheng, "A Parallel-line Detection Algorithm Based on HMM Decoding," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 777–792, May 2005.

[23]	Y. Zhong, K. Karu, and A. K. Jain, "Locating Text in Complex Color Images," in Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR), vol. 1. Washington, DC: IEEE Computer Society, 1995, pp. 146–149.