IEEE TCDL Bulletin
 
space space

TCDL Bulletin
Current 2006
Volume 2   Issue 2

 

Visualization and Classification of Documents

A New Probabilistic Model to Automated Text Classification

Giorgio Maria Di Nunzio
Department of Information Engineering, University of Padua
Via Gradenigo, 6/a, 35131 Padova, Italy
dinunzio@dei.unipd.it

 

Abstract

A novel two-dimensional representation of documents may be used both to visualize and classify textual documents and to build classification schemes that may be used to browse, explore, and retrieve resources from Digital Library Management Systems (DLMS). We propose a mathematical formalization of the two-dimensional approach where documents' coordinates are defined as probabilities. An extension of the two-dimensional classification algorithm, named selective re-projection, is also proposed. Experiments showed that this approach confirmed to be an alternative to text classical classification methods. In particular the low requirements of space and time and the intuitive representation of the two-dimensional plots make this approach a valuable support for analysis of textual collections.

1.  Introduction

DLMS generally manage collections of multi-media digitalized data and include components that perform the storage, access, retrieval, and analysis of the collections of data. Multi-media digital libraries require solutions for content organization, access and interaction. In particular, solutions for interfaces that help to query and browse results are important and critical subjects of investigation [1]. In this context, DLMS designers not only deal with increasingly complex data sources but are continually confronted by the complexity of different user needs and abilities [2]. For this reason, a taxonomy (or subject hierarchy, classification scheme) may be used to browse, explore, and retrieve resources from DLMS. Interfaces that use this kind of taxonomy are typically used by users when they want to express their information needs in terms of keywords or when the specific request is not clear in their mind [3].

Building taxonomies or classification schemes is a general classification problem where objects having attributes (or features) are placed into categories based on the values of these attributes. When the object to be managed is of textual nature, the particular instance of the classification problem is called Automatic Text Categorization (ATC) where document words are interpreted as the features, and entire documents are the objects interpreted as numerical vectors [4]. Documents' vector length is determined by the number of the words of the vocabulary chosen to represent the corpus being classified. The size of the vocabulary may be considerable and easily reach hundreds of thousands. Processing the collection and training the system becomes extremely costly in computational terms by means of standard classification techniques; hence, a reduction of the original space is needed. The choice of the number of words that constitutes the vocabulary to represent the documents is a crucial and challenging point. The feature selection approach is a widely used approach to reduce the size of the vocabulary [5,6,7]. In literature, different statistical methods, such as Naive Bayes [8,9], ridge logistic regression [10], and support vector machines [11], showed that it is possible to achieve a good tradeoff between efficiency and effectiveness through feature selection approaches.

The question of choosing an appropriate number of features also arises when a compact visualization of the textual documents is needed. In scientific visualization and information visualization, the graphical models are typically constructed from measured or simulated data representing objects or concepts associated with phenomena from the physical world [12]. Several techniques for visualizing multidimensional data sets have been studied [13] since, in general, there is no standard mapping into the Cartesian coordinate systems. In spite of several visualization approaches in literature, there are very few examples of applications where text classification methods are used; self-organizing maps are probably the most used technique of visualization for text classification [14]. However, visualization methods provide only limited forms of guidance, visual or non-visual, for the user when training/testing an ATC system.

1.1  Background and Motivations

The main motivation of this work is to formalize mathematically the two-dimensional representation of documents and to present it as a probabilistic model. This model takes inspiration from some of the ideas of the Bayesian classifiers, such as the use of a collection of training examples to estimate the parameters of the model; however, the main assumptions and the definition of the probability of words and documents are distinct from the one of the classical Naïve Bayes models in order to tackle the problem of word selection and visualization. In particular, there are three significant advantages of the approach proposed in the paper we want to summarize here:

  • There is no explicit need for feature selection to reduce dimensionality, since documents are represented by only two dimensions.
  • The space required to store objects is very limited as well as the computational cost for training the classifier.
  • There is the possibility of the 2-D visualization of collections to analyze performances and find possible relations among categories.

The two-dimensional representation of documents is based on the definition of two parameters for each word of the vocabulary, named presence and expressiveness. These two intuitive measures want to capture the fact that the importance of a word (or a term) t in a generic category c is not only given by the relative frequency in the documents of c, but also by the relative frequency in the documents in all the other categories [15]. It is with the combination of this twin information that t gains or loses strength in a particular category: the relative importance of a term in a category is discredited (or given credence) by an amount proportional to its relative importance in the other categories. The weights of the words, that is the combination of presence and expressiveness, are in their turn combined to form the two coordinates of a document; in this way, documents are represented in the two-dimensional space with the expectation of finding the documents of a particular category separated from the other documents [16,17]. The classification of documents is performed by means of an efficient learning algorithm [18] that achieves a performance with respect the to state-of-the-art one statistically significantly not different more than 4 percentage point [19].

2  Mathematical Formalization

We shall use the following general definitions for the formalization: we shall assume a set of predefined categories C = {c1, c2, ..., ci, ..., cn}, and shall indicate with dj,i the j-th document that belongs to the i-th category (e.g. ci = {d1,i, d2,i, ..., dj,i, ..., dNi,i}), with the generic category ci having a number of documents Ni. The set of distinct terms in a category ci is Ti, while a generic term of the vocabulary V will be t Î Èi = 1n Ti. Moreover, we shall indicate with ci|t Í ci a subset of category ci, whose elements are the documents of ci in which the term t appears at least once (e.g. c1|t = {d2,1, d5,1, d11,1}). The cardinality of this subset for a generic ci|t will be indicated with Ni|t. Finally, the category of interest will usually be indicated as ci, and its complement with`ci = C - ci. The words: word, term and feature are often used in this paper as synonyms.

The model presented will make three assumptions:

  • Documents are represented by means of binary vectors (a word is either present or not in a document).
  • The probability of each word occurring in a document is independent of the occurrence of other words in a document (the word independence assumption).
  • Categories are equally probable (the relative number of documents in each category, according to the training set, is not considered).

2.1  Presence and Expressiveness

As already mentioned, the basic idea of presence and expressiveness of a term is: given a category ci, the importance of a generic term t is not only given by the relative frequency in the documents of ci, but also by the relative frequency in the documents in all the other categories`ci . The definition of the presence of a term t in the category ci, that is P(t,ci), is given by the ratio Ni|t/Ni between the number of documents in ci in which term t appears at last once, and the total number of documents in ci. A straightforward frequentist interpretation of this ratio gives that the ratio is the probability that a document randomly sorted from ci contains t at least once; let us call this probability P[t Î ci]. Keeping in mind the assumption we have done earlier, we can easily express the general presence of a term t in the set of categories G Ì C in terms of the probability that a document randomly sorted from G contains t at least once. In fact, by using the total probability theorem, we get:


P[t Î G]=
å
j Î g 
P[t Î G|cj]P[cj]= 1

|G|

å
j Î g 
P[t Î cj]=

å
j Î g 
Nj|t

Nj

|G|
=P(t,G)  ,

where we used the equality P[t Î G|cj]=P[t Î cj], since the probability that a document randomly sorted from G contains t at least once, when cj is the observed category in G, equals the probability that a document randomly sorted from cj contains t at least once.

Similarly, the general expressiveness of a term t in the set of categories G, that is E(t,G), is given by


E(t,G)=1-P(t, C-G)=P[{t Î C- G}C]=P[t Ï C- G]  ,
(1)

where {t Î C- G}C denotes the complement set of {t Î C- G}. Formula (1) shows that the general expressiveness of a term t in the set of categories G equals the probability that a randomly sorted document in C-G does not contain term t.

2.2  Term Weighting

Given a generic term t in a category ci, we want to give this term an importance proportional both to its presence in ci (how many documents in ci contain t) and to its expressiveness (how many documents in`ci do not contain t). In general, we define the weight of a term t in a set of categories G, with respect to complementary set C-G, as:


W(t,G)=P(t,G)·E(t,G)  .

Again, this weight can be interpreted as a probability. To show this fact, suppose we consider the following experiment: let us randomly sort a document from G and, independently, a document from C - G. To mathematically model this experiment, we can consider the probability space S, which models the experiment, as the Cartesian product of two independent spaces SG and SC-G, modeling the random sorting of a document from G and C-G, respectively [20]. Now, the probability of sorting both a document containing term t at least once from G and a document not containing term t from C-G is given by:


P[t Î G, t Ï C-G]=P[t Ï G]P[t Ï C-G]=P(t,G)·E(t,G)=W(t,G)  ,

which proves our statement.

2.3  Documents' Coordinates

The aim of the coordinates of a document is to combine the weights of the words in a document according to the following principle: the more a document contains words discriminative for category ci the higher the abscissae; the more a document contains words discriminative for the complementary`ci the higher the ordinate. A document's coordinates can be obtained by defining a proper additive measure over the vocabulary V. We use W(t,G) to define this measure. Given a generic subset d={tj1,tj2,¼,tjk } of V, i.e. a document, the measure of d in the set of categories G is defined as


mG[d] =
k
å
l=1 
W(tjl,G)

|V|
å
i=1 
W(ti,G)
 .
(2)

Now we can define the coordinates of a document d when the category ci is examined:


Xi(d) = mci[d]        Yi(d)=m`ci[d]  .

The aim of the definition of mG[d] is to get high Xi(d) and low Yi(d) if d with high probability belongs to the category ci, the opposite if d with high probability does not belong to the category ci. Theoretical justifications for this definition can be found in an implicit assumption of terms as disjoint events, and in the usage of a weighing factor.

3  Training the Classifier : Selective Re-projection

Once the documents are projected in the two-dimensional space, it is possible to classify them using the algorithms we proposed in past works [15,16,18]. The general idea is that given a point p(0,q) close to the origin, |q| < 0.1, a straight line that separates the documents of category ci from the remainder documents exists according to some criterion, and its equation is


Yi,opt = mopt·Xi + qopt  ,

where mopt and qopt are respectively an appropriate angular coefficient and intercept. In the following paragraphs, a novel approach that exploits the overlapping categories, that is to say the categories that have a similar distribution on the plane, will be presented and tested to verify whether performances increase or not.

Given a category of interest ci, if one projects singularly each category of the set`ci = C - ci, it is possible to find one (or possibly more than one, or even none) category that clearly overlap with ci. Figure 1 shows an example where the documents of a category ck Î`ci overlap with ci. If one removed the overlapping category, there would be a situation where the positive and negative documents are almost linearly separable in the two-dimensions. This peculiar behavior suggested a second learning phase named selective re-projection that can be summarized with the following steps:

  1. Find overlapping categories.
    • Find the category ck Î`ci that overlaps most with ci;
    • Remove ck from the set`ci;
    • Find the optimal separating line Yi,optck, where ck at top indicates that the overlapping category was removed from the set`ci.
  2. Create a secondary two-dimensional space.
    • Create a secondary new two-dimensional space with C¢ = {ci, ck} where the category of interest is still ci while the complementary set is only ck;
    • Find the optimal separating line Yi,opt* of the secondary two-dimensional space.

figure/over-category-thumb.png
(For a larger view, click here.)

Figure 1: Training documents of Reuters-21578 collection projected in the two dimensional space. Positive documents are the ones of the category of interest ci, negative documents are the ones of the set C - ci. Documents highlighted with circles are those of a category ck Î C - ci that overlaps most with ci

To find the category that overlaps most over ci, for each cj Î`ci we calculate the best separating line between the pair (ci, cj). Among all the considered pairs, there is one, for example (ci, ck) that produces the lowest accuracy of classification. That particular category ck is the candidate to be removed from the set`ci in order to build the secondary space. Once category ck has been found as well as the separating line Yi,optck, the second step aims to build the new secondary two-dimensional space with C¢ = {ci, ck}. This means to recompute the parameters of the coordinates considering C¢ as the training set and to find the new separating line Yi,opt* in the secondary space. In Fig. 2 the documents of the two overlapping categories of Fig. 1 which fall below the separating line (not shown in figure) have been re-projected in the secondary space. The figure shows clearly a better separation between the two categories.

Testing whether a document d belongs to ci requires four steps: project d with the original coordinates, check if d is below the optimal line with Yi,optck, re-project d in the new space, check if d is below the new optimal line Yi,opt*. If all these conditions are true than the documents is assigned to ci.

figure/reprojection.png
(For a larger view of the figure, click here.)

Figure 2: The documents of the two overlapping categories of Fig. 1 which fall below the separating line Yi,optck have been re-projected in the secondary space.

4  Experimental Analysis

To make the evaluation of the two-dimensional probabilistic model comparable to most of the published results on ATC, the Reuters-215781 and RCV12 were chosen as a benchmark. The experimentation in this paper used the ModApte split of Reuters-21578, and only the 10 most frequent of the 135 potential categories herein are examined; these 10 categories account 7,193 training documents and 2,787 test documents. A sample of 16 categories out of the 103 potential categories of RCV1 were chosen. These 16 categories approximately follow the distribution of the documents in the whole collection [7]. The training set (30,529 documents) consisted on the first month of the Reuters year (August 20, 1996, September 19, 1996) while the test (36,243 documents) set was the last Reuters month (July 20, 1997, August 19, 1997) (a similar choice on the months has been done in one of the experiments of [22]).

Some text preprocessing was done on the documents: a first cleaning was done removing all the punctuation marks and and converting all the letters to lowercase. A stoplist of 232 words was used to remove the most frequent words of the English language, finally the English Porter stemmer3 was used as the only method to reduce the space of terms. Numerical tokens were removed in two experiments out of three, to see the difference in performance when using a whole alpha-numeric vocabulary.

Standard IR evaluation measures were used. Recall ri and Precision pi are defined for each category ci as (using the same notation of [4]):


ri = TPi

TPi + FNi
    pi = TPi

TPi + FPi

where TPi (true positive) is the number of documents correctly classified under category ci, and FNi (false negative) and FPi (false positive) are defined accordingly. A measure which combines ri and pi is the Fb function [23]:


Fb = (b2 + 1)·pi ·ri

b2 ·pi + ri
(3)

where the parameter b is used to weight differently, or equally, Recall and Precision. The macro-averaged F1 was calculated to evaluate the performance of the classifier for the whole set of categories. The F1 measure was also used to measure the accuracy of the classifier to find the overlapping category.

The SVMLight package4, a widely used implementation of Support Vector Machines (SVM), was employed with the default parameters (linear kernel) as a baseline to compare performances. In this case, each word was weighed using a standard TfIdf weighting scheme (tfc according to the notion of [24]), and no feature selection was performed.

Tables 1 and 2 show the F1 performances of different approaches with respect to the SVM baseline. We used a k-fold cross validation with k = 5, and we ran 10 times each cross-validation. The tables report the average of the performances. Table 1 reports performance comparison on Reuters-21578. For each category is reported the F1 measure of the two-dimensional (standard) approach (Normal), the selective re-projection approach (Re-proj) and the SVM baseline. The last line presents the F1 macroaverage values for each approach. A category-by-category investigation shows that the re-projection in the new space usually improves performances. The maximum gain in performance is obtained on category corn (+34.8%) while on category interest there is a loss of -1.5%. The overall gain on the F1 measure is +7.8%. Moreover, significance tests (using the signed rank test) show that the difference in performance between Re-proj and SVM is not appreciable, and the two approach may be considered comparable. Table 2 presents performance comparison on RCV1 using a vocabulary of words with/without numbers. There is a positive improvement in both experiments although it is much more appreciable when numerical tokens are removed. When only words are used, the overall gain on the F1 measure is +7.8%; when the whole vocabulary is used the overall gain on the F1 measure is +3.2%. Again, significance tests confirmed that there is no difference between Re-proj and SVM outputs; nevertheless, it can be noticed that when the whole vocabulary is used there SVM performs a little bit better.

Table 1: Performance comparison on Reuters-21578
category Normal Re-proj   SVM  
earn 0.946 0.971 0.987
acq 0.855 0.939 0.950
money-fx 0.759 0.862 0.762
grain 0.889 0.916 0.929
crude 0.824 0.874 0.885
trade 0.797 0.860 0.776
interest 0.750 0.739 0.737
wheat 0.784 0.843 0.859
ship 0.845 0.867 0.805
corn 0.596 0.804 0.854
F1 macro 0.805 0.868 0.855

 

Table 2: Performance comparison on RCV1
 tokens without numbers   all tokens  
category Normal Re-proj  SVM  Normal Re-proj  SVM
godd 0.344 0.779 0.221 0.4809 0.4851 0.1895
c313 0.078 0.239 0.044 0.0862 0.2037 0.0222
c15 0.803 0.862 0.860 0.9310 0.9418 0.9746
e121 0.606 0.671 0.660 0.7452 0.7412 0.7742
e132 0.939 0.916 0.977 0.4541 0.4885 0.5179
ghea 0.730 0.729 0.818 0.5759 0.6706 0.6586
gobit 0.516 0.850 0.097 0.4941 0.2162 0.0000
gpol 0.898 0.776 0.978 0.7997 0.7645 0.8594
gspo 0.697 0.932 0.938 0.9417 0.9616 0.9812
m14 0.970 0.987 0.990 0.8912 0.8845 0.9749
m143 0.428 0.709 0.661 0.7071 0.8830 0.9349
c183 0.787 0.462 0.920 0.7207 0.7018 0.7596
e13 0.744 0.825 0.773 0.7804 0.6952 0.9056
e21 0.712 0.840 0.894 0.6480 0.7958 0.8905
e142 0.600 0.563 0.480 0.4167 0.5385 0.2727
c13 0.559 0.652 0.704 0.5605 0.6088 0.6946
F1 macro 0.670 0.737 0.748 0.6632 0.6848 0.7026

5  Conclusions

The paper presented a mathematical formalization of the two-dimensional text categorization approach. A probabilistic definition of documents' coordinates has been proposed, and an extension of the two-dimensional classification algorithm, named selective re-projection, was proposed in order to improve performances. The system was empirically evaluated and tested on Reuters-21578 and Reuters Collection Volume 1 benchmarks using standard evaluation measures. Significance test confirmed that there is no difference in performance between the two-dimensional approach and state-of-the-art SVM. Therefore, this approach confirmed to be an alternative to classical classification methods for text classification. In particular the low requirements of space and time and the intuitive representation of the two-dimensional plots make this approach a valuable support for analysis of textual collections. There are still many important path or research open: one is to find a better justification for the use of an additive measure to find documents' coordinates. We are currently working on the definition of categories as sets of disjoint concepts (words) in order to infer probabilities of terms in a better way. Moreover, it is interesting to note how the ratio between W(t,G) and W(t,C-G) it is very close to the definition of the Odds Ratio of a term t, unless a logarithm is used. This may suggest an implicit relation between the two-dimensional model and the feature selection approach.

Acknowledgment

The author would like to give the most sincere thanks to Luca Pretto for the fruitful cooperation in carrying out together the formalization part of this work, and especially for his help and willingness to study and analyze appropriate solutions for the probabilistic interpretation of the two-dimensional model.

A big thank also to the tutors Norbert Fuhr and Dunja Mladenic for their valuable support and suggestions as tutors during and after the Doctoral Consortium in Vienna.

References

[1]
Catarci, T., Christodoulakis, S., Del Bimbo, A., eds.: 7th International Workshop of the EU Network of Excellence DELOS on Audio-Visual Content and Information Visualization in Digital Libraries (AVIVDiLib`05), Cortona (AR), Italy, Centromedia, Capannori (Lucca), Italy (2005).
[2]
Adams, A., Blandford, A.: Digital libraries' support for the user's `information journey'. In: 5th ACM/IEEEE Joint Conference on Digital Libraries (JCDL-05), Denver, CO, USA, ACM Press, New York, NY, USA (2005) 160-169.
[3]
Krowne, A., Halbert, M.: An initial evaluation of automated organization for digital library browsing. In: 5th ACM/IEEEE Joint Conference on Digital Libraries (JCDL-05), Denver, CO, USA, ACM Press, New York, NY, USA (2005) 246-255.
[4]
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34 (2002) 1-47.
[5]
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In Fisher, D.H., ed.: Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97), Nashville, Tennessee, US, Morgan Kaufmann Publishers, San Francisco, US (1997) 412-420.
[6]
Forman, G.: an extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3 (2003) 1289-1305.
[7]
Mladenic, D., Brank, J., Grobelnik, M., Milic-Frayling, N.: Feature selection using linear classifier weights: Interaction with classification models. In: Proceedings of the 27-th International ACM SIGIR (SIGIR-04), Sheffield, UK, ACM Press, New York, NY, USA (2004) 234-241.
[8]
Eyheramendy, S., Lewis, D.D., Madigan, D.: On the naive bayes model for text categorization. In: Proceedings of the 9-th International Workshop Artificial Intelligence and Statistics (AISTATS-03), Key West, Florida, US (2003).
[9]
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI/ICML-98 Workshop on Learning for Text Categorization, Madison, WI, USA, AAAI Press (1998) 41-48 Technical Report WS-98-05.
[10]
Zang, T., Oles, F.J.: Text categorization based on regularized linear classification methods. Information Retrieval 4 (2001) 5-31.
[11]
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In Nédellec, C., Rouveirol, C., eds.: Proceedings of the Tenth European Conference on Machine Learning (ECML-98), Chemnitz, DE, Springer Verlag, Heidelberg, DE (1998) 137-142.
[12]
Keim, D.A.: Information visualization and visual data mining. IEEE Transactions on Visualization and Computer Graphics 8 (2002) 1-8.
[13]
Keim, D.A., Kriegel, H.P.: Visualization techniques for mining large databases: A comparison. IEEE Transactions on Knowledge and Data Engineering 8 (1996) 923-938.
[14]
Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Patero, V., Saarela, A.: Self organization of a massive document collection. IEEE Transaction on Neural Networks 11 (2000) 574-585.
[15]
Di Nunzio, G.M., Micarelli, A.: Does a new gaussian weighting approach perform well in text categorization? In Gottlob, G., Walsh, T., eds.: Proceedings of the Eighteenth International Joint Conference of Artificial Intelligence (IJCAI-03), Acapulco, Mexico, Morgan Kaufmann Publishers, San Francisco, US (2003) 581-586.
[16]
Di Nunzio, G.M.: A bidimensional view of documents for text categorisation. In: Proceedings of the 26-th European Conference on Information Retrieval (ECIR-04). Number 2997 in LNCS, Sunderland, UK, Springer Verlag (2004) 112-126.
[17]
Di Nunzio, G.M.: 3-d environment to represent textual documents for duplicate detection and collection examination. In Catarci, T., Christodoulakis, S., Del Bimbo, A., eds.: 7th International Workshop of the EU Network of Excellence DELOS on Audio-Visual Content and Information Visualization in Digital Libraries (AVIVDiLib`05), Cortona (AR), Italy, Centromedia, Capannori (Lucca), Italy (2005) 12-21.
[18]
Di Nunzio, G.M., Micarelli, A.: Pushing "underfitting" to the limit: Learning in bidimensional text categorization. In: Proceedings of the 16th European Conference on Artificial Intelligence (ECAI-04), Valencia, Spain, IOS Press (2004) 465-469.
[19]
Di Nunzio, G.M.: Cross-comparison for two-dimensional text categorization. In: Proceedings of the 11-th Symposium on String Processing and Information Retrieval (SPIRE-04). Number 3246 in LNCS, Padova, Italy, Springer Verlag (2004) 125-126.
[20]
Feller, W.: An Introduction to Probability Theory and Its Applications. Third edn. Volume I. John Wiley & Sons, New York (1968).
[21]
Ross, S.: Introduction to Probability and Statistics for Engineers and Scientists. Academic Press (2000).
[22]
Damerau, F.J., Zhang, T., Weiss, S.M., Indurkhya, N.: Text categorization for a comprehensive time-dependent benchmark. Information Processing and Management 40 (2004) 209-221.
[23]
van Rijsbergen, C.J.: Information Retrieval. Second edn. Dept. of Computer Science, University of Glasgow, Glasgow, UK (1979).
[24]
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24 (1988) 513-523.

Footnotes:

1http://www.daviddlewis.com/resources/testcollections/reuters21578/
2http://about.reuters.com/researchandstandards/corpus/
3http://www.tartarus.org/~martin/PorterStemmer/
4http://svmlight.joachims.org/
 

© Copyright 2006 Giorgio Maria Di Nunzio

Top | Contents
Previous Article
Next Article
Home | E-mail the Editor