Visualization and Classification of Documents: A New Probabilistic Model to Automated Text Classification

TCDL Bulletin
Current 2006
Volume 2 Issue 2

Visualization and Classification of Documents

A New Probabilistic Model to Automated Text Classification

Giorgio Maria Di Nunzio
Department of Information Engineering, University of Padua
Via Gradenigo, 6/a, 35131 Padova, Italy
dinunzio@dei.unipd.it

Abstract

A novel two-dimensional representation of documents may be used both to visualize and classify textual documents and to build classification schemes that may be used to browse, explore, and retrieve resources from Digital Library Management Systems (DLMS). We propose a mathematical formalization of the two-dimensional approach where documents' coordinates are defined as probabilities. An extension of the two-dimensional classification algorithm, named selective re-projection, is also proposed. Experiments showed that this approach confirmed to be an alternative to text classical classification methods. In particular the low requirements of space and time and the intuitive representation of the two-dimensional plots make this approach a valuable support for analysis of textual collections.

1. Introduction

DLMS generally manage collections of multi-media digitalized data and include components that perform the storage, access, retrieval, and analysis of the collections of data. Multi-media digital libraries require solutions for content organization, access and interaction. In particular, solutions for interfaces that help to query and browse results are important and critical subjects of investigation [1]. In this context, DLMS designers not only deal with increasingly complex data sources but are continually confronted by the complexity of different user needs and abilities [2]. For this reason, a taxonomy (or subject hierarchy, classification scheme) may be used to browse, explore, and retrieve resources from DLMS. Interfaces that use this kind of taxonomy are typically used by users when they want to express their information needs in terms of keywords or when the specific request is not clear in their mind [3].

Building taxonomies or classification schemes is a general classification problem where objects having attributes (or features) are placed into categories based on the values of these attributes. When the object to be managed is of textual nature, the particular instance of the classification problem is called Automatic Text Categorization (ATC) where document words are interpreted as the features, and entire documents are the objects interpreted as numerical vectors [4]. Documents' vector length is determined by the number of the words of the vocabulary chosen to represent the corpus being classified. The size of the vocabulary may be considerable and easily reach hundreds of thousands. Processing the collection and training the system becomes extremely costly in computational terms by means of standard classification techniques; hence, a reduction of the original space is needed. The choice of the number of words that constitutes the vocabulary to represent the documents is a crucial and challenging point. The feature selection approach is a widely used approach to reduce the size of the vocabulary [5,6,7]. In literature, different statistical methods, such as Naive Bayes [8,9], ridge logistic regression [10], and support vector machines [11], showed that it is possible to achieve a good tradeoff between efficiency and effectiveness through feature selection approaches.

The question of choosing an appropriate number of features also arises when a compact visualization of the textual documents is needed. In scientific visualization and information visualization, the graphical models are typically constructed from measured or simulated data representing objects or concepts associated with phenomena from the physical world [12]. Several techniques for visualizing multidimensional data sets have been studied [13] since, in general, there is no standard mapping into the Cartesian coordinate systems. In spite of several visualization approaches in literature, there are very few examples of applications where text classification methods are used; self-organizing maps are probably the most used technique of visualization for text classification [14]. However, visualization methods provide only limited forms of guidance, visual or non-visual, for the user when training/testing an ATC system.

1.1 Background and Motivations

The main motivation of this work is to formalize mathematically the two-dimensional representation of documents and to present it as a probabilistic model. This model takes inspiration from some of the ideas of the Bayesian classifiers, such as the use of a collection of training examples to estimate the parameters of the model; however, the main assumptions and the definition of the probability of words and documents are distinct from the one of the classical Naïve Bayes models in order to tackle the problem of word selection and visualization. In particular, there are three significant advantages of the approach proposed in the paper we want to summarize here:

There is no explicit need for feature selection to reduce dimensionality, since documents are represented by only two dimensions.
The space required to store objects is very limited as well as the computational cost for training the classifier.
There is the possibility of the 2-D visualization of collections to analyze performances and find possible relations among categories.

The two-dimensional representation of documents is based on the definition of two parameters for each word of the vocabulary, named presence and expressiveness. These two intuitive measures want to capture the fact that the importance of a word (or a term) t in a generic category c is not only given by the relative frequency in the documents of c, but also by the relative frequency in the documents in all the other categories [15]. It is with the combination of this twin information that t gains or loses strength in a particular category: the relative importance of a term in a category is discredited (or given credence) by an amount proportional to its relative importance in the other categories. The weights of the words, that is the combination of presence and expressiveness, are in their turn combined to form the two coordinates of a document; in this way, documents are represented in the two-dimensional space with the expectation of finding the documents of a particular category separated from the other documents [16,17]. The classification of documents is performed by means of an efficient learning algorithm [18] that achieves a performance with respect the to state-of-the-art one statistically significantly not different more than 4 percentage point [19].

2 Mathematical Formalization

We shall use the following general definitions for the formalization: we shall assume a set of predefined categories C = {c₁, c₂, ..., c_i, ..., c_n}, and shall indicate with d_j,i the j-th document that belongs to the i-th category (e.g. c_i = {d_1,i, d_2,i, ..., d_j,i, ..., d_{N_i,i}}), with the generic category c_i having a number of documents N_i. The set of distinct terms in a category c_i is T_i, while a generic term of the vocabulary V will be t Î È_{i = 1}ⁿ T_i. Moreover, we shall indicate with c_i|t Í c_i a subset of category c_i, whose elements are the documents of c_i in which the term t appears at least once (e.g. c_1|t = {d_2,1, d_5,1, d_11,1}). The cardinality of this subset for a generic c_i|t will be indicated with N_i|t. Finally, the category of interest will usually be indicated as c_i, and its complement with`c_i = C - c_i. The words: word, term and feature are often used in this paper as synonyms.

The model presented will make three assumptions:

Documents are represented by means of binary vectors (a word is either present or not in a document).
The probability of each word occurring in a document is independent of the occurrence of other words in a document (the word independence assumption).
Categories are equally probable (the relative number of documents in each category, according to the training set, is not considered).

2.1 Presence and Expressiveness

As already mentioned, the basic idea of presence and expressiveness of a term is: given a category c_i, the importance of a generic term t is not only given by the relative frequency in the documents of c_i, but also by the relative frequency in the documents in all the other categories`c_i . The definition of the presence of a term t in the category c_i, that is P(t,c_i), is given by the ratio N_i|t/N_i between the number of documents in c_i in which term t appears at last once, and the total number of documents in c_i. A straightforward frequentist interpretation of this ratio gives that the ratio is the probability that a document randomly sorted from c_i contains t at least once; let us call this probability P[t Î c_i]. Keeping in mind the assumption we have done earlier, we can easily express the general presence of a term t in the set of categories G Ì C in terms of the probability that a document randomly sorted from G contains t at least once. In fact, by using the total probability theorem, we get:

P[t Î G]=

å
j Î g

P[t Î G|c_j]P[c_j]=

|G|

å
j Î g

P[t Î c_j]=

å
j Î g

N_j|t

N_j

|G|

=P(t,G) ,

where we used the equality P[t Î G|c_j]=P[t Î c_j], since the probability that a document randomly sorted from G contains t at least once, when c_j is the observed category in G, equals the probability that a document randomly sorted from c_j contains t at least once.

Similarly, the general expressiveness of a term t in the set of categories G, that is E(t,G), is given by

E(t,G)=1-P(t, C-G)=P[{t Î C- G}^C]=P[t Ï C- G] ,

(1)

where {t Î C- G}^C denotes the complement set of {t Î C- G}. Formula (1) shows that the general expressiveness of a term t in the set of categories G equals the probability that a randomly sorted document in C-G does not contain term t.

2.2 Term Weighting

Given a generic term t in a category c_i, we want to give this term an importance proportional both to its presence in c_i (how many documents in c_i contain t) and to its expressiveness (how many documents in`c_i do not contain t). In general, we define the weight of a term t in a set of categories G, with respect to complementary set C-G, as:

W(t,G)=P(t,G)·E(t,G) .

Again, this weight can be interpreted as a probability. To show this fact, suppose we consider the following experiment: let us randomly sort a document from G and, independently, a document from C - G. To mathematically model this experiment, we can consider the probability space S, which models the experiment, as the Cartesian product of two independent spaces S_G and S_C-G, modeling the random sorting of a document from G and C-G, respectively [20]. Now, the probability of sorting both a document containing term t at least once from G and a document not containing term t from C-G is given by:

P[t Î G, t Ï C-G]=P[t Ï G]P[t Ï C-G]=P(t,G)·E(t,G)=W(t,G) ,

which proves our statement.

2.3 Documents' Coordinates

The aim of the coordinates of a document is to combine the weights of the words in a document according to the following principle: the more a document contains words discriminative for category c_i the higher the abscissae; the more a document contains words discriminative for the complementary`c_i the higher the ordinate. A document's coordinates can be obtained by defining a proper additive measure over the vocabulary V. We use W(t,G) to define this measure. Given a generic subset d={t_j₁,t_j₂,¼,t_{j_k} } of V, i.e. a document, the measure of d in the set of categories G is defined as

m_G[d] =

k
å
l=1

W(t_{j_l},G)

|V|
å
i=1

W(t_i,G)

(2)

Now we can define the coordinates of a document d when the category c_i is examined:

X_i(d) = m_{c_i}[d] Y_i(d)=m_{`c_i}[d] .

The aim of the definition of m_G[d] is to get high X_i(d) and low Y_i(d) if d with high probability belongs to the category c_i, the opposite if d with high probability does not belong to the category c_i. Theoretical justifications for this definition can be found in an implicit assumption of terms as disjoint events, and in the usage of a weighing factor.

3 Training the Classifier : Selective Re-projection

Once the documents are projected in the two-dimensional space, it is possible to classify them using the algorithms we proposed in past works [15,16,18]. The general idea is that given a point p(0,q) close to the origin, |q| < 0.1, a straight line that separates the documents of category c_i from the remainder documents exists according to some criterion, and its equation is

Y_i,opt = m_opt·X_i + q_opt ,

where m_opt and q_opt are respectively an appropriate angular coefficient and intercept. In the following paragraphs, a novel approach that exploits the overlapping categories, that is to say the categories that have a similar distribution on the plane, will be presented and tested to verify whether performances increase or not.

Given a category of interest c_i, if one projects singularly each category of the set`c_i = C - c_i, it is possible to find one (or possibly more than one, or even none) category that clearly overlap with c_i. Figure 1 shows an example where the documents of a category c_k Î`c_i overlap with c_i. If one removed the overlapping category, there would be a situation where the positive and negative documents are almost linearly separable in the two-dimensions. This peculiar behavior suggested a second learning phase named selective re-projection that can be summarized with the following steps:

Find overlapping categories.
- Find the category c_k Î`c_i that overlaps most with c_i;
- Remove c_k from the set`c_i;
- Find the optimal separating line Y_i,opt^c_k, where c_k at top indicates that the overlapping category was removed from the set`c_i.
Create a secondary two-dimensional space.
- Create a secondary new two-dimensional space with C¢ = {c_i, c_k} where the category of interest is still c_i while the complementary set is only c_k;
- Find the optimal separating line Y_i,opt^* of the secondary two-dimensional space.

(For a larger view, click here.)

Figure 1: Training documents of Reuters-21578 collection projected in the two dimensional space. Positive documents are the ones of the category of interest c_i, negative documents are the ones of the set C - c_i. Documents highlighted with circles are those of a category c_k Î C - c_i that overlaps most with c_i

To find the category that overlaps most over c_i, for each c_j Î`c_i we calculate the best separating line between the pair (c_i, c_j). Among all the considered pairs, there is one, for example (c_i, c_k) that produces the lowest accuracy of classification. That particular category c_k is the candidate to be removed from the set`c_i in order to build the secondary space. Once category c_k has been found as well as the separating line Y_i,opt^c_k, the second step aims to build the new secondary two-dimensional space with C¢ = {c_i, c_k}. This means to recompute the parameters of the coordinates considering C¢ as the training set and to find the new separating line Y_i,opt^* in the secondary space. In Fig. 2 the documents of the two overlapping categories of Fig. 1 which fall below the separating line (not shown in figure) have been re-projected in the secondary space. The figure shows clearly a better separation between the two categories.

Testing whether a document d belongs to c_i requires four steps: project d with the original coordinates, check if d is below the optimal line with Y_i,opt^c_k, re-project d in the new space, check if d is below the new optimal line Y_i,opt^*. If all these conditions are true than the documents is assigned to c_i.

(For a larger view of the figure, click here.)

Figure 2: The documents of the two overlapping categories of Fig. 1 which fall below the separating line Y_i,opt^c_k have been re-projected in the secondary space.

4 Experimental Analysis

To make the evaluation of the two-dimensional probabilistic model comparable to most of the published results on ATC, the Reuters-21578¹ and RCV1² were chosen as a benchmark. The experimentation in this paper used the ModApte split of Reuters-21578, and only the 10 most frequent of the 135 potential categories herein are examined; these 10 categories account 7,193 training documents and 2,787 test documents. A sample of 16 categories out of the 103 potential categories of RCV1 were chosen. These 16 categories approximately follow the distribution of the documents in the whole collection [7]. The training set (30,529 documents) consisted on the first month of the Reuters year (August 20, 1996, September 19, 1996) while the test (36,243 documents) set was the last Reuters month (July 20, 1997, August 19, 1997) (a similar choice on the months has been done in one of the experiments of [22]).

Some text preprocessing was done on the documents: a first cleaning was done removing all the punctuation marks and and converting all the letters to lowercase. A stoplist of 232 words was used to remove the most frequent words of the English language, finally the English Porter stemmer³ was used as the only method to reduce the space of terms. Numerical tokens were removed in two experiments out of three, to see the difference in performance when using a whole alpha-numeric vocabulary.

Standard IR evaluation measures were used. Recall r_i and Precision p_i are defined for each category c_i as (using the same notation of [4]):

r_i =

TP_i

TP_i + FN_i

p_i =

TP_i

TP_i + FP_i

where TP_i (true positive) is the number of documents correctly classified under category c_i, and FN_i (false negative) and FP_i (false positive) are defined accordingly. A measure which combines r_i and p_i is the F_b function [23]:

F_b =

(b² + 1)·p_i ·r_i

b² ·p_i + r_i

(3)

where the parameter b is used to weight differently, or equally, Recall and Precision. The macro-averaged F1 was calculated to evaluate the performance of the classifier for the whole set of categories. The F₁ measure was also used to measure the accuracy of the classifier to find the overlapping category.

The SVM^Light package⁴, a widely used implementation of Support Vector Machines (SVM), was employed with the default parameters (linear kernel) as a baseline to compare performances. In this case, each word was weighed using a standard TfIdf weighting scheme (tfc according to the notion of [24]), and no feature selection was performed.

Tables 1 and 2 show the F₁ performances of different approaches with respect to the SVM baseline. We used a k-fold cross validation with k = 5, and we ran 10 times each cross-validation. The tables report the average of the performances. Table 1 reports performance comparison on Reuters-21578. For each category is reported the F₁ measure of the two-dimensional (standard) approach (Normal), the selective re-projection approach (Re-proj) and the SVM baseline. The last line presents the F₁ macroaverage values for each approach. A category-by-category investigation shows that the re-projection in the new space usually improves performances. The maximum gain in performance is obtained on category corn (+34.8%) while on category interest there is a loss of -1.5%. The overall gain on the F₁ measure is +7.8%. Moreover, significance tests (using the signed rank test) show that the difference in performance between Re-proj and SVM is not appreciable, and the two approach may be considered comparable. Table 2 presents performance comparison on RCV1 using a vocabulary of words with/without numbers. There is a positive improvement in both experiments although it is much more appreciable when numerical tokens are removed. When only words are used, the overall gain on the F₁ measure is +7.8%; when the whole vocabulary is used the overall gain on the F₁ measure is +3.2%. Again, significance tests confirmed that there is no difference between Re-proj and SVM outputs; nevertheless, it can be noticed that when the whole vocabulary is used there SVM performs a little bit better.

Table 1: Performance comparison on Reuters-21578

category	Normal	Re-proj	SVM
earn	0.946	0.971	0.987
acq	0.855	0.939	0.950
money-fx	0.759	0.862	0.762
grain	0.889	0.916	0.929
crude	0.824	0.874	0.885
trade	0.797	0.860	0.776
interest	0.750	0.739	0.737
wheat	0.784	0.843	0.859
ship	0.845	0.867	0.805
corn	0.596	0.804	0.854

F₁ macro	0.805	0.868	0.855

Table 2: Performance comparison on RCV1

	tokens without numbers			all tokens
category	Normal	Re-proj	SVM	Normal	Re-proj	SVM
godd	0.344	0.779	0.221	0.4809	0.4851	0.1895
c313	0.078	0.239	0.044	0.0862	0.2037	0.0222
c15	0.803	0.862	0.860	0.9310	0.9418	0.9746
e121	0.606	0.671	0.660	0.7452	0.7412	0.7742
e132	0.939	0.916	0.977	0.4541	0.4885	0.5179
ghea	0.730	0.729	0.818	0.5759	0.6706	0.6586
gobit	0.516	0.850	0.097	0.4941	0.2162	0.0000
gpol	0.898	0.776	0.978	0.7997	0.7645	0.8594
gspo	0.697	0.932	0.938	0.9417	0.9616	0.9812
m14	0.970	0.987	0.990	0.8912	0.8845	0.9749
m143	0.428	0.709	0.661	0.7071	0.8830	0.9349
c183	0.787	0.462	0.920	0.7207	0.7018	0.7596
e13	0.744	0.825	0.773	0.7804	0.6952	0.9056
e21	0.712	0.840	0.894	0.6480	0.7958	0.8905
e142	0.600	0.563	0.480	0.4167	0.5385	0.2727
c13	0.559	0.652	0.704	0.5605	0.6088	0.6946

F₁ macro	0.670	0.737	0.748	0.6632	0.6848	0.7026

5 Conclusions

The paper presented a mathematical formalization of the two-dimensional text categorization approach. A probabilistic definition of documents' coordinates has been proposed, and an extension of the two-dimensional classification algorithm, named selective re-projection, was proposed in order to improve performances. The system was empirically evaluated and tested on Reuters-21578 and Reuters Collection Volume 1 benchmarks using standard evaluation measures. Significance test confirmed that there is no difference in performance between the two-dimensional approach and state-of-the-art SVM. Therefore, this approach confirmed to be an alternative to classical classification methods for text classification. In particular the low requirements of space and time and the intuitive representation of the two-dimensional plots make this approach a valuable support for analysis of textual collections. There are still many important path or research open: one is to find a better justification for the use of an additive measure to find documents' coordinates. We are currently working on the definition of categories as sets of disjoint concepts (words) in order to infer probabilities of terms in a better way. Moreover, it is interesting to note how the ratio between W(t,G) and W(t,C-G) it is very close to the definition of the Odds Ratio of a term t, unless a logarithm is used. This may suggest an implicit relation between the two-dimensional model and the feature selection approach.

Acknowledgment

The author would like to give the most sincere thanks to Luca Pretto for the fruitful cooperation in carrying out together the formalization part of this work, and especially for his help and willingness to study and analyze appropriate solutions for the probabilistic interpretation of the two-dimensional model.

A big thank also to the tutors Norbert Fuhr and Dunja Mladenic for their valuable support and suggestions as tutors during and after the Doctoral Consortium in Vienna.

References

[1]: Catarci, T., Christodoulakis, S., Del Bimbo, A., eds.: 7th International Workshop of the EU Network of Excellence DELOS on Audio-Visual Content and Information Visualization in Digital Libraries (AVIVDiLib`05), Cortona (AR), Italy, Centromedia, Capannori (Lucca), Italy (2005).
[2]: Adams, A., Blandford, A.: Digital libraries' support for the user's `information journey'. In: 5th ACM/IEEEE Joint Conference on Digital Libraries (JCDL-05), Denver, CO, USA, ACM Press, New York, NY, USA (2005) 160-169.
[3]: Krowne, A., Halbert, M.: An initial evaluation of automated organization for digital library browsing. In: 5th ACM/IEEEE Joint Conference on Digital Libraries (JCDL-05), Denver, CO, USA, ACM Press, New York, NY, USA (2005) 246-255.
[4]: Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34 (2002) 1-47.
[5]: Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In Fisher, D.H., ed.: Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97), Nashville, Tennessee, US, Morgan Kaufmann Publishers, San Francisco, US (1997) 412-420.
[6]: Forman, G.: an extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3 (2003) 1289-1305.
[7]: Mladenic, D., Brank, J., Grobelnik, M., Milic-Frayling, N.: Feature selection using linear classifier weights: Interaction with classification models. In: Proceedings of the 27-th International ACM SIGIR (SIGIR-04), Sheffield, UK, ACM Press, New York, NY, USA (2004) 234-241.
[8]: Eyheramendy, S., Lewis, D.D., Madigan, D.: On the naive bayes model for text categorization. In: Proceedings of the 9-th International Workshop Artificial Intelligence and Statistics (AISTATS-03), Key West, Florida, US (2003).
[9]: McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI/ICML-98 Workshop on Learning for Text Categorization, Madison, WI, USA, AAAI Press (1998) 41-48 Technical Report WS-98-05.
[10]: Zang, T., Oles, F.J.: Text categorization based on regularized linear classification methods. Information Retrieval 4 (2001) 5-31.
[11]: Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In Nédellec, C., Rouveirol, C., eds.: Proceedings of the Tenth European Conference on Machine Learning (ECML-98), Chemnitz, DE, Springer Verlag, Heidelberg, DE (1998) 137-142.
[12]: Keim, D.A.: Information visualization and visual data mining. IEEE Transactions on Visualization and Computer Graphics 8 (2002) 1-8.
[13]: Keim, D.A., Kriegel, H.P.: Visualization techniques for mining large databases: A comparison. IEEE Transactions on Knowledge and Data Engineering 8 (1996) 923-938.
[14]: Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Patero, V., Saarela, A.: Self organization of a massive document collection. IEEE Transaction on Neural Networks 11 (2000) 574-585.
[15]: Di Nunzio, G.M., Micarelli, A.: Does a new gaussian weighting approach perform well in text categorization? In Gottlob, G., Walsh, T., eds.: Proceedings of the Eighteenth International Joint Conference of Artificial Intelligence (IJCAI-03), Acapulco, Mexico, Morgan Kaufmann Publishers, San Francisco, US (2003) 581-586.
[16]: Di Nunzio, G.M.: A bidimensional view of documents for text categorisation. In: Proceedings of the 26-th European Conference on Information Retrieval (ECIR-04). Number 2997 in LNCS, Sunderland, UK, Springer Verlag (2004) 112-126.
[17]: Di Nunzio, G.M.: 3-d environment to represent textual documents for duplicate detection and collection examination. In Catarci, T., Christodoulakis, S., Del Bimbo, A., eds.: 7th International Workshop of the EU Network of Excellence DELOS on Audio-Visual Content and Information Visualization in Digital Libraries (AVIVDiLib`05), Cortona (AR), Italy, Centromedia, Capannori (Lucca), Italy (2005) 12-21.
[18]: Di Nunzio, G.M., Micarelli, A.: Pushing "underfitting" to the limit: Learning in bidimensional text categorization. In: Proceedings of the 16th European Conference on Artificial Intelligence (ECAI-04), Valencia, Spain, IOS Press (2004) 465-469.
[19]: Di Nunzio, G.M.: Cross-comparison for two-dimensional text categorization. In: Proceedings of the 11-th Symposium on String Processing and Information Retrieval (SPIRE-04). Number 3246 in LNCS, Padova, Italy, Springer Verlag (2004) 125-126.
[20]: Feller, W.: An Introduction to Probability Theory and Its Applications. Third edn. Volume I. John Wiley & Sons, New York (1968).
[21]: Ross, S.: Introduction to Probability and Statistics for Engineers and Scientists. Academic Press (2000).
[22]: Damerau, F.J., Zhang, T., Weiss, S.M., Indurkhya, N.: Text categorization for a comprehensive time-dependent benchmark. Information Processing and Management 40 (2004) 209-221.
[23]: van Rijsbergen, C.J.: Information Retrieval. Second edn. Dept. of Computer Science, University of Glasgow, Glasgow, UK (1979).
[24]: Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24 (1988) 513-523.

Footnotes:

¹ http://www.daviddlewis.com/resources/testcollections/reuters21578/

² http://about.reuters.com/researchandstandards/corpus/

³ http://www.tartarus.org/~martin/PorterStemmer/

⁴ http://svmlight.joachims.org/

Top | Contents
Previous Article
Next Article
Home | E-mail the Editor