Angrosh, "Modelling Argumentation Structures in Scientific Discourse through Context Identification: Towards Intelligent Information Retrieval Systems", TCDL Bulletin 6.2 (2010)

Modelling Argumentation Structures in Scientific Discourse through
Context Identification: Towards Intelligent Information Retrieval Systems

M.A. Angrosh

Department of Information Science, University of Otago, Dunedin, NZ
angrosh@infoscience.otago.ac.nz

Abstract

Identification of contexts associated with sentences is becoming increasingly necessary for developing intelligent information retrieval systems. The present study is designed to identify contexts associated with sentences in scientific articles and use them for providing valued added information services. The key objectives of the study are (1) develop an argumentative framework for modelling sentences in research articles; (2) carry out an inter-rater reliability study for establishing the framework; (3) extract and classify contextual information associated with sentences; (4) automate the process of sentence extraction and feature definition; (5) convert contextual data into semantic metadata and (6) demonstrate the value of semantic data through developing a semantic web application. The present paper provides details about the methodology adopted for achieving these objectives and draws upon some of the key initial results obtained in the study.

Keywords

Argumentation Models, Citation Classification, Knowledge-based Information Systems

1 Introduction

For centuries now, research articles have played an important role in communicating scientific research. Ever since the first scientific journal ‘Philosophical Transactions’ was published by the Royal Society of London in the year 1665 [1], the number of scientific journals has increased exponentially. The advent of electronic publishing further fuelled this process, leading to a commendable increase in scientific publications. However, this exponential growth has also resulted in serious challenges for the scientific community. While the rapid growth of scientific literature has led to difficulties for scientists to access all relevant articles in their fields, it is becoming only next to impossible to read all those relevant articles. Though a range of search mechanisms are provided by current search engines and information providers, most of these services are centered on either full-text search or limited to bibliographic metadata, which fail to answer researcher’s fundamental questions. For instance, at an elementary level, a researcher would be interested in knowing and understanding the related work cited in a paper.

Further, he would also be interested in gaining knowledge about the shortcomings of related work cited by the author. Answers to such fundamental questions form the key in progressing research and currently there are hardly any systems capable of answering these questions

On the other hand, focused on providing efficient search solutions, researchers in the fields of Machine Learning, Text Mining, Natural Language Processing and Ontology Learning have been working on novel forms of information extraction from research documents [2-3]. The challenge is to obtain a representation of knowledge embedded in the documents and use the same for information retrieval. Researchers have also been toying with the idea of viewing research articles as argumentation structures [4]. The goal has been to represent research articles as argumentation structures, which would facilitate providing value-added information services. On a different front, citation analysis has been extensively studied for establishing links between researchers and research articles.

While bibliometric measures based on citations such as the Impact Factor are used to measure the impact of a researchers’ work by how often they are cited [5-6], these measures have been criticized for being purely quantitative and that many citations are done out of "politeness, policy and piety" [7]. Identification of explicit relationships between citing paper and cited work has remained a challenge. Furthermore, in recent times, the emergence of newer technologies such as Semantic Web and Web 2.0 has helped in interlinking various data sources through linked data initiatives [8] which facilitates powerful navigation between varied data sources.

Against this scenario, the present research is taken up for developing intelligent information retrieval systems based on the contextual information of sentences in research articles. The research study employs techniques from different areas identified above in order to achieve the defined objectives. For instance, while machine learning techniques are used for identifying contexts associated with sentences, we employ Semantic Web technologies for developing an information system using the contextual data. We also explore the use of argumentation models and classification of citation sentences for developing the rhetorical model in research articles.

The principal research questions formulated for the study are:

How can we identify contexts associated with sentences in research articles?
How can we model argumentative structures in research articles?
How can we use the representation of argumentation structures for building intelligent information systems?

2 Related Work

There have been efforts in modelling information contained in research articles for information retrieval. Sillence proposed an argumentation based indexing method for information retrieval from learned articles [9]. Buckingham, Shum et al. proposed an argumentation scheme for modelling the rhetorical pattern in research articles [10]. The study implemented a client-server system called ClaiMaker, which facilitated distributed modelling of documents in literature and provided a variety of services for browsing and analysing the emergent conceptual graphs. Sereno et al. developed ClaimSpotter, an environment for supporting sense making with knowledge triples [11]. De Waard presented a pragmatic structure for research articles [12]. Further De Waard et al. proposed a model for research publications to structure scientific articles, based on the rhetorical structure ubiquitous in science papers [13]. Teufel et al. developed an annotation scheme of seven categories for building automatic abstracting systems for scientific research articles [14]. Further, Teufel and Moens noted a common rhetorical pattern of scientific argumentation in the introduction section of research articles [15].

Citations form an integral part of research writing. While laying the foundation of academia, citations provide for referencing other works, i.e., works that are already known, established or conceptual thought, for progressing one's ideas. Apart from providing valuable bibliometric measures, citations have been keenly studied with a focus on identifying the reasons for citations and, more recently, experiments on the automated classification and identification of contexts associated with citations have been undertaken. As early as 1971, Weinstock put forward as many as 15 categories describing the different reasons for citations [16]. The first approach towards a formal classification of citations was proposed by Moravcsik and Murugesan [17]. Nanba and Okumara summed up the various reasons for citations as identified by Weinstock into three fundamental categories for automatic generation of review articles [18]. Garzone and Mercer presented a system for citation classification that relied on characteristic syntactic structure to determine citation category [19]. Mercer and Marco extended this idea to use fine-grained cue phrases within citation sentences as a stylistic basis for categorization [20]. More recently, Le et al. employed Maximum Entropy Markov Models (MEMMs) for classifying citation sentences [21]. With a focus on detecting emerging trends, the authors proposed a classification scheme comprising six categories for classifying citation types. Based on findings from Swales [22] that scientific argument follows a general rhetorical structure, Teufel et al. introduced a citation annotation scheme involving 12 categories [23].

There has been a considerable amount of work on developing ontologies for scientific discourse. The SWAN project (Semantic Web Applications in Neuromedicine) aims to develop a practical, common, semantically-structured framework for scientific discourse [24]. The project is currently applied to the field of Alzheimer's Disease research and is expected to be applied to other domains. The project has also developed the Citations ontology, which defines a set of entities useful for referencing scientific publications [25]. The present study uses these ontologies for defining the entities of scientific publications.

While most of the aforementioned studies limited their focus to classifying citation sentences, the present study, besides classifying citations, also focuses on classifying sentences adjacent to citations. The final objective of this study is to classify citations appearing throughout the article and use the classified data for developing intelligent information systems. Furthermore, we view the classification task as a sequential classification problem. The sequential classification is achieved by using conditional random fields (CRFs)--a probabilistic framework for labelling sequential data based on supervised learning. Recently, CRFs have been successfully applied to various tasks such as Part of Speech (POS) tagging [26]; named entity recognition [27]; table extraction from government reports [28] and noun phrase segmentation [29].

3 Objectives of the Study

The following are the key objectives of the present study:

Propose an argumentation framework for classifying sentences in research articles.
Study the inter-rater reliability of the proposed framework.
Extract and classify contextual information from sentences in research articles based on the proposed argumentative framework.
Automate the process of sentence extraction, feature definition and tagging classifier results
Convert contextual data into semantic metadata for harnessing optimum value from the classified data.
Demonstrate the value of semantic data through developing a semantic web application.

4 Methodology

The following sections describe the methodology adopted with respect to each of the objectives defined in the study.

Propose an argumentation framework for classifying sentences in research articles
Research articles are primarily argumentative in nature. Citations, which form the foundations of academic writing, are mainly used for progressing one’s arguments. The key focus of this objective is to identify the notion of argument associated with citation sentences. Established models of argumentation such as Toulmin Method of Argumentation [30] will be studied to assess their application in the context of scientific discourse.
Study inter-rater reliability of the proposed framework.
The following steps would be carried out in order to achieve this objective:
1. Carry out user study for obtaining feedback about the proposed framework.
  Appropriate techniques would be developed for obtaining user feedback about the developed framework. To this end we intend to use Acrobat JavaScript to enable users to annotate the PDF documents based on the developed framework. Appropriate dialog boxes would be created for collecting user data.
2. Evaluate inter-rater reliability statistics on the collected data
  Inter-rater reliability measures such as Cohen’s Kappa and Krippendorff’s Alpha provide the degree of agreement among different raters [31]. We intend to evaluate these statistics by using the collected data in order to establish the proposed framework.
Identify contextual information associated with sentences in research articles based on the proposed argumentative framework.
The key focus of this objective is to employ standard machine learning techniques for identifying contextual information associated with sentences in research articles. More specifically, the present study would examine the performance of sequential classifiers such as Hidden Markov Models, Maximum Entropy Markov Models and Conditional Random Fields. To this end, we propose to develop a standard training dataset comprising features for sentences in research articles.
Automate the process of sentence extraction, feature definition and tagging classifier results
While the initial process of term categorization and feature definition is performed manually, the study proposes to automate this process through developing appropriate Python modules. While the Natural Language Toolkit (NLTK) would be used for sentence segmentation and other NLP tasks, Regular Expressions would be employed for term identification. Appropriate techniques would be developed for sentence extraction, feature definition and tagging classifier results to related sentences.
Convert contextual data into semantic metadata for harnessing optimum value from the classified data.
The study proposes to use Semantic Web technologies for developing the resulting contextual knowledge into semantic metadata. This would facilitate developing intelligent information retrieval systems capable of providing value added information services.
Demonstrate the value of semantic data through developing a Semantic Web application.
The study proposes to develop a Semantic Web Application for exploiting the developed contextual knowledge.

5. Expected Outcome

The present study is expected to result in the following:

An argumentation framework for modelling sentences in scientific research articles, which could be used for information management and retrieval.
A training dataset comprising labelled sentences for classification of the context of sentences in research articles.
An implemented application of CRF for sequential classification of sentences in articles and an understanding of this technique’s performance compared to other classification techniques.
An ontology-based information retrieval system using the classified data. Amongst others, the system would be able to answer the following queries:
1. Identify current work outcomes and current work shortcomings of a research article
2. Identify the contexts in which related works are cited
3. Identify the shortcomings of a specific work as identified by different authors
4. Identify contrasting statements about related works in an article
5. Sketch the pattern of argumentation for a particular research idea

6. Results Achieved So Far

The following section provides details about results achieved so far.

6.1 Framework for Related Work Sections

Working in the direction of developing an argumentation framework for research articles, we initially considered modeling Related Work sections in research articles as a first step. Accordingly, we have developed a generic framework based on the rhetorical patterns observed in these sections as shown in Figure 1. We propose to extend this framework for modelling full articles.

Figure 1: Rhetorical Pattern in Related Work Section

As shown in the Figure, the section usually begins with a background sentence and continues with citation sentences pointing to a specific related work in the context of the research paper. The citation sentences either simply describe the related work or refer to the outcome or strengths of the related work. Further, sentences which further describe the cited work follow citation sentences. Following such sentences, the author generally points out shortcomings in the related work. Besides identifying shortcomings, there could also be sentences referring to contrasting work for the cited work. After identifying shortcomings, the author usually proceeds to point out alternate works which overcome the identified shortcomings. These could also be again citations or regular sentences which refer to alternate approaches for a given study. Finally, in relation to the description or reference to an outcome or strengths of a cited work, its shortcomings, and alternate approaches, the authors describe the outcomes of the current work. The authors also sometimes mention shortcomings of the current work.

6.2 Classification Schema for Related Work Sections

This framework resulted in a classification schema for sentences in related work sections, which is tabulated in Table 1.

Table 1: Labels for Sentences in Related Work Section

The categories resulting from the classification schema were used for labeling sentences in the related work section and develop our training data for carrying out machine learning experiments.

6.3 Feature Definition

We distinguished between two kinds of features for each sentence viz., citation features and sentence features. While citation features denote the presence of a citation in sentences, sentence features are defined based on the generalization strategy adopted in the study. The generalization strategy categorized certain kinds of words appearing in sentences and defined features for presence of terms in these categories. The categories included Background Terms, Subject of Inquiry Terms, Outcome Terms, Strength Terms, Shortcoming Terms, Subjective Pronoun Terms, Words of Stress, Alternate Approach Terms, Result Terms and Contrasting Terms.

6.4 Developing Training Data

The initial training dataset was developed from 50 research articles randomly selected from the Lecture Notes in Computer Science (LNCS) collection at springerlink.com [32]. The training set of 50 research articles yielded 200 paragraphs which had a total of 1063 sentences. More specifically, two different datasets were prepared from the training set. While the first dataset was developed using only sentence features, the second dataset was developed using both citation features and sentence features.

6.5 Experiments and Results with Conditional Random Fields

We carried out experiments with Conditional Random Fields (CRFs) [26], which are undirected graphical models used for computing the conditional probability of values on designated output nodes given feature values assigned to a sequence of input nodes. We chose CRFs as they offered several advantages over other sequential models such as Hidden Markov Models (HMMs) and Maximum Entropy Markov Models (MEMMs).

We performed a 10-fold cross validation on each dataset. We used Mallet [33], a Java-based package that provides an implementation of linear chain CRF algorithms for working with sequential data. The results showing precision, recall and F-score for each of the classification labels are tabulated in Table 2. More details about modelling related work sections and using CRFs for achieving context identification can be found in [34]

Table 2: Classification results of the Classifier

6.6 Sentence Context Ontology and Linked Data

We have also developed the Sentence Context Ontology for modelling the contextual relations identified above as shown in Figure 2. The ontology was developed using the Protégé Ontology Editor and Knowledge-base Framework [35]. Based on the developed ontology, we have also set up a Linked Data server for developing the contextual data in Resource Description Framework (RDF) form. We used the D2R Server [36] for achieving this task.

6.7 Ontology-based Information Retrieval System

The created RDF data was used along with Semantic Web Server (SEWESE) [37], a JSP/Servlet environment for developing ontology based retrieval system. The system provided user interfaces for queries and navigation of semantic data. Screenshots of the developed information system are provided in Figures 3(a) and 3(b). More details about semantic modelling of the contextual data can be found in [38].

Figure 2: Sentence Context Ontology (For a larger view of image, click here)

Figure 3(a) and Figure 3(b): Screenshots of the developed Ontology-based Information Retrieval System
(For a larger view of image, click here)

7 References

[1]	Fathom, "The Scientific Article: From Galileo's New Science to the Human Genome - Session 2: Origin and Early Years," http://www.fathom.com/course/21701730/index.html, 2000.

[2]	R. Kosala and H. Blockeel, "Web mining research: A survey," ACM SIGKDD Explorations Newsletter, vol. 2, pp. 1-15, 2000.

[3]	A. H. F. Laender, B. A. Ribeiro-Neto, A.S. da Silva and J.S. Teixeira, "A brief survey of web data extraction tools," ACM Sigmod Record, vol. 31, p. 84-93, 2002.

[4]	S. B. Shum, "The Roots of Computer Supported Argument Visualization," in Visualizing argumentation: software tools for collaborative and educational sense-making, P. Kirschner, S. B. Shum and C. Carr, Eds. London;New York: Springer, 2003, pp. 3-24.

[5]	C. L. Borgman and J. Furner, "Scholarly communication and bibliometrics," Annual Review of Information Science and Technology, vol. 36, pp. 3-72, 2002.

[6]	T. Luukkonen, "Is scientists' publishing behaviour reward-seeking?," Scientometrics, vol. 24, pp. 297-319, 1992.

[7]	J. M. Ziman, Public knowledge: an essay concerning the social dimension of science: Cambridge University Press, 1968.
[8]	Linked Data – Connecting Distributed Data across the Web. http://linkeddata.org/
[9]	J. A. A. Sillince, "Argumentation-based indexing for information retrieval from learned articles," Journal of Documentation, vol. 48, pp. 387-405, 1992.

[10]	S.B. Shum, V. Uren, G. Li, B. Sereno and C. Mancini, "Modeling naturalistic argumentation in research literatures: Representation and interaction design issues," International Journal of Intelligent Systems, vol. 22, pp. 17-48, 2007.
[11]	B. Sereno, S.B. Shum and E. Motta, "ClaimSpotter: an environment to support sense-making with knowledge triples," in Proc. Tenth Intl. Conf. on Intelligent User Interfaces, 2005, pp. 199-206.

[12]	A. de Waard, "A pragmatic structure for research articles," in Proceedings of the 2nd international conference on Pragmatic web, 2007, pp. 83-89.
[13]	A. De Waard, L. Breure, J.G. Kircz and H. van Oostendorp, "Modeling rhetoric in scientific publications," in Current Research in Information Sciences and Technologies, V.P. Guerrero Bote, Ed. Badajoz, Spain: Open Institute of Knowledge, 2006, pp. 352-356.
[14]	S. Teufel, J. Carlett and M. Moens, "An annotation scheme for discourse-level argumentation in research articles," in Ninth Conf. on European Chapter of the ACL, 1999, pp. 110-117.
[15]	S. Teufel and M. Moens, "What's yours and what's mine: determining intellectual attribution in scientific text," in Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, vol. 13, 2000, pp. 9-17.

[16]	M. Weinstock, "Citation indexes," Encyclopedia of library and information science, A. Kent and H. Lancour, Eds. vol. 5, NY: Dekker, 1971, pp. 16–40.
[17]	M. J. Moravcsik and P. Murugesan, "Some results on the function and quality of citations," Social Studies of Science, vol. 5, pp. 86-92, 1975.

[18]	H. Nanba and M. Okumura, "Towards multi-paper summarization using reference information," in Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999, pp. 926-931.
[19]	M. Garzone and R. E. Mercer, "Towards an automated citation classifier," Lecture notes in computer science, pp. 337-346, 2000.

[20]	R. E. Mercer and C. Di Marco, "The importance of fine-grained cue phrases in scientific citations," Lecture notes in computer science, pp. 550-556, 2003.

[21]	M. Le, T. Ho and Y. Nakamori, "Detecting citation types using finite-state machines," Advances in Knowledge Discovery and Datamining, pp. 04-09, 2006.

[22]	J. Swales, "Citation analysis and discourse analysis," Applied Linguistics, vol. 7, pp. 39-56, 1986.

[23]	S. Teufel, A. Siddharthan and D. Tidhar, "An annotation scheme for citation function," in Proc. Seventh SIGdial Workshop on Discourse and Dialogue, pp. 80-87, 2006.

[24]	Semantic Web Applications in Neuromedicine (SWAN) Ontology. W3C Interest Group Note 20 October 2009. Ciccarese, P. (ed.). http://www.w3.org/TR/hcls-swan/
[25]	Citations Ontology Specification. Semantic Web Applications in Neuromedicine (SWAN) Ontology. Ciccarese, P. (ed.). http://swan.mindinformatics.org/spec/1.2/citations.html

[26]	J. D. Lafferty, A. McCallum and F.C.N. Pereira, "Conditional random fields: Probabilistic models for segmenting and labeling sequence data," in Proc. Eighteenth Intl. Conf. on Machine Learning, pp. 282-289, 2001.

[27]	A. McCallum and W. Li, "Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons," in Proc. Seventh Conf. on Natural Language Learning at HLT-NAACL, pp. 188-191, 2003.

[28]	D. Pinto, A. McCallum, X. Wei and W.B. Croft, "Table extraction using conditional random fields," in Proc. 26th Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 235-242, 2003.

[29]	F. Sha and F. Pereira, "Shallow parsing with conditional random fields," in Proc. of 2003 Conf. of North American Chapter of the ACL on HLT, Volume 1, pp. 134-141, 2003.

[30]	J. D. Trent, "Toulmin's model of an argument: An examination and extension," Quarterly Journal of Speech, vol. 54, pp. 252-259, 1968.
[31]	Krippndorff’s Alpha. Wikipedia. http://en.wikipedia.org/wiki/Krippendorff%27s_Alpha
[32]	Springerlink.com http://www.springerlink.com/home/main.mpx
[33]	MALLET – MAchine Learning for LanguagE Toolkit. http://mallet.cs.umass.edu/
[34]	M.A. Angrosh, S. Cranefield and N. Stanger, "Context Identification of Sentences in Related Work Sections using a Conditional Random Field: Towards intelligent digital libraries" in Proc. of 10th Annual Joint Conf. on Digital Libraries, pp. 293-302, 2010.

[35]	Protégé- Ontology Editor and Knowledge-based Framework. http://protege.stanford.edu/
[36]	D2R Server. http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/
[37]	SEWESE. JSP/Servlet Environment to build Semantic Web Applications. http://www-sop.inria.fr/teams/edelweiss/wiki/wakka.php?wiki=Sewese

[38]	M.A. Angrosh, S. Cranefield and N. Stanger, "Ontology-based Modelling of Related Work Sections in Research Articles: Using CRFs for developing semantic data based information retrieval systems," in 6th Intl. Conf. on Semantic Systems, 2010.