Volume 6 Issue 2
Fall 2010
ISSN 1937-7266

Multi-document Summarization of Information Science Research Papers

Kokil Jaidka

Wee Kim Wee School of Communication & Information
Nanyang Technological University
+65 - 67904460


This project proposes to generate literature reviews from a set of research papers by modeling human summarization behavior. The key features of natural literature reviews will be identified through a macro-level and clause-level discourse analysis. Also, information selection strategies will be designed based on the mapping of referenced information to source documents. Preliminary results have helped to characterize the styles and discourse structure of literature reviews. These findings will be exploited to design templates for automatic content generation.

Categories and Subject Descriptors

I.2.7 [Artificial Intelligence]: Natural Language Processing – discourse, language generation, text analysis.

General Terms

Algorithms, Languages.


Multi-document summarization, literature reviews, natural language processing, discourse analysis, rhetorical structure.

1 Introduction

The overall aim of this study is to develop a method to automatically summarize a set of research papers and generate a summary that mimics the characteristics of human literature reviews. To accomplish this, an analysis of the characteristics and structure of naturally-written literature reviews has to be carried out, and human strategies of information selection and summarization need to be understood. The research questions posed are:

  • How are literature reviews written, and what are their linguistic and functional characteristics?
  • How can a computer program be designed to automatically generate a human-style literature review from a set of related research papers?

The first step in this research project is to analyze how naturally-written literature reviews are organized and structured and to identify the linguistic devices used by researchers to make them coherent and cohesive. The next step is to study how researchers seem to have chosen and extracted relevant information. These findings will be used to design templates with the characteristics of human literature reviews. For this purpose, the following tasks have to be carried out:

  • To determine the discourse structure and functional structure of human-generated literature reviews in journal articles reporting information science research.
  • To determine the linguistic devices (e.g. cohesive links) used to construct a cohesive and coherent literature review text.
  • To develop a method to identify and extract relevant information from a set of source journal articles to serve the functions of a literature review.
  • To develop a text generation method to model the extracted information as a cohesive and coherent literature review.

2 Studies of Human Summarizing

In this study, the aim is to generate literature reviews out of research papers in the field of information science. The target literature review will be of the kind typically published as a section in a journal paper. Such literature reviews are written to distinguish what is done from what needs to be done, to establish the context of the problem, to rationalize the significance of the problem, to synthesize earlier results and ideas, to gain a new perspective on an old problem and so on [1] (pp.27). Such literature reviews are typically tedious to write because they involve analysis of multiple studies and back-and-forth comparisons between them. An automatically generated multi-document summary is expected to be useful to researchers when a human-constructed literature review is not available. Moreover, an automatic summarization program that is customizable to model different kinds of literature reviews for different purposes may be more useful for a researcher. Such automatic literature reviews can suggest ways of synthesizing a set of related research papers and provide ideas to authors for writing their own literature review.

There have been some studies of human abstracting [2-3]--how professional abstractors as well as authors write abstracts, and the characteristics of such abstracts. Such studies provide an insight into the human abstracting process and would be useful guidelines for auto-summarization systems as well. Cremmins [2] introduced an analytical reading model for abstract writing comprising three stages: retrieval reading, involving a rapid reading of the text; creative reading, involving a rereading of the text; and critical reading involving editing and proof-reading of the abstract.

In their study of six expert abstractors, Endres-Niggemeyer et al. [3] concluded that the human summarization process comprises three stages. Through their work, they identified the abstracting strategies typically followed by professional summarizers. The stages and the strategies described by them are: document exploration, involving the genre and scheme analysis of the original text; relevance assessment, involving the identification of relevant excerpts for the summary from the original text; and summary production, involving cut-and-paste operations through a syntactic- and semantic-level representation.

A detailed description of the discourse structure of academic literature was first provided by Swales [4]. The Create a Research Space (CARS) model proposed by Swales described the discourse structure of the introduction section of research articles. It described the schematic structure of introductions as the application of three “moves”. Moves are the rhetorical movements in text that define its schematic structure and create a logical progression [4]. Each move may employ several “steps” to achieve its purpose. The model describes the schema of typical introduction sections as follows:

  • Move 1: Establishing a territory, by making topic generalizations and reviewing items of previous research
  • Move 2: Establishing a niche, by counter-claiming, indicating a gap or raising questions
  • Move 3: Occupying the niche, by outlining the purpose of the study and its findings

Using the rhetorical moves described by Swales and extended by Bunton [5] in a later study, Kwan [6] sought to apply the CARS model to literature review chapters and compare it with the rhetorical move structure of introduction chapters. She also identified “strategies” or the ways involved in fulfilling each rhetorical move. It was found that the schema of a literature review chapter usually follows an Introduction-Body-Conclusion structure. Moves 1 and 2 of Bunton’s extended CARS model are usually recursively applied but Move 3 is rarely used. A new type of move encountered was the writer’s affirmative stance towards the surveyed research. Also, none of the move elements followed a specific progression in the layout. These studies provide a preliminary insight into human summarizing strategies. However, in the current study, the rhetorical analysis is aimed at uncovering the specific linguistic realizations used to frame the author’s argument in literature reviews. This is because, for automatic summarization, a method is needed to synthesize a literature review from its extracted form, in ways which model typical linguistic expressions.

On the other hand, an algorithm for selecting information will be based on the findings from mapping information from literature reviews to the source papers to address the following questions:

  • Where do researchers select information from, to include in a literature review? For example, abstract, conclusion, etc.
  • What types of information do they select? For example, methodology, topic description, etc.
  • How do they transform that information? For example, paraphrasing, cut-pasting, etc.
  • Why do they choose this information? For example, justification, instantiation, etc.
  • How do they fulfil the functions of a literature review, using this information? For example, to delineate a literature gap, provide an overview of research, etc.

There has been some research on some of the individual questions listed above. Jing and McKeown [7] conducted a study wherein they resolved the kind of transformations which are performed on source sentences before they are included in a summary. The findings were used to train a Hidden Markov Model classifier to detect source transformations. It was anticipated that the study of the cut-paste transformations would improve the performance of automatic text summarizers. Their study focused on news-related and event-based summarizing at the syntactic level; however, this study will map information to source text at different semantic levels, and specifically focuses on the summarizing strategies in information science literature reviews.

3. Multi-document Summarization

This study designs its approach while keeping in mind its ultimate goal: to automatically generate a specific genre of discourse: a literature review, modeled on human writing characteristics. Few studies attempt to summarize multiple or single research papers [8-10] but none yet yield summaries similar to human-generated literature reviews.

Text summarization has been defined as “the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user and task” [11] (pp. ix). Most of the work in text summarization has focused on single-document summarization—to produce an abstract of a document. Multi-document summarization, on the other hand, seeks to generate an integrated summary of a set of related documents. These techniques have commonly been applied on a corpus of news articles to generate a time- or event-based summary [12]. They have also found some application in summarizing medical corpora [13-14].

Although the automatic summarization process is comprised of four main steps: text pre-processing, information selection and integration, summary generation and post-processing [11], the novel approaches proposed in this system would mainly be in the information selection and integration stage, where information will be extracted at different semantic levels, and at the summary generation stage, where the literature review will be drafted out of rhetorical templates. In previous research, important information has been identified at the sentence level by using rhetorical relations indicating the source paper's objectives and contributions [9]. Important information has also been identified at the clausal level by extracting research concepts (nouns) and their relationships expressed in the text [15]. Important information can be selected based on its relevance to the subject of the final summary by measuring correlation between the candidate topic and the source content [11], the summary topic [16] or the user query [17]. Unique information across the source documents can be identified by comparing source sentences, for example, by taking the negative log of the likelihood ratio which measures semantic similarity [18] or by measuring the relative information gain ratios of information with respect to the surrounding text [19]. For this system, the aim is to combine a number of these strategies to select common as well as unique information from the source documents to generate a comparative literature review.

The other important stage in multi-document summarization is summary generation. It is expected that this system will follow Saggion and Lapalme’s approach [10] of generating content by mapping information into templates incorporating linguistic, conceptual and relational information. However, their output was a domain-specific, single-document indicative abstract which would not fulfil the functions of a literature review. Nevertheless, their strategy will be adopted in order to use rhetorical relations to model the source corpus. This would be a useful way to represent and compare research descriptions and would also aid in summary generation on the basis of the methodological similarities and differences encountered.

The proposed system will pose different challenges from previously reported research because of its objective to automate the critique and comparison of reported research and to emulate human writing behaviour. The scope of the study is limited to information science research papers. The functional and linguistic elements of human literature reviews will be studied and the final review generated would be a comparative account of the research, rather than an indicative overview.

4. Proposed System

The proposed literature review generation system is shown in Figure 1. The novel approaches in this system would mainly be in the information selection and integration stage to select information from different semantic levels, and the rhetorical function implementation stage where the literature review will be drafted. These would draw from the analysis of human summarization strategies as described above. The literature review needs to be generated in a concise, coherent and cohesive form, closely matching the human style of composition and logic, which would be quite distinct from the traditional summary generation techniques. The pre-processing and post-processing techniques applied on the final summary will follow the traditional approaches described for extractive summarization systems. The readability of the literature review could also be enhanced by incorporating typical introductory sentences in the template and reusing them to provide a general idea of the topics being discussed, as in “Previous research in the field of ___” and so on. Besides this, a comparative account of cited studies may require the application of a unification grammar and linearization to correctly handle sentence formation and morphological structure.

Figure 1. The proposed literature review generation system (Click here for a larger view of the image)

A content analysis of literature review sections of information science research papers was conducted. The main objective of this analysis was to understand human abstracting behavior and typify the discourse structure of information science literature reviews. Based on these findings, generic templates will be designed for future use in the automatic summarization system. For the data analysis, information science research papers were downloaded from online journal websites. Their literature review sections were carefully analyzed at three levels of detail to understand the features of the discourse structure and what they signified about human abstracting techniques.

5. Preliminary Results

The context of my study is automatic summarization and the ultimate goal is to engineer a method to automatically summarize research and to generate a literature review of it. Typically, this task is done by researchers who follow certain strategies and techniques while drafting a literature review.

A study of these techniques would provide valuable insights for automatic summarization systems as well. Besides human abstracting techniques, a study of the structure of real literature reviews would be of potential use in creating a literature review rubric. It may also be useful in estimating the quality of auto-generated literature reviews.

In this context, first the analysis of literature reviews was conducted to determine the discourse structure and functional structure of human-generated literature reviews in journal articles reporting information science research. This was followed by a study of the information selection and summarization strategies followed by writers of literature reviews, which will be emulated in the summarization algorithm. The following paragraphs describe the findings of preliminary studies.

5.1 Findings about Literature Review Discourse Structure

Discourse Structure

5.1.1 Sentence-level discourse structure

A coding scheme was developed to annotate the macro-level organization of literature reviews, focusing on the types of information. An XML schema was constructed to define the valid XML document structure used to represent the structure of a literature review, including the expected elements and their hierarchical relationships.

A set of macro-discourse elements that make up information science literature reviews were defined. Broadly, two types of functional elements were identified:

  • Descriptive elements, which describe topics, studies and concepts at different levels of detail.
  • Meta-elements, which indicate the reviewer’s comments and critique of cited studies.

It was observed that literature reviews may be either descriptive or integrative and their discourse structures are correspondingly different [20-21]. Both types of information science literature reviews follow a hierarchical structure and have a typical composition of discourse elements. On the basis of these findings, generic templates have been designed for the literature review generation system; one such template is seen in Figure 2.

5.1.2 Clause-level structure of a literature review

Rhetorical structure is the graphical representation of rhetorical relations in text, traditionally represented as a tree structure highlighting the relationships between the constituent clauses of text. A template structure of literature reviews can be applied in the summarization system to automatically generate a literature review. Keeping this goal in mind, the study of document-macrostructure was taken a step forward by studying literature review text at the inter-clausal level.

In this phase of discourse analysis, lexical patterns and rhetorical functions, which signaled the purpose of the macro-level elements, were identified from the text. The purpose was to uncover the structural elements’ primary rhetorical purpose in a literature review. This analysis identified 34 rhetorical functions as well as a variety of linguistic expressions for realizing these functions. It was observed that for descriptive literature reviews, writers choose more descriptive rhetorical functions and consequently more detail-oriented linguistic patterns, whereas for integrative literature reviews, they choose more of the critical rhetorical functions and consequently more argument-oriented linguistic patterns.

For the purpose of literature review generation, the system will initially be designed to realize commonly occurring rhetorical functions such as to introduce a topic, state a common topic, delineate a research gap, compare studies and so on, through a variety of linguistic expressions. In future work, it may be feasible to apply more or all of the rhetorical functions identified.

5.1.3 Intra-clause-level structure of a literature review

To study the intra-clause level discourse structure a previously tested approach outlined will be adopted [15]. In this study, Ou et al. investigated four kinds of information embedded at the clausal level:

  • Research concepts, which are the subject nouns or noun phrases.
  • Relationships, which relate a pair of concepts or specify their attributes that are of particular interest in the research.
  • Contextual relations, which place the research in the context of a framework.
  • Research methods, which are mentions of the methods used by the researcher.

This approach will be extended in order to analyze the structure of the sentences describing research findings in the source papers.

5.2 Findings about Information Selection Strategies

In the next phase, the goal is to identify the strategies followed by human reviewers to select information from research papers and organize and synthesize it in a literature review. The purpose of this phase of analysis is two-fold:

  • To find out which parts of a source research paper researchers select information from
  • To find out what types of information are selected

From a preliminary analysis, the following typical information selection strategies were observed:

  • Researchers show a marked preference for selecting information from certain sections of the source paper such as its abstract, conclusion and methodology.
  • Descriptive elements of literature reviews usually contain information that is cut-and-pasted or paraphrased from the source papers.
  • Integrative elements of literature reviews usually summarize information through inferring and generalization techniques for a higher-level overview.

6. Evaluation

In manual evaluation methods, researchers have typically used two methods in trying to quantify the performance of summarization programs. One is to compare the machine’s output with an “ideal” hand-written summary, produced by an editor or a domain expert. This has been called intrinsic evaluation, and is the more widely used of the two. The other approach is known as extrinsic evaluation and is used to evaluate the usefulness of a summary in helping someone perform an information processing task [22]. Both methods are known to be very sensitive to basic parameters of the experimental context, such as summary length. Manual evaluation was the accepted paradigm in the Document Understanding Conferences which standardized this evaluation for multi-document summarization as a pair-wise comparison of generic and task-based summaries against the reference summaries drafted by ten evaluators or abstractors [23]. However, it has been argued that human-written summaries are prone to disagree with each other with regards to comprehensiveness and information overlap [24]. To avert this problem, the ROUGE measures were developed by NIST and are a popular paradigm for automatic evaluation [25]. This is a recall-based N-gram co-occurrence metric measuring content overlaps between automatic and reference summaries. Similarly, BLEU [26] is a precision-based mixed n-grams metric which favours longer n-grams for grammaticality or word order.

Figure 2. One of the possible macro-level templates for the literature review generation system

(Click here for a larger view of the image)

The measures discussed above are traditionally used to evaluate the completeness of an automatic summary. However, the nature of my research problem suggests that its evaluation should center on its “human-ness”, or how similar it is to human literature review. This implies that its significant qualities should primarily include cohesiveness, a sound rhetorical structure and easy readability. The evaluation should thus focus on discourse quality rather than comprehensiveness. A manual evaluation would be able to measure these features and thus would be the appropriate choice for my system. The summarization system will be evaluated at two levels:

  • Surface level evaluation: evaluating the coherence and cohesion of the literature review
  • Content evaluation: evaluating the completeness of the content of the literature review

The quality of the literature review generated may depend on many variables such as number of source articles, length and type of literature review to be generated, etc. Each of the two levels of evaluation will conducted using a panel of information science researchers. The surface level evaluation will be accomplished by using metrics to judge the readability of the final literature review. The content level evaluation will assess the completeness of the information contained in the final literature review. The following evaluation questions will be answered:

  • How logical is the structure of the literature review?
  • How cohesive is the text of the literature review?
  • How much of the source content is included in the final summary?
  • How much of the content matches the content deemed as important by the panel?

Another limitation for the evaluation may be the unavailability of reference summaries to judge the automatic literature review. So the evaluation process may judge the comprehensiveness and the discourse quality of the final literature review rather than compare it with other sample literature reviews.

7. Forthcoming Challenges

The results of the literature review generation stage will serve as a proof-of-concept to affirm theories about human abstracting behavior and their emulation in automatic summarization. It will also test the performance of existing methods in generating a summary with a specific discourse structure. However, due to the broad scope of the research problem, there are a lot of challenges as follows:

  • Information required is at the conceptual level. It would be a challenging task to develop an automatic information selection strategy aimed at synthesizing a critique or comparison. This is because the information applied in this case is conceptual and cannot be uncovered by selecting sentences or clauses. This challenge can be overcome by picking out concept-level information present within clauses and using these concepts as arguments for synthesizing critiques and comparisons.
  • Discourse template is only indicative. The generic templates provided are merely indicative of a set of possible discourse structures. It is not clear which kind of structure should be preferred over others, and when. It is also not clear which discourse elements should be chosen over others while customizing a template. In such a situation, different choices would yield different styles and different qualities of literature reviews.
  • Choices in representing information. Each rhetorical function has several linguistic manifestations; it is not clear which linguistic expressions should be used when. Also, the user may want to focus on a single rhetorical function in the generation stage. These are user-dependent variables that can be added in order to customize the type and scope of the final literature review.
  • Generating human-style integrative discourse elements. The rhetorical function implementation stage would be the most complex task since a method is required to generate abstractive text comparing and criticizing research studies. This will be done by comparing information concepts stored in lexical chains. The final output literature review is expected to be close to a human-written literature review in its layout and rhetorical organization. This is a big challenge as the current state-of-the-art does not mimic human-style presentation and discourse structure; however, the goal of this project is to attempt to emulate them by studying human summarizing strategies in detail.
  • Evaluation. Typical methods of evaluation have compared automatic summaries against reference summaries written by humans to judge their completeness. However, there is no gold standard against which literature reviews can be compared and evaluated for their “human-ness”. So the judges may need to evaluate the comprehensiveness and the discourse quality of the final literature review rather than rate it relative to other sample literature reviews.

8. Conclusion

After considering all the forthcoming challenges, it is apparent that it may be necessary to scale down the goals of this project and to focus the auto-summarization task on generating text for only a few discourse elements of a literature review. It is expected that the large scope of my project may not allow the development of an entire literature review automatically. Since every element of a literature review has a unique representation style, it may be more prudent to develop an auto-summarization method generating a single or a few types of discourse elements, leaving other aspects for future work. Also, in order to focus on perfecting the proposed approach for rhetorical function implementation, other phases in the summarization process may need to be manually conducted. This would test the performance of the summarization approach under the assumption that the pre-processing steps are functioning perfectly.

The main contribution of this research would be towards finding out how humans conduct literature reviews and how the process may be emulated by machines. The discourse structure and the rhetorical functions identified towards this end will aid linguistics and information scientists alike.

The techniques developed for the purpose of multi-document summarization through this study would contribute to applications of multi-document auto-summarization to research information; if successful it would also prove to be a viable and feasible research tool for the review of large quantities of literature. Its applications can also be imagined in web-based databases and digital libraries to allow users a convenient and speedy method for browsing available content.

9. Acknowledgements

The author would like to thank her dissertation supervisors, Christopher S.G. Khoo and Jin-Cheon Na, at the Wee Kim Wee School of Communication and Information, Nanyang Technological University. Their continuing support and guidance is invaluable to the progress of this project.

9. References

[1] C. Hart, Doing a Literature Review. London: Sage, 1998.
[2] E. T. Cremmins, The Art of Abstracting. Arlington, VA: Information Resources Press, 1992.
[3] B. Endres-Niggemeyer, E. Maier, and A. Sigel. “How to implement a naturalistic model of abstracting: four core working steps of an expert abstractor,” Inf. Process. Manage., vol. 31, no. 5, pp. 631-674, Sep. 1995.
[4] J. Swales, Genre Analysis: English in academic and research settings. New York, NY: Cambridge University Press, 1990.
[5] D. Bunton, “Generic moves in Ph.D Introduction chapters”. In Academic Discourse, J. Flowerdew, Ed. London: Longman Publishing Group, 2002.
[6] B.S.C. Kwan. “The schematic structure of literature reviews in doctoral theses of applied linguistics,” English for Specific Purposes, vol. 25, no. 1, pp. 30-55, 2006.
[7] H. Jing and K.R. McKeown, “The decomposition of human-written summary sentences,” in Proceedings of the 22nd Annual international ACM SIGIR Conference on Research and Development in information Retrieval, 1999, pp. 129-136.
[8] J. Kuipec, J. Pedersen, and F. Chen, “A trainable document summarizer,” in Proceedings of the 18th Annual international ACM SIGIR Conference on Research and Development in information Retrieval, 1995, pp. 68-73.
[9] S. Teufel and M. Moens. “Summarizing scientific articles: experiments with relevance and rhetorical status,” Comput. Linguist., vol. 28, no. 4, pp. 409-445, Dec. 2002.
[10] H. Saggion and G. Lapalme. “Generating indicative-informative summaries with sumUM,” Comput. Linguist., vol. 28, no. 4, pp. 497-526, Dec. 2002.
[11] I. Mani and M. T. Maybury, “Introduction”. In Advances in Automatic Text Summarization, I. Mani and M.T. Maybury, Eds. Cambridge, MA: MIT Press, 1999
[12] B. J. Dorr and T. Gaasterland. “Exploiting aspectual features and connecting words for summarization-inspired temporal-relation extraction,” Inf. Process. Manage., vol. 43, no. 6, pp. 1681-1704, Nov. 2007.
[13] K.R. McKeown, S. Chang, J. Cimino, S. Feiner, C. Friedman, L. Gravano, et al., “PERSIVAL, a system for personalized search and summarization over multimedia healthcare information,” in Proceedings of the 1st ACM/IEEE-CS Joint Conference on Digital Libraries, 2001, pp. 331-340.
[14] N. Elhadad, M. Kan, J. L. Klavans, and K. R. McKeown. “Customization in a unified framework for summarizing medical literature,” Artif. Intell. Med., vol. 33, no. 2, pp. 179-198, Feb. 2005.
[15] S. Ou, C.S. Khoo, and D.H. Goh. “Design and development of a concept-based multi-document summarization system for research abstracts,” J. Inf. Sci., vol. 34, no. 3, pp. 308-326, June 2008.
[16] D. Zajic, B. J. Dorr, J. Lin, R. Schwartz. “Multi-candidate reduction: Sentence compression as a tool for document summarization tasks”. Inf. Process. Manage., vol. 43, no. 6, pp. 1549-1570, Nov. 2007.
[17] T. He, W. Shao, H. Xiao, and P. Hu, “The Implementation of a Query-Directed Multi-document Summarization System,” in Proceedings of the Sixth international Conference on Advanced Language Processing and Web information Technology, 2007, pp. 105-110.
[18] J.D. Schlesinger, D.P. O’leary, and J. M. Conroy, “Arabic/English Multi-document Summarization with CLASSY – The Past and the Future”. Lecture Notes in Computer Science, vol. 4919, 2010, pp. 568-581.
[19] R. Feldman and J. Sanger, The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data. Cambridge, England: Cambridge University Press, 2006.
[20] C. Khoo, J.-C. Na, and K. Jaidka. “Analysis of the Macro-Level Discourse Structure of Literature Reviews,” Online Inf. Rev., in press.
[21] Jaidka, K., Khoo, C., & Na, J.-C, “Imitating Human Literature Review Writing: An Approach to Multi-Document Summarization,” in Proceedings of the 12th International Conference on Asia-Pacific Digital Libraries, 2010, pp. 116-119.
[22] Jackson, P., & Moulinier, I. 2002. Natural language processing for online applications: Text retrieval, extraction and categorization. John Benjamins Publishing Company.
[23] DUC. 2002. The Document Understanding Conference. http://duc.nist.gov
[24] A. Nenkova, R. Passonneau, and K.R. McKeown. “The Pyramid Method: Incorporating human content selection variation in summarization evaluation,” ACM Trans. Speech Lang. Process, vol. 4, no. 2, May 2007
[25] G. Doddington, “Automatic evaluation of machine translation quality using n-gram co-occurrence statistics,” in Proceedings of the Second international Conference on Human Language Technology Research, 2002, pp. 138-145.
[26] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002, pp. 311-318.