Investigating and Assisting Interactive Document Triage
in Electronic Environments
Centre for Human Computer Interaction Design
School of Informatics
City University London
Fernando.Loizides.1@soi.city.ac.uk
Abstract
Document triage is the critical point when information seekers first decide the relevance of a document to their information need[1]. This process is not yet clearly understood. In order to support the effective search for information within digital societies, an examination and formalisation of this complex process is undertaken. In this abstract we present findings from user studies, which guide the formulation of models for the document triage process. Prototype high-fidelity electronic software tools to assist information seekers are also introduced and demonstrated. Future work is outlined aiming to refine the understanding and development of this field of research.
1. Introduction
Document triage takes place mostly before the documents are read in depth. It usually occurs after an automatic information retrieval task in which a series of documents are returned in one form or another. Document triage exists in informal scenarios as well as formal settings. Generally, after an information seeker requests information by entering a query in a search engine, the search engine returns him or her a series of results. From this point on the information seeker is given very little assistance in order to explore the documents returned, in order to make the final decision of which to accept or reject. We use the term document to describe entities containing textual and image information. The main focus of the project, however, lies within the electronic domain.
Research Aims and Motivation Although document triage plays a critical role in the information seeking process, it has received very little direct attention, as we will see in the related work section. In general, research into document triage is sporadic at best, with work focusing mainly on the relevance decisions directly from automatic information retrieval systems. Sometimes this stage is bypassed or dismissed completely. Work targeting document triage directly has focused mainly on elements of a document and cognition rather than how information seekers interact with these. Document triage is a highly interactive process and we do not yet know how users encounter, process and use the data provided.
The aim of this research is to gain insight into the internal workings of this process by studying the behaviour of information seekers during document triage. By observing what information seekers actually do we can characterize the steps in the process rather than relying on what people say or think they do. It is by understanding and linking this behaviour to relevance decisions that we can infer how these decisions are influenced. It is the aim of this research to derive navigational behavioural models for document triage. Each will focus on specific elements of the document triage process. We will also study how interactive tools help, or in some cases hinder, effective document triage. We can then proceed to identify weaknesses and provide software aids to assist the document triage process.
Research Scope Although we touch upon the physical medium, the main area of research lies within document triage in the electronic domain. Most triage is now performed using electronic means and it is for this reason that it can give us more insight and a representative data pool. A suitable starting point (and the main focus of this research) targets scholars searching academic documents. This gives us a more controlled and well understood environment, which we can use effectively to represent realistic conditions. Although this is the main study scope, results are cross examined with other user groups, either by related work or by internal research, to validate findings and identify differences with the wider information seeker group. This is discussed in further detail in the methodology section (see Section 4). Using exploratory findings we form a navigational behavioural model for document triage, which depicts common patterns and predicts the actions and outcomes of information seekers. By focusing and studying each part of the model we can verify and refine the details. The model is discussed in detail in Section 3.
Once behaviours and weaknesses in the triage process have been identified we develop and test interactive tools and technological aids aimed to assist information seekers as well as researchers. These tools are under development and are described again in Section 3. EmpTriage is a purpose-built document reader aimed for researchers studying how information seekers search for information within documents. It is customized to gather empirical data in real time while a user performs triage on individual documents. SmartFind is a within-document search facility, which builds on the basic search feature, commonly known as Control-F. It allows for the information seeker to perform a much more dynamic search to find areas of interest within a document and therefore better assess its relevance. The third aid will utilize information visualization tools (custom made or otherwise). These tools will allow the researcher to identify ways of portraying areas of interest and relationships that exist between these and relevance decisions during document triage. By reverse engineering this information, it is then possible to adapt this tool to provide relevant information to the seeker that will assist him through relevant and non-relevant documents.
Abstract Structure This extended abstract is divided into four subsequent sections. Section 2 presents the state of the art research relating to document triage as well as background work that the project is built upon. It identifies significant problems in the current area of research upon which this research intends to build and improve. Section 3 reports work already performed and describes the findings so far. Section 4 describes the methodology that has been used thus far and that will be used for the future of the work. Section 5 concludes with a brief summary of the discussion in this abstract and reminds the reader of the contributions to the research area this project aims to achieve.
2 Related Work
The work presented here from the larger umbrella of information seeking. Several models exist explaining this area, both from a behavioural and cognitive approach. Gary Marchionini describes the information seeking process using eight sub-processes, spanning from the recognition and acceptance of an information problem to the "stop searching" sub-process [7]. The document triage process lies within the domain of the last three sub processes: Examine Results, Extract Info, Reflect/Stop. Marchionini identifies the way that the results from a query are returned, affect the information seeker’s relevance judgments and are examined. This combined with the seeker's personal information infrastructure drives the seeker towards his goal. The need of effective document triage is made clear. "Information must be examined and assessed during browsing. The information seeker’s challenge is to make rapid judgments of relevance, and the designers' challenge is to provide flexible display facilities for examination and assessment."
There exists literature that hints at behaviour indirectly related to document triage. A study of information seeking tasks ranging from the input of query terms to relevance decisions and information retrieval tools gives us insight into patterns. Furthermore, users often identify elements that attract or repel information seekers from exploring information or behaving in certain ways. Liebscher claims that information seekers will go through substantial sets of textual or graphic documents if they are given appropriate display and control mechanisms[13]. Another such reason for interaction providing a benefit to the document triage activity is found in [10], which states “user activity with a document was positively correlated with an overall assessment of document value”.
We know that users enter between two to four queries per session (with the exception to of queries of a sexual nature)[11]. When these queries return results, information seekers deal with the information returned in specific ways. For example, users rarely change the default values of an IR system to alter the results. When the advanced features are used, they are used incorrectly about a third of the time. This testifies to a) the unwillingness for users to alter settings and to b) the obscurity in how to use these extra settings. There are some in information retrieval that believe that information seekers would generally benefit from training in using systems. When users are asked whether they believe additional training would be of benefit they respond in a negative manner, saying that they do not think of it as beneficial and the would avoid this sort of training. It is therefore reasonable to assume that for developing any triage tool, it should either have set settings or contain an automatic adaptive user interface. Furthermore, the bigger the number of results returned, the less likely it is for information seekers to view the actual documents [8]. There is an abundance of literature to suggest this. It is rare for information seekers to navigate past the second result list page [11] except in special circumstances [5]. The reasons given for this are hypothetical and state that "perhaps perseverance wanes after the second page; users have no need to continue because ranking algorithms do such a good job at placing relevant retrieved Web pages on the top, or they turn to an alternative web search engines to satisfy their information need."
A model closely related to the document triage process was discussed by Peiling Wang in [12]. This cognitive model portrays how users deal with the selection of documents throughout the information seeking process and not from every individual search. Furthermore, participants were encouraged to read the documents in further depth than the definition of document triage and only the cognitive aspect is discussed in detail, gathered from think aloud comments.
3 Current Results
The initial stage of research involves understanding the document triage process. It should be noted that this background paragraph was part of an M.Res thesis and not of the Ph.D. It is however the starting point of the research and so it is suitable to mention the work here. In summary the findings identified common as well as inconsistent behaviour between mediums. A model was suggested that encompassed cognitive and procedural insight into the document triage process. Details can be found in the thesis volume [2] and in [3].
Empirical Data Following the original study it was considered appropriate to perform a further user study in a similar setting with more specific goals in mind, as well as, a more empirical and detailed approach to data gathering. For this purpose a bespoke PDF reader software (EmpTriage) was created. It allowed for detailed information to be logged in XML format in real time to be analysed later. Details registered included position within the document, zoom levels and speed of scrolling. It also allowed full control for the investigator to allow for the pausing and resuming of logging in case the participants were not actively engaged with the system to prevent false data. 20 participants were recruited with previous experience in reading academic documents and given a set of documents to triage with regards to a specific information need. The participants utilized the custom made document reader in a laboratory-based study that gives us an initial assessment of the influence of visual features within documents on user behaviour.
The main aim of the current research is to identify user patterns during the practice of document triage. These can be behavioural, procedural or cognitive in nature. By understanding the internal workings employed by document triage we can speculate and predict users’ actions. Since the document triage stage is a prerequisite to the relevance decision process, we can use these patterns to determine the effects between the two stages. Furthermore, we are able to then see how different tools (already existing or custom made), disciplines and situations influence the document triage behaviour and cause users to deviate from their normal activities, either in a positive or negative manner.
By extracting the high level navigational patterns of the participants when viewing documents, we are able to distinguish patterns in navigational behaviour common to all participants. Such behaviour consists of the linear progression through a list of documents for triage rather than a random pick. Furthermore, when looking in detail into the navigation of each individual document we witness recurring navigational behaviours. The most common is a behaviour dubbed step navigation. Here the user begins triage by a prolonged view of the initial page before linearly scrolling the rest of the document, stopping on areas of interest for short periods of time. Other popular navigational behaviours include the flatline effect, where the user simply reads the first page; the mountain effect, where users begin the common step navigation but subsequently reproduce the reflective behaviour returning to the beginning of the document; and the begin and end, where the user reads the initial page and the conclusion section.
We also look at the impact of other visual factors such as images and headings within the document with regards to viewing times in order to discern whether or not they are likely to have an effect on relevance decision making. Findings suggest document length to be a hindrance in terms of the level of detail that document pages are viewed with relation to their location on the document. In other words, the further the page is on a document, the less likely it is that an information seeker will view it and those who do, view it for a shorter time. Featureless views such as plain text receive very little attention, while scrolling makes them unlikely to contribute to the decision. Headings, Pictures and Emphasised text to attract the user's attention and are considered to have a modest impact on the decision process. The conclusion section seems to play a summarising and verifying role in the decision. A deeper discussion and further details of the results can be found in [4]. One common and predominant finding between the two observational user studies presented in this section is the importance of the first page of a document. Participants were found to devote much of their time, and in many cases all of their time, to the scrutiny of the initial page of the document. Since the first page includes the main title, the abstract and, in many cases, the bulk of the introduction, it is very difficult to assess precisely what features information seekers focus on and the duration of these. For this reason, a new user study is to be conducted which uses an eye-tracker to empirically record the fixation of an information seeker’s gaze accurately. This study is discussed in the next section.
Within-Document Searching Another common fact that surfaced from the participants’ comments from both user studies was the perceived usefulness of the common search feature also known as Control-F. Users are aware of its existence and claim it to be of utmost importance and beneficial to any document triage task. They also claim to often use it to find information they need both in unfamiliar and familiar documents. However, what people say they do and what they actually do are not always consistent. This proved to be the case here, with participants rarely utilising the search feature and stopping after the first search result when they actually did. It became evident that there seemed to be serious limitations in the capabilities of the feature, which lacked the ability to facilitate the needs of the information seekers.
In light of these findings, further scrutiny of the use of the basic search feature was undertaken in two further experimental settings. Eight participants were asked to perform a detailed diary study of their search habits in terms of triage of information. Participants were interviewed before, half-way and after the study for qualitative feedback. If the topic of Ctrl-F and its usefulness did not surface from the participants’ comments, a direct question was made to elicit details. Care was taken however, not to overemphasize the importance of Ctrl-F and thus affect the study. The participants were allowed between one week and one month to complete the diary study in order for them to work at their own pace and for the study to be least intrusive. Interestingly, none of the participants in the diary study recorded using Ctrl-F.
Further to the diary participants, eight experts in the field of computer science were chosen and directly introduced to the idea of the limitations of the basic search feature. A high fidelity prototype advanced search feature was created and introduced to them. This prototype conducts a search on a text document utilising the (term frequency) x (inverse document frequency) calculation to rank set document parts based on the query terms of the user. It then returns the highest ranked results first and cycles through the remainder. The words are highlighted for easier location and the porter stemmer algorithm was also enforced to allow for variation and flexibility in word matching. Participants were asked to complete a series of search tasks to find both specific and more general information so as to test the prototype in both scenarios. They were subsequently given the basic search feature to compare the results of the searches and were permitted to ask any questions while they thought aloud parallel to their actions. In general the experts agreed to the limited scope of the search capabilities of the basic feature. Stemming was considered a great advantage, as well as the ability to rank document passages on individual words rather than an exact phrase match. More details on the studies can be found in [6].
Formulating a Navigational Behavioural Model As mentioned before, one of the main aims of this project is to create validity models for the document triage process. By gathering the quantitative and qualitative data described above we are able to begin the formulation of one the models; namely that of the behavioural actions of the information seeker when performing triage in an individual academic document. Before we attempt however to generalise the model to represent all disciplines and levels it is vital to study whether its effectiveness in a plethora of scenarios. For this reason two further studies were conducted to fortify the results already existent.
The first study involved extending the diary studies mentioned in the previous subsection to include 12 more participants from a wider spectrum of disciplines. The ages ranged from 12 to 50 and consisted of disciplines such as law, criminology, computer science, engineering, physics, accounting and finance, art, interior design, graphic design and fashion. Unemployed users as well as students under the age of 18 attending school were also selected. Although some of the results were largely unrelated to academic documents, the structure of the documents such as the headings and this abstract can correspond to titles and table of contents respectively.
Focus was also given on an often neglected group, namely that of novice information seekers. We selected 20 participants of a computer science background with no previous experience in academic research, and introduced them to the academic research methodology. After a brief acquaintance with a well-known digital library, they were asked to write and justify a proposal with references within 5 days. Screen capture, notes and interviews were taken throughout the study. We found evidence that the title plays an even more important relevance role to novice information seekers than the complete abstract. It was also found that during the initial stages participants would triage the documents on the first page before entering another query. However, as participants became more familiar with the library and their subject area they would continue to scan three to five pages deep within the results list, contradicting previous work in [11] in this specific case. More detail on this study can be found in [5].
Fig. 1. Navigational Behavioural Model for Single Document Triage
By gathering and analysing all the data mentioned above, we introduce a navigational behavioural model for document triage. This model can be found in Figure 1. In short, we see here that information seekers performing document triage will always navigate to the first page of the document as a starting point to their triage activity. From there they are likely to take one of three courses of action. The first is to proceed linearly skimming the document pages. The second is to navigate to the conclusion part of the document and the third option is to end the triage process without further navigation. When participants are skimming the within-document features, such as the images and headings of the document, they are likely to continue doing so until the conclusion is reached or navigate back the the initial page. It is rare for users to stop navigation during this period. The conclusions section of a document is a sought after part of the document in terms of triage navigation. From there users either overshoot to the references section, end the search or return to the initial page for further triage. It is also likely, but less probable, that users return to the within-document features from this point.
4 Research Methodology
Future research into the area of document triage targets more specific points to extract further details and to refine the process. These include a series of observational lab studies, diary studies and high fidelity prototyping with system evaluation. As seen in the previous studies, quantitative as well as a wealth of qualitative information is obtained. Empirical data is handled with appropriate statistical analysis whereas qualitative information is useful in order to introduce grounded theory to the document triage modelling process.
Developing Advanced Within-Document Searching Tools The next stage of the research aims to build upon the results specific to the basic search feature (Control-F). A new approach to within-document searching is that of passage retrieval. In short, the document is divided into sections via paragraphs or other measure and each section is given a ranking. The proposition is for a novel method using word count based on query term matching. The theory behind it is that partitioning a document using a variable number, paragraphs or pages gives a constant length. When dividing the relevant words by the total number of words, a non-representative result is often formed due to the same section being split into two. Using a variable length by matching the distance of the matching query terms, we can create variable length sections which represent a densely populated relevant section. Large non-relevant sections are ignored and therefore calculation times are also reduced. By building a high fidelity prototype we can empirically and qualitatively test the effectiveness of this method using a laboratory based observational study simulating a real triage scenario.
Small Screen Triage Mobile devices are increasingly becoming more ubiquitous. One hindrance in their use, however, is that of the limited screen size. There is however no research into how the users go about to triage the results from a search or how they look to find information within a document on a small screen device. This part of the project replicates the experiments mentioned in section 3 to study the behaviour and navigation of users using a small screen device. It is appropriate to conduct a laboratory based observational study with think aloud qualitative feedback and screen recording software for qualitative data gathering.
Eye Tracking Document Triage Behaviour We have mentioned the overwhelming evidence on the importance of the first page of a document with regards to the triage activity and relevance decisions of information seekers. Due to the lack of precise reading, however, it was not possible to distinguish the exact areas of attention and their duration of participants during any of the studies. It is therefore the aim of this section of the project to investigate the gaze and attention of the user using an eye-tracker while performing document triage. Although the initial page is of utmost importance and interest, the scope of the laboratory experiment will gather information on the complete triage activity, covering the whole document.
Fast Triage In earlier work Saracevic experimented with relevance decisions by uncovering parts of documents gradually to information seekers and recording the relevance rating after each new unveiling [9]. It was found that the majority of information seekers make a decision early on and do not often change their minds after more evidence is provided. It is therefore the aim of this part of the research to answer the question of how much of a role time plays in the decision process of document triage. Participants in an observation will go through different documents, some with a short time limit and some without, using a Latin squares approach for less biased results and give a relevance rating. The hypothesis is there will be no relevant difference in decisions made while abiding by a strict time frame and those made with unlimited time. This will help in the development of more relevant software and tools to aid in more accurate and rapid decision making while performing document triage.
Formulating and Refining the Models After all the information is gathered, the already existing models will be refined to incorporate the new data into behaviour and cognition. Furthermore, in light of new data more specialised models can be formulated, based on the data, which deal with specialised scenarios such as those of small screens or triage of the initial document page. Diary studies can enhance these results by allowing the recording of data in information seekers’ natural environment.
5 Summary and Contributions
Document triage is a process by which an information seeker first rates a document based on his or her information need. Although the area is severely lacking in research, we extract valuable insights from other related areas of information seeking. We investigate and assess current methods of document triage performed by users and model behaviour. By doing so we can then better facilitate this process by the creation and evaluation of software aids that act as tools to aid information seekers in performing more effective document triage. Thus far we have examined information seekers in natural surroundings by diary studies and performed several laboratory based observational studies leading to qualitative as well as quantitative results of triage behaviour. Areas of importance in terms of relevance decisions and navigation were identified and explained thoroughly, leading to the extraction of a navigational behavioural model introduced. Bespoke document reader software was created that assists investigators in gathering empirical data of navigational patterns of users. Further, an advanced within document searching feature was prototyped and evaluated following more specific studies into the limitation of the basic search feature. The specialised, and often neglected, group of novice information seekers was also studied by observation in a controlled environment. Future plans such as using an eye tracker to detail navigational and cognitive behaviour as well as specialised studies into triage using small screen devices and speed triage are formulated. Alternative algorithms for within document searching are under development and to be evaluated.
References
[1] | R. Badi, S. Bae, J. M. Moore, K. Meintanis, A. Zacchi, H. Hsieh, F. Shipman, and C. C. Marshall. "Recognizing user interest and document value from reading and organizing activities in document triage". In IUI ’06: Proceedings of the 11th international conference on Intelligent user interfaces, New York, NY, USA, 2006, pp. 218–225. |
[2] | G. Buchanan and F. Loizides, “Investigating Document Triage on Paper and Electronic Media,” Proceedings of the European Conference on Reasearch and Advanced Technology for Digital Libraries, no. 35, 2007, pp. 416-427. |
[3] | F. Loizides. Exploring Document Triage, Improving Information Seeking Methods. Swansea University, 2008. |
[4] | F. Loizides and G. Buchanan, “An Empirical Study of User Navigation during Document Triage,” Research and Advanced Technology for Digital Libraries, 2009, pp. 138-149. |
[5] | F. Loizides and G. Buchanan. "What patrons want: Supporting interaction for novice information seeking scholars." In Joint Conference on Digital Libraries, 2009. 6. |
[6] | F. Loizides and G. R. Buchanan. The myth of find: user behaviour and attitudes towards the basic search feature. In JCDL ’08: Proceedings of the 8th ACM/IEEE- CS joint conference on Digital libraries, New York, NY, USA, 2008, pp. 48–51. |
[7] | G. Marchionini. Information Seeking in Electronic Environments. Cambridge University Press, 1995. |
[8] | K. Markey, “Twenty-five years of end-user searching, Part 1: Research findings,” J. Am. Soc. Inf. Sci. Technol., vol. 58, 2007, pp. 1071-1081. |
[9] | T. Saracevic. "Comparative effect of titles, abstracts and full texts on relevance judgments." Proceedings of the American Society for Information Science, 6, 1969, pp. 293-299. |
[10] | S. Bae, R. Badi, K. Meintanis, J.M. Moore, A. Zacchi, H. Hsieh, C.C. Marshall, and F.M. Shipman, “Effects of Display Configurations on Document Triage,” Human-Computer Interaction - INTERACT 2005, 2005, pp. 130-143. |
[11] | A. Spink, B.J. Jansen, D. Wolfram, and T. Saracevic, “From E-Sex to E-Commerce: Web Search Changes,” Computer, vol. 35, 2002, pp. 107-109. |
[12] | P. Wang and M.D. White, “A cognitive model of document use during a research project. Study II. Decisions at the reading and citing stages,” J. Am. Soc. Inf. Sci., vol. 50, 1999, pp. 98-114. |
[13] | X. Wang, P. Liebscher, and G. Marchionini, “Improving information-seeking performance in hypertext: roles of display format and search strategy,” Computer Science Tech. Rep. Series CS-TR-2006 CAR-TR-353, College Park, University of Maryland, Baltimore, MD, 1988. |