Search Tasks and Retrieval Strategies in Structured Information Retrieval

TCDL Bulletin
Current 2006
Volume 2 Issue 2

Search Tasks and Retrieval Strategies in Structured Information Retrieval

Georgina Ramírez
CWI
P.O. Box 94079
1090GB Amsterdam
The Netherlands
georgina@cwi.nl

Abstract

Structured information retrieval studies the combination of the content and the structure information of documents to perform different IR tasks. Different approaches such as structure or context weighting (e.g., [26], [12]) make use of the structural information of documents to improve information retrieval effectiveness. However, most of these studies do not take any contextual information into account, e.g., information about the user or the search task, and they use the same processing strategy across all types of queries. This work aims to identify the relationship between user search tasks and strategies on the use of structural information when performing those tasks. The theoretical part of this research consists of three main phases designed to acquire a good understanding of (1) the nature of user search tasks on structured documents, (2) the types of structural information and its role in retrieval strategies, and (3) the formalization of a model to correlate both user search tasks and retrieval strategies. Two different experimental studies are planned. The first one investigates the combination of evidence from different strategies for the defined user tasks, and the second one investigates how to use relevance feedback techniques to refine the structural information for a given search task.

1. Motivation

The growing amount of structured information available, e.g., web pages and XML documents, poses interesting new challenges to different information seeking research communities like digital libraries or information retrieval (IR). On one hand, the structure of the documents provides a new source of information that retrieval systems may exploit to improve their search effectiveness. On the other hand, the appearance of new query languages that work on structure provides users with a more powerful tool to express complex and specific needs.

This possibility of querying using structural constraints requires itself an understanding of two main issues that differ from traditional IR. On one hand, due to differences in the structure of the documents, some extra knowledge (e.g., semantic mapping) might be needed when querying heterogeneous sources. On the other hand, the knowledge that the user has of the structure of the documents, might lead to different interpretations of the structural constraints of the query.

Surface features, i.e., anything other than content information, and, in particular, structural information, i.e., structural markup within a document or a collection, have been poorly used by information retrieval systems. Although some work has been done on using structural information to address some of the issues from above, how to choose and use the right surface features for a given user task is still an open research question.

Furthermore, although there is a growing interest in the information retrieval field in using user context information to improve retrieval effectiveness, a gap still exists between the advances in information seeking research regarding user tasks and behaviour and the work done to apply these studies in information retrieval systems [17]. This work investigates this relationship in the context of structured information retrieval.

On the other hand, studies done in other areas of information retrieval, like web search, have shown that the use of more specialized retrieval strategies that combine different sources of evidence for different categorizations of user tasks and intentions, improves system effectiveness significantly (e.g., [16]).

The main hypothesis of the present work is that the different types of information that can be extracted from the structural components of documents and from other surface features can be treated as multiple sources of information. The best combination of evidence from these sources will be determined by the different types of user tasks and intentions, as happens already in different information retrieval tasks. Therefore, once we acquire a better understanding of surface features and study how they can help the system to perform several types of information needs, information retrieval systems will be able to use this extra source of information more effectively.

2. Research plan

Several aspects of the information seeking process on structured information need to be analyzed and understood to be able to develop the work presented. The five main steps of this research plan are described next in sub-sections 2.1 through 2.5.

2.1. Definition of a taxonomy of user tasks

Because the aim of any information system is to be able to answer effectively different search tasks, it is important to understand the nature of these tasks. The main hypothesis on this part of the research is that the type of search tasks users perform on structured collections are the same as those done on plain text collections. The difference between the two fields is that, when querying structured data, users can provide the system with extra information about the type and location of the information for which they are looking and that the knowledge the users have of the structure of the documents might differ in different degrees.

As an example, consider a user interested in finding a book review that discusses context in IR. A user familiar with the structure of the collection (e.g., the librarian) could know that generally book reviews appear in sections of documents titled "new books", "book review" or "bookshelf". He or she might then pose the following query¹:

//article[about(.//atl, "new book" "book review" bookshelf )]//sec[about(., context IR)]²

While a user less familiar with the collection would probably simply ask:

//sec[about(., book review context IR)]

The structural constraints of the librarian may help the retrieval system to perform a better search and maybe even reduce the search space and therefore should be treated in a stricter way. However, in the case of the inexperienced end user, we might not want the retrieval system to use the structural constraints at all. If we would restrict the search and use only the sections to find the query terms we might not find the desired information, since the exact phrase "book review" might not appear in either the section title or the section body. We believe that the knowledge users have about the structure of the documents should be an important contextual factor when choosing a retrieval strategy, and therefore, it should be added as a new dimension in the existing search task classifications.

In this part of the research, two main research questions are investigated:

Can we apply existing categorization of user tasks to structured information retrieval?
How can we extend these categorizations to include the different levels of knowledge a user might have on the structure of the documents?

2.2. Definition of a taxonomy of types and uses of structure.

Structured documents provide information systems with an extra source of information. A good understanding of the nature of this structure and its possible uses is needed in order to make optimal use of this information. The main hypotheses here are that different types of structure have been created for different purposes. Therefore, the use that information retrieval systems can make of the structure can be also clustered into different views (dimensions), e.g., semantic or organizational.

As an example, consider the structure of an XML document from the INEX collection (shown in Figure 1).

Image showing the structure of an XML document

Figure 1 Overview of the structure of an XML document from the INEX collection.

Each document in the INEX collection contains some metadata such as the journal and the date the article was published (grayed area in the figure). Apart from the metadata tags, the documents contain other types of tags. Some of these tags were added in order to give some format to the document (style tags) such as it (italics) or li (list item), and others were used to give extra information about the part of the document they describe (semantic tags) such as au (author) or bib (bibliography). Another type of mark-up could be the one that gives some structural division to the text (organizational tags) such as p (paragraphs) or sec (sections). Furthermore, another type of structural information that can be used for retrieval are the surface features, such as the location of the file within the collection or the file name.

Information retrieval approaches that use structural information tend to use all types of mark-up in the same way. We argue that different types of structural information can be optimally employed using different retrieval techniques. For instance, length normalization can be better used on organizational mark-up, and it has no sense to use it with titles or author names. On the other hand, techniques that give a higher weight to terms that appear in certain elements generally use style components such as title or italics or metadata, but not the organizational tags.

Different research questions are investigated in this part of the research:

Can we define a fixed set of independent dimensions from the structural features of the documents?
Can we classify the existing strategies into these set of dimensions?
Are there other strategies systems could use regarding structure?

2.3. Formalization of a model to correlate user tasks and strategies.

Once the classifications of user tasks and possible strategies are defined, we need to connect both of them and formalize a prescriptive model that defines which strategies for the possible dimensions can be used by the retrieval system for each of the user tasks. The hypothesis here is that different user needs will require different structural features. For example, a query from a user searching for pictures might require being processed strictly regarding the type of retrieval unit (pictures) and some semantic matching could be added (e.g., pictures or drawings or figures). However, this query most probably will not require any length normalization. The research question we investigate in this part is:

Can we correlate, according to their properties, different user needs with the different uses of structural features?

2.4. Experimentation on the combination of evidence from different dimensions.

Different experiments need to be done in order to estimate the best way to combine the evidence from the different dimensions used in the different user tasks. A hypothesis in this part of the research is that the problem we address of combination of evidence is similar to the one of combining evidence from different classifiers [10]. We investigate the following research questions:

Can we cast the problem of combining the different dimensions to the one of combining classifiers?
Can we estimate the best parameterization for the combination of evidence dependent on the user task?

2.5. Experimentation on the use of structural relevance feedback.

As the complexity of an information need increases, systems need to be able to process any information users might provide, e.g., by using special interfaces or by relevance feedback strategies. The hypothesis in this experimental part of the research is that, in the same way as content is redefined on a relevance feedback process, relevant structural information can also be used to update the parameters of the combination of evidence during different search tasks. We investigate the following research questions:

Which type of structural information can be extracted from a relevance feedback process?
How can we use the relevant structural information to update the parameterization of the combination of evidence and improve the overall effectiveness of the retrieval system?

3. System and evaluation

Once the model is defined, it will be implemented on top of an IR system that uses a probabilistic language modeling approach to information retrieval [22]. Different benchmarks will be used to test the effectiveness of the approach. The user tasks from several tracks in these benchmarks will be manually classified into the defined task categorization and processed with the system. The different tracks have been chosen for providing different types of information needs and different types of structured data. The main characteristics of these tracks are highlighted in Table 1:

Table 1. Track's characteristics.

Testbed and Track	Features
INEX: Adhoc	Content oriented and content and structure queries.
INEX: Heterogeneous	Different search tasks on a heterogeneous collection.
INEX: Multimedia	Search task on a collection with different multimedia data.
INEX:Relevance Feedback	Adhoc task with relevance feedback.
TREC: Enterprise search	Different user search tasks on different types of resources.

4. Background and Related Work

A review on theoretical models and frameworks for information seeking and retrieval is presented in [18]. Although many studies have been done on understanding and modeling user needs and information seeking behaviour within the information science community, traditional information retrieval systems pretty much ignore the user. However, several efforts have been made towards defining information retrieval models and systems that take users and context into account (e.g., [13], [4]), and a growing interest within the IR community on these user aspects is leading to an increasing number of studies on user tasks and seeking behaviours. Some of these recent studies can be found in [14], and some of the specific work that inspired the research presented in this paper is described in the following paragraphs.

Bhavnani et al. [6] define a framework for IR tasks and strategies with the goal of training users on the use of effective strategies for information searches. They also analyze different studies on the categorization of tasks and IR strategies. Broder [9] introduces and analyses a taxonomy of web searches and shows how search engines evolve towards dealing better with these web-specific needs. Croft [10] analyses some work done in the area of combining evidence in information retrieval and shows how this problem can be modeled as the one of combining the outputs of multiple classifiers. Kang et al. [16] define another taxonomy for web searches and report improved effectiveness when using different strategies for each of the categorized web searches. In Bomhoff et al. [7], the same taxonomy is extended to represent other information retrieval tasks. Although the authors do not show significant improvements yet, they believe that if different types of queries (user intentions) are defined and processed accordingly, the retrieval effectiveness may increase.

As mentioned earlier, different studies on the use of structure to improve retrieval effectiveness exist. In the few years of INEX existence [11], several XML retrieval approaches have been presented. Studies have been done on the use of the structural relationship between elements [12, 28], mentioned previously as context weighting. Another group of researchers uses structure weighting in order to give importance to certain types of elements [26]. To address the problem of what is the best retrieval unit for each query, some works choose the approach of defining a subset of possible elements to be retrieved [29, 23] whereas some others, as in other information retrieval areas, use length normalization [15]. Work done by the author on this area includes the study and implementation of some of these strategies [21] [22].

Outside the area of XML retrieval, surface features have been studied mainly in the context of web retrieval. A host of work exists that studies ways to exploit the hyperlink structure between documents. See, for example, [8, 19, 5]. Kraaij et al. [20] demonstrate that using information obtained from URL-length can improve performance when querying for homepages. Also in the HARD track at TREC [3], surface features are studied, but there the features describe characteristics of the searcher rather than the documents. In other information retrieval areas the only surface feature that has been used widely is document length, which is typically used for normalization.

Different relevance feedback techniques have been used in IR systems. Although extensively used, these techniques focus uniquely on the content part of a document. A survey of these techniques applied to different information retrieval models is presented in [27]. In [24] and [28], existing relevance feedback algorithms are applied to query on XML documents. All these approaches are using content-oriented feedback, whereas the feedback this research intends to study is based on structural features. Some work that the author has done in this area is explained in [25], where a first approach on using different structural features on a relevance feedback process is presented.

5. Conclusions

In this paper, I have summarized the main motivations, hypotheses and research questions of my research proposal. I conclude with the main points where the presented research might contribute to the information seeking community and, in particular, to the IR field:

Investigation on the use of structural features for effective information retrieval and relevance feedback.
Definition of the set of features that an information system should support in order to use structural information effectively.
Study and development of an experimental approach to correlate user needs and strategies for structured information retrieval.

About the author

Georgina Ramirez received her Master's degree on Computer Science Engineering from the Barcelona Faculty of Informatics (FIB) at the Technical University of Catalonia (UPC). She did her final master's thesis at the Centre for Mathematics and Computer Science (CWI), in Amsterdam, were she worked on efficient implementations for a k-NN search algorithm for image retrieval. Currently, she is working on her Ph.D. at CWI, studying the use of structural features for effective content-oriented XML retrieval and defining a model to correlate user search tasks and retrieval strategies with respect to the use of structural information.

Georgina Ramirez is member of the Database Architectures and Information Access group (INS1) at CWI, the National Research Institute for Mathematics and Computer Science in the Netherlands. The work carried out in the research team INS1 is focused on bridging the gap between database architectures developed for applications in a peer-2-peer (P2P) setting and the need to scale algorithms for multimedia information retrieval.

Notes

1. Queries are expressed in NEXI [1], INEX's query language. NEXI is a subset of XPath extended with an about clause used for ranking purposes.

2. article title, section.

References

[1] INEX'03 Guidelines for Topic Development. In Norbert Fuhr, Saadia Malik, and Mounia Lalmas, editors, INEX 2003 Workshop Proceedings, pages 192-199, 2003. <http://inex.is.informatik.uni-duisburg.de:2003/proceedings.pdf>.

[2] INEX 2004 Relevance Assessment Guide. In Norbert Fuhr, Saadia Malik, and Mounia Lalmas, editors, INEX 2004 Workshop Proceedings, 2004. Notebook paper.

[3] J. Allan. HARD Track Overview in TREC 2003; high accuracy retrieval from documents. In the Twelfth Text Retrieval Conference, 2003.

[4] N.J. Belkin, R.N. Oddy, and H.M. Brooks. Ask for Information Retrieval: Part I and II. Journal of Documentation, 38(2 and 3), 1982.

[5] K. Bharat and M. R. Henzinger. Improved Algorithms for Topic Distillation in a Hyperlinked Environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104-111, 1998.

[6] S. K. Bhavnani, K. Drabenstott, and D. Radev. Towards a Unified Framework of IR Tasks and Strategies. In Proceedings of ASIST'2001, pages 340-354, 2001.

[7] M. Bomhoff, T. Huibers, and P. van der Vet. User Intentions in Information Retrieval. In Proceedings of the Fifth Dutch-Belgian Information Retrieval Workshop, 2005.

[8] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107-117, 1998.

[9] A. Broder. A taxonomy of web search. SIGIR Forum, 36(2):3-10, 2002.

[10] W. B. Croft. Combining Approaches to Information Retrieval. In Advances in Information Retrieval: Recent Research from CIIR, chapter 1, pages 1-36. Kluwer Academic Publishers, 2000.

[11] N. Fuhr, N. Gövert, G. Kazai, and M. Lalmas. INEX: INitiative for the Evaluation of XML retrieval. In Proceedings of the SIGIR 2002 Workshop on XML and Information Retrieval, 2002.

[12] N. Fuhr and K. Großjohann. Xirql: A Query Language for Information Retrieval in XML Documents. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 172-180, 2001.

[13] P. Ingwersen. Information Retrieval Interaction. London: Taylor Graham, 1992.

[14] P. Ingwersen, K. van Rijsbergen, N. Belkin, and B. Larsen, editors. IRiX: ACM SIGIR 2004 Workshop on Information Retrieval in Context, 2004. <http://ir.dcs.gla.ac.uk/context/IRinContext_WorkshopNotes_SIGIR2004.pdf>.

[15] J. Kamps, M. de Rijke, and B. Sigurbjörnsson. Length Normalization in XML Retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 80-87. ACM Press, 2004.

[16] I. Kang and G. Kim. Query Type Classification for Web Document Retrieval. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in Information retrieval, pages 64-71. ACM Press, 2003.

[17] K. Järvelin and P. Ingwersen. Information Seeking Research Needs Extension Towards Tasks and Technology. In Information Research, 10(1) paper 212. 2004. <http://InformationR.net/ir/10-1/paper212.html>.

[18] K. Järvelin and T.D. Wilson. On Conceptual Models for Information Seeking and Retrieval Research. In Information Research, 9(1) paper 163. 2003. <http://InformationR.net/ir/9-1/paper163.html>.

[19] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999.

[20] W. Kraaij, T. Westerveld, and D. Hiemstra. The Importance of Prior Probabilities for Entry Page Search. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 27-34. ACM Press, 2002.

[21] J. List, V. Mihajlovic, A. P. de Vries, G. Ramirez, and D. Hiemstra. The TIJAH XML-IR System at INEX 2003. In Norbert Fuhr, Saadia Malik, and Mounia Lalmas, editors, INEX 2003 Workshop Proceedings, pages 102-109, 2003. <http://inex.is.informatik.uni-duisburg.de:2003/proceedings.pdf>.

[22] J. List, V.Mihajlovic, G. Ramirez, A.P. de Vries, D. Hiemstra, and H.E. Blok. TIJAH: Embracing IR Methods in XML Databases. Information Retrieval Journal. Volume 8, Number 4, pages 547-570. Kluwer Academic Publishers, December 2005.

[23] Y. Mass and M. Mandekbrod. Retrieving the Most Relevant XML components. In Norbert Fuhr, Saadia Malik, and Mounia Lalmas, editors, INEX 2003 Workshop Proceedings, 2003. <http://inex.is.informatik.uni-duisburg.de:2003/proceedings.pdf>.

[24] Y. Mass and M. Mandelbrod. Relevance Feedback for XML Retrieval. In Norbert Fuhr, Saadia Malik, and Mounia Lalmas, editors, INEX 2004 Workshop Proceedings, 2004. notebook paper.

[25] V. Mihajlovic, G. Ramirez, A.P. de Vries, , D. Hiemstra, and H.E. Blok. TIJAH at INEX 2004. Modeling Phrases and Relevance Feedback. In Norbert Fuhr, Saadia Malik, and Mounia Lalmas, editors, INEX 2004 Workshop Proceedings, 2004. notebook paper.

[26] P. Ogilvie and J. Callan. Using Language Models for Flat Text Queries in XML Retrieval. In Norbert Fuhr, Saadia Malik, and Mounia Lalmas, editors, INEX 2003 Workshop Proceedings, 2003. <http://inex.is.informatik.uni-duisburg.de:2003/proceedings.pdf>.

[27] I. Ruthven and M. Lalmas. A Survey on the Use of Relevance Feedback for information access systems. Knowl. Eng. Rev., 18(2):95-145, 2003.

[28] B. Sigurbjörnsson, J. Kamps, and M. de Rijke. An Element-based Approach to XML Retrieval. In N. Fuhr, M. Lalmas, and S. Malik, editors, Proceedings of the Second Workshop of the INitiative for the Evaluation of XML retrieval (INEX), ERCIM Publications, 2004.

[29] A. Trotman and R. A. O'Keefe. Identifying and Ranking Relevant Document Elements. In Norbert Fuhr, Saadia Malik, and Mounia Lalmas, editors, INEX 2003 Workshop Proceedings, 2003. <http://inex.is.informatik.uni-duisburg.de:2003/proceedings.pdf>.

Top | Contents
Previous Article
Next Article
Home | E-mail the Editor