IEEE TCDL Bulletin
space space

TCDL Bulletin
Fall 2007
Volume 3   Issue 3


Semantic Components

A model for enhancing retrieval of domain-specific information

Susan L. Price
Department of Computer Science
Portland State University
Portland, Oregon 97207-0751
<prices at>



Domain specialists often have very targeted information needs. Digital libraries that leverage knowledge about information organization in a domain are likely to more effectively and efficiently meet the needs of expert users. We propose a model for representing document content, called semantic components. The semantic components model can affect multiple stages of the retrieval process: indexing, querying, document ranking, and document representation in results lists. In this paper we briefly describe the model, present preliminary results, and propose experiments that will provide evidence regarding the feasibility and potential usefulness of the model.


1 Introduction and Background

Kirsten, a family practitioner, wants to refer her patient to a cardiologist because standard medication is not controlling his arrhythmia. She needs to know how to arrange the referral and what tests must be done in advance. Kirsten does not need a collection of documents that describes various arrhythmias, she needs one document that contains the referral guidelines that are applicable to her practice.

Domain specialists, such as physicians, often have very targeted information needs. Furthermore, they have extensive knowledge about what types of documents are created in the domain and how information is typically organized and expressed within documents. Digital libraries that leverage knowledge about information organization in a domain are likely to more effectively and efficiently meet the needs of expert users.

A digital library pertinent to a particular domain contains a collection of documents and a retrieval system that provides access to those documents. The retrieval system typically has an index, consisting of a representation for each document, a query module that accepts user requests in a query language that is understood by the retrieval system, and a search module that matches the user requests to document representations. Document representations typically consist either of words extracted from the document, or keywords assigned from a controlled vocabulary appropriate to the domain. Both types of representations have advantages and disadvantages. We propose a new model, called semantic components, to represent documents in domain-specific digital libraries that is complementary to keyword and full-text indexing, and leverages information about document types and the kinds of information they contain. In the remainder of this paper we briefly describe the semantic components model, describe how we envision it being used, and outline a planned body of work to investigate the feasibility and potential usefulness of this model in digital libraries.

2 Proposed Research

The documents in a domain-specific collection can be classified into document types by considering the type of topic that is the main focus of a document. For example, a collection of health-related documents could include documents about diseases (one document type) and documents about medications (another document type). Documents about a particular type of topic tend to contain information about a finite set of aspects of the topic type. We refer to the selected aspects for a document type as semantic components. For example, documents about diseases often contain information about etiology and treatment (semantic components) and documents about medications often contain information about dosage and side effects (also semantic components).

A semantic component instance is one or more segments of text that contain information about a particular aspect (semantic component) of the concept instance (a topic) that is the document focus. The segments of text that comprise a semantic component instance in a document can vary in length and may or may not be contiguous. Text may belong to more than one semantic component instance, therefore semantic components may overlap or one instance may be contained, or nested, inside another. The semantic component is the type (that is, a label that indicates the type) for the semantic component instances that correspond to a particular aspect. In the example above, etiology and treatment are semantic components in documents about diseases. The text segments that describe the etiology and treatment of a particular disease are instances of etiology and treatment semantic components, respectively.

Semantic components bear similarities to facets identified during facet analysis of a topic [1]. We do not use the term facet, because use of the term might imply that semantic components result from performing a facet analysis of the concept represented by the document class and because not all semantic components correspond to facets. We identify sets of semantic components for document classes because we believe they will be useful for searching, not because they adhere to principles of classification such as exhaustivity or mutual exclusivity. In fact, our definition of semantic components explicitly states that semantic component instances may overlap and that not all text in a document must belong to at least one semantic component instance. Some semantic components correspond to facets of the concept represented by the document class. For example, diagnosis and treatment are facets of diseases and are also likely candidates for semantic components of documents about diseases. Semantic component instances may, however, contain information that is not traditionally considered as part of the facet of a topic. For example, an instance of a diagnosis semantic component may contain information that is applicable only to the locale where the document was created and expected to be used, such as the availability of various tests and procedures in a given hospital or region. Semantic components may also group together two or more concepts that might be considered separate facets of a concept represented by the document class. Since semantic components are intended to facilitate retrieval of information, not to describe the domain itself, knowledge of the contents of a document collection and the types of information needs common among users of the collection may lead to varying degrees of specificity among the semantic components for a document class. For example, an analyst might decide to collapse pathophysiology and epidemiology into a single semantic component of documents about diseases if the users are busy practicing physicians who rarely query for these kinds of information, whereas she might make them separate components if the user population regularly searches for these kinds of information, or if these information types are prominent in the document collection being analyzed. Some components exist because they appear as prominent items of information within document text even though they resemble the kinds of information that often appear as metadata, or could be fields in a database, such as location, responsible official, and date in Environmental Analysis documents. We refer to these semantic components as fact-oriented semantic components. Semantic components can also contain information that is specific to the collection or user environment, such as practical information (where to go) and aftercare (length of hospitalization, number and timing of follow-up appointments) in documents that describe invasive medical procedures. These components, like the fact-oriented semantic components, correspond less naturally to facets although one might be able to coerce them into a structure of facets and subfacets.

Information about semantic components in documents can be useful for information retrieval in three ways. First, just knowing which semantic components are present in a document can help a searcher decide which documents in a results set are most likely to be useful. A list of the semantic components present in each document provides a short synopsis of document content. Second, a searcher can specify that his search is for documents about a topic that contain a particular semantic component. Third, the searcher can search for terms within semantic components. For example, the hypothetical searcher introduced in Section 1 might request documents about arrhythmia that contain the term "echocardiogram" within a referral semantic component instance to find out whether her patient should have an echocardiogram (an ultrasound examination of the heart) before seeing the cardiologist.

2.1 Research Questions

In order to use semantic components for retrieving documents, semantic component instances must be identified in documents, and information requests must be expressed in a way such that semantic component information can be used to match documents to requests. This work will address four questions about semantic components.

  1. Can document types and semantic components be identified for particular domain-specific document collections?
  2. Can searchers express information needs using document types and semantic components?
  3. How easily can semantic components be identified and represented in an index?
  4. Are semantic components useful for retrieving documents?

2.1.1 Can document types and semantic components be identified for particular domain-specific document collections?

We have preliminary evidence to indicate that identification of document types and semantic components is relatively straightforward, at least for domain experts. We have analyzed three collections of documents, two in the medical domain, and one in the natural resources management domain [2, 3]. These analyses are discussed in Section 3.1. We are considering tools that could facilitate the analytic process for librarians or domain experts who might be assigned the task of analyzing a document collection and specifying the semantic components appropriate for each document type. Evaluating the process and associated tools will be challenging. An evaluation could address the ease of analyzing a document collection and the resulting quality. The ease of performing an analysis could be assessed with respect to the time required for analysis and degree of difficulty perceived by the analyst. Evaluation of the quality of an analysis would require either subjective judgments by another domain expert, or be inferred as part of a retrieval study that would be influenced by multiple other factors.

2.1.2 Can searchers express information needs using document types and semantic components?

We are approaching this question in two ways. First, we mapped an existing taxonomy of questions asked by primary care physicians to the document types and semantic components we found in two collections of documents intended for physicians [3]. In this study, we used information needs that had already been abstracted into generic questions and then were analyzed by a researcher very familiar with the notion of semantic components. The results of the study are summarized in Section 3.2. We will also assess how searchers, newly introduced to semantic components, express information needs using this model. In the searching experiment described in Section 4, we will evaluate how searchers express information needs using semantic components, both with regard to how well the queries retrieve relevant documents, and how the search formulations compare among subjects searching on the same topic. We will also assess how well the searchers themselves felt they were able to express their information needs.

2.1.3 How easily can semantic components be identified and represented in an index?

This general question has several associated research questions. Semantic components will not be a useful model for retrieval if the presence (or absence) and location of semantic components in documents cannot be identified with a reasonable amount of resources and a reasonable degree of quality. Our first investigations will assess the identification of semantic components by human indexers and will include a comparison to keyword indexing. The resources to be measured will be time and perceived cognitive demand (or "difficulty" of the task). We will assess the quality of indexing with respect to inter-indexer consistency and also compare indexer judgments to a "standard" created by consensus between two members of the research team, a domain expert and an information scientist. While indexing consistency does not measure indexing quality (indexing can be consistently incorrect), we believe it will provide interesting and useful information. Consistency is likely to reflect the degree of difficulty of the task, and frequent users of a system are more likely to be able to adjust their searching behavior to accommodate imperfect indexing if it is consistent. Directly measuring indexing effectiveness (whether an indexed document is correctly retrieved every time it is relevant to a query [4]) is not feasible, but the searching study described in Section 4 will evaluate indexing effectiveness in combination with other factors.

Our goals for this part of the work are to answer the following more specific questions:

  1. Can human indexers identify semantic components more quickly than they can assign indexing keywords?
  2. Is designation of semantic components more consistent than assignment of keywords?
  3. Is designation of semantic components more "accurate" (when compared to a researcher-generated gold standard) than assignment of keywords?
  4. Do indexers find the identification of semantic components to be easier (less cognitively demanding) than assignment of keywords?

Evaluation of semantic component indexing is a significant challenge. Although we frame the problem as classification, recall and precision-based metrics that are often used for document classification are not adequate to evaluate the similarity of locations in the text between two instances of semantic component indexing and often do not account for the probability of similar classifications occurring by chance. We are investigating measurements of agreement based on two main approaches: binary classification of each unit of text in the document (independent of location) and Krippendorff's alpha for measuring agreement of segmented (and categorized) text in content analysis tasks [5, 6].

2.1.4 Are semantic components useful for retrieving documents?

Evaluation of the usefulness of our semantic component model for searching is arguably the most important, and complex, piece of this work. Clearly for this model to attain widespread use it needs to provide some benefit to the searcher. Our goal for this part of the work is to answer the following questions:

  1. How do searchers use semantic components to express a search task?
  2. Do searchers feel it is easy to express searches using semantic components?
  3. When searchers are presented with a list of the semantic components present in each document that appears in the search result list, possibly supplemented with additional information about each semantic component instance, such as its length and an excerpt of text, do they find it easier to select which documents to view?
  4. Does the use of semantic components for searching produce a better search result, compared to a system that uses traditional queries and indexing (probably simple natural language queries and full text indexing), with respect to document ranking (as determined by an appropriate metric, such as Mean Average Precision, and relevance judgments that are specific to the actual search task and not just topic relevance)?
  5. Do searchers find what they determine to be a suitable answer more quickly with a system that uses semantic components than with a traditional system?

3 Preliminary Results

3.1 Analysis of Document Collections

We have identified two main approaches to identifying the set of document types: (1) an analysis based on document sampling and (2) re-using existing document types. We also have identified two stages of the analysis with respect to semantic components.

The first document collection we analyzed is from the national Danish Health Portal, [7]. Although the website allows users to filter searches based on a list of document types, we chose to ignore existing metadata and analyze the collection using document sampling. Because we did not have access to an enumeration of document identifiers, we sampled the document collection by issuing queries using nonspecific terms, such as the Danish words for "or" and "it", and then used a pseudo-random number generator to select a document from the results list. We filtered some of the searches by Danish region (an option provided by the existing search interface) to ensure inclusion of at least one document from each contributing region. We also ensured inclusion of documents intended for a variety of audiences by randomly choosing some search terms from each of the three controlled vocabularies that are used to index many of the documents. For each of the 72 documents in the sample, we outlined its content, listing the types of information present. From the outlines we developed a classification scheme based on the type of main topic (such as disease, procedure, or medication) and the primary intended audience (healthcare professional or patient).

The second collection consisted of documents that are candidates for inclusion in a prototype digital library system developed in a different project. The documents are all produced and used by U.S. government agencies that manage public lands. Many are intended to document various types of analyses and decisions and must conform to document types that are specified by the government agencies. We are particularly interested in documents mandated by the National Environmental Protection Act (NEPA). Document content is highly specified by regulations to ensure compliance with legal requirements. For these documents we re-used the existing document types and focused our attention on identifying semantic components within two document types, Decision Notice and Environmental Analysis.

Our analyses of semantic components associated with document types have consisted of two stages. The first is an enumeration and description of information types commonly found in each document type. The second stage is a refinement based on anticipated information needs and search tasks given the domain and intended users of the document collection. Our refinements were based in part on early feedback that a shorter list of semantic components would be more usable for both indexers and searchers. Figure 1 and Figure 2 show the initial list and refined list of semantic components for two document types from two domains.

The refined list of semantic components for documents about diseases was based on information about the information needs and work tasks of Danish family practitioners. Physicians typically need to find very targeted information, such as referral guidelines, while they are seeing patients. Because they are less likely to search for general information about diseases, we collapsed several of the initial semantic components into a single component. We also expanded the descriptions and added examples to clarify the kinds of information that are included in each semantic component.

The second example is from Decision Notice (DN) documents. The initial list reflects content present in documents and specified in instructions to government agency employees for preparing documents of this type. The refinement is based on anticipated information needs of users. For example, DNs must contain information about administrative review, but the content is stereotypical and similar across documents. The words within administrative review instances are unlikely to discriminate one DN from another DN, and searchers are unlikely to want to search within an administrative review semantic component. On the other hand, issues are very important in this context even though they do not appear as a separate section in guidelines for preparation of DNs. Issues, raised either by agency personnel or by the public, are what drive the decision makers to choose one alternative over another. A decision maker is likely to want documents about decisions made when facing the same issues he is considering. These examples illustrate the importance of tailoring the semantic components, not only to a domain, but also to a particular collection of documents and the expected information needs of its users.

Table showing the initial list of semantic components (on the left) and the refined list (on the right)

Figure 1. Semantic components for documents about a clinical problem written for health professionals. The table on the left shows the initial list of semantic components and the table on the right shows the refined list.

3.2 Expression of Information Needs Using the Semantic Components Model

Ely and colleagues used data from observational studies to create and refine a hierarchical taxonomy of questions asked by primary care physicians. Each category in the taxonomy contains one or more generic questions. (An example of a generic question is "What is the cause of symptom x?"). Each category also has a frequency value based on the number of questions from the observational study that were assigned to the category [8, 9]. We analyzed two collections of documents written for physicians with respect to document type and semantic components:

  1. We used our analysis of the documents from the national Danish Health Portal, [7].
  2. We analyzed a subset of documents from UpToDate© [10], a commercially produced resource for practicing physicians.

We then mapped the question categories in the taxonomy to each document collection to determine what fraction of the categories in the taxonomy could be mapped to at least one document type/semantic component combination in each resource. We successfully mapped 36 of 50 question categories to one resource, and 34 of 50 to the other. Based on the frequency of the question categories in the taxonomy, our mappings covered over 92% of question categories [3]. An additional five categories could be partially, or possibly, mapped to each resource. Successful mapping indicates only that one or more semantic components can easily be identified as most likely to satisfy the information need, not that answers to all instantiations of the generic questions would be found.

Two tables. The table on the left shows the initial list of semantic components and the table on the right shows the refined list

Figure 2. Semantic components for Decision Notices. The table on the left shows the initial list of semantic components and the table on the right shows the refined list.

Figure 3 shows an example of a category mapping. Note that the category can be mapped in two directions, depending on whether a target document is primarily about the drug or the problem it treats. Categories that failed to map included two very general questions with no semantic component ("What is [condition|test] x?"), and questions for which the requested information types were not observed in the documents sampled from the collections.

Figure 3. A mapping of one question category. The variables x and y indicate values that should either correspond to the focus of the document or be present in instances of the semantic component. If a category has multiple generic questions, there can be mappings with and without variables.

4 Experiments

We have recently completed an indexing study in which 16 subjects, who currently perform keyword indexing for documents as a part of their job, were each asked to index 12 documents, six with keywords and six with semantic components. We systematically varied the order of document presentation and indexing technique to achieve a balanced design. Data collected included the actual indexing, time required to index each document, and subjective data regarding ease, confidence, satisfaction, and general feedback and suggestions. The participants recorded all indexing on paper to avoid confounding the results with user interface issues. We plan to measure inter-indexer consistency and will also assess indexing quality by comparing the experimental indexing to a reference indexing produced by consensus between an indexing expert and a physician. Data analysis is in the early stages; we will report the results at a later time.

Second, we also recently completed a searching experiment showing that semantic components enhanced the retrieval of domain-specific documents [11]. We compared a retrieval system that used semantic components to supplement free text and manually assigned keyword indexing to a traditional retrieval system using natural language queries and a combination of full text and keyword indexing. We created two interfaces to a copy of the 24,712 documents available from the portal in July 2006: (1) a basic interface that allowed simple natural language queries and two of the filters provided by the portal, and (2) an interface that provided the same facilities as the basic interface plus the ability to further specify the query using semantic components. A subset of documents were (manually) indexed using semantic components. Thirty Danish family physicians each completed four realistic search scenarios, two using the basic interface and two using the interface with semantic components. The searching tasks were formulated via a collaboration between a Danish physician, who is both a family practitioner and a researcher, and an information scientist. We logged the search queries, search results, pages viewed, and timestamps for each action. We also used questionnaires and interviews to assess the searchers' subjective reaction to using semantic components.

5 Related Work

This work builds on large bodies of previous work in information retrieval and information science. We focus here on three major areas:

  1. efforts to leverage knowledge about document genres and organization for information retrieval,
  2. efforts to classify (or categorize) documents in information retrieval systems, and
  3. efforts to select and use sections of text within documents (subdocuments) for retrieval-related tasks.

A number of authors have suggested using document genre to improve information retrieval (for example, [12-14]). The term genre, traditionally used to describe literary and artistic works, has also been used to describe categories of organizational communications [15], documents in digital libraries [13], and web pages [16], although there does not seem to be a precise and universally accepted definition of genre. Orlikowski and Yates describe genres of organization communication (such as business letters and annual reports) as being "characterized by a socially recognized communicative purpose and common aspects of form" [15]. Attempts to automatically classify document collections on the basis of genre generally rely on identifying attributes, or facets, that can be used to create a genre classification system. Documents are assigned to genres based on the values for those attributes. The document types in our model are akin to genres although we prefer not to use the term genre until we have refined our notion of document types and are convinced that genre is an appropriate term. For some familiar genres, it is easy to suggest semantic components whereas for others it is not. For example, recipes typically have ingredients and cooking instructions, but what about letters or emails? All three documents have an identifiable form and purpose. The difference seems to be that recipes are both specific to a domain (cooking) and have a predictable type of topic (a dish), but knowing that a document is a letter or an email gives us little clue about the types of information likely to be present.

Turner et al. created a model in the public health domain in which genre was one component. They used content analysis and a study with expert users to identify key elements in public health gray literature that could be extracted to create useful document surrogates to populate a searchable database. Document type, such as newsletters, guidelines, and data sets, was just one of the key elements [17]. Some of the key elements in the proposed surrogate, such as description of the problem, description of the intervention, and target population are similar to semantic components but were not linked to particular document types.

Studying and exploiting the organization of information in specific document types is not new. For example, Dillon showed that experienced users of research articles have a mental model of research article organization that allows them to impose an ordering on fragments of text that have been excerpted from an article and presented in scrambled order [18]. Bishop studied how academic researchers disaggregate and re-aggregate journal articles into useful subdocument components [19]. Both studies support readers' ability to manipulate and use subdocument components in the context of a familiar document model, but neither specifically addresses information retrieval uses. Purcell et al. developed context models of three types of medical research articles that could be used in representing documents in a retrieval system. A context model is basically an outline of the types of information that appear in a particular type of document, such as case presentation and case discussion in a case report and methods and results in a clinical research article. Sentences in documents can be assigned to a context using a markup language [20]. This context-based representation is very similar to the semantic components model, although the contexts they identified are more closely tied to the document organization that is characteristic in medical research articles than are some of the semantic components we identified in the document collections we analyzed. To our knowledge, Purcell's context model has not been applied to document types outside biomedical research articles.

Document classification is the process of assigning documents to one or more categories that have been predefined, either manually or automatically. For the purposes of information retrieval, classification is usually performed either to associate documents with metadata (including indexing terms from controlled vocabularies) or to assign documents to categories in a browsable hierarchy. Sebastiani provides an extensive review of machine learning techniques for automated text categorization [21]. We frame semantic component indexing as a classification task in the sense that each unit of text is classified as belonging, or not belonging, to an instance of a particular semantic component instance. The subdocument granularity of semantic component indexing is different from most text classification tasks because the semantic component label applies to a designated subset of the original document, not to the entire document.

A variety of other tasks related to information retrieval involve text at subdocument granularity. We briefly discuss five related tasks: content analysis, text segmentation, passage retrieval, novelty detection, and information extraction. These tasks differ with respect to the importance of detecting the location and boundaries of subdocument text, the importance of characterizing or labeling the content in the subdocument text, and the purpose of the task (that is, how the results of the task are used).

Content analysis is arguably the task most closely related to semantic component indexing. Content analysis, frequently used in social science research, is the systematic evaluation of the content of various forms of communication. It typically involves coding (labeling) units of information within a message (content analysis may be applied to information types other than text, such as audio and video). The coding scheme may be predefined or may be developed as part of the research. For example, a study of the effects of television on children might require coding the content of various television shows. While the underlying medium may have logical units, such as words or video frames, coding generally results in segmenting the message into variable length pieces corresponding to the analysis. Both the assigned code and the location of various coded segments, including the assigned boundaries, are important. When demarcation and labeling of segments using a defined coding scheme is applied to text, the task is almost identical to semantic component indexing. The differences between the two tasks are the purpose – content analysis is a research technique whereas semantic component indexing is intended to enhance information retrieval – and the model. Semantic component indexing occurs in the context of a document class/semantic component model whereas the coding scheme will vary across different research projects and may or may not involve classification of the document to be coded. Because comparing two coding instances, such as to establish the reliability of a coding scheme and/or a coder, involves comparing both the codes assigned and the similarity of the locations that have been coded, we plan to draw heavily on previous work to establish metrics for evaluating content analysis.

Text segmentation is the task of dividing text into sections based on changes in topic or subtopic. It has been studied in the context of several problems: dividing previously undifferentiated streams of text, such as concatenated news stories, into their components [22], possibly as part of a topic detection and tracking effort [23]; dividing documents into sections corresponding to subtopics to aid in information retrieval [24] or display of retrieval results [25]; and preprocessing text in a summarization system [26]. Text segmentation can be linear or hierarchical. Linear segmentation typically assigns each unit of text to exactly one contiguous segment. The task is to correctly find segment boundaries, and evaluations have been focused on measuring the correctness of automatically placed boundaries. Semantic component indexing is similar in that we try to find sections of documents that pertain to specific aspects of the main topic, but unlike most text segmentation tasks, the list of aspects of interest (semantic components) is defined in advance based on the document class. Semantic component instances within a single document can be discontiguous and also can overlap with other semantic component instances, unlike segments resulting from text segmentation tasks.

Using text segmentation for information retrieval (IR) is one example of a broader group of passage retrieval techniques, in which documents are split into a set of passages and similarity to the query is computed for each passage instead of for whole documents. Liu and Croft classify approaches to splitting documents into passages as structural, semantic, window-based, and arbitrary [27]. Semantic component instances might be considered a form of semantic passages, although not all document text is necessarily included in any of the semantic component instances. A more significant difference between our approach and passage retrieval is that we propose to use information about semantic component instances to supplement, not replace, whole-document retrieval techniques.

Novelty detection is similar to text segmentation in that the goal is to find instances of different subtopics, but the focus of novelty detection is on the different subtopics, not their location within documents. The TREC novelty task focused on finding sentences that were both relevant to a topic and novel, given the sentences that have already been seen [28]. Semantic component indexing differs from novelty detection because the aspects of interest, semantic components, are defined in advance and because the locations of the semantic component instances are important.

Information extraction (IE) is a somewhat different subdocument-level task. Information extraction systems identify certain types of information in unstructured text, such as entities, facts and events, and extract the information into databases or templates. Cardie points out that IE is inherently domain-specific since systems typically identify domain-specific relations among entities in the text [29]. Some of the semantic components we have identified in Decision Notices, such as Responsible Official and Date, are discrete, fact-oriented bits of information that would be suitable for extraction. Instances of other semantic components, such as Issues in Decision Notices or Management in Documents about Clinical Problems tend to be more diffusely distributed in the text and less amenable to identification using IE pattern-matching techniques.

6 Conclusions

Semantic components offer the possibility of allowing searchers to add additional specification to queries, which may allow search engines to more successfully match documents to the searcher's information need. This dissertation will explore issues surrounding developing the sets of document classes and semantic components appropriate to a document collection, expressing information needs using document classes and semantic components, (manually) indexing documents using semantic components, and the effects of semantic component queries and indexing on searching. We have outlined our approach to investigating each of these subtopics and presented a few preliminary results.

In addition, we see areas of future work that will be beyond the scope of the dissertation, including automating semantic component indexing, investigating alternative versions of the model, and studying various ways of presenting and using semantic components in a searching interface.

7 Acknowledgments

This work was supported in part by the National Science Foundation, grant numbers 0514238, 0511050, and 0534762 and by the National Library of Medicine Training Grant 5-T15-LM07088. Any opinions, findings, conclusions, or recommendations expressed here are those of the author and do not necessarily reflect the views of the National Science Foundation.

The work described here is joint work with my advisor, Lois Delcambre, at Portland State University and Marianne Lykke Nielsen at the Royal School of Library and Information Science in Denmark. The indexing and searching studies were done with the support and collaboration of, the national Danish health portal. Tim Tolle provided expert assistance with the analysis of the NEPA documents.


1. Louise Spiteri. A simplified model for facet analysis. Canadian Journal of Information and Library Science, 23, pp. 1-30, 1998. Reprinted by the Information Architecture Institute at <>.

2. Susan L Price, et al. Using Semantic Components to Facilitate Access to Domain-Specific Documents in Government Settings. In The 7th Annual International Conference on Digital Government Research (dg.o), San Diego, California, May 21-24, 2006.

3. Susan L Price, Lois M Delcambre, and Marianne Lykke Nielsen. Using semantic components to express clinical questions against document collections. In International Workshop on Healthcare Information and Knowledge Management (HIKM 2006), Arlington, VA, November 11, 2006.

4. L Rolling. Indexing Consistency, Quality and Efficiency. Information Processing & Management, 17, pp. 69-76, 1981.

5. Klaus Krippendorff. Content analysis: An introduction to its methodology. Second ed. Sage Publications: Thousand Oaks, CA, 2004.

6. Klaus Krippendorff. Measuring the reliability of qualitative text analysis data. Quality and Quantity, 38, pp. 787-800, 2004.

7. URL: <>, Accessed: June 15, 2006.

8. JW Ely, et al. Analysis of questions asked by family doctors regarding patient care. BMJ, 319(7206), pp. 358-361, 1999.

9. JW Ely, et al. A taxonomy of generic clinical questions: classification study. BMJ, 321(7258), pp. 429-432, 2000.

10. UpToDate. URL: <>, Accessed: February 22, 2005.

11. Susan L Price, et al. Semantic components enhance retrieval of domain-specific documents. In Conference on Information and Knowledge Management (CIKM 2007), Lisbon, Portugal, November 6-9, 2007 (accepted).

12. Kevin Crowston and Barbara H Kwasnik. Can document-genre metadata improve information access to large digital collections? Library Trends, 52(2)2003.

13. Andreas Rauber and Alexander Müller-Kögler. Integrating automatic genre analysis into digital libraries. In Proceedings of the 1st ACM/IEEE-CS joint conference on digital libraries (JCDL '01), Roanoke, VA, 2001.

14. Luanne Freund, Elaine G Toms, and Charles LA Clarke. Modeling task-genre relationships for IR in the workplace. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR '05, Salvador, Brazil, August 2005.

15. Wanda J Orlikowski and JoAnn Yates. Genre repertoire: The structuring of communicative practices in organizations. Administrative Science Quarterly, 39, pp. 541-574, 1994.

16. Kevin Crowston and Marie Williams. Reproduced and emergent genres of communication on the World-Wide Web. The Information Society, 16(3), pp. 201-215, 2000.

17. Anne M Turner, et al. Modeling public health interventions for improved access to the gray literature. J Med Libr Assoc, 93(4), pp. 487-494, 2005.

18. A Dillon. Reader's models of text structures: the case of academic articles. International Journal of Man-Machine Studies, 35(913-925)1991.

19. AP Bishop. Document structure and digital libraries: How researchers mobilize information in journal articles. Information Processing & Management, 35, pp. 225-279, 1999.

20. Gretchen P Purcell, Glenn D Rennels, and Edward H Shortliffe. Development and evaluation of a context-based document representation for searching the medical literature. International Journal on Digital Libraries, 1, pp. 288-296, 1997.

21. Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1), pp. 1-47, 2002.

22. Jay M Ponte and W Bruce Croft. Text segmentation by topic Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science 1324, pp. 113-125, 1997.

23. James Allan, et al. Topic Detection and Tracking Pilot Study Final Report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 194-218, Lansdowne, VA, February 8-11, 1998.

24. MA Hearst and C Plaunt. Subtopic structuring for full length document access. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 59-68, Pittsburgh, PA, 1993.

25. Marti A Hearst. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), pp. 33-64, 1997.

26. Min-Yen Kan, Judith L Klavans, and Kathleen R McKeown. Linear segmentation and segment significance. In Proceedings of the Sixth Workshop on Very Large Corpora, pp. 197-205, Montreal, Canada.

27. Xianoyong Liu and W Bruce Croft. Passage Retrieval Based On Language Models. In Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM '02), pp. 375-382, McLean, VA, November 4-9, 2002.

28. Ian Soboroff. Overview of the TREC 2004 Novelty Track. In Proceedings of the Thirteenth Text Retrieval Conference (TREC 2004).

29. Claire Cardie. Empirical methods in information extraction. AI Magazine, 18(4), pp. 65-79, 1997.


© Copyright 2007 Susan L. Price

Top | Contents
Previous Article
Next Article
Home | E-mail the Editor