Volume 5 Issue 3
Winter 2009
ISSN 1937-7266

Non-topical Classification for Healthcare Information

Emi Ishita

Surugadai University
698 Azu, Hanno, Saitama, Japan 357-8555


My research purpose is to classify text found in online healthcare resources to non-topical categories credibility, reading level, perspective and intended audience. My ultimate goal is to display resources with aspect labels for topical and non-topical categories in an end-user information system. The key enabling objective on which I focus is to discover what features are useful for specific non-topical classification tasks.

Categories and Subject Descriptors

H.3.3 [Information Systems]: Information Search and Retrieval – content analysis and indexing.

General Terms

Measurement, Experimentation.


Non-topical classification, Intended audience, Credibility, Reading level, Subjectivity.


Many kinds of information are available online. A general search engine shows documents that are (hopefully) relevant, or that at least include some of the search terms that the user inputs. These results help users to find documents with a specific topical focus. However, it is difficult to know from search results alone about other important aspects, such as which audience a document was written for, whether the information presented in the document is credible, or what perspective the document takes in presenting that information. Assessing such factors manually can be time consuming, and can sometimes require skills that the user may lack. If search results could be displayed with non-topical categories, users could find the type of information that they seek more easily. Non-topical classification is the key enabler for such a capability.

Text classification research has focused principally on topical classification, but classification based on non-topical factors such as source authority, sentiment, and reading level has attracted increasing attention in recent years. In my research, the non-topical categories of interest are intended audience, credibility, reading level and subjectivity. Classification techniques based on machine learning are applied. By using one collection with four categories, I expect to be able to compare the differences in optimal feature selection, both among non-topical categories and between non-topical and topical categories.


This section presents an outline of my proposed system. An illustrative example is shown in Figure 1. In that example, the document is displayed with intended audience, credibility, reading label and subjectivity, and also with a topical category and a topical sub-category. Users are able to access documents by topical categories and sub-categories using search terms, for example, “breast cancer” and “treatment.” Existing search engines show results based on topical relevance and other features (e.g., Google leverages the link structure of the Web to estimate authority, which captures some but not all aspects of credibility).

Different kinds of users have different kinds of the information needs. For example, a user who is a breast cancer patient might want to read another patient’s experience in order to understand their feelings, even if the document lacks credibility in the traditional fact-oriented sense. In Figure 1, Documents A and B have the same topical category, and the same topical sub-category. Document A describes treatment of breast cancer and is written by a breast cancer expert. Document B, by contrast, was written by a patient with breast cancer. This is an example of different perspectives. With regard to reading level, a user who is not a native speaker of the document language may prefer simpler forms of expression and more general vocabulary. Non-topical categories can, therefore, help users sort the search results in ways that better match their information needs.

In the remainder of this section, I describe my research design, how I plan to address non-topical categorization, and the evaluation methods that I plan to use.


Document A

Document B

Topical category

Breast cancer

Breast cancer

Topical sub-category



Non-topical category:




Intended audience

Lay person

Lay person






Reading level





Doctor’s report

Patient’s experience

Figure 1. Illustrating the use of non-topical categories.

2.1 Research design

A sketch of the key steps in applying machine learning to classifier design is shown in Figure 2. Before proceeding with classifier design, however, the following points need to be addressed:

  • What topics to focus on?
  • What languages to focus on?
  • What non-topical categories to recognize?

Which non-topical categories might prove useful depends, of course, on the topic. For example, a description of a patient’s experience could be very useful for medical topics, but perhaps less so for financial topics. In order to focus on a well-defined set of non-topical categories, some topical domain or domains must therefore be chosen. My research focuses on healthcare information as a broad topic domain. A 2005 Pew study reported on a November 2004 survey that “Eight out of ten (79%) American Internet users have searched for information on at least one major health topic online, statistically the same portion (80%) as in 2002. That translates to about 95 million American adults (18+ years) who use the Internet to find health information (a small increase from 2002 because the Internet population has grown a bit)” [1]. Clearly, it is very popular to look for healthcare information online. A domain-specific information system for healthcare information could therefore be immediately useful. Breast cancer will initially be the specific topical focus of my prototype system. To fix only one topic at the outset of my work is useful, since it serves to shape the non-topical category set more narrowly, and thus simplifies design and evaluation. After evaluation, I may add additional medical conditions. And, of course, similar techniques could be applied to other medical conditions, and to other domains, in future work.

I have chosen to focus on English documents initially because vastly more healthcare information is written in English than in any other language. If time allows, I may replicate that work in Japanese in order to begin to characterize the effects of language and cultural characteristics on non-topical classifier design.

The remainder of this section addresses the choice of non-topical categories, classifier design, and evaluation design.

Figure 2. Research design.

2.2 Non-Topical Categories and Feature Design

The initial non-topical categories for breast cancer in English documents will be intended audience, credibility, reading level and perspective. There has been considerable research on topical classification, and feature design is therefore well understood for that case; term frequency is by far the most widely used feature for topical classification. Feature design is, however, an underexplored issue for non-topical classification.

2.2.1 Intended Audience

There are many kinds of users who look for the online healthcare information (e.g., physicians, nurses, pharmacists, medical students, patients, potential patients, and families or friends of patients). Medical documents written for an expert may use terminology in ways that would be hard for a lay person to understand. I therefore intend to classify the intended audience as a binary variable: expert or lay person. Readers with specialized medical training will be considered experts; others will be considered lay persons. Intrinsic evaluation of the intended audience classifier will be used to measure classification effectiveness. In the ultimate system, information access by lay persons will be the focus, so this classifier will then be used as a filter.

For example, the "flu" may be familiar to lay persons, while, experts use more specific terms like H1N1, H9N2, H5N1, and H2N2. A term list (of medical words) may be useful for identifying documents that are intended for experts. I plan to identify words which are used more frequently in academic documents by searching academic documents about breast cancer in the PubMed [2] using "Breast Cancer" as the query.

2.2.2 Credibility

Credibility, as used here, is an assessment of whether the information content of a document is reliable or believable; in a sense, it is a measure of quality. Naturally, many users who look for healthcare information will prefer reliable information. It can be difficult, however, for some users to judge whether a document is credible or not.

There are has been some research on how users assess or judge credibility of general Web sites. Fogg at al. [3, 4] conducted an online study to investigate how different elements of Web site affect people’s perceptions of credibility. They asked over 1,400 people in the U.S. and Europe to evaluate 51 Web sites using a questionnaire. Their results identified seven main elements for evaluating credibility, including real world feel, ease of use, expertise, trustworthiness, tailoring, commercial implications, and amateurism. Moreover, they identified specific high-level features (which they called “items”) on which the assessment of each element could be based, such as “the site lists the organization’s physical address” or “the site gives a contact phone number and email address” were indicative of “real world feel.” Similarly, “the site looks professionally designed” was indicative for ease of use. For trustworthiness, indicative features included “the site is linked to by a site you think is believable” and “the site states its policy and content.” Some of these high-level features can serve as starting points for the design of automatically detected (lower-level) features for credibility classification.

Fukushima [5] proposed a computational method for predicting the credibility of Japanese Web pages using the content of the Web page and other associated information. Sixty eight features were designed (after a survey and questionnaire), and 40 features were selected as a basis for predicting credibility. Examples of these 40 features include whether the site link to other sites or other sources, whether the update date is displayed, and whether a search function is provided.

O’Grady [6] developed a theoretical framework for depiction of credibility in healthcare Web sites and introduced some related terminology and implementation details. In her paper, “quality criteria items” (i.e., features) include content, authorship, advertising, currency, scope, contact information, and legal issues.

This review of the literature suggests that the following initial set of features might generally be useful for credibility classification:


  • Domain name matches known credible sources (e.g., domain names that end in .gov or .edu).
  • Contact information (e.g., a postal address, a phone number and/or an email address) is provided.
  • Information about the author is provided (e.g., a profile or an affiliation).


  • The date on which the page was created or updated is provided.
  • Updated information is clearly identified (e.g., using “new” or “update”).


  • There are links to other resources or Web pages.
  • A formal mode of expression is used.
  • There are few or no links to advertisements.
  • There are few or no dead links.

The key contributions of my research will be to extend this list with additional domain-specific features, and then to perform feature selection in ways that are well matched to the domain, the task, and the characteristics of available classification techniques.

2.2.3 Reading Level

Although I plan to focus on documents for which a lay person is the intended audience, that still subsumes quite a broad range of abilities and skills. In particular, there are different level of reading ability resulting from factors such as education, age, and language proficiency. By annotating search results with an indication of reading level, users should be able to more easily identify suitable documents.

For example, description of the flu that were written for children (in KidsHealth [7]), and for adults (in MedlinePlus [8]) are shown in Figures 3 and Figure 4, respectively. The description for children in Figure 3 uses simpler expressions and avoids difficult words like “symptoms” or “gastroenteritis” that are used in MedlinePlus.

In contrast to credibility, which relies heavily on features related to the design, function, and technical characteristics of a document, classification of reading level typically relies more strongly on words and grammatical structures found in the document. For example, Heilman et al. [9] developed a measure of “readability” based on a combination of lexical and grammatical features derived from sub-trees of syntactic parses.

What Is the Flu?

Influenza (say: in-floo-en-zah) is also called the flu. It's an infection that causes fever, chills, cough, body aches, headaches, and sometimes earaches or sinus problems. The flu is caused by the influenza virus (say: vy-rus). A virus is a microorganism (say: my-croh-or-gah-nih-zum), which means it's so small that you can't see it without a strong microscope.

For most people, the flu is a drag, but it goes away in a week or two. But for some people, the flu can make them very sick.

Those groups include:

  • babies and kids younger than age 5
  • people older than 50
  • adults and kids who have health problems, such as diabetes and asthma

Figure 3. The description of the flu in KidsHealth site.


Also called: Grippe, Influenza

Flu is a respiratory infection caused by a number of viruses. The viruses pass through the air and enter your body through your nose or mouth. Between 5% and 20% of people in the U.S. get the flu each year. The flu can be serious or even deadly for elderly people, newborn babies and people with certain chronic illnesses.

Symptoms of the flu come on suddenly and are worse than those of the common cold. They may include

  • Body or muscle aches
  • Chills
  • Cough
  • Fever
  • Headache
  • Sore throat

Is it a cold or the flu? Colds rarely cause a fever or headaches. Flu almost never causes an upset stomach. And "stomach flu" isn't really flu at all, but gastroenteritis.

Figure 4. The description of the flu in Medline Plus site.

I therefore plan to initially try the following features:

  • sentence length
  • word length
  • the number of syllables in a word
  • word difficulty
  • grammatical structure
2.2.4 Subjectivity

There are many kinds of authors online and they write their document from different viewpoints. For example, doctors writing for lay readers might be expected to write factual accounts about what is known, while patients writing for other patients might instead write more experiential, and hence more subjective, accounts. Moreover, some documents (e.g., advertisements) might be written with advocacy as their goal. In this research, I plan to focus on distinguishing objectivity from subjectivity.

Detection of subjectivity can draw on existing work in both sentiment analysis, genre classification, and “style” classification. Sentiment analysis is presently a hot topic in text classification research (e.g., Su et al [10] has characterized the subjectivity of a word sense). Kesserler [11] used structural, lexical, character-level and derivative information as cues to identify genres. Argamon [12] developed lexical features for stylistic classification based on a taxonomy of semantic functions for certain words or phrases.

Hiroshima [13] presented a method for searching the Web for sentences expressing opinions. They collected declaratively subjective clues in opinion expressing sentences from Japanese Web pages retrieved with opinion search queries. For example, they collect words such as "should," "think," "possibly," "glad," and "wondering". These words are extracted from Japanese Web pages which have topics like product, entertainment, and facilities. Of course, there might be more appropriate words which express an author’s feelings regarding a healthcare topic. I plan to select such words from the training data.

2.3 Classifier Design

Developing new classifier designs is not a focus of my work, so I intend to focus on the following well understood classifier designs that are based on supervised machine learning:

  • Support Vector Machine (SVM)
  • k-Nearest Neighbor (kNN)
  • Maximum Entropy (MaxEnt)

I have experience with SVM and kNN classifiers from recent work on non-topical classification for other applications [14].

2.4 Evaluation

Informative, replicable, and affordable evaluation of classification accuracy requires a test collection. Building a test collection involves two steps: (1) document selection and (2) ground truth annotation. For document selection, many kinds of resources are available online, including institutional resources (e.g. MedlinePlus[8]), question answering sites for healthcare information (e.g. Yahoo! Answers [15] has the health category), archives of patient-oriented mailing lists (e.g., Breast Cancer Mailing List Archives [16]), and blogs in which patients describe their personal experiences. I plan to assemble documents by constructing a moderately large sets of 20-30 queries (e.g., “breast cancer treatment,” “breast cancer surgery,” and “breast cancer testing”) that are expected to yield overlapping result sets with general-purpose search engines. These queries will initially be based on literature reviews, on the observed structure of prominent institutional healthcare Web sites and online communities, but iterative reformulation will be used as needed. In each case, as many of the top 1,000 search engine results as possible will be downloaded, and the result sets will be merged.

I will then train human assessors to assign the specific topical and non-topical labels of interest to each pages in samples whose size and composition are selected to balance affordability and insight. I plan to automatically divide the collection into two groups; documents written for experts and documents written for the lay parson. I will ask the human assessors to assign non-topical categories to pages that are automatically detected as being intended for lay persons. If too many documents are assigned to that category, I will do interval sampling based on the raking from the search engines. Accuracy, precision, recall, F-measure will be used for evaluation of the system


I have described the outline of my research. My goal is to classify text found in online healthcare resources by the non-topical categories of credibility, reading level, perspective, and intended audience. My ultimate goal is to display resources with aspect labels for topical and non-topical categories in an end-user information system. This research is at an early stage, and there has been prior work on each non-topical category. My principal goal is, therefore, to identify for each non-topical category which features are most useful for classification tasks.


[1] Fox, S. “Health information online: eight in ten Internet users have looked for health information online, with increased interest in diet, fitness, drugs, health insurance, experimental treatments, and particular doctors and hospitals.” Washington, DC: Pew Internet & American Life Project, 2005.
[2] PubMed, URL: http://www.ncbi.nlm.nih.gov/pubmed/
[3] Fogg, B, J., Kameda, T., Boyd, J., Marshall, J., Sethi, R., Sockol, M., and Trwobridge. Standard-Makovsky Web credibility study 2002: Investigating what makes Web sites credible today, Stanford University, 2002.
[4] Fogg, B. J., Marshall, J., Laraki, O., Osipovich, A., Varma, C., Fang, N., Paul, J.,. Rangnelar, A., Shon, J., Swani, P and Treinen, M. “What makes a Web site credible? A report on large quantitative study.” Proceedings of ACM SIGCHI Conference on Human Factors in Computing Systems, 61-68, 2001.
[5] Fukushima, T. and Utsumi, A. “Automatic estimation of Web page credibility.” Journal of Japan society for fuzzy and intelligence informatics. Vol.19, No.3, 239-249, 2007 [in Japanese].
[6] O’Grady, Laura. “Future directions for depicting credibility in health care web sites.” International Journal of Medical Informatics. Vol.75, No.1, 58-65, 2006.
[7] KidsHealth, URL: http://kidshealth.org
[8] Medline Plus, URL: http://medlineplus.gov/
[9] Heilman, M., Collins-Thompson, K., and Eskenazi, M. "An Analysis of Statistical Models and Features for Reading Difficulty Prediction." The 3rd Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 2008.
[10] Su, F. and Market, K. “From words to senses; A case study of subjective recognition.” Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), 825–832, 2008.
[11] Kessler, B., Numberg, G. and Schutze H. “Automatic detection of text genre.” Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics. 32-38, 1997.
[12] Argamon, S., Whitelaw, C., Chase, P., Hota, S. R., Garg, N. and Levitan, S. "Stylistic text classification using functional lexical features." Journal of the American Society for Information Science and Technology. Vol.58, No.6, 802-822, 2007.
[13] Hiroshima N., Yamada, S., Furuse O., Kataoka, R. “Searching for sentences expressing opinions by using declaratively subjective clues.” Proceedings of the Workshop on Sentiment and Subjectivity in Text, 39-46, 2006.
[14] Fleischmann, K.R., Oard, D.W., Cheng, A.-S., Wang, P., & Ishita, E. “Automatic Classification of Human Values: Applying Computational Thinking to Information Ethics.” Proceedings of the 72nd Annual Meeting of the American Society for Information Science and Technology, Vancouver, BC, Canada, 2009. (to appear).
[15] Yahoo! Answers, URL: http://answers.yahoo.com/
[16] Breast Cancer Mailing List Archives, URL: http://bclist.petebevin.com/