Characterizing the Quality of Resources in Educational Digital Libraries
with Natural Language Processing and Machine Learning
University of Colorado
594 UCB
Boulder, CO, USA
philipp.wetzler@colorado.edu
Abstract
With the rise of community-generated web content, the need for automatic characterization of resource quality has grown, particularly in the realm of educational digital libraries. I demonstrate how machine learning approaches to automatic quality characterization can be made tractable by identifying concrete factors of quality for web-based educational resources. Using data from previous studies, we selected a set of key indicators of quality that were commonly identified by educators. We developed a set of annotation guidelines and annotated a corpus of 1000 digital resources for the presence or absence of these key quality indicators, achieving high agreement among annotators. Initial machine learning models trained from this corpus were able to identify some indicators of quality with as much as an 18% improvement over the baseline. This document discusses the approaches I will take during my dissertation research to improve on those models and apply them in an educational digital library setting.
1 Overview
Facilitating students’ learning efforts and supporting teachers in providing a rich educational landscape to their students by leveraging technology has been identified as one of the great challenges of today [9]. Educational Digital Libraries like the National Science Digital Library (NSDL), the Digital Library for Earth System Education (DLESE) and MERLOT have emerged as a cornerstone of such cyber-learning efforts. A central goal of such libraries is to develop and curate collections of “high quality” web-based resources useful for teaching and learning across a wide range of grade levels and educational settings. These resources are created by individuals and institutions across a range of academic, government and non-profit sectors, and submitted to libraries for further dissemination into educational settings.
Vetting the quality of submitted resources is often a critical issue for educational libraries, but one that is very challenging in practice to carry out reliably and efficiently at the necessary scale. Interest in scalable methods for evaluating quality have grown in importance with the rise of user and community generated content. Wikipedia, which depends entirely on user contributions, has spawned numerous research efforts exploring algorithmic approaches for automatically assessing article quality [1, 12, 3].
In addition to scalability, libraries face another challenge: the definition of quality is contextual. It depends on the alignment between the user constituency being served, the educational setting where deployed, and the intended purpose of the resource. For example, Connexions is an online project aimed at efficiently sharing modular scholarly content that is collaboratively created. Acknowledging that there are multiple perspectives on quality, the service allows third parties to create lenses, which provide different views onto collections. For instance, a professional society has created a lens to view resources that have been vetted by their own peer review processes. Others have created different lenses based on algorithmic assessments of resource popularity.
1.1 Research Questions
I am exploring algorithmic approaches for characterizing resource quality to address the challenges of scalability and of multiple perspectives. Grounded in prior work to identify dimensions of quality associated with specific educational purposes, I am developing algorithms capable of characterizing resources according to these dimensions. The main goal of my work is not to produce an overall quality assessment, but to create a rich, multi-dimensional characterization of different aspects of a resource’s quality. My goals can be summarized in these four research questions:
- What are key markers of high quality resources in educational digital libraries?
- How can these markers be accurately detected using natural language processing techniques?
- How can the result of such an analysis be employed to help in real-world tasks that require quality considerations?
- Having selected a use case, how effective is such a system in achieving its goal?
The following section discusses my overall approach, gives an outline of the methodology I employ to achieve my research goals and lists the anticipated contributions resulting from these efforts.
1.2 Research Methodology
Automatically detecting markers of quality in a resource is the core of my research; but in answering my research questions, I intend to address this problem not in isolation, but by first firmly rooting the approach in a study of human expert processes, and, once I’ve created promising models, applying and evaluating them in a real-world scenario. This human-centered approach is less common in the field of computational semantics, but has been successfully applied by my colleagues to related problems ([5],[7]).
Study of Expert Processes
In order to systematically approach the characterization of the quality of digital library resources, specific aspects that are relevant in the context, in which the resource may be evaluated, must be identified. While prior work has suggested broad frameworks for defining information quality, such efforts are not specific enough to the domain of educational documents, which has its own sets of criteria for what constitutes a “high quality” resource in a learning environment. I’m basing my research on a study conducted by Bethard et. al., which examined the cognitive processes experienced educators followed in assessing resources for in-classroom use and identified the characteristics they use most commonly. Unlike previous studies, the researchers narrowed their analysis down to identify various low-level indicators that the experts routinely referred to in making their decision; a set of such quality indicators which were most predictive of expert quality assessments was identified. The study is discussed in detail in [11] and [2].
Creating a Training Corpus
The supervised machine learning algorithms that I use to assess the presence or absence of the quality indicators work by first constructing a statistical model of resources that do, and do not, exhibit that indicator. This is accomplished by analyzing a large corpus of resources, some of which are known to exhibit the indicator, and some of which do not. To create such a corpus, I conducted an annotation project with my colleagues, in which we selected 1000 educational resources from the Digital Library for Earth System Education (DLESE) and asked experts to assess each of them for the presence or absence of seven quality indicators identified by the expert study. This annotation project, together with the expert study, addresses research question one; it also facilitates the building of computational models, thus contributing to research question two. This is also discussed in [2].
Building Machine Learning Models
I use supervised machine learning algorithms to build models of the human annotators’ judgment on the seven indicators by statistically analyzing the human annotation on the training corpus. Using these models, similar judgments can be made automatically on previously unseen resources. In Section 2 I will describe my efforts to date to compile a feature set that allows high accuracy models to be built. I also show preliminary results and an analysis of the errors the current system makes. Finally, I will discuss ways in which I intend to improve the models’ performance.
Evaluation of Practical Usefulness
Quality is a very task-specific and subjective property. While the quality indicators I’m focusing on were derived from an analysis of the cognitive processes of human experts, that alone is not enough to show the usefulness of a system that can characterize a resource based on the indicators.
The approach discussed here does not aim to only make high-quality / low-quality distinctions; rather, the goal is to produce a quality profile that covers a range of dimensions of quality and can be flexibly applied according to the needs of the task. To evaluate the effectiveness of this approach I want to identify a real-world task in the context of educational digital libraries that can make use of this multi-dimensional characterization of a resource, and for which I can conduct an objective evaluation with the help of digital library experts. Section 4 discusses my thoughts on potential scenarios for such an evaluation.
1.3 Anticipated Contributions
- The completed annotation project provides a corpus of educational resources, which are tagged for seven quality indicators, as well as partial information about which of the linked pages are part of the resource. This data will be useful for further work in this domain.
- In the course of creating computational models I will create various generalizable components for feature extraction and feature encoding. They are built on ClearTK[10], an open-source toolkit for natural language processing, and will thus be reusable by other researchers.
- Automatically detecting high level semantic properties of text, such as the markers of quality, has not been extensively studied. In the course of this work I will identify which machine learning features can be applied successfully to this and similar problems.
- Little is known about how to systematically apply a rich characterization of a resource’s quality to improve digital libraries’ utility on tasks their users are faced with. In addressing my research questions three and four I expect to improve our understanding of these issues, which will serve as a basis for future research.
2 Computational Models
We annotated a corpus of 1000 digital resources for the presence or absence of seven quality indicators (for details, see [2]); 800 of these were designated as the training / development set, the remaining 200 as the test set for the final evaluation. In the following I will outline my computational approach at building machine learning models to recognize these quality indicators. I will then present preliminary results and an error analysis, followed by a discussion of further ideas that I intend to explore. This implementation uses the ClearTK toolkit for statistical natural language processing [10] to process and extract features from a resource and SVMlight[8] for training machine learning models.
2.1 The Machine Learning System
Table 1 lists the current set of features, which are strongly based on prior work in document classification. The feature extraction process results in one numeric vector per digital library resource. During development the machine learning system analyses these vectors, identifies which ones are predictive of each quality indicator and which ones should be ignored, and so learns a statistical model for each of the seven indicators. During evaluation the machine learning system decides if the indicators are present or absent in a resource by applying those models to the vector generated from that resource.
I use the SVMlight package [8]. The training parameters are chosen using five-fold cross validation. The results reported below were achieved using a linear kernel SVM.
Feature |
Bag-of-words |
Bag-of-bigrams |
TF-IDF |
Resource URL |
URLs linked to |
URLs of images |
Google PageRank |
Alexa TrafficRank |
Table 1: The current set of features
2.2 Preliminary Results
In order to evaluate the effectiveness of my system, I trained and evaluated models on the training set portion of the corpus (800 resources) using cross-validation. I then compared the results to a simple baseline: always assume the most common case, ignoring the resource. For example, the has instructions indicator is present in 39% of resources. If I always assumed that a resource has no instructions, I would be correct in 61% of cases. An effective machine learning model will show significant improvement over this trivial baseline. Table 2 shows the accuracy when predicting each indicator using the baseline models and the machine learning models.
These results are encouraging in that even using very basic features the system was able to classify some of the indicators fairly well.
2.3 Error Analysis
In order to better understand the current weaknesses and strengths of my models I conducted a study in co-operation with my colleagues at Digital Learning Sciences / UCAR to analyze the errors the system makes. For the purpose of this study we ignored the indicators has sponsor and appropriate for age, because the annotated data provides insufficient variation to conclusively train and evaluate models. I randomly split the set of 800 annotated training resources into a smaller training set (650 instances) and test set (150 instances), trained quality indicator models on the training set and ran them on the test set. By comparing the models’ results with the manual annotation on the test set I identified resources where two or more of the remaining five quality indicator models produced an apparently incorrect result, giving me a total of 39 resources with between 2 and 4 errors each.
A DLESE curation expert who had not taken part in the original annotation project was asked to analyze each of the errors on each of these 39 resources. For each error the expert completed an online questionnaire consisting of both enumerated choice and open-ended questions. This questionnaire asked the expert
- which is correct: the human annotation or the automatic model’s result?
- is the indicator clearly present or absent, or is it ambiguous?
- are there cues in the text which clearly signal the indicator and should have been found; or are the cues implied, but not explicitly stated; or are the cues present in graphic elements or other parts not examined by the system (e.g. flash, images, etc.)?
Early results confirm that in 55% of the cases where the model did not detect the presence of an indicator, cues were clearly present in the text and should have been found. As an example, the Heat Transfer and El Niño resource clearly lists “curriculum standards," but the algorithms did not recognize these when assessing the has learning goals indicator.
In approximately 29% of the cases where the model incorrectly detected the presence of an indicator the expert noted that there was text in the resource which could be mistakenly understood to signal the indicator. For example, for the resource Fossils in the Field the machine learning model said that the resource had the has instructions indicator. The expert noted that there were parts of the text that superficially may have been mistaken as instructions, stating that “the resource is a discussion of pedagogy” and “the words used provide ideas for instructing students […] but there are no instructions about using this resource as a professional reference.”
Approximately 15% of the time when the human annotator found the identifies age range indicator to be present while the model did not detect it, the cues were buried in graphic elements that are not examined by the algorithms. For example, the EarthStorm – Relative Humidity & Dew Point resource lists the grade ranges it supports in clickable buttons and not in the text of the resource.
This kind of information will help me target my future efforts more effectively, and to better understand how quality indicators are encoded within a resource.
3 Next Steps
I have presented results that show the feasibility of modeling quality indicators with natural language processing and machine learning techniques. The current results show success on some of the indicators, but performance is still poor on others. The error analysis so far already shows that the current feature set still misses many cues that, to human readers, are readily available in the text; next I will complete the error analysis and attempt to identify how cues are commonly missed by the features. Guided by these insights, and in order to improve performance across all indicator models, I intend to explore a larger set of features that aims to capture the structure of the content and to identify the more important concepts within a resource. I plan to implement these new features in two stages:
3.1 Stage 1: Structural Features
Resources that are cataloged by DLESE and other libraries are for the most part in HTML format, potentially linking to PDF files or containing rich media. The common approach is to run a simple parser that strips out HTML tags and discards presumed irrelevant portions of a page. Various implementations are available that perform this task (e.g. TagSoup, Beautiful Soup). My current system combines the TagSoup parser with a simple, manually compiled set of rules to arrive at a plain-text representation of the resource. This representation is then fed into the feature extraction process.
The plain-text representation still contains many things that are not part of the textual content of the web page. In particular, no attempt is made to distinguish between navigation elements, boilerplate (e.g. page headers or footers, copyright notices), advertisements, and educational content. A web page offers many visual cues (different fonts, font sizes, visual separation, headings / paragraphs, …) to help the user identify these parts and navigate the text; in the plain-text format, those cues are lost.
CLEANEVAL [4], organized by ACL SIGWAC in 2007, was a contest of automatic systems for preparing web sites for use for linguistic and language technology research. The contest focused on removing all HTML and Java code, boilerplate, advertisements, and other content that was deemed not part of the “proper, coherent content” of the page. Paragraphs, headings and list elements were marked. A follow-up contest appears to be planned for 2009, but no details have been released. Arguably, the boilerplate parts of a page (title, copyright statements), navigation elements (e.g. a navigation bar on the side) and other parts that are not proper content still contain information that may be useful to identify quality indicators; for example, the boilerplate sections are exactly where you would expect a resource to identify its publisher. Thus I would like to improve on CLEANEVAL’s approach by not simply throwing away those sections, but by categorizing them into content classes, thus allowing features to focus on the specific information they contain.
I have built a simple prototype system, which splits a web page along HTML block element tags (for example div, p and table tags). Using a simple ad-hoc ruleset, the resulting HTML fragments are then classified as paragraphs, headings, navigation, etc., based on features such as the number of words, capitalization, a language model and the amount of text within links. The performance of this prototype system is promising, but more work is needed to make the classification accurate and robust. This I intend to achieve by replacing the ad-hoc ruleset with a simple machine learning classifier; a relatively small amount of manually annotated training data, in combination with the published data from CLEANEVAL, should be sufficient.
With this system in place I will explore a range of features that make use of the resulting information. For example, instead of having one bag-of-words feature over all the content, there can be multiple bag-of-words features for the different content classes, allowing the has sponsor indicator model to focus on the title of the page instead of its content, or the organized for learning goals model to focus on the navigation bar. Similarly, a feature may focus on the first few paragraphs within a resource, where one would expect to find instructions that help the user approach a resource effectively. With the added information a model for the has instructions indicator may be less likely to be distracted by other sections of the resource that use similar terminology.
3.2 Stage 2: Semantic Features
Table 1 listed the features my system currently uses. The content based features, such as bag-of-words and bag-of-bigrams, rely solely on counting words that show up in the training set. These features are frequently used successfully for document classification tasks, and they achieve moderately good results in my system, as presented in Section 2.2; but they are very limited in the information they capture about the content of a resource. For example, this leads to problems when a resource uses slightly different terminology than previously seen resources to describe, say, learning goals. The automatic system ignores the new words, because they haven’t been used when talking about learning goals before; a human reader, on the other hand, could use rich understanding of the words’ meaning to recognize that the new words are talking about the same thing. The Heat Transfer and El Niño resource mentioned in the error analysis provides a concrete example: the resource referred to learning goals as “curriculum standards”, as opposed to other, more common phrases, such as “education standards” or “learning objectives”. Similarly, a human reader is able to pick out the most relevant parts of a resource when making an assessment and is not likely to be distracted by single words in less important sections; based on the bag-of-words feature alone, a machine learning system cannot make that distinction. The current feature set captures lexical aspect of a resource, not semantic ones.
In stage 2 I intend to first explore how well known semantic analyses from the field of computational semantics can be applied effectively to online resources. Named entity recognition (NER), in particular, is a promising approach: a NER system identifies phrases within text that refer to names of, e.g., organizations or locations. Such information would be useful in determining if a resource has a prestigious sponsor, for example. Other approaches I may employ include document summarization, cohesion metrics, and language models.
Another idea that I want to pursue involves identifying the key concepts in each paragraph to make the conceptual organization of a resource more accessible to the machine learning algorithm. Lexical methods were used in [5] to identify overarching key concepts within a set of resources. Building on known computational semantics techniques like named entity recognition and coreference resolution, I will try to capture more fine-grained, discourse level concepts. Such information could serve to identify a paragraph’s main topics; this could be useful, for example, to distinguish between a resource that has individual sections dedicated to each learning goal (and would thus be considered organized for learning goals), and a resource that has no such structure.
4 Potential Use Cases
A more fine-grained quality profile of a resource that can be automatically created for large numbers of unseen resources, such as this research aims to provide, would certainly be applicable to a wide variety of tasks that experts and users of educational digital libraries are faced with. Still, while the experts I have worked with agree that there is much potential in the idea, a concrete scenario for applying the data the quality indicator models provide has yet to be defined. In collaboration with my colleagues at Digital Learning Sciences / UCAR, I’m planning to consider a range of scenarios in the context of DLESE that the results of this research could be applied to, and in the context of which it’s effectiveness and usefulness could be evaluated. Out of these, I plan to choose one as a case study and to evaluate the result of my models. In the following I will outline some possible use cases.
Scaffolding Expert Reviews
Current DLESE review processes require the expert reviewers to manually evaluate a large number of resources. During review, the expert is confronted with the resource and little information to get them started. To help them identify problem areas more quickly, the digital library review system could flag potential problems for the reviewer to check, e.g. “this resource does not appear to indicate a target age range”. The reviewer would then check if that assessment is accurate and decide if the identified flaw is relevant in that particular instance. Hopefully such a system would shorten the average time required to review a resource. There are, of course, concerns with biasing the reviewer in one way or another before they’ve had a chance to form their own opinion.
Quality Indicators to Improve Search
The quality indicators can be combined into a quality score (which could be tweaked to be more or less sensitive to each individual indicator). Such a score could be used as a factor in a ranking scheme when searching the digital library, boosting resources of presumed high quality and making it more likely that they show up near the top of the results list.
Alternatively, instead of attempting to integrate the output of the quality indicator models into an overall quality score, the search interface could allow a user searching for a digital library resource for a specific purpose to indicate how important factors such as presence of instructions or a prestigious sponsor are; the result list can then be modified accordingly.
Quality Indicators to Educate Users
Another alternative is to provide direct quality indicator feedback to end users to support their cognitive processes. Research has shown that people are quick to trust information they find online without questioning its validity as a source [6]. Indicating potential issues with a resource to users may sensitize them to quality problems in resources they are evaluating. For example, indicating that a resource has no identified sponsor may help make students aware that they need to identify the source of a statement to evaluate its credibility. Similarly, indicating other potential quality problems may train users to be aware of such issues and not just take what they read at face value. To evaluate the effectiveness of such a system a user study would need to be done, comparing user behavior during and after a session using a quality indicator based system that indicates potential problems (e.g. “This page doesn’t seem to indicate a sponsor. Make sure you know who is providing this information.”) versus a system that, say, randomly prompts the user to inquire into the quality of a resource (e.g. “Make sure that you know who provided the sources you use!”).
References
[1] | B. Thomas Adler and Luca de Alfaro. A content-driven reputation system for the wikipedia. In Proceedings of the 16th international conference on World Wide Web, pages 261–270, Banff, Alberta, Canada, 2007. ACM. |
[2] | Steven Bethard, Philipp G. Wetzler, Kirsten Butcher, James H. Martin, and Tamara Sumner. Automatically characterizing resource quality for educational digital libraries. In Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, 2009. To be published. |
[3] | Joshua E. Blumenstock. Size matters: Word count as a measure of quality on wikipedia. In Proceedings of the 17th International World Wide Web Conference, pages 1095–1096, New York, NY, USA, 2008. ACM. |
[4] | CLEANEVAL home page. http://cleaneval.sigwac.org.uk/, October 2008. |
[5] | Sebastian de la Chica. Generating Conceptual Knowledge Representations to Support Students Writing Scientific Explanations. PhD thesis, University of Colorado, 2008. |
[6] | Leah Graham and Panagiotis Takis Metaxas. “of course it’s true; i saw it on the internet!”: critical thinking in the internet era. Communications of the ACM, 46(5):70–75, 2003. |
[7] | Qianyi Gu. Personalized information searching to support intentional learning. To be published. |
[8] | Thorsten Joachims. Making large-scale support vector machine learning practical. In Advances in kernel methods: support vector learning, pages 169–184. MIT Press, 1999. |
[9] | National Academy of Engineering. Introduction to the grand challenges for engineering, February 2008. http://www.engineeringchallenges.org/cms/8996/9221.aspx. |
[10] | Philip V. Ogren, Philipp G. Wetzler, and Steven Bethard. ClearTK: A UIMA toolkit for statistical natural language processing. In UIMA for NLP workshop at Language Resources and Evaluation Conference (LREC), 2008. |
[11] | Philipp G. Wetzler, Steven Bethard, Kirsten Butcher, James H. Martin, and Tamara Sumner. Automatically assessing resource quality for educational digital libraries. In Proceedings of the 3rd ACM Workshop on Information Credibility on the Web, 2009. To be published. |
[12] | Honglei Zeng, Maher Alhossaini, Li Ding, Richard Fikes, and Deborah Mcguinness. Computing trust from revision history. In Proceedings of the 2006 International Conference on Privacy, Security and Trust, October 2006. |