PanAnthropon: e-Knowledge Portal for Digital Humanities
toward Semantic Exploration and Visualization of
Intellectual,
Cultural, & Scientific Connections
College of Information Science and Technology
Drexel University
Philadelphia, PA, USA
sofia.j.athenikos@acm.org
http://www.pages.drexel.edu/~sj94/
ABSTRACT
Due to its unique structural features and rich user-generated content, Wikipedia is being increasingly recognized as a valuable knowledge source that can be exploited for various applications. Nevertheless, the mode of information search and retrieval on Wikipedia remains that of conventional keyword-based search and retrieval of a list of articles ranked in terms of keyword matching. In this paper I propose to create a Web-based knowledge portal, based on the information extracted from Wikipedia, which enables semantics-based and visually-enhanced search and exploration involving the Western intellectual heritage, in particular, explicit/implicit intellectual/cultural connections among influential thinkers and ideas that have exerted major formative influences on Western civilization. The proposed research extends my pilot study on the WikiPhiloSofia project, which concerned extraction, analysis, and visualization of semantic and hyperlink information concerning 300 philosophers using Wikipedia. By exploiting the synergy between the Social Web and the Semantic Web and by prominently employing visualization as an effective mode of information presentation, the proposed project contributes to a paradigm shift from conventional information retrieval toward the next generation of Web-based information services/systems that seamlessly incorporate semantics, aesthetics, and edutainment, with useful applications for digital humanities and beyond.
ACM Classification
I 2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods – semantic networks; E 1.1 [Data Structures]: graphs and networks; H 3.3 [Information Storage and Retrieval]: Information Search and Retrieval – query formulation, search process; H 3.5 [Information Storage and Retrieval]: Online Information Services – web-based services; H 3.7 [Information Storage and Retrieval]: Digital Libraries – collection, systems issues; H 5.1 [Information Interfaces and Presentation]: Multimedia Information Systems - hypertext navigation and maps; H 5.4 [Information Interfaces and Presentation]: Hypertext/Hypermedia – navigation, user issues.
General Terms:
Algorithms, Design, Human Factors.
Keywords:
Digital Humanities, Digital Library, Information Visualization, Wikipedia, Web 2.0, Social Web, Semantic Web, Web 3.0, e-Learning, Social Network Analysis, HCI, User Experience Design.
1. INTRODUCTION
The Western intellectual heritage, stretching across nearly 3 millennia, consists of an extended genealogy of thinkers (philosophers, theologians, scholars, writers, artists, scientists, etc.) and corresponding ideas (concepts, theories, doctrines, ideologies, etc.) that have exerted major formative influences on Western civilization itself, directly or indirectly shaping its social, cultural, political, and economic underpinnings. The rich fabric of the Western intellectual/cultural tradition can be viewed from diverse angles, not only from the vertical, chronological perspective but also from the horizontal, interdisciplinary perspective, i.e., not only from the point of view of tracing the emergence and evolution of ideas and thinkers but also from the point of view of traversing the boundaries of various disciplines, subdisciplinary branches, and schools of thought, in search of the explicit/implicit connections that exist among the thinkers and among the ideas that have shaped the Western world as we know it.
The age of Internet, or the age of information, has created the largest data repository the world has ever known, namely, the World Wide Web, with numerous Web sites boasting huge repositories of all forms of digitized data including digitized books, journals, films, artworks, etc. In particular, the advent of the so-called Web 2.0 [1] or Social Web has generated massive amounts of collaboratively-constructed data in all forms and on all subjects. The prime example of this novel phenomenon on the Web is the well-known Wikipedia (http://www.wikipedia.org/), a free, Web-based, multilingual encyclopedia project described as “the free encyclopedia that anyone can edit”. As of the time of this writing, July 2009, the English version of Wikipedia (http://en.wikipedia.org/) alone contains nearly 3 million articles on a variety of topics ranging from art and history to science and technology. On the other hand, there is also an ongoing movement toward the creation of what is called the Semantic Web [2], i.e., a transition from the Web of unstructured data that are understandable only by humans to the Web of knowledge consisting of semi-structured data the semantics of which can be understood by machines, whereby it becomes possible to automatically connect related pieces of data across data repositories, as exemplified by the linked data initiative (http://linkeddata.org/). The Social Web and the Semantic Web may well converge in the Social Semantic Web or Web 3.0.
How are the 3 millennia of the Western intellectual/cultural heritage covered in the 3 million articles of Wikipedia? What kinds of novel discoveries and interesting insights can be brought about by connecting and comparing the Wikipedia data with the data from other, more scholarly data repositories on the Web? The main objective of the doctoral thesis project hereby proposed, entitled PanAnthropon, is to extract, analyze, and visualize meaningful and interesting facts, relations, and connections involving the influential thinkers and ideas constituting the Western intellectual heritage, via the automatic processing of hyperlinks and semantic data from Wikipedia and complementary data sources. By doing so, I aim at creating a useful Web-based portal for students of humanities as well as the general public. The proposed project will build upon the preliminary results that I have achieved with 300 Wikipedia articles on major philosophers within a pilot project called WikiPhiloSofia (aka The WikiPhil Portal) (http://research.cis.drexel.edu:8080/sofia/WPS/) [3] [4] [5] [6] [7] [8], extending the scope of the latter in terms of data sources and domains of application.
The remainder of this paper is organized as follows: In Section 2, I describe the motivation behind the proposed research project, by illustrating the limitations of conventional information retrieval on Wikipedia. In Section 3, I describe my previous research concerning the WikiPhiloSofia project, in which I applied semantic information extraction and interactive information visualization to the domain of philosophy by using Wikipedia. In Section 4, I describe my proposed PanAnthropon project as a large-scale extension of the pilot project, laying out research questions, methodologies, and plans. In Section 5, I discuss related work. Finally, I conclude the paper in Section 6.
2. MOTIVATING EXAMPLE
In this section, I present a motivating example, which points to the inadequacy of conventional information retrieval in meeting users’ information seeking needs (on Wikipedia in particular and on the Web in general), and which thus motivates my proposal for a Web-based portal for semantics-based search and exploration.
Suppose a user wishes to learn about the philosophers directly or indirectly influenced by Immanuel Kant, a major figure in German idealism. The user can do a simple search on Wikipedia with a string “philosophers influenced by Kant”, as shown in Figure 1.
Figure 1. Search on philosophers influenced by Kant.
The results the user would get from such a simple keyword-based search are shown in Figures 2 and 3. As shown, the user will be told that no Wikipedia article page whose title exactly matches the search string exists, and will then be given a long list of articles whose titles and/or textual content contain strings that partially match the search string. Note that the article on Kant appears as the second one on the list of articles shown in Figure 3, following an article on Husserl, due to the fact that the relevance of an article to the user query is measured by the degree of (partial) string matching. (Incidentally, the Husserl page entry would appear on top even if the user did a search on a different philosopher, e.g., Plato, due to the fact that the article has a section title that contains the string “Philosophers Influenced by”.) The search results would be similar even if the user did an “advanced search” as shown in Figure 4.
Figure 2. Summary of search results on Kant.
Figure 3. List of articles shown as search results.
Figure 4. Advanced search on philosophers influenced by Kant.
The user could, of course, go to the Wikipedia article page on Kant, which contains an infobox that includes information on the influence relationships, as shown in Figure 5. The ‘influenced’ attribute, however, may have non-person and/or non-philosopher entities as values. Besides, the user could not explore indirect, extended influence relationships and networks involving Kant simply by looking at the list in the infobox.
Figure 5. Part of infobox on Kant.
3. PRELIMINARY RESEARCH
In this section, I describe my preliminary research involving a pilot project, entitled WikiPhiloSofia (aka The WikiPhil Portal) [3] [4] [5] [6] [7] [8], of which the proposed thesis project constitutes a large-scale extension.
The WikiPhiloSofia project was concerned with extracting, analyzing, and visualizing the facts, relations, and networks involving 300 major philosophers using Wikipedia. The objective was to create a Web-based learning/knowledge portal that enables semantics-based and visually-enhanced search and exploration on the philosophers, which goes beyond the paradigm of conventional keyword-based query and retrieval.
The research questions that I addressed in the WikiPhiloSofia project are as follows:
- What kinds of interesting structures emerge from the information extracted from Wikipedia article pages on philosophers?
- How well do the structures represent the intellectual/semantic connections among philosophers?
- How can the information be effectively presented so as to facilitate discovery and exploration of the facts, relations, and networks involving philosophers?
I implemented the project in Java. Below I first describe data extraction from Wikipedia. I then present the Web portal interface that I created to enable semantics-based search and exploration on the extracted data. Finally, I illustrate presenting the results of user queries via interactive information visualization.
3.1 Wikipedia Data Extraction
I used the English version of Wikipedia as the sole data source. Data extraction proceeded as follows:
- Step 1: I obtained a chronological list of 300 major philosophers (including influential theologians, writers, scientists, etc.) from Wikipedia’s “Timeline of Western Philosophers” page (http://en.wikipedia.org/wiki/Timeline_of_Western_philosophers).
- Step 2: I extracted data on the hyperlink connections and academic/biographical facts on the philosophers from their Wikipedia article pages, and stored the data in a MySQL database.
- Step 3: I retrieved information needed for visualization by querying the database, and stored the results as XML files marked up with GraphML (http://graphml.graphdrawing.org/) and TreeML (http://www.nomencurator.org/InfoVis2003/download/treeml.dtd).
Data extraction from Wikipedia was done by directly parsing the downloaded XHTML pages. The academic/ biographical facts extracted during Step 2 was obtained from infoboxes/wikitables contained in Wikipedia pages for some philosophers, which have semi-structured and semi-standardized formats consisting of attribute–value (or predicate–object) pairs. The extracted data were augmented by automatically inferred data, in cases where the relational information involving a pair philosophers (e.g., influenced-by/influenced, teacher/student) appears in only one philosopher’s page. Table 1 summarizes the types of semantic and linkage information extracted concerning the philosophers.
The data extracted from Wikipedia were stored in five MySQL tables. Table PHIL contains statistical data concerning each philosopher (page) as well as alternative names and page URLs. The other four tables, PHILINFO, LINKS, PHILLINKS, and CATEGORIES contain information on academic/biographical facts, hyperlink connections, and categories in the form of semantic triples, i.e., Subject–Predicate–Object. The database schema consists of ‘philID’ and ‘philName’, which correspond to Subject, ‘attribute’, which corresponds to Predicate, and ‘valueID’, ‘valueName’, and ‘valueURL’, which correspond to Object. Tables LINKS and PHILLINKS also contain a field to store hyperlink counts for the sake of measuring link connection strength.
Table 2 shows the basic statistics concerning the 300 philosopher dataset obtained through the procedure described above. As shown, while there exists a high number of hyperlink connections among the 300 philosopher pages, there are a few pages that do not contain any out-links to the other philosopher pages and/or do not receive any in-links from them. Also, only 192 philosopher pages contain infoboxes that summarize academic/biographical facts, which is (at least in part) why there are some philosophers that are shown to have no influenced/influenced-by relations.
Table 1. Types of information extracted
Table 2. Basic statistics on the 300 philosopher dataset
I also examined the distributional characteristics and the network characteristics involving the dataset, and partially compared the dataset against a subset of Thomson Reuters’ Arts & Humanities Citation Index (AHCI) (http://www.thomsonreuters.com/products_services/scientific/Arts_Humanities_Citation_Index) dataset that contains 1.26 million records covering the period of 1988-1997. However, here I omit detailed discussion of the characteristics of the dataset obtained.
3.2 Web Portal Interface
I created a prototype of Web portal interface where the user can issue queries on the facts, relations, and networks involving 300 philosophers and explore the results that are displayed using diverse modalities of interactive information visualization as will be illustrated in the next subsection. Figure 6 shows the homepage of the WikiPhiloSofia portal (http://research.cis.drexel.edu:8080/sofia/WPS/).
Figure 6. Homepage of the WikiPhiloSofia portal.
Upon entering the portal, the user is presented with a main menu shown in Figure 7, where the user can select the focus of search and exploration. In case the user chooses to focus on one philosopher, for example, the user is then directed to the query menu shown in Figure 8, where the user can select a philosopher to focus on and a facet to explore concerning the chosen philosopher. Suppose the user selects a philosopher, e.g., Kant, and selects “Academic/Biographical Facts” as the facet for exploration. The user is then directed to the query result visualization menu shown in Figure 9, where the user can explore academic/biographical facts concerning the selected philosopher using a radial graph view, a graph view, or a tree view, by clicking on an appropriate link.
Figure 7. Main menu of user options.
Figure 8. Menu for query on one philosopher.
Figure 9. Result page for query on one philosopher academic/biographical facts.
Table 3 summarizes various query options and result display modalities.
Table 3. Options for semantics-based and visually-enhanced search and exploration
3.3 Visualization of Facts, Relations, & Networks
Here I illustrate presenting the results of various user queries on the philosophers via interactive visualization implemented by using Prefuse information visualization toolkit (http://prefuse.org/). Since I used a query on Kant as a motivating example showing the disadvantages of keyword-based search on Wikipedia, I will again use Kant as the/a focus of query, for comparison purposes.
3.3.1 Focus on One Philosopher
Figure 10 presents a tree view representing academic/biographical facts on Kant, which, at the moment, shows notable ideas that Kant is known for. Figure 11 shows a colored graph view representing outgoing hyperlink connections from Kant. The node colors are differentiated according to the link strength (i.e., link count) so that the user can easily identify philosophers that Kant is more strongly connected to via out-links. Figure 12 shows a tree view representing 6-degree influences originating from Kant, which allows the user to explore many alternative influence paths involving Kant beyond that of direct (1st-degree) connection.
Figure 10. Academic/biographical facts on Kant.
Figure 11. Out-link connections from Kant.
Figure 12. Extended 6-degree influences originating from Kant.
3.3.2 Focus on Two Philosophers
Figure 13 shows direct relations between Kant and Hegel. Figure 14 shows philosophers who were influenced by Kant and/or Heidegger, including Sartre, who was influenced by both. Figure 15 shows commonalities between Kant and Nietzsche, in particular, the fact that both philosophers were influenced by Spinoza, with whom they both have bi-directional hyperlink connections.
Figure 13. Relations between Kant and Hegel.
Figure 14. Influences by Kant and/or Heidegger.
Figure 15. Commonalities between Kant and Nietzsche.
3.3.3 Focus on All Philosophers
The method that I use to effectively visualize a network, in particular, in order to highlight the most significant nodes and their interconnections, is a graph simplification method that I have developed, called the Strongest Link Paths (SLP) [6]. Even though SLP is simpler than other well-known graph scaling methods such as pathfinder network [9] or main path analysis [10], I have found that, by applying SLP, I can not only achieve substantial data reduction but also obtain a meaningful representation of the dominant figures and their connections within the network of philosophers (as well as within different social networks).
Here I use two variations of SLP. For strongest hyperlink/influence networks, I use the straightforward version of SLP by selecting, for each node, only the link with the highest hyperlink/influence count. In the case of networks emerging from extended hyperlink/influence relations, I add all 1st-degree links and then, for each subsequent degree, I add links only to nodes not already covered, thereby eliminating overlapping edges.
The graph that results from applying SLP by selecting only the links with highest hyperlink/influence counts consists of distinct clusters clearly separated from one another. Figure 16 shows a close-up of one of the largest clusters in the strongest bi-link network, which centers on Hegel with Marx and Kant. The graph that results from applying SLP by eliminating edge crossing consists of one large cluster which in turn consists of subclusters. Figure 17 presents a radial graph representing non-overlapping extended influences originating from Kant, which shows that the philosophers directly or indirectly influenced by Kant, alternatively represented via a tree view in Figure 12, can in fact be reached within 3 degrees of separation.
Figure 16. Bi-link cluster with Hegel—Marx—Kant.
Figure 17. Extended influences from Kant.
4. PROPOSED PROJECT
In this section, I describe the proposed PanAnthropon project as an extension of the pilot study project described in the previous section. First, I discuss the implications and limitations of the WikiPhiloSofia project. Next, based on the assessment, I present the research questions, methodologies, and plans for the proposed project. Finally, I conclude this section with a discussion of the significances and expected contributions of the proposed thesis project.
4.1 Implications/Limitations of Pilot Study
Here I discuss some of the implications and limitations of the WikiPhiloSofia project, which naturally suggest the directions for the proposed research.
Implications of the results of the pilot study project:
While the WikiPhiloSofia project concerned one particular domain, namely, philosophy, and involved a small number of entities (300 philosophers), I nevertheless obtained meaningful and interesting results as far as the dataset was concerned. For example, by applying the Strongest Link Paths (SLP) method for graph simplification, I generated philosopher network graphs consisting of distinct clusters representing coherent intellectual themes and connections while highlighting the most significant philosophers in the given networks (as exemplified in Figure 16).
Overall, the results obtained so far imply that even the simple hyperlink structure, as an embodiment of a huge amount of latent human annotation [11], can reflect meaningful intellectual connections among philosophers and that the structured templates available in Wikipedia (i.e., infoboxes/wikitables), as a prime example of emergent structural standardization taking place within the collaboratively-constructed corpus, can serve as a valuable source of semantic knowledge concerning philosophers beyond the semantically-neutral hyperlink data. In addition, my analysis of the characteristics of the 300 philosopher dataset as obtained from Wikipedia, and partial comparison against a subset of Thomson Reuters AHCI dataset, have revealed interesting insights and parallels.
By extracting hyperlinks and semantic data from Wikipedia and by presenting the extracted information via interactive visualization, the WikiPhiloSofia project has shown promise for creating a useful portal that enables semantics-based and visually-enhanced discovery and exploration, which goes beyond the conventional keyword-based search and retrieval, for intellectual/cultural heritage resources. The merits and potentials of the WikiPhiloSofia project, as such, have been duly recognized across diverse research communities involving Digital Humanities, Digital Libraries, Information Retrieval, Information Visualization, Hypertext/Hypermedia, and Wikis/Wikipedia.
Limitations of the pilot project concern the following aspects:
- Domains of application: As mentioned above, the pilot project was limited to one domain of choice. As such, the question remains as to whether or not similar results can be replicated with respect to another domain and/or across domains.
- Data sources: The pilot project used Wikipedia as the single data source, without exploring ways to connect and complement the data obtained from Wikipedia with those obtained from other, more scholarly resources.
- Data extraction/storage methods: In the pilot project, data were extracted by directly parsing the XHTML pages downloaded from Wikipedia. It did not examine other ways of data extraction by utilizing intermediate representations, e.g., Wikipedia SQL/XML dumps, Wikipedia XML corpus [12], or DBpedia knowledge base (http://dbpedia.org/). Related to the previous point, the pilot project, while aiming at enabling semantics-based search and exploration, did not consider or utilize data storage mechanism – i.e., triple store – and/or knowledge representation formalisms – i.e., RDF (http://www.w3.org/RDF/) or OWL (http://www.w3.org/TR/owl-features/) – that have become widely-used standards within the Semantic Web community.
4.2 Research Questions, Methodologies, Plans
Based on the analysis of the implications and limitations of the WikiPhiloSofia project, here I present research questions, proposed methodologies, and tentative plans for the proposed PanAnthropon project.
4.2.1 Objectives and Research Questions
The main objective of the PanAnthropon project is to create a Web-based knowledge portal that allows humanities students and scholars to explore the landscape of the Western intellectual/cultural heritage from diverse perspectives and using different granularities of focus, e.g., by focusing on thinkers or by focusing on ideas, by focusing on a particular thinker or by focusing on the networks of thinkers, by tracing the evolution of ideas vertically or by exploring the connections among ideas horizontally, etc.
The main research questions that I will address in the project are as follows:
- Q1: Domains of application: Can the overall approach adopted in the pilot project be extended to other domains besides philosophy?
- Q2: Data sources: What other data sources besides Wikipedia can be incorporated in order to provide more value-added data service?
- Q3: Data extraction/storage methods: How can relevant semantic data from heterogeneous datasets be efficiently extracted, stored, and connected in order to enable uniform, federated semantic search and exploration?
4.2.2 Data Sources and Methodologies
4.2.2.1 Data Sources
The main data source for the proposed project will be the English version of Wikipedia, as in the pilot project. In the proposed project, however, I will extend the scope of coverage, or domain of application, beyond that of philosophers to include other influential thinkers (theologians, scholars, writers, artists, scientists, etc.) and ideas (concepts, theories, doctrines, ideologies, etc.) that constitute the Western intellectual/cultural heritage. Also, I will exploit the organizational structure and textual content of Wikipedia articles, besides hyperlinks and infoboxes/wikitables, in order to extract appropriate contextual information concerning each data piece, as I will explain in more detail when I discuss data extraction methods below.
As a complementary data source concerning the philosophy domain in particular, I will use a prominent expert-created philosophical resource on the Web, namely, the Stanford Encyclopedia of Philosophy (SEP) (http://plato.stanford.edu/), which consists of articles on philosophers and philosophical concepts that are written, maintained, and kept up to date by an expert or group of experts in the field.
4.2.2.2 Implementation Methodologies
4.2.2.2.1 Data Extraction
In order to address one of the limitations of the pilot project, namely, the fact that the project did not explore ways to utilize intermediate representations of the Wikipedia data instead of extracting the data directly from scratch by parsing the XHTML pages, I have considered the feasibility/desirability of utilizing the aforementioned Wikipedia XML/SQL dumps and/or DBpedia knowledge base, the latter of which currently consists of approximately 274 million RDF triples (or “facts”) describing 2.6 million “things” (including 213,000 persons), which have been extracted from multilingual versions of Wikipedia, as well as 609,000 links to images, 3,150,000 links to external Web pages, 4,878,100 external links to other RDF datasets, 415,000 Wikipedia categories, and 75,000 YAGO (http://www.mpi-inf.mpg.de/yago-naga/yago/) [13] categories.
However, it has come to my attention that the data contained in the DBpedia knowledge base include obviously incorrect and/or inconsistent information concerning even the most significant intellectual figures. For example, according to the DBpedia page on Plato (http://dbpedia.org/page/Plato), partially shown in Figure 18, the values corresponding to the property (attribute) “is dbpedia-owl:influenced of” mostly consist of those who were influenced by Plato, such as Aristotle, Schopenhauer, Augustine, etc., although they also include those who influenced Plato (as should be the case), such as Heraclitus, Parmenides, and Protagoras. In contrast, the values corresponding to the property “is dbpprop:influenced of” mostly consist of those who influenced Plato (as should be the case), with the exception of Gustavo Breno, as shown in Figure 19. On the other hand, the properties “is dbpedia-owl:influences of” and “is dbpprop:influences of” both have as their values those who were influenced by Plato (as should be the case), but the two lists of values are not identical and neither of them completely contains the other one.
Figure 18. Part of DBpedia page on Plato (1).
Figure 19. Part of DBpedia page on Plato (2).
Such inaccuracies as the one involving the properties “is dbpedia-owl:influenced of” and “dbpprop:influenced of” may possibly have been caused in part due to the fact that some infoboxes in Wikipedia articles contain ambiguous attributes. For example, the infobox shown in Figure 20, from the Plato article page in Wikipedia, contain an attribute called “Influences and Followers”, which includes as its values both those who influenced Plato and those who were influenced by Plato. However, the main biography infobox on the same Wikipedia page, which I used in my pilot project to extract academic/biographical facts on Plato, unambiguously distinguishes between influences to Plato (with the attribute “influenced by”, corresponding to “is dbpedia-owl:influenced of” and “is dbpprop:influenced of”) and influences from Plato (with “influenced”, corresponding to “is dbpedia-owl:influences of” and “is dbpprop:influences of”), as shown in Figure 21.
Figure 20. Part of infobox on Plato. Figure 21. Part of main infobox on Plato.
Besides the aforementioned problem concerning the inaccuracy/inconsistency of the data in DBpedia, another reason why I have found both DBpedia and Wikipedia XML/SQL dumps to be less appropriate for the purpose of my proposed project, consists in the fact that I intend to extract useful contextual information from each given Wikipedia article page, together with hyperlinks. I have therefore decided to extract data from Wikipedia as well as from SEP by directly parsing XHTML pages.
The data extraction process will consist of three phrases, as illustrated in Figure 22.
Figure 22. Data extraction process.
Phase 1: Construction of Thinker/Idea Lists
This phase will consist of the following steps:
- Step 1: Create a seed list of thinkers by consulting various lists of people on Wikipedia.
- Step 2: Extract hyperlinks, which point to thinkers or ideas, excluding other types of hyperlinks, from the Wikipedia article pages concerning the thinkers identified in Step 1.
- Step 3: Iteratively construct/expand a list of thinkers and a list of ideas by adding the thinkers/ideas corresponding to the hyperlinks extracted in Step 2 (and by extracting hyperlinks from the Wikipedia article pages corresponding to the latter).
The Wikipedia article pages will be locally stored for the purpose of data extraction and processing.
Phase 2: Data Extraction from Wikipedia
From each Wikipedia article page stored from the previous phase, I will extract hyperlinks including those contained in infoboxes/wikitables, as in the pilot project. Unlike in the pilot project, however, I will also extract information concerning the context surrounding each hyperlink. Such additional pieces of information should greatly facilitate search and exploration of relevant information concerning a given subject.
To illustrate the enhancement that will be brought about by including the contextual information, let us again consider Figure 11, which shows those philosophers to whom Kant is connected via outgoing hyperlinks. One can see, for example, that the Kant page contains 4 repeated hyperlinks directed to the page on Newton. One cannot see, however, the multiple, hence possibly distinct, contexts of discussion surrounding those multiple hyperlink connections between Kant and Newton. It is easy to imagine how such background information could be useful for users.
Table 4 shows the types of contextual information to be extracted or assigned for each hyperlink. As shown, the object corresponding to each hyperlink will be assigned an entity type, e.g., Thinker, Idea, Organization, Place, Time, Event, Work, etc., which will later be used in marking up each Wikipedia article page during the stage of data processing.
Table 4. Types of contextual information
In addition to hyperlinks and contextual information, I will also extract images, together with captions, in order to incorporate image search functionality in the Web portal.
The data extracted during this phase will be stored in relational tables in a MySQL database, as in the pilot project. However, the stored data can readily be converted to RDF triples, as I will explain later when I discuss the data processing stage.
Phase 3: Data Extraction from SEP
From SEP article pages corresponding to the thinkers and ideas identified in Phase 1, I will extract hyperlinks including those directed to related entries inside SEP as well as those pointing to other internet resources outside SEP. Given that the main content of SEP articles usually contains very few hyperlinks, I will exploit the hyperlinks extracted from Wikipedia in extracting additional information from SEP in order to more closely connect relevant pieces of information from both data sources. Specifically, given a Wikipedia article page concerning a thinker or an idea, and given the list of (unique) hyperlinks originating from the page, I will extract contextual information (namely, section/subsection titles and the enclosing sentence) from a corresponding SEP article page, for each hyperlink, supposing that the object (i.e., a thinker or an idea) corresponding to the hyperlink is mentioned in the SEP article. The data extracted from SEP will also be stored in the MySQL database, with appropriate fields to record the data source and context of information.
4.2.2.2.2 Data Processing
During this stage, the hyperlinks inside Wikipedia article pages locally stored will be annotated with RDFa (http://www.w3.org/TR/xhtml-rdfa-primer/) markups specifying entity types corresponding to the objects of reference. An appropriate visualization tool, or possibly a Web browser plug-in, can then be created during the next stage of data presentation so that only those hyperlinks corresponding to a certain entity type are highlighted at a given time according to user selection. This will allow the users to quickly focus on selected hyperlinks while viewing the enhanced Wikipedia page.
An additional data processing task that may be carried out during this stage is concerned with converting the data stored in the MySQL database to RDF triples (and subsequently storing the RDF triples in a triple store) so that they can be linked with other data resources on the Semantic Web. The conversion can be done via reification of (the main facet of) a given fact, adopting an approach similar to the one used in the YAGO [13] project. For example, suppose the database contains the following piece of information concerning a particular hyperlink from the Kant page to the Hegel page:
Figure 23. Database entry on a hyperlink from Kant to Hegel.
An RDF triple can be created in the (pseudo) form of <Kant>–<Out-Linked To>–<Hegel>, which will be assigned a unique ID, e.g., 100. Additional triples can be created to represent the context information, e.g., <Fact-100>–<Source>–<Wikipedia>, <Fact-100>–<Section>–<Abstract>, etc. (The triple <Hegel>–<Type>–<Thinker>, identifying Hegel as one belonging to the list of thinkers, would have already been created at the start of the conversion process, prior to the processing of individual hyperlinks.)
4.2.2.2.3 Data Presentation
Once the data are extracted and properly processed, I will build the Web portal where the users can explore the data using various analysis/visualization tools. Preliminary conceptual design tasks that will be carried out for the construction of the Web portal include:
- Define/refine various perspectives, foci, and facets for exploring the data.
- Develop/refine various modalities (both textual and graphical) for presenting the data.
- Consider incorporating (keyword-based) search capabilities for additional data sources.
Implementation of the Web portal will then proceed according to the design decisions.
4.2.2.3 Evaluation Methods
Once the Web portal is created and ready to be publicly deployed, I will engage potential users in order to validate the meaningfulness of the data generated and to evaluate the usefulness of the Web portal. For this purpose, I will seek expert opinions on some of the representative results obtained concerning the facts, relations, and networks involving influential thinkers and ideas and/or have the system used and evaluated by instructors/students of relevant college courses.
4.2.3 Plans and Timelines
Upon the successful completion of my thesis proposal, I will embark upon implementing the system. I will conduct interviews, surveys, or experiments to evaluate the system as it becomes ready to be deployed as a publicly-accessible resource. Finally, I will conclude the project with the final defense of the dissertation. The estimated timeframe for completion is around the end of 2010.
4.3 Significances and Expected Contributions
4.3.1 Technical Significance and Contributions
The proposed project serves as a venue of the convergence of the Social Web or Web 2.0 and the Semantic Web toward the emerging Web 3.0, and of the convergence of information extraction/retrieval and information visualization toward the next generation of online information systems that incorporate information aesthetics and edutainment. The project will tackle the task of deriving explicit/implicit semantics and making semantic connections across domain areas and across heterogeneous data sources. The project will demonstrate the utility of semantics-based search and retrieval for generating more focused and useful results to better satisfy users’ information seeking needs. The project will also explore innovative modalities of interactive visualization to effectively and aesthetically represent the evolving content.
4.3.2 Intellectual Significance and Contributions
The project is contributive to the cause of digital humanities research in that it applies large-scale data extraction and visualization to the intellectual/cultural heritage materials. The intellectual significance of the project consists in that it will demonstrate generating and visualizing the emergent landscape of the Western intellectual/cultural heritage consisting of rich semantic networks involving influential thinkers and their ideas. The provision of overarching big-picture views surveying the landscape connecting disparate elements, enabled by taking advantage of the large-scale datasets, will allow humanities students and scholars to discover/explore connections across the timelines and across disciplinary/subdisciplinary boundaries, thereby gaining insights that would not otherwise be possible.
4.3.3 Practical Significance and Contributions
The outcome of the proposed project, namely, the Web-based learning/knowledge portal, will be made freely accessible by all interested users. The practical significance of the project consists in the utility of the Web portal, which will serve as a useful resource for humanities students and scholars as well as the general public, for example, as a pedagogical resource for instructors teaching (undergraduate) courses on intellectual heritage, philosophy, general liberal arts, etc. It is customary of such courses to be able to cover only a very limited number of intellectual/cultural figures and their works, without sufficient time for surveying the milieu in which those figures and works are located. The Web portal can serve as a useful aid, providing much-needed background information.
Consider, for example, a course on Western intellectual heritage offered by the University of Maryland (http://www.honors.umd.edu/HONR218C/). According to the syllabus, students are required to read, among other texts, The Prince by Niccolò Machiavelli. By utilizing the Web portal, students will easily be able to explore the connections between Machiavelli and other influential thinkers – e.g., the fact that Machiavelli was influenced by Plato, also on the reading list – and to gain insights into the political/cultural milieu of the times – e.g., by learning about the relation between Machiavelli and the influential Medici family, to whom the book is dedicated. The students will also be able to search for images on Machiavelli or read an expert-written synopsis, all with a single click, with the Web portal serving as an integrated tool for learning.
5. RELATED WORK
Corresponding to the rapid, exponential growth of the size of Wikipedia, recent years have witnessed an increasing number of computer and information science researchers working on various projects concerned with Wikipedia. In particular, researchers working in the fields of natural language processing (NLP), text mining, information extraction, question answering, etc. have explored various ways to exploit the vast amount of lexical, semantic, and encyclopedic knowledge contained in Wikipedia. In addition, some Semantic Web researchers have turned to Wikipedia for clues to resolving what has been considered the main roadblock on the way toward the realization of the Semantic Web vision, namely, the knowledge acquisition bottleneck due to the scarcity of structured data available on the Web. The reason behind Wikipedia’s fast becoming a prominent semantic knowledge source, besides the huge size of its user-generated content on a wide range of topics, consists in its unique set of structural features, including a hierarchical category structure, a dense hyperlink structure, and structured templates such as infoboxes and wikitables.
Extracting information from Wikipedia: Table 5 summarily shows some of the prominent approaches toward utilizing Wikipedia as a knowledge source for extracting various kinds of lexical/semantic information.
Table 5. Approaches toward utilizing Wikipedia as a lexical/semantic knowledge resource
Theme | Researcher | Task/Technique/System |
Extraction of Semantic Data from Wikipedia | Chernovetal. [14] | Measures for computing the strength of semantic relationships between Wikipedia categories. |
Auer and Lehmann [15] | Extraction of structured semantic triples from Wikipedia templates. | |
Wu and Weld [16] | Automatic infobox and link generation, based on a machine-learning approach (KYLIN). | |
Using Wikipedia for NLP Tasks | Strube and Ponzetto [17] [18] | Comparison of Wikipedia and WordNet as a resource for computing term relatedness (WikiRelate!). |
Milne [19] | Technique for computing term relatedness exploiting the hyperlink structure of Wikipedia (WLVM). | |
Ruiz-Casado et al. [20] | Association of Wikipedia concept definition entries with WordNet synsets. | |
Ponzetto and Strube [21] [18] | Use of features derived from WordNet and Wikipedia for coreference resolution. | |
Using Wikipedia for Topic Identification | Schönhofen [22] | Use of titles and categories of Wikipedia articles for document topic identification. |
Yang et al. [23] | Tool for assisting Topic Maps design by extracting a collection of topics from the TOC and text of Wikipedia articles. | |
Taxonomy/Ontology Extraction from Wikipedia | Ponzetto and Strube [18] [24] | Automatic extraction of is-a taxonomy from Wikipedia categories. |
Suchanek et al. [13] | Automatic extraction of ontology consisting of entities and facts extracted from Wikipedia and WordNet (YAGO). | |
Semantic Query and Retrieval using Wikipedia Data | Milne et al. [25] [26] [27] | Search engine (Koru) based on a corpus-specific thesauri constructed from Wikipedia (WikiSauri). |
Auer and Lehmann [15] | OntoWiki and a graph pattern builder for facet-based browsing and querying of the semantic triples extracted from Wikipedia. | |
Suchanek et al. [13] | SPARQL-like query interface for querying YAGO ontology extracted from Wikipedia and WordNet. |
Most of the approaches summarized above (except those concerned with NLP tasks) are concerned with structural mining as versus content mining, i.e., exploiting structural features in extracting information from Wikipedia as versus using its textual content, to employ the distinction made by Zesch et al. [28]. For example, Milne et al. [26] [27] extracted corpus-specific thesauri from Wikipedia by exploiting its hyperlinks, categories, and redirects, and built a lexical knowledge-based search engine, called Koru. Ponzetto and Strube [18] [24] derived a taxonomy consisting of is a relations by treating Wikipedia’s category structure as a conceptual network. Auer and Lehmann [15] derived more than 8 million RDF statements from Wikipedia templates. Suchanek et al. [13] constructed a large-scale ontology or semantic knowledge base consisting of 2 million entities and 20 million facts, called YAGO (Yet Another Great Ontology), which has been incorporated in the DBpedia knowledge base, by exploiting the structural features in Wikipedia and by utilizing lexical information in WordNet. Both Suchanek et al. and Auer and Lehmann present (demonstrative) interfaces for querying the huge amount of facts extracted from Wikipedia by using a query pattern similar to SPARQL (http://www.w3.org/TR/rdf-sparql-query/), consisting of three fields corresponding to Subject, Predicate, and object. The proposed project is related to these projects insofar as it exploits structural features in extracting information concerning intellectual heritage materials. The summary distinctions of the proposed project from the aforementioned projects, however, consist in its prominent incorporation of interactive visualization as an effective mode of knowledge representation and in its focus on deriving intellectual/cultural/conceptual connections among the entities of interest. Furthermore, in contrast to Suchanek et al.’s and Auer and Lehmann’s approaches, which are closer to the proposed project in terms of general objectives/approaches, the Web portal constructed from the proposed project will be able to offer dual advantages both in terms of a more user-friendly search interface, which does not require the users to learn the syntax of SPARQL, and in terms of more useful search results, which can provide relevant contextual information in response to user queries.
Enhancing Wiki(pedia) by embedding semantic features: Table 6 summarily shows some of the approaches toward enhancing Wikipedia in particular and wikis in general by embedding explicit semantic features.
Table 6. Approaches toward enhancing wiki(pedia)
Theme | Researcher | Task/Technique/System |
Semantic Wiki(pedia) | Krötzsch et al. [29]; Völkel et al. [30] | Semantic MediaWiki for semantically annotating typed links and attributes. |
Incorporating Ontology in Semantic Wiki | Vrandečić and Krötzsch [31] | Extension of Semantic MediaWiki to enable import/export of ontological data. |
Nixon and Simperl [32] | Makna wiki for distributed knowledge engineering based on ontologies. | |
Reasoning in Semantic Wiki | Vrandečić and Krötzsch [31] | Integration of extended Semantic Media Wiki and KAON2 reasoner to offer ontological reasoning. |
Krötzsch et al. [33] | Reasoning tasks relevant for semantic wikis. Scenarios for the application of reasoning. | |
Multimedia Extension | Nixon and Simperl [32] | MultiMakna wiki for semantic knowledge management of the multimedia content. |
The proposed project is not focused on enhancing Wikipedia as such. However, annotating relevant Wikipedia articles with RDFa markups specifying the entity types corresponding to objects pointed to by hyperlinks and presenting the enhanced articles based on the semantic markup represent an approach toward “semantifying” Wikipedia. While the proposed project may not explicitly use a reasoner, it will still be able to incorporate automatic inference to a certain extent. First of all, the relational information extracted from Wikipedia concerning a pair of thinkers/ideas will be augmented by automatically inferred data, as in the pilot project, using the inverses of the properties (attributes) involved. (For example, the fact that “A influenced B” implies that “B was influenced by A” even if the latter is not explicitly stated.) Also, the Web portal interface will incorporate the capability to answer queries concerning indirect relationships between a pair of thinkers/ideas by using transitive inference, which, again, was already implicitly incorporated in the pilot project. (For example, from the fact that “A influenced B” and “B influenced C”, it can be inferred that “A (indirectly) influenced C”, even though such inference may need to be qualified, considering the fact that B may have received influences from multiple thinkers and that, as such, the context of the influence from B to C may rather involve (a) different thinker(s) other than A.) Lastly, the proposed project, while not aiming at creating a system for managing multimedia content, will still provide the capability to search for images extracted from Wikipedia concerning a given thinker/idea.
Visualizing Wikipedia: Most approaches that apply various forms of information visualization to Wikipedia have been mainly concerned with visualizing the activity patterns of content contributors (e.g., [34] [35] [36] [37]) or with visualizing the conceptual structure of Wikipedia itself (e.g., [38] [39]). In contrast, the proposed project is concerned with visualizing the facts, relations, and networks involving the thinkers and ideas that are discussed in Wikipedia articles, which indirectly reflect both the conceptual structure of Wikipedia and the perceptions of its contributors.
Searching Wikipedia: Besides the academic research projects discussed so far, there are some existing systems on the Web, which are related to Wikipedia. In particular, various systems have been developed for the purpose of providing users with alternative methods for searching the content of Wikipedia. Those mentioned in [40] include: Wikiwix (http://www.wikiwix.com/), which provides a single interface for searching articles and images across all Wikipedia sites (i.e., Wikiquote, Wiktionary, Wikinews, etc.) at the same time; Similpedia (http://www.similpedia.org/), which allows users to find related articles on Wikipedia, given a URL or a paragraph of text; Wikiwax (http://www.wikiwax.com/), which shows a dynamic list of suggestions while the user types keywords to find a Wikipedia article. Wikiwix and Wikiwax are both based on the simple keyword-based search model. Wikiwix and Similpedia both present the query results in a familiar, Google-like fashion, even though Wikiwix also show some related entries from each of the Wikipedia sites, some of them relevant, others grossly irrelevant. (Wikiwax simply directs the user to an article page, given the user selection.)
More relevant to the proposed project are Powerset (http://www.powerset.com/) and WikiMindMap (http://www.wikimindmap.org/), also mentioned in [40]. While it also has a keyword-based search interface, Powerset presents a given Wikipedia article along with certain facts derived from the article, supposedly by using sophisticated NLP, which include both individual entities with assigned types (i.e., “Things”, “Actions”, etc.) and triples consisting supposedly of Subject, Predicate, and Object. However, the entities include many irrelevant ones and the triples mostly consist of vague or meaningless statements (e.g., “perspective influences philosophy”, “pursue asks questions”), which are of no great help to the users in finding relevant information. WikiMindMap, also based on keyword-based search, presents a Wikipedia article matching the user query in the interactive visual form of a mind map consisting of branches representing section titles and leaves representing hyperlinks under a given section, somewhat similar to the fisheye tree view representation that I used in the pilot project. I think such an overview can be useful for users and thus plan to incorporate a similar feature in the Web portal.
Freebase: I conclude the discussion of related work by mentioning a system that has been seen as related to the proposed project. Freebase (http://www.freebase.com/) is an open, community-constructed online database containing millions of facts in dozens of categories, collected from various sources, including, prominently, Wikipedia. The Freebase page on Kant, for example, presents basic biographical info, quotes, and influenced-by/influenced relations. The basic difference between Freebase and the proposed project consists in the fact that the latter does not aim at collaborative resource construction. Beyond the basic difference, a possible similarity between the two may be found in the fact that Freebase also aims to allow users to make connections across different categories. However, the Web portal created from the proposed project will provide the users with enhanced capabilities of exploration by prominently employing interactive information visualization as well as by providing more detailed semantic and contextual information.
6. CONCLUSION
In this paper I have proposed creating a Web-based knowledge portal, based on the data extracted from Wikipedia (and the Stanford Encyclopedia of Philosophy), which allows humanities students and scholars to explore the landscape of the Western intellectual/cultural heritage using diverse perspectives, foci, and facets, via a user-friendly interface that supports semantics-based and visually enhanced search and exploration. The provision of capabilities both to survey overarching big-picture views connecting disparate elements and to focus on details concerning individual elements will allow the users to gain insights that would not otherwise be possible. The Web portal, as such, will serve as a useful teaching/learning/research tool for digital humanities. Furthermore, the methodologies used in the proposed PanAnthropon project will, with little modification, lend themselves easily to be applied to other domains besides humanities.
REFERENCES
[1] | O'Reilly, T. What is Web 2.0: design patterns and business models for the next generation of software. 30 September 2005. Available at http://oreilly.com/web2/archive/what-is-web-20.html. Last accessed 30 July 2009. |
[2] | Berners-Lee, T., Hendler, J., Lassila, O. The Semantic Web. Scientific American, 5 (May 2001). Available at http://www.scientificamerican.com/article.cfm?id=the-semantic-web. Last accessed 30 July 2009 |
[3] | Athenikos, S.J. Interactive visualization and exploration of information on philosophers (and artists, scholars, & scientists) in an e-learning portal for digital humanities. Presented at the 2009 Symposium on Interactive Visual Information Collections and Activity (IVICA 2009) (Austin, TX, USA, 19 June 2009). Proceedings available at http://www.csdl.tamu.edu/~shipman/IVICA/2009proceedings.pdf. Last accessed 30 July 2009. |
[4] | Athenikos, S.J. WikiPhiloSofia and PanAnthropon: extraction and visualization of facts, relations, and networks for a digital humanities knowledge portal. Poster. Presented at the ACM Student Research Competition at the 20th ACM Conference on Hypertext and Hypermedia (Hypertext 2009) (Torino, Italy, 29 June - 1 July 2009). |
[5] | Athenikos, S.J., and Lin, X. The WikiPhil Portal: extraction, analysis, and visualization of philosophical connections using Wikipedia. Poster. Won student poster award at the Fall 2008 North East Database and Information Retrieval Day (University of Pennsylvania, Philadelphia, PA, USA, 14 October 2008). |
[6] | Athenikos, S.J., and Lin, X. The WikiPhil Portal: visualizing meaningful philosophical connections. Presented at the 2008 Chicago Colloquium on Digital Humanities and Computer Science (DHCS 2008) (University of Chicago, Chicago, IL, USA, 1-3 November 2008). In Proceedings of the Chicago Colloquium on Digital Humanities and Computer Science, 1, 1 (July 2009). Available at https://letterpress.uchicago.edu/index.php/jdhcs/article/view/5/60. Last accessed 30 July 2009. |
[7] | Athenikos, S.J., and Lin, X. WikiPhiloSofia: extraction and visualization of facts, relations, and networks concerning philosophers using Wikipedia. Presented at the 2009 Digital Humanities Conference (DH 2009) (University of Maryland, MD, USA, 22-25 June 2009). In Digital Humanities 2009 Conference Abstracts, 56-62. Available at http://www.mith2.umd.edu/dh09/wp-content/uploads/dh09_conferencepreceedings_final.pdf. Last accessed 3 August 2009. |
[8] | Athenikos, S.J., and Lin, X. Visualizing intellectual connections among philosophers using the hyperlink & semantic data from Wikipedia. Poster. Presented at the 5th International Symposium on Wikis and Open Collaboration (WikiSym 2009) (Orlando, FL, USA, 25-27 October 2009). |
[9] | Schvaneveldt, R.W., Durso, F.T., and Dearholt, D.W.: Network structures in proximity data. In The Psychology of Learning and Motivation: Advances in Research and Theory, vol. 24, G. Bower (ed.). Academic Press, New York, NY, 1989, 249-284. |
[10] | Hummon, N.P., and Doreian, P. Connectivity in a citation Network: the development of DNA theory. Social Networks, 11 (1989), 39-63. |
[11] | Chakrabarti, S., Dom, B.E., Ravi Kumar, S., Raghavan, P., Rajagopalan, S., Tomkins, A.D., Gibson, D., and Kleinberg, J. Mining the Web’s link structure. Computer, 32, 8 (1999), 60-67. |
[12] | Denoyer, L., and Gallinari, P. The Wikipedia XML corpus. ACM SIGIR Forum, 41, 1 (June 2006), 64-69. |
[13] | Suchanek, F.M., Kasneci, G., and Weikum, G. YAGO: a core semantic knowledge unifying WordNet and Wikipedia. In Proceedings of the 16th International Conference on World Wide Web (WWW 2007) (Banff, Alberta, Canada, May 2007). ACM Press, New York, NY, 2007, 697-706. |
[14] | Chernov, S., Iofciu, T., Nejdl, W., and Zhou, X. Extracting semantics relationships between Wikipedia categories. In Proceedings of the First Workshop on Semantics Wikis (SemWiki 2006) at the 3rd European Semantic Web Conference (ESWC 2006) (Budva, Montenegro, 11-14 June 2006). 2006, 153-163. |
[15] | Auer, S., and Lehmann, J. What have Innsbruck and Leipzig in common?: extracting semantics from wiki content. In Proceedings of 4th European Semantic Web Conference (ESWC 2007) (Innsbruck, Austria, 3-7 June 2007). 2007. |
[16] | Wu, G., and Weld, D.S. Autonomously semantifying Wikipedia. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM '07) (Lisbon, Portugal, 6-9 November 2007). ACM Press, New York, NY, 2007, 41-50.. |
[17] | Strube, M., and Ponzetto, S.P. WikiRelate!: computing semantic relatedness using Wikipedia. In Proceedings of the 21 st National Conference on Artificial Intelligence (AAAI -06) (Boston, MA, USA, 10-16 July 2006). AAAI Press, Menlo Park, CA, 2006, 1419-1424. |
[18] | Ponzetto, S.P. Creating a knowledge base from a collaboratively generated encyclopedia. In Proceedings of the Doctoral Consortium of the 2007 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2007) (Rochester, NY, USA, 22-27 April 2007). Association for Computational Linguistics, 2007, 9-12. |
[19] | Milne, D. Computing semantic relatedness using Wikipedia link structure. In Proceedings of the 5th New Zealand Computer Science Research Student Conference (NZCSRSC'07) (Hamilton, New Zealand, 10-13 April 2007). 2007. |
[20] | Ruiz-Casado, M., Alfonseca, E., and Castells, P. Automatic assignment of Wikipedia encyclopedic entries to Wordnet synsets. In Proceedings of the 3rd Atlantic Web Intelligence Conference (AWIC 2005) (Lodz, Poland, 6-9 June 2005). 2005. |
[21] | Ponzetto, S.P., and Strube, M. Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution. In Proceedings of the 2006 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2006) (New York, NY, USA, 4-9 June 2006). Association for Computational Linguistics, 2006, 192-199. |
[22] | Schönhofen, P. Identifying document topics using the Wikipedia category network. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006) (Hong Kong, China, 18-22 December 2006). IEEE Computer Society, 2006, 456-462 |
[23] | Yang, J., Han, J., Oh, I., and Kwak, M. Using Wikipedia technology for topic maps design. In Proceedings of the 45th ACM Annual Southeast Regional Conference (ACMSE 2007) (Winston-Salem, NC, USA, 23-24 March 2007). Curan Associates, inc., 2007, 106-110. |
[24] | Ponzetto, S.P, and Strube, M. Deriving a large scale taxonomy from Wikipedia. In Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI 2007) (Vancouver, Canada, July 2007). AAAI Press, Menlo Park, CA, 2007, 1440-1445. |
[25] | Milne, D., Medelyan, O., and Witten, I.H. Mining domain-specific thesauri from Wikipedia: a case study. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006) (Hong Kong, China, 18-22 December 2006). IEEE Computer Society, 2006, 442-448. |
[26] | Milne, D., Witten, I.H., and Nichols, D.M. Extracting corpus specific knowledge bases from Wikipedia. Working Paper. Department of Computer Science, University of Waikato, New Zealand, June 2007. |
[27] | Milne, D., Witten, I.H., and Nichols, D.M. A knowledge-based search engine powered by Wikipedia. In Proceedings of the16th ACM Conference on Information and Knowledge Management (CIKM '07) (Lisbon, Portugal, 6-8 November 2007). AMC, 2007, 445-454. |
[28] | Zesch, T., Gurevych, I., and Mühlhäuser, M. Analyzing and accessing Wikipedia as a lexical semantic resource. In Proceedings of the Biannual Conference of the Society for Computational Linguistics and Language Technology (Tübingen, Germany, 2007). 2007, 213-221. |
[29] | Krötzsch, M., Vrandecic, D., and Völkel, M. Wikipedia and the Semantic Web - the missing links. In Proceedings of the First International Wikimedia Conference (Wikimania 2005) (Frankfurt, Germany, 4-8 August 2005). 2005. |
[30] | Völkel, M., Krötzsch, M., Vrandecic, D., Haller, H., and Studer, R. Semantic Wikipedia. In Proceedings of the 15th International Conference on World Wide Web (WWW 2006) (Edinburgh, Scotland, 23-26 May 2006). ACM Press, New York, NY, 2006, 585-594. |
[31] | Vrandecic, D., and Krötzsch, M. Reusing ontological background knowledge in semantic wikis. In Proceedings of the First Workshop on Semantic Wikis (SemWiki 2006) at the 3rd European Semantic Web Conference (ESWC 2006) (Budva, Montenegro, 11-16 November 2006). 2006, 16-30. |
[32] | Nixon, L.J.B., and Simperl, E.P.B. Makna and MultiMakna: towards semantic and multimedia capability in wikis for the emerging Web. In Proceedings of the 2006 Conference on Semantic Systems in the Knowledge Society – Challenges and Opportunities (Semantics 2006) (Vienna, Austria, 28-30 November 2006). 2006. |
[33] | Krötzsch, M., Schaffert, S., and Vrandecic, D. Reasoning in semantic wikis. In Proceedings of the Third International Summer School on Reasoning Web (Reasoning Web 2007) (Dresden, Germany, 3-7 September 2007). LNCS 4636. Springer-Verlag, Berlin; Heidelberg, 2007, 310–329. |
[34] | Biuk-Aghai, R.P. Visualizing co-authorship networks in online wikipedia. In Proceedings of the International Symposium on Communication and Information Technologies (ISCIT 2006) (Bangkok, Thailand, 18-20 October 2006). 2006, 737-742. |
[35] | Gawryjolek, J., and Gawrysiak, P. The analysis and visualization of entries in wiki services. In Advances in Intelligent Web Mastering. Springer, Berlin/Heidelberg, 2007, 118-123. |
[36] | Suh, B., Chi, E.H., Kittur, A., and Pendleton, B.A. Lifting the veil: improving accountability and social transparency in Wikipedia with WikiDashboard. In Proceedings of the 26th CHI Conference (CHI 2008) (Florence, Italy, 5-10 April 2008). ACM Press, New York, NY, 2008. |
[37] | Wattenberg, M., Viégas, F., and Hollenbach, K. Visualizing activity on Wikipedia with chromograms. In Proceedings of the 11th IFIP TC13 International Conference on Human-Computer Interaction (INTERACT 2007) (Rio de Janerio, Brazil, 10-14 September 2007). LNCS 4663, Part II, C. Baranauskas et al. (eds.). Springer, Berlin/Heidelberg, 2007, 272-287. |
[38] | Holloway, T, Miran Božicevic, M., and Börner, K. Analyzing and visualizing the semantic coverage of Wikipedia and its authors. Complexity, 12, 3 (2007), 30-40. |
[39] | Lee, K., Kim, H., Jang, C., Kim, H.-J. Folksoviz: a subsumption-based folksonomy visualization using Wikipedia texts. In Proceedings of the 17th International World Wide Web Conference (WWW 2008). ACM Press, New York, NY, 2008, 1093-1094. |
[40] | Catone, J. Top 10 ways to search Wikipedia. 21 May 2008. Available at: http://www.readwriteweb.com/archives/top_10_ways_to_search_wikipedia.php. Last accessed 3 August 2009. |