Biswas, "Semantic Indexing for Intention-based Retrieval", TCDL Bulletin 5.3 (2009)

Semantic Indexing for Intention-based Retrieval

Amitava Biswas

Department of Computer Science & Engineering
Texas A&M University
College Station, TX, USA
amitabi@cs.tamu.edu

ABSTRACT

To enable intention-based search, computers have to compare the meaning of user’s search intentions and data items being searched. Traditional information retrieval systems fail to adequately represent, compare and discern composite and finer meanings. This research proposes to develop techniques to represent and compare composite meanings. The proposed contribution are: (1) an algebraic theory, design of an data structure and related algorithms for representing hierarchically composed meaning as a tensor which is amenable to efficient semantic similarity comparison; and (2) techniques to construct a scalable and self-organizing meaning based index system that can seamlessly federate and integrate multiple distributed repositories.

Keywords

Intention-based search and retrieval, meaning representation and comparison, meaning based indexing.

1. INTRODUCTION

1.1 Current Trends and Requirements

Digital libraries are a kind of Information Retrieval (IR) system that is expected to satisfy specific requirements, which include, but are not limited to:

Meaningful search: Today users expect more meaningful search capability that is beyond simple keyword matching based searches. For example, when an user wants to do a research on “causes of glucose metabolism disorders” (the intention), a research publication on the role of specific gene PTPN22 in causing diabetes should be returned even if there are no text keywords that are common between the returned text and the user’s search intention.
More precise retrieval: Users prefer a smaller set of more relevant results.
Large scale: Today’s digital libraries are often rapidly growing collection of billions of data objects consuming peta (10E15) to exabytes (10E18) of space [1].
Distributed organization: Contemporary digital libraries are federated collections of multiple large databases managed by multiple parties.
Collection of heterogeneous objects: Modern digital libraries have to host digital content (objects) having multiple formats ranging from structured SQL records to free-form natural language texts, image, video, audio, binary files.

1.2 Intention-based retrieval

To satisfy the first two requirements, we need superior “meaning centric” (semantic) information retrieval (IR) technologies that can capture the meaning of the user’s search intention and the objects, and then compare the meaning of the intention against the objects’ meanings to identify relevant objects (Fig. 1). We call this kind of IR intention-based retrieval. When this kind of meaning comparison is used instead of plain query and object/metadata text string matching, then the results will have higher relevance to users.

Figure 1. Processes behind intention-based

To satisfy the last three requirements as mentioned in section 1.1, the intention-based retrieval system will also need a new kind of scalable index system that will simultaneously achieve two things: (1) identify useful objects based on meaning based comparisons; and (2) federate over large numbers of distributed collections (or indexes) of heterogeneous objects.

The “meaning” which is being discussed here is composite meaning of the entire object (or the intention) instead of simple meanings of individual index terms or keywords in the text (the text object or annotation of a non-text object). Human beings naturally think in terms of composite meanings which are hierarchically composed from simpler elementary meaning [2-5]. For example, the composite meaning conveyed by the phrase “color of the leaf is green” is comprehended and visualized in terms of elementary meanings such as: the physical object “leaf”, its “color” and the instance of color “green”, etc. Each elementary meaning itself can be a hierarchical composition of more elementary meanings.

1.3 Limitations of Existing Technologies

Existing public domain IR technologies do not have an effective approach for the required kind of searching based on composite meanings. This inadequacy is partly due to the limited ability of existing semantic technologies to represent and quickly compare composite meanings to discern finer differences in meanings. Humans use such meanings in natural language annotations and text objects. Hence there is a need for speedy, unified, generative mechanism that can simultaneously: (1) compose elementary meanings to represent composite meaning; and (2) recognize the semantic (meaning) relationships between dissimilar text strings to yield consistent results unlike table 1 & 2. Table 1 illustrates the limitations of a key word matching based search system, whereas table 2 shows limitations of a popular search engine which appears to be using latent semantic indexing.

Table 1. Limitations of the scientific publications search at Pubmed [6] (as of July, 08)

Keyword used in “All Database” searching	Objects returned	Comments
“diabetes” AND “PPAR”	1979	Indicates that objects are available
“auto immune disease” AND “PPAR”	0	This result should be a superset of (not less than) the result above.
“diabetes” AND “nuclear receptor”	406	Should not give less than 1st row

Note: “PPAR” is a “nuclear receptor”, “diabetes” is an “auto immune disease”

Table 2. Limitations of a popular internet search engine (as of Apr 24, 09)

Keywords used	Top 20 results	Comments
rodent supplier	Some are relevant	Indicates that objects are available
mouse supplier	Irrelevant	Returns computer mouse suppliers
supplier animal mouse medical experimentation	Irrelevant	Fails to return relevant results even though disambiguated by adding “animal”

1.4 Required Technologies

To materialize an intention-based retrieval system we need to extend the traditional IR paradigm (Fig. 2) with respect to three crucial technological components:

The key which is used to retrieve objects (items).
Comparison of keys.
The index, which reduces the actual number of comparisons required during the search.

Figure 2. The search paradigm.

In the traditional IR paradigm, a database is a collection of searchable objects and key pairs as shown in Fig 2. Generally the key is a number, text string, a set or vector of keywords which represent a compact description of an object. The query has a structure which is similar to that of the object key (set, vector, etc.), and it is effectively compared against all object keys to identify the matching objects using the index. To extend this paradigm for intention-based searching, the key should be a “semantic descriptor” data structure that represents the meaning of user’s intention and the object. The key comparison should ascertain the similarity between meanings represented by two semantic descriptors, whereas the index should be a scalable infrastructure designed to improve the search speed (response) and recall.

2. RESEARCH OBJECTIVES

To materialize the needed technologies, we need to develop the followings:

A psychologically realistic semantic descriptor data structure that can represent composite meanings, by enabling hierarchical meaning composition.
A method to automatically generate meaning representation data structures for given natural language (English) texts.
An efficient technique to compare two data structure for the similarity of their meanings.
A scalable approach that can allow meaning-based indexing of heterogeneous objects stored in distributed repositories.

Automatic generation of the semantic descriptor data structure from text is important because many useful objects are text objects. Often a text description about an object is available rather than its structured metadata. In addition a user is more likely to prefer providing a text description of an object rather that it's structured metadata.

Meaning similarity comparison is the key process here which involves significant amount of computation and hardware. If this similarity comparison can be made efficient, then that would make the operation much faster, as well as reduce hardware and energy consumption. Today the data centers that host large scale search engines and digital libraries are largely limited by the power and cost related concerns [7, 8]. Therefore time-, memory- and power-efficient meaning comparison technique is much needed.

3. EXISTING STATE OF ART

3.1 Meaning Representation Models

3.1.1 “Meaning” in cognition & linguistics

The subsequent sub-sections present how human beings comprehend and convey meanings [9]. These understandings are necessary to design realistic meaning representations.

3.1.1.1 Sense of meaning

“Meaning” denotes ideas and thoughts in human mind that are usually conveyed by natural language representations. Human beings convey and comprehend this meaning using concepts. A concept is mental representation of an abstract idea or a physical object (e.g. Car) that is stored, recognized, comprehended and manipulated in human memory in terms of its attributes (e.g. wheel, engine, transportation), which are used as manipulation handles [10]. A concept can be either generalized or specified to a class having only a single object (e.g. “President Kennedy’s car”). Comparison of meaning actually means comparison of concepts.

3.1.1.2 Complex concepts

Composite meaning conveyed by natural language texts involves complex concepts. Behavioral, neuro-scientific [2, 3] and linguistic [4, 5] studies indicate that humans combines elementary thoughts and ideas to generate more complex ones (composition of concepts), which forms the basis of reasoning, learning [11] and language comprehension ability [12]. Therefore to be psychologically realistic, the descriptor data structure should enable hierarchical concept composition.

3.1.1.3 Simple principles of meaning composition

“Simpler syntax hypothesis” [5] and “parallel architecture” theory [11] argue and [13, 14] provide evidence that in the human mind, a simple collection of elementary meanings is good enough to represent a complex meaning. Take, for example, the language of children, pidgin speakers, late-language learners [13] (e.g. “walk street, go store”) and in the usage of English compound words [14] (e.g. “winter weather skin troubles”), where the words convey elementary meanings.

3.1.1.4 Generative mechanism of meaning composition

“Parallel architecture” theory proposes that a generative mechanism for semantics [11] exists in the mind that generate complex semantic structures based on two simple rules. The first rule allows representation of complex meaning as a collection of elementary meanings, and the second one allows composition of collection of meanings to form higher level collections. Therefore a hierarchical collection (tree structure) of elementary concepts / meanings (a toy example as in Fig. 3) can be used to represent a complex concept/composite meaning within computers [9].

Figure 3. Hierarchical collection of meanings [9]

3.1.2 Existing meaning representation methods

3.1.2.1 Set based (Boolean) model

Here the object’s key descriptor is a set of index terms or keywords [15]. The query implies a set which is expressed as Boolean conjunction, disjunction and negation of index terms. This scheme represents meaning too broadly and inaccurately.

3.1.2.2 Vector based models and associated techniques

Different kinds of vector models are available. Some models involve large dimensional vector spaces, where each basis vector corresponds to a real world term, concept or a phrase that are either selected from the text description of the object (or the text object itself, e.g. traditional models as in [16]) or from the object’s metadata. Other models consider smaller number of basis vectors, each of which corresponds to a statistically dominant dimension (e.g., latent semantic dimensions as in Latent Semantic Indexing (LSI) [17]). In all these vector models, the magnitude of semantic similarity between two vectors V1 and V2 that represent two meanings is computed as the dot (cosine) product V1 • V2. A similarity value of zero means that two vectors are dissimilar (orthogonal) and a higher value (< 1) indicates more similarity.

The traditional term-based vector model does not address the synonymy and polysemy problems [17]. The basic LSI approach recognizes the meaning context better than term-based vector models, but it ignores new infrequently used terms that might convey important meaning information (e.g. in niche scientific domains). The relative superiority of one of these two kinds over the other is still not fully resolved [17], therefore there might be a benefit in using them both together. Vector model based search systems often need additional techniques, for example methods to identify topic concepts (to generate metadata) from the text [18, 19] or disambiguate terms to resolve the polysemy problem [20], etc.

Vector-based models are computationally expensive. These models cannot represent complex descriptions that are based on complex meanings (concepts) and fail to discern between text descriptions that have common keywords/terms but have very different meanings [21]. This is because vector-based meaning representation, being a flat “bag of words”, does not capture the required composition information to represent composite meanings. There is a growing volume of work [20-22] that strives to address some of these limitations and improve performance of vector models.

3.1.2.3 Graph based model

Based on the theory in [23], a composite meaning can be represented as a graph where each elementary concept is a node and the relationships between the elementary concepts are denoted as edges connecting those concepts [24]. It might be possible to compare complex concepts with graph comparison techniques like in [25], however these are ungainly and computation-intensive.

3.1.2.4 Galois lattice model

In Galois lattice based design, composition and comparison techniques by [26, 27] are based on Formal Concept Analysis (FCA) theory. These schemes do not seem to have any support from cognitive science. This descriptor design assumes an over-simplistic taxonomy, which is insufficient to describe a real world context.

3.1.3 Mainstream semantic web standards

Mainstream semantic web standards like RDF, RDFa, microdata, microformat, and HTML5 [28-32] provide ways to represent meaning in terms of attributes (object) and their relationships (predicate) with the subject. These representations are more structured and hence more computable to enable formal reasoning/inferencing capabilities. These aspects can enable more precise searching but these semantic web techniques are also expensive in terms of computations.

3.2 Meaning based Indexing

3.2.1 Inverted index in vector model based search systems

Set and vector based IR systems have their own inverted index scheme, which is a colossal incidence matrix of terms (as rows), documents (as columns) and/or latent semantic vectors (in case of LSI). Similarly, proposals on large (and expensive) Galois lattices as concept based indexes are available in [33]. Both of these approaches do not incorporate composite meanings.

3.2.2 Distributed index

Large indexes often need resources that are beyond the memory and processing capacity of a single computer node. Hence large indexes are implemented on a distributed system [7]. In such systems, capacity scaling of indexes is carried out by slicing the index horizontally (randomly assigning objects to slices) (Fig. 4). Each index slice is implemented on a separate computer (node) or a pool of inexpensive nodes. An object is randomly assigned to any index slice (pool), a query has to be broadcasted to all index server slices/pools to identify relevant objects. Some index pools will return local storage addresses (e.g. document ids) of matching objects, if available, to the document server, and the document server will present the object URIs to the user, who can retrieve the objects using the URIs. These details are available in [7].

Figure 4. Index distribution and query delivery in a typical scaled index system

4. RESEARCH DESCRIPTION

The following sub-sections present our ongoing work to address the four research objectives identified in section 2. We first explain the broader context of this research in terms of the proposed search system and the search process.

4.1 Overview of the search system and processes

4.1.1 Proposed Meaning-based Distributed Index

Three different requirements are addressed in the proposed index system. The index should: (1) enable meaning based searching of heterogeneous objects; (2) federate over large numbers of distributed collections (or independent smaller indexes); and (3) be scalable. Therefore the index should use meaning comparison to enable index lookup. To enable federation of individual objects or object collections stored in local storage systems or which remain scattered all over the internet, the index should store objects addresses that are either local storage addresses or URLs depending upon the implementation. To achieve scalability, we propose that a scalable, distributed meaning based index be built by a large number of small index nodes, where each index node specializes in indexing only a certain kind of objects (vertical slicing as opposed to horizontal slicing mentioned in section 3.2.2). Each index node may be a single node or a small pool of computers. These specialized index nodes may apply custom indexing and search strategies suited to the domain, search context, type of data objects, etc., to yield superior search performance.

Here similar objects are assigned to a common index slice, one key problem is how to selectively route a search query based on meaning, to a specific computer which may host the portion of the index that is the relevant to the given query. This selective query routing is a more resource and energy efficient alternative than the query broadcasting approach, therefore the later kind of index scaling system is proposed. Selective query routing avoids exerting multiple nodes than what is absolutely necessary and therefore obviates the need for high capacity nodes/pools and saves hardware and electrical energy.

4.1.1.1 Using Semantic Routers to implement a distributed index

To selectively deliver a query to the right index node based on meaning, we propose that a special query routing component called Semantic Router be used. A semantic router is an application level message router. It is different from network routers used for packet routing at lower networking protocol levels [34]. Fig. 9 illustrates a scalable index system that use meaning based (vertical) index slicing technique and a semantic router to deliver queries to index nodes that can best handle the queries.

In this system (Fig. 5), a user can pose a search query (arrow “i” in Fig. 5) to a mediating user interface (UI), which constructs a suitable search/query key and submits an ‘object search request’ message with this key to the semantic router (arrow “ii”). The search key represents the meaning of the desired object to an extent what a user can define. The semantic router will forward the query to single or a few selected index nodes (e.g. index server 2 in Fig. 5), which will finally return local storage addresses of matching objects, if available (arrow “iii”), to the document server. The document server will present the object URIs to the user (arrow “iv”), who can retrieve the objects using the URIs.

Figure 5. Proposed index scaling and meaning based query routing

The index node destination to forward the query message is decided using a table lookup mechanism (semantic lookup). For example, in Fig. 5, the semantic router receives a message and finds its key “K” to be most similar to its semantic routing table row key (semantic descriptor) “D2” which has a corresponding address belonging to index server node 2. The semantic router will forward the query message to IN3. In case of a lack of a routing table match, the query is broadcasted to all or few selected addresses depending on the policy. An index node specializes in indexing certain kind of objects having similar meanings. The description of its specialty is represented by a semantic descriptor that acts as the row key in the semantic routing table. As the query message key semantic descriptor represents its meaning, so in this way the query gets routed based on its meaning. This meaning based selective query forwarding approach avoids the situation where multiple copies of the query have to be forwarded to each individual horizontal slice of the index as in [7]. The technique to distribute index entries based on meaning is described in [35]. For further scalability of the distributed index, multiple routers can be used as also explained in [35].

4.1.1.2 Design of the semantic descriptor and comparison technique

Design of a semantic router system will involve: (1) a design for semantic descriptor data structure that can represent meaning for query routing purpose; (2) semantic descriptor comparison technique; and (3) a method to speed up the comparison. The semantic descriptor design will involve more than one method to represent meaning due to pros and cons of each method explained in section 3.1.2. It appears that a better choice is to use a mix of three vector space models (for speed advantage): term-based vector model; latent semantic indexing; and our proposed tensor based models (explained in details in section 4.3). To determine the optimum method to derive a single similarity values from these three models, a large scale user based experimental study is required. However for sake of simplicity at this moment we will consider separate approaches for fast comparison according to the three aforementioned models. The idea is that once a similarity value composition method is decided, then the final similarity values can be computed from the three individual methods. So we will assume that the semantic descriptor will have three sub-components: (1) an infinite dimensional but sparse term based vector; (2) a limited dimensional non-sparse latent semantic vector; and (3) an infinite dimensional sparse vector based on the proposed tensor model. As the first two components are well known, so henceforth we will only discuss techniques to generate and compare tensor based component.

4.1.1.3 Inserting an object in the index

To insert an object in the index, first the semantic descriptor (key) for the object is generated by the indexer system, then this key along with the object’s URI is submitted to the semantic router. Using the index system’s search capability the recipient router will identify a most suitable index node which is interested in hosting the object’s key-address pair permanently in its index and request the index node to register the key-URI pair.

4.2 Overview of the Search Process

Fig. 6 presents an overview of the search process [9], which is different than traditional implementation (i.e. inverted indexes) of a vector based search systems. This search process will benefit from a new kind of meaning representation scheme that is proposed in this research. The search process involves seven systems (shown by boxes with broken boundaries): (1) user’s desktop terminal; (2) search engine’s user interface; (3) object storage platform; (4) index generator system that generates semantic descriptors (keys) for the stored object; (5) index storage which stores all key-address pair mappings; (6) search engine/index core which carries out the index lookup for a given query by comparing it against object keys; and (7) the key comparator. The index storage may be distributed for scalability according to the approach discussed in the earlier sections. The key comparator will be located at the semantic routers and also in the index nodes where meaning comparisons will be carried out for query routing and index lookup.

We propose that the key comparator be implemented by using a specialized hardware accelerator (e.g. FPGA/ASIC based coprocessor on PCI-E card inside a desktop) for: speed, smaller hardware investment and energy efficiency. Design of such hardware is possible based on the approaches proposed by this research [36]. In Fig. 6, the arrows indicate flow of data entities (shown as boxes with continuous lines).

Figure 6. Overview of the search process [9]

Here we assume that an object is either a text or has a text description provided by the object owner/publisher, which captures the meaning of the object. The final forms of the object and query keys as used for comparison in the hardware comparator are generated from the text through steps I to IV.

In step I, the meaning of the text (a complex concept) is captured and represented as concept tree by a manual or automated process. In step II, an algebraic (tensor) representation of this concept tree is generated to enable cosine similarity comparison of two trees. In step III, this derived tensor is encoded in the Coefficient table. This coefficient table is used as the format to store the key in the index (Fig. 6). In step IV this table is extended to a Bloom filter based data structure to enable fast parallel computation of cosine similarity using a simple hardware accelerator. During the query key generation (left side of Fig. 6), steps I to III take place within the search engine user interface. For object key generation (right side of Fig. 6), steps I to III takes place within the indexer system. Step IV takes place within the search engine/index core. Step III uses hash functions and bloom filters [37] that are specific to a search engine and indexer implementation. Sometimes there is also a need to find objects that are similar to a given object (e.g. as in recommender systems in Amazon, Pubmed [6]). In such cases, the given object’s coefficient table is used as the query coefficient table.

The cosine comparison for all the three the descriptor components can be carried out in the hardware comparator based on strict string (or vector label) matching. There is no need to check for the semantic similarity of different keywords from ontology or keyword co-occurrence matrix during the dot product computation hence the process is straightforward and speedy. However step I & II involve natural language processing and semantic relationships between words and phrases.

4.3 Meaning Representation

4.3.1 Concept tree representation

To explain the concept tree we have considered a bioscience publication as an object in Fig. 7, which presents its title along with its corresponding concept tree representation. For such objects, the abstract of the publication can be used as the text description.

Figure 7. Generation of concept tree from text

In Fig. 7, each complex concept is defined by collection of elementary concepts. These elementary concepts are considered as the attributes of the complex concept, therefore taken together they can denote the complex concept. The composed concepts are underlined and the basic concepts are shown in bold as the leaves. Basic concepts have been defined as controlled terms in domain ontologies like Gene Ontology [39], Disease ontology [40], etc. Basic concepts are denoted by these terms (character strings).

The given publication is a narrative about the gene PTPN22 whose malfunction causes diabetes and other autoimmune diseases. Therefore the publication is expressed as a concept (shown at the top of the tree) which is composed of two child concepts: the resulting disease “Type 1 diabetes mellitus” and the gene “PTPN22”. The next two levels of the tree describe the concepts that represent the gene and its function (encoding of the protein “Lyp”, which binds with another molecule “Csk”). In general the specific rules of constructing concept trees will depend on this kind of domain knowledge models and these rules can be codified as standard composition templates to be used to represent the corresponding concepts (Fig. 7). Standard templates will enable use of common set of rules (domain knowledge) to construct concept trees and ensure proper meaning representation and similarity comparison.

In the search process, as illustrated in Fig. 7, the ontology and templates are specific to the search engine user interface and indexer. Using the available ontology and knowledge base (construction templates, etc.) the search user interface and indexer system will make the best effort to iron out variations and ambiguities that might be present in user provided terms and construct the right kind of concept tree from the given text. Design of techniques to generate concept trees from given texts for a given domain will require some research effort.

In the simple example in Fig. 7, “Type 1 diabetes mellitus” (hyponym) is a specific instance of “Auto immune disease” (hypernym), so only the hyponym was considered here at this time. Method to incorporate hypernyms as extension to the basic method is discussed in [9, 38], along with the structure of the coefficient table.

4.3.2 Tensor representation of concept tree

Each composition in a concept tree is a set and the concept tree is actually a nested set (set of sets) having hierarchical structure (Fig. 8). Such a nested set or a tree structure is expressed as a tensor which is sum of scalar weighted polyadic products (polyads) [41] of the basic basis vectors that represent the elements at the lowest/innermost levels of the nested set (leaves of the tree).

Fig. 8 illustrates a tensor representation of a simple tree (for the shaded portion of the tree in Fig. 8). Any particular composition (or a set) in the hierarchical structure is represented as a tensor, which is a function of all the elementary meanings contained in the set. Here the elements (say A, B, C…) are considered as tensors themselves, and their composition (or the set) is expressed as {A, B, C, ….}, where the curly brackets “{…}” denote a certain function of: A, B,C,…. [9, 38]. This function can express the set as a mix of conjunction and disjunction of the constituent elements. A tensor algebra was developed to solve this problem. This algebra, the explanation of the delimiter vectors “” and “” and the generative method to produce the scalar coefficients and the polyads, have been presented in details in [9, 38].

Figure 8. Tensor representation of concept tree

4.3.3 Descriptor data structure

The tensor representation of the entire concept tree is encoded in a compact Bloom filter [37] based semantic descriptor data structure in two steps (step III and IV). The algorithm for step III, that encodes the tensor to a coefficient table has been presented in [9] and the algorithm for step IV is explained in [38].

4.4 Descriptor Comparison

The semantic similarity between two descriptors is computed by taking the dot product of the two tensors that represent them. For infinite dimensional vector models this pose a certain problem. In these models, the vector/tensor representation is a sparse single dimensional matrix. To compute the dot product of two of such matrices, one must identify the matching dimensions (basis vectors) and then multiply their corresponding coefficients. To speedup this entire computation, a special algorithm and bloom filter based data structure [38] is utilized. This algorithm is designed for hardware implementation and this hardware implementation can provide several order magnitude (> 10E4) of speedup over software implementation [9, 36, 44]. The speed gain and power savings aspect has been verified by circuit level design and simulation in [36, 44].

4.5 Salient properties of the Meaning Composition Framework

The tensor based tree comparison has some useful properties which are illustrated below. To explain these properties, the terminologies and metrics used to compare two trees are explained in Fig. 9. Here, the number of “overlap” indicates amount of leaves present in similar locations in both trees. Number of “noise” denotes the count of leaves that are not common and hence considered as noise when the other tree is taken as the reference one. Number of “displaced” signifies the same for displaced leaves (e.g. “C” in Fig. 9). Based on these three numbers (Overlap, Noise, Displaced), three ratios, one for each number, are derived by dividing them with the total number of leaves in the two trees taken together (number of elements in the union of set of leaves). For example, if L1 is the set of leaves in the first tree and L2 is for the second tree then the overlap ratio is given by Overlap / ¦L1 U L2¦. The two other ratios are similarly computed. These ratios are analogous to “Jaccard similarity” [43] measure used to compare two sets.

Figure 9. Terminologies for tree comparison

4.5.1.1 Property I: Tensor based comparison is consistent with other tree comparison metric

In each row of table 3, two concept trees are compared using three methods: tensor; vector; and the metrics described above. The trees in all these rows are constructed using two different kinds of tree composition templates. Both these kinds have two levels of compositions. For both, the top level composition is skewed towards conjunction. In one case the lower level composition is equally balanced between conjunction and disjunction, and in the other case it is skewed towards conjunction. The method to set the behaviors of the compositions are explained in [9, 38].

The trees/descriptors (D¹, D²) in row # 1 to 6 in table 3, are composed using the first kind of template described above, and the ones in row # 7 to 11 are based on the second kind of template. We shall show later that the trees in row # 1 to 6 can be templates for disambiguating a concept (e.g. A= “store”) by declaring a set of alternative attributes (e.g. C= “sale”, D= “purchase”, E= “supply”, F= “buy”), or for specific definition of a meaning.

Table 1 illustrates that for different kinds of compositions, the tensor similarity is consistent with the favorable effect of overlap and unfavorable effect of noise and displaced ratios. The similarity values derived according to the tensor model indicates that the tree similarity decreases with decrease with overlap ration and increase of noise and displaced ratios. However the similarity according to the vector model is not consistent with the ratios.

Table 3. Consistency between tensor and ratio based tree comparison metric

4.5.1.2 Property II: Composition information is included (conjunction)

Row 6 and 11 of table 3 above, illustrates the composition property of the tensor based model. These two comparisons show that that tensor similarity measure can distinguish trees with similar leaves having different compositions, but vector based similarity can not. These show that tensors do a better job in discerning dissimilar compositions (trees) and meanings.

4.5.1.3 Property III: An incomplete set of elementary meanings can identify the composite meaning

Two similar composite meanings may be expressed by two different but overlapping set of elementary meanings (i.e. they share many common elements) and yet they will be recognized as similar ones by the tensor model (see row 1 & 2 in table 3 above), as in case of vector model. This property is useful to identify similarity between contexts which are described by a slightly different set of elementary meanings.

4.5.1.4 Property IV: Higher level compositions are more important

The differences or similarities of elements at higher level compositions in a tree have larger impact on the similarity of the entire tree. The tensor based comparison in table 4 below illustrates this (only dissimilarity case is shown). In row 1, the trees have dissimilarity in the higher level of the composition whereas in the row 2, the dissimilarity is at the lower level. All compositions are uniform mix of conjunction and disjunction compositions. The real world analogy of this property is that two objects will be considered similar if the big picture meaning of objects are similar even though the finer detailed meanings may be somewhat different.

Table 4. Importance of higher level compositions

4.6 Applications of the proposed framework

The proposed tensor model has certain strengths that can complement the existing traditional vector or LSI models, when used along with them. So we propose that a combination of LSI, term based vector model and the proposed tensor model is used together to represent meaning of a object. In this regards, the meaning composition framework and the tensor approach can be used for different purposes, as follows:

The meaning composition framework can be used to represent composite meanings using the full-fledged tensor model, as explained earlier.
Alternately the proposed framework can be also used to bolster the traditional (term based) vector models and improve its semantic performance. This can be achieved by selectively replacing some of the basis vectors by tensors for small compositions or trees where the leaves may be text terms or from controlled vocabulary. These can be used to:

Define a term by its attributes. Compositions having form {<item>, <attributes>} can be used instead of only <item> to define the term. E.g. {“LyP”, “Csk”} to denote the function of “LyP” in terms of relationship with “Csk” instead of only “LyP”. The curly bracket notation “{ , , …}” was explained earlier in section 4.3.2.
Define the specific usage context for a term. Composition template to be used is in the form: {<item>, <context attributes>}. E.g., {“diabetes”, “PTPN22”} to denote the cause aspect of the disease instead of only “diabetes”.
Aid term disambiguation. E.g., instead of “mouse”, use {“mouse”, “animal”} or {“mouse”, “computer device”}.

5. EVALUATION PLAN & PRELIMINARY RESULTS

5.1 Meaning representation and comparison

5.1.1 Evaluation Plan

The focus of this research is to evaluate whether concept trees (nested or hierarchical sets) can improve the quality of meaning representation (i.e. produce better meaning comparison) for semantic searching, when used along with existing vector models. Though it is natural to think about an ultimate test which should evaluate the performance of a usable search (information retrieval) system built using this meaning representation technique, however, such straightforward evaluation will not be entirely fair. This is because a usable search system, where users will pose real life queries and get results, needs several other associated mechanisms in addition to the meaning representation and comparison technique. The performance of such system will depend on the combined performance of these mechanisms that has nothing to do with this concept tree meaning representation scheme.

Therefore such evaluation is likely to get muddied by the variations in implementation of the associated mechanisms. In addition the design issues which are going to make the comparison evaluations difficult are: quality and completeness of the ontology (vocabulary, tacit knowledge models that manifest as tree composition templates, etc.); quality of the text to tree conversion mechanisms; choice of the corpus; design of the search user interface and query construction technique; etc. These factors are likely to inject significant noise in the results to make them unusable for a fair and conclusive comparison. Moreover several other innovations such as the design of query generation scheme, and user interaction techniques, etc., are also necessary to build a usable search system. All these will need significant research and system development time and manpower investments before the constructed system can generate any useful data for the required evaluation.

Therefore, it makes more sense to carry out the evaluation in a phased manner. These evaluation phases will yield preliminary data, which may not be conclusive but they can be early indicators. Based on the preliminary results one can decide whether to pursue (or modify) the idea of concept tree, to advance additional research tasks, undertake system development effort, and proceed towards final stages of evaluation. Here the following three evaluation phases are proposed.

5.1.1.1 Phase I: Simulation based verification of salient properties of the tensor model

In this phase the salient properties of the tensor model, as identified in section 4.5, would be verified for a large number of trees. These trees will be synthesized (machine generated) with required degree of randomness and then compared against one another. Using these sample trees the metrics as presented earlier in table 3 in section 4.5.1.1, will be computed and compared. Conclusions will be drawn using statistical hypothesis testing.

5.1.1.2 Phase II: Evaluation of the meaning discriminating power of the tensor model using human subjects

This user based study will evaluate whether the concept tree meaning representation and the tensor model working together can discriminate objects more precisely when used along with traditional term based vector model as in [16] and Latent Semantic Indexing (LSI) [17]. Traditional term based vector model and LSI are chosen for this comparison because they are well established and proven techniques used in popular search engines. The relative superiority of one of these two kinds over the other are not yet conclusive [17], hence the need to compare against both. Here we want to test the hypothesis that meaning representation and comparison performance can be improved when the tensor model is used along with the existing vector models. The implied assumption is that if meaning representation and comparison can be improved then that will lead to superior search performance.

Here we assume that a human being is the best judge to discern the similarities or dissimilarities of meaning of given text objects, so human judgment is taken as the ideal case. The success of meaning representation is evaluated by comparing how effectively meanings can be matched or distinguished when these four competing approaches are used:

Tensor model along with traditional vector model (called “tensor + vector” model) where the similarity metric is average of tensor and vector based similarities;
Tensor model along with LSI (“tensor + LSI” model), where the similarity metric is average of similarities from both models;
Traditional term based vector model (“vector” model); and
Latent Semantic Indexing (“LSI”).

From a corpus of text objects, four objects are considered to form six text object pairs and a human subject will be asked to rank these six pairs according to their similarities. The size of the experiment is limited to six pairs to make the task less arduous to the human subject and keep the results reliable (humans are prone to make mistakes when stressed with large numbers of pairs). Based on the similarity values (computed by four machine based methods) between these object descriptor pairs, a computer system will rank the pairs and compare them against the human based rankings. A hypothetical sample set of data are presented in table 5 for the purpose of illustration.

Table 5. Hypothetical sample data set for comparing different approaches

Here the intention is to verify whether the rankings of “tensor + vector” model follows more monotonically with the human rankings compared to the rankings generated by the “vector” model. The same for “tensor+LSI” and “LSI” models will be done. To actually do this, from the table mentioned above, the non-parametric Kendall Tau correlation [42] between these rankings will be computed for all four machine based rankings against the human rankings. Then the differences of the correlations for the “tensor + vector” and the “vector” model will be computed, it will be called as tensor-to-vector correlation difference. Similarly we will compute the correlation difference between the “tensor + LSI” and the “LSI” model.

Here one needs to test the hypothesis that the tensor-to-vector correlation difference is significantly greater than zero, or not. If it is greater than zero, then one can conclude that rankings generated by the “tensor+vector” model follow human ranking with greater fidelity compared to the “vector” model. If this is true, then one can conclude that there is a benefit in using the tensor model to improve the performance of the existing search system which use traditional vector based model. The same hypothesis testing will be carried out for the tensor-to-LSI correlation difference to evaluate the benefit of using tensor based model to improve the LSI model. For testing these hypotheses, multiple (m) human subjects will be used, and each subject will be given the same multiple sets (s) of six object pairs to rank. All these correlation differences from multiple users and object pair sets will be considered as sample data (total mXs samples).

For these experiments, the text objects from certain scientific domain will be selected based on availability of high quality well documented ontology for those domains and relative ease of capturing domain knowledge into tree composition templates. Biomedical science publications from Pubmed [6] are likely object candidates.

5.1.1.3 Phase III: Evaluation of the retrieval power when similar objects are used as queries

To carry out experiments for this phase, a rudimentary search (information retrieval) system will be built with a large collection of objects which are drawn from Pubmed [6]. The size of the collection will be limited but it will be larger in size compared to the size of the object set used in phase II evaluation. Instead of queries, a selected set of object descriptors will be used as query keys. These descriptors will be constructed by the four models: (1) “tensor + vector”; (2) “tensor + LSI”; “vector”; and (4) “LSI”, as described earlier, in section 5.1.1.2. This will allow a fair degree of assessment of how much the tensor model can improve retrieval performance, while avoiding the need for chosen query formation, user interaction techniques and query generation system design and implementation, etc. issues. Human subjects will score the search results by a 1 to 5 scale (very dissimilar to very similar), and the usual precision and recall metrics will be computed with a cut off score of 3 (results with score 3 and above will be considered relevant). The statistics of the scores and the search performance metrics will be compared for all four cases (“tensor + vector”, “tensor + LSI”, “vector”, “LSI”).

5.1.1.4 Evaluation of the high speed dot product computation algorithm

The dot product comparison algorithm as presented in [39] can improve the speed of the computation when implemented on a custom fine grained parallel hardware. So a practical hardware design which implements this algorithm and shows superior speedup can substantiate the superiority of the proposed algorithm.

5.1.2 Preliminary Results

The tensor comparison algorithm (as in [39]) was implemented in hardware (Verilog code) and its performance was compared against that of an efficient dot product computation software code running on a traditional server class processor (Intel Xeon 3Ghz, up to 4 instructions per cycle/IPC) (Table 4) [37].

Table 6. Benefits of the key comparison algorithm implemented on hardware [37]

The proposed hardware provides a speedup of 45,853. It takes 119 clock cycles to compute a dot product compared to 5.4 million clock cycles taken by a software code executing on an Intel Xeon which is a representative server class processor. This is also true for other traditional processors, because their IPCs (<10) are not large enough to compete with million order speedup. This establishes that the proposed algorithm enables fast descriptor (meaning) comparison. When used, this hardware co-processor can increase the dot product processing capacity of a semantic router by several order times.

6. EXPECTED CONTRIBUTIONS

The research is proposed to develop foundational semantic technologies to enable superior composite meaning based information retrieval. Specifically the contributions will be: (1) an algebraic theory; (2) design of an data structure and related algorithms for representing hierarchically composed meaning; (3) an efficient technique to compare two data structures for similarity of their meanings; and (4) techniques to construct a scalable and self-organizing meaning based index which is can seamlessly federate and integrate multiple distributed repositories of heterogeneous digital content. This work is novel because: (1) it proposes to represent composite meaning in manner which is psychologically realistic compared to the existing approaches; (2) it provides a time efficient meaning comparison algorithm; and (3) a meaning-based scalable indexing scheme.

7. ACKNOWLEDGMENTS

My thanks to my supervisor, Dr. Rabi Mahapatra, of Computer Science department of Texas A&M University, for his support and guidance in this work, and to my colleagues: Suneil Mohan, Aalap Tripathy and Jagannath Panigrahy, for their help in carrying out the experiments. This work has been supported by National Science Foundation’s Strategic Technologies for Cyber Infrastructure program award number 0823020.

8. REFERENCES

[1]	Chamberlain, R.D., et al. The Mercury System: Exploiting Truly Fast Hardware for Data Search. Int. Workshop on Storage Network Architecture and Parallel I/OS, New Orleans, LA, USA, 2003.

[2]	Bai, C., et al. Semantic composition engenders an N400: evidence from Chinese compounds. Neuro Report, 19(6), 2008, 695.

[3]	Piango, M. M. The neural basis of semantic compositionality. In session hosted by the Yale Interdepartmental Neuroscience Program, Yale University, 2006.

[4]	Murphy, G. Comprehending complex concepts. Cognitive Science 12(4), 1988, 529–562.

[5]	Culicover P.W. and R. Jackendoff. The simpler syntax hypothesis. Trends in Cognitive Science, 10(9), 2006, 413-8.

[6]	Pubmed, http://www.ncbi.nlm.nih.gov/pubmed/

[7]	Barroso, L., Dean, J, and Hoelzle, U. Web Search for a Planet: The Google Cluster Architecture, IEEE Micro 23(2), 2003.

[8]	Botelho, B. “Gartner predicts data center power and cooling crisis”, June 2007, http://searchdatacenter.techtarget.com/news/article/ 0,289142,sid80_gci1260874,00.html Accessed March 13th, 2009.

[9]	Biswas, A., Mohan, S., Tripathy, A., Panigrahy, J., Mahapatra, R. Semantic Key for Meaning Based Searching, The 3rd IEEE International Conference on Semantic Computing (IEEE ICSC), Berkeley, CA, USA, Sept 2009.

[10]	Saffran, E. The Organization of Semantic Memory: In Support of a Distributed Model. Brain and Language, 71(1), 2000, 204–212.

[11]	Culicover, P. W. and Jackendoff, R. Simpler Syntax. Oxford: Oxford University Press, 2005.

[12]	Kuperberg, G. Neural mechanisms of language comprehension: Challenges to syntax. Brain Research, 2007, 1146:23–49.

[13]	Bickerton, D. Language & Species, The University of Chicago Press, Chicago & London, 1990.

[14]	Jackendoff, R. Compounding in the Parallel Architecture and Conceptual Semantics. In The Oxford Handbook of Compounding, R. Lieber and P. Stekauer, eds., Oxford: Oxford University Press, 2009.

[15]	Wikipedia, Information retrieval, 2009, http://en.wikipedia.org/w/index.php?title=Information_retrieval&oldid=274343961

[16]	Salton, G., and Buckley, C. . Term-weighting approaches in automatic text retrieval. In Information Processing & Management, vol 24 no. 5, 1988, 513–523.

[17]	Rosario, B. “Latent Semantic Indexing: An overview”, School of Information Management & Systems, U.C. Berkeley, Spring, 2000, http://people.ischool.berkeley.edu/~rosario/projects/LSI.pdf

[18]	Wikipedia,Latent Dirichlet allocation, Dirchelet, 2009, http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

[19]	Coursey, K., R. Mihalcea, and W. Moen, Using Encyclopedic Knowledge for Automatic Topic Identification, Thirteenth Conference on Computational Natural Language Learning (CoNLL), Boulder, Colorado, June 2009, pp. 210–218.

[20]	Widdows, D., A Mathematical Model for Context and Word-Meaning. Modeling and Using Context: Lecture Notes In Computer Science, 2003, pp. 369-382.

[21]	Mitchell, J. and Lapata, M. Vector-based models of semantic composition. In Proc of ACL-08, 2008, pp. 236–244.

[22]	Augello, A., G. Vassallo, S. Gaglio, and G. Pilato, Sentence induced transformations in "conceptual" spaces, IEEE Int. Conf. on Semantic Computing, 2008, pp. 34-41.

[23]	Murphy, G.L. and D.L. Medin. The role of theories in conceptual coherence. In Psychological Review, 92, 1985, pp. 289–316.

[24]	Andersen, C. A Computational Model of Complex Concept Composition. Master’s thesis, Dept CS, Univ Texas at Austin, 1996.

[25]	Ogata H., W. Fujibuchi, S. Goto, and M. Kanehisa. A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters, Nucleic Acids Research, vol. 28, No. 20, 2000, 4021-4028.

[26]	Rajapske, R., and M. Denham. Text retrieval with more realistic concept matching and reinforcement learning, In Inform Process Manag, vol.42, 2006, 1260-1275.

[27]	Qi, J., L. Wei, and Y. Bai. Composition of Concept Lattices, Machine Learning and Cybernetics, 2008 International Conference on, 2008.

[28]	Wikipedia, Resource Description Framework, 2009, http://en.wikipedia.org/wiki/Resource_Description_Framework

[29]	Wikipedia, RDFa, 2009, http://en.wikipedia.org/wiki/RDFa

[30]	W3C, HTML 5: A vocabulary and associated APIs for HTML and XHTML, Editor's Draft, 9 August 2009, http://dev.w3.org/html5/spec/Overview.html

[31]	Wikipedia, Microformat, 2009, http://en.wikipedia.org/wiki/Microformat

[32]	Wikipedia, HTML5, 2009, http://en.wikipedia.org/wiki/HTML_5

[33]	Priss, U. Lattice-based Information Retrieval, Knowledge Organization, Vol. 27, 3, 2000, p. 132-142. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.60.2145

[34]	Wikipedia, OSI model, 2009, http://en.wikipedia.org/wiki/OSI_model

[35]	Biswas, A., S. Mohan, and R. Mahapatra. Search Co-ordination by Semantic Routed Network, IEEE CommSoc's 18th International Conference on Computer Communication and Network (ICCCN), San Francisco, CA, Aug 2009.

[36]	Mohan, S., A. Biswas, A. Tripathy and R. Mahapatra, “Meaning Comparison in Hardware”, Technical Report, Dept. of CS, Texas A&M University, 2009.

[37]	Broder, A., Mitzenmacher, M. Network applications of Bloom filters: A survey. In Internet Mathematics, Vol.1, no.4, 2002, pp.485-509.

[38]	Biswas, A., et al. Representation and Comparison of Complex Concepts for Semantic Routed Network, ICDCN’09, Hyderabad, 2009.

[39]	Gene Ontology, http://www.geneontology.org/

[40]	Disease Ontology, http://diseaseontology.sourceforge.net/

[41]	Irgens, F. Tensors. Continuum Mechanics, Springer, 2008.

[42]	Abdi, H. Kendall Rank Correlation. In Encyclopedia of Measurement and Statistics, Sage, CA, 2007, pp. 530-532.

[43]	Jaccard, P. Étude comparative de la distribution florale dans une portion des Alpes et des Jura, Bulletin del la Société Vaudoise des Sciences Naturelles 37, 1901, pp. 547-579.

[44]	S. Mohan, A. Biswas, A. Tripathy and R. Mahapatra, A Parallel Architecture for Meaning Comparison, to appear in 24th IEEE International Parallel and Distributed Processing Symposium 2010.

Editor's note, May 13, 2010
Two corrections have been made to this article after its initial publication:

In the third paragraph of Section 3.1.2.2, reference [44] was incorrect; it has been corrected and is now [17].
In table 4, the numbers in the rows were interchanged and have now been corrected.