Volume 6 Issue 1
Spring 2010
ISSN 1937-7266

A Query Language for Digital Libraries

Jitao Yang

Laboratoire de Recherche en Informatique
Université Paris-Sud, 91405 Orsay Cedex, France


We introduce a data model for digital libraries and an associated query language for the discovery of objects of interest based on content or description. We also outline possible mappings of the query language to existing standards, namely SPARQL. This paper is a short description of my doctoral work, under way, at the University of Paris South.


Data model, Digital libraries, Query language, SPARQL.

1. Introduction and objectives

We introduce a data model for digital libraries and an associated query language for discovering objects of interest based on content or description. We follow the approach introduced in [5, 4] and we also outline possible mappings of the data model into RDF and of the query language to SPARQL. A longer version of this work can be found in [7]. The basic notion of the model is that of a digital object (or simply object). Intuitively, we think of a digital object as a piece of information in digital form such as a PDF document, a JPEG image, a URI and so on. As such, a digital object can be processed by a computer; for instance it can be stored in memory and displayed on a screen.

More formally, let O stand for a collection of objects. We assume O to be non-empty and countable. We assume that each digital object can be viewed using an appropriate mechanism (e.g., a PDF document can be printed, a JPEG image can be displayed on a screen, a URI can be shown as a string of characters).

Given an object o O, we use view(o) to denote the result of viewing o. We make the assumption that each object is viewable and that an object is always viewed in one and the same way. Formally, this means that view is a total function having the set O as domain. The range of the view function, as well as the technical details of how it is implemented, are outside the scope of our model.

In what follows we shall illustrate how digital objects are used to model digital libraries, highlighting the roles that they play in the various components that make up a DL.

The rest of the paper is structured as follows: Section 2 introduces digital libraries and Sections 3 and 4 describe the main components of a digital library, namely the repository and the knowledge base. Section 5 contains the definition of the query language, while section 6 gives examples of mapping the query language to SPARQL. Section 7 concludes.

2. Digital libraries

Roughly speaking, what we call a digital library (DL for short) is a set of digital objects and services that allow a community of users to access and re-use the digital objects. In particular, each object in the DL is associated with a view, a content, a number of descriptions and a number of versions. The users of the library should be able to perform the following tasks:

  • describe an object of interest according to the vocabulary of the community;
  • discover objects of interest based on content and/or description;
  • view the content of a discovered object;
  • identify an object of interest, in the sense of assigning to it an identity;
  • re-use objects in a different context (e.g. by adding them to the content of existing objects or by creating new complex objects);
  • create versions.

We note that the concept of DL constitutes a major departure from that of a traditional information system. Indeed, a traditional information system contains representations of real-world entities (such as employees and departments), while a DL contains the entities themselves.

We view a digital library (DL) as consisting of two parts: the DL repository, and the DL knowledge base. Informally, the DL repository holds the content of the DL (as a set of structured objects and their versions) and the DL knowledge base holds the descriptions of objects and the knowledge enriching them. In order to define these two parts formally, we need the notions of schema and database that we define now.

2.1 Schemas

Figure 1 shows the meta-schema graph (abbreviated as MSG), a labeled, directed graph giving the structure of a schema. The nodes in the MSG indicate schema entities, while the arcs indicate schema relations, as follows: an arc labeled l from node m to node n indicates a relation named l, having m as domain and n as range. Specifically, the entities of our schemas are classes and properties, and there are two main kinds of relations:

  • the relations capturing taxonomies of classes (isac) and of properties (isap), the former being defined over classes, the latter over properties;
  • the relations expressing the domain (dom) and the range (ran) of properties. Both these relations have properties as domains and classes as range.

Formally, a schema is a function σ that associates:

  • each node n of MSG with a finite subset σ(n) of digital objects, σ(n) ⊆ O,
  • and each arrow f : X → Y of MSG with a set-valued function σ(f) :

Set-valued functions are a convenient way of expressing binary relations; thus, σ(f) can also be seen as a binary relation having σ(X) as domain and σ(Y) as range.

Notice that we do not require σ(class) to be disjoint from σ(property). Indeed, it does not appear sensible that an object is both a class and a property; however, if an application needs such an object, it may have it; no ambiguity will arise, as context will always determine whether the property or the class is meant.

Fig.1: The metaschema graph

Since in a DL there will be several schemas, we will use a function sform to associate a digital object to a schema. Formally, for each object s def (sform), sform(s) is the schema identified by s. For notational convenience, we use the schema identifier as a function symbol; thus s(class) denotes the class set of the schema identified by s, and the same holds for all the other elements of a schema. Moreover:

  • each digital object c s(class) will be called a class in s; if c s (isac)(c′), we will say that c′ is a sub-class of c in s. Analogously, each digital object p s(property) will be called a property in s; if p s(isap)(p′), we will say that p′is a sub-property of p in s;
  • if c s(dom)(p) we say that c is a domain of p in s; analogously, if c s(ran)(p) we say that c is a range of p in s.

When no ambiguity may arise, we will omit the specification of the schema (e.g. using class in place of s(class)).

The classes and the properties of a schema may be shared with one or more other schemata, already in use. In fact, the elements of popular schemata, such as RDFS or the Dublin Core Metadata Set, are typically re-used in many schema definitions.

2.2 Databases

Given a schema s, a database over s is a function δs that associates:

  • each class c in s with a finite subset δs(c) of digital objects, δs(c) ⊆ O; δs(c) is called the extension of c in δs and any member of it is called an instance of c in δs;
  • each property p in s with a set-valued function δs(p) over the digital objects, δs(p) : O →P(O).

In order to identify databases, we will adopt the same convention as for schemas, using a function dform to associate a digital object to a database. Formally, for each object d def (dform), dform(d) is the database identified by d. For notational convenience, we use the schema identifier as a function symbol; thus d(c) and d(p) denote respectively the set of objects assigned to class c and the set-valued function assigned to property p by the database identified by d.

A database d over a schema s is a model of s if it satisfies the following conditions:

  1. if c is a sub-class of c′ in s, then d(c) ⊆ d(c′);
  2. if p is a sub-property of p′ in s, then d(p) ⊆ d(p′);
  3. if c is a domain of property p in s, then def(d(p)) ⊆ d(c);
  4. if c is a range of property p in s, then range(d(p)) ⊆ d(c).

Several well-known concepts fall under this definition of a database. For instance, an authority file can be seen as a database since it gives the instances of a given appellation class. Similarly, a product catalogue can be seen as a database over a schema consisting of a class together with properties describing the characteristics of a product.

As we shall see next, a description is a very special kind of database as well.

2.3 Descriptions

Descriptions support several important activities concerning the DL content, such as the interpretation, the discovery, and the management of content, to mention a few of the most relevant ones.

Descriptions are defined over a very special kind of schema, which we call description schema. A description schema s is a schema whose properties all have the same domain, called the description class. Any database over a description schema is called a description database. In a description database, any instance of the description class is called a description identifier. Finally, a description identifier along with the property values defined on it, form a description.

It is important to note that:

  • we leave open the possibility of structuring classes and properties in a taxonomy within the description schema;
  • although the instances of a description class all share the same schema, each one of them identifies a different description, regardless if its property values.

A description is meant to give the important characteristics of some object. Through the values of those characteristics, the description is connected to a body of knowledge stored in various databases, for instance over the web. Note that a description is a database, although a very small one.

The association between objects and their descriptions is done in a DL repository, as defined below.

Fig. 2: The repository schema

3. The DL Repository

The DL repository is a database over the schema of Figure 2, which we call the repository schema. According to this schema, a DL repository consists of a set of digital objects, over which the following properties are defined:

  • the content function cont : each object in the repository can be associated to a set of objects called its content;
  • the description function desc : each object in the repository can be associated to a set of objects, each identifying a description of the original object;
  • the version function vers : each object in the repository can be associated to a set of objects, each identifying a version of the original object.

For the purposes of this paper, the version function will no longer be considered.

3.1 The content function

We define content over O to be a set-valued function cont on O : cont : O →P(O) such that for each object o def (cont), the image of o under cont, cont(o), is a finite, possibly empty set of objects that we shall call the content of o. We shall call each object o in def (cont) an identifier.

The intention behind this definition of content is to capture the way documents are actually obtained from digital objects. Indeed, starting with a content cont(o), one can obtain several different documents, each document corresponding to a rendering of cont(o) on a specific device. However, in order for the rendering to be possible, additional information is required. For instance, in HTML this rendering-specific information is given by the tags that surround content (pieces of text and references); in turn, these tags are used by browsers for obtaining a visual rendering of the page; such rendering is what we call a document. Other languages, such as PDF, use a different syntax to obtain an analogous effect. For the purposes of this paper, we shall restrict ourselves to content functions as a convenient abstraction of documents, and ignore the technical details for obtaining actual documents from these functions.

A few additional remarks are in order here: 1) In the definition of content, we do not exclude the case in which o cont(o), and in particular when cont(o) = {o}. In so doing, we aim at capturing a very general notion of document, including, e.g., web pages. However, there are cases where we may not want the object to belong to its own content, e.g., for objects representing books. 2) We emphasize that in our model content is a highly dynamic notion. Indeed, as we shall see later, the content function can change as the result of insertions and deletions of objects. This means that an object which is an identifier at one point in time, might no longer be an identifier at some later point due to updates. As a consequence, the notion of identifier is a role that a digital object can play at different points in time. For instance, the famous painting “Guernica” might serve (in digital form) as the identifier of a collection of Picasso’s paintings, while being an element of the same collection; likewise, a “typical” identifier such as a URI can be used as content, for instance as an element in a bookmark. 3) We do not require cont to be injective because synonymy is an important property in DLs; in particular, it allows users to give different names to one and the same content, thus contributing to the personalization of the information stored in the DL.

Inactive objects (i.e. digital objects that are not in the domain nor in the range of the content function) are objects that are not used currently in the library, but nevertheless are part of the DL’s vocabulary and as such they are available to the user community. They may enter the content function either as identifiers or as elements of content at any later point in time (thus becoming active).

An initial object (i.e. an object that is in the domain of the content function but not in its range) can be thought of as the unique (up to synonymy) identifier of a meaningful unit of information, known in the context of DLs under the name of collection. The important property of a collection is that it does not belong to the content of any other object. Even if one or more elements of the collection might be found in other contents, the collection itself (as a whole) does not appear in any other content. A special category of initial objects are those objects o having an empty content, i.e. such that cont(o) = ∅. An object of this kind can be thought of as the identifier of a currently empty collection, and plays an important role in the dynamics of content; indeed, complex objects are created empty and are then populated by inserting objects into the content of their identifiers.

A terminal object (i.e. an object that is in the range of the content function but not in its domain), can be thought of as “pure” content object, in that it is not associated to any object. A terminal object contributes to the DL by its visualization only (i.e. by applying to it the function view).

Finally, an intermediate object (i.e. an active object that is neither initial nor terminal) can be thought of as an intermediate element of a complex structure, such as a sub-collection or a meaningful part of a complex digital object. Notice that our model gives the freedom of using any digital object (e.g. a video stream) as an intermediate object.

In real applications, content is an important dimension of objects, which is used for functionality concerning, among others, the organizational (access restrictions, responsbilities, digital rights) and economical (billing schemas) aspects of a DL. We do not model these aspects here, but are aware of their existence and therefore organize the model in a way that allows to represent the relevant information.

From our own point of view, complex objects can only be understood in terms of their components; this makes the content function a necessary element of the model of a DL containing complex objects which are composed from other, simpler objects; in this sense, the content function is different from any other function that can be defined on an object, for instance from the object’s description.

3.2 The description function

We define description over O to be a set-valued function desc on O :

such that for each object o def (desc), the image of o under desc, desc(o), is a finite, possibly empty set of objects, called description identifiers. We shall call desc(o) the descriptions of o.

4. The DL Knowledge Base

As already mentioned, the DL knowledge base (KB for short) is a database consisting of the object descriptions and the knowledge about the values in the descriptions, or any other information accessible through those values. For instance, the description of an object representing the painting Mona Lisa might be identified by o and consist of the following (property: value) pairs:

creator: Leonardo da Vinci, type: oil painting, creation: event123;

where event123 is an object representing the event that brought Mona Lisa into existence. Now, event123 could be described as an instance of a class Event, with the following property values:

actor: Leonardo da Vinci, time period: 1503 - 1506, place: 7003163;

where 7003163 identifies Florence in the Getty Thesaurus of Geographic Names, which is an external source accessible for instance via the web1. Note that the description of event123 might have been added to the KB by a curator in order to enrich the information about Mona Lisa. Moreover, if “oil painting” is a class in the KB schema and a sub-class of “painting”, this will further enrich the description of Mona Lisa. This way of enriching the original description of Mona Lisa can continue but might lead to unexpected places (c.f. surfing); for example, starting with Mona Lisa one might land on the description of Florence’s main football team, Fiorentina.

The KB schema can be thought of as the union of the schemas used for enriching the KB (like in the previous example). These schemas can range from very simple, such as classification schemas, subject heading systems, or taxonomies, to very articulated, such as for instance the CIDOC CRM [1].

The extraction of information from the DL is done using a query language that we define in the next section.

5. The query language

The basic idea underlying the query language is that a query stated against a DL returns as a result a schema and an associated database. Before we introduce the language formally, let us see a few examples of queries.

The basic building blocks of our language are atomic queries of the form:(s p o) , where each of the s, p and o can be a variable or a digital object. Here are some concrete examples and their intuitive explanations.

The atomic query (Bob friend x) is interpreted as “find all friends of Bob”, or more formally, “find the image of Bob under friend”.

Similarly, (x friend y) is interpreted as “find all pairs (x,y) such that y is a friend of x”, or more formally, “retrieve the whole function friend”.

Finally, (Bob x Alice) is interpreted as “find all properties relating Bob to Alice”, or more formally, “find all properties x such that Alice is in the image of Bob under x”.

Formally, a query is a pair (s,δ) where:

  • s is a new schema, which may include classes and properties which are elements of some other schema;
  • δ is a function associating every class in s with a class expression and every property in s with a property expression.

A class expression has the form: , where:

  • n ≥ 0;
  • ◊ is either the existential (∃) or the universal (∀) quantifier, and
  • C(y,x1,…,xn) is a combination of atomic expressions via the classical Boolean connectives ∧, ∨, ¬, in which the variables y,x1,…,xn occur; an atomic expression is a triple of the form: (s p o), where each of s, p and o can be a variable amongst the y,x1,…,xn or a digital object.

Analogously, a property expression has the form: , where n and ◊ are as above, and P(y1,y2,x1,…,xn) is a combination of atomic expressions via the Boolean connectives, in which the variables 1,y2,x1,…,xn occur.

Informally, given a database d, the domain of d is the set D ⊆ O of all digital objects occurring in it. A class expression denotes the set of objects o D which make the expression ◊x1…◊xnC(o,x1,…,xn) true in d, while a property expression denotes a binary relation (or set-valued function), that is a set of pairs of objects (o1,o2) which make ◊x1…◊xn P(o1,o2,a1,…,an) true in d. The notion of truth is quite straightforward:

  • a triple (s p o) is true in a database d, d(s p o), if (s,o) d(p).
  • truth with respect to the Boolean connectives and the quantifiers is defined as in the standard semantics of first-order logic.

Formally, the semantics of a class expression κ = ◊x1…◊xn C(y,x1,…,xn) in a database d, ε(κ,d), is the set of objects given by:

Analogously, the semantics of a property expression in a database d, is the binary relation given by:

Finally, given a query q = (s,δ) and a database d, the answer of q on d is the minimal model of the database ε ∘ δ, where ε is the semantics in the database d of each class and property expression occurring in δ. Note that ε∘δ is a database over s. For more details see [7].

We will now outline the mapping of the model and the query language into RDF [3, 2] and SPARQL [6], respectively, using examples.

6. Mapping to SPARQL

The general idea of mapping our DL Query language to SPARQL is 1) mapping a database d and its schema to an RDF graph r; and 2) mapping a query q to a SPARQL query s such that the answer of q on d is the same as the evaluation of s on r; the process is shown in Figure 3.

Fig3. Mapping the DL Query language to SPARQL

Based on the above process, we give below the mappings to SPARQL of the query examples mentioned in Section 5. Suppose we have the following data stored in DL:

Bob friend {Alice,Tom}, Bob fullname “Bob Smith”;

Then we have:
1)“Find all friends of Bob” by our query language becomes: (Bob friend ?x);
2)The query results are: {“Alice”, “Tom”};
3) Mapping Bob’s data stored in DL to RDF/XML is as follows:
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”
<rdf:Description rdf:about=”http://example.org/Bob”>
<dc:fullname>Bob Smith</dc:fullname>

4) Querying the RDF/XML in 3) by SPARQL is as follows:
PREFIX dc:<http://purl.org/dc/elements/1.1/>
PREFIX ex:<http://example.org/>
SELECT ?friend
WHERE { ex:Bob dc:friend ?friend }

5) The query results are the same like using our query: {“Alice”, “Tom”};
6) Then, we can map our DL Query language in 1) to the SPARQL in 4).
In the same way, we can map the query (?x friend ?y) to SPARQL as follows:
PREFIX dc:<http://purl.org/dc/elements/1.1/>
SELECT ?name, ?friend
WHERE { ?name dc:friend ?friend }.

From the above examples, we can see that it is easy to map our query language composed of ∃ and ∧ to SPARQL; but for mapping a query composed of ∨, ¬ and ∀, further investigation is needed.

7. Conclusions

We have seen a formal model of digital libraries based on two main notions, content and descriptions, and an associated query language. The main contribution is independence from specific languages but at the same compliance with relevant standards. As an evidence of this claim, we have outlined a way of mapping our query language to SPARQL. Further work is needed to complete this mapping and to derive algorithms for query evaluation.

8. References

[1] Martin Doerr. The CIDOC conceptual reference module: An ontological approach to semantic interoperability of metadata. AI Magazine, 24(3):75–92, 2003.
[2] Patrick Hayes. RDF Semantics. W3C Recommendation, WWW Consortium, February 2004. http://www.w3.org/TR/rdf-mt/.
[3] Frank Manola and Eric Miller. RDF Primer. W3C Recommendation, WWW Consortium, February 2004. http://www.w3.org/TR/rdf-primer/.
[4] Carlo Meghini and Nicolas Spyratos. Modelling the web. Invited talk at the Fourth Franco-Japanese Workshop ISIP: Information Search, Integration and Personalization, Paris, France, 6-8 October 2008. http://tao.lri.fr/tiki-index.php?page=Third+Franco-Japanese+Workshop.
[5] Carlo Meghini and Nicolas Spyratos. Rationale and some principles for a vldl data model.n Invited talk at the Very Large Digital Libraries (VLDL 2008) Workshop, held in conjunction with ECDL2008, the 12th European Conference on Research and Advanced Technology for Digital Libraries, Aarhus (Denmark), September 2008.
[6] Eric Prud’hommeaux and Andy Seaborne. SPARQL Query Language for RDF. W3C Recommendation, WWW Consortium, January 2008. http://www.w3.org/TR/rdf-sparql-query/
[7] Nicolas Spyratos, Carlo Meghini, and Jitao Yang. A data model for digital libraries. Research report, LRI, 2009. To appear.