Digital libraries are organized, focused collections of information. They are focused on a particular topic or themeand good digital libraries will articulate the principles governing what is included. They are organized to make information accessible in particular, well-defined, waysand good ones will include a description of how the information is organized (Lesk, 1997).
The Greenstone digital library software is intended to help users construct simple collections of information very quickly. Indeed, only a few minutes of the user's time are needed to set up a collection based on a standard design and initiate the building process. Collections may be largesome comprise Gbytes of text; millions of documents. Furthermore, even larger volumes of information may be associated with a collectiontypically audio, image, and video, with textual metadata. Once initiated, the mechanical process of building the collection may take from a few moments for a tiny collection to several hours for a multi-Gbyte oneperhaps even a day if it involves many different full-text indexes.
Naturally, collections that have idiosyncratic requirementsas most large collections dotake longer to set up, and the design and debugging process can take several days, weeks if iterative usability testing is involved. The Greenstone designers wholeheartedly endorse Alan Kay's maxim that "simple things should be simple, complex things should be possible" (Davidson, 1993).
The role of metadata
A digital library's organization is reflected in the interface that it presents to users. Much of the organization rests on metadatastructured information about the information resources the library contains. Metadata is the stuff in the traditional card catalogs of bricks-and-mortar libraries (whether computerized or not). It is "structured" in that it can be meaningfully manipulated without understanding its content. For example, given a collection of source documents, bibliographic information about each document would be metadata for the collection. The structure is made plain in terms of which pieces of text represent author names, which represent titles, and so on. The notion of metadata is not absolute but relative: it is only really meaningful in a context that makes clear what the data itself is (Lagoze and Payette, 2000). For example, given a collection of bibliographic information, metadata might comprise information about each bibliographic item, such as who compiled it and when.
The use of metadata as the raw material of organization is really the defining characteristic of digital libraries: it is what distinguishes them from other collections of online information. It is metadata that allows new material to be sited within a library and hooked into existing structures in such a way that it immediately enjoys first-class status as a member of the library. Adding new material to ordinary online information collections requires manually linking it in with existing material, but the only manual work needed when adding new items to a digital library is to determine metadata values for each one. If a standard metadata scheme is used, even that may be unnecessary: the information may already be available from another source.
Customization in Greenstone
With Greenstone, users can design their collections individuallytypically by taking an existing collection that closely matches their needs and adapting its structure as necessary. The resulting design is recorded in a short file called the "collection configuration file." It specifies such things as the collection's title, its creator's email address, a description of the purpose and principles governing what is included, what input file types should be included in the collection, where the metadata comes from and what form it takes, and how the collection will look to the user. Most of the customization that non-programming users perform in Greenstone takes place in this file. What is possible depends crucially on the availability of metadata, and the structures defined are only produced if appropriate metadata is provided.
Searching the full text of all documents in the collection is a basic facility, included by default in all collections. Collection designers can determine whether searching should be on a paragraph, section, or whole-document level (this affects the scope of matches to a given query). They can also ask for full-text indexes to be built on metadata items (e.g., titles, authors). They can split the collection into sub-collections and allow each to be searched individually, or use an automatic language identification facility to restrict searches by language, or specify that certain other collections are to be searched whenever this one is.
Greenstone includes predefined browsing structures based on certain kinds of metadata. For example, any textual metadata can be presented as an alphabetically sorted list. The list can be tabbed into alphabetic ranges, which are chosen automatically to include a reasonable number of documents in each range. The ranges are presented horizontally at the top of the screen. Date metadata can be presented in a list that allows selection by year (horizontally) and month (vertically). Metadata that has a hierarchical structure, such as library classifications, can be presented as a tree whose nodes open to reveal the data beneath. In this case, the user must provide an auxiliary file giving labels for intermediate nodes of the hierarchy (e.g., subject headings corresponding to each classification number). Underlying these structures is an internal scheme of "horizontal" and "vertical" lists that are combined appropriately by the browsing mechanism.
Hierarchical phrase browsing
A novel kind of browsing is through an interactive interface to a phrase hierarchy that has been extracted automatically from the full text of a document collection (Gutwin et al., 1999). It is designed to resemble a paper-based subject index or thesaurus. The user enters an initial word into a search box, and a list of phrases containing this word is shown. These phrases are minimal ones: each can be further expanded (by clicking it) into a list of phrases that contain the original one. This allows hierarchical access to the lexical content of a document collection. Ultimately, the user reaches a leaf of the hierarchy, which takes the user straight to the unique document containing that phrase.
Although designed for use with the full document text, hierarchical phrase browsing is also useful on certain kinds of textual metadata, such as titles or key phrases.
In Greenstone, format statements control the presentation to the user of each "screen" that the system generates. Format statements can be used to determine how target documents are displayedwhether they are preceded by title, for example, or indented. They control the search results page, where they determine (for example) what metadata is presented as a "snippet" that stands for matching documents, whether it should be preceded by an appropriate document icon, whether it should be a hyperlink, and what the target is for that link. In collections that provide different versions of a document (e.g., Word and the HTML that has been extracted from it), icons for both versions are often presented in the search results list so that users can choose which one they see. This is accomplished using a format statement. Format statements also apply to the browsing mechanisms mentioned above. They can be used to control how both "horizontal" and "vertical" nodes are laid out. Thus one could embolden the A-Z tabs in an alphabetically split list, or apply different formatting to the vertical and horizontal lists in a hierarchical structure.
Format statements are basically HTML, but with some additional facilities. For example, metadata values (or even the full document text) can be interpolated into a format statement. There is a conditional mechanism you can use to alter the format depending on the contextual situation. This allows you to change the presentation of any Greenstone screen depending on the metadata values of returned documents. For example, in the search results some collections show a list of enclosing section headings within which a "hit" is nested, and the format may be different if the hit occurs at the top level of a document.
The format language is rather arcane (in future versions we plan to use XSLT to provide greater flexibility, but it post-dates Greenstone's original design). However, the web pages that the system presents are not pre-stored but generated on the fly as needed, and the format mechanism operates at run-time. This makes it relatively easy to debug format statements: changes take effect immediately and the result can be viewed instantly.
Greenstone is a multilingual digital library system: currently there are interfaces in Arabic, Chinese, Czech, Dutch, English, French, Galician, German, Hebrew, Indonesian, Italian, Kazakh, Maori, Portuguese, Russian, Spanish, and Turkish. To accommodate these variants, and to allow the language interfaces to be updated when new facilities are added, all web pages are passed through a macro expansion phrase before being displayed. This means that a new language can be added by providing a new set of language-specific macros, a task that has been performed many times by people with no expertise in Greenstone.
The macro facility is an extension of HTML that includes the ability to define macros and perform textual substitution. The way the digital library functionality is hooked into the user interface is through "dynamic macros" whose expansions are determined by the system (in terms of other macros). For example, the search widget is generated by a dynamic macro. Thus users can incorporate this widget into web pages of their own design, provided they go through the macro expansion phase. A total of about twenty dynamic macros provides access to Greenstone's full user interface functionality.
It is the macro language that is employed in format statements, and the metadata-substitution and conditional mechanisms mentioned above are actually embedded into this language.
Customizing Greenstone's "building" phase
Open source software permits the ultimate in customizability: changing the source code. Then, anything is possible (in principle!). Greenstone operates in two phases: collection building, which is performed offline and creates the data structures necessary to support searching and browsing, and the online business of serving the collection to users. The collection building phase is written in Perl, and some customization is achieved by making modifications to this code.
All document and metadata formats are processed by Perl modules called "plug-ins", and all browsing structures are created by modules called "classifiers". Altering existing plug-ins and classifiers, often in very small ways, provides a useful degree of customization. For example, a plug-in can define a new metadata type, which is interpreted at display time by a format statement. This enables communication between the original source document environment and what is served to users. For example, if a document is provided in different forms, these can be shown in the search results list by defining new metadata elements that contain appropriate URLs.
Similar small modifications can be made to classifiers. For example, one collection comprises mainly books but also has a few issues of different magazines. Rather than appearing in a title browser under the individual magazine name, the collection designer wanted all magazines to appear under a separate tab at the end of the A-Z array called "Magazines". When clicked, a list of magazine names appears, and clicking one of these leads to a list of issues of that magazine. This was accomplished by combining the functionality of the A-Z list and hierarchical browsers, and in fact took just a few extra lines of Perl in the appropriate classifier's implementation.
Altering the run-time system
The part of Greenstone that serves collections to users is called the "receptionist," and sometimes one has to resort to changing the receptionist code to achieve the desired level of customization. This rarely involves large changes, but creates software management difficulties in dealing with different parallel versions.
Our system development strategy is to accept the inevitability of occasionally having to build a special-purpose collection-dependent receptionist to achieve some desired features, and to note what is required with a view to incorporating it as an option within the standard Greenstone code.
There is a wide variety of different ways in which a digital library may be customized, and virtually every collection has its own idiosyncratic requirements. Although a basic Greenstone collection of new material with a standard look and feel can be set up in just a few minutes, most users require far more personalization. Of course, as the number of collections grows and the variety of styles increases, it becomes more likely that some existing collection will match new requirements.
Greenstone incorporates customization mechanisms at many levels. One of the difficulties in dealing with such a rich system is the difficulty of producing good, up to date, documentation. In fact, from a user's point of view the chief bottleneck in customization is documentation, not the facilities that are provided. Consequently collection builders need constant access to advice and assistance from others, in order to continue to learn how to tailor the software to meet ever-changing new requirements. There is a lively email discussion group for assistance with Greenstone; participants come from over 50 different countries.
Digital libraries have the advantage over other interactive systems that their user interfaces are universally based on metadata. Metadata is the glue that allows new documents to be added and immediately become first-class citizens. It is also the key to user interface customization, and Greenstone incorporates a range of mechanisms at different levels to capitalize on this.
The Greenstone Digital Library software has grown out of the stimulating research environment of the New Zealand Digital Library project, and I gratefully acknowledge the profound influence of all project members.
Davidson, C. (1993) "The man who made computers personal." New Scientist, No. 1978, pp. 32-35; June.
Gutwin, C., Paynter, G.W., Witten, I.H., Nevill-Manning, C. and Frank, E. (1999) "Improving browsing in digital libraries with keyphrase indexes." Decision Support Systems, Vol. 27, No. 1/2, pp. 81-104; November.
Lagoze, C. and Payette, S. (2000) "Metadata: Principles, practices and challenges." In Moving theory into practice: digital imaging for libraries and archives, edited by A.R. Kenney and O.Y. Rieger. Research Libraries Group, Mountain View, CA, pp. 84-100.
Lesk, M. (1997) Practical digital libraries: Books, bytes, and bucks. Morgan Kaufmann, San Francisco.
Witten, I.H. and Bainbridge, D. (2003) How to build a digital library. Morgan Kaufmann, San Francisco, CA.
Ian H. Witten