A four layer model for image-based editions

Perhaps the most iconic sort of project in the literary digital humanities is the electronic edition. Unfortunately, these projects, which seek to preserve and provide access to important and endangered cultural artifacts, are, themselves, endangered. Centuries of experimentation with the production and preservation of paper have generated physical artifacts that, although fragile, can be placed in specially controlled environments and more or less ignored until a researcher wants to see them. On the other hand, only the most rudimentary procedures exist for preserving digital artifacts, and most require regular care by specialists who must convert, transfer, and update the formats to those readable by new technologies that are not usually backwards compatible. A new model is required. The multi-layered model pictured here will, we believe, be attractive to the community of digital librarians and scholars, because it clearly defines the responsibilities of each party and requires each to do only what they do best.

Level 1: Digitization of Source materials

Four-layered model for image-based editions The creation of an electronic edition often begins with the transfer of analog objects to binary, computer readable files. Over the last ten years, these content files (particularly image files) have proven to be among the most stable in digital collections. While interface code must regularly be updated to conform to the requirements of new operating systems and browser specifications, text and image file formats remain relatively unchanged, and even 20 year old GIFs can be viewed on most modern computers. The problem, then, lays not so much with the maintenance of these files but in their curation and distribution. For various reasons (mostly bureaucratic and pecuniary rather than technical), libraries have often attempted to limit access to digital content to paths that passed through proprietary interfaces. This protectionist approach to content prevents scholars from using the material in unexpected (though perhaps welcome) ways, and also endangers the continued availability of the content as the software that controls the proprietary gateways becomes obsolete. Moreover, these limitations are rarely able to prevent those with technical expertise (sometimes only the ability to read JavaScript code) from accessing the content in any case, and so nothing is gained, and (potentially) everything is lost by this approach.

More recently, projects like the Homer Multitext Project, the Archimedes Palimpsest, and the Shakespeare Quartos Archive, have taken a more liberal approach to the distribution of their content. While each provides an interface specially designed for the needs of their audience, the content providers have also made their images available under a Creative Commons license at stable and open URIs. Granting agencies could require that content providers commit to maintain their assets at stable URIs for a specified period of time (perhaps 10-15 years). At the end of this period, the content provider would have the opportunity to either renew their agreement or move the images to a different location. The formats used should be as open and as commonly used as possible. Ideally, the library should also provide several for each item in the collection. A library might, for instance, chose to provide a full-size 300 MB uncompressed tiff image, a slightly smaller JPEG2000 image served via a Djatoka installation, or a set of tiles for use by “deep zooming” image viewers such open layers.

Level 2: Metadata

The files and directories in level 1 should be as descriptive as possible and named using a regular and easily identifiable progression (e.g. “Hamlet_Q1_bodley_co1_001.tif”); however, all metadata external to the file itself should be considered part of level 2. Following Greene and Meissner’s now famous principle of “More Product, Less Process”, we propose that all but the most basic work of identification of content should be located in the second level of the model, and possibly performed by institutions or individuals not associated with the content provider at level 1. The equipment for digitizing most analog material is now widely available and many libraries have developed relatively inexpensive and efficient procedures for the work, but in many cases there is considerable lag time between the moment the digital surrogates are generated and the moment they are made publicly available. Many content providers feel an obligation to ensure that their assets are properly cataloged and labeled before making them available to their users. While the impulse towards quality assurance and thorough work is laudable, a perfectionist policy that delays publication of preliminary work is better suited for immutable print media than an extensible digital archive. In our model, content providers need not wait to provide content until it has been processed and catalogued.

Note also that debates about the proper choice or use of metadata may be contained at this level without delaying at least basic access to the content. By entirely separating metadata and content, we permit multiple transcriptions and metadata (perhaps with conflicting interpretations) to point to the same item’s URI. Rather than providing, for example, a single transcription of an image (inevitably the work of the original project team that reflects a set of scholarly presuppositions and biases) this model allows those with objections to a particular transcription to generate another, competing one. Each metadata set is equally privileged by the technology, allowing users, rather than content-providers, to decide which metadata set is most trustworthy or usable.

In my next blog entry I will discuss the next (and final) two layers of this model: interfaces and user-generated data.

A four layer model for image-based editions

2 Comments

Pages

Categories