Layers 3 and 4

In my last blog entry I detailed the first two layers of a four-layer model for electronic editions and archives. The final two layers are detailed below:

Level 3: Interface layer

While stacks of multimedia files and transcripts in open repositories would, in some ways, improve the current state of digital libraries, interfaces are required if users are to do anything but simply access content a file at a time. Of course, interfaces can be very expensive to develop and tend to become obsolete very quickly. Unfortunately, the funding for interface development rarely lasts longer than a year or two, so the cost of maintaining a large code base usually falls to the hosting institution, which rarely has the resources to do so adequately. A new system and standard for interface development is required if interfaces are to be sustainably developed.

Code modularization and reusability have long been ideals in software development, but have only been realized in limited ways in the digital humanities. Several large infrastructure projects, most notably SEASR, seek to provide a sustainable model for interoperable digital humanities tools, but have yet to achieve wide-scale adoption. Our model will follow the example of SEASR, but with a scope limited to web-based editions and archives, we may therefore impose some code limitations that more broadly intentioned projects could (and should) not.

We propose a code framework for web-based editions, first implemented in JavaScript using the popular jQuery library, but adaptable to other languages when the prevalent winds of web development change. An instance of this framework is composed of a manifest file (probably in XML or JSON format) that identifies the locations of the relevant content and any associated metadata and a core file (similar to, but considerably leaner than, the core jQuery.js file at the heart of the popular JavaScript library) with a system of “hooks” onto which developers might hang widgets they develop for their own editions. A widget, in this context, is a program with limited functionality that provides well-defined responses to specific inputs. For example, one widget might accept as input a set of manuscript images and return a visualization of data about the handwriting present in the document. Another might simply adapt a deep zooming application, such as OpenLayers, for viewing high resolution images and linking them to a textual transcript. Each widget should only depend on the core file and, if applicable, the content and other input data; no widget should directly depend on any other. If data must be passed from one widget to the next, the first widget should communicate with the core file that can then call an instance of the second one.

It should be noted that we are, in fact, proposing to build something like a content management system at a time when the market for such systems is very crowded. Nonetheless, experience with the major systems (Omeka, Drupal, Joomla, etc.) has convinced us that while a few provide some of the functionality we require, none are suited for managing multimedia scholarly editions. Just as Omeka clearly serves a different purpose and audience than Drupal, so will our system meet the similar yet nonetheless distinct needs of critical editors.

Level 4: User generated data layer

Many recent web-based editions have made use of “web 2.0” technologies which allow users to generate data connected to the content. In many ways, this is the most volatile data in current digital humanities scholarship, often stored in hurriedly constructed databases on servers where considerations of scale and long-term data storage have been considered in only the most cursory fashion. Further, the open nature of these sites mean that it is often difficult to separate data generated by inexperienced scholars completing a course assignment from that of experts whose contributions represent real advances in scholarship. Our framework proposes the development of repositories of user-generated content, stored in a standard format, which will be maintained and archived. Of course, storing the data of every user who ever used any of the collections in the framework is impossible. We therefore propose that projects launch “sandbox” databases, out of which the best user-generated content may be selected for inclusion and “publication” in larger repositories. In some cases, these repositories may also store scholarly monographs that include content from a set of archives. Subscription fees may be charged for accessing these collections to ensure their sustainability.

Conclusion

It should be noted that much in the above model is already practiced by some of the best electronic editing projects. However, the best practices have not been articulated in a generalized way. Although we feel confident our model is a good one, it would be the height of hubris to call it “best practice” without further vetting from the community. That, dear reader, is where you come in. The comments are open.

3 Comments

Ben Brumfield

Posted March 16, 2010 at 12:02 pm | Permalink

Of course, storing the data of every user who ever used any of the collections in the framework is impossible.

I was really surprised and puzzled by this statement. Outside of the problems of spam, how much load from user interactions do you really envision? Even the smallest nugget of user-generated data can be useful, for example recording user click history can connect scholars who are researching similar things.
Doug

Posted March 23, 2010 at 11:08 am | Permalink

Ben,

I think that sort of storage would be great, and maybe possible, but if you think about how much space server logs can use (after four months online I think ours is already over 8GB) I think an individual server would be quickly overwhelmed if it tried to record every single interaction for a large set of high traffic sites. But your point is taken–at the moment relatively few users contribute content to scholarly sites (maybe more a problem of interface than interest, though).

Doug
Ben Brumfield

Posted March 23, 2010 at 1:00 pm | Permalink

The example I gave was pretty extreme, but I chose it because I wanted to point out the dangers in discarding user-created data as useless, irrelevant, or bulky. I’ve been tracking each user mouse-click in a DB table in FromThePage rather than in user logs, and after a year of light use 1) there are fewer than sixty thousand non-spider records, and 2) I really wish I’d also captured the referrer field.

My real concern is that imposing such a limitation in early-stage guidelines feels a lot like premature optimization. A lot of software projects (my own included) suffer from the temptation to spend undue time and effort on scalability when they don’t even have any users yet. In addition to the opportunity cost associated with this, I also worry that you’re erecting barriers to adoption by encouraging sites to take on yet more work: constructing sandboxes and pruning/selecting user-generated data are not effortless tasks, and may be unnecessary.

One Trackback

By Tweets that mention Layers 3 and 4 -- Topsy.com on February 9, 2010 at 1:57 pm

[...] This post was mentioned on Twitter by Editing Modernism, dougreside. dougreside said: Final two layers of 4-layer model for editions described on #TILEproject blog http://bit.ly/cNyiKj [...]

Layers 3 and 4

3 Comments

One Trackback

Pages

Categories