MITH Digital Stewardship Series – Maryland Institute for Technology in the Humanities https://mith.umd.edu Thu, 08 Oct 2020 20:00:14 +0000 en-US hourly 1 https://wordpress.org/?v=5.5.1 Reckoning with Digital Projects: MITH Makes a Roadmap https://mith.umd.edu/reckoning-with-digital-projects-mith-makes-a-roadmap/ Thu, 04 Oct 2018 20:20:34 +0000 https://mith.umd.edu/?p=20164 In February of 2018, MITH spent dedicated time talking about sustainability of digital projects with a team from the University of Pittsburgh’s Visual Media Workshop (VMW) as part of a focused user testing session for The Socio-Technical Sustainability Roadmap. The research project that produced the Roadmap was led by Alison Langmead, with Project Managers Aisling [...]

The post Reckoning with Digital Projects: MITH Makes a Roadmap appeared first on Maryland Institute for Technology in the Humanities.

]]>

In February of 2018, MITH spent dedicated time talking about sustainability of digital projects with a team from the University of Pittsburgh’s Visual Media Workshop (VMW) as part of a focused user testing session for The Socio-Technical Sustainability Roadmap. The research project that produced the Roadmap was led by Alison Langmead, with Project Managers Aisling Quigley (2016-17) and Chelsea Gunn (2017-18). The final goal of that project was to create a digital sustainability roadmap for developers and curators of digital projects to follow. The work was initially based on what the project team discovered during its NEH-funded project, “Sustaining MedArt.” In this blog post, which is a late entry in MITH’s Digital Stewardship Series from 2016, I’m going to talk a bit about what I discovered during the process of using the roadmap for one of MITH’s projects, how I synthesized our discoveries in the form of a concrete tool for MITH to utilize the roadmap afterward, and how this has changed some of my conceptions about digital sustainability practices.

The process of walking a future digital project through the roadmap can be completed either in a full eight-hour day session, or two four-hour sessions. During  the process, you work through three sections, each with different modules pertaining to aspects of a project’s future sustainability prospects. We chose the latter, with each attending member focusing on a different MITH project they were developing or working on. I opted to use a project for which we were awaiting funding at the time, Unlocking the Airwaves: Revitalizing an Early Public and Educational Radio Collection. Although significant time and effort went into developing the grant proposal for Airwaves, which included a section on sustainability, the Roadmap process cemented how much more concretely we could have been thinking through these issues, and how better planning for those components from the start would lead to better management of the project. In fact, one finding that Langmead and her team had discovered as they developed and tested the roadmap, is that thinking through the project management aspects of a digital project was a necessary first component to even being able to effectively get through the remaining sections of roadmap exercises. So as they went along, they added several elements and exercises to Sections A and B which force users to pinpoint the structural elements of their project. These include elements such as access points, deliverables, workflows, intellectual goals, data flow, and anticipated digital lifespan. This kind of work is essentially an extension of a project charter, which often includes a lot of these same basic concepts. In fact, Module B1 of the roadmap encourages users to create or reference existing charters, and stresses that using the roadmap in conjunction with a charter enhances the usefulness of both tools.

The lifespan questions in Section A were eye-opening, because although the need to ask them seems obvious – How long do you want your project to last? Why have you chosen this lifespan? – I think we as stewards of digital information feel compelled to predict unrealistically long lifespans, which Langmead and her collaborators define as “BookTime:”

“BookTime” is a term we have coined to denote a project lifespan equivalent to, “As long as a paper-based codex would last in the controlled, professional conditions of a library.” It may often be assumed that this is coterminous with “Forever,” but that belief relies heavily on a number of latent expectations about the nature of libraries, the inherent affordances of paper and glue, and other infrastructural dependencies.

The module asks us to acknowledge that not every digital project can realistically span decades into the future, and that sometimes this honesty is better for both the project and your team. The module also leverages concepts such as ‘graceful degradation,’ and ‘Bloom-and-Fade,’ both of which, in moments of dark humor, felt similar to planning for a project’s  hospice care or estate. “It’s okay, everything dies, let’s just be open in talking about it and how we’ll get through it together.” Humor aside, it was a useful exercise for me to acknowledge that time, change, and entropy will stand in the way of a project achieving BookTime, and that that IS, in fact, okay.

The other two sections and exercises that I felt were the most useful and that provided the core, structural materials on which to base a sustainability plan were Sustainability Priorities (Section A4) and Technological Infrastructure (Sections B2 and B3). In the former, we were asked to list out the core structural components of a project “without which your project simply would not be your project,” and to list them in order of priority. This could include things such as, but not limited to, authority records, curated access points, facets, geo-spatial data, or digitized materials. We were also asked to define the communities that each property served. In the latter, we were asked to list out every single technological component of the project, from Google Drive, to Trello, to IIIF servers, to the university’s digital repository, define the function(s) of each, and assign project team members that are responsible for each. Then we were asked to realistically assess how long each technology was guaranteed to be funded, as well as “how the duration of the funding for members of your project team compares with the duration of the funding for technologies they maintain, keeping in mind that funding discrepancies may require special considerations and/or contingency plans to ensure uninterrupted attention.” Again, at first glance, much of this may seem very logical and obvious, but actually doing these exercises is illuminating (and sometimes sobering).

After Sections A and B force you to have a reckoning with the deep dark potential (good and bad) of your project, Section C focuses on applying the the National Digital Stewardship Alliance (NDSA)’s Levels of Preservation to your identified structural components. The Levels of Preservation are a set of recommendations that align the entire the digital preservation spectrum in six core areas: Access, Backing up Work, Permissions, Metadata, File Formats, and Data Integrity. For each of these areas, the roadmap defines four ‘levels’ of commitment to each of these areas, and what each of those levels really mean. For example, Level 1 for Data Integrity involves designating which project members have credentials for certain accounts and services, and who has read/write/move/delete authorization. Levels 2-3 requires the ability to repair data and create fixity information for stable files, and Level 4 specifies the checking of that fixity data specifically in response to specific events or activities. After defining your current and anticipated levels in each area, you’re asked to define concrete actions your team would need to undertake in order to achieve your desired level. Once again, these exercises encourage expectation management, with comments like “Please note! Reaching Level 4 sustainability practices is not the goal. Your work here is to balance what your project needs with the resources (both in terms of technology and staff) that you have.” It also notes that it is “absolutely okay” to decide that your project will choose Level 0 for any one of these areas, choosing consciously not to engage with that area, using the resources you have to focus on what your team wants to prioritize.

Module A3 in written form

After the two four-hour meetings, my brain was full and I was full of new ideas about my project that probably should have already occurred to me, but that only coalesced in any meaningful way by walking through the roadmap process. I’ve also been around long enough to know that the giddy enthusiasm that comes after a meeting like this can die on the vine if those ideas aren’t transformed into actionable items and documented somewhere. I did have the printed roadmap modules and exercises with my written answers on them, and Langmead and her team were clear that if we wanted to merely file (or scan) those written documents and stop there, that was fine. But written in the final module of the roadmap is the recommendation that after its completion, “make sure that you store the documentation for this, and all other, STSR modules in one of your reliable sites of project documentation.” So after several months of contemplation, I finally determined that MITH’s most reliable current site of project documentation is Airtable, which we’ve been using more and more to track aspects of different projects.

Airtable is an online relational database application that looks and functions like a spreadsheet in its default ‘Grid’ UI, but which also has more robust relational functions allowing you to meaningfully connect data between different tables/worksheets. As opposed to merely entering my answers to each module/exercise, I opted to begin by actually moving references and links to all the roadmap’s sections and modules into two tables in Airtable, so that the full text of each module was easily at hand for reference. I also included base, table, and

column descriptions at all levels (this would be the rough equivalent of Excel comments), which explain how information should be entered or that gave sample entries. The base description also provides an overview to this whole exercise, and gives attribution to the project in the format requested by Langmead and her team.

There are descriptions throughout with details on how to utilize each table or field. Click on the ‘i’ Info button to display them.

There were actual spreadsheets provided by the Roadmap’s project team for certain exercises, and I uploaded those as new tables in Airtable, and modified them as needed to connect/link with other tables. For example, the Technological Infrastructure table (which includes all the various technologies used by your project), the ‘Project Member Responsible’ column is linked to the Project Team table. So after you’ve entered the data for each, you can go back to the Project Team table and see all the tech components each member is responsible for, rolled up in a linked record field. There’s also a reference table listing out the definitions of Levels 1-4 for each of the six NDSA areas, so when you’re deciding what to enter in the Sustainability Levels table, you can instantly reference that table and choose an appropriate level for each area. After crafting the ‘template,’ I tested its usability by entering all the data from Unlocking the Airwaves that I’d written down. By doing that I realized where there were a few tweaks and bottlenecks that needed ironing out, and went back and modified the template. See below for a few more screenshots of the completed template.

So now we’ve got the roadmap data for Unlocking the Airwaves saved in a reliable site of project documentation. MITH team members are now encouraged (but not required) to use the template as we develop new projects, and it’s available to anyone else who’d like to request a blank duplicated copy. Dr. Langmead also provided a gentle but useful reminder that there is inherent risk in picking and using any such technology for this purpose, since platforms like Airtable may not always remain available. She suggested that we include a mention along the lines of “The inclusion of Airtable in your project’s suite of technologies should be considered carefully (in line with the work done in Modules A5 and B2)” in the intro description text for the base, which we did.

In a way this was also a sense-making exercise wherein, by taking all the roadmap data and turning it into structured data, I’d not only be able to sync up all these components in my head and turn them into actionable tasks, I’d also better retain the information. Anyone who has transformed, mapped, or structured previously unstructured data knows that by doing these tasks, you become much more intimately connected to your data. But what I think really appeals to me about the roadmap process is the mindfulness aspect. It encourages participants to think beyond the theoretical concepts of sustainability and actually apply them, write them down, look at them, consider their implications, and be honest about project expectations as aligned with available resources. In a world of overtapped resources and academic and bureaucratic hurdles, that’s an incredibly valuable skill to have.

The post Reckoning with Digital Projects: MITH Makes a Roadmap appeared first on Maryland Institute for Technology in the Humanities.

]]>
The Digital Dialogues Collection, chronicling a slice of the digital humanities since 2005 https://mith.umd.edu/the-digital-dialogues-collection-chronicling/ Mon, 08 Aug 2016 13:30:29 +0000 http://mith.umd.edu/?p=17802 This is the 6th post in MITH’s Digital Stewardship Series. In this post, MITH’s summer intern David Durden discusses his work on MITH’s audiovisual collection of historic Digital Dialogues events. The Digital Dialogues series showcases many prominent figures from the digital humanities community (e.g., Tara McPherson, Mark Sample, Trevor Owens, Julia Flanders, and MITH’s own [...]

The post The Digital Dialogues Collection, chronicling a slice of the digital humanities since 2005 appeared first on Maryland Institute for Technology in the Humanities.

]]>

This is the 6th post in MITH’s Digital Stewardship Series. In this post, MITH’s summer intern David Durden discusses his work on MITH’s audiovisual collection of historic Digital Dialogues events.

The Digital Dialogues series showcases many prominent figures from the digital humanities community (e.g., Tara McPherson, Mark Sample, Trevor Owens, Julia Flanders, and MITH’s own Matthew Kirschenbaum) speaking about their research on digital culture, tools and methodologies, and the interlocking concerns of the humanities and computing.

As mentioned in my earlier post, the nature of this collection presents several challenges to preservation and access as the series continues on into the future. As with many collections that are the focus of digital curation, the topics and subject matter covered in the Digital Dialogues continuously evolve and change over the course of the series. The collection itself is a record of the evolution of the digital humanities, the growth of MITH, and the rapid development of digital technologies, e.g., audio podcasts, multimedia podcasts, HD web hosted video.

My project was intended to help MITH balance the challenges of proper storage of existing content with the challenges of developing sustainable workflows for the dissemination of current and future content. Prior to this project, the Digital Dialogues collection was dispersed among several locations, representing different workflows, available technologies and access platforms over time. There have been 193 Digital Dialogues since September of 2005. There are recordings of 129 of these—78 recorded on video, and 51 recorded on audio (only). Access copies for videos and audio tracks were hosted in a variety of locations, such as Vimeo, Internet Archive, or an Amazon S3 server instance. Source and project files were located on a combination of the internal drive for MITH’s iMac video editing station, an external hard drive, and a separate local server. After the completion of this project, the preservation, storage and accessibility of all Digital Dialogues content has been streamlined. Source and project files are now organized in a set file directory structure and stored redundantly on two separate local drives, and all access copies are available through a single source—Vimeo—making it easier for users to have access to the entire collection. Due to weekly upload limits imposed by Vimeo, there are currently 71 videos uploaded, and 45 more videos are in the upload queue and will be available soon.

Over the course of this project, I was involved in the processes of editing and exporting videos, updating the MITH site,, and preparing digital content for long-term storage, but through that process I did manage to find some time to actively engage with the sheer volume of content that exists within the collection. Several Digital Dialogues were in line with my own research interests and hobbies, so I was able to engage with the collection as both a curator and researcher, and watched these videos in their entirety.

Here are a few (only a tiny sample) of my favorites:

Spectacular Stunts and Digital Detachment: Connecting Effects to Affects in US Car Movies, by Caetlin Benson-Allott

These three are of personal interest to me, but each video also represents the variety of content that the Digital Dialogues has to offer. Additionally, the Donahue and Freedman pieces represent other ways that MITH is distributing content associated with each Digital Dialogue. Rachel Donahue’s Digital Dialogue page, in addition to the video of her presentation, features her slide deck available for download in PDF format. Richard Freedman’s Digital Dialogue page features a Storify recap that features links to resources referenced in his presentation that are inaccessible from the video alone.

Featured video: “It’s too Dangerous to Go Alone! Take This.” Powering Up for Videogame Preservation

Donahue Title Slide

Title slide from Rachel Donahue’s Digital Dialogue

I am an avid fan and player of videogames, which is why I chose to highlight the talk in this video. Rachel Donahue worked on a Library of Congress-sponsored project, Preserving Virtual Worlds (PVW), which focused on the complexities of preserving the digital content of videogames (the Preserving Virtual Worlds website can only be viewed through the Internet Archive, but the project report is available here).

Donahue’s talk explains the methodology devised by PVW to determine the ‘how’ and ‘why’ of videogame preservation, which isn’t as straightforward as I originally thought. She begins with a simple explanation of what it is exactly that PVW’s videogame preservation focused on: videogames that were originally for computer or dedicated consoles, such as the Super Nintendo Entertainment System. This talk represents a wide range of preservation activities and approaches at the highest level. Donahue proceeds to explain that the problems inherent in videogame preservation stem from the existence of different preservation priorities from different members of the gaming community, e.g.: developers, players, and archivists. These sub-groups often overlap and further complicate the process. The player and developer communities may disagree about what the most important aspects of the game are, and in reference to the game Oregon Trail, Donahue states,

“if you talk to a lot of people about the Oregon Trail, and ask them ‘what do you most remember about the Oregon Trail, what do you think is most important to the Oregon Trail?’, and they’re going to say things like, dysentery, trying to shoot squirrels, making it to Independence Rock before July 4th, fjording the river, having enough axles in your pack, having enough stuff in general without weighing down your oxen so much that they can’t move; maybe if you’re a little bit more observant you might think, ‘problematic portrayal of Native Americans,’ but you’re not going to say, ‘data model.’ I don’t think anybody thinks about the data model, but if you talk to the creators of the Oregon Trail, they are in fact going to say, ‘the data model, the statistics, those are the most important parts of the game.”

Oregon Trail

Photo credit: mygeekwisdom.com

Videogames often have a multiplayer component that is a source of nostalgia for players. When comparing the gameplay between two-player Super Mario Bros., which can be preserved through software emulation or preservation of original hardware, to online play in Halo 3, which required servers operated by Microsoft in addition to the hardware and software components, one can quickly see how the ‘what’ of videogame preservation can imply drastically different things to groups within the community. Donahue also mentions that there are often unique trends and quirks for specific games within the player community which are not always preservable (such as ‘bunny hopping’ in Quake).

A variety of questions must be answered before preservation activities can move forward. The most important question is: “what exactly are we preserving?” Aside from content, videogames are data, software, hardware, unique storage media, and peripherals such as controllers. Each element of a videogame system may require a specific skillset in order to achieve any sort of reliable preservation. In the case of hardware and circuit boards, basic knowledge of electronics and computer repair may be required; when using emulation, scripting skills will inevitably be required. Videogame preservation also demands a distinction to be made between original hardware preservation and software emulation–what is the minimum level of preservation for a videogame? The question of what to save is most certainly a philosophical one: is it the aesthetic of the original object and the experience of playing the game in its original state, or will any experience involving the loose entity of the game be acceptable?

Retrode

The Retrode (retrode.org) is a device that allows for hardware emulation using original videogame cartridges.

Donahue exhibits several surveys created to gauge the focus of preservation activities. For the curator or archivist, survey questions were more technical, and a few examples are ‘can the game be played’, ‘do you have the equipment to emulate’, and ‘will you provide a complete videogame experience, or will you just preserve the artifacts?’ For players, the questions are more rooted in videogame culture, for example, ‘what is the core of the game and what does it mean’, ‘what contributes to the success of a franchise’, ‘what is the importance of multiplayer’, and ‘is this a good game or a milestone game’?

Donahue and the PVW project made great strides in articulating the specific needs of videogame preservation as well as providing the groundwork for establishing preservation standards for an often overlooked and misunderstood part of our culture. This is just one of many interesting and unique Digital Dialogues within the collection – to view more, visit the Past Digital Dialogue Schedules page, where you can browse through all previous seasons and explore.

The post The Digital Dialogues Collection, chronicling a slice of the digital humanities since 2005 appeared first on Maryland Institute for Technology in the Humanities.

]]>
A Decade of Digital Dialogues Event Recordings and the Challenges of Implementing a Retroactive Digital Asset Management Plan https://mith.umd.edu/decade-digital-dialogues-event-recordings-challenges-implementing-retroactive-digital-asset-management-plan/ Thu, 14 Jul 2016 20:39:00 +0000 http://mith.umd.edu/?p=17756 This is the 5th post in MITH's Digital Stewardship Series. In this post, MITH's summer intern David Durden discusses his work on MITH's audiovisual collection of historic Digital Dialogues events. I was brought on as a summer intern at MITH to work on a digital curation project involving Digital Dialogues, MITH’s signature events program featuring speakers from around [...]

The post A Decade of Digital Dialogues Event Recordings and the Challenges of Implementing a Retroactive Digital Asset Management Plan appeared first on Maryland Institute for Technology in the Humanities.

]]>
This is the 5th post in MITH’s Digital Stewardship Series. In this post, MITH’s summer intern David Durden discusses his work on MITH’s audiovisual collection of historic Digital Dialogues events.

I was brought on as a summer intern at MITH to work on a digital curation project involving Digital Dialogues, MITH’s signature events program featuring speakers from around the U.S., and occasionally beyond, which has been running for eleven years. The Digital Dialogues events program has documented the development of the digital humanities as well as the ideas and work of several of the pioneers of the field. However, as the digital humanities grew and developed, so did the technology used to record and edit the Digital Dialogues. This digital record must be curated and preserved in order to ensure that the Digital Dialogues events are accessible for many years to come.

Staying current with changes in digital audio and video recording and editing resulted in a variety of media sources, file types, storage locations, and web-hosting services. MITH currently has a workflow for recent and future Digital Dialogues that ensures proper storage of raw video, systematized file naming-conventions, standards for video editing and the creation of web content, and redundant storage. This plan, in some form, must be retroactively applied to almost a decade of content.

Since I was dealing with a variety of locations for content, the first task at hand was to consolidate media from all storage locations and resolve discrepancies and duplications. This resulted in aggregating all available content from an editing workstation, an external drive, an AWS server, and a local server. Once all the content was funneled into a singular location, I began the slow and tedious process of comparing files and folders. I was able to separate usable media from everything else and began moving content into a well-organized master directory that will be cloned into redundant storage for preservation. Future workflows will prevent discrepancies by having content be imported, named, organized, and edited on the local workstation and then copied to external storage sources to prevent duplication or accidental changes to archived content.

An example of the future data flow for Digital Dialogues videos

An example of the future data flow for Digital Dialogues videos

MITH had been successfully saving multiple copies of files across different storage devices, but many of these files reflected out-dated workflows and there were often several versions of the same file. The recording of Digital Dialogues went through several technological evolutions and left behind a messy file structure. Some source files were saved, others are missing. Some final product videos and recordings were duplicated across local storage devices, others exist solely in the Internet Archive and other web-hosting services. MITH’s early Digital Dialogues provide an example of the danger inherent in relying on singular storage locations and web-hosting services to archive digital assets. The file compression used by many services, as well as the possibility of service interruption, make web-hosting a ‘front-end access-only’ form of digital storage. The important thing to emphasize here is that once digital source media is lost, it is usually lost forever, which is why it is always necessary and recommended to have a data management plan ready at the onset of any digital project.

Data storage isn’t the only challenge that the Digital Dialogues collection presents as the collection has moved through different A/V editing workflows and standards. The Digital Dialogues transitioned from audio recording to video recording, as well as from using iMovie to Adobe Premiere to edit video, a transition that has left a considerable number of useless project files lingering about. The differences between the two video editing software suites are considerable and present several challenges to long term functionality. Adobe Premiere and iMovie handle the import of source media very differently. Premiere doesn’t actually import the source media, but instead creates a link to the file using a system path, which results in project files that are only a few hundred kilobytes in size. IMovie, however, stores a copy of the original media as well as a variety of program specific data, which greatly increases the size of the project folder. Additionally, Adobe Premiere allows for backwards compatibility to some degree, whereas iMovie does not, making Premiere a better choice for long term functionality of project files.

The links that Adobe Premiere creates to source media are problematic because, if the source media changes location or filename, the links are effectively broken and media must be relocated before any editing can occur. However, as long as the source media is preserved and is identifiable, it is a simple task to point Premiere to the correct location of the source. To ensure MITH’s future access to working project files (which is important if a derivative is lost and needs to be regenerated, or video formatting needs to be updated for a website), I created a well organized and descriptively named directory containing all project files and associated linked media. The current editing and curation plan involves each Digital Dialogue event being stored in a folder containing source media and the edited derivative. Before transferring any source media, an appropriate directory is created to store the files. Files are then transferred from an external storage device or camera to the video editing iMac work-station and stored in the appropriate event folder. The event folders are named using the following convention:

‘YYYYMMDD_SpeakerNameInCamelCase_AdditionalSpeakersSeparatedByUnderscores’.

Events are organized by season (e.g., Spring 2016) and stored in a season folder using the following convention:

‘YYYY-Season-Semester’.

All events for a season will be edited in a single Adobe Premiere project file that is located within the season folder. This reduces the amount of project files to manage and also streamlines the video editing process.

Example of a well-organized Digital Dialogue season folder

Example of a well-organized Digital Dialogue season folder

Another part of this project consisted of editing previous content to conform to current standards. Due to the variety of files that existed, both formats and duplicates, I decided to prioritize raw footage (or the highest quality derivative that I could discover) for archiving and the creation of new videos. Provided that usable media was accessible, videos currently on the MITH website are being updated to reflect proper MITH logos and branding, as well as title slates with appropriate attributions to speakers, dates and talk titles. There are also many years of Digital Dialogues recorded as audio, which are in the process of being exported to a standardized video format so that the majority of Digital Dialogues will be accessible to the user through one hosting service (Vimeo). At the end of the project, I will have created or recreated around 105 videos, streamlined and documented any changes to MITH’s audiovisual workflows, and ensured proper digital stewardship of an important collection of digital humanities scholarship. My second and final blog post in this series will highlight some of the more interesting content in this collection.

 

 

 

The post A Decade of Digital Dialogues Event Recordings and the Challenges of Implementing a Retroactive Digital Asset Management Plan appeared first on Maryland Institute for Technology in the Humanities.

]]>
The Web’s Past is Not Evenly Distributed https://mith.umd.edu/webs-past-not-evenly-distributed/ https://mith.umd.edu/webs-past-not-evenly-distributed/#comments Fri, 27 May 2016 10:36:56 +0000 http://mith.umd.edu/?p=17605 This is the 4th post in MITH's Digital Stewardship Series. In this post Ed Summers discusses ways to align your content with the grain of the Web so that it can last (a bit) longer. If I had to guess I would bet you found the document you are reading right now by following a [...]

The post The Web’s Past is Not Evenly Distributed appeared first on Maryland Institute for Technology in the Humanities.

]]>
This is the 4th post in MITH’s Digital Stewardship Series. In this post Ed Summers discusses ways to align your content with the grain of the Web so that it can last (a bit) longer.

If I had to guess I would bet you found the document you are reading right now by following a hyperlink. Perhaps it was a link in a Twitter or Facebook status update that a friend shared? Or maybe it was a link from our homepage, or from some other blog? It’s even possible that you clicked on the URL for this page in an email you received. We’re not putting MITH URLs on the sides of buses, on signs, or in magazines (yet), but people have been known to do that that sort of thing. At any rate, we (MITH) are glad you arrived, because not all links on the Web lead somewhere. Some links lead into dead ends, to nowhere–to the HTTP 404. Here is how historian Jill Lepore describes the Web in her piece The Cobweb:

The Web dwells in a never-ending present. It is—elementally—ethereal, ephemeral, unstable, and unreliable. Sometimes when you try to visit a Web page what you see is an error message: Page Not Found. This is known as link rot, and it’s a drag, but it’s better than the alternative.

I like this description, because it hints at something fundamental about the origins of the Web: if we didn’t have a partially broken Web, where content is constantly in flux and sometimes breaking, it’s quite possible we wouldn’t have a Web at all.

Broken By Design

When Tim Berners-Lee created the Web at CERN in 1989 he had the insight to allow anyone to link to anywhere else. This was a significant departure from previous hypertext systems which featured link databases that ensured pages were interlinked properly. Even when the Web only existed on his NEXT workstation Berners-Lee was already thinking of a World Wide Web (WWW) where authors could create links without needing to ask for permission, just like they used words. Berners-Lee understood that in order for the Web to grow he had to cede control of the link database. Here is Berners-Lee writing about these early days of the Web:

For an international hypertext system to be worthwhile, of course, many people would have to post information. The physicist would not find much on quarks, nor the art student on Van Gogh, if many people and organizations did not make their information available in the first place. Not only that, but much information — from phone numbers to current ideas and today’s menu — is constantly changing, and is only as good as it is up-to-date. That meant that anyone (authorized) should be able to read it. There could be no central control. To publish information, it would be put on any server, a computer that shared its resources with other computers, and the person operating it defined who could contribute, modify, and access material on it. (Weaving the Web, p. 38-39).

So in a way the Web is necessarily broken by design–there is no central authority making sure all the links work. The Web’s decentralization is one of the primary factors that allowed it to germinate and grow. I don’t have to ask for permission to create this link to Wikipedia or to anywhere else on the Web. I simply put the URL in my HTML and publish it. Similarly, no one has to ask me for permission to link here. An obvious consequence of this freedom is that if the page I’ve linked to happens to be deleted or moved somewhere else, then my link will break.

Fast forward through 25 years of exponential growth and we have a Web where roughly 5% of the URLs collected using the social bookmarking site Pinboard break per year. URLs shared in social media are estimated to fare even worse with about 11% lost per year. In 2013 a group at Harvard University discovered that 50% of the links in Supreme Court opinions no longer link to the originally cited information. To paraphrase William Gibson, the past isn’t evenly distributed.

There is no magic incantation that will make your Web content permanent. At the moment we rely on the perseverance and care of Web archivists like the Internet Archive, or your own organization’s Web archiving team, to collect what they can. Perhaps efforts like IPFS will succeed in layering new, more resilient protocols over or underneath the Web. But for now you are most likely stuck with the Web we’ve got, and if you’re like MITH, lots of it.

Fortunately, over the last 25 years Web developers and information architects have developed some useful techniques for working with this fundamental brokenness of the Web. I’m not sure if these are best practices as much as they are practical recipes. If you tend a corner of the Web here are some very practical tools available to you you for helping mitigate some of the risks of link rot.

Names

There are only two problems in computer science: cache invalidation, naming things and off-by-one errors. (Phil Karlton by way of Martin Fowler)

Naming things is hard partly because of semantic drift. The things our names seem to refer to have a tendency to shift and change over time. For example if I create a blog at wordpress.example.org and a few years later decide to use a different blogging platform, I’m kind of stuck with a hostname that no longer fits my website. When creating content on the Web it’s important to be attentive to the hostnames you pick for your websites. Unfortunately most of us don’t have time machines to jump in to see what the future is going to look like. But we do have memories of the often tangled paths that lead to the present state of a project. Draw upon these histories when naming your websites. Turn naming into a cooperative and collaborative exercise.

Naming on the Web is ultimately about the Domain Name System (DNS). DNS is a distributed database that maps a hostname like mith.umd.edu to its IP address 174.129.6.250, which is a unique identifier for a computer that is connected to the Internet. Think of DNS as you would your address book, which lets you look up the phone number or mailing address for a person you know using their name. When your friend moves from one place to another, or changes their phone number you update your address book with the latest information. Establishing a DNS name for your website like mith.umd.edu lets you move the actual machine around, say from your campus IT department, to a cloud provider like Amazon Web Services, which changes its IP address, but without changing its name. DNS is there to provide stability to the Web and the underlying Internet.

DNS is distributed in that there are actually many, many address books or zones that are hierarchically arranged in a pyramid like structure, with one root zone at the top. In the case of mith.umd.edu the address book is managed by the University of Maryland. In order to change the hostname mith.umd.edu to point at another IP address I would need to contact an administrator at UMD and request that it be changed. Depending on my trust relationship with this administrator they might or might not fulfill my request. Sometimes you might want to purchase a new domain for your site like namingthingsishard.org from a DNS Registrar. After you paid the registrar an annual fee, you would be able to administer the address book yourself. But with great power comes great responsibility.

Given its role in making Internet names more stable it’s ironic that DNS is often blamed as a leading factor for link rot. It’s true, if you register a new domain and forget to pay your bill, a cybersquatter can snatch it up and hold it hostage. Alternative systems like Handles, Digital Object Identifiers and more recently NameCoin or IPFS have been developed partly out of an inherent distrust of DNS. These systems offer various improvements over DNS, but in some ways recapitulate the very problems that DNS itself was designed to solve. DNS has its warts, but it has been difficult to unseat because of its ubiquitous use on the Internet. Here are a few things you can do when working with DNS to keep your websites available over time:

  • Pick a hostname you and your team can collectively live with, at least for a bit. Think of it as a community decision.
  • If your new website fits logically in a domain that you already have access to use it instead of registering a new domain name. This way you have no new bills to pay, or management to do.
  • If you register a new domain keep your user registration account and contact information up to date. When a bill needs to be paid that’s who the registrar is going to get in touch with. When staff come and go make sure the contact information is changed appropriately.
  • Pick a registrar that lets you lock the domain to prevent unwanted updates and transfers.
  • If your registrar allows you to enable two-factor authentication do it. You’ll feel safer the next time you log in to make changes.
  • If the creator of a website has moved on to other organization, transfer the ownership of the domain to them. They are the ones making sure it stays online so it makes sense for them to own and administer the domain.

Redirects

A more practical solution than minting the perfect and unchanging name for your website is to accept that things change, and to let people know when these changes occur. An established way to announce a name change on the Web is to use the modest HTTP redirect. Think of an HTTP redirect as the forwarding address you give the post office when you move. The post office keeps your change of address on file, and forwards mail on to your new address, for a period of time (typically a year). The medium is different but the mechanics are quite similar, as your browser seamlessly follows a redirect from the old location of a document to its new location.

The Hypertext Transfer Protocol (HTTP) defines the rules for how Web browsers and servers to talk to each other. Web browsers make HTTP requests for Web pages using their URL, and Web servers send HTTP responses to those requests. Each type of response has a unique three digit code assigned to it. I already mentioned the 404 Not Found error above, which is a type of HTTP response that is used when the server cannot locate a resource with the URL that was requested. When the server returns a 200 OK response the browser know everything is fine and that it can display the page. One of the other class of HTTP responses servers can send are redirects, such as 301 Moved Permanently and 302 Found.

301 Moved Permanently is your friend in situations where a given website has moved to another location, as when a domain name is changed, or when the content of a website has been redesigned. Consider the example above when wordpress.example.org no longer works because another blogging platform is now being used. You can set up your webserver so that all requests for wordpress.example.org redirect permanently to blog.example.org.

The actual mechanisms for doing the redirect vary depending on the type of Web server you are running. For example, if you use the Apache Web server you can enable the mod_rewrite module and then put a file named .htaccess containing this in your document root:

RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_HOST} !^wordpress.example.org$ 
[NC] RewriteRule ^(.*)$ http://blog.example.org/$1 [L,R=301]

The point of this example isn’t to explain how mod_rewrite or Apache work, but merely to highlight one way (among many) of doing a permanent redirect. Whatever webserver you happen to use is bound to support HTTP redirects in some fashion. When things move from one place to another try to let people know that the website has moved permanently elsewhere. When people visit the old location their browser will seamlessly move on to the new location. Web search engines like Google also will notice the change in location and update their index appropriately, since they don’t want their search results to send people down blind alleys. Consider creating a terms of service for your website where you commit to serving redirects for a period of time, say one year, to give people a chance to update their links.

Proxies

Another pivot point for managing your Web content are reverse proxies. Remember those HTTP requests and responses mentioned earlier? Reverse proxies receive HTTP requests from a client, forward them on to another server and then send the response they receive back to the original client. That’s kind of complicated so here’s an example. Recently at MITH we moved many of our web properties from a single machine running on campus to Amazon Web Services, a.k.a. The Cloud. We had many WordPress, Drupal and Omeka websites running on the single machine and wanted to disaggregate them so they could run on separate Amazon EC2 instances. The reason for doing this was largely to allow them to be maintained independently of each other. We wanted to make this process as painless as possible by avoiding major changes to our URL namespaces. For example the MITH Vintage Computing Omeka site lived at:

http://mith.umd.edu/vintage-computers/

and we didn’t want to have to move it to somewhere like:

http://omeka.mith.umd.edu/vintage-computers/

Certainly the old locations could have been permanently redirected to the new locations for some period of time. But doing that properly for over 100 project websites was a bit daunting. Instead we chose to use a reverse proxy server called Varnish to manage the traffic to and from our new servers. An added benefit to using a reverse proxy like Varnish is that it will cache content where appropriate, which greatly improves user experience when pages have been previously requested. In fact speeding up websites is the primary use case for Varnish. Another advantage to using something like Varnish is that its configuration becomes an active map of your organization’s Web landscape. You can use the configuration as documentation for the web properties you manage. Here’s a partial view of our setup now:

image01

Reverse proxies are an extremely useful tool for managing your Web namespace. They allow you to present a simplified namespace of your websites to your users, while also giving you a powerful mechanism to grow and adapt your backend infrastructure over time.

Cool URIs

When a link breaks it’s not just the hostname or domain name in the URL that can be at fault. What is much more common is for the path or query components of the link URL to change. Unless a URL is for a website’s homepage almost all URLs have a path, which is the part that immediately follows the hostname, for example:

http://mith.umd.edu/vintage-computers/items/show/9

This URL identifies the record in our Vintage Computers website for this Apple IIe computer that has been signed by the author Bruce Sterling:

image00

In addition, a URL may have a query component, which is the portion of the URL that starts with a question mark. Here’s an example of a URL that identifies a search for the word apple in the Vintage Computing website:

http://mith.umd.edu/vintage-computers/items/browse?search=apple

The path and query portions of a URL are much more susceptible to change because they uniquely identify the location of a document on the Web server, or (more often the case) the type of Web application that is running on the server. Consider what might happen if we auction off the Apple IIe (never!) and delete record 9 from our Omeka instance. Omeka will respond with a 404 Not Found response. Or perhaps we upgrade the version of Omeka that uses a new search query such as ?q=apple instead of ?search=apple. In this case the URL will no longer identify the search results.

Even though they are highly sensitive to change there are some rules of thumb you can follow to help insure that your URLs don’t break over time. A useful set of recommendations can be found in a document called Cool URIs Don’t Change written by Tim Berners-Lee himself back in 1998, which recommends you avoid URLs that contain:

  • filename extensions: http://example.org/search.php
  • application names: http://example.org/omeka/
  • access metadata: http://example.org/public/item/1
  • status metadata: http://example.org/drafts/item/1
  • user metadata: http://example.org/timbl/notes/

Ironically enough the URL for Cool URIs Don’t Change is http://www.w3.org/Provider/Style/URI.html which is counter to the first recommendation since it uses the .html filename extension. Remember, these are not hard and fast rules–they are simply helpful pointers to things that are more susceptible to change in URLs.

In fact you may have already heard of Cool URIs by a more popular name: permalinks. In 2000, Just as blogging was becoming popular, Paul Bausch, Matt Howie and Ev Williams at Blogger needed a way to reference older posts that had cycled off a a blog’s homepage. Necessity was the mother of invention, and so the idea of the permalink was born. A permalink uniquely identifies a blog post which can be used when referencing older content in the archive. The idea is now so ubiquitous it is difficult to recognize as the innovation it was then.

When you create a post on WordPress, or send a tweet on Twitter your piece of content is automatically assigned a new URL that others can use when they want to link to it. The permalink was put to work in 2001 when Wikipedia launched, and every topic got its own URL There are not close to 5 million articles in English Wikipedia now. Previously many websites shrouded their URLs with complex query strings, which exposed the internals of the programs that made the content available, and lent to their transience. Clean URLs inspire people to link to them with some expectation that they will be managed and persistent. Consider these two examples:

https://en.wikipedia.org/wiki/Permalink

and:

https://twitter.com/jack/status/20

compared with:

http://www.example.com/login.htm?ts=1231231232222&st=223232&page=32&ap=123442&whatever=somewordshere

When you are creating your website it’s a good idea to be conscious of the URLs you are minting on the Web. Do they uniquely identify content in a simple and memorable way? Can you name the types of resources that the URL patterns will identify? Could you conceivably change the Web framework or content management system being used without needing to change all the URLs? Consider creating a short document that details how your organization uses its URL namespaces, and its commitment to maintaining resources over time. You won’t be going out on a limb, because these ideas aren’t particularly new: check out what Dan Cohen and Roy Rosenzweig said over 10 years ago about Designing for the History of the Web in their book Digital History.

Web Archives

In his first post in this series Trevor referred to the distinction between active and inactive records in archival theory, and how it informs MITH’s approach to managing its Web properties. Active records are records taht remain in day to day use by their creators. Inactive records on the other hand are largely kept for historical research purposes. An analogy can be made here to dynamic and static websites. Both types of websites are important, but each has fundamentally different uses, architectural constraints and affordances.

Over the last 15 years MITH has used a variety of Web content management systems (CMS) as part of our projects: Drupal, WordPress, Omeka, MediaWiki, and even some homegrown ones. CMS are very useful during the active development of a project because they make it easy for authenticated users to create and edit of Web content using their Web browser. But at a certain point a project transitions (slowly or quickly) into maintenance mode. All good grant funding must come to an end. Key project participants can move on to new projects and new jobs at other institutions. The urgent need for the project to have a dynamic website can be reduced, over time, to basically nothing. The website’s value is now as a record of past activity and work, rather than the purpose it was initially serving.

When you notice this transition in the life cycle of one of your projects there’s an opportunity to take a snapshot of the website so that the content remains available, and preserve what there is of the server side code and data. The venerable wget command line utility can create a mirror of a website. Here’s an example command you could use to harvest a website at http://example.com/app/

     wget --warc-file example 
          --mirror \
          --page-requisites \
          --html-extension \
          --convert-links \
          --wait 1 \
          --execute robots=off \
          --no-parent \
          http://example.com/app/

Once the command completes you will see a directory called example.com that contains static HTML, JavaScript, CSS and image files for the website. You should be able to take the contents of the directory and move it up to your server as a static representation of the website. In addition you will have an example.warc.gz that contains all the HTTP requests and responses that went into building the mirror. The WARC file can be transferred to a web archive viewing application like the Wayback Machine or pywb for replay.

One caveat here is that this snapshot will be missing any server side functionality that was present and not browsable via an HTML hyperlink. So for example a search form will no longer work, because it was dependent on a user entering in a search query into a box, and submitting it to the server for processing. Once you have a static version of your website the server side code for performing the query, and generating the results, will no longer be there. To simplify the user experience you may want to disable any of this functionality prior to crawling. If giving up the server side functionality compromises the functionality of the snapshot it may not be suitable for archiving in this way.

If your institution subscribes to a web archiving service provider such as ArchiveIt you may be able to nominate your website for archiving. Once the site has been archived you can redirect (as mentioned earlier) traffic to the new location provided by your vendor. Additionally you can rely on what is available of your website at Internet Archive. Although you may need to do some work to verify that all the content you need is there. Both approaches will require you to make sure your website’s robots.txt is configured to allow crawling so that the Web archiving bots can access all the relevant pages.

Static Sites

What if it wasn’t necessary to archive websites. What if they were designed to be more resistant to change? In the early days of the Web people composed static HTML documents by hand. It was easy to view source, copy and paste a snippet of HTML, save it in a file, and move it up to a webserver for the world to read. But it didn’t take long before we were creating server side programs to generate Web pages dynamically based on content stored in databases and personalized services like authentication and user preferences. While the dynamic, server side driven Web is still prevalent, over time there has been a slight shift back towards so called static websites for certain use cases.

For example when the New York Times needed to respond to thousands of requests per second for results on election night they used static websites. When healthcare.gov was being launched to educate millions of people about the Affordable Care Healthcare Act in 2013 the only part of the site with 100% uptime was the thousands of pages being managed as a static website. When lots of people come looking for a page it’s important not to need to connect to a database, query it, and generate some HTML using the query results. This is known as the so called thundering herd problem. The less moving parts there are, the faster the page can be sent to the user.

Strangely the same property of simplicity that is useful for performance reasons has positive side effects for preservation as well. The server side code that talks to databases to dynamically build HTML or JSON documents is extremely dependent on environmental conditions such as: operating system, software libraries, programming language, computational and network resources, etc. This code is often custom made, configured for the task at hand and can contain undiscovered bugs that compromise functionality and security. Even when using off the shelf software like WordPress or Drupal it’s important to keep these Web applications up to date, since spammers and crackers scan the Web looking for stale versions to exploit.

Read only static sites largely sidestep these problems since the only software that is being used are a Web browser and a Web server (which is running in its simplest mode, serving up documents). Both are battle tested by millions of people every day. Of course it’s not really practical to hand code an entire website in HTML so now there are now many different static site generators available that build the site once using templates, includes, configuration files and plugins for various things. Once built the website can be made available very easily by copying the files to a Web server which makes them available. StaticGen is a directory of static site generators that lists over 100 projects for 25 different programming language environments.

The next time you are creating a website ask yourself if you absolutely need the site to be dynamic. Can you automatically generate your site using a tool like Jekyll, and layer in dynamic functionality like commenting with Disqus, search with Google Site Search. If these questions are intriguing you may be interested in a relatively new Digital Humanities community group for Minimal Computing.

Export

The last lever I’m going to cover here is data export. If you decide to build your website using a content management system (WordPress, Drupal, Omeka, etc) or a third party provider (Twitter, Facebook, Medium, Tumblr, SquareSpace, etc) be sure to explore what options are available for getting your content out. WordPress for example lets you export your site content as an augmented RSS feed known as WXR. Twitter, Facebook and Medium allow you to download a zip file, which contains a miniature static site of all your contributions that you can open with your browser.

Being able to export your content from one site to the next is extremely important for the long term access to your data. In many ways the central challenge we face in the preservation of Web content, and digital content generally, is the recognition that each system functions as a relay in a chain of systems that make the content accessible. As the aphorism goes: data matures like wine, applications like fish. As you build your website be mindful of how your data is going in, and how it can come out, so it can get into the next system. If there is an export mechanism try it out and see how well it works. If you don’t have a clear story for how your content is going to come out, maybe it’s not a good place to put it.

It may seem obvious, and perhaps old school, but the simplest way for a website to export your data is to make it easily crawlable and presentable on the Web as a network of interlinked HTML documents. If you have a website that was designed with longevity in mind a tool like wget can easily mirror it, and has the added benefit of making it crawlable by search engines like Google or services like the Internet Archive. Stanford University has created a guide for how to make your website archivable.

Use Your Illusion

Hopefully this list of tools and levers for working with broken links on the Web have been helpful. It was not meant to be an exhaustive list of options that are available, but only as another chapter in a continuing conversation of how we can work on and with the Web while being mindful of sustainability and preservation. As Dan Connolly, one of the designers of the Web, once said:

The point of the Web arch[itecture] is that it builds the illusion of a shared information space.

This illusion is maintained through the attention and craft of Web authors and publishers all around the world. Maybe you are one of them. As we put new content on the Web and maintain older websites, we are doing our parts to sustain and deepen this hope for universal access to knowledge. Our knowledge of the past has always been mediated by the collective care of those who care to preserve it, and the Web is no different.

Many thanks to Trevor Muñoz, Kari Kraus and Matt Kirschenbaum for reading drafts of this post and their suggestions.

The post The Web’s Past is Not Evenly Distributed appeared first on Maryland Institute for Technology in the Humanities.

]]>
https://mith.umd.edu/webs-past-not-evenly-distributed/feed/ 1
Stewarding MITH’s History: A New Window Into Our Past https://mith.umd.edu/stewarding-mith-history/ Fri, 04 Mar 2016 09:30:16 +0000 http://mith.umd.edu/?p=17243 This is the third post in MITH’s series on stewarding digital humanities scholarship.  No doubt you’ve noticed that the MITH website looks a little different these days. We’re proud of this latest refresh of the site’s design, which brings a number of updates such as responsive design, better usability on mobile devices, and reorganized pages [...]

The post Stewarding MITH’s History: A New Window Into Our Past appeared first on Maryland Institute for Technology in the Humanities.

]]>

This is the third post in MITH’s series on stewarding digital humanities scholarship

No doubt you’ve noticed that the MITH website looks a little different these days. We’re proud of this latest refresh of the site’s design, which brings a number of updates such as responsive design, better usability on mobile devices, and reorganized pages for featuring talks from our Digital Dialogues series. The overall process was led by our designer Kirsten Keister but involved everyone at MITH.

This post is about one aspect of the redesign that I took a leading role in and that relates to MITH’s ongoing work to steward the digital humanities research that’s been done here. We wanted to improve the tools that visitors to our site can use to search and browse the history of MITH’s research. The result is our new “Research” page.

Understanding the Challenge

Creating the “Research” page involved an interesting confluence of challenges and approaches: part data curation, part records management, part appraisal, and part user experience design. My role was to translate the strategy for digital stewardship that Trevor outlined into the design process Kirsten was heading up while coordinating my research through MITH’s records with the work that Porter was doing to locate and reorganize legacy data from different websites and projects that had become obsolete or been decommissioned but were still being stored on MITH’s old servers. Throughout the process, I drew on some of my archival training to consider issues of appraisal and documentation. Also, as MITH’s project manager, I focused on communication and transparency to keep everyone collaborating efficiently.

A crucial insight—which we only recognized once we were well into the project—is that what we were doing with the “Research” page was less a redesign of existing content than it was a fundamental shift in how we conceive of this part of MITH’s web presence: to encompass a data curation mission. We were creating something new with this iteration of the “Research” page because MITH had never intended to use the website as a complete catalogue of all of the center’s many projects. As you can see browsing through the old versions of the site, what has become the “Research” page started out as something more like news: “What is MITH working on these days?” The purpose of creating pages for different projects and linking out to their other manifestations on the web was and is, first about publicity. We want people who hear a talk or read a blog post to be able to search for more information and find some kind of information about a project on our site. Successive redesigns grouped this information together and started using “research” or “projects” as part of the information architecture of the site. We can be pretty confident that no one set out to create a public catalogue or registry of MITH projects and websites. So, capturing a fuller picture of MITH’s output over the years is a mission that the website has grown into along with MITH’s own longstanding commitment to digital stewardship and to representing the history of digital humanities as a field. As part of this growing process, we’ve added a lot more metadata fields to the content management system that runs this site, we’ve collated and cross-referenced data that wasn’t previously part of the site, and we’ve worked to organize the data and make it available through a new, more usable interface.

MITH’s Data: Record Retention and Appraisal

First, there was a need for the consolidation and cross-referencing of data related to the history of MITH’s DH scholarship. Although we did have individual project pages online for the vast number of current and former MITH projects, we also knew that projects may have become less publicly visible over time for such reasons as sites becoming vulnerable to hacking and needing to be taken offline; side effects of past site migrations and upgrades; staff turnover leading to knowledge gaps or miscommunication about project updates; and no doubt, inevitably, some oversight/human error.

To start the process of gathering and updating data about MITH’s research, I started making a list of locations where MITH history was located:

  1. The current MITH website;
  2. Past iterations of the MITH website on the Wayback Machine;
  3. MITH’s electronic records;
  4. MITH’s paper records in Special Collections/University Archives;
  5. Former MITH Research tracking documents.

Record retention is a tricky business in general, as is the appraisal of an organizational record. I liked to keep in mind Dennis Meissner’s warning that “too great an abstraction is an evil,” suggesting that an imaginative archivist could find some reason for the retention of every document, thus reducing appraisal to the level of an intellectual game. We cannot retain everything. Despite always having data curation and archiving experts on staff or working with us, there has never been a dedicated MITH Archivist. As with most organizations, the decision to keep or discard institutional records is often a choice made in a particular moment, mostly factoring in current assessments of need. So while we have a fairly complete portrait of our activities taking in all of the above five sources, that portrait had to be painstakingly assembled over a series of months.

I downloaded all available project and event metadata from the current MITH WordPress site and set it up in a dedicated shared spreadsheet. Then the MITH staff met and talked through our ideas about the design of the revamped interface, determining what an ideal set of core project metadata elements would be, and what information would display where on the new site. We looked at what metadata was already present on the site, and Kirsten created a separate tab on the spreadsheet to track existing and new metadata elements.

DISC Web Components

Archival documentation found in MITH’s paper archives: An early outline of possible components for the Disability Studies Academic Community (DISC) website.

The easiest additions were projects which had been on former versions of the MITH site, now archived in the Wayback Machine. Since the descriptions and information in these had already been curated and vetted by former MITH staff, it was typically in fine shape for porting over as-is, filling in holes with other research. MITH’s electronic records provided much of the information about post-2009 projects that needed additional metadata.

Among MITH’s records were a number of different “tracking documents,” for example, documentation of the contents of a server as of a particular month and year. These could be helpful, but they were often created to serve a very specific purpose so they could also serve to obfuscate research conclusions, particularly in instances where there was evidence of a development site for a project which never came to fruition, or was only there to test out alternative versions of a current site. File paths from these documents were helpful in working with Porter to track down projects on legacy servers, but then I often had to track down all the basic metadata on a project in the MITH paper records.

The MITH paper records were a particular challenge because a) they were interesting and could lead to distracting perusal of old correspondence and documentation, and b) because they contained entire folders on projects which got very far in the development process but never came to fruition.. Often folders for projects that did come to fruition contained all the same types of records and data, so unless I checked in with the project director it was difficult to determine when to stop going down the rabbit hole.

Disambiguation and Taxonomies

The process of cycling through all of the above sources tended to be iterative, requiring back- and-forth rechecking and cross-referencing of conflicting data. But in the end I filled in all the new metadata elements for current projects, and added the full set of metadata elements for a total of 24 projects that will be getting “new” representation in our redesigned research-page-as-catalog.

When it came to enabling visitors to benefit from all of this data curation, Raff pitched in to develop a small Javascript application for the “Research” page to enable faceted browsing. For example, although projects often refer to the funder/sponsor in the description text, we’d never tracked that data in a separate field. And, although we’d used tags to assign keywords to specific projects, the main reason for using the tags was to connect a project page to related blog posts. Over time the list of tags had become a bit spotty and random. So, if a user wanted to explore the history of awarded MITH grant projects and then filter down on specific research topics, this information would need to be populated throughout all current MITH projects as well as projects we added through the consolidation process. To facilitate this, we developed a new, hierarchical taxonomy of Topic tags in a third spreadsheet tab. You can see the result of this re-tagging and re-organization through the options for faceted browsing on the left-hand side of the new “Research” page.

Conclusions & Takeaways

This process was rewarding in that I acquired a unique vantage point on MITH’s history, including the quantification and distinction of types of research we’ve really specialized in over the years. There were also many fascinating insights, such as learning about MITH’s role in the development of early online educational technology (see the Spain/Online project), and MITH’s role in helping develop Disability Studies as a recognized academic discipline (DISC: A Disability Studies Academic Community). Lastly, I was delighted to learn more about MITH’s history organizing spoken word poetry events involving the community (read about the Gwendolyn Brooks Poetry Slam in 2004, which featured the involvement of local high school students, and a community Q&A with Irvin Kershner, director of the Empire Strikes Back, after a public screening of the film.

With the new interface going live in conjunction with the latest redesign of the MITH website, I hope that our community finds as much inspiration digging into MITH’s history as I have.

The post Stewarding MITH’s History: A New Window Into Our Past appeared first on Maryland Institute for Technology in the Humanities.

]]>
Hacking MITH’s Legacy Web Servers: A Holistic Approach to Preservation on the Web https://mith.umd.edu/hacking-miths-legacy-servers/ Wed, 08 Jul 2015 15:33:30 +0000 http://mith.umd.edu/?p=14118 Editor's note— This is the second post in MITH's series on stewarding digital humanities scholarship. In September of 2012 MITH moved from its long-time home in the basement of the McKeldin Library on the University of Maryland campus to a newly renovated, and considerably better lit, location next to Library Media Services in the Hornbake [...]

The post Hacking MITH’s Legacy Web Servers: A Holistic Approach to Preservation on the Web appeared first on Maryland Institute for Technology in the Humanities.

]]>
Editor’s note— This is the second post in MITH’s series on stewarding digital humanities scholarship.

In September of 2012 MITH moved from its long-time home in the basement of the McKeldin Library on the University of Maryland campus to a newly renovated, and considerably better lit, location next to Library Media Services in the Hornbake Library. If you’ve had a chance to visit MITH’s current location, then you’ve likely noticed its modern, open, and spacious design. And yet, for all its comforts, for all its natural light streaming in from the windows that comprise its northern wall, I still find myself missing our dark corner of the McKeldin basement from time to time: its cubicles, its cramped breakroom, Matthew Kirschenbaum’s cave-like office with frankensteinian hardware filling every square inch, and especially its oddly shaped conference room, packed to the gills and overflowing into the hallway every Tuesday at 12:30 for Digital Dialogues.

In preparation for the move, we delved into those nooks and crannies to inventory the computers and other equipment that had accumulated over the years. MITH is a place that is interested in the materiality of computing—and it shows. Boxes of old media could be found under one cubicle, while a stack of keyboards—whose corresponding computers had long since been recycled—could be found under another. A media cabinet near the entrance contained a variety of game systems from the Preserving Virtual Worlds project, and legacy computer systems, now proudly displayed on MITH’s “spline,” jockied for pride of place on the breakroom table. Tucked away here and there in MITH’s McKeldin offices were a host of retired computer hardware—old enough to have been replaced, but still too modern to merit a listing in MITH’s Vintage Computers collection. Among these systems were two large, black IBM server towers, twice the size and weight of your typical PC. As I lugged them onto the cart in preparation for the move, I couldn’t help but wonder what these servers had been used for, when and why had they been retired, and what data might still be recoverable from them?

A few months ago, I got the chance to find out when I was asked to capture disk images of these servers (a disk image is a sector-by-sector copy of all the data that reside on a storage medium such as a hard disk or CD-ROM). Disk images of these servers would significantly increase access to the data they contained, and also make it possible for MITH to analyze them using the BitCurator suite of digital forensics tools developed by MITH and UNC’s School of Information and Library Science. With BitCurator we would be able to, among other things, identify deleted or hidden files, search the disk images for particular or sensitive information, and generate human and machine-readable reports detailing file types, checksums and file sizes. The tools and capabilities provided by BitCurator offered MITH an opportunity to revisit its legacy web servers and retrospectively consider preservation decisions in a way that was not possible before. However, before we could conduct any such analysis, we first had to be able to access the data on the servers and capture that data in disk image form. It is tackling that specific challenge, accessing and imaging the servers’ hard drives, that I want to focus on in this blog post.

As Trevor described in his recent post on MITH’s digital preservation practices, servers are one important site where complex institutional decisions about digital curation and preservation are played out. As such, I see two communities that have a vested interest in understanding the particular preservation challenges posed by server hardware. First, the digital humanities community where DH centers routinely host their own web content and will need, inevitably, to consider migration and preservation strategies associated with transitioning web infrastructure. These transitions may come in the form of upgrading to newer, more powerful web servers, or migrating from self-hosted servers to cloud-based virtual servers. In either event, what to do with the retired servers remains a critical question. And second, the digital preservation community who may soon see (if they haven’t already) organizations contributing web, email or file servers as important institutional records to be archived along with the contents of their filing cabinets and laptops. Given the needs of these two communities, I hope this blog post will begin a larger conversation about the how and why of web server preservation.

Khelone and Minerva

The two systems represented by those heavy, black IBM towers were a development server named “Khelone” and a production server named “Minerva.” These machines were MITH’s web servers from 2006 to 2009. When they were retired, the websites they hosted were migrated to new, more secure servers. MITH staff (primarily Greg Lord and Doug Reside) developed a transition plan where they evaluated each website on the servers for security and stability and then, based on that analysis, decided how each website should be migrated to the new servers. Some sites could be migrated in their entirety, but a number of websites had security vulnerabilities that dictated only a partial migration. (Later in the summer MITH’s Lead Developer Ed Summers will share a little more about the process of migrating dynamic websites.) After the websites had been migrated to the new servers, Khelone and Minerva, which had been hosted in the campus data center, were collected and returned to MITH. The decision to hold on to these servers showed great foresight and is what allowed me to go back six years after they had been retired and delve into MITH’s digital nooks and crannies.

Khelone & Minerva - MITH's web servers c 2009

Khelone & Minerva – MITH’s web servers c 2009

Imaging and analyzing Khelone and Minerva’s hard drives is a task I eagerly took up because I have been interested in exploring how a digital forensics approach to digital preservation would differ between personal computing hardware (desktop computers, laptops and mobile devices) and server hardware. As I suspected, what I found was that there are both hardware and software differences between the two types of systems, and these differences significantly affect how one might go about preserving a server’s digital contents. In this post I will cover four common features of server hardware that may impede the digital humanities center manager’s or digital archivist’s ability to capture disk images of a server’s hard drives (I will write more about post-imaging BitCurator tools in subsequent posts). Those features are: 1) the use of SCSI hard drives, 2) drive access without login credentials, 3) accessing data on a RAID array, and 4) working with logical volumes. If those terms don’t mean anything to you, fear not! I will do my best to explain each of them and how a digital archivist might need to work with or around them when data held on server hardware.

This topic is necessarily technical, but for those less interested in the technical details than the broader challenge of preserving web server infrastructure, I have included TL;DR (Too Long; Didn’t Read) summaries of each section below.

Server hardware vs. PC hardware

Workflows designed to capture born-digital content on carrier media (hard drives, floppy disk, optical disks, etc.) typically focus on collecting data from personal computing devices. This makes sense both because the majority of carrier media comes from personal computing and also because it fits the familiar framework of collecting the papers of an individual. The presumption in these workflows is that the carrier media can be removed from the computer, connected to a write blocker, and accessed via a digital forensics workstation. (See Marty Gengenbach’s paper “Mapping Digital Forensics Workflows in Collecting Institutions” for example workflows.) Once connected to the workstation, the digital archivist can capture a disk image , scan the contents of the drive for viruses, search for personally identifiable information, generate metadata reports on the drive’s contents, and ultimately transfer the disk image and associated metadata to a digital repository. The challenge posed by server hardware, however, is that if the hard drives are removed from their original hardware environment, they can become unreadable, or the data they contain may only be accessible as raw bitstreams that cannot be reconstructed into coherent files. (I’m using “servers” as a generic term for different server types such as web, email, file, etc.) These factors necessitate a different approach. Specifically, that when archiving born-digital content on server hardware, the safest (and sometimes only) way to access the server’s hard drives is by accessing them while they are still connected to the server.

As a general rule, commercial servers and personal computers use different data bus types (the data bus is the system component that moves data from a disk drive to the CPU for processing). And unfortunately, currently available write blockers and disk enclosures/docking stations do not support the data bus type commonly used in server hardware. Over the years there have been a number of different data bus types. From the late 1980’s to the early 2000’s most desktop systems included hard drives that used a bus type called IDE (Integrated Device Electronics). That bus technology has since given way to Serial ATA (or, SATA), which is found in almost all computers today. However, servers almost always included hard drives that used a SCSI data bus because of SCSI’s higher sustained data throughput. So, while it’s common for a write blocker to have pinout connections for both IDE and SATA (see the WiebeTech ComboDock described in my blog post covering digital forensics hardware), there are none on the market that support any type of SCSI devices (I’d be happily proven wrong if anyone knows of a write blocker that supports SCSI devices). This means that if you want to capture data from a legacy server with SCSI hard drives, instead of removing the drives and connecting them to a digital forensics workstation via a write blocker—as one would do with IDE or SATA hard drives—your best, and maybe only, solution is to leave the drives in the server, boot it up, and capture the data from the running system.

There is an alternative approach that might make sense if you anticipate working with SCSI hard drives regularly. Special expansion cards (called “controller cards”) can be purchased that allow SCSI drives to connect to your typical desktop computer. Adding such a card to your digital curation workstation if, as stated above, you anticipate processing data on SCSI hard drives on a regular basis.

TL;DR #1: Servers almost always use SCSI hard drives, not the more common IDE or SATA drives. Write blockers and disk enclosures/docking stations do not support SCSI drives, so if you want to access the data contained on a server, the best solution is to access them via the server itself.

The LiveCD Approach

How then to best access the data on hard drives still attached to the server? There are two options: The first is simply to boot up the computer, log in, and either capture a disk image or copy over selected files. The second option is what is called a “liveCD” approach. In this case you put the liveCD (or DVD as the case may be) in the drive and have the server boot up off of that disk. The system will boot into a temporary environment where all of the components of the computer are running except the hard drives. In this temporary environment you can capture disk images of the drives or mount the drives in a read-only state for appraisal and analysis.

From a digital forensics perspective, the first option is problematic, bordering on outright negligent. Booting directly from the server’s hard drives means that you will, unavoidably, write data to the drive. This may be something as seemingly innocuous as writing to the boot logs as the system starts up. But what if one of the things you wanted to find out was the last time the server had been up and running as a server? Booting from the server’s hard drive will overwrite the boot logs and replace the date of the last active use with the present date. Further, capturing a disk image from an actively mounted drive means that any running processes may be writing data to the drive during the imaging process, potentially overwriting log files and deleted files.

It may also be impossible to log into the server at all. It is common for servers to employ a “single sign on” technology where the server authenticates a user via an authentication server instead of the local system. This was the case with Khelone and Minerva, making the liveCD approach the only viable course of action.

Booting from a liveCD is a common digital forensics approach because even though the system is up and running, the hard drives are inaccessible—essentially absent from the system unless the examiner actively mounts them in the temporary environment. There a number of Linux based liveCDs, including BitCurator’s, which would have been the ideal liveCD for my needs. Unfortunately, the old MITH servers had CD-ROM, not DVD drives, and the BitCurator liveCD is DVD size at 2.5GB. An additional impediment to using BitCurator is that BitCurator is built on the 64-bit version of Ubuntu Linux, while the Intel Xeon processors in these servers only supported 32-bit operating systems.

With BitCurator unavailable I chose to use Ubuntu Mini Remix, a version of Ubuntu Linux specifically paired down to fit on a CD. Ubuntu has an official minimal liveCD/installation CD but it is a bit too minimal and doesn’t include some necessary features for the disk capture work I needed to do.

TL;DR #2: Use a “liveCD” to gain access to legacy servers. Be careful to note the type of optical media drive (CD or DVD), if the system can boot from a USB drive, and whether or not the server has support for 64 bit operating systems so you can choose a liveCD that will work.

Redundant Array of Inexpensive Disks (RAID)

Once the server is up and running on the liveCD, you may want to either capture a disk image or mount the hard drives (read only, of course) to appraise their content. On personal computing hardware this is fairly straightforward. However, servers have a need for both fault tolerance and speed, which leads most server manufacturers to pair multiple drives together to create what’s called a RAID, or Redundant Array of Inexpensive Disks. There are a number of different RAID types, but for the most part a RAID does one of three things:

  1. “Stripe” data between two or more disks for increased read/write times (called RAID 0)
  2. “Mirror” data across two or more disks to build redundancy (called RAID 1, see diagram below)
  3. Data striping with parity checks to prevent data loss (RAID 2-6)

 

Diagram of RAID 1 Array. Photo credit Intel

Diagram of RAID 1 Array. Photo credit Intel

Regardless of the RAID type, in all cases a RAID makes multiple disks appear to be a single disk. This complicates the work of the digital archivists significantly because when hard drives are configured in a RAID, they may not be able to stand alone as single disk, particularly in the case of RAID 0 where data is striped between the disks in the array.

As with SCSI hard drives, virtually all servers configure their hard drives in a RAID, and the old MITH servers were no exception. It is easy to determine if a server has hard drives configured in a RAID by typing “fdisk -l”, which reads the partition information from each hard drive visible to the operating system. The “fdisk -l” command will print a table that gives details on each drive and partition. Drives that are part of a raid will be labeled “Linux raid autodetect” under the “system” column of the table. What is less apparent is which type of RAID the server is using. To determine the RAID type, you use an application called “mdadm” (Multi Disk Administration), which, once downloaded and installed, revealed that the drives on Khelone and Minerva were configured in a RAID 1 (mirroring). There were four drives on each server, with each drive paired with an identical drive so that if one failed, the other drive would seamlessly kick in and allow the server to continue functioning without any downtime.

Because the drives were mirroring rather than striping data, it was possible to essentially ignore the RAID and capture a disk image of the first of the drives that constitute the array and still have a valid, readable disk image. However, this is only the case when dealing with RAID 1 (mirroring). However, if a server employs a RAID type that stripes data, such as RAID 0, then you must recreate the full raid in order to have access to the data. If you image a drive from a RAID 0, you essentially only get half of the data, resulting in so many meaningless ones and zeros.

TL;DR #3: Servers frequently deploy a redundancy technology called RAID to ensure the seamless rollover from a failed drive to a backup. When creating disk images of server hard drives one must first identify the type of RAID being used and from that information determine whether to capture a disk image of the raw device (the unmounted hard drive), or reassemble the RAID and capture the disk image from the disks running in the RAID.

Logical Volumes

The fourth and final server technology I’ll discuss in this post is what is called a “logical volume.” For simplicity’s sake, I’ll draw a distinction between a “physical volume” and a “logical volume.” A physical volume would be all the space on a single drive; its size (its volume) is limited by its physical capacity. If I connect a physical drive to my computer and format it, its volume would only ever be what its physical capacity allowed. A logical volume, by comparison, has the capacity to span multiple drives to create a single, umbrella-like volume. The volume’s size can now be expanded by adding additional drives to the logical volume (see the diagram below). In practice this means that, like a RAID array, a logical volume allows the user to to connect multiple hard disks to a system but have the operating system treat them as a single drive. This capability allows the user to add space to the logical volume on an as-needed basis, so server administrators create logical volumes on servers where they anticipate the need to add more drive space in the future.

Diagram of a Logical Volume. Image credit The Fedora Project

Diagram of a Logical Volume. Image credit The Fedora Project

The hard drives on Minerva, the production server, were configured in a logical volume as well as a RAID. This meant that in addition to reconstructing the RAID with mdadm, I had to download a tool called lvm (Linux Volume Manager) to mount the logical volume, which I then used to mount and access the contents of the drives. While I generally advocate the use of disk images for preservation, in this case it may make more sense to capture a logical copy of the data (that is, just the data visible to the operating system and not a complete bitstream). The reason for this is that in order to access the data from a disk image, you must once again use lvm to mount the logical volume. This additional step may be difficult for future users. It is, however, possible to mount a logical volume contained on a disk image in BitCurator, which I’ll detail in a subsequent post.

TL;DR #4: A logical volume is a means of making multiple drives appear to be a single volume. If a server employs a logical volume, digital archivists should take that fact into account when they decide whether to capture a forensic disk image or a logical copy of the data on the drive.

Revisiting Digital MITH

So why go through this? Why heft these old servers out of the storage closet and examine images of systems long-since retired? For me and for MITH it comes down to understanding our digital spaces much like our physical spaces. We have all had that moment when, for whatever reason, we find ourselves digging through boxes for that thing we think we need, only to find that thing we had forgotten about, but in fact need more than whatever it was we started digging for in the first place. So it was with Khelone and Minerva; what began as something of a test platform for BitCurator tools opened up a window to the past, to digital MITH circa 2009. And like their physical corollary, Khelone and Minerva were full of nooks and crannies. These servers hosted course pages for classes taught by past and present University of Maryland faculty, the personal websites of MITH staff, and versions of all the websites hosted by MITH in 2009, including those that weren’t able to be migrated to the new servers due to the security concerns mentioned above. In short, these servers were a snapshot of MITH in 2009—the technologies they were using, the fellows they were working with, the projects they were undertaking, and more. In this case the whole was truly greater than the sum of its parts (or the sum of its hosted websites). These servers—now accessible to MITH’s current managers as disk images—are a digital space captured in time that tell us as much about MITH as they do about any of the projects the servers hosted.

Understanding web servers in this way has significant implications for digital humanities centers and how they preserve project websites as well as their own institutional history. An atomized approach to preserving project websites decontextualizes them from the center’s oeuvre. Any effort to capture a representative institutional history must demonstrate the interrelated web of projects that define the center’s scholarship. Elements such as overlaps between participants, technologies that were shared or expanded upon between projects, funder relationships, and project partnerships, to name a few, form a network of relationships that is visible when websites are viewed in their original server context. However, this network becomes harder to see the moment a website is divorced from its server. In MITH’s case, examination of these disk images showed the planning and care with which staff had migrated individual digital projects. Nonetheless, having this additional, holistic view of previous systems is a welcome new capability. It is perhaps asking too much of a DH center to spend already limited system administration resources creating, say, biannual disk images of their web servers. However, when those inevitable moments of server migration and transition come about, they can be approached as an opportunity to capture a digital iteration of the center and its work, and in so doing, hold on to that context.

From a digital preservation perspective, I believe that we need to adopt a holistic approach to archiving web servers. When we capture a website, we have a website, and if we capture a bunch of website we have… a bunch of websites, which is fine. But a web server is a different species; it is a digital space that at any given moment can tell us as much about an organization as a visit to their physical location. A disk image of an institution’s web server, then, is more than just a place to put websites—it is a snapshot of the organization’s history that is, in its way, still very much alive.

The post Hacking MITH’s Legacy Web Servers: A Holistic Approach to Preservation on the Web appeared first on Maryland Institute for Technology in the Humanities.

]]>
Stewarding Digital Humanities Work on the Web at MITH https://mith.umd.edu/stewarding-digital-humanities-work-on-the-web-at-mith/ https://mith.umd.edu/stewarding-digital-humanities-work-on-the-web-at-mith/#comments Mon, 15 Jun 2015 10:00:25 +0000 http://mith.umd.edu/?p=14034 A digital humanities center is nothing if not a site of constant motion: staff, directors, fellows, projects, partners, tools, technologies, resources, and (innumerable) best practices all change over time, sometimes in quite unpredictable ways. As small, partly or wholly soft-funded units whose missions involve research, or teaching, or anchoring a local interest community, digital humanities [...]

The post Stewarding Digital Humanities Work on the Web at MITH appeared first on Maryland Institute for Technology in the Humanities.

]]>
A digital humanities center is nothing if not a site of constant motion: staff, directors, fellows, projects, partners, tools, technologies, resources, and (innumerable) best practices all change over time, sometimes in quite unpredictable ways. As small, partly or wholly soft-funded units whose missions involve research, or teaching, or anchoring a local interest community, digital humanities centers face fundamental challenges involving the long-term digital stewardship of the work they help to produce.

The importance of stewarding digital scholarship will only grow and the work will need to be shared by the entire digital humanities community. Founded sixteen years ago in 1999, MITH is proud of the way it has faced and continues to face these challenges. We would like to take this opportunity to document our practices in a series of blog posts, beginning with this one, in the hope of providing a clear and potentially useful record of our principles for digital stewardship, the issues we’ve faced, and our practices for dealing with them.

In this initial post, we’ll provide an overview of the actions MITH has taken to steward the variety of digital humanities work created here. In doing so, we’ll articulate the underlying principles that have guided our decisions and present some key lessons we’ve learned. Finally, we’ll point out some areas where further work is needed by stakeholders across the wider digital humanities community.

What are we stewarding?

To document digital stewardship practices effectively, we need to be as clear as possible about what we are stewarding. In MITH’s case, we are concerned with actions taken to care for and manage over time the digital humanities work publicly shared by staff, students, faculty, librarians and others who have been affiliated with MITH. Given the era of MITH’s founding, the World Wide Web has been the chief means for making such work public. So as to have a compact way of referring to “digital humanities work made publicly available via the technologies of the web”, we will use the term “website” throughout this initial discussion of MITH’s stewardship activities. However, we purposefully understand the referents of “website” to encompass things ranging from collections of a few hyperlinked documents to complex applications and virtual worlds. In other contexts, these things might also be discussed as “databases”, “digital archives”, “tools”, or “projects”. (In other words, the use of the term “website” is a convenience based on the fact that most of the digital work we’re interested in was delivered using the web.) We acknowledge, furthermore, that the “websites” we are stewarding may represent or document the bulk of the intellectual, emotional, and technical labor of a person or a project, or they may represent very little of that labor—like the varying portions of icebergs buoyant enough to rise above the water line.

At one level, then, the digital humanities work we’re discussing comprises aggregations of computer files: documents and data in various formats, layers of interlocking software, and so on. At another level, this work comprises tacit knowledge of its creators—including knowledge located in the interrelationships between people. This work also comprises elements of how it has been received since it was made public—its position in networks of (hyper-) links and citations, for example. In discussing stewardship we mean the process of assessing, accounting for, and making decisions about how to represent all of these elements of the digital work people affiliated with MITH have created, within the constraints of available resources and from whatever situated vantage point each of us inevitably occupies.

Principle: active records and archival records

One important principle for MITH’s stewardship activities is acknowledging that we manage digital materials of two different major types. To borrow from the parlance of archives and records management, we care for digital humanities websites as both “active records” and “archival records”. By active records, we mean websites that are still in regular use by creators or their successors to do digital humanities work. Such use could involve adding additional data, sharing with students in an ongoing course, or incorporating a site into a new project or publication. By archival records, we generally mean websites that are no longer the active project of the scholars who created them—for instance, if they are no longer being updated with new data—but which have ongoing value and interest from elements of the community beyond the original scholar or team. The actual age of a particular resource is not significant in this distinction: older work is not automatically “archival,” just as recent work can be active briefly and become “archival” in a short amount of time (work related to events like conferences is a good example).

One important implication that follows from our decision to think of MITH’s digital materials as records with active and archival modes is that this principle makes explicit the balance that a digital humanities center must strike. Websites have a lifecycle. Different choices and actions are appropriate at different stages of that lifecycle. On the one hand, for websites as active records, we want to maintain working systems in place to enable digital humanities work by specific members of our community with whom we have ongoing collaborative relationships. For websites as archival records, on the other hand, our aim may be less to manage active systems as it is to see that representations of earlier work are preserved in some form as evidence of people’s scholarship and of MITH’s own history.

MITH’s stewardship in practice

The rubric of stewardship that MITH has outlined for itself encompasses both active management and digital preservation. The locus of this discussion of specific stewardship actions will be the MITH servers—as institutional spaces into which, over time, digital materials were transferred so as to be made public as part of digital humanities work.

Systems Administration and “Backups”

MITH has, for many years, as part of our active management, “backed up” the contents of our running server machines to guard against sudden data loss. At present, MITH pays for a service hosted by the University of Maryland Division of Information Technology for this purpose. These backups guard against data loss from a short-term failure of the servers—so that the latest copy of the full system can be restored to a recent running state. This backup service also ensures that copies of the data on MITH’s machines are duplicated at an additional location on the University of Maryland campus and to magnetic tape for storage at a geographically remote location from the campus (as current best practice suggests).

Active management to support digital materials necessarily involves change—for one thing, physical machines wear out and become obsolete. The longer a website’s life, the more certain it is that those digital materials will need to be migrated from one system to another in response to the obsolescence of the underlying physical machines. Over its sixteen year history, MITH has used a succession of physical machines as servers. As machines approached their end of life, MITH would migrate the content of servers—the digital materials over which MITH exercised stewardship—to new hardware. As an example of this process and in the interest of precision, we’ll discuss actions undertaken on two generations of such systems: machines that hosted MITH’s projects from 2006 to 2009, and those that have hosted MITH’s projects since 2009. The two physical servers that hosted MITH projects from 2006 to 2009 also contain data representing an older generation of machines—copies of MITH’s digital work back to 2003. Given that the outgoing and the successor machines have run versions of the same operating system, migration refers to copying those parts of the system where users (rather than system utilities) have stored data. These include file directories associated with the accounts of people who were granted access to MITH machines, file directories utilized by the web server (containing the HTML and related documents and source code for various websites), as well as databases, and configuration files. The copying—migration—of these key files was part of the management of the MITH servers as active systems. The goal was to replicate functionality from one generation of physical hardware to the next. As part of server migrations, MITH also created disk images, that is complete copies, of the servers’ hard drives, and took other actions (about which we’ll say more below) to help ensure preservation of data.

For the majority of the digital work created at MITH, this represents the complete story. Basic active management has kept these websites, some quite old, online.

Security

The life stories of digital materials can become more complex for a number of reasons. Security problems are one complication that face stewards of digital materials. In the case of digital humanities websites, the most common security problems involve SQL injection attacks in which interactive components of a site can be compromised to run malicious software code often for the purpose of sending spam messages or disrupting regular network traffic. MITH has direct experience with these challenges. At the request of the University of Maryland’s network administrators and with assistance from staff of the Division of Information Technology and paid external computer security consultants, MITH conducted a thorough review of all its projects in 2009 after several years of progressively increasing instability due to security problems. This review was conducted as part of migrating to new, clean systems and some sites were deemed too insecure to return to full, online availability. Compressed archives of projects, even those with security vulnerabilities, were copied to new machines and external storage media. No files were ever deleted but the process of responding to security concerns nonetheless created a category of projects that could not simply be reproduced online following the server migrations. These projects have needed to be managed in alternative ways. By evaluating some sites as archival records not solely as active projects, MITH staff decided to re-mount static resources—HTML pages, images, and so on—without the accompanying databases to lessen future security risk while preserving evidence of these projects. The 2008 Digital Diasporas Symposium is one example of this approach. A database had been part of the original site for the purpose of accepting registration information but no longer needed to be online after the end of the conference. In other cases, addressing security or other problems would require making choices or potentially altering features of sites. Where a site was no longer the active project of the scholar who created it (she/he has “completed” it and moved on), MITH has decided not to make substantial changes unilaterally.

Migrations and Transfers

Digital materials have also sometimes been migrated to other systems not managed by MITH, for example because a project director moved to another institution and wanted to transfer stewardship of the project. As a principle of stewardship, we believe that it is acceptable—even necessary—for the responsibility for digital work to change over time. This principle entails sharing copies of data with project creators and potentially others. At MITH, we consider sharing complete copies of digital materials (including the databases and all the other components) part of our stewardship in three specific ways: first, so that project creators have additional backup copies of their own, to manage as they see fit; second, so that those who request access can view materials from sites with security vulnerabilities offline; and third, so that stewardship of a site can be transferred to a new institution. MITH has made a practice of offering copies of digital material for personal backups when a funding period (for externally-supported projects), or a fellowship (for internally-funded projects), ends. Also, in 2007, those who had produced digital work at MITH up to that time were offered copies of their data to take for their own personal use. In these cases, project creators are provided with compressed archives of files copied from the MITH servers and MITH also retains copies. In the second case, of materials from sites that are offline due to security vulnerabilities, we have provided copies of files in response to specific requests on an ad hoc basis. When project directors or others choose to take over responsibility for a particular website, thereby ending MITH’s involvement in its active and ongoing management, MITH will link to or redirect traffic to the new versions of a digital project while also retaining copies of the earlier data created at MITH for preservation.

Collaborative Work

MITH’s experience with providing access copies of digital projects suggests issues that the community as a whole needs to grapple with further. As a principle, project creators should have access to their own work to use or build upon. Yet, almost all digital humanities projects are works of collaborative authorship and represent the efforts of not only project directors but staff members, graduate students, and others. Best practices for project charters probably need to indicate who can make decisions over the long-term. It is unwieldy to ask, long afterwards, all the graduate students who have worked on a project, for example, how to handle providing copies of materials but, at the same time, MITH takes seriously principles of respect for all contributors, as expressed in initiatives like the Collaborators’ Bill of Rights. When third parties, not the original creators, want complete copies of a site the prospects become more complex yet. Solutions to many of these kind of conundrums involve using open licenses for content and source code. Other conundrums may be trickier—if materials are being transferred to someone other than the original creator should project files be reviewed, as is the process in many analog archives, so that personal information (if any) or database credentials could be removed or reset?

Repositories and Archives

In addition to managing digital materials as part of active servers, MITH has taken steps to document projects created here as part of a broader stewardship strategy that also includes digital preservation activities. In 2012, MITH staff deposited a collection of materials with the institutional repository for the University of Maryland and we continue to make this a part of our project process. At around the same time, MITH moved its physical offices. In the course of the move, staff deposited boxes of physical materials documenting MITH’s activities with the University Archives. Also in 2012, MITH staff worked with the University of Maryland Libraries’ Special Collections and University Archives to ensure that internet addresses where most MITH projects are found would be regularly crawled by a web archiving tool (Archive-It) paid for through the Libraries’ collection budget. By adding an active partnership with the Libraries to MITH’s strategy we could be confident that our specific content would be collected and that collection would occur more regularly than could be expected from waiting for archiving software operated by the Internet Archive and other organizations to visit MITH sites.

To be sure, web archive versions of sites are not always identical to the original ones but we think they represent an important element of digital stewardship planning. Maintaining a live website, keeping it online and accessible at its original location with its complete original functionality, is not digital preservation but active management. A stewardship strategy predicated entirely on active management is unsustainable. For one, such a strategy is too expensive and labor-intensive given the limited resources of a digital humanities center such as MITH. Larger memory organizations such as libraries should aim to collect widely and reflect the diversity of practice of digital humanities work. The economies of collecting and preserving such work at scale also militate against stewardship strategies that depend on managing each digital humanities website (only) as an active system. Finally, active management of digital humanities websites, where they are embedded in a working web server, exposes them to ongoing risk of corruption and human error. We cannot ignore, for the convenience of present use, the aspect of preservation that involves removing materials from their original situation and relocating them where actions can be taken for their long term survival even if this entails changing the experience of using them. A serious discussion of digital stewardship must incorporate consideration of how best to sunset projects as active sites. For all these reasons, the role of web archives as they relate to the future use and preservation of digital humanities work is an area where there is much work still to do—and may be the subject of a future post.

Additional Digital Preservation Strategies

Finally, as we mentioned above, MITH has collaborated on a number of important digital preservation research projects that have affected how we preserve MITH’s own outputs. First, this research agenda has helped us recruit staff, students, and faculty interested in digital preservation challenges. MITH faculty and staff have authored books, articles, conference papers and reports, convened summit meetings, organized conferences, led training institutes, built tools, disseminated resources, and spoken widely on the importance of digital preservation and data curation. Second, by working on these digital preservation research projects, people at MITH have gained certain technical skills and conceptual approaches to problems. We’ll offer just one brief example of how this has played out. Though the physical machines that immediately preceded our current generation of servers were decommissioned about 6 years ago, MITH has retained this physical hardware, as well as various other internal and external hard drives of machines MITH staff and faculty have used. Credit this decision to the habits of thought and practice that MITH’s engagement with digital preservation has cultivated. Our work on digital preservation has caused us to consider the physical nature even of things like websites and to consider further how original hardware may represent a crucial element for preservation. A few months ago, MITH asked Porter Olsen, former Community Lead of the BitCurator project and a Graduate Assistant at MITH, to apply some of the skills he learned from his work here to help us evaluate whether new tools and capabilities could be applied to curating and preserving digital humanities materials from systems like MITH’s old web servers. We know from our research and our experience that no matter how thoughtfully we’ve crafted policies and procedures and no matter how carefully we’ve acted, digital stewardship is a complex and challenging endeavor that requires us to keep learning and trying to improve.

In the next post in the series, Porter will describe in more detail how he was able to use the BitCurator tools, which MITH helped develop, to collect and preserve additional information. In other future posts, Stephanie Sapienza, MITH’s Project Manager, will discuss work she’s been leading to revamp the section of our website where we document all of MITH’s work in order to make it easier to find information about previous projects. MITH’s Lead Developer Ed Summers will post some lessons learned about best practices for migrating complex and dynamic websites. And, since preserving computing hardware as well as software figures in the story of MITH’s digital stewardship practices thus far, we’ll also consider the preservation implications of the increasing move to “cloud” computing.

(Provisional) conclusions

There are no ready-made solutions, no repositories ready to accept many of the kinds of complex objects digital humanists produce so every stewardship strategy falls somewhere along a spectrum of benefits and tradeoffs. Complex work carried out by multiple people over long spans of time during which our collective knowledge of best practices was itself evolving is difficult to judge according to a binary of success or failure, presence or erasure. Attempting to do so likely vitiates the value of the open and detailed discussion of this work for the community genuinely interested in preserving the fullest history of digital humanities work. It also runs counter to the conclusions generated by MITH’s own active research in digital preservation, and the theoretical and methodological writings of several of its staff and administration.

We would not claim that this process of caring for the range of MITH’s digital humanities websites has been flawlessly executed at every step. During server migrations, in particular, sites have sometimes gone offline due to a misconfiguration of one kind or another. When we discover these errors or when they have been pointed out to us, we’ve investigated and restored the online availability of materials where possible. Where this has not been possible because earlier work would need new investment to fix security vulnerabilities or make other substantial changes, we have documented projects and retained data—and even hardware. One thing the history of MITH’s digital work should suggest is that there are important distinctions between preserving data, maintaining or sustaining specific computing systems, and providing varying levels of access (online vs. offline, original vs. migrated).

The expectations of access for archival material differ from those for active material. Just as with analog collections, it may be necessary to find archival “websites” in different locations than their active counterparts; archival websites might take slightly different forms; additional requests or effort might be necessary to access these things. We at MITH recognize that it is a frustrating experience to find that digital work that was once available on the web is offline or that the experience of working with it has changed. As a community, our expectation of digital work has been that, in many cases, it’s present online or it’s gone. At MITH, where we have digital work that is preserved but not online this ingrained expectation is a challenge. For more and more digital work, we think that there must be a middle, archival way—how do we begin to incorporate this into our practices in satisfying ways? Most researchers have encountered analog collections that are unprocessed or materials that are out for repair and preservation. At the moment, this is perhaps the best comparison for the state of some few of MITH’s digital outputs, yet we are continuously working to develop and improve our own practices of curation and preservation.

At MITH, we have chosen a stewardship strategy that entails both actively managing systems that still run a variety of websites and also taking actions to preserve copies and alternate representations of these sites including through retention of compressed offline archives of project data, through web archiving, and through deposit of supplementary materials and documentation with both our (digital) institutional repository and our (analog) university archives.

We value the tradition of work that has been accomplished at MITH over the last 16 years. Digital humanities has a history and indeed multiple histories, not just in terms of the intellectual pedigree of the phrase or the concept, but in the legacy of material work that has been performed in its name. MITH’s research has addressed itself to issues of gender and diversity among other core humanist concerns. MITH has also attracted and recruited people interested in digital preservation, and has actively collaborated on a range of digital preservation projects and initiatives in its research. These intertwined aspects of our work have particular resonance now, given that digital humanities practitioners are placing increased emphasis on recovering narratives and origin stories for the field that are more diverse than some stakeholders seem interested in acknowledging. Stewarding the collective history of work at MITH thus helps make visible the diverse history of the digital humanities.

Click here to access subsequent posts in our series on this topic.

The post Stewarding Digital Humanities Work on the Web at MITH appeared first on Maryland Institute for Technology in the Humanities.

]]>
https://mith.umd.edu/stewarding-digital-humanities-work-on-the-web-at-mith/feed/ 3