Porter Olsen – Maryland Institute for Technology in the Humanities https://mith.umd.edu Thu, 08 Oct 2020 20:02:37 +0000 en-US hourly 1 https://wordpress.org/?v=5.5.1 Hacking MITH’s Legacy Web Servers: A Holistic Approach to Preservation on the Web https://mith.umd.edu/hacking-miths-legacy-servers/ Wed, 08 Jul 2015 15:33:30 +0000 http://mith.umd.edu/?p=14118 Editor's note— This is the second post in MITH's series on stewarding digital humanities scholarship. In September of 2012 MITH moved from its long-time home in the basement of the McKeldin Library on the University of Maryland campus to a newly renovated, and considerably better lit, location next to Library Media Services in the Hornbake [...]

The post Hacking MITH’s Legacy Web Servers: A Holistic Approach to Preservation on the Web appeared first on Maryland Institute for Technology in the Humanities.

]]>
Editor’s note— This is the second post in MITH’s series on stewarding digital humanities scholarship.

In September of 2012 MITH moved from its long-time home in the basement of the McKeldin Library on the University of Maryland campus to a newly renovated, and considerably better lit, location next to Library Media Services in the Hornbake Library. If you’ve had a chance to visit MITH’s current location, then you’ve likely noticed its modern, open, and spacious design. And yet, for all its comforts, for all its natural light streaming in from the windows that comprise its northern wall, I still find myself missing our dark corner of the McKeldin basement from time to time: its cubicles, its cramped breakroom, Matthew Kirschenbaum’s cave-like office with frankensteinian hardware filling every square inch, and especially its oddly shaped conference room, packed to the gills and overflowing into the hallway every Tuesday at 12:30 for Digital Dialogues.

In preparation for the move, we delved into those nooks and crannies to inventory the computers and other equipment that had accumulated over the years. MITH is a place that is interested in the materiality of computing—and it shows. Boxes of old media could be found under one cubicle, while a stack of keyboards—whose corresponding computers had long since been recycled—could be found under another. A media cabinet near the entrance contained a variety of game systems from the Preserving Virtual Worlds project, and legacy computer systems, now proudly displayed on MITH’s “spline,” jockied for pride of place on the breakroom table. Tucked away here and there in MITH’s McKeldin offices were a host of retired computer hardware—old enough to have been replaced, but still too modern to merit a listing in MITH’s Vintage Computers collection. Among these systems were two large, black IBM server towers, twice the size and weight of your typical PC. As I lugged them onto the cart in preparation for the move, I couldn’t help but wonder what these servers had been used for, when and why had they been retired, and what data might still be recoverable from them?

A few months ago, I got the chance to find out when I was asked to capture disk images of these servers (a disk image is a sector-by-sector copy of all the data that reside on a storage medium such as a hard disk or CD-ROM). Disk images of these servers would significantly increase access to the data they contained, and also make it possible for MITH to analyze them using the BitCurator suite of digital forensics tools developed by MITH and UNC’s School of Information and Library Science. With BitCurator we would be able to, among other things, identify deleted or hidden files, search the disk images for particular or sensitive information, and generate human and machine-readable reports detailing file types, checksums and file sizes. The tools and capabilities provided by BitCurator offered MITH an opportunity to revisit its legacy web servers and retrospectively consider preservation decisions in a way that was not possible before. However, before we could conduct any such analysis, we first had to be able to access the data on the servers and capture that data in disk image form. It is tackling that specific challenge, accessing and imaging the servers’ hard drives, that I want to focus on in this blog post.

As Trevor described in his recent post on MITH’s digital preservation practices, servers are one important site where complex institutional decisions about digital curation and preservation are played out. As such, I see two communities that have a vested interest in understanding the particular preservation challenges posed by server hardware. First, the digital humanities community where DH centers routinely host their own web content and will need, inevitably, to consider migration and preservation strategies associated with transitioning web infrastructure. These transitions may come in the form of upgrading to newer, more powerful web servers, or migrating from self-hosted servers to cloud-based virtual servers. In either event, what to do with the retired servers remains a critical question. And second, the digital preservation community who may soon see (if they haven’t already) organizations contributing web, email or file servers as important institutional records to be archived along with the contents of their filing cabinets and laptops. Given the needs of these two communities, I hope this blog post will begin a larger conversation about the how and why of web server preservation.

Khelone and Minerva

The two systems represented by those heavy, black IBM towers were a development server named “Khelone” and a production server named “Minerva.” These machines were MITH’s web servers from 2006 to 2009. When they were retired, the websites they hosted were migrated to new, more secure servers. MITH staff (primarily Greg Lord and Doug Reside) developed a transition plan where they evaluated each website on the servers for security and stability and then, based on that analysis, decided how each website should be migrated to the new servers. Some sites could be migrated in their entirety, but a number of websites had security vulnerabilities that dictated only a partial migration. (Later in the summer MITH’s Lead Developer Ed Summers will share a little more about the process of migrating dynamic websites.) After the websites had been migrated to the new servers, Khelone and Minerva, which had been hosted in the campus data center, were collected and returned to MITH. The decision to hold on to these servers showed great foresight and is what allowed me to go back six years after they had been retired and delve into MITH’s digital nooks and crannies.

Khelone & Minerva - MITH's web servers c 2009

Khelone & Minerva – MITH’s web servers c 2009

Imaging and analyzing Khelone and Minerva’s hard drives is a task I eagerly took up because I have been interested in exploring how a digital forensics approach to digital preservation would differ between personal computing hardware (desktop computers, laptops and mobile devices) and server hardware. As I suspected, what I found was that there are both hardware and software differences between the two types of systems, and these differences significantly affect how one might go about preserving a server’s digital contents. In this post I will cover four common features of server hardware that may impede the digital humanities center manager’s or digital archivist’s ability to capture disk images of a server’s hard drives (I will write more about post-imaging BitCurator tools in subsequent posts). Those features are: 1) the use of SCSI hard drives, 2) drive access without login credentials, 3) accessing data on a RAID array, and 4) working with logical volumes. If those terms don’t mean anything to you, fear not! I will do my best to explain each of them and how a digital archivist might need to work with or around them when data held on server hardware.

This topic is necessarily technical, but for those less interested in the technical details than the broader challenge of preserving web server infrastructure, I have included TL;DR (Too Long; Didn’t Read) summaries of each section below.

Server hardware vs. PC hardware

Workflows designed to capture born-digital content on carrier media (hard drives, floppy disk, optical disks, etc.) typically focus on collecting data from personal computing devices. This makes sense both because the majority of carrier media comes from personal computing and also because it fits the familiar framework of collecting the papers of an individual. The presumption in these workflows is that the carrier media can be removed from the computer, connected to a write blocker, and accessed via a digital forensics workstation. (See Marty Gengenbach’s paper “Mapping Digital Forensics Workflows in Collecting Institutions” for example workflows.) Once connected to the workstation, the digital archivist can capture a disk image , scan the contents of the drive for viruses, search for personally identifiable information, generate metadata reports on the drive’s contents, and ultimately transfer the disk image and associated metadata to a digital repository. The challenge posed by server hardware, however, is that if the hard drives are removed from their original hardware environment, they can become unreadable, or the data they contain may only be accessible as raw bitstreams that cannot be reconstructed into coherent files. (I’m using “servers” as a generic term for different server types such as web, email, file, etc.) These factors necessitate a different approach. Specifically, that when archiving born-digital content on server hardware, the safest (and sometimes only) way to access the server’s hard drives is by accessing them while they are still connected to the server.

As a general rule, commercial servers and personal computers use different data bus types (the data bus is the system component that moves data from a disk drive to the CPU for processing). And unfortunately, currently available write blockers and disk enclosures/docking stations do not support the data bus type commonly used in server hardware. Over the years there have been a number of different data bus types. From the late 1980’s to the early 2000’s most desktop systems included hard drives that used a bus type called IDE (Integrated Device Electronics). That bus technology has since given way to Serial ATA (or, SATA), which is found in almost all computers today. However, servers almost always included hard drives that used a SCSI data bus because of SCSI’s higher sustained data throughput. So, while it’s common for a write blocker to have pinout connections for both IDE and SATA (see the WiebeTech ComboDock described in my blog post covering digital forensics hardware), there are none on the market that support any type of SCSI devices (I’d be happily proven wrong if anyone knows of a write blocker that supports SCSI devices). This means that if you want to capture data from a legacy server with SCSI hard drives, instead of removing the drives and connecting them to a digital forensics workstation via a write blocker—as one would do with IDE or SATA hard drives—your best, and maybe only, solution is to leave the drives in the server, boot it up, and capture the data from the running system.

There is an alternative approach that might make sense if you anticipate working with SCSI hard drives regularly. Special expansion cards (called “controller cards”) can be purchased that allow SCSI drives to connect to your typical desktop computer. Adding such a card to your digital curation workstation if, as stated above, you anticipate processing data on SCSI hard drives on a regular basis.

TL;DR #1: Servers almost always use SCSI hard drives, not the more common IDE or SATA drives. Write blockers and disk enclosures/docking stations do not support SCSI drives, so if you want to access the data contained on a server, the best solution is to access them via the server itself.

The LiveCD Approach

How then to best access the data on hard drives still attached to the server? There are two options: The first is simply to boot up the computer, log in, and either capture a disk image or copy over selected files. The second option is what is called a “liveCD” approach. In this case you put the liveCD (or DVD as the case may be) in the drive and have the server boot up off of that disk. The system will boot into a temporary environment where all of the components of the computer are running except the hard drives. In this temporary environment you can capture disk images of the drives or mount the drives in a read-only state for appraisal and analysis.

From a digital forensics perspective, the first option is problematic, bordering on outright negligent. Booting directly from the server’s hard drives means that you will, unavoidably, write data to the drive. This may be something as seemingly innocuous as writing to the boot logs as the system starts up. But what if one of the things you wanted to find out was the last time the server had been up and running as a server? Booting from the server’s hard drive will overwrite the boot logs and replace the date of the last active use with the present date. Further, capturing a disk image from an actively mounted drive means that any running processes may be writing data to the drive during the imaging process, potentially overwriting log files and deleted files.

It may also be impossible to log into the server at all. It is common for servers to employ a “single sign on” technology where the server authenticates a user via an authentication server instead of the local system. This was the case with Khelone and Minerva, making the liveCD approach the only viable course of action.

Booting from a liveCD is a common digital forensics approach because even though the system is up and running, the hard drives are inaccessible—essentially absent from the system unless the examiner actively mounts them in the temporary environment. There a number of Linux based liveCDs, including BitCurator’s, which would have been the ideal liveCD for my needs. Unfortunately, the old MITH servers had CD-ROM, not DVD drives, and the BitCurator liveCD is DVD size at 2.5GB. An additional impediment to using BitCurator is that BitCurator is built on the 64-bit version of Ubuntu Linux, while the Intel Xeon processors in these servers only supported 32-bit operating systems.

With BitCurator unavailable I chose to use Ubuntu Mini Remix, a version of Ubuntu Linux specifically paired down to fit on a CD. Ubuntu has an official minimal liveCD/installation CD but it is a bit too minimal and doesn’t include some necessary features for the disk capture work I needed to do.

TL;DR #2: Use a “liveCD” to gain access to legacy servers. Be careful to note the type of optical media drive (CD or DVD), if the system can boot from a USB drive, and whether or not the server has support for 64 bit operating systems so you can choose a liveCD that will work.

Redundant Array of Inexpensive Disks (RAID)

Once the server is up and running on the liveCD, you may want to either capture a disk image or mount the hard drives (read only, of course) to appraise their content. On personal computing hardware this is fairly straightforward. However, servers have a need for both fault tolerance and speed, which leads most server manufacturers to pair multiple drives together to create what’s called a RAID, or Redundant Array of Inexpensive Disks. There are a number of different RAID types, but for the most part a RAID does one of three things:

  1. “Stripe” data between two or more disks for increased read/write times (called RAID 0)
  2. “Mirror” data across two or more disks to build redundancy (called RAID 1, see diagram below)
  3. Data striping with parity checks to prevent data loss (RAID 2-6)

 

Diagram of RAID 1 Array. Photo credit Intel

Diagram of RAID 1 Array. Photo credit Intel

Regardless of the RAID type, in all cases a RAID makes multiple disks appear to be a single disk. This complicates the work of the digital archivists significantly because when hard drives are configured in a RAID, they may not be able to stand alone as single disk, particularly in the case of RAID 0 where data is striped between the disks in the array.

As with SCSI hard drives, virtually all servers configure their hard drives in a RAID, and the old MITH servers were no exception. It is easy to determine if a server has hard drives configured in a RAID by typing “fdisk -l”, which reads the partition information from each hard drive visible to the operating system. The “fdisk -l” command will print a table that gives details on each drive and partition. Drives that are part of a raid will be labeled “Linux raid autodetect” under the “system” column of the table. What is less apparent is which type of RAID the server is using. To determine the RAID type, you use an application called “mdadm” (Multi Disk Administration), which, once downloaded and installed, revealed that the drives on Khelone and Minerva were configured in a RAID 1 (mirroring). There were four drives on each server, with each drive paired with an identical drive so that if one failed, the other drive would seamlessly kick in and allow the server to continue functioning without any downtime.

Because the drives were mirroring rather than striping data, it was possible to essentially ignore the RAID and capture a disk image of the first of the drives that constitute the array and still have a valid, readable disk image. However, this is only the case when dealing with RAID 1 (mirroring). However, if a server employs a RAID type that stripes data, such as RAID 0, then you must recreate the full raid in order to have access to the data. If you image a drive from a RAID 0, you essentially only get half of the data, resulting in so many meaningless ones and zeros.

TL;DR #3: Servers frequently deploy a redundancy technology called RAID to ensure the seamless rollover from a failed drive to a backup. When creating disk images of server hard drives one must first identify the type of RAID being used and from that information determine whether to capture a disk image of the raw device (the unmounted hard drive), or reassemble the RAID and capture the disk image from the disks running in the RAID.

Logical Volumes

The fourth and final server technology I’ll discuss in this post is what is called a “logical volume.” For simplicity’s sake, I’ll draw a distinction between a “physical volume” and a “logical volume.” A physical volume would be all the space on a single drive; its size (its volume) is limited by its physical capacity. If I connect a physical drive to my computer and format it, its volume would only ever be what its physical capacity allowed. A logical volume, by comparison, has the capacity to span multiple drives to create a single, umbrella-like volume. The volume’s size can now be expanded by adding additional drives to the logical volume (see the diagram below). In practice this means that, like a RAID array, a logical volume allows the user to to connect multiple hard disks to a system but have the operating system treat them as a single drive. This capability allows the user to add space to the logical volume on an as-needed basis, so server administrators create logical volumes on servers where they anticipate the need to add more drive space in the future.

Diagram of a Logical Volume. Image credit The Fedora Project

Diagram of a Logical Volume. Image credit The Fedora Project

The hard drives on Minerva, the production server, were configured in a logical volume as well as a RAID. This meant that in addition to reconstructing the RAID with mdadm, I had to download a tool called lvm (Linux Volume Manager) to mount the logical volume, which I then used to mount and access the contents of the drives. While I generally advocate the use of disk images for preservation, in this case it may make more sense to capture a logical copy of the data (that is, just the data visible to the operating system and not a complete bitstream). The reason for this is that in order to access the data from a disk image, you must once again use lvm to mount the logical volume. This additional step may be difficult for future users. It is, however, possible to mount a logical volume contained on a disk image in BitCurator, which I’ll detail in a subsequent post.

TL;DR #4: A logical volume is a means of making multiple drives appear to be a single volume. If a server employs a logical volume, digital archivists should take that fact into account when they decide whether to capture a forensic disk image or a logical copy of the data on the drive.

Revisiting Digital MITH

So why go through this? Why heft these old servers out of the storage closet and examine images of systems long-since retired? For me and for MITH it comes down to understanding our digital spaces much like our physical spaces. We have all had that moment when, for whatever reason, we find ourselves digging through boxes for that thing we think we need, only to find that thing we had forgotten about, but in fact need more than whatever it was we started digging for in the first place. So it was with Khelone and Minerva; what began as something of a test platform for BitCurator tools opened up a window to the past, to digital MITH circa 2009. And like their physical corollary, Khelone and Minerva were full of nooks and crannies. These servers hosted course pages for classes taught by past and present University of Maryland faculty, the personal websites of MITH staff, and versions of all the websites hosted by MITH in 2009, including those that weren’t able to be migrated to the new servers due to the security concerns mentioned above. In short, these servers were a snapshot of MITH in 2009—the technologies they were using, the fellows they were working with, the projects they were undertaking, and more. In this case the whole was truly greater than the sum of its parts (or the sum of its hosted websites). These servers—now accessible to MITH’s current managers as disk images—are a digital space captured in time that tell us as much about MITH as they do about any of the projects the servers hosted.

Understanding web servers in this way has significant implications for digital humanities centers and how they preserve project websites as well as their own institutional history. An atomized approach to preserving project websites decontextualizes them from the center’s oeuvre. Any effort to capture a representative institutional history must demonstrate the interrelated web of projects that define the center’s scholarship. Elements such as overlaps between participants, technologies that were shared or expanded upon between projects, funder relationships, and project partnerships, to name a few, form a network of relationships that is visible when websites are viewed in their original server context. However, this network becomes harder to see the moment a website is divorced from its server. In MITH’s case, examination of these disk images showed the planning and care with which staff had migrated individual digital projects. Nonetheless, having this additional, holistic view of previous systems is a welcome new capability. It is perhaps asking too much of a DH center to spend already limited system administration resources creating, say, biannual disk images of their web servers. However, when those inevitable moments of server migration and transition come about, they can be approached as an opportunity to capture a digital iteration of the center and its work, and in so doing, hold on to that context.

From a digital preservation perspective, I believe that we need to adopt a holistic approach to archiving web servers. When we capture a website, we have a website, and if we capture a bunch of website we have… a bunch of websites, which is fine. But a web server is a different species; it is a digital space that at any given moment can tell us as much about an organization as a visit to their physical location. A disk image of an institution’s web server, then, is more than just a place to put websites—it is a snapshot of the organization’s history that is, in its way, still very much alive.

The post Hacking MITH’s Legacy Web Servers: A Holistic Approach to Preservation on the Web appeared first on Maryland Institute for Technology in the Humanities.

]]>
MITH and UNC SILS release BitCurator 1.0 and look ahead to the future https://mith.umd.edu/mith-unc-sils-release-bitcurator-1-0-look-ahead-future/ Wed, 08 Oct 2014 14:28:24 +0000 http://mith.umd.edu/?p=13369 This has been a wild month for the BitCurator project here at MITH. First of all, as the grant funded portion of the BitCurator project has drawn to a close, we have established a member-based consortium to be the ongoing home of the BitCurator environment. The BitCurator Consortium (BCC) will be a member-led organization to [...]

The post MITH and UNC SILS release BitCurator 1.0 and look ahead to the future appeared first on Maryland Institute for Technology in the Humanities.

]]>
This has been a wild month for the BitCurator project here at MITH. First of all, as the grant funded portion of the BitCurator project has drawn to a close, we have established a member-based consortium to be the ongoing home of the BitCurator environment. The BitCurator Consortium (BCC) will be a member-led organization to pick up where the grant funded portion of BitCurator left off. In my conversations with BitCurator users I like to emphasize this point: the BitCurator project has not ended just because the grant period is over. Far from it! The outreach, training, and other community engagement efforts of the BitCurator team over the last year have established an active and growing user base who are committed to the ongoing development of the BitCurator environment. What’s more, both UNC School of Information and Library Science (SILS) and MITH are charter members of the BitCurator Consortium and will continue to be actively engaged in the project. You can learn more about the consortium and see a list of member organizations who have already joined the BCC here on the BitCurator Consortium page of the BitCurator website.

In addition to the establishment of the BCC, we were proud to announce the release of BitCurator 1.0 last week. This is a major milestone for us and represents the culmination of three years of dedicated development work. The BitCurator 1.0 release includes a major update to the BitCurator Disk Image Access tool–a tool that allows users to view the full content of a disk image, including hidden and deleted files, and export those files to a directory. Other recent updates include a revised safe mounting tool that allows users to mount disks in read only mode, the inclusion of the Library of Congress’s Bagger tool, and an updated version of bulk_extractor. To learn more and to download the latest version of BitCurator, please visit our wiki at wiki.bitcurator.net.

As the BitCurator Community Lead over the past year, it has been my pleasure to get to know the members of this community and learn about the challenges they face as they begin processing their born-digital collections. It has been rewarding to see the adoption of the BitCurator environment as an important part of addressing those issues and to see the BitCurator community take a leadership role in this area. I look forward to seeing both the BitCurator environment and the BitCurator community grow and develop as we take this next and necessary step in the maturation of the BitCurator project.

The post MITH and UNC SILS release BitCurator 1.0 and look ahead to the future appeared first on Maryland Institute for Technology in the Humanities.

]]>
Digital Curation Workstation https://mith.umd.edu/digital-curation-workstation/ Mon, 26 Nov 2012 13:00:40 +0000 http://mith.umd.edu/?p=9873 A few weeks ago I began putting together MITH’s new digital curation workstation. The primary reason for the workstation was to build a testbed for the BitCurator environment, an open source suite of digital forensics (DF) tools that have been repurposed for the curation of born-digital materials. While there are commercial DF workstations available on [...]

The post Digital Curation Workstation appeared first on Maryland Institute for Technology in the Humanities.

]]>
A few weeks ago I began putting together MITH’s new digital curation workstation. The primary reason for the workstation was to build a testbed for the BitCurator environment, an open source suite of digital forensics (DF) tools that have been repurposed for the curation of born-digital materials. While there are commercial DF workstations available on the market (for example, see Digital Intelligence’s FRED system), their cost can be prohibitive, especially compared to the ever-diminishing cost of desktop workstations.

What I wanted to come up with was a system under $1000 that would allow access to as many forms of digital media as possible. After researching the different options I ultimately chose a Windows 7 64-bit workstation with an i7 Intel processor, 24GB of RAM, and a 2TB SATA hard drive. When building a digital curation workstation running BitCurator, perhaps the most critical component is the system RAM. This is because BitCurator is designed to run in a virtual machine on a host operating system (hosts can be Windows, OSX, or Linux). As a general rule, the more RAM you have to dedicate to a virtual machine, the better performance you can expect. However, these specs reflect an optimal configuration; if you want to repurpose an existing workstation instead of buying a new one, you can run the BitCurator environment on any PC with a 64-bit capable CPU (most Intel and AMD CPUs have been able to run 64-bit operating systems for the last few generations), 2GB of RAM, and a 250GB SATA hard drive.

Choosing the system components was just the first step in building our digital curation workstation, however. The primary challenge most digital archivists face is getting physical access to their media. To address this, the MITH digital curation workstation includes an “All-in-One” memory card reader which allows access to everything from Sony Memory Sticks to micro SD cards. For an optical drive, you want a drive that can read as many formats as possible, so a drive that can read Blu-ray disks and is backwards compatible with older DVDs and CD-ROMs is ideal (you’ll want to check and make sure your drive can effectively read burned media as well). Zip disks, though comparatively short-lived, are still common enough that a born-digital curation workstation would be incomplete without a Zip drive. USB Zip drives are still available new from the manufacturer, and are also easily found on auction sites such as eBay.

Perhaps the media types that present the biggest challenge are 3.5” and 5.25” floppy disks. For 3.5” floppy disks I recommend a USB 3.5” floppy disk drive because 1) they are still readily available, and 2) because USB devices integrate into the BitCurator environment more easily than those connected to a floppy disk controller on the motherboard. For the 5.25” drive we used the FC5025 device by Device Side, a USB peice of hardware that, when coupled with a 5.25” drive, allows a Windows, Mac or Linux PC to read a wide variety of 5.25” floppy disk formats, including Apple DOS 3.2 and 3.3, MS DOS, Commodore 1541, and Atari 810–just to name a few (see the above link for a full listing). For more on accessing 3.5” and 5.25” floppy disks, I recommend an article by Doug Reside (MITH alum and digital curator at the NY Public Library) titled: “Digital Archaeology: Recovering your Digital History”. Doug’s article is particularly helpful because not only does he outline the various media types and their required drives, but he also tells you where you can find the drives themselves, many of which are no longer manufactured.

Once complete, we didn’t have long to wait before using our new digital curation workstation. Travis Brown, one of the directors here at MITH, came to me with a stack of 5.25” floppy disks containing some of Neil Fraistat’s early work on Percy Shelley’s manuscripts. Back in the early 90s, Neil had transcribed the Prometheus Unbound portions of Percy Shelley’s notebooks (ms Shelley e.1, e.2, e.3), painstakingly recreating in WordPerfect 4.2 each of Shelley’s hand-marked notations. For example, lines that were struck through in the manuscript were likewise struck through in the WordPerfect documents, along with word changes and emendations. If we could recover Neil’s early digital transcriptions, they could serve as a foundation for the work being done on the Shelley-Godwin Archive, another project here at MITH. Using the FC5025, we were able to access the disks and copy the file contents onto the workstation’s hard drive. We then used the current version of WordPerfect to convert the transcriptions into a modern document format. The time saved by recovering Neil’s original transcriptions goes beyond just that needed to retype the original documents; it includes the careful validation work done by Neil’s collaborators at the Bodleian Library. Taken together, their work represents a significant example of early digital humanities work, work that is now available to us because of the tools described above. From here the electronic transcriptions will be used to form a foundation for further TEI encoding and be an important part of of the Shelley-Godwin Archive.

This is just one example of how we here at MITH anticipate being able to use our new digital curation workstation, and, we think, makes a pretty compelling case for similar workstations being an essential part of any digital humanities center.

Porter Olsen is a Ph.D. candidate in the University of Maryland Department of English and a Graduate Research Assistant with the BitCurator project at MITH.

The post Digital Curation Workstation appeared first on Maryland Institute for Technology in the Humanities.

]]>
An Early Look at the BitCurator Environment https://mith.umd.edu/an-early-look-at-the-bitcurator-environment/ https://mith.umd.edu/an-early-look-at-the-bitcurator-environment/#comments Fri, 09 Nov 2012 13:30:18 +0000 http://mith.umd.edu/?p=9737 Roughly one year ago members of the BitCurator Professional Experts Panel (PEP) met at the Maryland Institute for Technology in the Humanities (MITH) to help further refine the scope and priorities of the BitCurator project, and ensure that our efforts would have "real world" usefulness for archivists and librarians who are responsible for born-digital materials. [...]

The post An Early Look at the BitCurator Environment appeared first on Maryland Institute for Technology in the Humanities.

]]>
Roughly one year ago members of the BitCurator Professional Experts Panel (PEP) met at the Maryland Institute for Technology in the Humanities (MITH) to help further refine the scope and priorities of the BitCurator project, and ensure that our efforts would have “real world” usefulness for archivists and librarians who are responsible for born-digital materials. The PEP meeting, along with a similar meeting in January of the Development Advisory Group, produced two significant results: first, revisions to a product requirements document that outlined the work to be done on the BitCurator project, including an architecture overview and feature descriptions; and second, a collection of detailed workflows that have helped us to identify where BitCurator can best fit into and enhance curatorial practices. The upcoming one-year anniversary of these initial meetings makes this a good time to take a look at how the BitCurator project has progressed and where we’re headed in the near future.

The BitCurator development team is making available at test release of the BitCurator Environment, which can now be downloaded from the BitCurator wiki. The BitCurator Environment is a fully functioning Linux system built on Ubuntu 12.04 that has been customized to meet the needs of archivists and librarians, and it can be run either as a stand-alone operating system or as a virtual machine. Once installed, the BitCurator Environment includes a number of digital forensics tools that can be integrated into digital curation workflows. A sampling of those tools includes:

  • Guymager: a tool for creating disk images in one of three commonly used disk image formats (dd, E01, and AFF).
  • custom Nautilus scripts: A collection of enhancements to Ubuntu’s default file browser that allow users to quickly generate checksums, identify file types, safely mount drives, and more.
  • bulk_extractor: a tool that locates personal identifiable information (PII) and then generates reports on that information in both human and machine readable formats.
  • Ghex: an open source hex editor that allows users to view a file in hexadecimal format.

The BitCurator environment will make additional available tools available in later releases.

The BitCurator team has also been developing various forms of documentation to complement the product development. On the BitCurator wiki you can find documentation that introduces virtual machines, instructs users on how to install the BitCurator environment, and gives detailed configuration instructions on sharing devices and files between host and virtual machines. We are also currently working on developing documentation that outlines use-case scenarios for digital archivists using the tools mentioned above.

It is not enough, of course, to simply build tools and a wiki page and hope users will come find our software. The BitCurator team has also been actively promoting the BitCurator Environment through lectures, panel discussions, conference talks, posters, and publications. Recent examples include presentations from Kam Woods (BitCurator technical lead), Cal Lee and Matthew Kirschenbaum (BitCurator Co-PIs) on BitCurator and digital forensics at this year’s Society of American Archivists conference in San Diego; and presentations by Cal at Archiving 2012 in Copenhagen, Denmark, Memory of the World in the Digital Age in Vancouver, Canada, the International Congress on Archives in Brisbane, Australia, and to the staff of the National Library of Australia in Canberra. In addition, team members Alex Chassanoff and Porter Olsen will present a poster on integrating digital forensics into born-digital workflows at the upcoming ASIS&T conference. We have also recently published an article in D-Lib Magazine titled “BitCurator: Tools and Techniques for Digital Forensics in Collecting Institutions.”

We have also been incorporating BitCurator elements into professional education offerings.  Cal has developed a one-day continuing education course called “Digital Forensics for Archivists” as part of the Digital Archives Specialist (DAS) curriculum of the Society of American Archivists (SAA).  Matt Kirschenbaum and Naomi Nelson (BitCurator PEP member) have been offering a course called “Born-Digital Materials: Theory & Practice” as part of the Rare Book School (RBS) at the University of Virginia.  Both the SAA and RBS courses serve as excellent mechanisms for raising awareness about BitCurator’s offerings and eliciting needs and perceptions from working professionals.

These are just a few examples of the work done by the BitCurator team to get the word out about BitCurator and our work on bringing digital forensics tools and techniques the digital curation community. For a full list of BitCurator related publications and presentations, please visit our project website at www.bitcurator.net.

As we look forward into the next few months, the BitCurator team has a number of goals and benchmarks that we will be working towards, chief among them being the release of the BitCurator beta later this fall. We are also organizing the second annual meeting of our Development Advisory Group for January 2013, where we will elicit feedback from DAG members on our releases to date. The day before the DAG meeting will be CurateGear 2013 on January 9 in Chapel Hill, where members of the DAG and many other experts will give presentations and run demos of software to support digital curation  And finally, we are currently in the process of applying for funding for phase two of the BitCurator project to support additional product development and further efforts to engage with working professionals who could benefit from implementation of the BitCurator tools. We invite those who are interested, especially those in collecting institutions working with born-digital materials, to follow our progress at www.bitcurator.net, or follow us on Twitter at @bitcurator. For those who would like to jump right in and start working with the BitCurator Environment, you can do so at wiki.bitcurator.net and join the BitCurator Users List.

If you have questions about the BitCurator project or the role of digital forensics methods in born-digital curation, please feel free to ask them in the comments section below.

Porter Olsen is a Ph.D. candidate in the University of Maryland Department of English and a Graduate Research Assistant at MITH.

The post An Early Look at the BitCurator Environment appeared first on Maryland Institute for Technology in the Humanities.

]]>
https://mith.umd.edu/an-early-look-at-the-bitcurator-environment/feed/ 1
BitCurator is Designing Curation Tools for Use https://mith.umd.edu/bitcurator-is-designing-curation-tools-for-use/ Thu, 12 Jan 2012 19:50:09 +0000 http://mith.umd.edu/?p=4847 Over the weekend, Matt Kirschenbaum and I traveled to UNC Chapel Hill in order to meet with the BitCurator Development Advisory Group (DAG). By design, our meeting with the DAG coincided with Curate Gear, a UNC Chapel Hill School of Information and Library Sciences sponsored conference designed to bring together scholars, software developers, and archivists [...]

The post BitCurator is Designing Curation Tools for Use appeared first on Maryland Institute for Technology in the Humanities.

]]>

Over the weekend, Matt Kirschenbaum and I traveled to UNC Chapel Hill in order to meet with the BitCurator Development Advisory Group (DAG). By design, our meeting with the DAG coincided with Curate Gear, a UNC Chapel Hill School of Information and Library Sciences sponsored conference designed to bring together scholars, software developers, and archivists to discuss tools and on-going research focused on the unique challenges of digital curation. Curate Gear was very informative, and showed the need for effective tools to deal with the ever-growing collection of digital artifacts archiving institutions collect daily. Attending the Curate Gear talks and demos emphasized to me just how immediate the problem of digital curation is. It is not, as they say, academic.

One of the DAG members, Dr. William Underwood, who also presented at Curate Gear spoke of his work with the George H.W. Bush presidential library and their need to process an extensive collection of existing floppy diskettes. Yes, that’s right, we’re still trying to work through the first Bush administration’s digital records from twenty years ago. This is just one of the many examples at Curate Gear of how pressing this issue is. And not just in academia, the challenge of archiving and making digital records available affects businesses, government agencies, lending libraries and private collections. Think even briefly about the amount of data that now sits unprocessed in shoe boxes in library storage, and you’ll immediately see the need for projects such as BitCurator.

The next day we walked from the hotel across the UNC campus on a surprisingly warm and sunny morning (at least compared to Maryland) to begin the day-long meeting with the DAG members. What struck me from the beginning was just how engaged and invested the DAG members were. Even as Cal Lee (BitCurator PI and professor at SILS) introduced BitCurator to the group, we began to receive valuable feedback and insights into both the software development process and the particular challenges we might face with the BitCurator project. Professor Geoffrey Brown of Indiana University, for example, offered incisive comments on the importance of maintaining a tight focus on the specific set of problems BitCurator is designed to address. Other DAG members were likewise generous with their comments, which lead to extensive discussions on a range of topics that included:

  • The role of BitCurator in a broader ecology of digital archiving tools
  • Properly scoping the project so that it will be able to deliver on the objectives laid out in the design documentation
  • Defining the intended user base for BitCurator
  • GUI and command line interfaces
  • Identifying and sequestering private or otherwise sensitive data during the curation process
  • Education and documentation requirements
  • Outreach and long-term support for the BitCurator project

Again, I was impressed with the interest and engagement in BitCurator, which in my opinion went beyond professional courtesy (though there was that in spades) and demonstrated the very real need for this project. Put another way, BitCurator is a tool that the DAG members want to succeed not simply as a matter of academic curiosity, but because they want to be able to put it to use themselves.

Going forward the BitCurator team will revise the design documents based on the DAG members’ feedback and then begin the development process. One of the key tasks ahead is building a corpus of “real world” archive materials against which we can test BitCurator. As Matt observed at one point, there are any number of DH tools that look and sound promising, yet languish on (the digital equivalent of) dusty shelves because they either don’t actually address the intended user’s problems or because nobody knows of them. Part of my job, then, will be to make sure that we get more than just good ideas out of the DAG members; we’ll need their bits, too.

The post BitCurator is Designing Curation Tools for Use appeared first on Maryland Institute for Technology in the Humanities.

]]>