Josh Westgard

About Josh Westgard

Grad Student in Information Management at UMD's iSchool, GA in Digital Stewardship at UMD Libraries, and aficianado of early medieval manuscripts.

Seeing the Forest through the Thees (and Thous)

My initial reaction to Ramsay’s statement is that for me nothing quite induces the defamiliarization of textuality like invoking the ostranenie of Russian formalists. I’d like to see someone explain that passage in Upgoerfive! That having been said, I found this week’s exercises quite thought provoking and exciting. As soon as I began my first attempts to create word clouds with Augustine’s <i>Confessions</i>, I knew there were going to be problems with my particular translation, the language of which is extremely antiquated. Because of the language, my initial Wordle showed “Thee”, “Thou”, and “Thy” to be the most common words (because they are not in their basic stoplists, of course, even though a modern translator would say “you” and “your”).  Further examination revealed that there were a large number of other very common words in archaic forms in my text.

Through some trial and error, and using a text editor with advanced Grep capability to perform some batch replace procedures on my text file, I managed to generate a more satisfactory result. The Wordle and WordItOut versions seemed quite similar in my case. And even though WordItOut seems to offer somewhat easier manipulation of the final ouput, I’m posting the Wordle because I agree with others that they tend to look better:

Wordle

I found this to be a surprisingly good encapsulation of many of the main themes of the Confessions. Putting the resulting words into UpGoerFive resulted in the following list of words used frequently in my text that were not among the more commonly used in English today:

nor, unto, lord, earth, soul, whom, heaven, itself, therefore, neither, behold, joy, spirit, whence, flesh, holy, certain, unless

Here we can see that the archaic language is still apparent, even after my attempts to modernize the most frequently used archaic words.  ”Nor”, “unto”, and “whom” should really probably be on the stoplist since the ideas that my old translation is expressing with them would probably be expressed with stoplist words in a translation written today.  But if we look past those words, the remaining results are reasonably instructive, and a machine trying to ‘comprehend’ what the Confessions are about would have a reasonably easy time of it, I suspect.

The CLAWS tagger seems quite powerful though its results didn’t immediately speak to me.  I did notice that it seems to have mis-identified Augustine’s use of “times” as a preposition.  CLAWS becomes particularly powerful, it would seem to me, if one were to convert the results list to a spreadsheet that can be easily sorted by part of speech.  TAPOR likewise looks like a very powerful toolset — if I’m not mistaken its concordance generator could could accomplish Father Busa’s entire project in a matter of a few minutes — assuming one had the works of Aquinas available in text files.

Ultimately, though, coming back to the question of defamiliarization of the text, this week’s exercises proved to me that there is something valuable in breaking our texts down in this way — even if I’m not sure I see where this is all headed just yet.  Text mining procedures like these seem to be taking apart the forest and sorting the trees by species, size, age, etc.  Surely that would be useful information for a biologist studying the forest, but how we will get from stacks of trees over to understanding biodiversity still remains unclear to me.

What has William Morris to do with DH?

A brief recommendation: UMD Libraries’ Special Collections is currently featuring an exhibit  (“How We Might Live: The Vision of William Morris,” Sept. 2012-July 2013) on the life and works of William Morris, the 19th-century English author, designer, socialist, and — arguably most famously, though perhaps I’m not objective on this point — founder of the Kelmscott Press and printer of the Kelmscott Chaucer.  As a medievalist with a particular interest in manuscript studies, I’ve long found Morris’s work appealing and admired his taste — for example, what lover of books would not appreciate the discussion of the relative aesthetic merits of various typefaces and guidelines for margin widths found in his “The Ideal Book“?  That having been said, though, I never found Morris particularly relevant to my own work — that is, not until I read Bethany Nowviskie’s very thoughtful MLA talk, “Resistance in the Materials” (posted here on her blog).  Nowviskie uses a quotation from Morris as a jumping off point for discussing the role of craft and collaboration in DH, as well as for some reflections on the casualization of the academic workforce.  Not only is her essay directly pertinent to our discussion of making and building in DH, but for me reading it also gave new relevance to UMD’s Morris exhibition.  In particular, it got me thinking about the tension between the hand- and machine-crafted object in Morris’ work, and about the resonance of his attempts to translate both the aesthetics and the ethics of the hand-crafted book into the technological context of printing. In that sense his work now strikes me as particularly relevant to our moment, when at times the future of books as physical objects seems to be in doubt — not to mention the viability of a career devoted to writing and studying them. But rather than take my word for it, why not read the essay — and take in the exhibition — for yourself?

Download and Read: Augustine’s Confessions Online

For this exercise I wanted to choose something with a long and complex history that would be relevant to my interests, but which also had enough cultural significance to be of interest to a wider audience.  I settled on Augustine’s Confessions, his autobiographical masterpiece written at the end of the 4th century, in which he recounts his early life and conversion to Christianity.  As with any work written before the age of print, the Confessions came to life and first circulated in manuscript form (examples of which can also be found online, for example this digitized microfilm of Troyes, Bibl. mun., 473, and this digital facisimile of a Villanova MS). The work made its way into print at an early date, and was translated from the original Latin into English at least as early as the first half of the 17th century.  Perhaps the most accessible edition of the Latin text is that in the Patrologia Latina (32.659 ff.). The PL — the publication of which in the mid-19th century has to be one of the most successful acts of serial plagiarism ever perpetrated — retains its relevance today as a kind of least common denominator of editions. But as you might expect, over the years there have been numerous editions of the text, not to mention translations into various languages. Not knowing what other sorts of exercises might be in store in the coming weeks for the texts we choose to investigate, I decided to focus my efforts on the English versions of the Confessions.  And rather than attempting to compile a comprehensive survey of all the various versions that might be out there on the web, for the purpose of this exercise I decided not to labor too much over locating every available version and instead just to approach each of the four search interfaces with some common terms (viz., author: Augustine; title: Confessions; and where possible limiting the results to English language hits available in full text), and see what each one returned.

Project Gutenberg

Project Gutenberg returns just four hits in response to a search on “Augustine Confessions”, including one hit each of the English and the Latin text, as well as two anthologies that contain excerpts from the text.  Gutenberg’s English text (available here) was first released in 2002, and is a version of the translation of Edward Pusey from the Library of the Fathers series, a series of translations of patristic texts published in the 19th century by members of the Oxford Movement of High-Church Anglicanism. This is an influential translation, and it will make repeated appearances below.  The text is available in six formats: HTML, ePub, Kindle, Plucker, QiOO, and plain text (UTF-8).  The HTML version is XHTML, and seems to have been carefully proofed. This version also contains some useful additional encoding such as paragraph numbering.  The text can be read online, or downloaded in any of the six available formats. It is in the public domain, and is here released under a Project Gutenberg license, which allows the end-user to use the text for just about any non-commercial purpose.  There isn’t any obvious way to mark up or otherwise correct the text and re-submit it back to the project.

Google Books

Searching Google Books using the terms described above returns 25 hits when limited to those available in full, ranging in date from 1770 to 1912.  Closer examination reveals that many of these 25 are in fact duplicates, and others are irrelevant volumes of a multi-volume series (The Nicene and Post-Nicene Fathers [NPNF]), only one volume of which contains the text under investigation.  Among the ‘good’ hits are a copy from the Loeb Classical Library; a reprint of Pusey’s translation in a series called the Harvard Classical Texts; and a translation by Charles Pilkington in the aforementioned NPNF series.  The volume edited by Temple Scott (1900), which was scanned from a Harvard copy, is in fact a re-issue of Pusey’s translation, while the translation by W. H. Hutchings, scanned from a copy in the Bodleian, purports to be a new translation, albeit of only ten books rather than the full text’s thirteen (the final three books of the Confessions, more philosophical than autobiographical, are sometimes left out). There are a number of other versions available on Google besides these.  The volumes on Google are available in a variety of formats both directly on Google books and through the Google eBook feature, including several formats designed for e-readers and for online reading.  While the quality of those that rely on page images is generally good, the OCR versions remain quite error-laden.  For example, this passage chosen more or less at random: “$e tnbefff$s against tTie SOone of enucatinp; f&e BUT woe to thee, thou torrent of human custom!” (p. 23).  If one examines the page-images it is immediately apparent why the text is so corrupt in the first half of the passage — it is an epigram printed in a gothic font. But because of anomalies like this, the poor quality of the OCR would make it difficult and dangerous to use Google’s text for any serious purpose (over and above the fact that there is no obvious easy way to download the entire book in plain text format). There are some nice features of the Google reader such as the ability to create notes and mark up the text with highlighting in different colors. Google’s terms of service would seem to allow download and reuse of their content in a variety of forms.

Internet Archive

Perhaps the most interesting and unique offering at the Internet Archive is the very first one among the initial hits: a complete audio book from Librivox.  The experience of listening to the Confessions read aloud probably more closely approximates how the text was experienced through much of its early history, when even private reading was often done aloud, than many of the printed versions.  Many of the versions available on IA are copies of books digitized by Google. Pusey’s and Pilkington’s translations are here, but also a version of the text translated into Hebrew that I found on no other site. The IA’s versions are available for free download in a variety of formats, including formats for various e-readers, as a PDF, and as a single plain text file.  Unfortunately, the plain text version is full of OCR errors (not least the common failure to segregate headers, footnotes, and main text), and would require significant clean up to be useful for any serious purpose.  Many of the IA books are listed as not in copyright or with no known copyright restrictions, and can be downloaded freely in various formats.  In addition, descriptive information about the scanned books can be contributed by users through openlibrary.org, and problems can be reported to IA through a link on their site.  IA’s online reader is perhaps the best interface of any of the available online readers.

Hathi Trust

Finally, a search of the Hathi Trust using the same terms described above returns 19 hits, including many of the same translations available via Google (in fact, the watermarks reveal that many of these are in fact Google’s scans). As one might expect, the metadata for Hathi Trust books are generally fuller and more precise than Google’s. Another useful feature is the ability to download citation information. Plain text is available, but only on a page-by-page basis, and even the PDF download of full book in the public domain requires authentication.  According to the access and use policy, the Google-digitized books are requested not to be used for commercial purposes or re-hosted, but otherwise are free for use for non-commercial, educational purposes.

 

In conclusion, I would note that the plain-text version of Pusey’s translation available through Project Gutenberg is probably the most useful of all the free online versions of the text, simply because of its flexibility.  None of the foregoing discussion takes into account the accuracy of either the translations or of the editions upon which they were based.

Yes, there’s an award for that…

You can now cast your vote for the best digital projects and contributions to the field of DH in 2012.  Voting is open to anyone.  To learn more about these new awards, see the slate of nominees in various categories, and ultimately cast your vote, go to: http://dhawards.org/dhawards2012/voting/

But the ballot is good for more than just voting, it seems to me that it could also serve as a nice introduction to current work in the field.  The slate of nominees was distilled from public submissions by a nominating committee, and includes MITH’s own Amanda Visconti as well as the Bamboo DiRT project.

The voting is open to anyone, and it will be interesting to see how the awards play out, given that there is no way to enforce that voters actually look at all the nominations (ah, democracy…).  The question of this being just a popularity contest is confronted in the Awards FAQ (http://dhawards.org/faqs/):

Doesn’t that just turn it into a popularity contest? In some ways, yes, it does. The other alternative would be to have the winners decided by a shadowy oligarchy. DH Awards was set up intentionally as a community-nominated and community-voted form of recognition. If we start controlling who has the right to vote it undermines this.

This is, I think, a conundrum worthy of some further discussion. Are there really only these two choices (= popularity contest or shadowy oligarchy)?  What are awards determined by this procedure likely to reward?  Is there a better way to choose projects for recognition?  What additional importance does this selection procedure lend to the social aspects of DH?

Is there such a thing as analog humanities?

For me it still feels premature to attempt my own definition of DH at this stage, but taking a cue from the agile development school, I guess I should get a working definition on screen and then iterate as the semester goes on.

Before I do that, however, a few words by way of introduction:  I am a student in the MIM program in the iSchool, and I also have a second (or first?) life as a medieval historian, having completed a Ph.D. in history at UNC-Chapel Hill in 2006.  Prior to my doctoral studies, in the late 90s I took an M.A. in Medieval Studies from Western Michigan University, which is where I first started working on digital projects, doing some web design for the Medieval Institute and SGML tagging for an electronic review journal (The Medieval Review, or TMR, see http://quod.lib.umich.edu/t/tmr/).  Back in those days — ‘the before times’ my kids like to call them — TMR’s cubicle also housed a special UNIX terminal, the sole purpose of which was to serve images from something called “The Electronic Beowulf” — still available, now in its 3rd edition! See http://ebeowulf.uky.edu/studyingbeowulfs/overview. The Electronic Beowulf’s images were  too large to be opened on a typical PC of that time, but today I’m sure could be handled by the average smartphone.  After I moved to Chapel Hill, digital skills were mainly a way to make ends meet between teaching assistantships rather than an integral part of my dissertation research, though already then I was starting to recognize how important and useful digital libraries could be.  For someone who primarily studies manuscripts, most of which are housed in European repositories, many of which are still minimally and poorly described in print, the prospect of having large numbers of primary sources digitized and made freely available looked to be a game changer (though in practice it hasn’t necessarily played out that way for a combination of reasons, but some sense of the riches that are out there today can be gained from http://manuscripts.cmrs.ucla.edu/index.php).  After completing my degree, I held various temporary appointments, both full and part-time, including the better part of a year working on a project that actively engaged in the enterprise of making medieval manuscripts more widely available: Carolingian Culture at Reichenau and St. Gall (http://www.stgallplan.org).

My experiences working on the St. Gall project really helped to drive home for me how the  field of digital libraries/digital cultural heritage was where I wanted to be, and that realization in turn is what has led me to UMD and the MIM program, which in turn brings this blog post back around to the question of defining DH.  With the exception of the St. Gall project, I don’t really consider most of what I have done through my scholarly career to have been digital humanities per se, though there hasn’t really been a time in all these years that technology has not played some role in my academic life, whether it be in facilitating scholarly interaction and exchange, a practical way to access primary and secondary research materials, or a means of keeping body and soul together, i.e. a paycheck. And while I wouldn’t dispute many of the definitions and characteristics put forth in earlier posts and in this week’s readings, especially the idea that DH is a particularly collaborative, social, and experimentational flavor of modern scholarship, I am left wondering whether we haven’t reached a saturation point where there is in practice virtually no humanities scholarship that is not, on some level at least, digital.

That having been said, while there may be no analog humanities these days (except perhaps that practiced by castaways on desert islands), not every scholarly project is equally digital.  So what makes some more digital than the others?  Ramsay’s idea of building, which so may posts have touched on, rings true to me, as does the idea that digital humanities is particularly collaborative and social (in contrast to the solitary and isolated monographers of the ‘before times’).  I recognize that these are descriptive characteristics rather than the elements of a definition — perhaps come May I’ll have learned enough to venture the latter?