More fun with topic modeling

While most of you have moved on to newer and perhaps more exciting challenges (I can imagine that hypertext authoring tools like Twine might be a good deal more interesting to people enrolled in a graduate literature program than Paper Machines was), I’m still plugging away at topic modeling. In response to Matt’s tweet earlier this week, here’s a very preliminary update.

The text I’m working with is Gratian’s Decretum, a 12th c. textbook of canon law. The Decretum is not a literary text. Anyone matriculating in the faculty of canon law at any university in Medieval Europe spent their first year sitting through lectures from the Decretum. The Decretum is a composite text, made up of excerpts from authorities like Augustine, Ambrose and Jerome; from canons of church councils; and from papal letters (real or forged). Gratian wrapped all of this in his own first-person commentary (the so-called dicta) that were supposed to carry the thread of his argument. So we’re not dealing with the monolithic work of a single author.

In the late 1990s, the Decretum was discovered to have been composed in two distinct stages, the First Recension and the Second Recension. The immediate goal of my topic modeling exercise is to determine whether I can detect topics that were only added in the Second Recension. I know that at least one such topic exists. My doctoral advisor discovered (the old-fashioned way) that all of the texts in the Decretum relating to the legal status of Jews were added in the Second Recension. If I can get topic modelling working on the text, the goal would be to topic model an electronic text of the standard edition of the Decretum (more or less corresponding to the Second Recension), then to topic model an electronic text that can be thought of as a proxy for the First Recension, and finally to look at the differences to try to detect topics that were added between the First and Second Recensions.

I’m using command-line MALLET (rather than Paper Machines), which gives me the ability to manipulate things like the number of topics to model and the number of iterations at the expense of being a little clunky. Here’s an example of how you run it:

# bin/mallet import-dir –input data/gratian –output gratian-input.mallet –keep-sequence-bigrams –stoplist-file stoplists/mgh.txt

# bin/mallet train-topics –input gratian-input.mallet –num-topics 20 –output-state gratian-state.gz –num-iterations 10000 –output-topic-keys gratian-topic-keys.txt

This test of 10,000 iterations took 12 minutes, 18 seconds to run. I’ve gone up to 100,000 iterations (2 hours, 19 minutes). I won’t show you all 20 lines of output, but here’s the first few to give you an idea:

0 2.5 legis primo presumpserit ordinis rem humana tenere honoris cunctis quarta operis publica celebrare respondetur infirmitate diximus grauius conceditur dignitate
1 2.5 populo facere proprio boni multi ualeat sacerdotium uenia romani malorum clerum tradidit electus dat digna probare possessiones peruenire ordinationis
2 2.5 populi uoluntatem pertinet sentencia uideatur possumus obicitur unitatem dicta urbanus sinodum prohibet permanere sacerdotali decretum matrimonio corpore archiepiscopo dimissa

At this point, there are at least two things happening that I didn’t expect. First, the topic keys aren’t converging. My understanding was that at some point the output of the N+1th iteration wasn’t going to be very different from the output of the Nth iteration. One of the interesting feature of command-line MALLET is that it spits out the list of topic keys every 50 iterations, so you can watch it (try) to converge. So far, I’m seeing the words jump around a lot more than I expected. Second, the topics keys I’m getting look a lot more like the topic keys from Lisa Rhody’s ekphrastic poetry corpus than I’d expect for a non-literary text.

There are many issues that I still have to resolve. Probably the two most important are the number of topics to model, and whether or not stemming the Latin words will make a substantive difference.

weird tape in the mail

I read, if that’s the right way to describe it, “weird tape in the mail.” To be honest, my first reaction was annoyance at the punctuation anomalies (“uncles” when “uncle’s” was meant; “it;s” when “it’s” was meant). My second reaction was that the art was repellant, especially the first appearance of the uncle. It reminded me a bit of George Grosz. The repellant art is, I think, a positive feature of the story-telling experience, as it fits the tone of the story quite well. On a broader level, I found the story at least somewhat engaging, although I agree that the attack on utopian consumerism was a little heavy-handed. I’m witholding judgement until next week, though, because I suspect it’s harder to write in this mode than it looks, and subtlety may be one of the things that goes by the wayside in a shorter piece.

There seems to be a strong affinity between this kind of writing and gaming, and in that regard I may be somewhat at a disadvantage, since I have no experience in that realm. (I really felt that strongly with my very short-lived attempt to interact with SHADE). I’m looking forward to the Twine assignment, since my sense is that this may be one of those things where there’s just no substitute for learning by doing.

Transcribing Bentham Experience

Like a number of other people in the class, I thought this exercise was going to be easier than it turned out to be. I have a lot of experience transcribing Medieval Latin manuscripts, and even some experience transcribing 18th century manuscripts in English for the Works of Jonathan Edwards project at Yale. I found it easy both to register for the Transcribe Bentham site and to use their transcription tool. I did not, however, find it easy to select a page for transcription. (Thanks to Melissa and Dan, who pointed out the banner with a link to un-transcribed material at the top of the Transcription Desk page.) And, of course, once I found links to the un-transcribed folios, it took a half-dozen tries to find one where I could actually read the handwriting.

I ended up transcribing JB/107/293/002, which was relatively easy, because it is a fair copy written out by one of Bentham’s copyists.

My main interest in this exercise was not to challenge my ability to read 18th and 19th century English manuscripts, but to evaluate the transcribing environment. In this regard, I think the Transcribing Bentham encoding tool compares quite well with similar systems (such as the excellent T-PEN for Medievalists.)

My only reservation about this approach to a “Big Humanities” project is that it privileges easier projects thatover more difficult projects that might have greater intrinsic scholarly value. I consider the Transcribe Bentham project (relatively) easier because a) the source manuscripts are written in the dominant language of the DH world, English, and b) the primary barrier to transcription is Bentham’s bad handwriting (i.e., transcribers do not need specialized paleographical training, which they almost certainly would for manuscripts much older than these). I understand that just because there’s certain things you can’t do (or at least can’t do easily) is not a reason not to do the things you can do. But every project has an opportunity cost, and I think we should always keep in mind which project we’d choose if all of the alternatives were equally doable.

Paper Machines Update

Here is a suggested fix for Paper Machines that has now worked in two cases, which gives me some confidence that it’s generally useful. Thanks to Courtney Wells, who volunteered her MacBook to test this fix. These instructions apply to Zotero running in Firefox (not standalone Zotero), and to Mac OS X systems. Here’s the summary:

  • Uninstall Paper Machines
  • Install Python 2.7.3
  • Reinstall Paper Machines
  • Make sure that the Path to Python executable is /usr/local/bin/python
  • Quit and restart Firefox

Here are the details:

  • Click Tools on the Firefox menu bar, and then select Add-ons. This gets you to the Add-ons Manager. Find Paper Machines 0.3.6, and click on the Remove button.
  • Download Python by clicking here. Open your Downloads folder and you should see a disk image file named python-2.7.3-macosx10.6.dmg. Double click on it, and when it opens, double click on Python.mpkg. You will then be led through the installation of Python.
  • Reinstall Paper Machines by clicking here in Firefox.
  • Open Zotero by clicking on the Zotero logo at the lower right of Firefox. Control click on TextHeap (under ENGL668K in Group Libraries on the lower left) and select Paper Machines Preferences… at the bottom of the contextual menu. Make sure you’re on the General Settings tab, and change the Path to Python executable from /usr/bin/pythonw to /usr/local/bin/python (see below).Paper Machines Preferences
  • Quit and restart Firefox. Open Zotero and find TextHeap. Control click on TextHeap and select Extract Text for Paper Machines. When text extraction runs to completion, you should be able to use other Paper Machines functionality (Word Cloud, Topic Modeling, etc.)

Please let me know if I’ve explained anything incompletely, or if this fix doesn’t work for you.

Still having problems with Paper Machines?

We don’t really get to declare independence from Travis Brown’s technical support until we have Paper Machines actually running on our laptops. I’ve got everything working except topic modeling, and I think a lot of other people are in the same boat (although I know that Katie did manage to get it working last night). I’ve identified a Paper Machines bug report that may be the same problem we’re having (https://github.com/chrisjr/papermachines/issues/12), but I need more information to be sure. In the interest of getting topic modeling using Paper Machines working for myself (because I actually intend to use it), I’m volunteering to be the point person for debugging this issue. I think (to borrow from Mark Sample) this definitely counts as service, not scholarship.

Please comment on this post with your operating system, Java, and Python version information, plus whether or not Paper Machines is working for you. For operating system information on a Mac, click on the Apple in the top menu bar, and select “About This Mac”. For version information about Java and Python, you’ll need to open the Terminal application and then type “java -version” and “python -V” on the command line. I’m hoping that a Windows person in the class can supplement this with information on how to do the corresponding operations in that environment: since the last Windows OS that I was familiar with was XP Pro, I don’t think that anything I’ll have to say on the subject will be useful.

As an example, the OS on my laptop is Mac OS X Version 10.6.8, the Java version is 1.6.0_39, and the Python version is 2.6.1. I’ve got Paper Machine working except for topic modeling. I’d appreciate hearing from as many of you as possible, including those who have it working (because I need all the data points to see what the non-working installations have in common). Thanks.

No Clever Title

I used the text of John Henry Newman’s The Idea of a University that I found on the Project Gutenberg website last week to produce two word clouds on Wordle and WordItOut.

NewmanWordleWordItOut-Newman

There were two differences that jumped out right away when I compared the two: WordItOut seemed to do a better job of weeding out stopwords (“may”), and Wordle accepted without question what I’m pretty sure are character-encoding errors (the pseudo-words beginning with ‘Ä’).

I had pretty much the same experience as everyone else did when I pasted the words from the WordItOut word cloud into the Up-Goer Five Text Editor: it rejected 26 of the words (although it wasn’t concerned in the least that the words, in that order, did not constitute a syntactically valid English sentence.).

I then pasted the same words from the WordItOut word cloud in the CLAWS Part-of-Speech tagger. For some reason, the text pasted with spaces between the words, and I had to enter the spaces manually. I noticed that the word list had a similar effect to the “entropic poem” on page 37 of Ramsay’s Reading Machines, which surprised me, since I had assumed that that effect would only be perceptible in a short text.

I get the point of tools like this. There’s a similar one called William Whitaker’s Words that’s very popular among students learning Latin, although the fact that CLAWS accepts bulk input (unlike Whitaker’s Words) is an improvement on the model. And there are useful things, I suppose to be learned about a text from such tools (e.g., to confirm or deny the claim that John Calvin never used adverbs in writing). I didn’t, however, find the output of CLAWS particularly edifying in this case:

CLAWS Output

 

I attempted to hand off the URL for the plain text on the Project Gutenberg site directly to TAPoR using “Your Web Page”, but what I got was an HTTP 403 Forbidden error, so I played with Chapter 1 of Moby Dick instead. My sense was that the HyperPo does need a body of text longer than a single chapter in order to be really useful rather than a curiosity.

I don’t feel qualified to comment on whether the use of these tools produces an effect of estrangement and defamiliarization of textuality in general — not being a literature student, I’m not used to relating to textuality in the abstract, as opposed to a particular text or texts. My impression is that tools of this kind will do much more for you if you already know something about the text you are examining in this way, and I certainly got a lot more out the examination of Gratian’s Decretum than of Newman’s Idea of a University.

A Quick Experiment in “Distant Reading” a Large Medieval Latin Text

Gratian

 

My dissertation is on the textual development of Gratian’s Decretum. The Decretum was written around 1140 by the otherwise unknown Gratian, and was the foundational textbook for the systematic study of canon law within the medieval university. (In fact, it remained the basis for the law of the Roman Catholic church right up until 1917.)

Inspired by Charity and Kathryn’s presentation on Wednesday night, I decided to use Wordle to do an experiment in “distant reading” Gratian’s text. The MGH (Monumenta Germaniae Historiae) in Munich digitized Emil Friedberg’s still-standard 1879 critical edition in the 80s, and I cut-and-pasted the whole thing (all 490,446 words) into Wordle.

A few things need to be kept in mind in order to interpret the resulting Wordle.

First, the Decretum was written in Latin, a fully-inflected language, and Wordle does no stemming. This is both a minus and a plus. Deus, Dei, Deum and Deo are just morphologically different forms of one word, and if we were to put them all together, Deus (“God”) would have a more prominent (and less misleading) place in the visual space than it does. Episcopus (“bishop”) is another example. On the other hand, the fact that Wordle does no stemming has the effect of preserving the gendered words, for example eum (“him”) and eam (“her”). These pronouns can, or course, refer to things that are masculine and feminine in a purely grammatical sense, but the difference is nevertheless interesting.

Another linguistic feature is the salience of the word que. This word can mean several different things depending on context, but it shows up on the Wordle because of its use as a relative pronoun (“which”) kicking off a subordinate clause. Latin is a hypotactic language and so subordinate clauses appear much more frequently than in a paratactic language like English.

Second, the Wordle makes sense in the context of the way in which Gratian put the Decretum together. The Decretum consists of short extracts from “authorities”, church councils plus long-dead theologians and popes, which Gratian embeds within a framework of his own comments (called dicta or “sayings”). It is extremely interesting that only two of the individual authorities are named frequently enough to show up in the Wordle: Augustinus (bishop of Hippo Regius in modern-day Algeria, d. 430) and Gregorius (bishop of Rome, d. 603). The word Papa (“Pope”) is more prominent, suggesting the collective, if not individual, heft of the popes in the lineup of authorities. Finally, Concilio (“Council”) shows up because the attribution (“inscription” in the jargon of medieval canon law studies) of so many canons is to one or another of the general or provincial councils that Gratian cited.

The chaining of multiple authorities in sequence is a very prominent feature of the text, and is indicated by the world Item (“Similarly”). One of Gratian’s goals was to show that the authorities were in harmony with each other. In fact his title for the book (which isn’t the one that stuck) was Concordia Discordantium Canonum (“The Agreement of Disagreeing Rules”). To do that, however, he had to bring out the apparent disagreements among the authorities before resolving them (his resolutions usually being introduced by Unde or “Whence”). This gives rise to the use of adversative particles like uel (“or”) and uero (“but”) that foreground the (apparent) contrast between the positions of the authorities.

These are just some of the immediate reactions I had to a quick experiment in “distant reading” an almost half million word text in one morning. I’ll update this post if I come up with more upon further reflection. I’d also appreciate feedback from the group on how to better communicate these ideas.

The Idea of an E-Book

I’m sorry, I just can’t come up with the great posting titles the rest of you do.

The first book I looked for was Lux Mundi (1890), a collection of Anglo-Catholic theological essays edited by Charles Gore. My reason for doing so was practical, since Travis Brown and I are using scanned images from this book, fed through OCR tools like Tesseract and OCRopus, for the ActiveOCR project at MITH. I won’t say the book was chosen at random,  but close to it. Travis wanted something from the late 19th century, and suggested that I search for everything in the Hathi Trust collection published in 1890.

The fact that the only other collection it appears in, however, is Google Books rules it out for the purpose of this assignment.

Deciding to stick with the theme of 19th century divines, I looked for John Henry Newman’s The Idea of the University, and found it on Project Gutenberg, the Internet Archive, Hathi Trust and Google Books.

As several other have noted, Project Gutenberg provides the most formats and the least provenance information. The book is available in HTML, EPUB, Kindle, PDF, Plucker, QiOO Mobile, Plain Text UTF-8 and TEI. All of these in addition, of course, to the Online Reader. Some of these formats seem a bit obscure to me — I had to look up Plucker (apparently an e-book reader for PalmOS devices), and QiOO (I’m guessing a reader for Android phones, since it’s Java-based, although they didn’t use the name Android). I fired up the oXygen editor to take a look at the TEI file , and it appears to be TEI (P5?) Lite with a Project Gutenberg-specific modified DTD. Although there are credits for the people responsible for preparing the files for Project Gutenberg, there is no information about which printed text(s) provide the basis for the electronic text.

I got 26 results when I searched the Internet Archive for The Idea of a University by Newman. One of these results was for the Project Gutenberg record, which offers the book in several formats not immediately visible on Project Gutenberg’s own page, including DAISY Digital Talking Book and DjVu (pronounced déjà vu, this is a format for scanned documents that its promoters, although I suspect few others, consider a competitor to image PDFs). There were also at least three (one may have been a duplicate)  results from Google Books (digitized from the University of California, Harvard, and New York Public Libraries).

I chose to look at one (26 was way too many) in detail that was contributed by “Kelly – University of Toronto”. While my first reaction was that “Kelly” might be an individual, a Google search indicated that it is a reference to the John M. Kelly Library at the University of St. Michael’s College, a Catholic university that has an institutional relationship with the public University of Toronto. This version was available in Full Text, PDF, EPUB, Kindle, Daisy and DjVu formats. The documents is in the Public Domain. There is no apparent way for users to report or correct errors. This is probably as good a place as any to note that I find the default online reader, which navigates through the text by “turning” pages, incredibly annoying. This is an misguided as the attempts of late 15th century printers to recreate the look of manuscripts in printed texts.

(This is as far as I’m going to be able to get before class, but I will update the post later with the information on the Hathi Trust and Google Books sites.)

 

From the under-theorized side of the room …

My name is Paul Evans, and I am currently a PhD candidate in the Medieval and Byzantine Studies program at The Catholic University of America in Washington, DC. I am also a graduate research assistant at UMD’s Maryland Institute for Technology in the Humanities (MITH), working as a Scala/Lift developer on MITH’s NEH-funded Active OCR project (http://mith.umd.edu/research/project/active-ocr/).

Before that, I had a number of previous academic and professional lives. My undergraduate degree was in the History of Medieval and Early Modern Europe. I then spent 23 years working in the computer industry, for the first ten years as a UNIX system administrator, and then as a manger, director and VP of IT.

My PhD dissertation is focused on the evolution of Gratian’s Decretum (c. 1140), the foundational text of medieval canon law. So I’m working on a traditional topic, using a traditional approach (think 19th century German textual scholarship, like the Monumenta Germaniae Historiae). My tools, however, are not traditional. I transcribe the texts from digitized images (still a manual process), encode the transcriptions in XML, and then write web applications in Java, Python or Scala that help me to visualize the variants. To see a sample, check out http://http://ingobert-app.appspot.com/.

Having read the other introductory blog posts, I feel under-qualified to discuss the critical-theoretical issues raised by the readings that the rest of you have engaged. I will limit myself to the issue of whether or not one has to know how to write code in order to be a DHer in good standing. As the person in the room with (I think) the most technical experience, I’m going to take the counter-intuitive position that the answer is “no” or at least “not much”. I think it’s more important to be able to tell a story that someone (yourself or someone else) can turn into code. To understand what I mean by “tell a story”, read Getting Real, a book on software development by 37 Signals, the people who brought you Basecamp.

I’m looking forward to the discussion tonight.