While most of you have moved on to newer and perhaps more exciting challenges (I can imagine that hypertext authoring tools like Twine might be a good deal more interesting to people enrolled in a graduate literature program than Paper Machines was), I’m still plugging away at topic modeling. In response to Matt’s tweet earlier this week, here’s a very preliminary update.
The text I’m working with is Gratian’s Decretum, a 12th c. textbook of canon law. The Decretum is not a literary text. Anyone matriculating in the faculty of canon law at any university in Medieval Europe spent their first year sitting through lectures from the Decretum. The Decretum is a composite text, made up of excerpts from authorities like Augustine, Ambrose and Jerome; from canons of church councils; and from papal letters (real or forged). Gratian wrapped all of this in his own first-person commentary (the so-called dicta) that were supposed to carry the thread of his argument. So we’re not dealing with the monolithic work of a single author.
In the late 1990s, the Decretum was discovered to have been composed in two distinct stages, the First Recension and the Second Recension. The immediate goal of my topic modeling exercise is to determine whether I can detect topics that were only added in the Second Recension. I know that at least one such topic exists. My doctoral advisor discovered (the old-fashioned way) that all of the texts in the Decretum relating to the legal status of Jews were added in the Second Recension. If I can get topic modelling working on the text, the goal would be to topic model an electronic text of the standard edition of the Decretum (more or less corresponding to the Second Recension), then to topic model an electronic text that can be thought of as a proxy for the First Recension, and finally to look at the differences to try to detect topics that were added between the First and Second Recensions.
I’m using command-line MALLET (rather than Paper Machines), which gives me the ability to manipulate things like the number of topics to model and the number of iterations at the expense of being a little clunky. Here’s an example of how you run it:
# bin/mallet import-dir –input data/gratian –output gratian-input.mallet –keep-sequence-bigrams –stoplist-file stoplists/mgh.txt
# bin/mallet train-topics –input gratian-input.mallet –num-topics 20 –output-state gratian-state.gz –num-iterations 10000 –output-topic-keys gratian-topic-keys.txt
This test of 10,000 iterations took 12 minutes, 18 seconds to run. I’ve gone up to 100,000 iterations (2 hours, 19 minutes). I won’t show you all 20 lines of output, but here’s the first few to give you an idea:
0 2.5 legis primo presumpserit ordinis rem humana tenere honoris cunctis quarta operis publica celebrare respondetur infirmitate diximus grauius conceditur dignitate
1 2.5 populo facere proprio boni multi ualeat sacerdotium uenia romani malorum clerum tradidit electus dat digna probare possessiones peruenire ordinationis
2 2.5 populi uoluntatem pertinet sentencia uideatur possumus obicitur unitatem dicta urbanus sinodum prohibet permanere sacerdotali decretum matrimonio corpore archiepiscopo dimissa
At this point, there are at least two things happening that I didn’t expect. First, the topic keys aren’t converging. My understanding was that at some point the output of the N+1th iteration wasn’t going to be very different from the output of the Nth iteration. One of the interesting feature of command-line MALLET is that it spits out the list of topic keys every 50 iterations, so you can watch it (try) to converge. So far, I’m seeing the words jump around a lot more than I expected. Second, the topics keys I’m getting look a lot more like the topic keys from Lisa Rhody’s ekphrastic poetry corpus than I’d expect for a non-literary text.
There are many issues that I still have to resolve. Probably the two most important are the number of topics to model, and whether or not stemming the Latin words will make a substantive difference.