More fun with topic modeling

While most of you have moved on to newer and perhaps more exciting challenges (I can imagine that hypertext authoring tools like Twine might be a good deal more interesting to people enrolled in a graduate literature program than Paper Machines was), I’m still plugging away at topic modeling. In response to Matt’s tweet earlier this week, here’s a very preliminary update.

The text I’m working with is Gratian’s Decretum, a 12th c. textbook of canon law. The Decretum is not a literary text. Anyone matriculating in the faculty of canon law at any university in Medieval Europe spent their first year sitting through lectures from the Decretum. The Decretum is a composite text, made up of excerpts from authorities like Augustine, Ambrose and Jerome; from canons of church councils; and from papal letters (real or forged). Gratian wrapped all of this in his own first-person commentary (the so-called dicta) that were supposed to carry the thread of his argument. So we’re not dealing with the monolithic work of a single author.

In the late 1990s, the Decretum was discovered to have been composed in two distinct stages, the First Recension and the Second Recension. The immediate goal of my topic modeling exercise is to determine whether I can detect topics that were only added in the Second Recension. I know that at least one such topic exists. My doctoral advisor discovered (the old-fashioned way) that all of the texts in the Decretum relating to the legal status of Jews were added in the Second Recension. If I can get topic modelling working on the text, the goal would be to topic model an electronic text of the standard edition of the Decretum (more or less corresponding to the Second Recension), then to topic model an electronic text that can be thought of as a proxy for the First Recension, and finally to look at the differences to try to detect topics that were added between the First and Second Recensions.

I’m using command-line MALLET (rather than Paper Machines), which gives me the ability to manipulate things like the number of topics to model and the number of iterations at the expense of being a little clunky. Here’s an example of how you run it:

# bin/mallet import-dir –input data/gratian –output gratian-input.mallet –keep-sequence-bigrams –stoplist-file stoplists/mgh.txt

# bin/mallet train-topics –input gratian-input.mallet –num-topics 20 –output-state gratian-state.gz –num-iterations 10000 –output-topic-keys gratian-topic-keys.txt

This test of 10,000 iterations took 12 minutes, 18 seconds to run. I’ve gone up to 100,000 iterations (2 hours, 19 minutes). I won’t show you all 20 lines of output, but here’s the first few to give you an idea:

0 2.5 legis primo presumpserit ordinis rem humana tenere honoris cunctis quarta operis publica celebrare respondetur infirmitate diximus grauius conceditur dignitate
1 2.5 populo facere proprio boni multi ualeat sacerdotium uenia romani malorum clerum tradidit electus dat digna probare possessiones peruenire ordinationis
2 2.5 populi uoluntatem pertinet sentencia uideatur possumus obicitur unitatem dicta urbanus sinodum prohibet permanere sacerdotali decretum matrimonio corpore archiepiscopo dimissa

At this point, there are at least two things happening that I didn’t expect. First, the topic keys aren’t converging. My understanding was that at some point the output of the N+1th iteration wasn’t going to be very different from the output of the Nth iteration. One of the interesting feature of command-line MALLET is that it spits out the list of topic keys every 50 iterations, so you can watch it (try) to converge. So far, I’m seeing the words jump around a lot more than I expected. Second, the topics keys I’m getting look a lot more like the topic keys from Lisa Rhody’s ekphrastic poetry corpus than I’d expect for a non-literary text.

There are many issues that I still have to resolve. Probably the two most important are the number of topics to model, and whether or not stemming the Latin words will make a substantive difference.

7 thoughts on “More fun with topic modeling

  1. Hi Paul,

    It definitely looks like something odd is going on here. If you’re not seeing things settle down after just 1k iterations it’s not likely that more will help.

    What’s the per-token log-likelihood doing? Do you see the same thing with hyperparameter estimation turned on (e.g. “–optimize-interval 10″)? How are you dividing the text up into documents? Are you using a Latin stopword list?

    Also, if you have enough text, you should expect to see different forms of a word grouping together, even in a highly-inflected language like Latin. The fact that you don’t see this in your example topics is also a little worrying.

    Is the script that generates the input to MALLET in your repo? At a glance I don’t see it. I’d be happy to give it a shot.

  2. Hey outstanding blog! Does running a blog similar
    to this require a massive amount work? I’ve absolutely no knowledge
    of coding however I had been hoping to start my own blog soon. Anyway, should you have any ideas or
    tips for new blog owners please share. I know this is off subject however I simply needed to ask.
    Appreciate it!

  3. Today, while I was att work, my sister sfole mmy apple
    ipad and tested to see if it caan survive a forty foot drop, jjust so she can be a youtube sensation. My iPad
    is now broken andd sshe hhas 83 views. I know
    this iis totally offf topic but I hadd too share it with someone!

  4. I’m just commenting to let you understand of the brilliant encounter my friend’s daughter found visiting your web page. She realized several issues, which include how it is like to possess a great teaching spirit to get other individuals easily have an understanding of specific complex topics. You truly did more than readers’ desires. I appreciate you for presenting those interesting, trustworthy, informative as well as cool guidance on this topic to Tanya.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>