Clay Templeton – Maryland Institute for Technology in the Humanities

Reading the Topic Modeling Literature

Clay Templeton — Fri, 19 Aug 2011 18:40:19 +0000

As Sayan Bhattacharyya and I have discussed in several posts over the summer, the technique of unsupervised “topic modeling” or Latent Dirichlet Allocation (LDA) has emerged in the humanities as one way to engage a text in “distant reading”. The appeal of the technique lies chiefly in the minimal assumptions it makes about the structure of meaning in a body of texts. However, this strength can also be a liability when the researcher brings specific research questions to a corpus. Classic topic modeling offers few levers of control by which a researcher can influence the outcome of the exercise.

How to remedy this? Digital Humanities practitioners are not typically in the business of implementing topic models. Rather, the Digital Humanities community has received LDA from the Natural Language Processing community, who in turn built it from basic research in Bayesian methods. As humanists consider the programmatic infrastructure requisite to launching innovations in topic modeling, the NLP community is addressing a diversifying portfolio of questions in their field by muting basic LDA. In this post, I present three key questions for practitioners to ask in tentatively approaching new topic modeling techniques developed in the Natural Language Processing community:

What kind of questions does the model address?
What new information does the model include to address these questions?
How is the structure of the model adapted so that it can take advantage of the new information to answer the questions?

I illustrate the application of these questions to Wang and McCallum’s paper on Topics over Time. As in many topic modeling papers, much light can be shed simply from reading the introduction.

What kind of questions does the model address?

In the opening section of their paper, Wang and McCallum explain that their approach is motivated by unexploited temporal information. In their more technical language, “the large data sets to which these topic models are applied do not have static co-occurrence patterns; the data are often collected over time, and generally patterns present in the early part of the collection are not in effect later.” We might like to see the Mexican-American War and World War I, for example, emerge as distinct topics in a historical analysis of U.S. State-of-the-Union addresses. The Topics over Time technique is designed to encourage topics to “rise and fall in prominence” as a function of time (1).

What new information does the model include?

As I pointed out in my previous post, topical characterization of a time window is achievable using classic LDA. Simply average a topic’s prominence over each document inside the window. However, the topics thus distributed over time have already, at that point, conflated aspects of things like wars waged decades apart. Thus, for example, a topic including airplanes as a prominent word might be well represented in 1893. This is fine if we’re looking for transhistorical themes, but not if we’d rather find historical trends in the use of language. Topics over Time uses the date-stamp associated with each document, taking this as input data alongside information about the co-occurrence of words.

How is the structure of the model adapted?

Topics over Time uses date-stamps to encourage topics to cluster around a point in time. As the inferential machinery behind the model develops topics, it also estimates where the central point in time lies for each topic and how far the topic tends to disperse around that point. In Wang and McCallum’s language, ‘TOT [Topics over Time] parameterizes a continuous distribution over time associated with each topic, and topics are responsible for generating both observed timestamps as well as words. Parameter estimation is thus driven to discover topics that simultaneously capture word co-occurrences and locality of those patterns in time’(Section 1). This leads to more topics that reveal historically specific themes.

Topics over Time is one of a number of topic model adaptations that hold promise for the digital humanities. Dynamic Topic Modeling allows topics to evolve from year to year, capturing the intuition that scientific fields, for example, endure despite changing terminology. Using Supervised LDA (SLDA), another innovation, a modeler can encourage topics to form so that the proportions of topics making up a document are effective predictors of some target variable. This allows the modeler to exert influence on the kind of structure topic modeling explicates. Finally, Dirichlet Forests allow the modeler to engender affinities or aversions between words based on prior knowledge of the content domain.

For all of these techniques, an additional key question is where to find the code to implement them. Code implementing Dirichlet Forest Priors can be found here. An implementation of SLDA can be found here. Another option is always to contact the researchers on a paper and ask them if their code is sharable, or to consult your local topic modeling expert (if you’re fortunate enough to have one!).

The post Reading the Topic Modeling Literature appeared first on Maryland Institute for Technology in the Humanities.

Topic Modeling in the Humanities: An Overview

Clay Templeton — Mon, 01 Aug 2011 14:16:10 +0000

In a recent post to this blog, Sayan Bhattacharyya described his contributions to the Woodchipper project in the context of a broader discussion about corpus-based approaches to humanities research. Topic modeling, the statistical technology undergirding Woodchipper, has garnered increasing attention as a tool of hermeneutic empowerment, a method for drawing structure out of a corpus on the basis of minimal critical presuppositions. In this post I map out a basic genealogy of topic modeling in the humanities, from the highly cited paper that first articulated Latent Dirichlet Allocation (LDA) to recent work at MITH.

The Story of Topic Modeling

The original LDA topic modeling paper, the one that defined the field, was published by Blei, Ng, and Jordan in 2003. The basic story is one of assumptions, and it goes like this: First, assume that each document is made up of a random mixture of categories, or topics. Now, suppose each category is defined by its preference for some words over others. Finally, let’s pretend we’re going to generate each word in each document from scratch. Over and over again, we randomly choose a category, then we randomly choose a word based on the preferences of that category.

Obviously the corpus wasn’t actually generated this way. Barring cyborg intervention, it was probably written down by a person or group of people. However, topic modeling calls on us to suspend our disbelief. Let’s just suppose the corpus was generated entirely through this process. Then, given that the corpus is what it is, what are the most likely underlying affinities between words and between categories? Topic modeling infers a plausible answer under the assumption that the “generative story” I told a paragraph ago is true.

Skepticism is warranted, but the proof of the pudding is that it can be quite nice. As an aside to his work modeling Martha Ballard’s diary, Cameron Blevins (Ph.D. candidate in American History, Stanford University) offers frank acknowledgment that a newcomer can benefit from topic modeling by using a standard toolkit like MALLET (developed by University of Massachusetts-Amherst): “I don’t pretend to have a firm grasp on the inner statistical/computational plumbing of how MALLET produces these topics, but in the case of Martha Ballard’s diary, it worked. Beautifully.”

A tool like MALLET gives a sense of the relative importance of topics in the composition of each document, as well as a list of the most prominent words in each topic. The word lists define the topics – it’s up to practitioners to discern meaning in the topics and to give them names. For example, Blevins identifies the theme of “housework” in two topics, and then shows that the prevalence of these topics in the corpus increases over the life-span of Martha Ballard’s diary. Although a correlation between housework and age might seem counterintuitive, it turns out to corroborate the definitive critical commentary on the diary.

In this or similar fashion, the “topic proportions” assigned to each document are often used, in conjunction with topic word lists, to draw comparative insights about a corpus. Boundary lines are drawn around document subsets and topic proportions are aggregated within each piece of territory. In chronological applications like Blevins’ study of the diary, it is often advantageous to draw boundary lines in time. Griffiths and Steyvers (2004) illustrate how to register temporal changes in topic composition using the output from basic LDA. Newman and Block (2006)‘s work with the 18th century Pennsylvania Gazette corpus was perhaps the first diachronic application of topic modeling in the humanities.

Into a DH Frame: Graphs, Maps, and Trees

Applications of topic modeling in the digital humanities are sometimes framed within a “distant reading” paradigm, for which Franco Moretti’s Graphs, Maps, Trees (2005) is the key text. Robert K. Nelson, director of the Digital Scholarship Lab and author of the Mining the Dispatch project, explains that “the real potential of topic modeling . . . isn’t at the level of the individual document. Topic modeling, instead, allows us to step back from individual documents and look at larger patterns among all the documents, to practice not close but distant reading, to borrow

[Moretti’s] memorable phrase.” In his recent post on this blog, my fellow MITH intern Sayan Bhattacharyya motivates interface enhancements to the Woodchipper project by appealing to the interplay of distant and close reading. In general, the Woodchipper project aims to facilitate a seamless interpretive experience that toggles between multiple levels of engagement with a corpus of texts.

Five Elements of Topic Modeling Projects

In one of the earliest impulses in the current wave of humanities topic modeling, Matthew Jockers (Stanford University) modeled a corpus of blogs generated during the “Day of DH”, a “community publication project that [brought] together digital humanists from around the world to document what they do on one day.” In Jockers’ methodology, I identify five elements needed to communicate the story of a topic modeling project:

Corpus
Technique
Unit of Analysis
Post Processing
Visualization

For Jockers’ project, these elements (potentially, descriptive metadata elements in a future registry or repository of Bayesian DH projects) are populated as follows:

Corpus: Day of DH Blog posts
Technique: vanilla LDA using MALLET
Unit of analysis: Blog (all the posts on a single blog)
Post Processing: “With a little massaging in R, I read in the matrix and then use some simple distance and clustering functions to group the bloggers into 10 (again an arbitrary number) groups; groups based on shared themes.”
Visualization: “I then output a matrix showing which authors have the most in common.”

A Catalogue of DH Topic Modeling Projects

As I hinted before, a distinction between diachronic and synchronic units of analysis seems feasible. It also turns out to be an efficacious organizer for a directory of DH projects. Jockers’ “Day of DH” project, based on the blog as unit of analysis, happens to be synchronic; earlier, Newman and Block’s time-bound approach to colonial newspapers was diachronic. The distinction structures this open list of DH topic modeling projects:

Synchronic approaches (Unit of analysis is not time bound)
Matthew Jockers’ work on the Day of DH blog posts (2010).
Elijah Meeks’ work on self-definitions of digital humanists (2011).
Jeff Druin’s work on Proust (2011).
Travis Brown’s work on Jane Austen’s Emma and and Byron’s Don Juan (2011).

Diachronic Approaches (Unit of analysis is a time slice)
Block and Newman’s work on the Pennsylvania Gazette (2006).
Cameron Blevins’ work on Martha Ballard’s diary (2010).
Robert K. Nelson’s work on the Richmond Daily Dispatch corpus (2011).
Yang, Torget, and Mihalcea’s work on Texas newspapers (2011).

One might expect this directory to expand rapidly as practitioners enter the field and new techniques are imported from Natural Language Processing (NLP) into the DH community.

In this post I have tried to draw out the unique value of LDA topic modeling as a text mining technique in the humanities, and to identify significant landmarks in the field. In future posts, I will begin incorporating topic modeling approaches that extend the LDA model into the conversation, and to document our exploratory movements at MITH. Follow me @purplewove on twitter.

The post Topic Modeling in the Humanities: An Overview appeared first on Maryland Institute for Technology in the Humanities.