Reflections on Scale and Topic Modeling

Sayan Bhattacharyya — Mon, 08 Aug 2011 14:40:29 +0000

I recently came across a 1991 interview of the literary critic Harold Bloom (Sterling Professor of Humanities at Yale University) in The Paris Review, in the course of which Bloom remarks:

“As far as I’m concerned, computers have as much to do with literature as space travel, perhaps much less.”

Since coming here (to MITH) as an intern this summer, I have learned about several projects that just might make Bloom change his mind. In the last few weeks, I have been working on one such project, along with Travis Brown and Clay Templeton here at MITH, that utilizes cutting-edge work on topic modeling currently being done in University of Maryland’s Computer Science department. Clay has already written about the project in his previous blog post, and so I will simply use this opportunity to express some reflection of my own.

The question of “scale” has been on my mind over the past couple of weeks. We are processing vast amounts of text data — topic modeling is the kind of approach whose power of discovery is predicated on the assumption that vast amounts of textual data will be available for it to run on. It makes me pause and reflect that the assumption that these approaches will “scale up” quantitatively to continue to become more prominent and visible in the coming years, rests on some deeper technological and social assumptions. That is, increased success for these approaches is going to depend on Moore’s Law continuing to hold (i.e., more and more processing power being available more and more cheaply), and also on the willingness (and legal feasibility) of those libraries and institutions that own such vast repositories of texts to make them available in computer-readable formats.

Earlier, our group here at MITH was working with the “unsupervised” topic modeling approach, in which no knowledge of the content of the text is really needed — the algorithm simply cranks away at whatever text corpus it is working on, and discovers topics from it. For the last week or so, though, we have focused on the brand-new and cutting-edge “supervised” topic modeling approach that is being developed by a research group in the Computer Science department here at Maryland. The idea in this kind of “supervised” topic modeling is to “train” the algorithm by making use of domain knowledge. For example, in conjunction with the Civil War era newspaper archive with which we are working, we are making use of such related pieces of knowledge (coming from contemporaneous sources external to our corpus) as the casualty rate for each week, and the Consumer Price Index for each month. The idea behind this approach is that the algorithm will discover more “meaningful” topics if it has a way to make use of feedback regarding how well the topics discovered by it are associated with a parameter of interest. Thus, if we are trying to bias the algorithm into discovering topics that pertain more directly to the Civil War and its effects, then it will make sense to align the aforementioned “other kinds of data” such as — in our case, casualty figures and economic figures for the era — which have a provenance outside the text corpus. This is where the “qualitative” scale becomes important.

The more intelligently we try to leverage these approaches’ power, the sheer number of areas with which the successful practitioner of this kind of topic modeling approach will, therefore, have to have at least a passing acquaintance, will “scale” up. This made me think about how people trained in information science — which is a truly interdisciplinary field — are really well-positioned to do this. Over the last week, for example, I read several papers on the economic history of the Civil War (which we were pointed to by Robert K. Nelson, a historian at the University of Richmond who has worked on topic modeling and history). Who would have guessed that one would have to read Civil War papers in the course of a summer internship in Information Science? I aligned the economic data with the text corpus, and based on what the data seemed to be telling us, I came up with a design for some experiments to test out some hypotheses, which we will proceed to carry out over the next few days.

Also, in a piece of exciting news, the paper proposal that we (Travis, Clay, and I) submitted to the “Making Meaning” conference for graduate students, organized by the Program in Rhetoric at the English Department of the University of Michigan, has been accepted. This presentation will reflect on how one might situate approaches like topic modeling in the context of literary theory and philosophy. This, too, is an example of how as “information scientists” we must see, and think, in terms of the “big picture” — that is, scale up to the big picture.

P.S. Now that this post turned out to be a reflection on the question of scale, it just occurred to me that it is also appropriate that the programming language I learned during the earlier part of the internship was — Scala!

The post Reflections on Scale and Topic Modeling appeared first on Maryland Institute for Technology in the Humanities.

Digging into Data with Topic Models

Sayan Bhattacharyya — Fri, 22 Jul 2011 16:11:49 +0000

I am a graduate student from the University of Michigan interning this summer at MITH, working on the topic modeling project that is underway here. In this post, I will describe the “what” and “why” of what I have been doing, and I will try to put it in the wider context of corpus-based approaches.

R&D software developer Travis Brown and others at MITH have developed Woodchipper, a visualization tool, which runs the Mallet package developed at the University of Massachusetts at Amherst to perform topic modeling on a selected corpus, and then displays the results of a principal-component analysis. An attractive feature of Woodchipper is that it is oriented towards “drilling down” — a concept that is particularly relevant to the digital humanities. Those of us who “do” humanities pride ourselves on being close readers of texts. To be appealing to humanists, topic modeling, insofar as we can think of it as a method of “distant reading,” will need to be combined with close reading. Woodchipper allows the humanist scholar to view individual texts by displaying each page of a text as a clickable data point on a two-dimensional graph; the spatial layout of the graph is shaped by the results of the principal component analysis.

Visualization between the “distant” aspect of the text’s high-level attributes — its “topics” — and the “close” aspects the text — its individual words — are crucial. The challenge is the following: why should the researcher trust the high-level attributes the model says that the text supposedly has? Only if the visualization bridges the gap between the high-level and text-level attributes of the text by clearly displaying their relationship, will the user be likely to trust the high-level properties discovered by the topic model.

Thus we decided to make the visualization more expressive and richer. Earlier, Woodchipper displayed only a specified number of topics adjudged by the algorithm to be the best topics for the page. The visualization represented each topic as a list of the first few words that were the most representative of that topic. However, to simply represent a topic as a few selected words is misleading, because, even if those selected words represent the highest-probability words in that topic, the actual probability mass represented by each word in that topic may be quite different. It would be more logical and more expressive, therefore, to represent a topic by those words which, together, add up to a certain specified fraction of the total probability mass. Doing so necessitates changing the Scala code on the server side, which furnishes these words, before the Woodchipper client accesses the topic (and, hence, the words).

We also realized that a further change needed to be made. Each page was too small in size, so that, very often, no word in the page actually matched the topics for that page. We realized that we probably needed to break up the documents into larger sized units, in order to show to the user a more trustworthy picture of how the top-level (“topics”) connects with the bottom-level (“words on a specific page”) when we metaphorically “drill down” from top to bottom.

Stay tuned to the MITH blog for further posts over the course of the summer from myself and fellow graduate intern, Clay Templeton.

The post Digging into Data with Topic Models appeared first on Maryland Institute for Technology in the Humanities.

Sayan Bhattacharyya – Maryland Institute for Technology in the Humanities

Reflections on Scale and Topic Modeling

Digging into Data with Topic Models