{"id":3093,"date":"2011-08-01T09:16:10","date_gmt":"2011-08-01T14:16:10","guid":{"rendered":"http:\/\/mith.umd.edu\/?p=3093"},"modified":"2020-10-08T16:03:19","modified_gmt":"2020-10-08T20:03:19","slug":"topic-modeling-in-the-humanities-an-overview","status":"publish","type":"post","link":"https:\/\/mith.umd.edu\/topic-modeling-in-the-humanities-an-overview\/","title":{"rendered":"Topic Modeling in the Humanities: An Overview"},"content":{"rendered":"<p>In a <a href=\"http:\/\/mith.umd.edu\/digging-into-data-with-topic-models\/\" target=\"_blank\" rel=\"noopener noreferrer\">recent post<\/a> to this blog, Sayan Bhattacharyya described his contributions to the <a href=\"http:\/\/mith.umd.edu\/corporacamp\/tool.php\" target=\"_blank\" rel=\"noopener noreferrer\">Woodchipper<\/a> project in the context of a broader discussion about corpus-based approaches to humanities research. Topic modeling, the statistical technology undergirding Woodchipper, has garnered increasing attention as a tool of hermeneutic empowerment, a method for drawing structure out of a corpus on the basis of minimal critical presuppositions. In this post I map out a basic genealogy of topic modeling in the humanities, from the highly cited paper that first articulated Latent Dirichlet Allocation (LDA) to recent work at MITH.<\/p>\n<p><strong>The Story of Topic Modeling<\/strong><\/p>\n<p>The <a href=\"http:\/\/www.google.com\/url?sa=t&amp;source=web&amp;cd=2&amp;ved=0CCcQFjAB&amp;url=http%3A%2F%2Fwww.cs.princeton.edu%2F~blei%2Fpapers%2FBleiNgJordan2003.pdf&amp;rct=j&amp;q=blei%20ng%202003%20topic%20modeling&amp;ei=Jrs2TtuGG-S50AHG_cj3Cw&amp;usg=AFQjCNEGsYCPJ8IZk9Y4xKeIS6WCKUeO-A&amp;sig2=Daec-QOPp6uZnxCp841icg&amp;cad=rja\" target=\"_blank\" rel=\"noopener noreferrer\">original LDA topic modeling paper<\/a>, the one that defined the field, was published by Blei, Ng, and Jordan in 2003. The basic story is one of assumptions, and it goes like this: First, assume that each document is made up of a random mixture of categories, or topics. Now, suppose each category is defined by its preference for some words over others. Finally, let&#8217;s pretend we&#8217;re going to generate each word in each document from scratch. Over and over again, we randomly choose a category, then we randomly choose a word based on the preferences of that category.<\/p>\n<p>Obviously the corpus wasn&#8217;t actually generated this way. Barring cyborg intervention, it was probably written down by a person or group of people. However, topic modeling calls on us to suspend our disbelief. Let&#8217;s just suppose the corpus was generated entirely through this process. Then, given that the corpus is what it is, what are the most likely underlying affinities between words and between categories? Topic modeling infers a plausible answer under the assumption that the \u201cgenerative story\u201d I told a paragraph ago is true.<\/p>\n<p>Skepticism is warranted, but the proof of the pudding is that it can be quite nice. As an aside to <a href=\"http:\/\/historying.org\/2010\/04\/01\/topic-modeling-martha-ballards-diary\/\" target=\"_blank\" rel=\"noopener noreferrer\">his work modeling Martha Ballard&#8217;s diary<\/a>, Cameron Blevins (Ph.D. candidate in American History, Stanford University) offers frank acknowledgment that a newcomer can benefit from topic modeling by using a standard toolkit like <a href=\"http:\/\/mallet.cs.umass.edu\/\" target=\"_blank\" rel=\"noopener noreferrer\">MALLET<\/a> (developed by University of Massachusetts-Amherst): &#8220;I don\u2019t pretend to have a firm grasp on the inner statistical\/computational plumbing of how MALLET produces these topics, but in the case of Martha Ballard\u2019s diary, it worked. Beautifully.&#8221;<\/p>\n<p>A tool like MALLET gives a sense of the relative importance of topics in the composition of each document, as well as a list of the most prominent words in each topic. The word lists define the topics \u2013 it&#8217;s up to practitioners to discern meaning in the topics and to give them names. For example, Blevins identifies the theme of \u201chousework\u201d in two topics, and then shows that the prevalence of these topics in the corpus increases over the life-span of Martha Ballard&#8217;s diary. Although a correlation between housework and age might seem counterintuitive, it turns out to corroborate the definitive critical commentary on the diary.<\/p>\n<p>In this or similar fashion, the \u201ctopic proportions\u201d assigned to each document are often used, in conjunction with topic word lists, to draw comparative insights about a corpus. Boundary lines are drawn around document subsets and topic proportions are aggregated within each piece of territory. In chronological applications like Blevins&#8217; study of the diary, it is often advantageous to draw boundary lines in time. <a href=\"http:\/\/web.archive.org\/web\/20120417131033\/http:\/\/www.pnas.org:80\/content\/101\/suppl.1\/5228.full.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">Griffiths and Steyvers (2004)<\/a> illustrate how to register temporal changes in topic composition using the output from basic LDA. <a href=\"http:\/\/www.ics.uci.edu\/~newman\/pubs\/JASIST_Newman.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">Newman and Block (2006)<\/a>&#8216;s work with the 18th century <em>Pennsylvania Gazette<\/em> corpus was perhaps the first diachronic application of topic modeling in the humanities.<\/p>\n<p><strong>Into a DH Frame: Graphs, Maps, and Trees<\/strong><\/p>\n<p>Applications of topic modeling in the digital humanities are sometimes framed within a \u201cdistant reading\u201d paradigm, for which Franco Moretti&#8217;s Graphs, Maps, Trees (2005) is the key text. Robert K. Nelson, director of the Digital Scholarship Lab and author of the <a href=\"http:\/\/dsl.richmond.edu\/dispatch\/pages\/intro\" target=\"_blank\" rel=\"noopener noreferrer\">Mining the Dispatch<\/a> project, explains that \u201cthe real potential of topic modeling . . . isn&#8217;t at the level of the individual document. Topic modeling, instead, allows us to step back from individual documents and look at larger patterns among all the documents, to practice not close but distant reading, to borrow <div class=\"fusion-fullwidth fullwidth-box fusion-builder-row-1 hundred-percent-fullwidth non-hundred-percent-height-scrolling\" style=\"background-color: rgba(255,255,255,0);background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;margin-bottom: 0px;margin-top: 0px;border-width: 0px 0px 0px 0px;border-color:#eae9e9;border-style:solid;\" ><div class=\"fusion-builder-row fusion-row\"><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-0 fusion_builder_column_1_1 1_1 fusion-one-full fusion-column-first fusion-column-last fusion-column-no-min-height\" style=\"margin-top:0px;margin-bottom:0px;\"><div class=\"fusion-column-wrapper fusion-flex-column-wrapper-legacy\" style=\"background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;padding: 0px 0px 0px 0px;\">[Moretti&#8217;s] memorable phrase.\u201d In his <a href=\"http:\/\/mith.umd.edu\/digging-into-data-with-topic-models\/\" target=\"_blank\" rel=\"noopener noreferrer\">recent post<\/a> on this blog, my fellow MITH intern Sayan Bhattacharyya motivates interface enhancements to the Woodchipper project by appealing to the interplay of distant and close reading. In general, the Woodchipper project aims to facilitate a seamless interpretive experience that toggles between multiple levels of engagement with a corpus of texts.<\/p>\n<p><strong>Five Elements of Topic Modeling Projects<\/strong><\/p>\n<p>In <a href=\"http:\/\/web.archive.org\/web\/20111211124523\/http:\/\/www.stanford.edu:80\/~mjockers\/cgi-bin\/drupal\/node\/39\" target=\"_blank\" rel=\"noopener noreferrer\">one of the earliest impulses<\/a> in the current wave of humanities topic modeling, Matthew Jockers (Stanford University) modeled a corpus of blogs generated during the \u201cDay of DH\u201d, a \u201ccommunity publication project that [brought] together digital humanists from around the world to document what they do on one day.\u201d In Jockers&#8217; methodology, I identify five elements needed to communicate the story of a topic modeling project:<\/p>\n<p>Corpus<br \/>\nTechnique<br \/>\nUnit of Analysis<br \/>\nPost Processing<br \/>\nVisualization<\/p>\n<p>For Jockers&#8217; project, these elements (potentially, descriptive metadata elements in a future registry or repository of Bayesian DH projects) are populated as follows:<\/p>\n<p><strong>Corpus<\/strong>: Day of DH Blog posts<br \/>\n<strong>Technique<\/strong>: vanilla LDA using MALLET<br \/>\n<strong>Unit of analysis<\/strong>: Blog (all the posts on a single blog)<br \/>\n<strong>Post Processing<\/strong>: &#8220;With a little massaging in R, I read in the matrix and then use some simple distance and clustering functions to group the bloggers into 10 (again an arbitrary number) groups; groups based on shared themes.&#8221;<br \/>\n<strong>Visualization<\/strong>: &#8220;I then output a matrix showing which authors have the most in common.&#8221;<\/p>\n<p><strong>A Catalogue of DH Topic Modeling Projects<\/strong><\/p>\n<p>As I hinted before, a distinction between diachronic and synchronic units of analysis seems feasible. It also turns out to be an efficacious organizer for a directory of DH projects. Jockers&#8217; \u201cDay of DH\u201d project, based on the blog as unit of analysis, happens to be synchronic; earlier, Newman and Block&#8217;s time-bound approach to colonial newspapers was diachronic. The distinction structures this open list of DH topic modeling projects:<\/p>\n<p><strong>Synchronic approaches<\/strong> (Unit of analysis is not time bound)<br \/>\n<a href=\"http:\/\/web.archive.org\/web\/20111211124523\/http:\/\/www.stanford.edu:80\/~mjockers\/cgi-bin\/drupal\/node\/39\" target=\"_blank\" rel=\"noopener noreferrer\">Matthew Jockers&#8217; work<\/a> on the Day of DH blog posts (2010).<br \/>\n<a href=\"https:\/\/dhs.stanford.edu\/comprehending-the-digital-humanities\/\" target=\"_blank\" rel=\"noopener noreferrer\">Elijah Meeks&#8217; work<\/a> on self-definitions of digital humanists (2011).<br \/>\n<a href=\"\/\/dhs.stanford.edu\/algorithmic-literacy\/topic-networks-in-proust\/\" target=\"_blank\" rel=\"noopener noreferrer\">Jeff Druin&#8217;s work<\/a> on Proust (2011).<br \/>\n<a href=\"http:\/\/mith.umd.edu\/corporacamp\/tool.php\" target=\"_blank\" rel=\"noopener noreferrer\">Travis Brown&#8217;s work<\/a> on Jane Austen&#8217;s <em>Emma<\/em> and and Byron&#8217;s <em>Don Juan<\/em> (2011).<\/p>\n<p><strong>Diachronic Approaches<\/strong> (Unit of analysis is a time slice)<br \/>\n<a href=\"http:\/\/www.ics.uci.edu\/~newman\/pubs\/JASIST_Newman.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">Block and Newman&#8217;s work<\/a> on the <em>Pennsylvania Gazette<\/em> (2006).<br \/>\n<a href=\"http:\/\/historying.org\/2010\/04\/01\/topic-modeling-martha-ballards-diary\/\" target=\"_blank\" rel=\"noopener noreferrer\">Cameron Blevins&#8217; work<\/a> on Martha Ballard&#8217;s diary (2010).<br \/>\n<a href=\"http:\/\/dsl.richmond.edu\/dispatch\/pages\/intro\" target=\"_blank\" rel=\"noopener noreferrer\">Robert K. Nelson&#8217;s work<\/a> on the <em>Richmond Daily Dispatch<\/em> corpus (2011).<br \/>\n<a href=\"http:\/\/www.aclweb.org\/anthology\/W\/W11\/W11-15.pdf#page=108\" target=\"_blank\" rel=\"noopener noreferrer\">Yang, Torget, and Mihalcea&#8217;s work<\/a> on Texas newspapers (2011).<\/p>\n<p>One might expect this directory to expand rapidly as practitioners enter the field and new techniques are imported from Natural Language Processing (NLP) into the DH community.<\/p>\n<p>In this post I have tried to draw out the unique value of LDA topic modeling as a text mining technique in the humanities, and to identify significant landmarks in the field. In future posts, I will begin incorporating topic modeling approaches that extend the LDA model into the conversation, and to document our exploratory movements at MITH. Follow me @purplewove on twitter.<\/p>\n<h2><\/h2>\n<div class=\"fusion-clearfix\"><\/div><\/div><\/div><\/div><style type=\"text\/css\">.fusion-fullwidth.fusion-builder-row-1 { overflow:visible; }<\/style><\/div><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In a recent post to this blog, Sayan Bhattacharyya described his contributions to the Woodchipper project in the context of a broader discussion about corpus-based [&hellip;]<\/p>\n","protected":false},"author":13,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[66,77],"tags":[164,55],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v15.0 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Topic Modeling in the Humanities: An Overview &ndash; Maryland Institute for Technology in the Humanities<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/mith.umd.edu\/topic-modeling-in-the-humanities-an-overview\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Topic Modeling in the Humanities: An Overview &ndash; Maryland Institute for Technology in the Humanities\" \/>\n<meta property=\"og:description\" content=\"In a recent post to this blog, Sayan Bhattacharyya described his contributions to the Woodchipper project in the context of a broader discussion about corpus-based [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/mith.umd.edu\/topic-modeling-in-the-humanities-an-overview\/\" \/>\n<meta property=\"og:site_name\" content=\"Maryland Institute for Technology in the Humanities\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/UMD.MITH\" \/>\n<meta property=\"article:published_time\" content=\"2011-08-01T14:16:10+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-10-08T20:03:19+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/mith.umd.edu\/wp-content\/uploads\/2018\/10\/MITH-logostack-square-grn.png\" \/>\n\t<meta property=\"og:image:width\" content=\"300\" \/>\n\t<meta property=\"og:image:height\" content=\"300\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/mith.umd.edu\/#website\",\"url\":\"https:\/\/mith.umd.edu\/\",\"name\":\"Maryland Institute for Technology in the Humanities\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/mith.umd.edu\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/mith.umd.edu\/topic-modeling-in-the-humanities-an-overview\/#webpage\",\"url\":\"https:\/\/mith.umd.edu\/topic-modeling-in-the-humanities-an-overview\/\",\"name\":\"Topic Modeling in the Humanities: An Overview &ndash; Maryland Institute for Technology in the Humanities\",\"isPartOf\":{\"@id\":\"https:\/\/mith.umd.edu\/#website\"},\"datePublished\":\"2011-08-01T14:16:10+00:00\",\"dateModified\":\"2020-10-08T20:03:19+00:00\",\"author\":{\"@id\":\"https:\/\/mith.umd.edu\/#\/schema\/person\/495c07746050c4324330052ceacf384b\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/mith.umd.edu\/topic-modeling-in-the-humanities-an-overview\/\"]}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/mith.umd.edu\/#\/schema\/person\/495c07746050c4324330052ceacf384b\",\"name\":\"Clay Templeton\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/mith.umd.edu\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/c0af60de3b46654a3cb955444a50f39a?s=96&d=mm&r=g\",\"caption\":\"Clay Templeton\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/3093"}],"collection":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/comments?post=3093"}],"version-history":[{"count":1,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/3093\/revisions"}],"predecessor-version":[{"id":21312,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/3093\/revisions\/21312"}],"wp:attachment":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/media?parent=3093"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/categories?post=3093"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/tags?post=3093"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}