{"id":3325,"date":"2011-08-19T13:40:19","date_gmt":"2011-08-19T18:40:19","guid":{"rendered":"http:\/\/mith.umd.edu\/?p=3325"},"modified":"2020-10-08T16:03:18","modified_gmt":"2020-10-08T20:03:18","slug":"reading-the-topic-modeling-literature","status":"publish","type":"post","link":"https:\/\/mith.umd.edu\/reading-the-topic-modeling-literature\/","title":{"rendered":"Reading the Topic Modeling Literature"},"content":{"rendered":"<p>As Sayan Bhattacharyya and I have <a href=\"http:\/\/mith.umd.edu\/topic-modeling-in-the-humanities-an-overview\/\" target=\"_blank\" rel=\"noopener noreferrer\">discussed<\/a> in <a href=\"http:\/\/mith.umd.edu\/reflections-on-scale-and-topic-modeling\/\" target=\"_blank\" rel=\"noopener noreferrer\">several<\/a> <a href=\"http:\/\/mith.umd.edu\/digging-into-data-with-topic-models\/\" target=\"_blank\" rel=\"noopener noreferrer\">posts<\/a> over the summer, the technique of unsupervised \u201ctopic modeling\u201d or Latent Dirichlet Allocation (LDA) has emerged in the humanities as one way to engage a text in \u201cdistant reading\u201d. The appeal of the technique lies chiefly in the minimal assumptions it makes about the structure of meaning in a body of texts. However, this strength can also be a liability when the researcher brings specific research questions to a corpus. Classic topic modeling offers few levers of control by which a researcher can influence the outcome of the exercise.<\/p>\n<p>How to remedy this? Digital Humanities practitioners are not typically in the business of implementing topic models. Rather, the Digital Humanities community has received LDA from the Natural Language Processing community, who in turn built it from basic research in <a href=\"https:\/\/en.wikipedia.org\/wiki\/Bayesian_probability\" target=\"_blank\" rel=\"noopener noreferrer\">Bayesian methods<\/a>. As humanists consider the programmatic infrastructure requisite to launching innovations in topic modeling, the NLP community is addressing a diversifying portfolio of questions in their field by muting basic LDA. In this post, I present three key questions for practitioners to ask in tentatively approaching new topic modeling techniques developed in the Natural Language Processing community:<\/p>\n<ol>\n<li>What kind of questions does the model address?<\/li>\n<li>What new information does the model include to address these questions?<\/li>\n<li>How is the structure of the model adapted so that it can take advantage of the new information to answer the questions?<\/li>\n<\/ol>\n<p>I illustrate the application of these questions to Wang and McCallum&#8217;s paper on <a href=\"http:\/\/citeseer.ist.psu.edu\/viewdoc\/summary?doi=10.1.1.152.2460\" target=\"_blank\" rel=\"noopener noreferrer\">Topics over Time<\/a>. As in many topic modeling papers, much light can be shed simply from reading the introduction.<\/p>\n<p>What kind of questions does the model address?<\/p>\n<p>In the opening section of their paper, Wang and McCallum explain that their approach is motivated by unexploited temporal information. In their more technical language, \u201cthe large data sets to which these topic models are applied do not have static co-occurrence patterns; the data are often collected over time, and generally patterns present in the early part of the collection are not in effect later.\u201d We might like to see the Mexican-American War and World War I, for example, emerge as distinct topics in a historical analysis of U.S. State-of-the-Union addresses. The Topics over Time technique is designed to encourage topics to \u201crise and fall in prominence\u201d as a function of time (1).<\/p>\n<p>What new information does the model include?<\/p>\n<p>As I pointed out in my <a href=\"http:\/\/mith.umd.edu\/topic-modeling-in-the-humanities-an-overview\/\" target=\"_blank\" rel=\"noopener noreferrer\">previous post<\/a>, topical characterization of a time window is achievable using classic LDA. Simply average a topic&#8217;s prominence over each document inside the window. However, the topics thus distributed over time have already, at that point, conflated aspects of things like wars waged decades apart. Thus, for example, a topic including airplanes as a prominent word might be well represented in 1893. This is fine if we\u2019re looking for transhistorical themes, but not if we\u2019d rather find historical trends in the use of language. Topics over Time uses the date-stamp associated with each document, taking this as input data alongside information about the co-occurrence of words.<\/p>\n<p>How is the structure of the model adapted?<\/p>\n<p>Topics over Time uses date-stamps to encourage topics to cluster around a point in time. As the inferential machinery behind the model develops topics, it also estimates where the central point in time lies for each topic and how far the topic tends to disperse around that point. In Wang and McCallum&#8217;s language, &#8216;TOT [Topics over Time] parameterizes a continuous distribution over time associated with each topic, and topics are responsible for generating both observed timestamps as well as words. Parameter estimation is thus driven to discover topics that simultaneously capture word co-occurrences and locality of those patterns in time\u2019(Section 1). This leads to more topics that reveal historically specific themes.<\/p>\n<p>Topics over Time is one of a number of topic model adaptations that hold promise for the digital humanities. <a href=\"http:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.62.2783&amp;rep=rep1&amp;type=pdf\" target=\"_blank\" rel=\"noopener noreferrer\">Dynamic Topic Modeling<\/a> allows topics to evolve from year to year, capturing the intuition that scientific fields, for example, endure despite changing terminology. Using <a href=\"http:\/\/www.cs.princeton.edu\/~blei\/papers\/BleiMcAuliffe2007.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">Supervised LDA<\/a> (SLDA), another innovation, a modeler can encourage topics to form so that the proportions of topics making up a document are effective predictors of some target variable. This allows the modeler to exert influence on the kind of structure topic modeling explicates. Finally, Dirichlet Forests allow the modeler to engender affinities or aversions between words based on prior knowledge of the content domain.<\/p>\n<p>For all of these techniques, an additional key question is where to find the code to implement them. Code implementing Dirichlet Forest Priors can be found <a href=\"http:\/\/pages.cs.wisc.edu\/~andrzeje\/research\/df_lda.html\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>. An implementation of SLDA can be found <a href=\"http:\/\/web.archive.org\/web\/20120825213639\/http:\/\/www.cs.princeton.edu:80\/~chongw\/slda\/\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>. Another option is always to contact the researchers on a paper and ask them if their code is sharable, or to consult your local topic modeling expert (if you\u2019re fortunate enough to have one!).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As Sayan Bhattacharyya and I have discussed in several posts over the summer, the technique of unsupervised \u201ctopic modeling\u201d or Latent Dirichlet Allocation (LDA) has [&hellip;]<\/p>\n","protected":false},"author":13,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[66,77],"tags":[164,55],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v15.0 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Reading the Topic Modeling Literature &ndash; Maryland Institute for Technology in the Humanities<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/mith.umd.edu\/reading-the-topic-modeling-literature\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Reading the Topic Modeling Literature &ndash; Maryland Institute for Technology in the Humanities\" \/>\n<meta property=\"og:description\" content=\"As Sayan Bhattacharyya and I have discussed in several posts over the summer, the technique of unsupervised \u201ctopic modeling\u201d or Latent Dirichlet Allocation (LDA) has [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/mith.umd.edu\/reading-the-topic-modeling-literature\/\" \/>\n<meta property=\"og:site_name\" content=\"Maryland Institute for Technology in the Humanities\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/UMD.MITH\" \/>\n<meta property=\"article:published_time\" content=\"2011-08-19T18:40:19+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-10-08T20:03:18+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/mith.umd.edu\/wp-content\/uploads\/2018\/10\/MITH-logostack-square-grn.png\" \/>\n\t<meta property=\"og:image:width\" content=\"300\" \/>\n\t<meta property=\"og:image:height\" content=\"300\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/mith.umd.edu\/#website\",\"url\":\"https:\/\/mith.umd.edu\/\",\"name\":\"Maryland Institute for Technology in the Humanities\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/mith.umd.edu\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/mith.umd.edu\/reading-the-topic-modeling-literature\/#webpage\",\"url\":\"https:\/\/mith.umd.edu\/reading-the-topic-modeling-literature\/\",\"name\":\"Reading the Topic Modeling Literature &ndash; Maryland Institute for Technology in the Humanities\",\"isPartOf\":{\"@id\":\"https:\/\/mith.umd.edu\/#website\"},\"datePublished\":\"2011-08-19T18:40:19+00:00\",\"dateModified\":\"2020-10-08T20:03:18+00:00\",\"author\":{\"@id\":\"https:\/\/mith.umd.edu\/#\/schema\/person\/495c07746050c4324330052ceacf384b\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/mith.umd.edu\/reading-the-topic-modeling-literature\/\"]}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/mith.umd.edu\/#\/schema\/person\/495c07746050c4324330052ceacf384b\",\"name\":\"Clay Templeton\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/mith.umd.edu\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/c0af60de3b46654a3cb955444a50f39a?s=96&d=mm&r=g\",\"caption\":\"Clay Templeton\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/3325"}],"collection":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/comments?post=3325"}],"version-history":[{"count":1,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/3325\/revisions"}],"predecessor-version":[{"id":21310,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/3325\/revisions\/21310"}],"wp:attachment":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/media?parent=3325"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/categories?post=3325"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/tags?post=3325"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}