{"id":2969,"date":"2011-07-22T11:11:49","date_gmt":"2011-07-22T16:11:49","guid":{"rendered":"http:\/\/mith.umd.edu\/?p=2969"},"modified":"2020-10-08T16:03:21","modified_gmt":"2020-10-08T20:03:21","slug":"digging-into-data-with-topic-models","status":"publish","type":"post","link":"https:\/\/mith.umd.edu\/digging-into-data-with-topic-models\/","title":{"rendered":"Digging into Data with Topic Models"},"content":{"rendered":"<p>I  am a graduate student from the <a href=\"http:\/\/www.umich.edu\/\" target=\"_blank\" rel=\"noopener noreferrer\">University of Michigan<\/a> interning this  summer at MITH, working on the topic modeling project that is underway  here. In this post, I will describe the &#8220;what&#8221; and &#8220;why&#8221; of what I have  been doing, and I will try to put it in the wider context of corpus-based approaches.<\/p>\n<p>R&amp;D software developer Travis Brown and others at MITH have developed  Woodchipper, a visualization tool, which runs the Mallet package  developed at the <a href=\"http:\/\/www.umass.edu\/\" target=\"_blank\" rel=\"noopener noreferrer\">University of Massachusetts at Amherst<\/a> to perform topic  modeling on a selected corpus, and then displays the results of a  principal-component analysis. An attractive feature of Woodchipper is  that it is oriented towards &#8220;drilling down&#8221; &#8212; a concept that is  particularly relevant to the digital humanities. Those of us who &#8220;do&#8221;  humanities pride ourselves on being <a href=\"http:\/\/en.wikipedia.org\/wiki\/Close_reading\" target=\"_blank\" rel=\"noopener noreferrer\">close readers<\/a> of texts. To be appealing to humanists, topic modeling, insofar as we  can think of it as a method of &#8220;<a href=\"http:\/\/mikejohnduff.blogspot.com\/2009\/11\/distant-reading.html\" target=\"_blank\" rel=\"noopener noreferrer\">distant reading<\/a>,&#8221;  will need to be combined with close reading. Woodchipper allows the  humanist scholar to view individual texts by displaying each page of a  text as a clickable data point on a two-dimensional graph; the spatial  layout of the graph is shaped by the results of the principal component  analysis.<\/p>\n<p>Visualization  between the &#8220;distant&#8221; aspect of the text&#8217;s high-level attributes &#8212; its  &#8220;topics&#8221; &#8212; and the &#8220;close&#8221; aspects the text &#8212; its individual words &#8212;  are crucial. The challenge is the following: why should the researcher  trust the high-level attributes the model says that the text supposedly  has? Only if the visualization bridges the gap between the high-level  and text-level attributes of the text by clearly displaying their  relationship, will the user be likely to trust the high-level properties  discovered by the topic model.<\/p>\n<p>Thus  we decided to make the visualization more expressive and richer.  Earlier, Woodchipper displayed only a specified number of topics  adjudged by the algorithm to be the best topics for the page. The  visualization represented each topic as a list of the first few words  that were the most representative of that topic. However, to simply  represent a topic as a few selected words is misleading, because, even if  those selected words represent the highest-probability words in that  topic, the actual probability mass represented by each word in that  topic may be quite different. It would be more logical and more  expressive, therefore, to represent a topic by those words which,  together, add up to a certain specified fraction of the total  probability mass. Doing so necessitates changing the Scala code on the  server side, which furnishes these words, before the Woodchipper client  accesses the topic (and, hence, the words).<\/p>\n<p>We  also realized that a further change needed to be made. Each page was  too small in size, so that, very often, no word in the page actually  matched the topics for that page. We realized that we probably needed to  break up the documents into larger sized units, in order to show to the  user a more trustworthy picture of how the top-level (&#8220;topics&#8221;)  connects with the bottom-level (&#8220;words on a specific page&#8221;) when we  metaphorically &#8220;drill down&#8221; from top to bottom.<\/p>\n<p>Stay  tuned to the MITH blog for further posts over the course of the summer  from myself and fellow graduate intern, Clay Templeton.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I am a graduate student from the University of Michigan interning this summer at MITH, working on the topic modeling project that is underway here. [&hellip;]<\/p>\n","protected":false},"author":11,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[66,77],"tags":[164,55],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v15.0 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Digging into Data with Topic Models &ndash; Maryland Institute for Technology in the Humanities<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/mith.umd.edu\/digging-into-data-with-topic-models\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Digging into Data with Topic Models &ndash; Maryland Institute for Technology in the Humanities\" \/>\n<meta property=\"og:description\" content=\"I am a graduate student from the University of Michigan interning this summer at MITH, working on the topic modeling project that is underway here. [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/mith.umd.edu\/digging-into-data-with-topic-models\/\" \/>\n<meta property=\"og:site_name\" content=\"Maryland Institute for Technology in the Humanities\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/UMD.MITH\" \/>\n<meta property=\"article:published_time\" content=\"2011-07-22T16:11:49+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-10-08T20:03:21+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/mith.umd.edu\/wp-content\/uploads\/2018\/10\/MITH-logostack-square-grn.png\" \/>\n\t<meta property=\"og:image:width\" content=\"300\" \/>\n\t<meta property=\"og:image:height\" content=\"300\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/mith.umd.edu\/#website\",\"url\":\"https:\/\/mith.umd.edu\/\",\"name\":\"Maryland Institute for Technology in the Humanities\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/mith.umd.edu\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/mith.umd.edu\/digging-into-data-with-topic-models\/#webpage\",\"url\":\"https:\/\/mith.umd.edu\/digging-into-data-with-topic-models\/\",\"name\":\"Digging into Data with Topic Models &ndash; Maryland Institute for Technology in the Humanities\",\"isPartOf\":{\"@id\":\"https:\/\/mith.umd.edu\/#website\"},\"datePublished\":\"2011-07-22T16:11:49+00:00\",\"dateModified\":\"2020-10-08T20:03:21+00:00\",\"author\":{\"@id\":\"https:\/\/mith.umd.edu\/#\/schema\/person\/f146dda81152e7a9ea2018aa3c22b377\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/mith.umd.edu\/digging-into-data-with-topic-models\/\"]}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/mith.umd.edu\/#\/schema\/person\/f146dda81152e7a9ea2018aa3c22b377\",\"name\":\"Sayan Bhattacharyya\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/mith.umd.edu\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f32a60715e2055043d28c86699f6c376?s=96&d=mm&r=g\",\"caption\":\"Sayan Bhattacharyya\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/2969"}],"collection":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/comments?post=2969"}],"version-history":[{"count":1,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/2969\/revisions"}],"predecessor-version":[{"id":21315,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/2969\/revisions\/21315"}],"wp:attachment":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/media?parent=2969"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/categories?post=2969"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/tags?post=2969"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}