{"id":9901,"date":"2012-12-18T08:30:36","date_gmt":"2012-12-18T13:30:36","guid":{"rendered":"http:\/\/mith.umd.edu\/?p=9901"},"modified":"2020-10-08T16:00:49","modified_gmt":"2020-10-08T20:00:49","slug":"asking-questions-of-lots-of-text-with-weka","status":"publish","type":"post","link":"https:\/\/mith.umd.edu\/asking-questions-of-lots-of-text-with-weka\/","title":{"rendered":"Asking Questions of Lots of Text with Weka"},"content":{"rendered":"<p><em>Adrian Hamins-Puertolas and Adam Elrafei are students in <a href=\"http:\/\/web.archive.org\/web\/20131220230110\/http:\/\/teams.gemstone.umd.edu\/classof2014\/politic\/\" target=\"_blank\" rel=\"noopener noreferrer\">Team POLITIC<\/a>, an undergraduate research team in the University of Maryland\u2019s GEMSTONE honors research-focused honors college, mentored by MITH Faculty Fellow Peter Mallios.<\/em><\/p>\n<p>Our undergraduate research team uses newly developed technology to understand and quantify how American audiences received Russian authors in the early 1920s. One of the tools we\u2019re exploring is Weka, a collection of machine-learning algorithms that can be used to mine data-sets. MITH has helped us design and construct our database, which contains thousands of articles about Russian authors featured in American literary magazines written during the 1920s). Each article in the database is associated with values indicating the frequency of words in the text, so we can trace how often a single word (unigram) like \u201crevolution\u201d appears throughout our articles, or how often two words appear next to each other (a bigram), such as \u201cRussian revolution\u201d. Both these features offer us paths to think about our dataset in terms of describing and quantifying word proximity.<\/p>\n<p>MITH&#8217;s Travis Brown demonstrated how we could use Weka to train a machine learning classifier that could assign labels to articles in the dataset we had not ever read. To test this, we created a smaller training dataset of just 150 articles, a number small enough that we could actually read through the entire texts and manually describe by answering questions ranging from \u201cIs a given literary author a subject of debate in this article?\u201d to \u201cIs radical politics an issue in the article?\u201d. Given these measures, Weka has the ability to classify every other article in our dataset with some degree of accuracy.<\/p>\n<p>Weka provided us with a decision tree that classifies answers to the question \u201cIs literary style and artistry an issue in this article\u201d appropriately for approximately 67% of our training set. This success rate can be improved as we add new measures for classifying and quantifying the text. One direction we can go is to attempt to use MALLET\u2014\u201can integrated collection of Java code useful for statistical natural language processing <div class=\"fusion-fullwidth fullwidth-box fusion-builder-row-1 hundred-percent-fullwidth non-hundred-percent-height-scrolling\" style=\"background-color: rgba(255,255,255,0);background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;margin-bottom: 0px;margin-top: 0px;border-width: 0px 0px 0px 0px;border-color:#eae9e9;border-style:solid;\" ><div class=\"fusion-builder-row fusion-row\"><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-0 fusion_builder_column_1_1 1_1 fusion-one-full fusion-column-first fusion-column-last fusion-column-no-min-height\" style=\"margin-top:0px;margin-bottom:0px;\"><div class=\"fusion-column-wrapper fusion-flex-column-wrapper-legacy\" style=\"background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;padding: 0px 0px 0px 0px;\">[and] document classification\u201d\u2014in order to create topics\u2014groups of words that MALLET finds to be significantly thematically related. Topic modeling is fascinating because a preliminary examination of generated topics has already provided us with a variety of distinct themes and vocabulary appearing in our dataset, ranging from religion to specific Russian authors. We\u2019re in the process of running Weka\u2019s classification system on those generated topics that include religious language in order to answer another of our questions\u2014\u201dIs religion an issue in the article?\u201d.<\/p>\n<p>Our current Weka experiments, using a smaller training set of 46 articles, have already acquired promising results. For example, when using the J48 decision tree algorithm on our textual data filtered into unigrams, Weka correctly classifies 76% of our documents when answering the \u201cIs politics an issue?\u201d question. If we filter our data into both unigrams and bigrams, the correct classification rate decreases to 67%. However, if we filter our data into unigrams and apply a stemmer (which breaks down words into their root forms and ignores prefixes and suffixes), our correct classification rate increases to 77%.<\/p>\n<p>We are looking forward to expanding our experiments to apply to an even larger subset of our data, as we continue to learn more about natural language processing tools in the coming weeks.<div class=\"fusion-clearfix\"><\/div><\/div><\/div><\/div><style type=\"text\/css\">.fusion-fullwidth.fusion-builder-row-1 { overflow:visible; }<\/style><\/div><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Adrian Hamins-Puertolas and Adam Elrafei are students in Team POLITIC, an undergraduate research team in the University of Maryland\u2019s GEMSTONE honors research-focused honors college, mentored [&hellip;]<\/p>\n","protected":false},"author":32,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[70,71],"tags":[120,122,124],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v15.0 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Asking Questions of Lots of Text with Weka &ndash; Maryland Institute for Technology in the Humanities<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/mith.umd.edu\/asking-questions-of-lots-of-text-with-weka\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Asking Questions of Lots of Text with Weka &ndash; Maryland Institute for Technology in the Humanities\" \/>\n<meta property=\"og:description\" content=\"Adrian Hamins-Puertolas and Adam Elrafei are students in Team POLITIC, an undergraduate research team in the University of Maryland\u2019s GEMSTONE honors research-focused honors college, mentored [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/mith.umd.edu\/asking-questions-of-lots-of-text-with-weka\/\" \/>\n<meta property=\"og:site_name\" content=\"Maryland Institute for Technology in the Humanities\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/UMD.MITH\" \/>\n<meta property=\"article:published_time\" content=\"2012-12-18T13:30:36+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-10-08T20:00:49+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/mith.umd.edu\/wp-content\/uploads\/2018\/10\/MITH-logostack-square-grn.png\" \/>\n\t<meta property=\"og:image:width\" content=\"300\" \/>\n\t<meta property=\"og:image:height\" content=\"300\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/mith.umd.edu\/#website\",\"url\":\"https:\/\/mith.umd.edu\/\",\"name\":\"Maryland Institute for Technology in the Humanities\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/mith.umd.edu\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/mith.umd.edu\/asking-questions-of-lots-of-text-with-weka\/#webpage\",\"url\":\"https:\/\/mith.umd.edu\/asking-questions-of-lots-of-text-with-weka\/\",\"name\":\"Asking Questions of Lots of Text with Weka &ndash; Maryland Institute for Technology in the Humanities\",\"isPartOf\":{\"@id\":\"https:\/\/mith.umd.edu\/#website\"},\"datePublished\":\"2012-12-18T13:30:36+00:00\",\"dateModified\":\"2020-10-08T20:00:49+00:00\",\"author\":{\"@id\":\"https:\/\/mith.umd.edu\/#\/schema\/person\/472329544bed5b4861e756a6ab628b04\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/mith.umd.edu\/asking-questions-of-lots-of-text-with-weka\/\"]}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/mith.umd.edu\/#\/schema\/person\/472329544bed5b4861e756a6ab628b04\",\"name\":\"Gemstone POLITIC\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/mith.umd.edu\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5517a28ad0065314b0ac3c714125d7fc?s=96&d=mm&r=g\",\"caption\":\"Gemstone POLITIC\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/9901"}],"collection":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/users\/32"}],"replies":[{"embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/comments?post=9901"}],"version-history":[{"count":1,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/9901\/revisions"}],"predecessor-version":[{"id":21165,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/9901\/revisions\/21165"}],"wp:attachment":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/media?parent=9901"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/categories?post=9901"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/tags?post=9901"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}