Asking Questions of Lots of Text with Weka

Gemstone POLITIC — Tue, 18 Dec 2012 13:30:36 +0000

Adrian Hamins-Puertolas and Adam Elrafei are students in Team POLITIC, an undergraduate research team in the University of Maryland’s GEMSTONE honors research-focused honors college, mentored by MITH Faculty Fellow Peter Mallios.

Our undergraduate research team uses newly developed technology to understand and quantify how American audiences received Russian authors in the early 1920s. One of the tools we’re exploring is Weka, a collection of machine-learning algorithms that can be used to mine data-sets. MITH has helped us design and construct our database, which contains thousands of articles about Russian authors featured in American literary magazines written during the 1920s). Each article in the database is associated with values indicating the frequency of words in the text, so we can trace how often a single word (unigram) like “revolution” appears throughout our articles, or how often two words appear next to each other (a bigram), such as “Russian revolution”. Both these features offer us paths to think about our dataset in terms of describing and quantifying word proximity.

MITH’s Travis Brown demonstrated how we could use Weka to train a machine learning classifier that could assign labels to articles in the dataset we had not ever read. To test this, we created a smaller training dataset of just 150 articles, a number small enough that we could actually read through the entire texts and manually describe by answering questions ranging from “Is a given literary author a subject of debate in this article?” to “Is radical politics an issue in the article?”. Given these measures, Weka has the ability to classify every other article in our dataset with some degree of accuracy.

Weka provided us with a decision tree that classifies answers to the question “Is literary style and artistry an issue in this article” appropriately for approximately 67% of our training set. This success rate can be improved as we add new measures for classifying and quantifying the text. One direction we can go is to attempt to use MALLET—“an integrated collection of Java code useful for statistical natural language processing

[and] document classification”—in order to create topics—groups of words that MALLET finds to be significantly thematically related. Topic modeling is fascinating because a preliminary examination of generated topics has already provided us with a variety of distinct themes and vocabulary appearing in our dataset, ranging from religion to specific Russian authors. We’re in the process of running Weka’s classification system on those generated topics that include religious language in order to answer another of our questions—”Is religion an issue in the article?”.

Our current Weka experiments, using a smaller training set of 46 articles, have already acquired promising results. For example, when using the J48 decision tree algorithm on our textual data filtered into unigrams, Weka correctly classifies 76% of our documents when answering the “Is politics an issue?” question. If we filter our data into both unigrams and bigrams, the correct classification rate decreases to 67%. However, if we filter our data into unigrams and apply a stemmer (which breaks down words into their root forms and ignores prefixes and suffixes), our correct classification rate increases to 77%.

We are looking forward to expanding our experiments to apply to an even larger subset of our data, as we continue to learn more about natural language processing tools in the coming weeks.

The post Asking Questions of Lots of Text with Weka appeared first on Maryland Institute for Technology in the Humanities.

An Undergraduate View of Data Mining with WEKA

Gemstone POLITIC — Mon, 05 Nov 2012 13:30:32 +0000

Manpreet Khural is an undergraduate member of the Gemstone POLITIC undergraduate research team, led by MITH Faculty Fellow Peter Mallios.

As we, Team POLITIC of Gemstone, make progress in the effort of utilizing data mining tools such as Weka, it becomes more evident that such a technological approach provides a goldmine of new information that would be otherwise impossible to obtain. Currently, we have been working to train Weka to answer a set of questions in which we are interested. In doing so, we have to first provide it data from which it can learn. This requires manual annotations of article documents. It is in doing this that we see the potential of data mining technology.

The lack of human learning biases is this potential. In order to provide Weka the most accurate learning data set, we have made strict guidelines for how we answer the questions. Even with these guidelines, it is apparent that without strenuous amounts of personal effort, the questions will always have certain biases. The human opinion is a transient quantity and therefore makes it difficult to apply a scientific approach to the analysis of texts. We build associations every single day, making it impossible to maintain constants in mindset and realistically be able to answer these questions without error.

Data mining, on the other hand, has a much more objective learning process. It makes connections solely on the basis of the patterns that the data sets contains. These patterns contain an entirely new and revolutionary insight into texts as they are based on the use of language, what is on the page rather than the ideas, what a reader often infers based on prior personal associations. Even though the training process can be lengthy, the applications for data mining seem endless, considering that without such technology, we would have to go through every bit of text in our data set and annotate them. We foresee data mining as a way for gathering information on any topic that has sufficient amount of available text. For example, national defense agencies can use it to answer queries that could be useful in understanding changes in sentiment as pertaining to whatever topic they are interested in. We believe that data mining will revolutionize many such industries which aim to understand changes in public sentiment.

The post An Undergraduate View of Data Mining with WEKA appeared first on Maryland Institute for Technology in the Humanities.

Gemstone POLITIC – Maryland Institute for Technology in the Humanities

Asking Questions of Lots of Text with Weka

An Undergraduate View of Data Mining with WEKA