Small Projects & Limited Datasets

I’ve been thinking a lot lately about the significance of small projects in an increasingly large-scale DH environment. We seem almost inherently to know the value of “big data:” scale changes the name of the game. Still, what about the smaller universes of projects with minimal budgets, fewer collaborators, and limited scopes, which also have large ambitions about what can be done using the digital resources we have on hand? Rather than detracting from the import of big data projects, I, like Natalie Houston, am wondering what small projects offer the field and whether those potential outcomes are relevant and useful both in and of themselves as well as beneficial to large-scale projects, such as in fine-tuning initial results.

My project in its current iteration involves a limited dataset of about 4500 poems and challenges rudimentary assumptions about a particular genre of poetry called ekphrasis—poems regarding the visual arts. It is the capstone project to a dissertation in which I use the methods of social network analysis to explore socially-inscribed relationships between visual and verbal media and in which the results of my analysis are rendered visually to demonstrate the versatility and flexibility available to female poets writing ekphrastic poetry. My MITH project concludes my dissertation by demonstrating that network analysis is one way of disrupting existing paradigms for understanding the social-signification of ekphrastic poetry, but there are more methods available through computational tools such as text modeling, word frequency analysis, and classification that might also be useful.

To this end, I’ve begun by asking three modest questions about ekphrastic poetry using a machine learning application called MALLET:

Could a computer learn to differentiate between ekphrastic poems by male and female poets? In “Ekphrasis and the Other,” W.J.T. Mitchell argues that were we to read ekphrastic poems by women as opposed to ekphrastic poetry by men, that we might find a very different relationship between the active, speaking poetic voice and the passive, silent work of art—a dynamic which informs our primary understanding of how ekphrastic poetry operates. Were this true and were the difference to occur within recurring topics and language use, a computer might be trained to recognize patterns more likely to co-occur in poetry by men or by women.
Will topic modeling of ekphrastic texts pick out “stillness” as one of the most common topics in the genre? Much of the definition of ekphrasis revolves around the language of stillness: poetic texts, it has been argued, contemplate the stillness and muteness of the image with which it is engaged. Stillness, metaphorically linked to muteness, breathlessness, and death, provides one of the most powerful rationales for an understanding how words and images relate to one another within the ut pictura poesis tradition—usually seen as an hostile encounter between rival forms of representation. The argument to this point has been made largely on critical interpretations enacted through close readings of a limited number of texts. Would a computer designed to recognize co-occurrences of words and assign those words to a “topic” based on the probability they would occur together also reveal a similar affiliation between stillness and death, muteness, even femininity?
Would a computer be able to ascertain stylistic and semantic differences between ekphrastic and non-ekphrastic texts and reliably classify them according to whether or not the subject of the poem is an aesthetic object or not? We tend to believe that there are no real differences between how we describe the natural world as opposed to how we describe visual representations of the natural world. We base this assumption on human, interpretive, close readings of poetic texts; however, there is the potential that a computer might recognize subtle differences as statistically significant when considering hundreds of poems at a time. If a classification program such as Mallet could reliably categorize texts according to ekphrastic and non-ekphrastic, it is possible that we have missed something along the way.

In general, these are small questions constructed in such a way that there is a reasonable likelihood that we may get useful results. (I purposefully choose the word results instead of answers, because none of these would be answers. Instead the result of each study is designed to turn critics back to the texts with new questions.) And yet, how do we distinguish between useful results and something else? How do we know if it worked? Lots of money is spent trying to answer this question about big data, but what about these small and mid-sized data sets? Is there a threshold for how much data we need to be accurate and trustworthy? Can we actually develop standards for how much data we need to ask particular kinds of humanities questions to make relevant discoveries? In part, my project also addresses these questions, because otherwise, I can’t make convincing arguments about the humanities questions I’m asking.

Small projects (even mid-sized projects with mid-sized datasets) offer the promise of richly encoded data that can be tested, reorganized, and applied flexibly to a variety of contexts without potentially becoming the entirety of a project director’s career. The space between close, highly-supervised readings and distant, unsupervised analysis remains wide open as a field of study, and yet its potential value as a manageable, not wholly consuming, and reproducible option make it worth seriously considering. What exactly can be accomplished by small and mid-scale projects is largely unknown, but it may well be that small and mid-sized projects are where many scholars will find the most satisfying and useful results.

Lisa Rhody is a Ph.D. candidate in English at the University of Maryland, a Spring 2012 MITH Winnemore Dissertation Fellow, and a lecturer on the arts for the Virginia Museum of Fine Arts. This post first appeared on Lisa’s personal blog on March 31, 2012.