Seeing the Forest through the Thees (and Thous)

My initial reaction to Ramsay’s statement is that for me nothing quite induces the defamiliarization of textuality like invoking the ostranenie of Russian formalists. I’d like to see someone explain that passage in Upgoerfive! That having been said, I found this week’s exercises quite thought provoking and exciting. As soon as I began my first attempts to create word clouds with Augustine’s <i>Confessions</i>, I knew there were going to be problems with my particular translation, the language of which is extremely antiquated. Because of the language, my initial Wordle showed “Thee”, “Thou”, and “Thy” to be the most common words (because they are not in their basic stoplists, of course, even though a modern translator would say “you” and “your”).  Further examination revealed that there were a large number of other very common words in archaic forms in my text.

Through some trial and error, and using a text editor with advanced Grep capability to perform some batch replace procedures on my text file, I managed to generate a more satisfactory result. The Wordle and WordItOut versions seemed quite similar in my case. And even though WordItOut seems to offer somewhat easier manipulation of the final ouput, I’m posting the Wordle because I agree with others that they tend to look better:


I found this to be a surprisingly good encapsulation of many of the main themes of the Confessions. Putting the resulting words into UpGoerFive resulted in the following list of words used frequently in my text that were not among the more commonly used in English today:

nor, unto, lord, earth, soul, whom, heaven, itself, therefore, neither, behold, joy, spirit, whence, flesh, holy, certain, unless

Here we can see that the archaic language is still apparent, even after my attempts to modernize the most frequently used archaic words.  ”Nor”, “unto”, and “whom” should really probably be on the stoplist since the ideas that my old translation is expressing with them would probably be expressed with stoplist words in a translation written today.  But if we look past those words, the remaining results are reasonably instructive, and a machine trying to ‘comprehend’ what the Confessions are about would have a reasonably easy time of it, I suspect.

The CLAWS tagger seems quite powerful though its results didn’t immediately speak to me.  I did notice that it seems to have mis-identified Augustine’s use of “times” as a preposition.  CLAWS becomes particularly powerful, it would seem to me, if one were to convert the results list to a spreadsheet that can be easily sorted by part of speech.  TAPOR likewise looks like a very powerful toolset — if I’m not mistaken its concordance generator could could accomplish Father Busa’s entire project in a matter of a few minutes — assuming one had the works of Aquinas available in text files.

Ultimately, though, coming back to the question of defamiliarization of the text, this week’s exercises proved to me that there is something valuable in breaking our texts down in this way — even if I’m not sure I see where this is all headed just yet.  Text mining procedures like these seem to be taking apart the forest and sorting the trees by species, size, age, etc.  Surely that would be useful information for a biologist studying the forest, but how we will get from stacks of trees over to understanding biodiversity still remains unclear to me.

2 thoughts on “Seeing the Forest through the Thees (and Thous)

