Defining DH Word Cloud

Posted on February 14, 2013 by Josh Westgard

A Wordle of our collective DH definitions:

Craziest

Posted on February 13, 2013 by Matthew Kirschenbaum

Watch this. And think about algorithmic criticism.

Wordly wobblings

Posted on February 13, 2013 by Jenny Ross

My findings from the Google Ngram Viewer are that we did not like “idea” very much in the first half of the eighteenth century. Our feelings about “truth” have varied substantially; we liked it quite a lot during the mid-nineteenth century, but in 1910 we started preferring “idea” and this has stayed fairly consistent since then. Ngram

My Up-Goer Five definition of DH goes like this:

It is about doing old things in new ways. Or, if you ask another person, it is about doing new things in even newer ways. People who do it don’t agree on what things are most important or how to study them. Human life changed when books did away with forms of writing that came before them. Computer forms of stuff that used to be only on paper might be doing the same thing now. Computers can make stories look different, but does that mean that they ARE different at the bottom? Or is it only the way that we look at them? If we use computers to read books, we can study different ideas about them. The question is whether those kinds of ideas leave out the kind that came before. The question is also whether the old kinds of study leave out ideas that one can only reach by using new ways. Perhaps the best way to put the question is: How do we decide whether the old or new way is best for something we want to learn (or, better yet, how we can put the two together)?

While the original XKCD comic is funny, I think this concept can only work well when humor, not communication, is the point. It could be helpful if someone is taking him/herself too seriously and wants to re-evaluate a statement in search of excessive jargon, but it does not seem useful for describing something to someone who does not already know what you are talking about. Without the words “digital,” “humanities,” “electronic,” or “interpret” I wasn’t able to make a definition that could let somebody who had never heard of DH know what I was describing.

So, on to Wordle. I used the Gutenberg text of King Lear (minus the fine print and introductory “comments”) and this was what I got:

http://www.wordle.net/show/wrdl/6368114/Gutenberg_King_Lear_Wordle_for_ENGL_668K

Word it Out gave me this:

Obviously, the speech prefixes dominate these clouds; Lear and Kent are the most prominent in both clouds.

Running the Word it Out list through the Up-Goer Five produced these words:

tell one night Sister say make see great done further now man hath long life late Daughter good Daughters Enter name mans answer away yet part better Father fit eyes nothing cold else old some Horse Gods time home go hand least way take Letter heard here much against still know Sir rather heart both all though found more come art Let most well like little many place follow age gone made other comes hold death none mad call within Brother full power hast head Sisters makes Lady after two set being put came do’s thing What’s toward Boy where’s best world thought men reason stand word Oh before any dead first bring house Friend blood matter true since told dost draw fire doth Fathers course things cause strange sight stands

One thing that surprised me is that “Lady” could stay but “gentleman” had to go. Someone who was not aware of the context could probably gather that family relationships are a major theme of the work represented, but could probably not go much further than that.

The CLAWS tagger produced this:

-----_PUN 
place_NN1 hast_VHB turne_NN1 feare_NN1 Storme_NN1 Master_NN1 since_CJS 
i'th_NN1 th_NN0 Edgar_NP0 halfe_NN1 Edg_NN1 businesse_NN1 else_AV0 Enter_VVB 
leaue_NN1 Slaue_NN1 done_VDN thing_NN1 stand_NN1 heare_NN1 Ha_ITJ Regan_NP0 
Cornwall_NP0 speake_NN1 Lady_NN1 comes_VVZ world_NN1 Madam_NN1 head_NN1 
some_DT0 still_AJ0 Sword_NN1 Sir_NN1 againe_VVB thy_DPS farre_NN1 liue_NN1 
till_PRP any_DT0 Cordelia_NN1 most_AV0 set_VVN Knaue_NP0 told_VVD forth_AV0 
fire_VVB Brother_NP0 Daughters_NP0 Ile_NP0 meanes_NN2 gaue_VVB none_PNI 
being_VBG fit_AJ0 know_VVB within_PRP do'st_NN1 Douer_NN1 Cor_ITJ call_NN1 
nor_CJC Bast_VVB other_AJ0 Gentleman_NN1 Foole_NN1 backe_NN1 men_NN2
things_NN2 Noble_AJ0 neuer_NN1 Trumpet_NN1 pray_VVB seene_NN1 Alacke_VVB 
hither_AV0 goe_VVB now_AV0 Glou_NP0 more_AV0 bring_VVB vp_NN0 true_AJ0 
though_CJS much_AV0 two_CRD Villaine_NP0 euer_NN1 heard_VVD fellow_NN1 
gone_VVN Edmund_NP0 Scena_NP0 Fortunes_NN2 hold_VVB put_VVB where_AVQ 's_VBZ 
whom_PNQ take_VVB himselfe_NN1 do_VDB 's_POS Corn_NN1 ere_PRP sleepe_NN1 
euery_NN1 better_AJC King_NN1 say_VVB Stew_NN1 deere_NN1 first_ORD bin_NN1 
Fathers_NN2 finde_NN1 Duke_NP0 Gent_NP0 Gloster_NP0 cause_NN1 Knights_NN2 
good_AJ0 name_NN1 Oh_ITJ T_PNP is_VBZ returne_NN1 Sonne_UNC Horse_NN1 away_AV0 
France_NP0 Exit_NN1 Bastard_NN1 looke_NN1 make_VVB after_PRP o'th_NN1 
Prythee_NN1 wits_NN2 makes_VVZ Reg_NP0 word_NN1 little_AV0 vs_PRP Steward_NN1 
like_PRP age_NN1 Nature_NN1 thine_DPS cold_NN1 follow_VVB shalt_VM0 
against_PRP stands_NN2 What_DTQ 's_VBZ rather_AV0 way_AV0 seeke_VVB 
further_AV0 came_VVD Father_NN1 haue_VHB answer_NN1 knowne_NN1 long_AV0 
home_AV0 many_DT0 loue_VVB Sisters_NN2 life_NN1 Gods_NN2 late_AV0 thee_PNP 
made_VVD Fortune_NN1 Alb_NP0 eyes_VVZ nothing_PNI farewell_NN1 Edmond_NP0 
feele_NN1 purpose_NN1 Tom_NP0 old_AJ0 Friend_NN1 see_VVB found_VVN least_DT0 
power_NN1 dead_AJ0 Traitor_NN1 well_AV0 Let_VVB vse_NN1 toward_PRP blood_NN1 
euen_NN1 Lear_NP0 draw_VVB Lord_NN1 reason_NN1 mad_AJ0 strange_AJ0 heart_NN1 
here_AV0 Letter_NN1 yet_AV0 Albany_NP0 Gon_NP0 Gonerill_NP0 man_NN1 part_NN1 
one_CRD great_AJ0 Glo_NP0 dost_VDB heere_AJ0 giue_NN1 downe_NN1 doth_VDZ 
poore_NN1 lesse_NN1 come_VVB hand_NN1 Kent_NP0 Grace_NP0 art_NN1 helpe_NN1 
go_VVB matter_NN1 foule_NN1 course_NN1 thou_PNP strike_VVB Boy_NN1 vpon_NN1 
whose_DTQ thinke_NN1 thought_NN1 beare_NN1 peace_NN1 hath_VHZ Exeunt_UNC 
death_NN1 full_AJ0 Sister_NN1 owne_NN1 house_NN1 selfe_NN1 night_NN1 best_AJS 
Fiend_NN1 keepe_NN1 both_AV0 tell_VVB Ste_NN1 mans_NN2 sight_VVB Glouster_NN1 
all_DT0 hence_AV0 before_PRP Daughter_NN1 time_NN1 ..._SENT **42;7;TOOLONG_UNC

I’m sorry; I can’t give a useful analysis of this. The site is the opposite of the word cloud generators in that it is not even a little bit user-friendly. The key to tags is not straightforwardly organized. I tried to find what “NPO” (or possibly “NP0) might mean, but it was not in the list. Perhaps this would make more sense to me if I knew something about coding.

Pushing onward into the land of things I don’t understand, I approached TAPoR and HyperPo. Using this site was extremely frustrating because, once I uploaded the text (I couldn’t copy and paste, so the Gutenberg “comments” came along for the ride), the resulting window did not include labeled buttons. I got the following analyzing the word “daughter”:

If I’m using it right, this tool indicates that the word “daughter” occurs most often in Act 1, Scene 2 — the scene in which Lear divides his kingdom. This scene coincides with the highest number of mentions of “Cordelia” but not of “Gonerill” or “Regan.” I think this set of tools has the most potential usefulness, but I had trouble understanding how to make them useful. I tried some of the “help,” “tutorial,” and “tour” features, but I kept running into “page not found” and “router error” messages; I don’t know if I was doing something wrong or if the site just wasn’t working very well.

Ramsay was right: these tools make the text of King Lear look completely unfamiliar. As I flailed about through these mysterious new waters, I found that the mere strangeness of what I was seeing was almost overwhelming. I can see that I might eventually be able to put these tools to productive use, but first I need to become more comfortable navigating digital environments.

Seeing the Forest through the Thees (and Thous)

Posted on February 13, 2013 by Josh Westgard

My initial reaction to Ramsay’s statement is that for me nothing quite induces the defamiliarization of textuality like invoking the ostranenie of Russian formalists. I’d like to see someone explain that passage in Upgoerfive! That having been said, I found this week’s exercises quite thought provoking and exciting. As soon as I began my first attempts to create word clouds with Augustine’s <i>Confessions</i>, I knew there were going to be problems with my particular translation, the language of which is extremely antiquated. Because of the language, my initial Wordle showed “Thee”, “Thou”, and “Thy” to be the most common words (because they are not in their basic stoplists, of course, even though a modern translator would say “you” and “your”). Further examination revealed that there were a large number of other very common words in archaic forms in my text.

Through some trial and error, and using a text editor with advanced Grep capability to perform some batch replace procedures on my text file, I managed to generate a more satisfactory result. The Wordle and WordItOut versions seemed quite similar in my case. And even though WordItOut seems to offer somewhat easier manipulation of the final ouput, I’m posting the Wordle because I agree with others that they tend to look better:

Wordle

I found this to be a surprisingly good encapsulation of many of the main themes of the Confessions. Putting the resulting words into UpGoerFive resulted in the following list of words used frequently in my text that were not among the more commonly used in English today:

nor, unto, lord, earth, soul, whom, heaven, itself, therefore, neither, behold, joy, spirit, whence, flesh, holy, certain, unless

Here we can see that the archaic language is still apparent, even after my attempts to modernize the most frequently used archaic words. ”Nor”, “unto”, and “whom” should really probably be on the stoplist since the ideas that my old translation is expressing with them would probably be expressed with stoplist words in a translation written today. But if we look past those words, the remaining results are reasonably instructive, and a machine trying to ‘comprehend’ what the Confessions are about would have a reasonably easy time of it, I suspect.

The CLAWS tagger seems quite powerful though its results didn’t immediately speak to me. I did notice that it seems to have mis-identified Augustine’s use of “times” as a preposition. CLAWS becomes particularly powerful, it would seem to me, if one were to convert the results list to a spreadsheet that can be easily sorted by part of speech. TAPOR likewise looks like a very powerful toolset — if I’m not mistaken its concordance generator could could accomplish Father Busa’s entire project in a matter of a few minutes — assuming one had the works of Aquinas available in text files.

Ultimately, though, coming back to the question of defamiliarization of the text, this week’s exercises proved to me that there is something valuable in breaking our texts down in this way — even if I’m not sure I see where this is all headed just yet. Text mining procedures like these seem to be taking apart the forest and sorting the trees by species, size, age, etc. Surely that would be useful information for a biologist studying the forest, but how we will get from stacks of trees over to understanding biodiversity still remains unclear to me.

No Clever Title

Posted on February 13, 2013 by Paul Evans

I used the text of John Henry Newman’s The Idea of a University that I found on the Project Gutenberg website last week to produce two word clouds on Wordle and WordItOut.

There were two differences that jumped out right away when I compared the two: WordItOut seemed to do a better job of weeding out stopwords (“may”), and Wordle accepted without question what I’m pretty sure are character-encoding errors (the pseudo-words beginning with ‘Ä’).

I had pretty much the same experience as everyone else did when I pasted the words from the WordItOut word cloud into the Up-Goer Five Text Editor: it rejected 26 of the words (although it wasn’t concerned in the least that the words, in that order, did not constitute a syntactically valid English sentence.).

I then pasted the same words from the WordItOut word cloud in the CLAWS Part-of-Speech tagger. For some reason, the text pasted with spaces between the words, and I had to enter the spaces manually. I noticed that the word list had a similar effect to the “entropic poem” on page 37 of Ramsay’s Reading Machines, which surprised me, since I had assumed that that effect would only be perceptible in a short text.

I get the point of tools like this. There’s a similar one called William Whitaker’s Words that’s very popular among students learning Latin, although the fact that CLAWS accepts bulk input (unlike Whitaker’s Words) is an improvement on the model. And there are useful things, I suppose to be learned about a text from such tools (e.g., to confirm or deny the claim that John Calvin never used adverbs in writing). I didn’t, however, find the output of CLAWS particularly edifying in this case:

I attempted to hand off the URL for the plain text on the Project Gutenberg site directly to TAPoR using “Your Web Page”, but what I got was an HTTP 403 Forbidden error, so I played with Chapter 1 of Moby Dick instead. My sense was that the HyperPo does need a body of text longer than a single chapter in order to be really useful rather than a curiosity.

I don’t feel qualified to comment on whether the use of these tools produces an effect of estrangement and defamiliarization of textuality in general — not being a literature student, I’m not used to relating to textuality in the abstract, as opposed to a particular text or texts. My impression is that tools of this kind will do much more for you if you already know something about the text you are examining in this way, and I certainly got a lot more out the examination of Gratian’s Decretum than of Newman’s Idea of a University.

Art and Science as Complementary Opposites

Posted on February 13, 2013 by Dan Kason

I was very drawn to the argument Ramsay puts forth in Reading Machines. This might be because out of all of the readings thus far (okay, only two week’s worth of reading, but last week had a good amount of material . . .), Ramsay most willingly acknowledges the divide between humanistic inquiry and computational method. Indeed, as Ramsay argues, while each contains a kernel of the other, algorithmic criticism seeks definitive answers, while literary criticism seeks unanswerable questions.

In this blog post I will try to focus only on “Preconditions” and the first chapter, “An Algorithmic Criticism,” of Ramsay’s book, perhaps setting my own constraints for myself. I do this to save the rest of my thoughts for class on Wednesday, and I will use this post as a jumping-off point for discussion.

It is difficult to explain why the pairing of two opposing modes of inquiry fascinates me. This discussion reminds me of the interests of early science fiction writers, who, influenced by the Romantic period, used the very methods of rationalism and science as a form of critique. Ramsay nearly states exactly this in his discussion of art and science:

“Art has very often sought either to parody science or to diminish its claims to truth.”

With this ever-present tension, how could we possibly use text analysis to aid literary criticism in a way that does not remove the basic tenets of humanistic inquiry? Ramsay has a few answers to this. Computer-based tools represent a limitation that allows us to reorganize and understand a text in new ways. While text analysis can only concern itself with verifiable facts, the user is left to decide what to do with these “facts.”

In other words, computer-based tools like text analysis often act as a form of provocation, a starting point for us to delve deeper into an issue. I certainly encountered this in my own limited/crude experiment with Woodchipper, a topic modeling tool. The fear that comes with using many of these tools—and here I might break my own constraint and reach into the other chapters—is that they can only tell us what we already know. This might be a problem with methodology, as Ramsay points out. The more worthwhile experiments are the ones that tell you things that suggest the opposite of what you believe. Certainly as computer-based tools grow more complex and sophisticated, they will be able to give us answers to questions we previously believed only humans could address. But Ramsay is more interested in discourse rather than methodology:

“. . . we can refocus the hermeneutical problem away from the nature and limits of computation (which is mostly a matter of methodology) and move it toward consideration of the nature of the discourse in which text analysis bids participation.”

Another issue which Ramsay may or may not address is that while you can produce results using text analysis (and other tools) without having read the text in question, you may not be able to interpret those results. This is certainly true for Ramsay’s experiment with The Waves. As Ramsay points out after running an equation regarding the speakers in the novel,

“Few readers of The Waves would fail to see some emergence of pattern in this list.”

But what if you haven’t read The Waves? It is a short book, and one you would certainly be expected to have read if you decided to publish anything, including an experiment with text analysis, on the novel. But this issue becomes a problem when we consider “distant reading,” which purports not to require any general or specific knowledge of the text. In fact, distant reading discourages it.

But if you cannot interpret the results unless you have read the book in question, how are we supposed to approach the topic: “How to Read a Million Books.”? Even when we consider a hundred or a thousand books at once (or millions, as described in the TED talk video), it might be helpful to know at least a few things about each one, like the fact that The Waves features six speakers.

Here is where methodology asserts its importance once again. Only when a computer-based tool becomes sophisticated enough to allow for interpretive analysis without engaging with the text directly can these tools usurp the primacy of the reader. Perhaps we have reached this stage already, but I cannot help but cling to the importance of close reading, even as we compare a work to hundreds, thousands, or even millions of others.

War of the Wordles

Posted on February 13, 2013 by Melissa Rogers

Unfortunately, I lost my first Wordle of War of the Worlds, which had a beautiful custom palette and Martian-like font, and now I’m really mad that I couldn’t find a search function on the Wordle site’s public gallery. Boo. So here’s a second one.

And, the much uglier WordItOut!

Interestingly, many configurations of the Wordle sketch out a bare-bones premise for the book with the most prominent words: “Martians Came”. Both “Mars” and “Earth” are very small, and don’t even appear in the WordItOut! There are few proper nouns, no character names, but places like “London” and “Woking” show up. “Black” and “red” are also prominent, as are sensory words like “heard”, “see”, “saw.” “Seemed” is much bigger than “know,” giving a feel for the uncertainty that haunts much of the action of the book. The WordItOut! on the other hand, picked up much more common “filler” words like “said,” “about,” “through,” “over.” It was also much less fun to play with. Much of the appeal of the Wordle for me was arranging the layout so as to maximize the “sense” I could make out of it visually: how much of the basic “plot” or action words could I manage to juxtapose and highlight with color, straight or curved lines, font “appropriate” to the subject matter? As Ramsay suggests, this is perhaps the greatest potential of text-analysis tools–the ability to operate at a new scale and to manipulate the text on different levels than “close reading” allows.

Not surprisingly, very few of my Wordle words were allowed in the Up-Goer Five Text Editor. While experimenting with Up-Goer Five, I was trying to figure out the best approach–do I hand-pick words from the list of ten hundred, or do I build my definition by attempting to write it first, and then “translate” it? I wove back and forth between these approaches, picking some words and then trying out other phrases that were inspired by them. Ultimately I was disappointed, and I must say my definition of DH was more flippant than informative: “Many conversations about building, making, thinking. doing; money, jobs. Using computers to study humans and read/write ‘algorithmically.’” Without punctuation it’s as long as a tweet.

When I input the Wordle text into the CLAWS Part-of-Speech tagger, it interestingly read many of the verbs as gerunds, tagging them as adjectives. I would really like to know what others think the best application of a tool like this would be. I immediately thought it could be used as a translation aid from one corpus to another, but this doesn’t seem to be a feature.

TAPoR was honestly the tool that got me most excited and seemed most applicable to my research on women’s alternative/independent publishing. It was easy to “mess around” in–I’ve never done any text analysis before but at the most basic level I knew what a stop-word list was, and could figure out how to get the tool to “spit out” what I wanted to see. The descriptions that appear when you hover over a tool were immensely helpful and I found myself wishing every DH project or toolbox had this feature. Interested by the appearance of place names like London and Woking, I graphed these on the concordance tool to see the protagonist’s (and the Martians) geographical movements through the novel. I also graphed “Martians” and “People,” the occurrence of which mirrored each other for most of the novel before “People” drops off sharply toward the end, when the protagonist is moving through deserted houses and communities. This exercise really tested my knowledge of the “plot points” in the book–I found myself remembering details that seemed insignificant, all by looking at a graph of the words. I’m just itching to digitize some zines, scrape their text, and compare all the instances of “queer,” “feminist,” and “anti-racist” I can find.

I also couldn’t help but smile at the title of these tools: “Voyant: See through Your Texts.” The entendre is irresistible–use “your texts” (whatever they may be) as a pane or a lens through which to view a specific topic, and/or make your texts transparent, lucid; make bare their meanings. Of course, the implication of Ramsay’s argument is that none of these tools, or the texts to which we apply them, are “transparent.” We might be able to “see” our text differently, from new angles an at previously hidden layers, but it is dangerous to assume that nothing resists the self-evidence of scholarly vision. My partner, who was watching me do these experiments and also helping me with the necessary plugins to run them, kept lingering on these sites to figure out what kinds of algorithms they use and what kinds of patterns they’re finding. I’m not sure most users think about the tools on those levels [DH-ers and hackers are, as usual, another story], and it would be easy to tout their potential while forgetting that our interpretations, the most valued currency in some humanities disciplines, are just begging to be made.

Loved by the King?

Posted on February 13, 2013 by Katie Kaczmarek

I’ve seen Wordles used before in school projects, but usually for display purposes rather than used as an analytical tool. Therefore, I was excited to see the application given a new purpose that teachers could easily use in school for a variety of texts.

Word Clouds!

When I imported Project Gutenberg’s text of the first volume of Le Morte D’Arthur into Wordle and Word it Out, these were my results (Sadly, I discovered that the “Loved by the King” font in Wordle was not very, well, kingly, so I switched it to a more appropriate font):

Wordle

Word It Out

It’s not surprising that the most prominent word in both is “Sir”, as most of the characters go by that epithet, nor that “king” and “knight” are also frequently used, emphasizing the courtly genre of the text. ”CHAPTER” probably is featured since the table of contents was included in my copy and paste, in addition to all the times it is usually used. I was surprised that Tristram beats out Arthur (in a book titled after him!) I also found it interesting that words such as “smote”, “battle”, and “slain” are much more prominent than “God” and “worship”, hinting that the divine justification for most of the fighting was not as much of an excuse as it purported to be.

Paraphrasing with Up-Goer Five

Like many of my classmates, I found when I put the top 100 words into Up-Goer Five, that about half the words were not permitted, primarily in the proper name, antiquated term, and knightly terminology categories. I would doubt the ability of someone to use the Up-Goer Five to summarize books like this with difficult language if I hadn’t seen their application to Hamlet’s “To Be or Not To Be” speech. (I actually recommended this application to my former co-workers, many of whom require their students to paraphrase the famous soliloquies in Shakespeare’s plays on their tests.)

And I thought I was free from dealing with parts of speech…

I was impressed by the CLAWS Part of Speech Tagger’s ability to correctly identify even the antiquated pronouns such as “ye” and “thee”, but other than that, I found it difficult to see how these kinds of results could be useful in an analysis of the text. Maybe if there were further calculations applied (frequencies of parts of speech?) I could have seen those patterns to turn into narratives–or at least questions–that Ramsay suggests.

Making some conclusions with TAPoR

When I first plugged the text of Le Morte D’Arthur into TAPoR, the frequency count and “Cirrus” were both dominated by articles and other “unimportant” words, but when I asked the program to remove them, it generated a list almost identical to that of Wordle and Word It Out! The Word Trends graphs, though, got interesting when I decided to click on those prominent names.

Frequency of Arthur, Tristram, and Launcelot’s appearances in the book

Leaving the “Segments” setting at 10 to roughly mimic the 9 books in Vol. 1, I discovered that Arthur most frequently appears at the beginning of the book (which makes sense, given that it is devoted to the story of how he came to power), and then is practically forgotten about. Likewise, Tristram dominates the last part of the book, even more so than Arthur. This makes sense because book 8 is all about Tristram’s adventures. Similarly, Launcelot spikes in the middle of the graph, as book 6 is all about his deeds. The juxtaposed graph shows clearly how Malory attempted to integrate all the various legends about the knights which had come from different sources, choosing to do it in an episodic fashion focusing on the character rather than jump back and forth between multiple storylines as is more typical of contemporary literature.

So what is it like to read this?

I think that these activities did have a sense of what Ramsay refers to as ”ostranenie–the estrangement and defamiliarization of textuality” (3). However, I’m skeptical as to how far we can take algorithmic analysis when the potential for grasping at straws exists. As Ramsay mentions later on,

If something is known from a word-frequency list or a data visualization, it is undoubtedly a function of our desire to make sense of what has been presented. We fill in gaps, make connections backward and forward, explain inconsistencies, resolve contradictions, and, above all, generate additional narratives in the form of declarative realizations (62).

How much of this meaning is because we want to see meaning there? And how much is built on prior assumptions? For example, am I reading too much into the Word Trend charts of Malory because I know that his project was one of compilation, rather than invention? I think this gets even trickier when you analyze results of an algorithm that you have designed–your own biases and/or assumptions are built into the project from the start. Hopefully we’ll talk more in class about when these types of practices are productive and when they produce results that just mirror what we already think.

(And if you’re interested in seeing the outcome of Unicorns vs. Zombies according to Google N-Gram, check out my blog post!)

The Prejudice of Stripped Texts

Posted on February 13, 2013 by Courtney Wells

To start this week’s exercise, I decided to have a little fun. Kind of like stretching before a big work out. Using Google’s Ngram Viewer, I compared the heroine of my chosen text, Pride and Prejudice’s Elizabeth Bennet, to her modern-day counterpart, Bridget Jones, with whose diary we are intimately acquainted. Because Helen Fielding has openly admitted to basing her characters on Jane Austen’s—especially Mark Darcy on Mr. Darcy—I thought it would be interesting to see how else they compare. I was surprised to see how Miss Bennet’s popularity waned for so many years and then, at the turn of the century, increased and hasn’t stopped since. Additionally, I was surprised to see that Bridget Jones’ popularity peaked higher than Elizabeth’s ever did.

Then onto the hard part of the work out—creating a definition for digital humanities. And not just any definition, one with strict boundaries. My humble result below.

Wordle vs. WordItOut

While I generally consider myself a hands-on learner and quick on the uptake when it comes to basic computer programs and technologies, I found this week’s exercise to be more than a little frustrating. Wordle would not allow me to insert the Project Gutenberg (or any other) link to get my word output, which resulted in me copying and pasting the book in its entirety into the “Paste in a bunch of text” box. Oh, I pasted in a bunch of text alright! Finally, I got this beauty:

Then it was time for WordItOut, which was a much quicker task after figuring out Wordle’s quirks.

I actually took the time to try to make the two look as similar as possible in coloring for easier comparison. I think Wordle has WordItOut beat in basic aesthetics, but otherwise the results were nearly identical. I was very surprised to see “Mr.” was the word most used throughout Pride and Prejudice. Despite being the nineteenth century’s chick-lit by a female author, it is clear that it was still a man’s world at the time of writing and publication. However, the word “Elizabeth” does run a close second, which is a bit refreshing.

Up-Goer Five Text Editor

Next up, the commonality of words. It appears things haven’t changed much in 200 years since Miss Austen put pen to paper. In fact, other than proper names, only four words she used were not in the top 1000 words of Up-Goer Five: indeed, pleasure, till, and manner. However, this made me curious what the results would be if basic words like came, made, most, and go were not allowed to be analyzed. I was surprised at pleasure being so widely used. It’s not a word I hear used often, and it seems the connotation has changed over the years.

CLAWS

CLAWS was my least favorite of all the sites. To me, it did not lay out the results in a clear, easy-to-read manner. It was also counterintuitive that the key wasn’t listed on the same page as the results, so that you had to toggle back and forth between pages. Additionally, this seems more like it would be useful for grade school children learning grammar than it would be for any other purpose.

TAPoR

When it came to TAPoR, I wasn’t nearly as interested in the HyperPo abilities as I was with the program’s ability to run lists of words and compile how many times each word occurs in the text. The word “Elizabeth,” which appeared to be a close second to “Mr.” in the Wordle, is actually used 200 times less than “Mr.” Futhermore, I was particularly interested in the listing ability for two reasons. First, Stephen Ramsay writes extensively on the tf-idf formula and how its findings affect critics when looking for patterns in a text, which I found intriguing. Second, in Italo Calvino’s If on a winter’s night a traveler, a character tries to categorize and determine the genre of books based solely on the words that recur and appear the most in a given work. It’s an interesting thought, trying to decide what a book is about without having read it for its sentences, but for the words it features.

While all of these sites were fun to play with and produced interesting results, I think they ultimately take away from the true meaning of what a book is hoping to convey. Making a book a thing of quantitative results removes the reader’s ability to interpret the text for himself and to engage in the nuances the author has created with grammar, punctuation, and voice. The only work that comes to mind that would benefit from these results would be Gertrude Stein’s “Portraits and Repetition,” where her goal is to use the same words as many times and in as many ways as possible. As Ramsay himself writes:

“It is one thing to notice patterns of vocabulary, variation in line length, or images of darkness and light; it is another thing to employ a machine that can unerringly discover every instance of such features across a massive corpus of literary texts and then present those features in a visual format entirely foreign to the original organization in which these features appear” (Ramsay 16).

I couldn’t agree more. Just as Project Gutenberg states that anything may be done with a public domain text, which may result in the text being changed in ways that dissolve its power and purpose, stripping it to just its words changes it too.

From Hell’s Heart I Graph At Thee!

Posted on February 12, 2013 by Nigel Lepianka

The idea of quantifying Moby-Dick is simultaneously exciting and perhaps not altogether surprising given the results of some of the returns from the tools we were instructed to use. The novel is packed with Shakespearean language, is about a very specialized topic (whaling), and formally very odd in places. But that, of course, just means Moby-Dick is an ideal text for these sorts of experiment, right? Let’s see…

First, I ran Moby-Dick Wordle, resulting in this diagram:

Secondly, WordItOut:

The most obvious difference between the two is the choice for the largest word. ‘Whale’ and ‘one’, are unsurprisingly the largest words represented on the image. WordItOut, however, displays ‘all’ as its largest word, with ‘whale’ and ‘one’ the runners-up. The word ‘all’ is not represented on Wordle’s image, meaning it is cast aside in that program as an all-too-common word to be of any use. Now, I do see the logic in this decision in some form; ‘all’ is a common word, and sometimes can be used as a needless intensifier or a purely quantitative word. In this case, however, I contest Wordle’s decision; in Ahab’s final monologue he explicitly describes Moby Dick as “all-destroying” as he speeds, harpoon in hand, towards the beast that is destroying his ship. The ‘all’ in this case is not just a simple word, it’s an intensifier certainly, but it represents Ahab’s life (the whaling trade), and Ahab himself (his soul has been scarred and his body maimed). It is possible to read this word with more than the mere commonality ascribed to it by Wordle’s software.

Secondly, the major characters of the novel are mentioned: Queequeg, Stubb, Starbuck, and Ahab, but there are some missing. Ishmael is gone despite being the narrator, but aside from the opening sentence, his name is barely mentioned if at all (mostly just annotations ever recall to his name). More interesting, though, is the absence of one of Ahab’s right-hand men: Flask. Naturally, this means he is mentioned less, or at least referred to by name fewer times than the other first mates of the Pequod, but perhaps this opens up a line of inquiry to pursue: why are Starbuck and Stubb getting so much attention as to appear quantitatively more visible?

Next, I placed the contents of the word cloud into the Up-Goer Five, receiving the expected list of forbidden words:

Stubb, stub, brush, check, end, point, boats, captain, sperm, sea, ship, thou, nor, boat, Ahab, ye, whales, deck, Queequeg, Starbuck, chapter, whale, among

This list can be divided easily into three categories: Names (Stubb, Ahab, Queequeg, and Starbuck), archaisms (thou, ye, nor), and nautical terms (stub, brush, check, point, boats, sperm, sea, ship, boat, whales, deck, whale). None of these are surprising to see on the list considering the names are odd, the archaisms by definition not going to be common, and our modern society is less reliant on ship-trade as to render the nautical terms more scarce, and I would guess they wouldn’t appear in the top 1000 words in 1851 either.

The interesting remainders are end and among, which, I’ll admit, I am surprised are not within the ten hundred most used words.

Next comes the CLAWS speech tagger. This tool, as Mary and Dan reported, is not only less visually appealing, but less clear to someone not familiar with its format to read. But the tool was surprisingly good at recognizing the propers nouns (Queequeg, Stubb, Starbuck, and Ahab) as such, and not returning some sort of error or even just suggesting them as nouns. Since proper nouns are typically dependent upon context to recognize, CLAWS’ ability to recognize them is impressive. Aside from the names, there are mostly nouns and adjectives represented by list, with a few prepositions (upon, among) and an interjection (oh), but fewer verbs than I expected, with only five by my count: said, cried, go, thought, and know.

Finally, with the TAPoR/Voyant tool, I found myself lucky that the first chapter of Moby-Dick was a default on the website. Unfortunately, the diagnostic returned was not all that interesting, so I went ahead and uploaded the entire text.

The cloud, or ‘cirrus’, for Voyant is prone to including “useless” words, as you can see, like articles, but fortunately, while it does not take the liberty that both Wordle and WordItOut do with automatically removing certain words (and thereby removing some potentially important words, as in the case of ‘all’) it allows you to customize your list and essentially blacklist the words you do not want. Wordle as well provided this feature, but removed words by default. Voyant forces the uploader to think and choose the words represented.

As you can see in the screenshot, the first word I selected that seemed, to me, to be worth scanning was ‘whale’, with a total of 971 uses beginning on the very first page. What is fascinating about Voyant are the multiple ways it will contextualize and build information around a single word. There are two windows dedicated to showing a frequency chart and the context around each mention as well as tabs for the parts of the entire corpus of where your chosen word (or words) appears. This helps to alleviate any suspicion, especially when dealing with an ambiguous word (unlike ‘whale’) that may have multiple uses and contexts.

Looking at the use ‘whale’ throughout the entire book, I would be tempted to explore the periodic lull in its mentions visible in the line graph. When the graph is given 10 and 15 segments, this oscillations are more drastic and shows much more sporadic mentions of the term, though the most interestingly, what can be seen is a steady decline in the use of ‘whale’ until what starts the final chapters of the book, or, the chase sequence, in which case it begins a steep incline. There is seemingly a dramatic tension in the graph recognizable through its usage of the term.

So, when I think about Ramsay’s idea of “estrangement” from textuality, I have to wonder about what it is within the text, or about the text that is primary subject of estrangement. Is it the narrative? For ever instant my initial responses have been grounded within the narrative: why is Flask mentioned less? Why is the word ‘all’ important to the word cloud to be a significant loss? What time frame is represented by the steep incline at the end of the line graph? All of these questions are brought about because of my familiarity with the reading: a product of the close-reading focused education that enforced that I read Moby-Dick because it, singularly, is important and above thousands of anonymous books. But when it comes to the answers of my questions, are they all necessarily going to return to the narrative? Personally, it seems the temporary estrangement is merely a way of refocusing the narrative again and re-reading it, arriving at Ramsay’s purported goal: creating new information and criticism from what the algorithms can show us.

Introduction to Digital Humanities

ENGL 668K at the University of Maryland