Transcribing and Encoding Bentham

Quote

Having experimented briefly with XML encoding during the Technoromanticism class with Dr. Neil Fraistat, I was somewhat prepared for what this exercise entailed. However, I was pleasantly surprised to learn that the good people behind the Transcribe Bentham project have made XML encoding easier than ever for the average user. The toolbar was incredibly easy to use, and so I had no problem figuring out how to mark-up my manuscript (JB/051/376/003). The hard part was the transcription process. Like others who have posted before me, there were several words that I just could not figure out. Initially I was overwhelmed, feeling like I was placing <unclear> tags all over the place. I spent many long minutes staring at my screen begging the words to reveal their secrets. I even tried looking at each individual letter, coming up with strange words like “unassepnable,” which were clearly not correct. After stepping away for a bit and coming back to the manuscript, I was able to further decipher some of the words. Yet, I was still unsure in a few places. Finally, I decided to enlist the help of Charity to see if she could figure out any of my “questionable readings,” and was happy to find that she was able to clear up a few of the words that had been eluding me. Eventually, I still ended up settling a few times on educated guesses surrounded by the <unclear> tags, but overall I felt pretty good that the majority of my transcription was correct.

This morning, when I checked my email, I was pleased to see that my text had been approved. While the editor made some changes and filled in some of my mystery words (“unassignable,” not “unassepnable” or even my actual word guess, “inestimable”), the majority of my encoded transcription was approved as being correct. There were also some stylistic changes. Words that had been separated in the text by line breaks were completed in the top line, leaving no indication that the word was split up in the actual manuscript. I am guessing that this is just to make it easier to read? Also, the notes, which I felt started at the end of the first line, were moved to the top of the entire paragraph. This, as I’ve stated, was a stylistic choice as far as I can tell, and most likely serves to make the content a bit easier to read, especially since the notes describe what is being talked about in the paragraphs. Anyway, I was happy to note that the majority of my attempt at encoding and transcribing Bentham was a success! Although there were some moments of discouragement in which I thought I would never be able to figure out some of Bentham’s hand-writing, it was definitely fun when I was finally able to figure out a muddled word. The best part of this assignment was definitely encoding though. As I stated on my questionnaire, I was very happy to see that the encoding process was made so simple through the toolbar so that beginners like me had no problem encoding Bentham’s manuscript. It is definitely an activity I would be interested in doing again, though perhaps with a different subject matter for transcription.

Transcribble

When I first read this assignment, I thought it was going to be a piece of cake for me.  As a TV production assistant–and beyond–I was required to transcribe taped interviews on an almost daily basis.  But trying to decipher Jeremy Bentham’s handwriting is not even remotely close to rewinding a tape to pick up words during Courtney Love’s drunk ramblings.  Bentham’s handwriting has Love’s slurred speech beat, hands down.

After many of the same issues Mary discussed in her post–namely Firefox not being Transcribe Bentham-friendly–I finally was able to view the manuscripts available for  transcription.  After perusing a few that had not been transcribed and seeing the writing was nearly illegible, I opted for an “easy” manuscript.  Of course most of these had already been done, so it was back to the untranscribed category and clicking at random to find one that might possibly fall into the “easy” category if I was lucky.  I chose JB/002/010/001 because it looked user-friendly.  I was wrong.

It seems every other word got a <gap> label, resulting in numerous ellipses in the finished transcript.  Phrases such as the following left me puzzled and the document …

Screen shot 2013-03-06 at 12.31.16 AM

Overall, I think Transcribe Bentham is a great project.  Although I’m still not sure if its intent is to get people like us who think this stuff is super cool to do free work for them.  Perhaps it’s just mutually beneficial.

Crowdsourcing Transcriptions

I was rather amused at the crowdsourced transcription assignment for class, since there was a Crowdsourcing session at THATCamp Lehigh Valley (which I attended this weekend).  If you like this sort of thing, but can’t stand Bentham’s handwriting, that link gives you many other sites to try your hand on.

I chose to transcribe JB/002/153/001, which is part of Bentham’s economic writings entitled Annuity Notes, mostly because the handwriting looked pretty clear compared to some of the other pages I had seen.  I noticed that the process did get markedly easier I as I went through the document; I had more questionable “translations” in the first paragraph than the rest of the document.  Also, it was easier to decipher words that appeared multiple times.  Despite those advantages, there were still several words I was unsure of (one of which I am pretty sure is a name, so I don’t feel bad about being unable to decipher that one).  Like Cliffie, I asked my boyfriend to take a look, and he agreed on several of my translations and suggested others that made more sense.

I think transcription work like this naturally becomes a collaborative process, especially when issues of handwriting become involved.  When I was teaching, we used to get together with the other grade level teachers to calibrate norms and grade the written “constructed response” standardized test practice questions, and the process went much quicker when you had a colleague right next to you to help interpret handwriting, or to confirm or change your assessment.  I wonder if those of us with a background in English have a natural tendency to get a second pair of eyes to look over our work with our training in peer editing and/or workshopping?

Update: Turns out what I thought was a name (something Billy) was actually “Exchequer Bills”.   Not feeling bad about missing that!

Collaborative Transcriptions

I chose to transcribe and encode JB/051/376/002 for the Transcribe Bentham assignment (you should feel free to tackle pages 1 or 3 of the same folio – they are up for grabs!). Since I completed my transcribing/encoding process at work yesterday, when I came upon a particularly baffling phrase, I pulled in others from my office to help. This only happened a few times (I am still feeling fairly proud of myself for the relative ease with which I deciphered Bentham’s script), but the following phrase/word stumped us all:

Screen Shot 2013-03-01 at 8.45.37 PM

To clarify, the ENTIRE rest of the manuscript is written in English, without a whiff of another language in it (some of his others are written in French, I noticed), so I tried word after word after word (along with Nigel and another officemate). However, after many minutes of simply staring at the characters, willing them into some sort of coherency, I was finally forced to utilize the “?” tag, indicating a ‘questionable reading,’ and entering the phrase “In places.” So, you can imagine my eagerness when I woke up this morning with a response from Transcribe Bentham that my manuscript had been reviewed – I immediately went to the page to see what the “right” answer was – and my transcription had been changed to “Non placel.” Non placel? I thought, That’s not English, no wonder I couldn’t figure it out. Since I had involved two others in my efforts, I decided to update them via Twitter, including the 668k hashtag. Aaaaaand, check out my Storify below to see the resulting convo (it’s better if you click View as Slideshow – also, my post continues on underneath):

  1. Fri, Mar 01 2013 11:27:29

  2. @caritasity @trueXstory @boswells731 Probably “non placet”, literally “it does not please” in Latin.

    Fri, Mar 01 2013 11:29:30

  3. @BonifaceVIII @caritasity @boswells731 Ah, Latin. It gets you every time. ‘non placet’ makes much more sense.

    Fri, Mar 01 2013 11:30:37

  4. @trueXstory @BonifaceVIII @boswells731 – not necessarily in this context, though…? besides @TranscriBentham made the call. :P

    Fri, Mar 01 2013 11:37:08

  5. @caritasity @trueXstory @BonifaceVIII @boswells731 starting to think that ‘non placet’ is right! Will revise (thanks for the correction!)

    Fri, Mar 01 2013 12:09:04

  6. Fri, Mar 01 2013 12:09:18

  7. @BonifaceVIII – nice catch on the latin! i just wish i hadn’t spent a half-hour staring at that phrase with my english-only eyes. :P

    Fri, Mar 01 2013 12:20:42

Although most participants probably transcribe/encode individually, I couldn’t help but make this a collaborative activity, which seems in completely in alignment with the spirit of Transcribe Bentham (and the field of DH in general). Beyond the implicit communal nature of the project and the built-in collaboration between transcriber/encoder and the TB Editor, I was able to collaborate in person during my transcription process and digitally afterwards. The speedy response on Twitter from the TB Editor (I’m guessing Dr. Causer?) was both unexpected and gratifying, rendering the Project itself even more transparent. While I was initially skeptical of such an activity (Encoding? Isn’t that why I opted for topic modeling in Technoromanticism instead – to avoid this?), I’ve now concluded that Transcribe Bentham is something I’m definitely going to share with others and hope to revisit when I have more time (post-May!). It’s scholarly work saturated with social interaction, which is honestly how I like my academia served.

Wordly wobblings

My findings from the Google Ngram Viewer are that we did not like “idea” very much in the first half of the eighteenth century.  Our feelings about “truth” have varied substantially; we liked it quite a lot during the mid-nineteenth century, but in 1910 we started preferring “idea” and this has stayed fairly consistent since then.  Ngram

My Up-Goer Five definition of DH goes like this:

It is about doing old things in new ways. Or, if you ask another person, it is about doing new things in even newer ways. People who do it don’t agree on what things are most important or how to study them. Human life changed when books did away with forms of writing that came before them. Computer forms of stuff that used to be only on paper might be doing the same thing now. Computers can make stories look different, but does that mean that they ARE different at the bottom? Or is it only the way that we look at them? If we use computers to read books, we can study different ideas about them. The question is whether those kinds of ideas leave out the kind that came before. The question is also whether the old kinds of study leave out ideas that one can only reach by using new ways. Perhaps the best way to put the question is: How do we decide whether the old or new way is best for something we want to learn (or, better yet, how we can put the two together)?

 

While the original XKCD comic is funny, I think this concept can only work well when humor, not communication, is the point.  It could be helpful if someone is taking him/herself too seriously and wants to re-evaluate a statement in search of excessive jargon, but it does not seem useful for describing something to someone who does not already know what you are talking about.  Without the words “digital,” “humanities,” “electronic,” or “interpret” I wasn’t able to make a definition that could let somebody who had never heard of DH know what I was describing.

So, on to Wordle.  I used the Gutenberg text of King Lear (minus the fine print and introductory “comments”) and this was what I got:

http://www.wordle.net/show/wrdl/6368114/Gutenberg_King_Lear_Wordle_for_ENGL_668K

Word it Out gave me this:

WordItOut-Word-cloud-162602

Obviously, the speech prefixes dominate these clouds; Lear and Kent are the most prominent in both clouds.

Running the Word it Out list through the Up-Goer Five produced these words:

tell one night Sister say make see great done further now man hath long life late Daughter good Daughters Enter name mans answer away yet part better Father fit eyes nothing cold else old some Horse Gods time home go hand least way take Letter heard here much against still know Sir rather heart both all though found more come art Let most well like little many place follow age gone made other comes hold death none mad call within Brother full power hast head Sisters makes Lady after two set being put came do’s thing What’s toward Boy where’s best world thought men reason stand word Oh before any dead first bring house Friend blood matter true since told dost draw fire doth Fathers course things cause strange sight stands

 

One thing that surprised me is that “Lady” could stay but “gentleman” had to go.  Someone who was not aware of the context could probably gather that family relationships are a major theme of the work represented, but could probably not go much further than that.

The CLAWS tagger produced this:

-----_PUN 
place_NN1 hast_VHB turne_NN1 feare_NN1 Storme_NN1 Master_NN1 since_CJS 
i'th_NN1 th_NN0 Edgar_NP0 halfe_NN1 Edg_NN1 businesse_NN1 else_AV0 Enter_VVB 
leaue_NN1 Slaue_NN1 done_VDN thing_NN1 stand_NN1 heare_NN1 Ha_ITJ Regan_NP0 
Cornwall_NP0 speake_NN1 Lady_NN1 comes_VVZ world_NN1 Madam_NN1 head_NN1 
some_DT0 still_AJ0 Sword_NN1 Sir_NN1 againe_VVB thy_DPS farre_NN1 liue_NN1 
till_PRP any_DT0 Cordelia_NN1 most_AV0 set_VVN Knaue_NP0 told_VVD forth_AV0 
fire_VVB Brother_NP0 Daughters_NP0 Ile_NP0 meanes_NN2 gaue_VVB none_PNI 
being_VBG fit_AJ0 know_VVB within_PRP do'st_NN1 Douer_NN1 Cor_ITJ call_NN1 
nor_CJC Bast_VVB other_AJ0 Gentleman_NN1 Foole_NN1 backe_NN1 men_NN2
things_NN2 Noble_AJ0 neuer_NN1 Trumpet_NN1 pray_VVB seene_NN1 Alacke_VVB 
hither_AV0 goe_VVB now_AV0 Glou_NP0 more_AV0 bring_VVB vp_NN0 true_AJ0 
though_CJS much_AV0 two_CRD Villaine_NP0 euer_NN1 heard_VVD fellow_NN1 
gone_VVN Edmund_NP0 Scena_NP0 Fortunes_NN2 hold_VVB put_VVB where_AVQ 's_VBZ 
whom_PNQ take_VVB himselfe_NN1 do_VDB 's_POS Corn_NN1 ere_PRP sleepe_NN1 
euery_NN1 better_AJC King_NN1 say_VVB Stew_NN1 deere_NN1 first_ORD bin_NN1 
Fathers_NN2 finde_NN1 Duke_NP0 Gent_NP0 Gloster_NP0 cause_NN1 Knights_NN2 
good_AJ0 name_NN1 Oh_ITJ T_PNP is_VBZ returne_NN1 Sonne_UNC Horse_NN1 away_AV0 
France_NP0 Exit_NN1 Bastard_NN1 looke_NN1 make_VVB after_PRP o'th_NN1 
Prythee_NN1 wits_NN2 makes_VVZ Reg_NP0 word_NN1 little_AV0 vs_PRP Steward_NN1 
like_PRP age_NN1 Nature_NN1 thine_DPS cold_NN1 follow_VVB shalt_VM0 
against_PRP stands_NN2 What_DTQ 's_VBZ rather_AV0 way_AV0 seeke_VVB 
further_AV0 came_VVD Father_NN1 haue_VHB answer_NN1 knowne_NN1 long_AV0 
home_AV0 many_DT0 loue_VVB Sisters_NN2 life_NN1 Gods_NN2 late_AV0 thee_PNP 
made_VVD Fortune_NN1 Alb_NP0 eyes_VVZ nothing_PNI farewell_NN1 Edmond_NP0 
feele_NN1 purpose_NN1 Tom_NP0 old_AJ0 Friend_NN1 see_VVB found_VVN least_DT0 
power_NN1 dead_AJ0 Traitor_NN1 well_AV0 Let_VVB vse_NN1 toward_PRP blood_NN1 
euen_NN1 Lear_NP0 draw_VVB Lord_NN1 reason_NN1 mad_AJ0 strange_AJ0 heart_NN1 
here_AV0 Letter_NN1 yet_AV0 Albany_NP0 Gon_NP0 Gonerill_NP0 man_NN1 part_NN1 
one_CRD great_AJ0 Glo_NP0 dost_VDB heere_AJ0 giue_NN1 downe_NN1 doth_VDZ 
poore_NN1 lesse_NN1 come_VVB hand_NN1 Kent_NP0 Grace_NP0 art_NN1 helpe_NN1 
go_VVB matter_NN1 foule_NN1 course_NN1 thou_PNP strike_VVB Boy_NN1 vpon_NN1 
whose_DTQ thinke_NN1 thought_NN1 beare_NN1 peace_NN1 hath_VHZ Exeunt_UNC 
death_NN1 full_AJ0 Sister_NN1 owne_NN1 house_NN1 selfe_NN1 night_NN1 best_AJS 
Fiend_NN1 keepe_NN1 both_AV0 tell_VVB Ste_NN1 mans_NN2 sight_VVB Glouster_NN1 
all_DT0 hence_AV0 before_PRP Daughter_NN1 time_NN1 ..._SENT **42;7;TOOLONG_UNC

I’m sorry; I can’t give a useful analysis of this.  The site is the opposite of the word cloud generators in that it is not even a little bit user-friendly.  The key to tags is not straightforwardly organized.  I tried to find what “NPO” (or possibly “NP0) might mean, but it was not in the list.  Perhaps this would make more sense to me if I knew something about coding.

Pushing onward into the land of things I don’t understand, I approached TAPoR and HyperPo.  Using this site was extremely frustrating because, once I uploaded the text (I couldn’t copy and paste, so the Gutenberg “comments” came along for the ride), the resulting window did not include labeled buttons.  I got the following analyzing the word “daughter”:

"Daughter"

 

If I’m using it right, this tool indicates that the word “daughter” occurs most often in Act 1, Scene 2 — the scene in which Lear divides his kingdom.  This scene coincides with the highest number of mentions of “Cordelia” but not of “Gonerill” or “Regan.”  I think this set of tools has the most potential usefulness, but I had trouble understanding how to make them useful.  I tried some of the “help,” “tutorial,” and “tour” features, but I kept running into “page not found” and “router error” messages; I don’t know if I was doing something wrong or if the site just wasn’t working very well.

Ramsay was right:  these tools make the text of King Lear look completely unfamiliar.  As I flailed about through these mysterious new waters, I found that the mere strangeness of what I was seeing was almost overwhelming.  I can see that I might eventually be able to put these tools to productive use, but first I need to become more comfortable navigating digital environments.

Seeing the Forest through the Thees (and Thous)

My initial reaction to Ramsay’s statement is that for me nothing quite induces the defamiliarization of textuality like invoking the ostranenie of Russian formalists. I’d like to see someone explain that passage in Upgoerfive! That having been said, I found this week’s exercises quite thought provoking and exciting. As soon as I began my first attempts to create word clouds with Augustine’s <i>Confessions</i>, I knew there were going to be problems with my particular translation, the language of which is extremely antiquated. Because of the language, my initial Wordle showed “Thee”, “Thou”, and “Thy” to be the most common words (because they are not in their basic stoplists, of course, even though a modern translator would say “you” and “your”).  Further examination revealed that there were a large number of other very common words in archaic forms in my text.

Through some trial and error, and using a text editor with advanced Grep capability to perform some batch replace procedures on my text file, I managed to generate a more satisfactory result. The Wordle and WordItOut versions seemed quite similar in my case. And even though WordItOut seems to offer somewhat easier manipulation of the final ouput, I’m posting the Wordle because I agree with others that they tend to look better:

Wordle

I found this to be a surprisingly good encapsulation of many of the main themes of the Confessions. Putting the resulting words into UpGoerFive resulted in the following list of words used frequently in my text that were not among the more commonly used in English today:

nor, unto, lord, earth, soul, whom, heaven, itself, therefore, neither, behold, joy, spirit, whence, flesh, holy, certain, unless

Here we can see that the archaic language is still apparent, even after my attempts to modernize the most frequently used archaic words.  ”Nor”, “unto”, and “whom” should really probably be on the stoplist since the ideas that my old translation is expressing with them would probably be expressed with stoplist words in a translation written today.  But if we look past those words, the remaining results are reasonably instructive, and a machine trying to ‘comprehend’ what the Confessions are about would have a reasonably easy time of it, I suspect.

The CLAWS tagger seems quite powerful though its results didn’t immediately speak to me.  I did notice that it seems to have mis-identified Augustine’s use of “times” as a preposition.  CLAWS becomes particularly powerful, it would seem to me, if one were to convert the results list to a spreadsheet that can be easily sorted by part of speech.  TAPOR likewise looks like a very powerful toolset — if I’m not mistaken its concordance generator could could accomplish Father Busa’s entire project in a matter of a few minutes — assuming one had the works of Aquinas available in text files.

Ultimately, though, coming back to the question of defamiliarization of the text, this week’s exercises proved to me that there is something valuable in breaking our texts down in this way — even if I’m not sure I see where this is all headed just yet.  Text mining procedures like these seem to be taking apart the forest and sorting the trees by species, size, age, etc.  Surely that would be useful information for a biologist studying the forest, but how we will get from stacks of trees over to understanding biodiversity still remains unclear to me.

No Clever Title

I used the text of John Henry Newman’s The Idea of a University that I found on the Project Gutenberg website last week to produce two word clouds on Wordle and WordItOut.

NewmanWordleWordItOut-Newman

There were two differences that jumped out right away when I compared the two: WordItOut seemed to do a better job of weeding out stopwords (“may”), and Wordle accepted without question what I’m pretty sure are character-encoding errors (the pseudo-words beginning with ‘Ä’).

I had pretty much the same experience as everyone else did when I pasted the words from the WordItOut word cloud into the Up-Goer Five Text Editor: it rejected 26 of the words (although it wasn’t concerned in the least that the words, in that order, did not constitute a syntactically valid English sentence.).

I then pasted the same words from the WordItOut word cloud in the CLAWS Part-of-Speech tagger. For some reason, the text pasted with spaces between the words, and I had to enter the spaces manually. I noticed that the word list had a similar effect to the “entropic poem” on page 37 of Ramsay’s Reading Machines, which surprised me, since I had assumed that that effect would only be perceptible in a short text.

I get the point of tools like this. There’s a similar one called William Whitaker’s Words that’s very popular among students learning Latin, although the fact that CLAWS accepts bulk input (unlike Whitaker’s Words) is an improvement on the model. And there are useful things, I suppose to be learned about a text from such tools (e.g., to confirm or deny the claim that John Calvin never used adverbs in writing). I didn’t, however, find the output of CLAWS particularly edifying in this case:

CLAWS Output

 

I attempted to hand off the URL for the plain text on the Project Gutenberg site directly to TAPoR using “Your Web Page”, but what I got was an HTTP 403 Forbidden error, so I played with Chapter 1 of Moby Dick instead. My sense was that the HyperPo does need a body of text longer than a single chapter in order to be really useful rather than a curiosity.

I don’t feel qualified to comment on whether the use of these tools produces an effect of estrangement and defamiliarization of textuality in general — not being a literature student, I’m not used to relating to textuality in the abstract, as opposed to a particular text or texts. My impression is that tools of this kind will do much more for you if you already know something about the text you are examining in this way, and I certainly got a lot more out the examination of Gratian’s Decretum than of Newman’s Idea of a University.

War of the Wordles

Unfortunately, I lost my first Wordle of War of the Worlds, which had a beautiful custom palette and Martian-like font, and now I’m really mad that I couldn’t find a search function on the Wordle site’s public gallery. Boo. So here’s a second one.

wordle

And, the much uglier WordItOut!

WordItOut-Word-cloud-162393

Interestingly, many configurations of the Wordle sketch out a bare-bones premise for the book with the most prominent words: “Martians Came”. Both “Mars” and “Earth” are very small, and don’t even appear in the WordItOut! There are few proper nouns, no character names, but places like “London” and “Woking” show up. “Black” and “red” are also prominent, as are sensory words like “heard”, “see”, “saw.” “Seemed” is much bigger than “know,” giving a feel for the uncertainty that haunts much of the action of the book. The WordItOut! on the other hand, picked up much more common “filler” words like “said,” “about,” “through,” “over.” It was also much less fun to play with. Much of the appeal of the Wordle for me was arranging the layout so as to maximize the “sense” I could make out of it visually: how much of the basic “plot” or action words could I manage to juxtapose and highlight with color, straight or curved lines, font “appropriate” to the subject matter? As Ramsay suggests, this is perhaps the greatest potential of text-analysis tools–the ability to operate at a new scale and to manipulate the text on different levels than “close reading” allows.

Not surprisingly, very few of my Wordle words were allowed in the Up-Goer Five Text Editor. While experimenting with Up-Goer Five, I was trying to figure out the best approach–do I hand-pick words from the list of ten hundred, or do I build my definition by attempting to write it first, and then “translate” it? I wove back and forth between these approaches, picking some words and then trying out other phrases that were inspired by them. Ultimately I was disappointed, and I must say my definition of DH was more flippant than informative: “Many conversations about building, making, thinking. doing; money, jobs. Using computers to study humans and read/write ‘algorithmically.’” Without punctuation it’s as long as a tweet.

When I input the Wordle text into the CLAWS Part-of-Speech tagger, it interestingly read many of the verbs as gerunds, tagging them as adjectives. I would really like to know what others think the best application of a tool like this would be. I immediately thought it could be used as a translation aid from one corpus to another, but this doesn’t seem to be a feature.

TAPoR was honestly the tool that got me most excited and seemed most applicable to my research on women’s alternative/independent publishing. It was easy to “mess around” in–I’ve never done any text analysis before but at the most basic level I knew what a stop-word list was, and could figure out how to get the tool to “spit out” what I wanted to see. The descriptions that appear when you hover over a tool were immensely helpful and I found myself wishing every DH project or toolbox had this feature. Interested by the appearance of place names like London and Woking, I graphed these on the concordance tool to see the protagonist’s (and the Martians) geographical movements through the novel. I also graphed “Martians” and “People,” the occurrence of which mirrored each other for most of the novel before “People” drops off sharply toward the end, when the protagonist is moving through deserted houses and communities. This exercise really tested my knowledge of the “plot points” in the book–I found myself remembering details that seemed insignificant, all by looking at a graph of the words. I’m just itching to digitize some zines, scrape their text, and compare all the instances of “queer,” “feminist,” and “anti-racist” I can find.

I also couldn’t help but smile at the title of these tools: “Voyant: See through Your Texts.” The entendre is irresistible–use “your texts” (whatever they may be) as a pane or a lens through which to view a specific topic, and/or make your texts transparent, lucid; make bare their meanings. Of course, the implication of Ramsay’s argument is that none of these tools, or the texts to which we apply them, are “transparent.” We might be able to “see” our text differently, from new angles an at previously hidden layers, but it is dangerous to assume that nothing resists the self-evidence of scholarly vision. My partner, who was watching me do these experiments and also helping me with the necessary plugins to run them, kept lingering on these sites to figure out what kinds of algorithms they use and what kinds of patterns they’re finding. I’m not sure most users think about the tools on those levels [DH-ers and hackers are, as usual, another story], and it would be easy to tout their potential while forgetting that our interpretations, the most valued currency in some humanities disciplines, are just begging to be made.

 

Loved by the King?

I’ve seen Wordles used before in school projects, but usually for display purposes rather than used as an analytical tool.  Therefore, I was excited to see the application given a new purpose that teachers could easily use in school for a variety of texts.

Word Clouds!

When I imported Project Gutenberg’s text of the first volume of Le Morte D’Arthur into Wordle and Word it Out, these were my results (Sadly, I discovered that the “Loved by the King” font in Wordle was not very, well, kingly, so I switched it to a more appropriate font):

Wordle

Wordle

 

Word It Out

Word It Out

It’s not surprising that the most prominent word in both is “Sir”, as most of the characters go by that epithet, nor that “king” and “knight” are also frequently used, emphasizing the courtly genre of the text.  ”CHAPTER” probably is featured since the table of contents was included in my copy and paste, in addition to all the times it is usually used.  I was surprised that Tristram beats out Arthur (in a book titled after him!)  I also found it interesting that words such as “smote”, “battle”, and “slain” are much more prominent than “God” and “worship”, hinting that the divine justification for most of the fighting was not as much of an excuse as it purported to be.

Paraphrasing with Up-Goer Five

Screen Shot 2013-02-13 at 3.35.29 AM

Like many of my classmates, I found when I put the top 100 words into Up-Goer Five, that about half the words were not permitted, primarily in the proper name, antiquated term, and knightly terminology categories.  I would doubt the ability of someone to use the Up-Goer Five to summarize books like this with difficult language if I hadn’t seen their application to Hamlet’s “To Be or Not To Be” speech.  (I actually recommended this application to my former co-workers, many of whom require their students to paraphrase the famous soliloquies in Shakespeare’s plays on their tests.)

And I thought I was free from dealing with parts of speech…

CLAWSI was impressed by the CLAWS Part of Speech Tagger’s ability to correctly identify even the antiquated pronouns such as “ye” and “thee”, but other than that, I found it difficult to see how these kinds of results could be useful in an analysis of the text.  Maybe if there were further calculations applied (frequencies of parts of speech?) I could have seen those patterns to turn into narratives–or at least questions–that Ramsay suggests.

Making some conclusions with TAPoR

When I first plugged the text of Le Morte D’Arthur into TAPoR, the frequency count and “Cirrus” were both dominated by articles and other “unimportant” words, but when I asked the program to remove them, it generated a list almost identical to that of Wordle and Word It Out!  The Word Trends graphs, though, got interesting when I decided to click on those prominent names.

Frequency of Arthur, Tristram, and Launcelot's appearances in the book

Frequency of Arthur, Tristram, and Launcelot’s appearances in the book

 

Leaving the “Segments” setting at 10 to roughly mimic the 9 books in Vol. 1, I discovered that Arthur most frequently appears at the beginning of the book (which makes sense, given that it is devoted to the story of how he came to power), and then is practically forgotten about.  Likewise, Tristram dominates the last part of the book, even more so than Arthur.  This makes sense because book 8 is all about Tristram’s adventures.  Similarly, Launcelot spikes in the middle of the graph, as book 6 is all about his deeds.  The juxtaposed graph shows clearly how Malory attempted to integrate all the various legends about the knights which had come from different sources, choosing to do it in an episodic fashion focusing on the character rather than jump back and forth between multiple storylines as is more typical of contemporary literature.

So what is it like to read this?

I think that these activities did have a sense of what Ramsay refers to as  ”ostranenie–the estrangement and defamiliarization of textuality” (3).  However, I’m skeptical as to how far we can take algorithmic analysis when the potential for grasping at straws exists.  As Ramsay mentions later on,

If something is known from a word-frequency list or a data visualization, it is undoubtedly a function of our desire to make sense of what has been presented. We fill in gaps, make connections backward and forward, explain inconsistencies, resolve contradictions, and, above all, generate additional narratives in the form of declarative realizations (62).

How much of this meaning is because we want to see meaning there?  And how much is built on prior assumptions?  For example, am I reading too much into the Word Trend charts of Malory because I know that his project was one of compilation, rather than invention?  I think this gets even trickier when you analyze results of an algorithm that you have designed–your own biases and/or assumptions are built into the project from the start.  Hopefully we’ll talk more in class about when these types of practices are productive and when they produce results that just mirror what we already think.

 

(And if you’re interested in seeing the outcome of Unicorns vs. Zombies according to Google N-Gram, check out my blog post!)

The Prejudice of Stripped Texts

To start this week’s exercise, I decided to have a little fun. Kind of like stretching before a big work out. Using Google’s Ngram Viewer, I compared the heroine of my chosen text, Pride and Prejudice’s Elizabeth Bennet, to her modern-day counterpart, Bridget Jones, with whose diary we are intimately acquainted. Because Helen Fielding has openly admitted to basing her characters on Jane Austen’s—especially Mark Darcy on Mr. Darcy—I thought it would be interesting to see how else they compare. I was surprised to see how Miss Bennet’s popularity waned for so many years and then, at the turn of the century, increased and hasn’t stopped since.  Additionally, I was surprised to see that Bridget Jones’ popularity peaked higher than Elizabeth’s ever did.

Ngram Viewer

Then onto the hard part of the work out—creating a definition for digital humanities. And not just any definition, one with strict boundaries. My humble result below.

DH Definition

 

Wordle vs. WordItOut

While I generally consider myself a hands-on learner and quick on the uptake when it comes to basic computer programs and technologies, I found this week’s exercise to be more than a little frustrating. Wordle would not allow me to insert the Project Gutenberg (or any other) link to get my word output, which resulted in me copying and pasting the book in its entirety into the “Paste in a bunch of text” box. Oh, I pasted in a bunch of text alright! Finally, I got this beauty:

Wordle

Then it was time for WordItOut, which was a much quicker task after figuring out Wordle’s quirks.

WordItOut

I actually took the time to try to make the two look as similar as possible in coloring for easier comparison. I think Wordle has WordItOut beat in basic aesthetics, but otherwise the results were nearly identical. I was very surprised to see “Mr.” was the word most used throughout Pride and Prejudice. Despite being the nineteenth century’s chick-lit by a female author, it is clear that it was still a man’s world at the time of writing and publication. However, the word “Elizabeth” does run a close second, which is a bit refreshing.

 

Up-Goer Five Text Editor

Next up, the commonality of words. It appears things haven’t changed much in 200 years since Miss Austen put pen to paper. In fact, other than proper names, only four words she used were not in the top 1000 words of Up-Goer Five: indeed, pleasure, till, and manner. However, this made me curious what the results would be if basic words like came, made, most, and go were not allowed to be analyzed. I was surprised at pleasure being so widely used. It’s not a word I hear used often, and it seems the connotation has changed over the years.

Up-Goer Five

 

CLAWS

CLAWS was my least favorite of all the sites. To me, it did not lay out the results in a clear, easy-to-read manner. It was also counterintuitive that the key wasn’t listed on the same page as the results, so that you had to toggle back and forth between pages. Additionally, this seems more like it would be useful for grade school children learning grammar than it would be for any other purpose.

CLAWS

 

TAPoR

When it came to TAPoR, I wasn’t nearly as interested in the HyperPo abilities as I was with the program’s ability to run lists of words and compile how many times each word occurs in the text. The word “Elizabeth,” which appeared to be a close second to “Mr.” in the Wordle, is actually used 200 times less than “Mr.” Futhermore, I was particularly interested in the listing ability for two reasons. First, Stephen Ramsay writes extensively on the tf-idf formula and how its findings affect critics when looking for patterns in a text, which I found intriguing. Second, in Italo Calvino’s If on a winter’s night a traveler, a character tries to categorize and determine the genre of books based solely on the words that recur and appear the most in a given work. It’s an interesting thought, trying to decide what a book is about without having read it for its sentences, but for the words it features.

TAPoR

 

While all of these sites were fun to play with and produced interesting results, I think they ultimately take away from the true meaning of what a book is hoping to convey. Making a book a thing of quantitative results removes the reader’s ability to interpret the text for himself and to engage in the nuances the author has created with grammar, punctuation, and voice. The only work that comes to mind that would benefit from these results would be Gertrude Stein’s “Portraits and Repetition,” where her goal is to use the same words as many times and in as many ways as possible. As Ramsay himself writes:

“It is one thing to notice patterns of vocabulary, variation in line length, or images of darkness and light; it is another thing to employ a machine that can unerringly discover every instance of such features across a massive corpus of literary texts and then present those features in a visual format entirely foreign to the original organization in which these features appear” (Ramsay 16).

I couldn’t agree more. Just as Project Gutenberg states that anything may be done with a public domain text, which may result in the text being changed in ways that dissolve its power and purpose, stripping it to just its words changes it too.