A Quick Experiment in “Distant Reading” a Large Medieval Latin Text

Gratian

 

My dissertation is on the textual development of Gratian’s Decretum. The Decretum was written around 1140 by the otherwise unknown Gratian, and was the foundational textbook for the systematic study of canon law within the medieval university. (In fact, it remained the basis for the law of the Roman Catholic church right up until 1917.)

Inspired by Charity and Kathryn’s presentation on Wednesday night, I decided to use Wordle to do an experiment in “distant reading” Gratian’s text. The MGH (Monumenta Germaniae Historiae) in Munich digitized Emil Friedberg’s still-standard 1879 critical edition in the 80s, and I cut-and-pasted the whole thing (all 490,446 words) into Wordle.

A few things need to be kept in mind in order to interpret the resulting Wordle.

First, the Decretum was written in Latin, a fully-inflected language, and Wordle does no stemming. This is both a minus and a plus. Deus, Dei, Deum and Deo are just morphologically different forms of one word, and if we were to put them all together, Deus (“God”) would have a more prominent (and less misleading) place in the visual space than it does. Episcopus (“bishop”) is another example. On the other hand, the fact that Wordle does no stemming has the effect of preserving the gendered words, for example eum (“him”) and eam (“her”). These pronouns can, or course, refer to things that are masculine and feminine in a purely grammatical sense, but the difference is nevertheless interesting.

Another linguistic feature is the salience of the word que. This word can mean several different things depending on context, but it shows up on the Wordle because of its use as a relative pronoun (“which”) kicking off a subordinate clause. Latin is a hypotactic language and so subordinate clauses appear much more frequently than in a paratactic language like English.

Second, the Wordle makes sense in the context of the way in which Gratian put the Decretum together. The Decretum consists of short extracts from “authorities”, church councils plus long-dead theologians and popes, which Gratian embeds within a framework of his own comments (called dicta or “sayings”). It is extremely interesting that only two of the individual authorities are named frequently enough to show up in the Wordle: Augustinus (bishop of Hippo Regius in modern-day Algeria, d. 430) and Gregorius (bishop of Rome, d. 603). The word Papa (“Pope”) is more prominent, suggesting the collective, if not individual, heft of the popes in the lineup of authorities. Finally, Concilio (“Council”) shows up because the attribution (“inscription” in the jargon of medieval canon law studies) of so many canons is to one or another of the general or provincial councils that Gratian cited.

The chaining of multiple authorities in sequence is a very prominent feature of the text, and is indicated by the world Item (“Similarly”). One of Gratian’s goals was to show that the authorities were in harmony with each other. In fact his title for the book (which isn’t the one that stuck) was Concordia Discordantium Canonum (“The Agreement of Disagreeing Rules”). To do that, however, he had to bring out the apparent disagreements among the authorities before resolving them (his resolutions usually being introduced by Unde or “Whence”). This gives rise to the use of adversative particles like uel (“or”) and uero (“but”) that foreground the (apparent) contrast between the positions of the authorities.

These are just some of the immediate reactions I had to a quick experiment in “distant reading” an almost half million word text in one morning. I’ll update this post if I come up with more upon further reflection. I’d also appreciate feedback from the group on how to better communicate these ideas.

Saving Wordle Word Clouds

If anyone wants to know how I saved my word cloud from Wordle, here’s how I did it (you might find a better way): Choose the Print option and through that menu save it as a PDF, then open the PDF and save it as a JPEG. You can probably take a screenshot of the word cloud, too – I just wanted the best resolution possible so that it could be blown up onscreen for the activity.

This might be completely superfluous, but I just wanted to share in case anyone was initially flummoxed – please feel free to comment if you have a better way!

*UPDATE (in response to Paul’s comment):

Paul, I did use my work computer initially, which is a Microsoft one. However, when I went to Wordle.net just now, I was able to download the Java plug-in, restart Firefox, generate a wordcloud, and when I clicked “Print,” (a few times, because I had to keep “Allow”-ing the applet to connect with my printer [which isn't actually even hooked up to my laptop currently]), I was able to get a Print dialogue screen to appear:

Screen Shot 2013-02-08 at 12.25.24 PM

So I could manually choose “Save as PDF”, which then led to this screen, where I was able to save my word cloud into PDF format:

Screen Shot 2013-02-08 at 12.25.42 PM

I don’t know if it’s because I used Firefox or my OS is different (I’m running 10.7.5), but after downloading Java, I was able to obtain a PDF. However, now I need to double-check my print queue (for my non-existent printer), because this has happened before. Good luck! :/

How to Read a Million Books + (Kathryn & Charity)

Readings Wordle

Pictured above is a word cloud generated from this week’s readings using Wordle.

In-Class Exercise: With a partner, choose at least 2-3 terms from the word cloud above and discuss/define them in terms of our readings this week and/or experiences with the digital bibliography assignment. We will come back together after five minutes or so and share. Since each group will be sharing, you might consider mixing some lesser-mentioned terms (smaller-sized) with the buzzwords (larger-sized) to avoid repeats!

The Past and Future King Lear

I selected Shakespeare because I recently had to explain to a student why the $1.99 complete works he found for Kindle would not suffice for the work we will do in class.  I was giving the standard “you need to get the books for this class ASAP” speech during our first discussion section, and I said something to the effect of “It’s ok if you have or want to use a version other than Signet, but be sure that it’s a reputable edition with good notes.”  I was astonished when this student asked “What do you mean by ‘notes’?”  I tried to explain the necessity of footnotes glossing words that persist in current usage but had different meanings during the early modern period.  Later, the student emailed me a link to Amazon’s page for the edition he had, asking if I thought it was good enough.  He said “I don’t see anything obviously wrong with it other than that it has no notes.”  I saw something else wrong with it:  It did not list the name of the editor or contain any textual information or any clues as to the principles by which the text was prepared.  I used King Lear as an example.  The text as we have it comes from two sources (the 1608 Q1 and the 1623 F1) and so an editor must either print one of those or conflate the two.  When I pointed out that – in the absence of any editorial notes identifying what was on his electronic pages – he might end up reading something very different than what the rest of the class was seeing, he decided to buy the recommended print edition.  (I didn’t even get into variations between copies of those early printings.)

I’m sharing this story because it seems that this student is exactly the sort of reader whom makers of electronic texts hope to reach: he wanted something both inexpensive and reliable.  He just didn’t find the information to evaluate whether or not the Kindle product at which he was looking answered his needs.

So, King Lear.

I started with Project Gutenberg.  I thought it interesting that ssing the “search” feature yields a disorganized mess that uses “by popularity” as the default ordering system.  I tried again by browsing.  Under Shakespeare’s author heading, one may find four texts for King Lear.  The first is helpfully accompanied by “Scanner’s notes” explaining that it is a reproduction based on the first Folio and pointing out that different copies of F1 differ from each other.  The scanner (who sounds like an editor) provides his rationale for spelling alterations and gives his name and email addresses, encouraging readers to contact him if they find mistakes or disagree with his choices.  This seems sound, but I am bothered by the “Executive Director’s notes” that precede the “Scanner’s notes.”  That note reads as follows:

In addition to the notes below, and so you will *NOT* think all the spelling errors introduced by the printers of the time have been corrected, here are the first few lines of Hamlet, as they are presented herein:

Barnardo. Who’s there?
Fran. Nay answer me: Stand & vnfold
your selfe

Bar. Long liue the King

***

As I understand it, the printers often ran out of certain words or letters they had often packed into a “cliche”. . .this is the original meaning of the term cliche. . .and thus, being unwilling to unpack the cliches, and thus you will see some substitutions that look very odd. . .such as the exchanges of u for v, v for u, above. . .and you may wonder why they did it this way, presuming Shakespeare did not actually write the play in this manner. . . .

The answer is that they MAY have packed “liue” into a cliche at a time when they were out of “v”‘s. . .possibly having used “vv” in place of some “w”‘s, etc. This was a common practice of the day, as print was still quite expensive, and they didn’t want to spend more on a wider selection of characters than they had to.

You will find a lot of these kinds of “errors” in this text, as I have mentioned in other times and places, many “scholars” have an extreme attachment to these errors, and many have accorded them a very high place in the “canon” of Shakespeare. My father read an assortment of these made available to him by Cambridge University in England for several months in a glass room constructed for the purpose. To the best of my knowledge he read ALL those available . . .in great detail. . .and determined from the various changes, that Shakespeare most likely did not write in nearly as many of a variety of errors we credit him for, even though he was in/famous for signing his name with several different spellings.

So, please take this into account when reading the comments below made by our volunteer who prepared this file: you may see errors that are “not” errors. . . .

So. . .with this caveat. . .we have NOT changed the canon errors, here is the Project Gutenberg Etext of Shakespeare’s The Tragedie of King Lear.

Michael S. Hart
Project Gutenberg
Executive Director

This is a very peculiar appeal to authority.  While it is good that Hart wants readers to understand something of the wobbly nature of early modern spelling practices, he is more or less claiming that we should trust the contents of Project Gutenberg’s books because the Executive Director of the company had a well-read father and is telling us that he knows what he is talking about.  As a scholar, I find it offensive to be called a “scholar” by Hart.  I do not need scare quotes, thank you very much.  His overt meaning is that one does not have to be an expert to make sound judgments about texts (which is, after all, the ideal toward which open access strives), but his implication is that experts are actually mucking up the process of letting people understand the play by nostalgically clinging to insubstantial quibbles.  Needless to say, this edition is not annotated.  As an attempt to faithfully reproduce the text of F1, it could have uses to some people (such as scholars who already have some understanding of what to expect, and would not need to lean as heavily on the notes as would a beginner), but Hart’s note could steer people away from more helpful formats for their needs.

Next, I tried Google Books.  A search for King Lear produced “ About 2,740,000 results.”  Following Duguid’s point that Google’s sorting is a powerful force on the reader who does not know what he or she is seeking, I decided to investigate the first hit.  It was the “Dover Thrift Study Edition” which contained word-definition footnotes.  The copyright page (which was visible in the preview) explains that Dover has reproduced “the unabridged text of King Lear, as published in Volume XVII of The Caxton Edition of the Works of William Shakespeare, Caxton Publishing Company, London, n.d.” which is accompanied by a study guide made by test prep company R.E.A. and notes that “were prepared” for Dover by a nameless editor.  Yes, they did indeed cite “n.d.” instead of – for instance – checking WorldCat to determine that the Caxton edition was published in 1910.  I thought I’d try again, but this was a mistake.

The first free edition in the listings is from 1808 and is keyed to performance at Drury Lane and Covent Garden.  There are some scanning oddities, but those are minor in comparison to the real problem:


THIS IS THE HAPPY ENDING VERSION.

It was dreadful (though popular) enough when Nahum Tate decided to “improve” (scare quotes warranted) the play for the tastes of the Restoration stage, but for Google Books to offer this version without a prominent disclaimer is unconscionable.  A version of the play in which Lear and Cordelia live happily ever after could appropriately be called a work “based on” or “inspired by” Shakespeare’s play, but this IS NOT Shakespeare’s play.  Google is doing its readers a major disservice and doesn’t even have the decency to give a person’s name to whom we can complain about shoddy (lack of) cataloguing.

Searching HathiTrust, the first item I found is the New Variorum Shakespeare edited by Horace Howard Furnace, Ph.D., L.L.D and published in 1880.  It appeared to be a fine, responsibly-prepared edition that reflects the critical tendencies of the nineteenth century.  The notes consist largely of collections of famous critics’ opinions about the text, but – as they are well marked as such – unlikely to offer serious trouble to the beginning reader.  There were many fingers, slightly crooked pages, and corner shadows in the scanned images, but I found that the clarity of the text and flexibility of the online viewer made it more pleasant to read than the other scanned versions I found.  While the old-fashioned, ivory tower model of scholarly authority was clearly present in this edition, I think that, of the ebooks I’ve discussed here, this would be the most helpful to a beginner because at least it clearly represents itself for what it is.  The multiplicity of the notes (which may look a bit daunting at first – but the reader would find a similar landscape if she wandered into a bookstore and picked up the modern Arden Shakespeare) draws attention to the fact that editors make choices in presenting texts; the first-hit items from Project Gutenberg and Google Books did not.  HathiTrust’s site layout is more user-friendly than Project Gutenberg’s and its sorting principle is likelier to lead the reader to something useful on a first try than is Google Books.

After this research, if I had the conversation with which I began this post to do over again, I would still advise my student that it would be much easier to spring for the $4.95 Signet Classics Lear in physical form

Download and Read: Augustine’s Confessions Online

For this exercise I wanted to choose something with a long and complex history that would be relevant to my interests, but which also had enough cultural significance to be of interest to a wider audience.  I settled on Augustine’s Confessions, his autobiographical masterpiece written at the end of the 4th century, in which he recounts his early life and conversion to Christianity.  As with any work written before the age of print, the Confessions came to life and first circulated in manuscript form (examples of which can also be found online, for example this digitized microfilm of Troyes, Bibl. mun., 473, and this digital facisimile of a Villanova MS). The work made its way into print at an early date, and was translated from the original Latin into English at least as early as the first half of the 17th century.  Perhaps the most accessible edition of the Latin text is that in the Patrologia Latina (32.659 ff.). The PL — the publication of which in the mid-19th century has to be one of the most successful acts of serial plagiarism ever perpetrated — retains its relevance today as a kind of least common denominator of editions. But as you might expect, over the years there have been numerous editions of the text, not to mention translations into various languages. Not knowing what other sorts of exercises might be in store in the coming weeks for the texts we choose to investigate, I decided to focus my efforts on the English versions of the Confessions.  And rather than attempting to compile a comprehensive survey of all the various versions that might be out there on the web, for the purpose of this exercise I decided not to labor too much over locating every available version and instead just to approach each of the four search interfaces with some common terms (viz., author: Augustine; title: Confessions; and where possible limiting the results to English language hits available in full text), and see what each one returned.

Project Gutenberg

Project Gutenberg returns just four hits in response to a search on “Augustine Confessions”, including one hit each of the English and the Latin text, as well as two anthologies that contain excerpts from the text.  Gutenberg’s English text (available here) was first released in 2002, and is a version of the translation of Edward Pusey from the Library of the Fathers series, a series of translations of patristic texts published in the 19th century by members of the Oxford Movement of High-Church Anglicanism. This is an influential translation, and it will make repeated appearances below.  The text is available in six formats: HTML, ePub, Kindle, Plucker, QiOO, and plain text (UTF-8).  The HTML version is XHTML, and seems to have been carefully proofed. This version also contains some useful additional encoding such as paragraph numbering.  The text can be read online, or downloaded in any of the six available formats. It is in the public domain, and is here released under a Project Gutenberg license, which allows the end-user to use the text for just about any non-commercial purpose.  There isn’t any obvious way to mark up or otherwise correct the text and re-submit it back to the project.

Google Books

Searching Google Books using the terms described above returns 25 hits when limited to those available in full, ranging in date from 1770 to 1912.  Closer examination reveals that many of these 25 are in fact duplicates, and others are irrelevant volumes of a multi-volume series (The Nicene and Post-Nicene Fathers [NPNF]), only one volume of which contains the text under investigation.  Among the ‘good’ hits are a copy from the Loeb Classical Library; a reprint of Pusey’s translation in a series called the Harvard Classical Texts; and a translation by Charles Pilkington in the aforementioned NPNF series.  The volume edited by Temple Scott (1900), which was scanned from a Harvard copy, is in fact a re-issue of Pusey’s translation, while the translation by W. H. Hutchings, scanned from a copy in the Bodleian, purports to be a new translation, albeit of only ten books rather than the full text’s thirteen (the final three books of the Confessions, more philosophical than autobiographical, are sometimes left out). There are a number of other versions available on Google besides these.  The volumes on Google are available in a variety of formats both directly on Google books and through the Google eBook feature, including several formats designed for e-readers and for online reading.  While the quality of those that rely on page images is generally good, the OCR versions remain quite error-laden.  For example, this passage chosen more or less at random: “$e tnbefff$s against tTie SOone of enucatinp; f&e BUT woe to thee, thou torrent of human custom!” (p. 23).  If one examines the page-images it is immediately apparent why the text is so corrupt in the first half of the passage — it is an epigram printed in a gothic font. But because of anomalies like this, the poor quality of the OCR would make it difficult and dangerous to use Google’s text for any serious purpose (over and above the fact that there is no obvious easy way to download the entire book in plain text format). There are some nice features of the Google reader such as the ability to create notes and mark up the text with highlighting in different colors. Google’s terms of service would seem to allow download and reuse of their content in a variety of forms.

Internet Archive

Perhaps the most interesting and unique offering at the Internet Archive is the very first one among the initial hits: a complete audio book from Librivox.  The experience of listening to the Confessions read aloud probably more closely approximates how the text was experienced through much of its early history, when even private reading was often done aloud, than many of the printed versions.  Many of the versions available on IA are copies of books digitized by Google. Pusey’s and Pilkington’s translations are here, but also a version of the text translated into Hebrew that I found on no other site. The IA’s versions are available for free download in a variety of formats, including formats for various e-readers, as a PDF, and as a single plain text file.  Unfortunately, the plain text version is full of OCR errors (not least the common failure to segregate headers, footnotes, and main text), and would require significant clean up to be useful for any serious purpose.  Many of the IA books are listed as not in copyright or with no known copyright restrictions, and can be downloaded freely in various formats.  In addition, descriptive information about the scanned books can be contributed by users through openlibrary.org, and problems can be reported to IA through a link on their site.  IA’s online reader is perhaps the best interface of any of the available online readers.

Hathi Trust

Finally, a search of the Hathi Trust using the same terms described above returns 19 hits, including many of the same translations available via Google (in fact, the watermarks reveal that many of these are in fact Google’s scans). As one might expect, the metadata for Hathi Trust books are generally fuller and more precise than Google’s. Another useful feature is the ability to download citation information. Plain text is available, but only on a page-by-page basis, and even the PDF download of full book in the public domain requires authentication.  According to the access and use policy, the Google-digitized books are requested not to be used for commercial purposes or re-hosted, but otherwise are free for use for non-commercial, educational purposes.

 

In conclusion, I would note that the plain-text version of Pusey’s translation available through Project Gutenberg is probably the most useful of all the free online versions of the text, simply because of its flexibility.  None of the foregoing discussion takes into account the accuracy of either the translations or of the editions upon which they were based.

The Scarlet Ebook

I selected Hawthorne’s The Scarlet Letter for three not so exciting reasons. 1. I have the book on hand. 2. Nearly all of the books I am interested in or enjoy come after the public domain works. 3. I happen to enjoy this one.

With that out of the way, The Scarlet Letter is available on all four resources: Project Gutenberg, the Internet Archive, HATHITrust, and Google Books. Let us go down the list and see what we have here.

Project Gutenberg is available in a variety of formats: HTML, EPUB (no images), Kindle, Plucker, QiOO Mobile, and Plain Text UTF8. It isn’t clear what edition of the text the HTML version is based on, only that this version of the ebook was first released in 1992, produced by Dartmouth College, but has been updated in 2005. The HTML version contains all of the materials you might find in a print version of the book, such as biographical information, a list of works, and an editor’s note, but as this is HTML, there was no effort here either for the text itself to resemble a printed book, or to take advantage of some of the possibilities of the ebook format.

A few of the other formats seem unfamiliar to me, and others require programs or e-readers to view. Alas, being a non-Kindle user, I moved on to the online reader, which divides the novel into pages, serving as an alternative to scrolling through the text. But the online reader does little else to mediate or alter the text.

The Internet Archive provides what appears to be three versions of the manuscript, but on closer inspection they are all identical copies of the HTML format of The Scarlet Letter taken directly from Project Gutenberg. The site provides a space for reviews (presumably for opinions on the quality of the e-copy or perhaps even the novel itself). It is also interesting to know that the novel has been downloaded 1,848 times.

Typing The Scarlet Letter into the search bar of HATHITrust yielded 931,602 results. Woah. Could I narrow this down? I clicked the option for “full text only,” and with my results narrowed, I happily clicked the search button only to be bombarded by 480,863 results. Hm. What if I clicked “Nathanial Hawthorne” as the author. That brought me down to 720 results. Perhaps my search was still off, but I decided that this was the best I was going to get.

I apologize for not having mustered the time or the patience to search through 720 results, although I suspected that the correct items would be found on the first page. First, a word on the functions of the site: HATHITrust provides a few limited options of viewing the text, but these only amount to zooming and flipping pages (or scrolling). The search function is quite nice and works well, although any Word or PDF file has this capability.

Going right down the list, the first selection brought me to a scanned copy of the 1889 Boston Houghton, Mifflin and Company version of the text, featuring black splotches and lines, and even a Due Date card in the back. In all other respects, however, this appeared to be a fairly well-done copy, and I would rather download a PDF of something that resembles a book rather than an HTML version that appears like a poorly designed web page.

How did the other copies fair? Well, it turns out many of them were duplicates, but one version caught my eye: The Scarlet Letter “with illustrations of the author, his environment and the setting of the book; together with a foreword and descriptive captions by Basil Davenport,” published in 1948. And the illustration? Well, it scanned quite well, I suppose. Hawthrorne does sport his mustache with pride.

 

Finding most of the copies of HATHITrust in respectable shape, I moved on to the last resource: Google Books. Having already sorted through Project Gutenberg’s wide variety of formats, The Internet Archive’s borrowing the most simplistic format (HTML) from Project Gutenberg, and HATHITrust’s large quantity of nearly identical copies (available for download as PDFs), I was ready for whatever Google Books had in store.

 

Typing “The Scarlet Letter by Nathanial Hawthorne” of course yielded many, many results, but I could see right away that only one was an actual copy of the text. Here I found a scanned copy of the text from the 1898 Doubleday and McClure Co. edition. And yes, this one also features a stunning illustration of Nathanial Hawthorne and his mustache. Google Books gives you the option to download the book in Plain Text, PDF, and EPUB formats. The quality of the copy itself is quite good, from what I can tell. But more importantly, Google placed some effort in supporting some unique features. In addition to the search function, clicking a chapter title in the table of contents will bring you to the correct page. This is a long ways from a hypertext version of the novel, but Google certainly took a step in the right direction.

 

Ultimately, I was not overly impressed with any version of the text, although I did not experience any of the extreme formatting issues Duguid encountered while researching Tristan Shandy. Moreover, as all copies are free to use for whatever purposes you may desire, I suppose I shouldn’t be one to complain. Google Books provided the most impressive copy of the text, even though I would still prefer my own hard copy of the novel next to a scanned e-copy with a search function. I consider my $4 well spent. I can imagine a more robust hypertext version of The Scarlet Letter, but perhaps that is a blog post for another day.

Many versions of many stories in many languages (and many problems)

I decided to work with the great nineteenth century Brazilian author Machado de Assis (author of Brás Cubas), and analyze the results in a more careful way than when I am researching for my study. It was not easy to find a great variety of titles by this author, so I had to choose from a selected group of titles that had full text versions available (because most of them were protected for copyright reasons). In Gutenberg Project, I only found two of the books that Machado wrote, so I decided to work with Varias historias (Many stories) a collection of sixteen short stories that was published in 1896, in Google Books, HathiTrust and Internet archives. I did not know I was going to find so many problems!

Google Books

The first option has only a snippet view, and it is a translation into Spanish, actually. So I went to the second option to read it in full, and I saw that it is from the Library of the University of Texas at Austin, a 1903 edition. It is a text that was first published in 1896, so this edition comes just seven years after that. Google books only offers the name of the publishing house, H. Garnier, the year, 1903, and the number of pages, 282 pages.  The formats offered are: plain text, PDF, EPUB. You can download the text, and in the online version the table of contents has links to the different parts of the book. It is possible to read it in “Google play”, as well, a kind of digital cloud to store books, music, etc. So, you can make your own google books library.

As far as restrictions on the digital contents are concerned, users are not allowed to sell the digital content or remove the watermark or other sign that says it belongs to Google. These are the same restrictions that HathiTrust and Internet Archive have.

The scanned version had all the pages. But I realized that the print copy itself had a lot of problems instead!  In one instance, the page number was reversed (175 instead of 157), and there was a line mistakenly inserted in a dialogue. But, fortunately, one of the readers of this book in its printed form corrected the mistake, so we can now “read it the way it should be”.

Google Books

The copy was full of marks that made the reading really annoying. In addition to this, another reader, who seems to be learning Portuguese, tried to “help” by translating some words he did not know!

Google Books 2

My question is: What is the advantage of having access to an edition like this? Why digitize such a poorly printed and preserved copy? And it is the first option when Google digitized many other versions of this book?

 

Internet Archive

The copy I was looking for appears in the entry as written in Spanish! The site says that the publisher is Casa de las Américas, its year of publication, 1904 (which is the first problem, because “Casa de las Américas was created after Cuban Revolution), its language is Spanish, and it belongs to the collection of an “unknown library.” But when I “opened” the book, the first thing that appeared is the bookplate of Stanford University, it is a book in the Portuguese language, and digitized by Google. When I searched in the catalog of Stanford University, the book appeared there, of course.

So, why did they say they do not know the origin? Why is the information so poor? There is a mix of correct information of this book (the publication year) with another book: its translation into Spanish more than sixty years after, published by Casa de las Américas. But if the two entries were few, when I began reading the book’s inside cover I found a third bibliographical entry on a post-it!

Screen Shot 2013-02-06 at 1.07.20 PM

This copy was published by the same publishing house just one year later than the copy I found in Google books: the edition was corrected, and (fortunately) the copy was clean! The formats offered were PDF, EPUB, Kindle, DJVu, Metadata. But if you want to read it online, there are many problems with some pages, they look like this:

HATHITRUST problems

It’s frustrating! This aside, the catalog record is incorrect. And that annoys me a lot, because I see once again the same mistake: thinking that Portuguese and Spanish is the same.  I found that there is an “editable web page” through “Open Library.” So I created an account to see what options I had to correct the mistake. It said that it had four revisions, but none of them changed the bibliographical entry. Now I had the chance to add some information about the book, and CHANGE the information given. So I changed the information about the publishing house, date, language…I was feeling much better after that! BUT I could not change the Language edition… it is like a curse… Spanish is NOT Portuguese… so I just added a comment warning that it was the original Portuguese edition, instead of the Spanish one that it announced.

Screen Shot 2013-02-06 at 1.16.06 PM

Screen Shot 2013-02-06 at 1.12.59 PM

HathiTrust

 

The copy I found here belongs to the New York Public Library, and it was digitized by Google (even though it is not possible to read in full in Google books).

The publishing house is the same as the others, H. Garnier, but they do not know the date of publication. It should be after 1903, because it is a corrected version. It is strange because the data does not appear where it appeared in the other two versions. There was only one format, PDF, but it is possible to read it online as well. But this copy is almost illegible!

IA 4

Many stories lack from one to three pages, a whole story is missing, and there is one page that was attacked by a cannibal or something:

HathiTrust4

HathiTrust has a feedback form to report problems. But if problems come from books digitized by Google, they only say that “Google is continually improving the quality of images and OCR it delivers to HAthiTrust partners.” So, the real answer is: wait.

It is possible to read the text in a Classic view, Scroll, Flip, Thumbnails and Plain text, which I found interesting and useful – but not so useful if the copy lacks pages and sometimes it is almost illegible!

You can download the PDF version only if you are part of the partner institutions (American universities, basically, and just one from Spain and France). You can create a collection (that can be private or public) and add the book.

 

Yes, digitization has a long way to go, but there are things that can be done just paying more attention to the information that is posted. The quality of the scan is sometimes very poor, if not the original!

The Idea of an E-Book

I’m sorry, I just can’t come up with the great posting titles the rest of you do.

The first book I looked for was Lux Mundi (1890), a collection of Anglo-Catholic theological essays edited by Charles Gore. My reason for doing so was practical, since Travis Brown and I are using scanned images from this book, fed through OCR tools like Tesseract and OCRopus, for the ActiveOCR project at MITH. I won’t say the book was chosen at random,  but close to it. Travis wanted something from the late 19th century, and suggested that I search for everything in the Hathi Trust collection published in 1890.

The fact that the only other collection it appears in, however, is Google Books rules it out for the purpose of this assignment.

Deciding to stick with the theme of 19th century divines, I looked for John Henry Newman’s The Idea of the University, and found it on Project Gutenberg, the Internet Archive, Hathi Trust and Google Books.

As several other have noted, Project Gutenberg provides the most formats and the least provenance information. The book is available in HTML, EPUB, Kindle, PDF, Plucker, QiOO Mobile, Plain Text UTF-8 and TEI. All of these in addition, of course, to the Online Reader. Some of these formats seem a bit obscure to me — I had to look up Plucker (apparently an e-book reader for PalmOS devices), and QiOO (I’m guessing a reader for Android phones, since it’s Java-based, although they didn’t use the name Android). I fired up the oXygen editor to take a look at the TEI file , and it appears to be TEI (P5?) Lite with a Project Gutenberg-specific modified DTD. Although there are credits for the people responsible for preparing the files for Project Gutenberg, there is no information about which printed text(s) provide the basis for the electronic text.

I got 26 results when I searched the Internet Archive for The Idea of a University by Newman. One of these results was for the Project Gutenberg record, which offers the book in several formats not immediately visible on Project Gutenberg’s own page, including DAISY Digital Talking Book and DjVu (pronounced déjà vu, this is a format for scanned documents that its promoters, although I suspect few others, consider a competitor to image PDFs). There were also at least three (one may have been a duplicate)  results from Google Books (digitized from the University of California, Harvard, and New York Public Libraries).

I chose to look at one (26 was way too many) in detail that was contributed by “Kelly – University of Toronto”. While my first reaction was that “Kelly” might be an individual, a Google search indicated that it is a reference to the John M. Kelly Library at the University of St. Michael’s College, a Catholic university that has an institutional relationship with the public University of Toronto. This version was available in Full Text, PDF, EPUB, Kindle, Daisy and DjVu formats. The documents is in the Public Domain. There is no apparent way for users to report or correct errors. This is probably as good a place as any to note that I find the default online reader, which navigates through the text by “turning” pages, incredibly annoying. This is an misguided as the attempts of late 15th century printers to recreate the look of manuscripts in printed texts.

(This is as far as I’m going to be able to get before class, but I will update the post later with the information on the Hathi Trust and Google Books sites.)

 

Pride for Google Books, Prejudice for HATHITrust

Link

As a Kindle user, and more importantly, as someone who plans to work in digital publishing, I found this exercise very informative.  I initially attempted to find my favorite book, A Prayer for Owen Meany by John Irving, but it was only available on Google Books.  So, onto a favorite I knew would be a more viable option: Jane Austen’s Pride and Prejudice.

I am fairly familiar with free domain books, as I have downloaded many from Amazon.com for classes.  In fact, I have Pride and Prejudice via free download on my Kindle.  I was not, however, familiar with the answers to any of the questions Professor Kirschenbaum asked us to investigate.

Pride and Prejudice was available on all four platforms: Project Gutenberg, the Internet Archive, the HATHITrust, and Google Books.  With so many options to choose from, I dove into Google Books to see what I could find about the provenance of the book.  Where to begin?  There are seven versions available on page 1 of the initial search alone!  A sampling includes editions from Harvard dating from 1962 but copyrighted in 1918, Lenox Library with an 1853 copyright, and even a version from an imprint located in our neighbor Rockville, MD from 2008.  Some copies can only be read on the Google Books website, but others have a PDF and EPUB versions available.  From experience, I know PDFs can easily be transferred to an e-reader.  Thus, with the PDF, the reader now has four options on how to read—on the computer, printed out, on a smart phone, or on an e-reader.

The graphics and formatting were retained in all of the versions I researched.  Additionally, all of the versions I opened had a search feature.  Only a few had the option for reading the text in a more user-friendly way.  Some had options of reading one page at a time, side-by-side as you would a hard copy, and via thumbnails.  You can even save the book to your own online library.  As far as highlighting, Google Books had at least one version where you could create clippings and share them via social media.  For additional social media options, you can write your own review.  I did not, however, find a place where you can write about errors, nor did it seem there were any restrictions to usage, despite a Terms of Service.  Overall, Google Books was very user-friendly and provided a variety of ways to personalize your reading experience.

Next it was onto Project Gutenberg.  I was overwhelmed from the get-go when my search returned 29,141 downloads.  Further investigation led me to realize this was how many times the book had been downloaded for free.  From here I was given a variety of ways I could view and download, from HTML to QiOO Mobile, something I’ve never even heard of before.  I clicked on the very first HTML link.  There, I was greeted wit an interesting message:

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org

This intrigued me because, in my former life as a TV producer, there were restrictions on everything from music videos, to still images, to movie clips.  Everything came with a price and specifications as to how it could be used.  But what really got me was the release date of August 26, 2008 and a note that the version had last been modified November 5, 2012.  Surely Pride and Prejudice has not changed in the 200 years since it was first published.  But then, at the very bottom of the site, the full terms of license are listed.  Again, there were two interesting passages:

Updated editions will replace the previous one--the old editions will be renamed.

How are these editions being updated?  Why do they need to be updated?  What is being modified?  My questions were endless.  Then:

You may use this eBook for nearly any purpose such as creation of derivative works, 
reports, performances and research.  They may be modified and printed and given 
away--you may do practically ANYTHING with public domain eBooks.

This clause shocked me!  If you can do anything with public domain books, can we trust that we are getting the book as it was intended?  Are we getting the whole book, or some annotated version?  Because it can be modified in any given way, it seems as if we are given license to recreate the book to our liking.  Forget Elizabeth ending up with Darcy, let’s just change it around to have her wind up with the abhorrent Wickham.

The one option I found especially interesting on Project Gutenberg was the availability of a QR code, so that the user can scan it with their smart phone and automatically download a version to their mobile device.  PG also offers a link to “mirror sites,” which are mostly international universities offering the same version of Pride and Prejudice for download from their university library.  I found this to be disappointing because I was hoping it would offer me a version of the book translated into other languages, but it did not.  While it first appeared that PG was going to offer many versions of the book, all formats led to the exact same version, which is much different than the variety offered on Google Books, but also gave me a sense of faith that perhaps Pride and Prejudice wasn’t being mangled by users doing anything they want with the text.

Using the Internet Archive initially appeared to just cull the books that had been digitized elsewhere.  In fact, various versions specified they came from Google Books—the same Harvard version mentioned earlier—and Project Gutenberg.  Despite the versions being the same, the Internet Archive had a much more user-friendly format.  If you desired to read the book online, it immediately led you to a side-by-side page layout, that, when flipping pages, animated the page turning. It also allowed the graphics to be seen more clearly.  One of the most unique features was that the version was available as an audio book.  However, the audio was very computerized and it attempted to read aloud quotation marks and other punctuation.  While none of these features change the text, somehow it made it a little more enjoyable to know of the bells and whistles available.

The Internet Archive also offered your basic search functions, download options, and a place to write user reviews.  Strangely, the terms of use has not been modified since 2001.  Surprising given how much has changed in the digital humanities in the past 12 years.  It did, however, give an email address to contact someone about copyright information.  One of the versions even had a link to an editable page, where you can edit the book.  Thus far, only eight users had done so since 2008.  I guess people aren’t as inclined to mess with classics, until you have the bright idea to write Pride and Prejudice and Zombies, as Seth Grahame-Smith did.

Finally, it was onto HATHITrust.  As soon as I clicked on the page I knew it wasn’t nearly as user-friendly and complete as any other online library.  The initial results only returned options that would search for the book in hard copy at nearby libraries.  It was the eighth result that was actually a full-view online version.  I clicked, only to find it was the trusty ol’ Google Books version yet again.  It too had the side-by-side page flip option, but the words were so small you couldn’t read them and the zoom feature did not work.  The only was to read it online was in the traditional view.  However, it was available for PDF download just like the others.

HATHITrust also had what I’ve now come to realize are the basic features of an online library: document search, a personal online library, and a way to share links from the book on a social networking site.  It did have a more prominent feedback link for users to share how they found the quality of the text.  One reportable problem is missing parts—perhaps they got an editable version.

Overall, Google Books and the Internet Archive had the best sites, in my opinion.  Either way, I think it’s great there are so many classic books available to readers so easily.  No matter which site was chosen, the reader was going to get a legitimate copy of Pride and Prejudice, one of the most beloved books of all time.  As for me, I’ll stick to my Kindle for reading digital books for now.  However, a hard copy version will always be my first love.

 

War of the EBooks

I tried to be a scifi nerd and use Neuromancer for this exercise, but I had to settle for War of the Worlds. Doesn’t make for the best catchy blog post title but, what are you gonna do.

Project Gutenberg offers H.G. Wells’ The War of the Worlds in HTML, EPUB, Kindle, Plucker, QiOO Mobile, Plain Text UTF-8, and several kinds of zip files. It can also be read online as an EBook, although it is immensely frustrating to read that way as it is formatted into chunky paragraphs requiring links to the previous or following pages. According to Project Gutenberg it is EBook #36, released in 1992 and updated in 2008. The site allows the user to create bookmarks on the “pages”. Unlike the other sites, it notes that the user “can help us produce ebooks by proof-reading just one page a day” (http://www.gutenberg.org/catalog/world/readfile?fk_files=1697601&pageno=2).

HATHITrust offers downloadable PDFs of single pages without a log-in and a full downloadable PDF for members, as well as an online view of the Bernhard Tauchnitz Leipzig edition. HATHITrust offers two dates: “1898 [i.e. 1929?]“. The online version is originally from the University of Virginia, digitized by Google Books. It allows you to search the book or jump to different sections, to render it in plain text, to share a link to the book or to a single page, to view the book in “Flip” or “Scroll” mode or with thumbnails of the pages, and create new collections of books with a member log-in. The site notes that the book is public domain in the United States, although, “Google requests that the images and OCR not be re-hosted, redistributed, or used commercially” (http://www.hathitrust.org/access_use#pd-us-google).

Google Books offers EPUB and PDF downloads with both “Flowing Text” and “Scanned Pages.” It can be read in plain text and the user can “Advance Search” the book for specific phrases. Google offered the widest variety of editions, from a limited view of a 2012 edition to a full view of a 1898 illustrated edition published by Harper & Brothers in New York. The latter came from the Pennsylvania State University Library, and has the entirety of the table of contents in hyperlinks, which was the first instance of this I noticed in browsing several editions and which makes navigation quite easy. Unfortunately the book does not offer any information about the illustrator, but it contains a frontispiece of HG Wells and a number of beautifully drawn and rendered bluish black and white images that scanned crisply.

The frontispiece from The War of the Worlds. Unfortunately I could not find an information about the illustrator.

The frontispiece from The War of the Worlds. Unfortunately I could not find any information about the illustrator.

At the end of this copy is a library binders’ mark from August 3, 1967, in Philipsburg, Pennsylvania. Also contained at the end of the book was the mostly blank “Date Due” card, containing crossed out dates from 1993. Lastly, and most fun for me, there are no less than 5 scanned images of the book’s maroon back cover and bar code, two of which have the archivist’s bright pink latex glove in the corner and two of which were captured when the book was in the process of being opened and flipped over, with a black and white checkered pattern on the edge from what I am assuming is the inside cover of the book.

The back cover of HG Well's The War of the Worlds, as seen in Google Books.

The back cover of HG Well’s The War of the Worlds, as seen in Google Books.

A pink Martian's...errrr, archivist's thumb on the back cover of War of the Worlds.

A pink Martian’s…errrr, archivist’s thumb on the back cover of War of the Worlds.

Google allows the user to search the book and write a review, and offers perhaps the most flexible interface with multiple page views of the book, the ability to “cut” or highlight sections of pages, and a zoom tool. The site restrictions and terms of service state that this “copy and paste” function needs to be “used within the prescribed limits and only for personal non-commercial purposes” (http://books.google.com/intl/en/googlebooks/tos.html). Google watermarks also may not be removed from the digital content.

I found Google Books to be the most versatile interface for viewing and downloading this book. While the Kindle edition I downloaded from Project Gutenberg was readable and there didn’t seem to be huge issues with it in terms of formatting, I found myself annoyed by the fact that new chapters don’t start on new pages. On all of these sites, it was hard to find information about access to these books for people with disabilities.