How to Read a Million Books + (Kathryn & Charity)

Readings Wordle

Pictured above is a word cloud generated from this week’s readings using Wordle.

In-Class Exercise: With a partner, choose at least 2-3 terms from the word cloud above and discuss/define them in terms of our readings this week and/or experiences with the digital bibliography assignment. We will come back together after five minutes or so and share. Since each group will be sharing, you might consider mixing some lesser-mentioned terms (smaller-sized) with the buzzwords (larger-sized) to avoid repeats!

The Past and Future King Lear

I selected Shakespeare because I recently had to explain to a student why the $1.99 complete works he found for Kindle would not suffice for the work we will do in class.  I was giving the standard “you need to get the books for this class ASAP” speech during our first discussion section, and I said something to the effect of “It’s ok if you have or want to use a version other than Signet, but be sure that it’s a reputable edition with good notes.”  I was astonished when this student asked “What do you mean by ‘notes’?”  I tried to explain the necessity of footnotes glossing words that persist in current usage but had different meanings during the early modern period.  Later, the student emailed me a link to Amazon’s page for the edition he had, asking if I thought it was good enough.  He said “I don’t see anything obviously wrong with it other than that it has no notes.”  I saw something else wrong with it:  It did not list the name of the editor or contain any textual information or any clues as to the principles by which the text was prepared.  I used King Lear as an example.  The text as we have it comes from two sources (the 1608 Q1 and the 1623 F1) and so an editor must either print one of those or conflate the two.  When I pointed out that – in the absence of any editorial notes identifying what was on his electronic pages – he might end up reading something very different than what the rest of the class was seeing, he decided to buy the recommended print edition.  (I didn’t even get into variations between copies of those early printings.)

I’m sharing this story because it seems that this student is exactly the sort of reader whom makers of electronic texts hope to reach: he wanted something both inexpensive and reliable.  He just didn’t find the information to evaluate whether or not the Kindle product at which he was looking answered his needs.

So, King Lear.

I started with Project Gutenberg.  I thought it interesting that ssing the “search” feature yields a disorganized mess that uses “by popularity” as the default ordering system.  I tried again by browsing.  Under Shakespeare’s author heading, one may find four texts for King Lear.  The first is helpfully accompanied by “Scanner’s notes” explaining that it is a reproduction based on the first Folio and pointing out that different copies of F1 differ from each other.  The scanner (who sounds like an editor) provides his rationale for spelling alterations and gives his name and email addresses, encouraging readers to contact him if they find mistakes or disagree with his choices.  This seems sound, but I am bothered by the “Executive Director’s notes” that precede the “Scanner’s notes.”  That note reads as follows:

In addition to the notes below, and so you will *NOT* think all the spelling errors introduced by the printers of the time have been corrected, here are the first few lines of Hamlet, as they are presented herein:

Barnardo. Who’s there?
Fran. Nay answer me: Stand & vnfold
your selfe

Bar. Long liue the King

***

As I understand it, the printers often ran out of certain words or letters they had often packed into a “cliche”. . .this is the original meaning of the term cliche. . .and thus, being unwilling to unpack the cliches, and thus you will see some substitutions that look very odd. . .such as the exchanges of u for v, v for u, above. . .and you may wonder why they did it this way, presuming Shakespeare did not actually write the play in this manner. . . .

The answer is that they MAY have packed “liue” into a cliche at a time when they were out of “v”‘s. . .possibly having used “vv” in place of some “w”‘s, etc. This was a common practice of the day, as print was still quite expensive, and they didn’t want to spend more on a wider selection of characters than they had to.

You will find a lot of these kinds of “errors” in this text, as I have mentioned in other times and places, many “scholars” have an extreme attachment to these errors, and many have accorded them a very high place in the “canon” of Shakespeare. My father read an assortment of these made available to him by Cambridge University in England for several months in a glass room constructed for the purpose. To the best of my knowledge he read ALL those available . . .in great detail. . .and determined from the various changes, that Shakespeare most likely did not write in nearly as many of a variety of errors we credit him for, even though he was in/famous for signing his name with several different spellings.

So, please take this into account when reading the comments below made by our volunteer who prepared this file: you may see errors that are “not” errors. . . .

So. . .with this caveat. . .we have NOT changed the canon errors, here is the Project Gutenberg Etext of Shakespeare’s The Tragedie of King Lear.

Michael S. Hart
Project Gutenberg
Executive Director

This is a very peculiar appeal to authority.  While it is good that Hart wants readers to understand something of the wobbly nature of early modern spelling practices, he is more or less claiming that we should trust the contents of Project Gutenberg’s books because the Executive Director of the company had a well-read father and is telling us that he knows what he is talking about.  As a scholar, I find it offensive to be called a “scholar” by Hart.  I do not need scare quotes, thank you very much.  His overt meaning is that one does not have to be an expert to make sound judgments about texts (which is, after all, the ideal toward which open access strives), but his implication is that experts are actually mucking up the process of letting people understand the play by nostalgically clinging to insubstantial quibbles.  Needless to say, this edition is not annotated.  As an attempt to faithfully reproduce the text of F1, it could have uses to some people (such as scholars who already have some understanding of what to expect, and would not need to lean as heavily on the notes as would a beginner), but Hart’s note could steer people away from more helpful formats for their needs.

Next, I tried Google Books.  A search for King Lear produced “ About 2,740,000 results.”  Following Duguid’s point that Google’s sorting is a powerful force on the reader who does not know what he or she is seeking, I decided to investigate the first hit.  It was the “Dover Thrift Study Edition” which contained word-definition footnotes.  The copyright page (which was visible in the preview) explains that Dover has reproduced “the unabridged text of King Lear, as published in Volume XVII of The Caxton Edition of the Works of William Shakespeare, Caxton Publishing Company, London, n.d.” which is accompanied by a study guide made by test prep company R.E.A. and notes that “were prepared” for Dover by a nameless editor.  Yes, they did indeed cite “n.d.” instead of – for instance – checking WorldCat to determine that the Caxton edition was published in 1910.  I thought I’d try again, but this was a mistake.

The first free edition in the listings is from 1808 and is keyed to performance at Drury Lane and Covent Garden.  There are some scanning oddities, but those are minor in comparison to the real problem:


THIS IS THE HAPPY ENDING VERSION.

It was dreadful (though popular) enough when Nahum Tate decided to “improve” (scare quotes warranted) the play for the tastes of the Restoration stage, but for Google Books to offer this version without a prominent disclaimer is unconscionable.  A version of the play in which Lear and Cordelia live happily ever after could appropriately be called a work “based on” or “inspired by” Shakespeare’s play, but this IS NOT Shakespeare’s play.  Google is doing its readers a major disservice and doesn’t even have the decency to give a person’s name to whom we can complain about shoddy (lack of) cataloguing.

Searching HathiTrust, the first item I found is the New Variorum Shakespeare edited by Horace Howard Furnace, Ph.D., L.L.D and published in 1880.  It appeared to be a fine, responsibly-prepared edition that reflects the critical tendencies of the nineteenth century.  The notes consist largely of collections of famous critics’ opinions about the text, but – as they are well marked as such – unlikely to offer serious trouble to the beginning reader.  There were many fingers, slightly crooked pages, and corner shadows in the scanned images, but I found that the clarity of the text and flexibility of the online viewer made it more pleasant to read than the other scanned versions I found.  While the old-fashioned, ivory tower model of scholarly authority was clearly present in this edition, I think that, of the ebooks I’ve discussed here, this would be the most helpful to a beginner because at least it clearly represents itself for what it is.  The multiplicity of the notes (which may look a bit daunting at first – but the reader would find a similar landscape if she wandered into a bookstore and picked up the modern Arden Shakespeare) draws attention to the fact that editors make choices in presenting texts; the first-hit items from Project Gutenberg and Google Books did not.  HathiTrust’s site layout is more user-friendly than Project Gutenberg’s and its sorting principle is likelier to lead the reader to something useful on a first try than is Google Books.

After this research, if I had the conversation with which I began this post to do over again, I would still advise my student that it would be much easier to spring for the $4.95 Signet Classics Lear in physical form

Download and Read: Augustine’s Confessions Online

For this exercise I wanted to choose something with a long and complex history that would be relevant to my interests, but which also had enough cultural significance to be of interest to a wider audience.  I settled on Augustine’s Confessions, his autobiographical masterpiece written at the end of the 4th century, in which he recounts his early life and conversion to Christianity.  As with any work written before the age of print, the Confessions came to life and first circulated in manuscript form (examples of which can also be found online, for example this digitized microfilm of Troyes, Bibl. mun., 473, and this digital facisimile of a Villanova MS). The work made its way into print at an early date, and was translated from the original Latin into English at least as early as the first half of the 17th century.  Perhaps the most accessible edition of the Latin text is that in the Patrologia Latina (32.659 ff.). The PL — the publication of which in the mid-19th century has to be one of the most successful acts of serial plagiarism ever perpetrated — retains its relevance today as a kind of least common denominator of editions. But as you might expect, over the years there have been numerous editions of the text, not to mention translations into various languages. Not knowing what other sorts of exercises might be in store in the coming weeks for the texts we choose to investigate, I decided to focus my efforts on the English versions of the Confessions.  And rather than attempting to compile a comprehensive survey of all the various versions that might be out there on the web, for the purpose of this exercise I decided not to labor too much over locating every available version and instead just to approach each of the four search interfaces with some common terms (viz., author: Augustine; title: Confessions; and where possible limiting the results to English language hits available in full text), and see what each one returned.

Project Gutenberg

Project Gutenberg returns just four hits in response to a search on “Augustine Confessions”, including one hit each of the English and the Latin text, as well as two anthologies that contain excerpts from the text.  Gutenberg’s English text (available here) was first released in 2002, and is a version of the translation of Edward Pusey from the Library of the Fathers series, a series of translations of patristic texts published in the 19th century by members of the Oxford Movement of High-Church Anglicanism. This is an influential translation, and it will make repeated appearances below.  The text is available in six formats: HTML, ePub, Kindle, Plucker, QiOO, and plain text (UTF-8).  The HTML version is XHTML, and seems to have been carefully proofed. This version also contains some useful additional encoding such as paragraph numbering.  The text can be read online, or downloaded in any of the six available formats. It is in the public domain, and is here released under a Project Gutenberg license, which allows the end-user to use the text for just about any non-commercial purpose.  There isn’t any obvious way to mark up or otherwise correct the text and re-submit it back to the project.

Google Books

Searching Google Books using the terms described above returns 25 hits when limited to those available in full, ranging in date from 1770 to 1912.  Closer examination reveals that many of these 25 are in fact duplicates, and others are irrelevant volumes of a multi-volume series (The Nicene and Post-Nicene Fathers [NPNF]), only one volume of which contains the text under investigation.  Among the ‘good’ hits are a copy from the Loeb Classical Library; a reprint of Pusey’s translation in a series called the Harvard Classical Texts; and a translation by Charles Pilkington in the aforementioned NPNF series.  The volume edited by Temple Scott (1900), which was scanned from a Harvard copy, is in fact a re-issue of Pusey’s translation, while the translation by W. H. Hutchings, scanned from a copy in the Bodleian, purports to be a new translation, albeit of only ten books rather than the full text’s thirteen (the final three books of the Confessions, more philosophical than autobiographical, are sometimes left out). There are a number of other versions available on Google besides these.  The volumes on Google are available in a variety of formats both directly on Google books and through the Google eBook feature, including several formats designed for e-readers and for online reading.  While the quality of those that rely on page images is generally good, the OCR versions remain quite error-laden.  For example, this passage chosen more or less at random: “$e tnbefff$s against tTie SOone of enucatinp; f&e BUT woe to thee, thou torrent of human custom!” (p. 23).  If one examines the page-images it is immediately apparent why the text is so corrupt in the first half of the passage — it is an epigram printed in a gothic font. But because of anomalies like this, the poor quality of the OCR would make it difficult and dangerous to use Google’s text for any serious purpose (over and above the fact that there is no obvious easy way to download the entire book in plain text format). There are some nice features of the Google reader such as the ability to create notes and mark up the text with highlighting in different colors. Google’s terms of service would seem to allow download and reuse of their content in a variety of forms.

Internet Archive

Perhaps the most interesting and unique offering at the Internet Archive is the very first one among the initial hits: a complete audio book from Librivox.  The experience of listening to the Confessions read aloud probably more closely approximates how the text was experienced through much of its early history, when even private reading was often done aloud, than many of the printed versions.  Many of the versions available on IA are copies of books digitized by Google. Pusey’s and Pilkington’s translations are here, but also a version of the text translated into Hebrew that I found on no other site. The IA’s versions are available for free download in a variety of formats, including formats for various e-readers, as a PDF, and as a single plain text file.  Unfortunately, the plain text version is full of OCR errors (not least the common failure to segregate headers, footnotes, and main text), and would require significant clean up to be useful for any serious purpose.  Many of the IA books are listed as not in copyright or with no known copyright restrictions, and can be downloaded freely in various formats.  In addition, descriptive information about the scanned books can be contributed by users through openlibrary.org, and problems can be reported to IA through a link on their site.  IA’s online reader is perhaps the best interface of any of the available online readers.

Hathi Trust

Finally, a search of the Hathi Trust using the same terms described above returns 19 hits, including many of the same translations available via Google (in fact, the watermarks reveal that many of these are in fact Google’s scans). As one might expect, the metadata for Hathi Trust books are generally fuller and more precise than Google’s. Another useful feature is the ability to download citation information. Plain text is available, but only on a page-by-page basis, and even the PDF download of full book in the public domain requires authentication.  According to the access and use policy, the Google-digitized books are requested not to be used for commercial purposes or re-hosted, but otherwise are free for use for non-commercial, educational purposes.

 

In conclusion, I would note that the plain-text version of Pusey’s translation available through Project Gutenberg is probably the most useful of all the free online versions of the text, simply because of its flexibility.  None of the foregoing discussion takes into account the accuracy of either the translations or of the editions upon which they were based.

The Scarlet Ebook

I selected Hawthorne’s The Scarlet Letter for three not so exciting reasons. 1. I have the book on hand. 2. Nearly all of the books I am interested in or enjoy come after the public domain works. 3. I happen to enjoy this one.

With that out of the way, The Scarlet Letter is available on all four resources: Project Gutenberg, the Internet Archive, HATHITrust, and Google Books. Let us go down the list and see what we have here.

Project Gutenberg is available in a variety of formats: HTML, EPUB (no images), Kindle, Plucker, QiOO Mobile, and Plain Text UTF8. It isn’t clear what edition of the text the HTML version is based on, only that this version of the ebook was first released in 1992, produced by Dartmouth College, but has been updated in 2005. The HTML version contains all of the materials you might find in a print version of the book, such as biographical information, a list of works, and an editor’s note, but as this is HTML, there was no effort here either for the text itself to resemble a printed book, or to take advantage of some of the possibilities of the ebook format.

A few of the other formats seem unfamiliar to me, and others require programs or e-readers to view. Alas, being a non-Kindle user, I moved on to the online reader, which divides the novel into pages, serving as an alternative to scrolling through the text. But the online reader does little else to mediate or alter the text.

The Internet Archive provides what appears to be three versions of the manuscript, but on closer inspection they are all identical copies of the HTML format of The Scarlet Letter taken directly from Project Gutenberg. The site provides a space for reviews (presumably for opinions on the quality of the e-copy or perhaps even the novel itself). It is also interesting to know that the novel has been downloaded 1,848 times.

Typing The Scarlet Letter into the search bar of HATHITrust yielded 931,602 results. Woah. Could I narrow this down? I clicked the option for “full text only,” and with my results narrowed, I happily clicked the search button only to be bombarded by 480,863 results. Hm. What if I clicked “Nathanial Hawthorne” as the author. That brought me down to 720 results. Perhaps my search was still off, but I decided that this was the best I was going to get.

I apologize for not having mustered the time or the patience to search through 720 results, although I suspected that the correct items would be found on the first page. First, a word on the functions of the site: HATHITrust provides a few limited options of viewing the text, but these only amount to zooming and flipping pages (or scrolling). The search function is quite nice and works well, although any Word or PDF file has this capability.

Going right down the list, the first selection brought me to a scanned copy of the 1889 Boston Houghton, Mifflin and Company version of the text, featuring black splotches and lines, and even a Due Date card in the back. In all other respects, however, this appeared to be a fairly well-done copy, and I would rather download a PDF of something that resembles a book rather than an HTML version that appears like a poorly designed web page.

How did the other copies fair? Well, it turns out many of them were duplicates, but one version caught my eye: The Scarlet Letter “with illustrations of the author, his environment and the setting of the book; together with a foreword and descriptive captions by Basil Davenport,” published in 1948. And the illustration? Well, it scanned quite well, I suppose. Hawthrorne does sport his mustache with pride.

 

Finding most of the copies of HATHITrust in respectable shape, I moved on to the last resource: Google Books. Having already sorted through Project Gutenberg’s wide variety of formats, The Internet Archive’s borrowing the most simplistic format (HTML) from Project Gutenberg, and HATHITrust’s large quantity of nearly identical copies (available for download as PDFs), I was ready for whatever Google Books had in store.

 

Typing “The Scarlet Letter by Nathanial Hawthorne” of course yielded many, many results, but I could see right away that only one was an actual copy of the text. Here I found a scanned copy of the text from the 1898 Doubleday and McClure Co. edition. And yes, this one also features a stunning illustration of Nathanial Hawthorne and his mustache. Google Books gives you the option to download the book in Plain Text, PDF, and EPUB formats. The quality of the copy itself is quite good, from what I can tell. But more importantly, Google placed some effort in supporting some unique features. In addition to the search function, clicking a chapter title in the table of contents will bring you to the correct page. This is a long ways from a hypertext version of the novel, but Google certainly took a step in the right direction.

 

Ultimately, I was not overly impressed with any version of the text, although I did not experience any of the extreme formatting issues Duguid encountered while researching Tristan Shandy. Moreover, as all copies are free to use for whatever purposes you may desire, I suppose I shouldn’t be one to complain. Google Books provided the most impressive copy of the text, even though I would still prefer my own hard copy of the novel next to a scanned e-copy with a search function. I consider my $4 well spent. I can imagine a more robust hypertext version of The Scarlet Letter, but perhaps that is a blog post for another day.

Many versions of many stories in many languages (and many problems)

I decided to work with the great nineteenth century Brazilian author Machado de Assis (author of Brás Cubas), and analyze the results in a more careful way than when I am researching for my study. It was not easy to find a great variety of titles by this author, so I had to choose from a selected group of titles that had full text versions available (because most of them were protected for copyright reasons). In Gutenberg Project, I only found two of the books that Machado wrote, so I decided to work with Varias historias (Many stories) a collection of sixteen short stories that was published in 1896, in Google Books, HathiTrust and Internet archives. I did not know I was going to find so many problems!

Google Books

The first option has only a snippet view, and it is a translation into Spanish, actually. So I went to the second option to read it in full, and I saw that it is from the Library of the University of Texas at Austin, a 1903 edition. It is a text that was first published in 1896, so this edition comes just seven years after that. Google books only offers the name of the publishing house, H. Garnier, the year, 1903, and the number of pages, 282 pages.  The formats offered are: plain text, PDF, EPUB. You can download the text, and in the online version the table of contents has links to the different parts of the book. It is possible to read it in “Google play”, as well, a kind of digital cloud to store books, music, etc. So, you can make your own google books library.

As far as restrictions on the digital contents are concerned, users are not allowed to sell the digital content or remove the watermark or other sign that says it belongs to Google. These are the same restrictions that HathiTrust and Internet Archive have.

The scanned version had all the pages. But I realized that the print copy itself had a lot of problems instead!  In one instance, the page number was reversed (175 instead of 157), and there was a line mistakenly inserted in a dialogue. But, fortunately, one of the readers of this book in its printed form corrected the mistake, so we can now “read it the way it should be”.

Google Books

The copy was full of marks that made the reading really annoying. In addition to this, another reader, who seems to be learning Portuguese, tried to “help” by translating some words he did not know!

Google Books 2

My question is: What is the advantage of having access to an edition like this? Why digitize such a poorly printed and preserved copy? And it is the first option when Google digitized many other versions of this book?

 

Internet Archive

The copy I was looking for appears in the entry as written in Spanish! The site says that the publisher is Casa de las Américas, its year of publication, 1904 (which is the first problem, because “Casa de las Américas was created after Cuban Revolution), its language is Spanish, and it belongs to the collection of an “unknown library.” But when I “opened” the book, the first thing that appeared is the bookplate of Stanford University, it is a book in the Portuguese language, and digitized by Google. When I searched in the catalog of Stanford University, the book appeared there, of course.

So, why did they say they do not know the origin? Why is the information so poor? There is a mix of correct information of this book (the publication year) with another book: its translation into Spanish more than sixty years after, published by Casa de las Américas. But if the two entries were few, when I began reading the book’s inside cover I found a third bibliographical entry on a post-it!

Screen Shot 2013-02-06 at 1.07.20 PM

This copy was published by the same publishing house just one year later than the copy I found in Google books: the edition was corrected, and (fortunately) the copy was clean! The formats offered were PDF, EPUB, Kindle, DJVu, Metadata. But if you want to read it online, there are many problems with some pages, they look like this:

HATHITRUST problems

It’s frustrating! This aside, the catalog record is incorrect. And that annoys me a lot, because I see once again the same mistake: thinking that Portuguese and Spanish is the same.  I found that there is an “editable web page” through “Open Library.” So I created an account to see what options I had to correct the mistake. It said that it had four revisions, but none of them changed the bibliographical entry. Now I had the chance to add some information about the book, and CHANGE the information given. So I changed the information about the publishing house, date, language…I was feeling much better after that! BUT I could not change the Language edition… it is like a curse… Spanish is NOT Portuguese… so I just added a comment warning that it was the original Portuguese edition, instead of the Spanish one that it announced.

Screen Shot 2013-02-06 at 1.16.06 PM

Screen Shot 2013-02-06 at 1.12.59 PM

HathiTrust

 

The copy I found here belongs to the New York Public Library, and it was digitized by Google (even though it is not possible to read in full in Google books).

The publishing house is the same as the others, H. Garnier, but they do not know the date of publication. It should be after 1903, because it is a corrected version. It is strange because the data does not appear where it appeared in the other two versions. There was only one format, PDF, but it is possible to read it online as well. But this copy is almost illegible!

IA 4

Many stories lack from one to three pages, a whole story is missing, and there is one page that was attacked by a cannibal or something:

HathiTrust4

HathiTrust has a feedback form to report problems. But if problems come from books digitized by Google, they only say that “Google is continually improving the quality of images and OCR it delivers to HAthiTrust partners.” So, the real answer is: wait.

It is possible to read the text in a Classic view, Scroll, Flip, Thumbnails and Plain text, which I found interesting and useful – but not so useful if the copy lacks pages and sometimes it is almost illegible!

You can download the PDF version only if you are part of the partner institutions (American universities, basically, and just one from Spain and France). You can create a collection (that can be private or public) and add the book.

 

Yes, digitization has a long way to go, but there are things that can be done just paying more attention to the information that is posted. The quality of the scan is sometimes very poor, if not the original!

The Idea of an E-Book

I’m sorry, I just can’t come up with the great posting titles the rest of you do.

The first book I looked for was Lux Mundi (1890), a collection of Anglo-Catholic theological essays edited by Charles Gore. My reason for doing so was practical, since Travis Brown and I are using scanned images from this book, fed through OCR tools like Tesseract and OCRopus, for the ActiveOCR project at MITH. I won’t say the book was chosen at random,  but close to it. Travis wanted something from the late 19th century, and suggested that I search for everything in the Hathi Trust collection published in 1890.

The fact that the only other collection it appears in, however, is Google Books rules it out for the purpose of this assignment.

Deciding to stick with the theme of 19th century divines, I looked for John Henry Newman’s The Idea of the University, and found it on Project Gutenberg, the Internet Archive, Hathi Trust and Google Books.

As several other have noted, Project Gutenberg provides the most formats and the least provenance information. The book is available in HTML, EPUB, Kindle, PDF, Plucker, QiOO Mobile, Plain Text UTF-8 and TEI. All of these in addition, of course, to the Online Reader. Some of these formats seem a bit obscure to me — I had to look up Plucker (apparently an e-book reader for PalmOS devices), and QiOO (I’m guessing a reader for Android phones, since it’s Java-based, although they didn’t use the name Android). I fired up the oXygen editor to take a look at the TEI file , and it appears to be TEI (P5?) Lite with a Project Gutenberg-specific modified DTD. Although there are credits for the people responsible for preparing the files for Project Gutenberg, there is no information about which printed text(s) provide the basis for the electronic text.

I got 26 results when I searched the Internet Archive for The Idea of a University by Newman. One of these results was for the Project Gutenberg record, which offers the book in several formats not immediately visible on Project Gutenberg’s own page, including DAISY Digital Talking Book and DjVu (pronounced déjà vu, this is a format for scanned documents that its promoters, although I suspect few others, consider a competitor to image PDFs). There were also at least three (one may have been a duplicate)  results from Google Books (digitized from the University of California, Harvard, and New York Public Libraries).

I chose to look at one (26 was way too many) in detail that was contributed by “Kelly – University of Toronto”. While my first reaction was that “Kelly” might be an individual, a Google search indicated that it is a reference to the John M. Kelly Library at the University of St. Michael’s College, a Catholic university that has an institutional relationship with the public University of Toronto. This version was available in Full Text, PDF, EPUB, Kindle, Daisy and DjVu formats. The documents is in the Public Domain. There is no apparent way for users to report or correct errors. This is probably as good a place as any to note that I find the default online reader, which navigates through the text by “turning” pages, incredibly annoying. This is an misguided as the attempts of late 15th century printers to recreate the look of manuscripts in printed texts.

(This is as far as I’m going to be able to get before class, but I will update the post later with the information on the Hathi Trust and Google Books sites.)

 

Pride for Google Books, Prejudice for HATHITrust

Link

As a Kindle user, and more importantly, as someone who plans to work in digital publishing, I found this exercise very informative.  I initially attempted to find my favorite book, A Prayer for Owen Meany by John Irving, but it was only available on Google Books.  So, onto a favorite I knew would be a more viable option: Jane Austen’s Pride and Prejudice.

I am fairly familiar with free domain books, as I have downloaded many from Amazon.com for classes.  In fact, I have Pride and Prejudice via free download on my Kindle.  I was not, however, familiar with the answers to any of the questions Professor Kirschenbaum asked us to investigate.

Pride and Prejudice was available on all four platforms: Project Gutenberg, the Internet Archive, the HATHITrust, and Google Books.  With so many options to choose from, I dove into Google Books to see what I could find about the provenance of the book.  Where to begin?  There are seven versions available on page 1 of the initial search alone!  A sampling includes editions from Harvard dating from 1962 but copyrighted in 1918, Lenox Library with an 1853 copyright, and even a version from an imprint located in our neighbor Rockville, MD from 2008.  Some copies can only be read on the Google Books website, but others have a PDF and EPUB versions available.  From experience, I know PDFs can easily be transferred to an e-reader.  Thus, with the PDF, the reader now has four options on how to read—on the computer, printed out, on a smart phone, or on an e-reader.

The graphics and formatting were retained in all of the versions I researched.  Additionally, all of the versions I opened had a search feature.  Only a few had the option for reading the text in a more user-friendly way.  Some had options of reading one page at a time, side-by-side as you would a hard copy, and via thumbnails.  You can even save the book to your own online library.  As far as highlighting, Google Books had at least one version where you could create clippings and share them via social media.  For additional social media options, you can write your own review.  I did not, however, find a place where you can write about errors, nor did it seem there were any restrictions to usage, despite a Terms of Service.  Overall, Google Books was very user-friendly and provided a variety of ways to personalize your reading experience.

Next it was onto Project Gutenberg.  I was overwhelmed from the get-go when my search returned 29,141 downloads.  Further investigation led me to realize this was how many times the book had been downloaded for free.  From here I was given a variety of ways I could view and download, from HTML to QiOO Mobile, something I’ve never even heard of before.  I clicked on the very first HTML link.  There, I was greeted wit an interesting message:

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org

This intrigued me because, in my former life as a TV producer, there were restrictions on everything from music videos, to still images, to movie clips.  Everything came with a price and specifications as to how it could be used.  But what really got me was the release date of August 26, 2008 and a note that the version had last been modified November 5, 2012.  Surely Pride and Prejudice has not changed in the 200 years since it was first published.  But then, at the very bottom of the site, the full terms of license are listed.  Again, there were two interesting passages:

Updated editions will replace the previous one--the old editions will be renamed.

How are these editions being updated?  Why do they need to be updated?  What is being modified?  My questions were endless.  Then:

You may use this eBook for nearly any purpose such as creation of derivative works, 
reports, performances and research.  They may be modified and printed and given 
away--you may do practically ANYTHING with public domain eBooks.

This clause shocked me!  If you can do anything with public domain books, can we trust that we are getting the book as it was intended?  Are we getting the whole book, or some annotated version?  Because it can be modified in any given way, it seems as if we are given license to recreate the book to our liking.  Forget Elizabeth ending up with Darcy, let’s just change it around to have her wind up with the abhorrent Wickham.

The one option I found especially interesting on Project Gutenberg was the availability of a QR code, so that the user can scan it with their smart phone and automatically download a version to their mobile device.  PG also offers a link to “mirror sites,” which are mostly international universities offering the same version of Pride and Prejudice for download from their university library.  I found this to be disappointing because I was hoping it would offer me a version of the book translated into other languages, but it did not.  While it first appeared that PG was going to offer many versions of the book, all formats led to the exact same version, which is much different than the variety offered on Google Books, but also gave me a sense of faith that perhaps Pride and Prejudice wasn’t being mangled by users doing anything they want with the text.

Using the Internet Archive initially appeared to just cull the books that had been digitized elsewhere.  In fact, various versions specified they came from Google Books—the same Harvard version mentioned earlier—and Project Gutenberg.  Despite the versions being the same, the Internet Archive had a much more user-friendly format.  If you desired to read the book online, it immediately led you to a side-by-side page layout, that, when flipping pages, animated the page turning. It also allowed the graphics to be seen more clearly.  One of the most unique features was that the version was available as an audio book.  However, the audio was very computerized and it attempted to read aloud quotation marks and other punctuation.  While none of these features change the text, somehow it made it a little more enjoyable to know of the bells and whistles available.

The Internet Archive also offered your basic search functions, download options, and a place to write user reviews.  Strangely, the terms of use has not been modified since 2001.  Surprising given how much has changed in the digital humanities in the past 12 years.  It did, however, give an email address to contact someone about copyright information.  One of the versions even had a link to an editable page, where you can edit the book.  Thus far, only eight users had done so since 2008.  I guess people aren’t as inclined to mess with classics, until you have the bright idea to write Pride and Prejudice and Zombies, as Seth Grahame-Smith did.

Finally, it was onto HATHITrust.  As soon as I clicked on the page I knew it wasn’t nearly as user-friendly and complete as any other online library.  The initial results only returned options that would search for the book in hard copy at nearby libraries.  It was the eighth result that was actually a full-view online version.  I clicked, only to find it was the trusty ol’ Google Books version yet again.  It too had the side-by-side page flip option, but the words were so small you couldn’t read them and the zoom feature did not work.  The only was to read it online was in the traditional view.  However, it was available for PDF download just like the others.

HATHITrust also had what I’ve now come to realize are the basic features of an online library: document search, a personal online library, and a way to share links from the book on a social networking site.  It did have a more prominent feedback link for users to share how they found the quality of the text.  One reportable problem is missing parts—perhaps they got an editable version.

Overall, Google Books and the Internet Archive had the best sites, in my opinion.  Either way, I think it’s great there are so many classic books available to readers so easily.  No matter which site was chosen, the reader was going to get a legitimate copy of Pride and Prejudice, one of the most beloved books of all time.  As for me, I’ll stick to my Kindle for reading digital books for now.  However, a hard copy version will always be my first love.

 

War of the EBooks

I tried to be a scifi nerd and use Neuromancer for this exercise, but I had to settle for War of the Worlds. Doesn’t make for the best catchy blog post title but, what are you gonna do.

Project Gutenberg offers H.G. Wells’ The War of the Worlds in HTML, EPUB, Kindle, Plucker, QiOO Mobile, Plain Text UTF-8, and several kinds of zip files. It can also be read online as an EBook, although it is immensely frustrating to read that way as it is formatted into chunky paragraphs requiring links to the previous or following pages. According to Project Gutenberg it is EBook #36, released in 1992 and updated in 2008. The site allows the user to create bookmarks on the “pages”. Unlike the other sites, it notes that the user “can help us produce ebooks by proof-reading just one page a day” (http://www.gutenberg.org/catalog/world/readfile?fk_files=1697601&pageno=2).

HATHITrust offers downloadable PDFs of single pages without a log-in and a full downloadable PDF for members, as well as an online view of the Bernhard Tauchnitz Leipzig edition. HATHITrust offers two dates: “1898 [i.e. 1929?]“. The online version is originally from the University of Virginia, digitized by Google Books. It allows you to search the book or jump to different sections, to render it in plain text, to share a link to the book or to a single page, to view the book in “Flip” or “Scroll” mode or with thumbnails of the pages, and create new collections of books with a member log-in. The site notes that the book is public domain in the United States, although, “Google requests that the images and OCR not be re-hosted, redistributed, or used commercially” (http://www.hathitrust.org/access_use#pd-us-google).

Google Books offers EPUB and PDF downloads with both “Flowing Text” and “Scanned Pages.” It can be read in plain text and the user can “Advance Search” the book for specific phrases. Google offered the widest variety of editions, from a limited view of a 2012 edition to a full view of a 1898 illustrated edition published by Harper & Brothers in New York. The latter came from the Pennsylvania State University Library, and has the entirety of the table of contents in hyperlinks, which was the first instance of this I noticed in browsing several editions and which makes navigation quite easy. Unfortunately the book does not offer any information about the illustrator, but it contains a frontispiece of HG Wells and a number of beautifully drawn and rendered bluish black and white images that scanned crisply.

The frontispiece from The War of the Worlds. Unfortunately I could not find an information about the illustrator.

The frontispiece from The War of the Worlds. Unfortunately I could not find any information about the illustrator.

At the end of this copy is a library binders’ mark from August 3, 1967, in Philipsburg, Pennsylvania. Also contained at the end of the book was the mostly blank “Date Due” card, containing crossed out dates from 1993. Lastly, and most fun for me, there are no less than 5 scanned images of the book’s maroon back cover and bar code, two of which have the archivist’s bright pink latex glove in the corner and two of which were captured when the book was in the process of being opened and flipped over, with a black and white checkered pattern on the edge from what I am assuming is the inside cover of the book.

The back cover of HG Well's The War of the Worlds, as seen in Google Books.

The back cover of HG Well’s The War of the Worlds, as seen in Google Books.

A pink Martian's...errrr, archivist's thumb on the back cover of War of the Worlds.

A pink Martian’s…errrr, archivist’s thumb on the back cover of War of the Worlds.

Google allows the user to search the book and write a review, and offers perhaps the most flexible interface with multiple page views of the book, the ability to “cut” or highlight sections of pages, and a zoom tool. The site restrictions and terms of service state that this “copy and paste” function needs to be “used within the prescribed limits and only for personal non-commercial purposes” (http://books.google.com/intl/en/googlebooks/tos.html). Google watermarks also may not be removed from the digital content.

I found Google Books to be the most versatile interface for viewing and downloading this book. While the Kindle edition I downloaded from Project Gutenberg was readable and there didn’t seem to be huge issues with it in terms of formatting, I found myself annoyed by the fact that new chapters don’t start on new pages. On all of these sites, it was hard to find information about access to these books for people with disabilities.

La Mort D’Impression? : How Google (and others) Digitize Le Morte D’Arthur

(Apologies if the French translation is off–I don’t speak it and am relying on a machine translation (and I’m sure Julia can tell us why that’s a bad idea!))

Since my interests lie more heavily in the still-copyrighted 20th century, I turned to my other love of Arthurian legends for this task.  Specifically, I looked at the seminal collection of French (and one Middle English) tales written into English as Le Morte D’Arthur by Sir Thomas Malory, which was available in all 4 digital libraries.  I chose to focus on Volume 1 to narrow down the information and compare the resources.

Project Gutenberg offered the second-greatest number of formats (HTML, EPUB, Kindle, Plucker, QiOO Mobile, and Plain Text UTF-8), but for only one edition of the book which is not clearly identified.  It says the editor is William Caxton, who produced an edition in 1485 that has become the basis for most of the editions of the book (the other being the Winchester Manuscript), and contains his Preface, but it also contains a Bibliographic note by A. W. Pollard without identifying him as the editor.  Nor does it contain a publisher or print date beyond the release date of November 2009.  It also lacks any information as to which specific source was the basic for their digitization.  In terms of page layout, the EPUB and Kindle editions specify that there are no images, but whether that has an impact is unclear with out a specified edition.  A big frustration when reading online is the lack of page numbers to correspond with the chapter listings in the table of contents, if not hypertext links from the table of contents to those chapters, making it hard to move through the book unless you know the specific page to jump to.  Although there is no specific place on the book page to report errors, the top of the screen does have an “ad” reading: “Did you know that you can help us produce ebooks by proof-reading just one page a day? Go to: Distributed Proofreaders“.  This suggests that they are crowdsourcing their quality assurance process.  The online reader seems to be restricted to viewing only; however, you can download copies of the books to give you the affordances of the other formats (such as Kindle).

Google Books hosts several editions of Le Morte D’Arthur.  One is the Everyman Library edition, also based on the Caxton text, edited by Ernest Rhys and published by J.M. Dent in 1906.  It was sourced from the University of Michigan and is available as an EPUB and a PDF in addition to online viewing.  This edition includes the rather beautifully illustrated title pages; however, one has to scroll past multiple scans of the University of Michigan title plate, blank pages, and this interesting failure in scanning to find it:

Screen Shot 2013-02-05 at 6.15.10 PM

It also preserves Caxton’s original preface.  Google Books also hosts another version of Caxton’s text published by bompacrazy.com, which appears to be a scan of a PDF and is just plain text. There’s also an edition by digireads.com ebook for purchase.  Other than reviews, there does not seem to be a system for reporting errors (otherwise, I’d assume someone would have already have cut out the excess pages).  Google Books allows you to download, search within, and save a copy to “My Library”; however, it does not allow you to annotate the book.

HATHITrust also has the Rhys editions, but scanned by Google from the University of Cornell and University of Virginia in addition to the University of Michigan.  In addition, it has two other 19th century editions: an 1891 Macmillan publication with the Caxton text edited and introduced by Edward Strachey from the Universities of Michigan and Toronto, digitized by Google; and an 1889 Nutt publication in which Caxton’s text is “‘reprinted page for page, line for line’, but in modern type”, edited by Oskar Sommer and introduced by Andrew Lang, from the University of California, digitized by Google.  Each of the editions is only available in PDF format, and for some reason, both Rhys editions are for volume 2, rather than one of each.  Although HATHITrust offers the most viewing options (Classic View, Scroll, Flip, Thumbnails, and Plain Text), the Flip presentation of a book spine and cover are clearly a graphical representation instead of a realistic one.  (I will say that it’s fun to run your cursor over the “pages” and watch the “jump to page __” numbers flip rapidly.  For some reason this strikes me as similar to riffling the pages of a real book.)  Page layouts are preserved, including italics, spacing, and footnotes.  HATHITrust offers a Feedback form if there are any problems with the text, as well as the ability to search, download single pages or the whole document, add the book to a collection (if one has University access to sign in!), or share it with others.  HATHITrust offers a few full text versions, but many were only limited to viewing or to “snippets” of the full text.

The Internet Archive offers the greatest number of formats, with each edition available for download in PDF, EPUB, Kindle, Daisy, Full Text, and DjVu.  It contains the Rhys edition from the University of Michigan as digitized by Google, but also from the University of Toronto and the New York Public Library; the Strachey edition from Stanford Library and the University of California; and the Sommer edition from the Universities of Toronto, Michigan, and Cornell University.  The Internet Archive presents the book as if one were looking at a paper version, with page turns instead of scrolling, in a slightly more realistic way than HATHITrust (and offers the same satisfaction in riffling the pages).  Also, for the Strachey version, it looked as if many of the actual page images were presented instead of just the scanned text; I could clearly see that the bibliographic page in the Stanford book was torn and repaired with tape.  Some pages are badly scanned, with the margins of text cut off or wavy.  However, the marginalia from users has been preserved.

Yet more fingers.

Yet more fingers.

The Internet Archive offers an editable web page on Open Library that seems like the method for users to make changes (such as adding new editions), but I’m not sure if it also acts as an official reporting system for errors.  It allows users to search, bookmark, write reviews, share the book, and have a computer read the text aloud.  Interestingly, when I asked the computer to read aloud, it was forced to spell out “Rhys” rather than pronounce it, but had no trouble pronouncing the words “Igraine” or “pyonce”.  There do not seem to be any restrictions on use, and the site offers “selected metadata” that might be useful for creating databases for further study.

I tested the search features in each library by searching the book for the word “swoon” (since the amount of swooning, primarily among the supposedly noble and heroic knights of the Round Table, surprised me the most when I read the book).  Google Books shows 14 results in the book with hyperlinks to the individual pages and excerpts from the text to show the context of the word.  HATHITrust showed the word on 13 pages for a total of 15 results, also with hypertext linking and excerpts to show context, although the excerpts were shorter than those in Google Books.  Surprisingly, the Internet Archive produced no results; it did manage to find character names when asked, and provided a popup window of context with links to the individual word searched.  The Kindle download from Project Gutenberg found 25 results, displayed in a sidebar which shows the context and the location, which can be clicked on; however, the search term is not highlighted on the page when it is brought up, and so can still take a bit long to find.

One of the biggest challenges in examining Le Morte D’Arthur was that the different editions were labelled inconsistently in the catalogs.  For example, some editions claimed to have Janet Cowen as the editor, and when opened, turned out to be the Strachey edition.  Still others were not clearly labeled as to which volume it was.  Most concerning is the lack of any particular identifying information about the Project Gutenberg text.  Clearly, digital libraries need to establish the same criteria as print libraries for making sure their catalog databases are precise and accurate.

Moby-Dick: The Whiteness of the Page

My book of choice for any bibliographic project will usually be Moby-Dick. Katie and Susie can both attest to this after having to sit through a semester of me geeking out over the textual history of the novel. Of course, by posting later than some of the others, I can only echo what they have said: Project Gutenberg provides the most formats for a given text, including an audio option, which neither HATHITrust nor Google Books gives you (as they only allowed for pdf downloads, and with HATHITrust permission was required, and Google payment), and it was the certainly the easiest to download, because it came with virtually no strings attached. But while I have traditionally always turned to it first for my canonical etext needs, I found it the least transparent of the three versions of Moby-Dick I collected.

For those unfamiliar with Melville scholarship in general one name pretty much reigns as the foremost editor of Melville’s novels, especially Moby-Dick: Hershel Parker. He has edited since the 60s three ‘authoritative’ versions of MD that have formed the foundation of most of Melville scholarship and editing practices since. As someone heavily invested in Melville, Parker’s imprint is typical in any edition I come across, and the lack of it is suspicious. It is not a bad thing, of course, but it raises questions. Project Gutenberg does not note an editor or recognize their copy-text in either of the two full-text editions of MD, but instead does include the note:

Produced by Daniel Lazarus, Jonesey, and David Widger

I do not recognize any of the names personally, and these people are not specifically named as editors, so it is difficult to determine what sort of mark they may have left on the text, and without providing information about the copy-text, the text’s specific origins are unknowable to an outsider. Of course, Project Gutenberg provides a (somewhat reasonable) defense for this:

Creating the works from public domain print editions means that no one owns a United States copyright in these works, so the Foundation (and you!) can copy and distribute it in the United States without permission and without paying copyright royalties.

This is what made Project Gutenberg’s text of MD so easy to acquire, versus HATHITrust and Google, who expressed copyright claims to their digital versions and locked the downloads behind certain obstacles, and while I can appreciate the reverence paid to access, the unclear provenance of the text, other than its recognition as a “public domain text” does not point me to the copy-text being reliable. This perhaps is fine for a general reader, but unsettling for a scholar.

On the other hand, HATHITrust and Google Books both provide some more concrete information because the book is viewed through images of a scanned hard copy. What is unfortunate is that the two public domain editions available on each platform were also very dated. HATHITrust’s edition of MD is from a 1929 Macmillan edition (which is about the time Melville was rediscovered but well before academics began critically editing his work) and Google Books full text edition is from the 1851- the year the book was published. Google’s edition wins, for me at least, because the 1851 edition at least is more reputable than whatever edition served as the copy-text of Project Gutenberg’s edition, and it stands to reason may have served as the copy-text for HATHITrust’s version. Easily accessing the first edition of the book leaves little questions to scholars as to what they are working with, and can actually be very useful not only as a text itself, but as an artifact of the novel’s original form (before critical editing).

Of course, I can’t spend all my time musing on editions and validity. The formatting of the texts is also interesting for one major reason: in the Gutenberg edition, since it does not mimic the page scrolling format Google Books and HATHITrust adhere to, we find awkward moments in the text where the body of the text is interrupted by Melville’s footnotes (which he typically wrote in to clarify any esoteric nautical information). In the page scans from the other two databases, this does not occur, because they reproduce the pages and so the text remains in a more traditional form (with footnotes at the bottom, clearly demarcated as outside of the body).

In response to the Duguid article, where one of the primary critiques of Google Books is the poor scanning of pages and distorted words, Google’s edition of MD looks to be pretty polished. In my sampling of the scanned pages, I did not find cut edges, distortions at the spine, or anything of that sort. That problem, however, was prevalent in the HATHITrust version, where the illustrations of the cover page were cut off near the spine, and some marginalia went over the edge of page (someone made a note on the Table of Contents that spanned the margin between Chapters XIII and XVIII that I think might have said ‘BORING!’ , but I cannot be sure).

Finally, in terms of feedback, HATHITrust made the process the easiest by providing, on the same page as the book was read on, a little button that opened a survey asking about the quality of the book, where any errors could be reported including missing, distorted, curved, and blurry text. Google unfortunately, only allowed users to review the book, which could be more concerned with plot and enjoyment, instead of textual quality. Project Gutenberg did not provide any easily accessed method of evaluation, but does include links on the home page to get in contact with them, and to submit missing pages for texts (which I suppose counts as one form of correction).

I was surprised, especially after reading Duguid, of what I found in Google Books. Their images of the Moby-Dick text looked more professional and refined than the HATHITrust edition, was an 1851 first edition, and posed no issues in the formatting of the text. The same could not be said of the HATHITrust and Project Gutenberg versions, whose scans were less sophisticated, contained marginalia (incomplete and cutoff at that) or posed formatting issues by presenting a text with footnotes incorporated into the body without separating them in any way. As I said, the Duguid article made me fearful of what I would find on Google, and their issues with Tristram Shandy are of course valid concerns, but perhaps it’s possible Google has learned or has improved their process since that article was published in 2007, since while Google Books’ major downside was the lack of a reporting feature, of the three editions I have looked at, it was surprisingly the one that needed it the least.