It’s been an interesting couple of weeks at the Foreign Literatures in America project, as we’ve really begun to set sail as concerns both the Russian literary reception archive and the Modern British literary archive. Though the projects are large, and seem to increase further and further with excitement as we delve further into them, we have also found that many of our energies have been devoted to getting many small and technical details precisely in order. We’ve actually been working on such things for months, but the fine tuning is being able to show results.
There are three basic dimensions to our project so far as assembling these archives—in part, as resources in themselves; in equal part, to develop Natural Language Processing tools usable on further reception materials and other data—is concerned.
The first is the actual high quality scanning of reception materials. Nick Slaughter, one of the executive editors at FLA and the director of the Russian Authors Initiative, has developed a meticulously extensive guide detailing how to scan these materials at the University of Maryland. Collaboration with the undergraduate Gemstone team POLITIC (in particular, through team members Matthew Carr, Robert Cai, Adam Elrafei, Andrew Li, and Dan Yang), as part of their own research project quantitatively assessing the politicization of Russian literature in the US in relation to U.S. foreign policy, has begun to result in a rich fund of scans, while meanwhile Rebecca Borden and Jennifer Wellman, also executive editors of FLA, have been spearheading similar efforts as concerns American reception of the Polish-English author Joseph Conrad, especially as Conrad’s women readers are concerned. One question that keeps coming up in all this is the efficiency of the process: we all are very committed to securing the highest possible quality scans for any number of reasons ranging from the curatorial to NLP; but this said, there are faster and slower ways of securing files, and we already have thousands of pages of very rare documents that are not of the highest quality. It is a question in my mind whether the scholarly community is better served simply by our posting what we already have on the FLA website, rather than going to great lengths to re-find data we already have and that scholars really need ASAP.
This raises the second technical dimension of our project: OCR, ocular character recognition software. We have been blown away by just how easy and accurate and unbelievably fast the ABBYY software we have purchased and are working with is—even on scans that are not of the highest quality. The OCRing of the several thousand pages we have of scans is simply zooming along—not least because Alex Winter, Soumya Yanamandra, and Kay Zhang on the Gemstone team have mastered the system in ways that will forever be beyond my skill set.
This finally leaves the third technical dimension of this phase of the project: namely, developing, or rather, fine-tuning annotation questions to ask of the data we are gathering, so as to make it possible to train computers to read for and answer the kind of questions that simply keyword searches alone cannot avail. Adrian Hamins-Puertolas, Alex Goniprow, Manpreet Khural, Nick Slaughter and I have been working very hard on these question: we’re up to four revisions now, each tested in a kind of quadruple-blind independently between the four of us upon various articles to see if we can come up with consistent answers and hence a viable questionnaire. We are very close: out of 16, there remains one question on literary “style” and “artistry” that is giving us the most trouble. We’ve been having amazing fun sifting through these questions, pondering issues, some of them very technical, that I don’t think any of us individually had foreseen. It is because of these small details that we have been meticulously attending to during the past several weeks, and actually months, that I expect we will reap enormous analytic rewards in the near future.
Peter Lancelot Mallios is a MITH Faculty Fellow. He directs the Foreign Literatures in America Project and is a professor in English at the University of Maryland.