Literary DNA and Google Books

Can digital tools help us to trace literary inheritance and influence? In my dissertation, I claim that Washington Irving helped to originate a tradition of bachelor sentimentalism in American literature that Donald Grant Mitchell extended in Reveries of a Bachelor, Melville satirized in Pierre and “Paradise of Bachelors and Tartarus of Maids,” and Henry James complicated in “Lessons of the Master” and other works. If this relationship were visualized in a family tree, it would look something like this:

Bachelor Pedigree

Note that Herman Melville is hanging out by Donald Grant Mitchell but isn’t really connected to the family tree–in part because he’s got kind of a dismissive step-child relationship to Washington Irving, in part because I just couldn’t figure out how to make siblings display in (I would have been better off using a more flexible tool like Gliffy, and now thinks my name is Henry James, but live ‘n learn.) Note also that there are no mothers–appropriate, I suppose, for bachelor literature. Of course, such a family tree grossly oversimplifies literary inheritance and influence; for one thing, most authors have many literary ancestors, many fathers and mothers.

As I was tracing this literary lineage in my dissertation, I kept asking myself how I knew that this genealogy was real, how I could be sure that I wasn’t just making it all up? Although I couldn’t state with certainty that there was this bachelor genealogy (such certainty seems to be beyond the reach–and outside the aims–of literary interpretation), I found some pretty good evidence for it. I won’t rehash all of the arguments here, but as I explain in my dissertation, Mitchell, Melville and James commented on their literary forebears. Mitchell gave a tribute at the Washington Irving Centenary, Melville rejected Irving’s approach to literature as being a “self-acknowledged imitation of a foreign model,” and James remembered his “very young pleasure” in reading Mitchell and Melville. Then there are the similarities in voice (a smooth, subjective sentimentalism), setting (bachelor garrets and lonely hearths), character (the dreaming bachelor), etc.

Now I wonder: Would text analysis tools and massive collections of texts such as Google Books provide further evidence for this bachelor genealogy? By comparing abstractions and visualizations of my four core bachelor texts, would I be able to see a “family resemblance”–perhaps a unique turn of phrase that is repeated through the generations like funny toes or wavy red hair? And what would constitute reliable evidence of inheritance, anyway–similarity in word choice, narrative voice, character, structure, graphical design (i.e., if you’ve got Fabio on the cover, the book probably belongs to the romance tradition)? Is there such a thing as “literary DNA”?

And is “literary DNA” even a valid concept, or is it a scientific term misapplied to the literary realm? By “literary DNA,” I mean the unique characteristics that define a literary work, characteristics developed through the complex process of literary inheritance and creativity. A search for the term in the MLA Bibliography yields no results for “literary DNA” (although “literary influence” gets you over 1000), while Google Scholar shows only 9 results. Perhaps the term rings too much of a positivist approach to literary study. When you broaden the search to Google, though, about 1000 results are retrieved. For example: The Paris Review initially called its archive of 50 years of its interviews The DNA of Literature (USA Today), presumably because you can plumb it to find authors explaining the genealogy of their works. In the Netherlands, researchers at the Huygens Institute KNAW are developing The literary DNA: computer-assisted recognition of narrative elements, software that can recognize themes and motifs in literature. “Literary DNA” is probably most closely associated with Don Foster, the literary scholar who used textual analysis to help uncover Joe Klein’s identity as the “Anonymous” author of Primary Colors, Ted Kaczynski as the Unabomber, and Shakespeare as the author of the poem A Funerall Elegye in memory of the late Vertuous Maister William Peete (erroneously, as it turned out). In Author Unknown (which I’m just starting to read), Foster argues, “The scientific analysis of a text–how a mind and a hand conspire to commit acts of writing–can reveal features as sharp and telling as anything this side of fingerprints and DNA” (4). I’m skeptical that a text can be analyzed scientifically, and textual analysis does not necessarily prove authorship definitively, as we see in the case of the poem mistakenly attributed to Shakespeare. Still, Foster’s methods are intriguing–he looks at word choice, punctuation, spelling, grammar, and sentence construction to discover the unique features of an author’s “linguistic system.” Romantic that I am, I dig the idea of literary scholar as sleuth.

So how would you use the computer to study literary influence and inheritance, and what would be the implications of this approach to research? Certainly search tools can help us to track allusions and other signs of influence. In his excellent article “Googling the Victorians,” Patrick Leary describes how the editors of the Strouse Edition of Carlyle’s Writings use Google, Literature Online (LION), the online OED, and other electronic resources to identify unattributed quotations, sparing themselves from having to spend “many often fruitless hours in the library stacks.” By providing quick access to much of the cultural record, the Web may, as Leary argues, “lend new urgency to arguments about what constitutes evidence of literary or intellectual influence” (12). Massive digitization projects such as Google Books are exposing not only influence, but also outright plagiarism. In Slate, Paul Collins reports that a linguist working for Google Books found that a passage from Sacrificial Foundations (1899) was a plagiarism of a plagiarism. As Collins observes, “it may be only a matter of time before some enterprising scholar yokes Google Book Search and plagiarism-detection software together into a massive literary dragnet, scooping out hundreds of years’ worth of plagiarists—giants and forgotten hacks alike—who have all escaped detection until now.” But when is “plagiarism “creative re-invention? Jonathan Lethem’s plagiarism on plagiarism cleverly makes the point that literature is all about appropriation and re-imagining.

Curious about the capabilities of Google Books to track literary inheritance, I compared Google Books editions of Washington Irving’s Sketch Book, Donald Grant Mitchell’s Reveries of a Bachelor, Melville’s Pierre, and Henry James’ The Lesson of the Master, The Death of the Lion; The Next Time, and Other Tales. (I know that Google Books is itself the subject of controversy, a topic I plan to take up in a future post.) What Google Books, the Open Content Alliance, and other huge digital collections offer is lots and lots of data that one could mine to study literary inheritance, the evolution of ideas, and a lot more. I hoped to use Google Books to discover salient features not only of these works, but of the works preceding and succeeding them on the literary family tree. Google Books provides handy (if limited) tools for detecting literary influence and seeing a snapshot of a work through its “About This Book” feature, a sort of reference page gathering together key words, popular phrases, and references from other works. Google Books isolates 20 unique “Key words and phrases” (such as “sleepy hollow” and “diedrich knickerbocker”), as well as three key words associated with each chapter. Proper names seem to be the most frequent key words, so I wasn’t too surprised that the only overlap in key words among the four works was that both The Sketch Book and Pierre use variations of “mourn” and “methinks,” reflecting perhaps the melancholic tone of both works. Popular Passages seems to hold more promise for influence-tracking, since it enables readers to “follow the literary memes* that appear again and again in the world of books” by highlighting the ten passages that appear most frequently in other works, then by providing links to all of those works. For Irving and Mitchell, the most frequent “popular passages” were quotations from other works (such as Thomas Gray’s poetry) and passages of their own works that were frequently anthologized, while for James and Melville the popular passages were those most frequently cited by critics or represented in other editions of their works. Through popular passages, you can get a quick sense of how the authors were received in their time and ours and glimpse how a work fits into the larger network of literary influence. To understand how critics and commentators have received The Sketch Book, you can also examine “References from books” (which presumably searches Google Books for citations of the work), “References from scholarly works” (which presumably searches Google Scholar), and “References from web pages” (which presumably searches Google, in a limited way). If, say, The Sketch Book and Pierre were referenced by the same works, that might indicate some kind of kinship. For selected books (in my very small sample, only the travel-focused The Sketch Book), Google Books also provides a map of places mentioned in the book. The map for The Sketch Book shows dozens of markers clustered in Great Britain, revealing at a glance what an Anglophile Irving was. You could compare maps of several books to see if they share similar settings or itineraries.

After spending several hours playing with About This Book, I didn’t find much evidence for my theory of bachelor genealogy–perhaps I would need a more specialized tool to turn up that kind of evidence. However, it does seem that Google Books could be quite useful for scholars interested in book history, since it includes book advertisements that appear in back matter, catalogs of library and personal collections, and back issues of Publishers Weekly, as well as the aforementioned features that allow you to see how the book has been referenced. By providing access to so much of the cultural record, Google Books can allow scholars to broaden the scope of their inquiry and find connections among works. While I recognize Google Book’s potential as a scholarly tool, I can see several ways to improve it for scholars:

  • Allow users to see more than 10 “popular passages” and 20 keywords; for power users, expose all available data (in a user-friendly way, of course)
  • Be more transparent about how these different features work. I searched the Google Books site for an explanation of the different “About This Book” features, but couldn’t find much beyond a few sentences in the Book Search Blog. If I’m going to trust a tool, I’d like to know how it works.
  • Make it easier to search within “popular passages.” Analyzing popular passages can be quite time-consuming, especially when a passage is cited in 860 books, as is a quotation from Cymbeline that Irving includes in The Sketch Book. Ideally you could search inside your “popular passages” results to see if, say, Melville cited the same passage from Shakespeare.
  • Help readers sort out different editions of the same work. As a book history/ textual studies gal, I like the fact that Google Books includes multiple editions of a work, even if the choice of editions include seems to be more an accident of what a library holds than a scholarly decision about which are the key editions. But once you start comparing the “About the Book” feature for those different editions, things get awfully confusing. When I compared 2 different versions of Reveries of a Bachelor, the 1852 L.C. Page edition and the 1906 Fenno edition, the data was quite different. Perhaps the text of the work changed substantially as different publishers came out with their own editions (neither is the authorized edition first published by Scribner in 1850), or maybe something’s wacky about the way that Google is generating this information. In any case, you get different popular passages (the Page edition includes lots of advertisements for other books published by the same company), different books/scholarly works that refer to this work, and different key words. Curious. So I guess I’d like a “compare editions” tool to reveal what’s really going on…
  • So this request may be a little pie in the sky, but I sure would love a way to know what works Google Books is not searching, or what it might be missing because of OCR errors. When my colleague Jane Segal and I studied humanities scholars’ use of digital resources, several worried that works not in digital form would be neglected and that scholarship would suffer from a sort of ignorance of the analog. I dunno–maybe such a list could be generated by comparing WorldCat records with Google Books? In any case, scholars need to be conscious of the limits of Google Books.
  • Fix errors in generating links to other books. For some reason, if you try to follow “References from books” past the first results page, you get an error that looks like this: “Your search – cites:0F_xecYtdwg1CDwRxfl_QZ – did not match any documents.” Argh!
  • Make it easier to download full-text of books (PDFs are handy, full-text is better).
  • Ensure free access to the full-text of public domain books. I got a little scared when I heard that Google Books would start charging for full-text, but that plan seems to focus on books that are part of its publisher partner program, not its library scanning project.
  • Experiment with visualization tools. For instance, it would be cool to play with some kind of social network graph or citation network showing all of the authors who cited Irving and were cited by him

Maybe it’s not fair to expect Google to turn Google Books into a scholarly resource–perhaps it’s better for scholars to develop their own applications to analyze Google Books. I’m encouraged by news that members of the digital humanities community have been in conversation with Google. Dan Cohen makes a persuasive case for Google Books providing an API that would allow researchers to mine and manipulate information. Despite my criticisms, I love poking around “About This Book,” which provides a great way to get a snapshot view of how a work fits into the literary ecosystem.

I had planned to discuss my experiments using text analysis tools to trace literary influence, but I’ve gone on way way too long already, so I’ll save that for a future post.

7 responses to “Literary DNA and Google Books

  1. Have you seen Wmatrix, developed by a group of language and literature scholars at Lancaster University?

    Apart from a trial period, it’s not free, sadly (lack of funding), but it’s a rather clever tool when it comes to visualising keyword and thematic connections between texts. I haven’t quite worked out a use for it at work that would justify the cost, but I got to play with it last summer and was very impressed with its potential for historians. Especially as it was very simple to use (being web-based), unlike various pieces of free software I’ve attempted to install and work with since then.

  2. Thanks! Looks promising. I’ll check it out.

  3. It’s very interesting. Influence is certainly some kind of intertextuality. Looking for similar bits of phrases or for text patterns could be a way to identify influence, and of course are digital full text collections helpful. But it is kind of superficial. I tried to tackle the problem of intertextuality and how to recognize it (is it really quotation, allusion or influence?; or has one author simply chosen similar words or thought similar thoughts as another?) in the first chapters of my (german) dissertation “Es gibt für mich keine Zitate”. Intertextualität im dichterischen Werk Ingeborg Bachmanns, published Tübingen 2002.

  4. Good point–intertextuality is probably a more sophisticated way of thinking about the issues that have been swirling about in my sometimes muddled mind. Your dissertation sounds very interesting–I wish I knew German!

  5. Found your blog via Steve Ramsay’s post to humanist. There is a “compare editions” tool that works quite well for texts in the wild, NINES’s JUXTA. Google Books and other scanned texts (Making of America, Internet Archive) are great because they have little formal markup. Marked up texts (e.g., DocSouth) can be used also after an identity style sheet template removes the markup. I’ve recently compared 3 versions of Jewett’s Pointed Firs and two versions of Evans’s St. Elmo. After the 6 or 8 hours learning the tool (save 2 by reading the manual), you can compare two versions of a text with relative ease.

  6. Thanks for the link and suggestion to use Gliffy. Let us know if you have feedback or suggestions about our program,
    debik at gliffy dot com

  7. Pingback: Using Text Analysis Tools for Comparison: Mole & Chocolate Cake « Digital Scholarship in the Humanities

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s