Monthly Archives: December 2007

Literary DNA and Google Books

Can digital tools help us to trace literary inheritance and influence? In my dissertation, I claim that Washington Irving helped to originate a tradition of bachelor sentimentalism in American literature that Donald Grant Mitchell extended in Reveries of a Bachelor, Melville satirized in Pierre and “Paradise of Bachelors and Tartarus of Maids,” and Henry James complicated in “Lessons of the Master” and other works. If this relationship were visualized in a family tree, it would look something like this:

Bachelor Pedigree

Note that Herman Melville is hanging out by Donald Grant Mitchell but isn’t really connected to the family tree–in part because he’s got kind of a dismissive step-child relationship to Washington Irving, in part because I just couldn’t figure out how to make siblings display in Ancestry.com. (I would have been better off using a more flexible tool like Gliffy, and Ancestry.com now thinks my name is Henry James, but live ‘n learn.) Note also that there are no mothers–appropriate, I suppose, for bachelor literature. Of course, such a family tree grossly oversimplifies literary inheritance and influence; for one thing, most authors have many literary ancestors, many fathers and mothers.

As I was tracing this literary lineage in my dissertation, I kept asking myself how I knew that this genealogy was real, how I could be sure that I wasn’t just making it all up? Although I couldn’t state with certainty that there was this bachelor genealogy (such certainty seems to be beyond the reach–and outside the aims–of literary interpretation), I found some pretty good evidence for it. I won’t rehash all of the arguments here, but as I explain in my dissertation, Mitchell, Melville and James commented on their literary forebears. Mitchell gave a tribute at the Washington Irving Centenary, Melville rejected Irving’s approach to literature as being a “self-acknowledged imitation of a foreign model,” and James remembered his “very young pleasure” in reading Mitchell and Melville. Then there are the similarities in voice (a smooth, subjective sentimentalism), setting (bachelor garrets and lonely hearths), character (the dreaming bachelor), etc.

Now I wonder: Would text analysis tools and massive collections of texts such as Google Books provide further evidence for this bachelor genealogy? By comparing abstractions and visualizations of my four core bachelor texts, would I be able to see a “family resemblance”–perhaps a unique turn of phrase that is repeated through the generations like funny toes or wavy red hair? And what would constitute reliable evidence of inheritance, anyway–similarity in word choice, narrative voice, character, structure, graphical design (i.e., if you’ve got Fabio on the cover, the book probably belongs to the romance tradition)? Is there such a thing as “literary DNA”?

And is “literary DNA” even a valid concept, or is it a scientific term misapplied to the literary realm? By “literary DNA,” I mean the unique characteristics that define a literary work, characteristics developed through the complex process of literary inheritance and creativity. A search for the term in the MLA Bibliography yields no results for “literary DNA” (although “literary influence” gets you over 1000), while Google Scholar shows only 9 results. Perhaps the term rings too much of a positivist approach to literary study. When you broaden the search to Google, though, about 1000 results are retrieved. For example: The Paris Review initially called its archive of 50 years of its interviews The DNA of Literature (USA Today), presumably because you can plumb it to find authors explaining the genealogy of their works. In the Netherlands, researchers at the Huygens Institute KNAW are developing The literary DNA: computer-assisted recognition of narrative elements, software that can recognize themes and motifs in literature. “Literary DNA” is probably most closely associated with Don Foster, the literary scholar who used textual analysis to help uncover Joe Klein’s identity as the “Anonymous” author of Primary Colors, Ted Kaczynski as the Unabomber, and Shakespeare as the author of the poem A Funerall Elegye in memory of the late Vertuous Maister William Peete (erroneously, as it turned out). In Author Unknown (which I’m just starting to read), Foster argues, “The scientific analysis of a text–how a mind and a hand conspire to commit acts of writing–can reveal features as sharp and telling as anything this side of fingerprints and DNA” (4). I’m skeptical that a text can be analyzed scientifically, and textual analysis does not necessarily prove authorship definitively, as we see in the case of the poem mistakenly attributed to Shakespeare. Still, Foster’s methods are intriguing–he looks at word choice, punctuation, spelling, grammar, and sentence construction to discover the unique features of an author’s “linguistic system.” Romantic that I am, I dig the idea of literary scholar as sleuth.

So how would you use the computer to study literary influence and inheritance, and what would be the implications of this approach to research? Certainly search tools can help us to track allusions and other signs of influence. In his excellent article “Googling the Victorians,” Patrick Leary describes how the editors of the Strouse Edition of Carlyle’s Writings use Google, Literature Online (LION), the online OED, and other electronic resources to identify unattributed quotations, sparing themselves from having to spend “many often fruitless hours in the library stacks.” By providing quick access to much of the cultural record, the Web may, as Leary argues, “lend new urgency to arguments about what constitutes evidence of literary or intellectual influence” (12). Massive digitization projects such as Google Books are exposing not only influence, but also outright plagiarism. In Slate, Paul Collins reports that a linguist working for Google Books found that a passage from Sacrificial Foundations (1899) was a plagiarism of a plagiarism. As Collins observes, “it may be only a matter of time before some enterprising scholar yokes Google Book Search and plagiarism-detection software together into a massive literary dragnet, scooping out hundreds of years’ worth of plagiarists—giants and forgotten hacks alike—who have all escaped detection until now.” But when is “plagiarism “creative re-invention? Jonathan Lethem’s plagiarism on plagiarism cleverly makes the point that literature is all about appropriation and re-imagining.

Curious about the capabilities of Google Books to track literary inheritance, I compared Google Books editions of Washington Irving’s Sketch Book, Donald Grant Mitchell’s Reveries of a Bachelor, Melville’s Pierre, and Henry James’ The Lesson of the Master, The Death of the Lion; The Next Time, and Other Tales. (I know that Google Books is itself the subject of controversy, a topic I plan to take up in a future post.) What Google Books, the Open Content Alliance, and other huge digital collections offer is lots and lots of data that one could mine to study literary inheritance, the evolution of ideas, and a lot more. I hoped to use Google Books to discover salient features not only of these works, but of the works preceding and succeeding them on the literary family tree. Google Books provides handy (if limited) tools for detecting literary influence and seeing a snapshot of a work through its “About This Book” feature, a sort of reference page gathering together key words, popular phrases, and references from other works. Google Books isolates 20 unique “Key words and phrases” (such as “sleepy hollow” and “diedrich knickerbocker”), as well as three key words associated with each chapter. Proper names seem to be the most frequent key words, so I wasn’t too surprised that the only overlap in key words among the four works was that both The Sketch Book and Pierre use variations of “mourn” and “methinks,” reflecting perhaps the melancholic tone of both works. Popular Passages seems to hold more promise for influence-tracking, since it enables readers to “follow the literary memes* that appear again and again in the world of books” by highlighting the ten passages that appear most frequently in other works, then by providing links to all of those works. For Irving and Mitchell, the most frequent “popular passages” were quotations from other works (such as Thomas Gray’s poetry) and passages of their own works that were frequently anthologized, while for James and Melville the popular passages were those most frequently cited by critics or represented in other editions of their works. Through popular passages, you can get a quick sense of how the authors were received in their time and ours and glimpse how a work fits into the larger network of literary influence. To understand how critics and commentators have received The Sketch Book, you can also examine “References from books” (which presumably searches Google Books for citations of the work), “References from scholarly works” (which presumably searches Google Scholar), and “References from web pages” (which presumably searches Google, in a limited way). If, say, The Sketch Book and Pierre were referenced by the same works, that might indicate some kind of kinship. For selected books (in my very small sample, only the travel-focused The Sketch Book), Google Books also provides a map of places mentioned in the book. The map for The Sketch Book shows dozens of markers clustered in Great Britain, revealing at a glance what an Anglophile Irving was. You could compare maps of several books to see if they share similar settings or itineraries.

After spending several hours playing with About This Book, I didn’t find much evidence for my theory of bachelor genealogy–perhaps I would need a more specialized tool to turn up that kind of evidence. However, it does seem that Google Books could be quite useful for scholars interested in book history, since it includes book advertisements that appear in back matter, catalogs of library and personal collections, and back issues of Publishers Weekly, as well as the aforementioned features that allow you to see how the book has been referenced. By providing access to so much of the cultural record, Google Books can allow scholars to broaden the scope of their inquiry and find connections among works. While I recognize Google Book’s potential as a scholarly tool, I can see several ways to improve it for scholars:

  • Allow users to see more than 10 “popular passages” and 20 keywords; for power users, expose all available data (in a user-friendly way, of course)
  • Be more transparent about how these different features work. I searched the Google Books site for an explanation of the different “About This Book” features, but couldn’t find much beyond a few sentences in the Book Search Blog. If I’m going to trust a tool, I’d like to know how it works.
  • Make it easier to search within “popular passages.” Analyzing popular passages can be quite time-consuming, especially when a passage is cited in 860 books, as is a quotation from Cymbeline that Irving includes in The Sketch Book. Ideally you could search inside your “popular passages” results to see if, say, Melville cited the same passage from Shakespeare.
  • Help readers sort out different editions of the same work. As a book history/ textual studies gal, I like the fact that Google Books includes multiple editions of a work, even if the choice of editions include seems to be more an accident of what a library holds than a scholarly decision about which are the key editions. But once you start comparing the “About the Book” feature for those different editions, things get awfully confusing. When I compared 2 different versions of Reveries of a Bachelor, the 1852 L.C. Page edition and the 1906 Fenno edition, the data was quite different. Perhaps the text of the work changed substantially as different publishers came out with their own editions (neither is the authorized edition first published by Scribner in 1850), or maybe something’s wacky about the way that Google is generating this information. In any case, you get different popular passages (the Page edition includes lots of advertisements for other books published by the same company), different books/scholarly works that refer to this work, and different key words. Curious. So I guess I’d like a “compare editions” tool to reveal what’s really going on…
  • So this request may be a little pie in the sky, but I sure would love a way to know what works Google Books is not searching, or what it might be missing because of OCR errors. When my colleague Jane Segal and I studied humanities scholars’ use of digital resources, several worried that works not in digital form would be neglected and that scholarship would suffer from a sort of ignorance of the analog. I dunno–maybe such a list could be generated by comparing WorldCat records with Google Books? In any case, scholars need to be conscious of the limits of Google Books.
  • Fix errors in generating links to other books. For some reason, if you try to follow “References from books” past the first results page, you get an error that looks like this: “Your search – cites:0F_xecYtdwg1CDwRxfl_QZ – did not match any documents.” Argh!
  • Make it easier to download full-text of books (PDFs are handy, full-text is better).
  • Ensure free access to the full-text of public domain books. I got a little scared when I heard that Google Books would start charging for full-text, but that plan seems to focus on books that are part of its publisher partner program, not its library scanning project.
  • Experiment with visualization tools. For instance, it would be cool to play with some kind of social network graph or citation network showing all of the authors who cited Irving and were cited by him

Maybe it’s not fair to expect Google to turn Google Books into a scholarly resource–perhaps it’s better for scholars to develop their own applications to analyze Google Books. I’m encouraged by news that members of the digital humanities community have been in conversation with Google. Dan Cohen makes a persuasive case for Google Books providing an API that would allow researchers to mine and manipulate information. Despite my criticisms, I love poking around “About This Book,” which provides a great way to get a snapshot view of how a work fits into the literary ecosystem.

I had planned to discuss my experiments using text analysis tools to trace literary influence, but I’ve gone on way way too long already, so I’ll save that for a future post.

Imagining new tools for humanities scholars

In my last post, I noted that most humanities scholars seem to want pretty basic tools that are rooted in immediate needs, tools that would, for example, allow them to convert files from one format to another, easily compose and exchange documents that use Unicode fonts, and find information quickly. Such tools would save scholars time and spare them from frustration.Even as the digital humanities community acknowledges (and serves) the need for basic tools, I believe that it should also continue innovating by developing applications for analyzing, mining and visualizing texts; annotating and searching images and video; collecting and sharing digital objects; etc. As more and more scholarly resources become available in digital formats, I think that humanities scholars will recognize a pressing need for tools that help them manage, analyze and share huge masses of information. It’s just difficult for people to imagine exactly what these tools would do. As the DLF found in its 2004 Scholars’ Panel, “so unfamiliar is this area that we heard from several individuals that they had a hard time articulating precisely what they required from such tools, or what level of software creation skills or consultancy is available to them, and where. We are still in a stage where it is easier to react to an example of an existing tool than to dream them up ex nihilo.”

But dang, it sure is fun–and useful–to dream up new tools. At the 2005 Summit on Digital Tools for the Humanities, participants were deeply and playfully engaged as they sketched out tools to support interpretation, exploration of resources, collaboration, and (my favorite) Visualization of Space, Time and Uncertainty. I love seeing what kind of imaginative tools and hacks Bill Turkel will come up with in his blog Digital History Hacks, such as history appliances.

Here’s my own wish list for scholarly tools. Most of these ideas come out of my practical need for tools to help me find and manage digital information, as well as my curiosity about how a tool from one domain (say, music) might work when applied to the scholarly domain. I’m beginning to regard Zotero as my scholarly workbench, so I’m imagining a lot of these tools as add-ons to Zotero (without the expectation that they would necessarily be developed by the Zotero team or included as part of the standard release).

  • Bibliography Ripper: I’m trying to determine how many of the works I cited in my dissertation are now available electronically, which is incredibly labor intensive, despite my crude attempts to use search tools such as Rollyo to speed up the process. Ideally I could feed my bibliography into a bibliography ripper (OK, perhaps it would need a softer name) that would allow me to select which entries I’d like to dump into a Zotero collection. Then it would automatically go out and search Google Scholar, Open Worldcat, etc. for each resource. If full-text is available, it would be automatically downloaded into my Zotero collection; otherwise, the call number would be captured. Of course, such a tool would be useful not only in allowing me to assemble a collection of research materials that I used before “going digital,” but in pulling citations from the works of other scholars, a common research practice.
  • Recommender: I’m tantalized by the recommendation engine planned for the next release of Zotero. I’d love a tool that would recommend resources based on what I already have in my research collections, saving me from having to go out and find them myself.
  • Auto-summarizer: Matt Kirschenbaum recently gave a wonderful talk on The Remaking of Reading: Data Mining and the Digital Humanities where he described scholarly practices of “not reading” (skimming, looking at bibliographies, reading summaries by others, etc) and distant reading (“using statistical, quantitative methods to ‘read’ large volumes of text at a distance”). Given the volume of information I’m trying to deal with, I could really use a tool that would offer reliable summaries of works (particularly if an abstract isn’t available) and would let me judge quickly whether I need to read more deeply.
  • Shuffle scholarly playlists: When I listen to my iPod on shuffle, I often notice connections among songs and details I had previously overlooked; randomness seems to foster attention. I wonder if a similar effect could be achieved by putting my research collections on shuffle, if my critical attention would be stimulated if I asked Zotero to give me a random article?
  • Authoring tools: I think one thing holding back digital scholarship is the lack of powerful, intuitive authoring tools. Sure, blogging and wiki software offers a number of advantages–ease of use, collaboration capabilities, etc. In particular, I see a lot of potential in WordPress. A student at Georgetown developed a well-designed, thoughtful “online research portfolio” about bachelorhood in nineteenth century American lit using WordPress. Developers are building WordPress plug-ins and themes geared towards scholarship, such as Courseware (which “enables you to manage a class with a WordPress blog”) and CommentPress (which “allows readers to comment paragraph by paragraph in the margins of a text”). I’m also a fan of the authoring tools provided by the open educational repository Connexions, which offers converters from Word to XML and a pretty simple edit-in-place interface. What I’m looking for, though, is a way to put together a layered, hypermedia scholarly work, kind of like a DVD with bonus materials. At this early stage, I envision having a track for my main argument, one for supporting materials (texts, images, audio, etc), one for a “making of” feature exploring the process of producing the project, and one for extras such as a Google Map showing where bachelor authors lived, a digital story using images and audio to explore literary bachelorhood, etc.

Maybe my dream tools are already out there (or really out there) or are being developed–I’d love to find out!

What tools do humanities scholars need?

This week my colleague and I met with some of the leading philologists at Rice to discuss how they do their research and what tools would help make them more productive and innovative. I was primed to talk about text analysis and visualization tools, collaboration tools, collection-building tools, etc., but instead the conversation focused on much more bread-and-butter stuff. These scholars want:

  1. an easy way to convert from one file format to another. A religious studies scholar described the frustrating hours she put into converting files from Nota Bene to Word, hours that could have been spent doing research or writing. Others said that they have valuable files in obsolete formats and acknowledged printing out important documents so they would have at least some means of accessing them. Given the desperate need for an easy way to migrate file formats forward, I think scholars would embrace an easy-to-use batch conversion tool that would work with the sometimes-obscure file formats academics use. Bonus points for free, secure, long-term online storage of data. I think libraries have a real opportunity here to assist scholars in archiving and preserving their data, although scholars may understandably wish to retain custody over it.
  2. a reliable, commonly adopted word processor that handles Unicode well. As an Americanist, I’m blissfully unaware of the challenges of working with character sets such as Coptic, Ethiopic, Greek, Hebrew, etc., but these philologists struggle with font issues every day. Inputting characters is a pain, and exchanging files with publishers and others is even more of a hassle. Scholars said that they had to re-do work because their publishers didn’t have the right fonts installed on their systems. They also seemed to dislike Word, which is designed more for business applications. I started to wonder if Open Office could be adapted to meet scholars’ needs…
  3. a virtual reference desk of key texts in their field. Many foundational philology texts from the nineteenth century have not yet been digitized. Scholars said they would love to able to consult these texts quickly, particularly dictionaries, lexicons, and other tools. Somewhat surprisingly (at least to a text encoder like me), they said that they wouldn’t require full text, just page images. (Interestingly, two of the three grants that were recently awarded through the NEH/IMLS’s Advancing Knowledge program focus on building reference/contextual tools: “Tufts will develop a digital reference tool allowing researchers and librarians to conduct context-based ‘smart searches’ of un-indexed words from existing databases in the Tufts Digital Library,” while “The University of California, Berkeley, in collaboration with the Queen’s University, Belfast, will develop a digital database of Irish studies materials to test three open-source digital tools. The Context Finder, Context Builder, and Context Provider tools will be aimed at establishing scholarly context.”)

Given that scholars’ most precious commodity is probably time, it makes perfect sense that they most desire tools that help them to be more efficient and avoid getting caught up in frustrating, tedious activities such as converting files, wrestling with word processing programs, and finding books in the library. This conversation echoed the results of a survey my colleague Jane Segal and I conducted in the spring investigating the impact of digital resources on humanities scholarship. Not surprisingly, scholars most commonly use technologies that serve their regular research practices. Of our 85 respondents, 100% use word processing progams, but only 36% use bibliographic software, and only 5% use text analysis tools. Our respondents most desired tools that would help them find resources more quickly: 88% wanted “Search tools that are powerful and easy to use” and “Search tools that go across multiple scholarly web sites,” but only 28% wanted text visualization tools, and only 13% ranked dynamic mapping/GIS tools as a priority.

I should emphasize that scholars are by no means hostile to cutting-edge visualization and analysis tools; they’re just not aware of them or aren’t sure that they would support their research practices. When we asked Whitman and Dickinson scholars what they thought of tools such as text visualization applications, they generally seemed intrigued, but they indicated that they would need to be persuaded that such tools would advance their research projects. That makes sense: except for a handful of “innovators” and “early adopters” eager to try something new (estimated by Rogers to be about 16% of the population), most folks are pretty pragmatic in their adoption of technologies. They need to be frustrated with current tools and convinced that investing time and money in adopting a new one will pay off in increased productivity. They will first adopt tools that help them do what they’ve always done, just better. According to a very interesting 2001 CLIR report on Scholarly Work in the Humanities and the Evolving Information Environment, humanities scholars are adopting technologies “that are enhancing many of their traditional work practices” (28). As Jerome McGann argues in Radiant Textuality, transformations in humanities research must be rooted in the core values and methods of the discipline: “the general field of humanities and education and scholarship will not take the use of digital technology seriously until one demonstrates how its tools improve the ways we explore and explain aesthetic works–until, that is, they expand our interpretational procedures” (xii).

So what drives humanities scholars to adopt new tools? Well, the research on this topic (one that I need to investigate further) seems to suggest that the tool needs to be easy to find and use, that people need to receive incentives and support, etc. Rather than get into all of that, though, let me offer two quick anecdotes:

  • “Research not re-search.” One of my job responsibilities is to run tech training workshops for faculty. Generally even not the promise of the opportunity learn Really Useful Tools or feast on free lunch can lure people to these workshops, but a good number of humanities faculty and grad students showed up for my sessions on the wonderful, free, open source bibliographic tool Zotero. When I demonstrated how you could automatically download bibliographic information and articles from supported web sites, their faces lit up; they actually oohed and aahed. I felt like David Copperfield. The usefulness of such a tool was immediately apparent: “ohmygosh, I don’t have to copy and paste or type out citations; I can organize (and find!) my research much more easily. Hallelujah!” It was a little more difficult for the participants to grasp how they might use Zotero’s tagging functions, which resemble the schemas they use to organize their notes, but are different enough to prompt some confusion.
  • This summer I attended Ed Ayers’ keynote address at the Geography and the Humanities Symposium and was blown away by his presentation on visualizing dynamic temporal and geographical process. The audience’s excitement was palatable–I heard a lot of “oh, wows.” The dazzle was due in part to Ed Ayers’ exceptional presentation skills, but also to the power of the visualization tools to illustrate change. I was impressed by how he made the unfamiliar familiar by comparing the dynamic visualizations he was showing to weather maps. As he demonstrated the applications, people could see patterns that would otherwise be hard to detect and thus understand the usefulness of such tools. (You can see Ed Ayers and Will Thomas’ fascinating presentation on “Time, Space and History” at the 2006 Educause conference here)

I guess what I conclude from all of this is pretty obvious: humanities scholars need tools that help them to be more productive– and more innovative, although it’s harder for many folks to imagine what’s possible until they see concrete demonstrations. These tools can seem somewhat magical until scholars start incorporating them into their regular practices. One scholar that I interviewed this summer suggested that digital humanists run workshops on new tools and digital collections at conferences such as the MLA and American Literature Association conference. She said she and her colleagues are curious about new digital tools and collections, but they aren’t necessarily aware of them and don’t always know how to use them. Of course, the conversation needs to be two-way– tool developers need to understand the needs of scholars. With projects such as NINES, MONK, etc., we can find great models for scholar/developer collaborations (in many cases, digital humanists themselves are both scholars and developers.) And I’ve been impressed by the ways that projects such as Zotero and TAPOR have produced handy tutorials and actively promoted themselves.