Monthly Archives: June 2008

Using Text Analysis Tools for Comparison: Mole & Chocolate Cake

How can text analysis tools enable researchers to study the relationships between texts? In an earlier post, I speculated about the relevance of such tools for understanding “literary DNA”–how ideas are transmitted and remixed–but as one reader observed, intertextuality is probably a more appropriate way of thinking about the topic. In my dissertation, I argue that Melville’s Pierre represents a dark parody of Mitchell’s Reveries of a Bachelor. Melville takes the conventions of sentimental bachelor literature, mixes in elements of the Gothic and philosophic/theological tracts, and produces a grim travesty of bachelor literature that makes the dreaming bachelor a trapped quasi-husband, replaces the rural domestic manor with a crowded urban apartment building, and ends in a real, Hamlet-intense death scene rather than the bachelor coming out of reverie or finding a wife. Would text analysis tools support this analysis, or turn up patterns that I had previously ignored?

I wanted to get a quick visual sense of the two texts, so I plugged them into Wordle, a nifty word cloud generator that enables you to control variables such as layout, font and color. (Interestingly, Wordle came up with the perfect visualizations for each text at random: Pierre white type on a black background shaped into, oh, a chess piece or a tombstone, Reveries a brighter, more casual handwritten style, with a shape like a fish or egg.)

Wordle Word Cloud for Pierre

Wordle Reveries Word Cloud

Using these visual representations of the most frequent words in each book enabled me to get a sense of the totality, but then I also drilled down and began comparing the significance of particular words. I noted, for instance, the importance of “heart” in Reveries, which is, after all, subtitled “A Book of the Heart.” I also observed that “mother” and “father” were given greater weight in Pierre, which is obsessed with twisted parental legacies. To compare the books in even more detail, I decided to make my own mashed up word cloud, placing terms that appeared in both texts next to each other and evaluating their relative weight. I tried to group similar terms, creating a section for words about the body, words about feeling, etc. (I used crop, copy and paste tools in PhotoShop to create this mashup, but I’m sure–or I sure hope–there’s a better way.

Comparison of Reveries and Pierre(About three words into the project, I wished for a more powerful tool to automatically recognize, extract and group similar words from multiple files, since my eyes ached and I had a tough time cropping out words without also grabbing parts of nearby words. Perhaps each word would be a tile that you drag over to a new frame and move around; ideally, you could click on the word and open up a concordance) My mashup revealed that in many ways Pierre and Reveries have similar linguistic profiles. For instance, both contain frequently-occurring words focused on the body (face, hand, eye), time (morning, night), thinking, feeling, and family. Perhaps such terms are common in all literary works (one would need to compare these works to a larger literary corpus), but they also seem to reflect the conventions of sentimental literature, with its focus on the family and embodied feeling (see, for instance, Howard).

The word clouds enabled me to get an initial impression of key words in the two books and the overlap between them, but I wanted to develop a more detailed understanding. I used TAPOR’s Comparator to compare the two texts, generating a complete list of how often words appeared in each text and their relative weighting. When I first looked at the the word list, I was befuddled:

Words Reveries counts Reveries relative counts Pierre relative Pierre counts Relative ratio Reveries:Pierre
blaze 45 0.0007 0 1 109.4667

What does the relative ratio mean? I was starting to regret my avoidance of all math and stats courses in college. But after I worked with the word clouds, the statistics began to make more sense. Oh, relative ratio means how often a word appears in the first text versus the second–“blaze” is much more prominent in Reveries. Ultimately I trusted the concreteness and specificity of numbers more than the more impressionistic imagery provided by the word cloud, but the word cloud opened up my eyes so that I could see the stats more meaningfully. For instance, I found that mother indeed was more significant in Pierre, occurring 237 times vs. 58 times in Reveries. Heart was more important in Reveries (a much shorter work), appearing 199 times vs. 186 times in Pierre. I was surprised that “think” was more significant in Reveries than in Pierre, given the philosophical orientation of the latter. With the details provided by the text comparison results, I could construct an argument about how Melville appropriates the language of sentimentality.

But the differences between the two texts are perhaps even more interesting than their similarities, since they show how Melville departed from the conventions of male sentimentalism, embraced irony, and infused Pierre with a sort of gothic spirtualism. These differences are revealed more fully in the statistics than the word clouds. A number of terms are unique to each work. For instance, sentimental terms such as “sympathies,” “griefs,” “sensibility” appear frequently in Reveries but never in Pierre, as do romantic words such as “flirt,” “sparkle,” and “prettier.” As is fitting for Melville, Pierre‘s unique language is typically darker, more archaic, abstract, and spiritual/philosophical, and obsessed with the making of art: “portrait,” “writing,” “original,” “ere,” “miserable,” “visible,” “invisible,” “profound(est),” “final,” “vile,” “villain,” “minds,” “mystical,” “marvelous,” “inexplicable,” “ambiguous.” (Whereas Reveries is subtitled “A Book of the Heart,” Pierre is subtitled “The Ambiguities.”) There is a strand of darkness in Mitchell–he uses “sorrow” more than Melville–but then Mitchell uses “pleasure” 14 times to Melville’s 2 times and “pleasant” 43 times. Reveries is more self-consciously focused on bachelorhood; Mitchell uses “bachelor” 28 times to Melville’s 5. Both authors refer to dreaming; Mitchell uses “reveries” 10 times, Melville 7. Interestingly, only Melville uses “America” (14 times).

Looking over the word lists raises all sorts of questions about the themes and imagery of each work and their relationship to each other, but the data can also be overwhelming. If comparing two works yields over 10,000 lines in a spreadsheet, what criteria should you use in deciding what to select (to use Unsworth’s scholarly primitive)? What happens when you throw more works into the mix? I’m assuming that text mining techniques will provide more sophisticated ways of evaluating textual data, allowing you to filter data and set preferences for how much data you get. (I should note that you can exclude terms and set preferences in TAPOR).

Text analysis brings attention to significant features of a text by abstracting those features–for instance, by generating a word frequency list that contains individual words and the number of times they appear. But I kept wondering how the words were used, in what context they appeared. So Melville uses “mother” a lot–is it in a sweetly sentimental way, or does he treat the idea of mother more complexly? By employing TAPOR’s concordance tool, you can view words in context and see that Mitchell often uses mother in association with words like “heart,” “kiss,” “lap,” while in Melville “mother” does appear with “Dear” and “loving,” but also with “conceal,” “torture,” “mockingly,” “repelling,” “pride,” “cruel.” Hmmm. In Mitchell, “hand” most often occurs with “your” and “my,” signifying connection, while “hand” in Pierre is more often associated with action (hand-to-hand combat, “lift my hand in fury,” etc) or with putting hand to brow in anguish. Same word, different resonance. It’s as if Melville took some of the ingredients of sentimental literature and made something entirely different with them, enchiladas mole rather than a chocolate cake.

Word clouds, text comparisons, and concordances open up all sorts of insights, but how does one use this evidence in literary criticism? If I submitted an article full of word count tables to a traditional journal, I bet the editors wouldn’t know what to do with it. But that may change, and in any case text analysis can inform the kind of arguments critics make. My experience playing with text analysis tools verifies, for me, Steve Ramsay’s recommendation that we “reconceive computer-assisted text analysis as an activity best employed not in the service of a heightened critical objectivity, but as one that embraces the possibilities of that deepened subjectivity upon which critical insight depends.”

Works Cited

Howard, June. “What Is Sentimentality?.” American Literary History 11.1 (1999): 63-81. 22 Jun 2008 <;.

Ramsay, Stephen. “Reconceiving Text Analysis: Toward an Algorithmic Criticism.” Lit Linguist Computing 18.2 (2003): 167-174. 27 Nov 2007 <;.


THAT Camp Takeaways

My work has been so all-consuming lately that it feels like THAT Camp was months rather than a couple of weeks ago, but I wanted to offer a few observations about THAT Camp before they go completely stale. Like many others, I found THAT Camp much more satisfying than the typical academic conference, since it promoted a strong sense of community (in part by using technologies such as pre-conference blogging and Twitter), was organized around the interests of participants, and encouraged the open exchange of ideas. Academic conferences typically have three functions: 1) to disseminate new ideas; 2) to bring people together to explore those ideas (and share a few beers in the process); and 3) to provide a line on the CV certifying that a scholar is actually making contributions to the research community. THAT Camp excelled at fulfilling the first two functions, and I’m hopeful that search committees and tenure committees (at least in certain communities) will see THAT Camp on a CV and think, “Wow, this person is an innovator!” Besides, the ideas generated and collaborations formed at THAT Camp will likely lead to more lines (academic merit badges?) on CVs.

I don’t have the time—and the reader probably doesn’t have the patience—to describe everything I learned at THAT Camp, but I wanted to highlight a few of the most intriguing projects or compelling ideas.

1) It’s the people, stupid.

I helped to organize a session on emerging research methods and expected that we would focus on how technologies such as visualization and text mining are opening up new approaches to scholarly inquiry. Instead, we spent most of our time engaged in a fruitful discussion about the importance—and difficulty—of collaboration, positing it as the “scholarly primitive” missing from John Unsworth’s list of core research activities. Perhaps the defining statement of the session was one person’s observation that “the cyberinfrastructure is people.” As THAT Camp itself demonstrated, collaboration enables people to develop better ideas, share the workload, sustain projects, and ultimately have a greater impact in the field, but encouraging people to share requires changes in culture and incentive systems.

2) New tools are enabling people to share annotations, resources, and work.

If collaboration is a key research process, there are some really cool tools under development that will support it. For instance, Ben Brumfield demonstrated FromThePage, a tool that allows people (historians, genealogists, history buffs) to transcribe documents, zoom in on manuscript pages, collaborate with others to identify tasks and check their work, view subjects, and more. Travis Brown is working on eComma, which “will enable groups of students, scholars, or general readers to build collaborative commentaries on a text and to search, display, and share those commentaries online.” And then there’s Zotero 2.0, which will let researchers share their collections with others.

3) Through visualization tools, researchers can make sense of a vast amount of information.

For instance, Jeanne Kramer-Smyth demonstrated ArchivesZ, which enables users of archives to visualize how much material (e.g., how many linear feet) is available in an archive related to a particular topic.

4) GIS technologies offer real analytical power, showing changes across time and space, land ownership patterns, and much more.

In a rich session on GIS tools, Josh Greenberg demonstrated how an historical map of New York could be overlaid on a contemporary Google Map, enabling one to view the development of the city. Mikel Maron discussed Open Street Map, a free and open map of the world to which people regularly contribute data. And I was delighted to learn from Shekhar Krishnan that Zotero will be releasing a mapping plug-in that will allow you to view the publication location of works in a collection on a Google Map. I had planned to create my own Google Map showing where bachelor literature was published by extracting the necessary data from Zotero, but, hooray, now I don’t have to go through the extra work. (See for more cool GIS projects).

Research Methods Session at THAT Camp

This weekend I’m at THAT Camp, which is bringing together programmers, librarians, funding officers, project managers, mathematicians, historians, philosophers, literary scholars, linguists, etc. to discuss the digital humanities. Sponsored by the Center for History and New Media at George Mason University, THAT CAMP is an un-conference, which means that ideas for sessions emerged organically out of blog posts preceding the gathering and out of a discussion held when the Camp began. As a result of all of the sharing of ideas via blogging and social networking via Twitter, the meeting seems much more intimate, open, and lively than your average conference. People who are passionate and curious about the digital humanities are coming together to talk about teaching, gaming, visualization, project sustainability, etc., and to learn how to hack Zotero and Omeka, build a simple history appliance, and more. As many folks have commented, the toughest part of THAT Camp is deciding which of the four sessions to attend–I want to go to them all. Kudos to CHNM for organizing and hosting the event–I bet some exciting initiatives and collaborations will come out of THAT Camp.

Yesterday afternoon I facilitated a session on research methods. At the request of some of the participants, I’m posting the rough notes I took during this rich discussion.

Touchstones/ pump priming quotations for the session:

  • “Research in the humanities, then, is and has been an activity characterized by the four Rs: reading, writing, reflection, and rustication. If these are the traditional research methods in the humanities, what will “new research methods” look like–and more importantly, why do we need them?”—John Unsworth, New Methods for Humanities Research
  • “The day will come, not that far off, when modifying humanities with ‘digital’ will make no more sense than modifying humanities with ‘print.’” –Steve Wheatley, ACLS
  • Unsworth, Scholarly Primitives: “some basic functions common to scholarly activity across disciplines, over time, and independent of theoretical orientation.” Unsworth lists the following scholarly primitives:
    • Annotating
    • Comparing
    • Discovering
    • Illustrating
    • Referring
    • Representing
    • Sampling
  • “What is a literary-critical ‘problem?’ How is it different from a scientific “problem?””—Steve Ramsay



  • Old method: Scholars would find things in the archive, bring them back, provide people w/ information.
  • New: scholars face a deluge of information.
  • Old assumption: info is hard to get to, need to expertise to find stuff.
  • New: expertise shifts from finding to filtering and sorting
  • The point of a research method is figuring out how to filter, sort. A bibliography is not a list of Google links; you need to be familiar w/ major sources in field.
  • Experts know how to discern bias; Filtering requires expertise.
  • Expertise=familiarity with conceptual/ theoretical approaches in field. Scholars get a sense of theoretical approach by looking at the bibliography—it’s metadata about the book
  • Scholars need to inform students about problems with resources they find. New problems arise with digital—important to know weaknesses of Google Books. Need to teach students to question how resource/ tool created—what it does and doesn’t do.
  • The student world is digital—they need to learn how to operate responsibly in it
  • Two webs: open access, proprietary/ walled off. Students need to be aware of it—not everything is in Google.
  • But it’s also important to meet students where they start—even faculty start with Google; make metadata open so it’s discoverable. Implications of stuff not being accessible—it’s ignored.
  • Old model: one expert—you had to read the one book on the subject. Now there’s a huge amounts of data, need multiple interfaces to all of it. Need to provide multiple pathways to data. RDF key.
  • If you’re used to do something a particular way, it’s hard to change that.
  • Origins of print: first people to adopt print were different groups using it for their own agenda. Later library science came along to collect and curate content. Print media enabled new ways of doing existing scholarship. New disciplines developed, such as finding and keeping print materials (librarianship) and the study of books as physical objects. Same thing in shift to digital: there are specialists who focus on the technical side, like building tools. There are scholars, who want to use this stuff and don’t need to know the technical details.


  • At the recent New Horizons conference, Geoffrey Rockwell spoke on mass digitization and the process of research. Search is not that simple—there are multiple places to look. The problem of selection→ how do you decide what makes sense. Then there’s serendipity. How do scholars negotiate mass of stuff? How do they make sense of it, select it? Tools like Zotero help you to share & select info; then you leave Zotero and write paper separately. With textual analysis tools, there’s no way to take textual data and link to publication → you need a relationship to textual analysis work. Can integrated tools be developed so that discovery, search, data collection, analysis, etc. can be carried right through publication in journal, Omeka, etc?


  • Sharing should be one of the scholarly primitives. We’re sharing in new ways. The speed & scale of what you share is changing.
  • How do you cut across disciplines? People from different fields have difft takes: literature vs history vs art; different methods, not much cross-fertilization
  • Pronetos: scholars throughout the world get a single place to go to network and engage with other scholars. Organic—if you’re an American historian, you can create an American history group if it doesn’t already exist. Takes on the problem of how to help people network.
  • Zotero Commons will facilitate sharing of expertise, as you can find an expert sharing a particular bibliography.
  • Opening up projects, creating communities around them helps with sustainability
  • Most transformative aspect of new research methods is establishing scholarly networks, collaborative aspect
  • How do you track your efforts in collaboration so that you can document what deserves to be rewarded?
  • Teach collaboration by modeling it for students
  • Sharing depends on discipline—people working on patents don’t necessarily share.
  • Humanists have trouble with sharing—for instance, some NINES users wanted to make tags private
  • Not sharing will become a problem in the long term, since it leads to duplication of effort and unnecessary competition. You can collaborate to come up with a better project.
  • Information gets out quickly, danger is in not sharing–that’s when you get scooped.
  • It’s not the technology that enables the sharing—it’s the people. There’s concern about retaining rights, getting credit, getting ripped off. People are building projects (e.g. institutional repositories) and users are not coming. How will people be encouraged to share?
  • People tend to share within discipline rather than institution.
  • What’s the relationship btwn repositories, blogs, Omeka installations, etc.? Importance of data aggregation, globalization.
  • Cyberinfrastructure is people
  • They’ve been pushing knowledge management practices in the business world for decades, and they still haven’t cracked it.
  • Mashups—pieces in place to make scholars see potential, but haven’t been realized yet.
  • With openly shared research, you facilitate interdisciplinarity and get research out to more people. Institutional repositories (IR) are key for this.
  • IRs are siloed—but w/ Zotero Commons, institution is everyone.
  • If you put your research out there, you’re staking it—not getting scoped.
  • If scholars blog their work at the early stage, they may wonder: are they putting it out too early?
  • What is it about sharing that’s changing over time?
  • Do humanities departments who want to do digital need a marketing department to help people discover their work?
  • Role of libraries as marketing depts., making resources accessible.
  • Professional societies need to step up b/c it’s not realistic for individual schools to do the marketing of digital scholarship.
  • Should professional societies launch their own version of Facebook?
  • We need to get away from the silos.
  • Peer review is a kind of social network.
  • Media Commons: social network for peer review of online texts using CommentPress, etc. Slashdot: reputation ranking, etc. (morphed into peer review)
  • Offer interfaces inflected given different disciplines: NINES, 18th C Connect
  • NINES an example of peer review for digital scholarship. 22 sites peer-reviewed by NINES—22 of first 105 to be put in MLA Bibliography.
  • Journal gives seal of approval—haven’t come up with that kind of stamp for digital world. Part of fear about blogging iis that it’s not peer reviewed
  • Blogrolls are a form of peer review–to find good stuff, you look at Matt Kirschenbaum’s blog to see who he reads.
  • Rotunda/ digital publisher as stamp of approval.
  • There are different standards for digital and print. A Nature study of online peer review found it doesn’t work. But there were something like 40 comments in 6 months—isn’t that success, when in normal peer review it would take 1-2 years to get 3 comments? Why is there such a high bar for digital scholarship?
  • Noah Wadrip-Fruin’s peer review processes, different feedback overall from both online/blog-based and traditional peer review.
  • Scholarship over time: digital projects, when do they end?


  • How are traditional research methods tied to the printed book?
  • Interpretation: job of historian is to make sense of what things mean. We’re in the land grab stage right now—dump stuff online, then begin to wall it off. It’s still early—at the ground floor of something that could be big.
  • Historians typically narrativize events. At Miami U. they developed a tool to transform a short story into different genre—for instance, from horror to epic. Students learned elements of genre, wrote XSLT stylesheets to do the transformations.
  • Researchers could try out different narratives on data sets—picking out certain aspects. Historical narrativizing tied to print; digital enables historical multi-narrative. With digital, you can see what breaks when you change parameters.
  • Print to digital: transition from narrative to simulation, counter-factuals
  • How do you read? How many books do you have open?
    o Former practices: contraptions to hold multiple books open. Some ways of laying out books made them a database.
    o How does that work now? Ray Siemens: exploring idea of reading. Tools for document triage


  • The problem of naming a new digital humanities research center: Faculty advisers focused on the word “humanities”—what about social sciences, arts, etc.
  • When does the digital label drop out—or is it useful in defining what you do?
  • NEH Digital Humanities Office: NEH has been doing digital humanities for a long time: it funded TEI 20 years ago. But establishing the office helps to validate digital scholarship.
  • Specialists focus on certain areas of theory–we have the deconstruction scholars who specialize in the field, but their ideas permeate throughout the humanities. Similarly, digital humanists will be the lead group of folks who do digital work, but it will filter down into common research practice.
  • Digital humanities researchers need to make the case for a new methodology.
  • Digital” useful b/c we are at an early stage—people still wonder what it means to be digital.
  • “Digital humanities” brings together technical skills and humanistic knowledge. Creating a DTD is a fascinating part of digital humanities; sounds like computer stuff, but it’s fundamentally humanities.
  • A tension: bibliography used to be core work, but that kind of work doesn’t necessarily get you tenure now. There’s real suspicion about whether this is truly humanities work.
  • Digital humanities includes tool developers, text encoders, people who use digital methods, as well as those who study digital culture, e.g. video games, underlying structures of social environment. Object they study is digital.
  • Divide between game/ film studies and textual digital humanities.
  • Jerry McGann: “humanists have always worked to preserve and interpret human record. Digital humanities is doing it in digital form.”
  • ADHO used to focus on the textual digital humanities, but is reaching out to digital theorists/ art, etc.
  • There’s a significant skill set to doing digital humanities work. Many scholars don’t really appreciate what it takes to produce digital resources—it’s not just scanning documents.
  • Theoreticians: need a little more dirt under their fingernails—they need to get experience doing these projects to inform their theorizing.