Using Text Analysis Tools for Comparison: Mole & Chocolate Cake

How can text analysis tools enable researchers to study the relationships between texts? In an earlier post, I speculated about the relevance of such tools for understanding “literary DNA”–how ideas are transmitted and remixed–but as one reader observed, intertextuality is probably a more appropriate way of thinking about the topic. In my dissertation, I argue that Melville’s Pierre represents a dark parody of Mitchell’s Reveries of a Bachelor. Melville takes the conventions of sentimental bachelor literature, mixes in elements of the Gothic and philosophic/theological tracts, and produces a grim travesty of bachelor literature that makes the dreaming bachelor a trapped quasi-husband, replaces the rural domestic manor with a crowded urban apartment building, and ends in a real, Hamlet-intense death scene rather than the bachelor coming out of reverie or finding a wife. Would text analysis tools support this analysis, or turn up patterns that I had previously ignored?

I wanted to get a quick visual sense of the two texts, so I plugged them into Wordle, a nifty word cloud generator that enables you to control variables such as layout, font and color. (Interestingly, Wordle came up with the perfect visualizations for each text at random: Pierre white type on a black background shaped into, oh, a chess piece or a tombstone, Reveries a brighter, more casual handwritten style, with a shape like a fish or egg.)

Wordle Word Cloud for Pierre

Wordle Reveries Word Cloud

Using these visual representations of the most frequent words in each book enabled me to get a sense of the totality, but then I also drilled down and began comparing the significance of particular words. I noted, for instance, the importance of “heart” in Reveries, which is, after all, subtitled “A Book of the Heart.” I also observed that “mother” and “father” were given greater weight in Pierre, which is obsessed with twisted parental legacies. To compare the books in even more detail, I decided to make my own mashed up word cloud, placing terms that appeared in both texts next to each other and evaluating their relative weight. I tried to group similar terms, creating a section for words about the body, words about feeling, etc. (I used crop, copy and paste tools in PhotoShop to create this mashup, but I’m sure–or I sure hope–there’s a better way.

Comparison of Reveries and Pierre(About three words into the project, I wished for a more powerful tool to automatically recognize, extract and group similar words from multiple files, since my eyes ached and I had a tough time cropping out words without also grabbing parts of nearby words. Perhaps each word would be a tile that you drag over to a new frame and move around; ideally, you could click on the word and open up a concordance) My mashup revealed that in many ways Pierre and Reveries have similar linguistic profiles. For instance, both contain frequently-occurring words focused on the body (face, hand, eye), time (morning, night), thinking, feeling, and family. Perhaps such terms are common in all literary works (one would need to compare these works to a larger literary corpus), but they also seem to reflect the conventions of sentimental literature, with its focus on the family and embodied feeling (see, for instance, Howard).

The word clouds enabled me to get an initial impression of key words in the two books and the overlap between them, but I wanted to develop a more detailed understanding. I used TAPOR’s Comparator to compare the two texts, generating a complete list of how often words appeared in each text and their relative weighting. When I first looked at the the word list, I was befuddled:

Words Reveries counts Reveries relative counts Pierre relative Pierre counts Relative ratio Reveries:Pierre
blaze 45 0.0007 0 1 109.4667

What does the relative ratio mean? I was starting to regret my avoidance of all math and stats courses in college. But after I worked with the word clouds, the statistics began to make more sense. Oh, relative ratio means how often a word appears in the first text versus the second–“blaze” is much more prominent in Reveries. Ultimately I trusted the concreteness and specificity of numbers more than the more impressionistic imagery provided by the word cloud, but the word cloud opened up my eyes so that I could see the stats more meaningfully. For instance, I found that mother indeed was more significant in Pierre, occurring 237 times vs. 58 times in Reveries. Heart was more important in Reveries (a much shorter work), appearing 199 times vs. 186 times in Pierre. I was surprised that “think” was more significant in Reveries than in Pierre, given the philosophical orientation of the latter. With the details provided by the text comparison results, I could construct an argument about how Melville appropriates the language of sentimentality.

But the differences between the two texts are perhaps even more interesting than their similarities, since they show how Melville departed from the conventions of male sentimentalism, embraced irony, and infused Pierre with a sort of gothic spirtualism. These differences are revealed more fully in the statistics than the word clouds. A number of terms are unique to each work. For instance, sentimental terms such as “sympathies,” “griefs,” “sensibility” appear frequently in Reveries but never in Pierre, as do romantic words such as “flirt,” “sparkle,” and “prettier.” As is fitting for Melville, Pierre‘s unique language is typically darker, more archaic, abstract, and spiritual/philosophical, and obsessed with the making of art: “portrait,” “writing,” “original,” “ere,” “miserable,” “visible,” “invisible,” “profound(est),” “final,” “vile,” “villain,” “minds,” “mystical,” “marvelous,” “inexplicable,” “ambiguous.” (Whereas Reveries is subtitled “A Book of the Heart,” Pierre is subtitled “The Ambiguities.”) There is a strand of darkness in Mitchell–he uses “sorrow” more than Melville–but then Mitchell uses “pleasure” 14 times to Melville’s 2 times and “pleasant” 43 times. Reveries is more self-consciously focused on bachelorhood; Mitchell uses “bachelor” 28 times to Melville’s 5. Both authors refer to dreaming; Mitchell uses “reveries” 10 times, Melville 7. Interestingly, only Melville uses “America” (14 times).

Looking over the word lists raises all sorts of questions about the themes and imagery of each work and their relationship to each other, but the data can also be overwhelming. If comparing two works yields over 10,000 lines in a spreadsheet, what criteria should you use in deciding what to select (to use Unsworth’s scholarly primitive)? What happens when you throw more works into the mix? I’m assuming that text mining techniques will provide more sophisticated ways of evaluating textual data, allowing you to filter data and set preferences for how much data you get. (I should note that you can exclude terms and set preferences in TAPOR).

Text analysis brings attention to significant features of a text by abstracting those features–for instance, by generating a word frequency list that contains individual words and the number of times they appear. But I kept wondering how the words were used, in what context they appeared. So Melville uses “mother” a lot–is it in a sweetly sentimental way, or does he treat the idea of mother more complexly? By employing TAPOR’s concordance tool, you can view words in context and see that Mitchell often uses mother in association with words like “heart,” “kiss,” “lap,” while in Melville “mother” does appear with “Dear” and “loving,” but also with “conceal,” “torture,” “mockingly,” “repelling,” “pride,” “cruel.” Hmmm. In Mitchell, “hand” most often occurs with “your” and “my,” signifying connection, while “hand” in Pierre is more often associated with action (hand-to-hand combat, “lift my hand in fury,” etc) or with putting hand to brow in anguish. Same word, different resonance. It’s as if Melville took some of the ingredients of sentimental literature and made something entirely different with them, enchiladas mole rather than a chocolate cake.

Word clouds, text comparisons, and concordances open up all sorts of insights, but how does one use this evidence in literary criticism? If I submitted an article full of word count tables to a traditional journal, I bet the editors wouldn’t know what to do with it. But that may change, and in any case text analysis can inform the kind of arguments critics make. My experience playing with text analysis tools verifies, for me, Steve Ramsay’s recommendation that we “reconceive computer-assisted text analysis as an activity best employed not in the service of a heightened critical objectivity, but as one that embraces the possibilities of that deepened subjectivity upon which critical insight depends.”

Works Cited

Howard, June. “What Is Sentimentality?.” American Literary History 11.1 (1999): 63-81. 22 Jun 2008 <http://alh.oxfordjournals.org/cgi/content/citation/11/1/63&gt;.

Ramsay, Stephen. “Reconceiving Text Analysis: Toward an Algorithmic Criticism.” Lit Linguist Computing 18.2 (2003): 167-174. 27 Nov 2007 <http://llc.oxfordjournals.org/cgi/content/abstract/18/2/167&gt;.

8 responses to “Using Text Analysis Tools for Comparison: Mole & Chocolate Cake

  1. Very interesting post, Lisa. I don’t think that anbody should submit articles full of word count tables to traditional literary journals. For me, that would amount to intellectual surrender. Running statistical analysis tools is — with a certain degree of necessary input adjustment — a rather trivial matter. What editors and readers are (or at least I hope should be) interested in is what the author does with the numbers. We should also keep in mind that statistical methods in literary studies existed long before the advent of the computer. In fact, the breakthrough of Russian formalism at the beginning of 20th century was based on the idea of establishing an “objecitve” literary science that would study the devices of literariness (of that which makes literature literary). This actually turned out to involve, in many instances, a great deal of counting. Surely, it’s a piece of cake to classify the distribution of stressed vowels in a poem, but try counting by hand the number of times Doestoevsky used the adverb “suddenly” in Crime and Punishment. Yet people have done it and come up along the way with very smart, exciting readings based on very detailed, textual analysis — all that before the proliferation of the digital text. Now, computers — which Ramsay nicely describes as incredibly powerful yet ultimately impotent — should keep doing what they’re good at, but we should also keep interpreting the data. That doesn’t mean that we should stop theorizing the machine or give up looking for novel, playful, irreverent, “deformative” ways of interacting with the text. I just hope that literary journals will never transform into glorified phone books.

  2. Pingback: Text Analysis of Venture Smith’s Narrative « history-ing

  3. Excellent commentary, Toma! I agree that the interpretive power that the scholar brings to the textual analysis is the point…

  4. I recently used similar (rudimentary) word counting methods to tackle the discourse used by a series of publications on Flemish political theatre from the 1970s.
    In order to gauge the usage context of a particular keyword—in my case this was “vormingstheater”, meaning political or pedagogical theatre—I counted the content words that occurred most often in the vicinity of the keyword. This produced a list of words that are “attracted” by the keyword.
    (I borrowed this method from LABBÉ, Dominique en Cyril, ‘How to Measure the Meanings of Words? Amour in Corneille’s Work’, in: Language Resources and Evaluation, 39.4 (2005), pp. 335-351.)
    The bottom-line of my mini research project was that a word counting program allows you new ways to gather information about your texts. Indeed, it doesn’t cancel hermeneutics—there’s always an intelligent reader needed to process the newly formatted data. But still, it allows for wholly new ways of looking at a text you’ve read a hundred times before.
    I would propose that “human readers” automatically do close reading. It’s very difficult to skim through a document quickly, and still pick up all relevant information. On the other hand, the computer is the perfect “shallow reader”. It allows to process a text swiftly and exhaustively, even if it only scratches the ‘word surface’ of the document.

  5. Hey, I like the distinction between shallow, computer assisted reading and deep reading.

  6. Pingback: Comparator « Mercurius Politicus

  7. Pingback: Columbia University Libraries FYI » Too cool

  8. I find the title of the post appealing. It is interesting in a way that it made me pay close attention to the entire content. To be perfectly honest, my idea of text analysis is limited. Because of that, the first part of the post seemed to be vague. However, the rest of the post was able to enlighten me as I went on reading. I find the comparison thing very helpful for those who are into research and writing. The process might appear to be painstaking, but I can definitely see the purpose. I have read several articles about dissertations or admissions essays that are used for different scholarships. I know how it feels to apply for a college scholarship, but I am uncertain about how it feels to research and write an essay in order to qualify for a scholarship. Honestly, the thought of it makes me a little bit daunted. After reading your post, I know it is something that can be of incredible help for those people who go through scholarship applications. I also believe that they need to be familiar with this kind of text analysis before they finally decide to proceed to the scholarship finder process.

Leave a comment