Woman vs. machine? Analyzing texts…

Since it took me five years before I could steel myself to look at my dissertation again, I had forgotten some of the main points that I made in it. To uncover key terms in Chapter 1, which explores the popular literature of bachelorhood in 19th century America, I decided to use text analysis tools. By generating a list of frequently occurring terms, I figured that I could get a snapshot of my argument and, I hoped, have a handy list of search terms to use as I looked for other instances of bachelor literature. I also wanted to play with the tools so that I could better understand their capabilities and limitations. What patterns would the tool reveal? What terms did I use over and over, despite my best efforts to vary my vocabulary? And is word weight a useful measure of the significance of a concept? Wouldn’t the position of a word (for instance, in a heading or thesis paragraph) also matter, and shouldn’t synonyms be considered in the algorithm?

Before using any tools to automatically generate a list of commonly used terms , I decided to go through the chapter and construct my own list of key words. Then I used TAPOR‘s Word Frequency tool to automatically generate a list of key terms. In comparing my list and TAPOR‘s, I am struck by how I read the chapter through my own interpretive filter. Most of the terms that I included on my list are different descriptors for the bachelor figure in American literature, such as “detached, “”narcissist,” “luxury,” “metamorphosis,” etc. Not surprisingly, TAPOR’s list is much broader. Sure, it overlaps with my list by including terms commonly associated with the bachelor figure, such as “single,” “man,” “unmarried,” “pleasure,” and “sentiment.” But it also includes terms such as “author,” “narrator, “literature,” “literary,” “writing,” American” and “identity,” terms that reflect my argument that anxieties over American authorship were reflected in discourse about bachelorhood. L ikewise, the TAPOR list gives high ranking to words associated with domesticity such as “family,” “home,” and “love,” reflecting my argument that the bachelor stood outside family-centered domesticity but remade it on his own terms. Before running TAPOR, I did write a quick summary of my argument that includes terms such as “identity” and “authorship,” so I was certainly aware of how these ideas played into my argument–they just weren’t included in the list I made. But the TAPOR list also includes some words that reflect not so much my argument as my rhetorical style–for instance, “instance” (I apparently use that phrase a lot to provide examples), “according” (attributing sources), “suggests” (summarizing someone else’s argument), “typically” (avoiding the absolute statement), and likewise (comparing). Noticing the language I use to make arguments reminds me of when I was recorded making a speech and became aware of the way I hung my head to the side and “ummed” as I spoke–I became more self-conscious of my style. I suppose what I’ve gotten out of this exercise, besides a handy list of keywords that I hope to use in conducting searches, is an initial confirmation of the claim that text analysis tools can help you to look beyond your own interpretive filter and see other patterns.

As much as I like the TAPOR tools, I should note one frustration. Ideally you would be able to export word frequencies in some sort of a spreadsheet-friendly format so that you can play with the data and come back to it at a later point, but I didn’t see an easy way to do this. I tried to copy and paste the list of 4286 unique words into Google spreadsheets (which I’m using to share my findings) and ended up crashing my browser. I then pasted the 286 terms that appear at least 5 times into Excel and then into Google spreadsheets, but that process seemed to introduce unnecessary steps. Anyhow, I’ll keep experimenting with TAPOR, HyperPo, Token X, WordHoard, NORA, and the other text analysis, mining and visualization tools out there. Suggestions welcomed!

