Tag Archives: text analysis

Digital Pedagogy in Practice: Workshop Materials

On Saturday, March 2, I gave a workshop on digital (humanities) pedagogy for a group of about 20 faculty and staff at Gettysburg College.  I was impressed by the participants’ energy, openness, smarts, and playfulness.  We had fun!

I designed the workshop so that it moved through four phases, with the goal of participants ultimately walking away with concrete ideas about how they might integrate digital approaches into their own teaching:

1)  We explored the rationale for digital pedagogy (pdf of slides), discussing what students need to know in the 21st century, different frameworks for digital pedagogy (e.g. learning science, liberal education,  social learning, and studio learning), and definitions of digital pedagogy and the “digital liberal arts.” I started the session with Cathy Davidson’s exercise in which audience members first jot down on an index card three things they think students need to know in order to thrive in the digital age, then share their ideas with someone they didn’t walk in with, and finally work together to select the one key idea. (The exercise got people thinking and talking.)

2)   In the second session, I gave a brief presentation (pdf) offering specific case studies of digital pedagogy in action (repurposing some slides I’d used for previous workshops). Participants then broke up into groups to analyze an assignment used in a digital humanities class.

3)   Next participants worked in small groups to explore one of the following:

I structured the exercise so that participants first looked at the particular applications of the tool in teaching and scholarship (e.g. Mapping the Republic of Letters and Visualizing Emancipation in the session on information visualization), then played with a couple of tools in order to understand how they work, and finally reflected on the advantages and disadvantages of each tool and their potential pedagogical applications. I deliberately kept the exercises short and simple, and I tried to make them relevant to Gettysburg, drawing data from Wikipedia and other open sources.

4)   Finally participants worked in small teams (set up according to discipline) to develop an assignment incorporating digital approaches.  We concluded the session with a modified gallery walk, in which people circulated through the room and chatted with a representative of each team to learn more about their proposed assignment.

By the end of the day, workshop participants seemed excited by the possibilities and more aware of specific approaches that they could take (as well as a bit exhausted). I got several questions about copyright, so in future workshops I plan to incorporate a more formal discussion of fair use, Creative Commons and the public domain.

Our workshop drew heavily on materials shared by generous digital humanities instructors. (In that spirit, feel free to use or adapt any of my workshop materials. And I’m happy to give a version of this workshop elsewhere.) My thinking about digital humanities pedagogy has been informed by a number of people, particularly my terrific colleague Rebecca Davis.

Slides and Exercises from “Doing Things with Text” Workshop

Last week I was delighted to be back at my old stomping grounds at Rice University’s Digital Media Commons to lead a workshop on “Doing Things with Text.” The workshop was part of Rice’s Digital Humanities Bootcamp Series, led by my former colleagues Geneva Henry and Melissa Bailar. I hoped to expose participants to a range of approaches and tools, provide opportunities for hands-on exploration and play, and foster discussion about the advantages and limitations of text analysis, topic modeling, text encoding, and metadata. Although we ran out of time before getting through my ambitious agenda, I hope my slides and exercises provide useful starting points for exploring text analysis and text encoding.

Using Text Analysis Tools for Comparison: Mole & Chocolate Cake

How can text analysis tools enable researchers to study the relationships between texts? In an earlier post, I speculated about the relevance of such tools for understanding “literary DNA”–how ideas are transmitted and remixed–but as one reader observed, intertextuality is probably a more appropriate way of thinking about the topic. In my dissertation, I argue that Melville’s Pierre represents a dark parody of Mitchell’s Reveries of a Bachelor. Melville takes the conventions of sentimental bachelor literature, mixes in elements of the Gothic and philosophic/theological tracts, and produces a grim travesty of bachelor literature that makes the dreaming bachelor a trapped quasi-husband, replaces the rural domestic manor with a crowded urban apartment building, and ends in a real, Hamlet-intense death scene rather than the bachelor coming out of reverie or finding a wife. Would text analysis tools support this analysis, or turn up patterns that I had previously ignored?

I wanted to get a quick visual sense of the two texts, so I plugged them into Wordle, a nifty word cloud generator that enables you to control variables such as layout, font and color. (Interestingly, Wordle came up with the perfect visualizations for each text at random: Pierre white type on a black background shaped into, oh, a chess piece or a tombstone, Reveries a brighter, more casual handwritten style, with a shape like a fish or egg.)

Wordle Word Cloud for Pierre

Wordle Reveries Word Cloud

Using these visual representations of the most frequent words in each book enabled me to get a sense of the totality, but then I also drilled down and began comparing the significance of particular words. I noted, for instance, the importance of “heart” in Reveries, which is, after all, subtitled “A Book of the Heart.” I also observed that “mother” and “father” were given greater weight in Pierre, which is obsessed with twisted parental legacies. To compare the books in even more detail, I decided to make my own mashed up word cloud, placing terms that appeared in both texts next to each other and evaluating their relative weight. I tried to group similar terms, creating a section for words about the body, words about feeling, etc. (I used crop, copy and paste tools in PhotoShop to create this mashup, but I’m sure–or I sure hope–there’s a better way.

Comparison of Reveries and Pierre(About three words into the project, I wished for a more powerful tool to automatically recognize, extract and group similar words from multiple files, since my eyes ached and I had a tough time cropping out words without also grabbing parts of nearby words. Perhaps each word would be a tile that you drag over to a new frame and move around; ideally, you could click on the word and open up a concordance) My mashup revealed that in many ways Pierre and Reveries have similar linguistic profiles. For instance, both contain frequently-occurring words focused on the body (face, hand, eye), time (morning, night), thinking, feeling, and family. Perhaps such terms are common in all literary works (one would need to compare these works to a larger literary corpus), but they also seem to reflect the conventions of sentimental literature, with its focus on the family and embodied feeling (see, for instance, Howard).

The word clouds enabled me to get an initial impression of key words in the two books and the overlap between them, but I wanted to develop a more detailed understanding. I used TAPOR’s Comparator to compare the two texts, generating a complete list of how often words appeared in each text and their relative weighting. When I first looked at the the word list, I was befuddled:

Words Reveries counts Reveries relative counts Pierre relative Pierre counts Relative ratio Reveries:Pierre
blaze 45 0.0007 0 1 109.4667

What does the relative ratio mean? I was starting to regret my avoidance of all math and stats courses in college. But after I worked with the word clouds, the statistics began to make more sense. Oh, relative ratio means how often a word appears in the first text versus the second–“blaze” is much more prominent in Reveries. Ultimately I trusted the concreteness and specificity of numbers more than the more impressionistic imagery provided by the word cloud, but the word cloud opened up my eyes so that I could see the stats more meaningfully. For instance, I found that mother indeed was more significant in Pierre, occurring 237 times vs. 58 times in Reveries. Heart was more important in Reveries (a much shorter work), appearing 199 times vs. 186 times in Pierre. I was surprised that “think” was more significant in Reveries than in Pierre, given the philosophical orientation of the latter. With the details provided by the text comparison results, I could construct an argument about how Melville appropriates the language of sentimentality.

But the differences between the two texts are perhaps even more interesting than their similarities, since they show how Melville departed from the conventions of male sentimentalism, embraced irony, and infused Pierre with a sort of gothic spirtualism. These differences are revealed more fully in the statistics than the word clouds. A number of terms are unique to each work. For instance, sentimental terms such as “sympathies,” “griefs,” “sensibility” appear frequently in Reveries but never in Pierre, as do romantic words such as “flirt,” “sparkle,” and “prettier.” As is fitting for Melville, Pierre‘s unique language is typically darker, more archaic, abstract, and spiritual/philosophical, and obsessed with the making of art: “portrait,” “writing,” “original,” “ere,” “miserable,” “visible,” “invisible,” “profound(est),” “final,” “vile,” “villain,” “minds,” “mystical,” “marvelous,” “inexplicable,” “ambiguous.” (Whereas Reveries is subtitled “A Book of the Heart,” Pierre is subtitled “The Ambiguities.”) There is a strand of darkness in Mitchell–he uses “sorrow” more than Melville–but then Mitchell uses “pleasure” 14 times to Melville’s 2 times and “pleasant” 43 times. Reveries is more self-consciously focused on bachelorhood; Mitchell uses “bachelor” 28 times to Melville’s 5. Both authors refer to dreaming; Mitchell uses “reveries” 10 times, Melville 7. Interestingly, only Melville uses “America” (14 times).

Looking over the word lists raises all sorts of questions about the themes and imagery of each work and their relationship to each other, but the data can also be overwhelming. If comparing two works yields over 10,000 lines in a spreadsheet, what criteria should you use in deciding what to select (to use Unsworth’s scholarly primitive)? What happens when you throw more works into the mix? I’m assuming that text mining techniques will provide more sophisticated ways of evaluating textual data, allowing you to filter data and set preferences for how much data you get. (I should note that you can exclude terms and set preferences in TAPOR).

Text analysis brings attention to significant features of a text by abstracting those features–for instance, by generating a word frequency list that contains individual words and the number of times they appear. But I kept wondering how the words were used, in what context they appeared. So Melville uses “mother” a lot–is it in a sweetly sentimental way, or does he treat the idea of mother more complexly? By employing TAPOR’s concordance tool, you can view words in context and see that Mitchell often uses mother in association with words like “heart,” “kiss,” “lap,” while in Melville “mother” does appear with “Dear” and “loving,” but also with “conceal,” “torture,” “mockingly,” “repelling,” “pride,” “cruel.” Hmmm. In Mitchell, “hand” most often occurs with “your” and “my,” signifying connection, while “hand” in Pierre is more often associated with action (hand-to-hand combat, “lift my hand in fury,” etc) or with putting hand to brow in anguish. Same word, different resonance. It’s as if Melville took some of the ingredients of sentimental literature and made something entirely different with them, enchiladas mole rather than a chocolate cake.

Word clouds, text comparisons, and concordances open up all sorts of insights, but how does one use this evidence in literary criticism? If I submitted an article full of word count tables to a traditional journal, I bet the editors wouldn’t know what to do with it. But that may change, and in any case text analysis can inform the kind of arguments critics make. My experience playing with text analysis tools verifies, for me, Steve Ramsay’s recommendation that we “reconceive computer-assisted text analysis as an activity best employed not in the service of a heightened critical objectivity, but as one that embraces the possibilities of that deepened subjectivity upon which critical insight depends.”

Works Cited

Howard, June. “What Is Sentimentality?.” American Literary History 11.1 (1999): 63-81. 22 Jun 2008 <http://alh.oxfordjournals.org/cgi/content/citation/11/1/63&gt;.

Ramsay, Stephen. “Reconceiving Text Analysis: Toward an Algorithmic Criticism.” Lit Linguist Computing 18.2 (2003): 167-174. 27 Nov 2007 <http://llc.oxfordjournals.org/cgi/content/abstract/18/2/167&gt;.

What can you do with texts that are in a digital format?

I’ve had a longstanding, friendly debate with a colleague about whether it is sufficient to provide page images of books, or whether text should be converted to a machine- and human-readable format such as XML. She argues that converting scanned books to text is expensive and that the primary goal should be to provide access to more material. True, but converting books into a textual format makes them much more accessible, allowing users to search, manipulate, organize, and analyze them. Here’s my summary of what you can do with an electronic text. Most of these advantages are pretty obvious, but worth articulating.

Read it—on paper (once you print it out or pay for on-demand printing), your computer, or, increasingly, a portable device. From a single XML file, you can generate many forms of output, including HTML, PDF and for a mobile device.
Copy and paste it–avoid the hassle of having to retype passages.
Search it. Several years ago, I wrote a series of learning modules on stereographs, 3D photographs popular in the late 19th and early 20th centuries. I searched for books and articles on stereographs in the library catalog and in journal collections such as JSTOR, but was kind of disappointed by the lack of relevant information. Last year I returned to the topic and used Google Books for my research. I found dozens more relevant sources, such as key theoretical and historical works on stereography (most of which had already been published when I first studied the topic) as well as some fascinating nineteenth and early twentieth century manuals. Sure, I had to wade through a lot more stuff to find what I needed, but being able to search the contents of books and essays as well as the metadata let me uncover much more useful stuff.
Build a personal collection. Forget file cabinets crammed with photocopies. Using tools such as Zotero and EndNote, you can easily download articles and the accompanying bibliographic information onto your laptop, then take your entire collection with you on a plane, to an archive, to a boring meeting, etc. You can search your collection, sort it, create bibliographies, etc.
Share it. Much to the chagrin of movie studios and record companies, digital files are easy to share, so you can give colleagues access to articles, notes, bibliographies, etc. without having to deal with physical delivery (copyright permitting, of course.) With the forthcoming Zotero 2.0, sharing will get even easier.
Analyze it. Once you have a book in a text-based format, you can do all sorts of nifty things with it–generate word counts, find out what terms appear most frequently next to a particular word, extract dates, find capitalized terms, compare texts, and much more. See TAPOR’s tutorial.
Visualize it. Not only are text visualization tools, well, cool, they also can open up interpretive insights. For instance, using the US Presidential Speeches Tag Cloud, you can get a quick, dynamic view of the history of presidential priorities.
Mine it. Look for patterns in large textbases. As Loretta Auvil of NCSA & SEASR explains, text mining tools such as those being developed by MONK and SEASR enable researchers to automatically classify texts according to characteristics such as genre, identify patterns such as repetition (as in the case of Stein’s Making of the Americas), analyze literary inheritance, and study themes across thousands of texts.
Remix & play with it. By taking the elements of a text or collection of texts and remixing them, you not only produce a new creative work, but also see the text in a new way–your attention is brought to particular linguistic elements, like the fragments of a broken vase used to make a mosaic. For instance, when I used the Open Wound “language mixing tool” with Melville’s 1855 sketch “The Paradise of Bachelors and the Tartarus of Maids”, I gained new insights into the violence and anxiety expressed by words such as “agony,” “cut,” and “defective.” Running the tool on the sketch also produced some stunning phrases that could serve as mottoes for this kind of activity: “Exposed are the cutters,” “in the meditation onward,” and “protecting through the scholarship.” I also plan to play with tools that would allow me to mashup several bachelor texts (take the beginning from Irving, the middle from Melville and Hawthorne, the end from Mitchell), replace key words with pictures, etc.

Some really interesting research is underway on the possibilities of text mining for humanities scholarship–including the aforementioned MONK and SEASR projects, as well CHNM’s “Scholarship in the Age of Abundance: Enhancing Historical Research With Text-Mining and Analysis Tools.”

Literary DNA and Google Books

Can digital tools help us to trace literary inheritance and influence? In my dissertation, I claim that Washington Irving helped to originate a tradition of bachelor sentimentalism in American literature that Donald Grant Mitchell extended in Reveries of a Bachelor, Melville satirized in Pierre and “Paradise of Bachelors and Tartarus of Maids,” and Henry James complicated in “Lessons of the Master” and other works. If this relationship were visualized in a family tree, it would look something like this:

Bachelor Pedigree

Note that Herman Melville is hanging out by Donald Grant Mitchell but isn’t really connected to the family tree–in part because he’s got kind of a dismissive step-child relationship to Washington Irving, in part because I just couldn’t figure out how to make siblings display in Ancestry.com. (I would have been better off using a more flexible tool like Gliffy, and Ancestry.com now thinks my name is Henry James, but live ‘n learn.) Note also that there are no mothers–appropriate, I suppose, for bachelor literature. Of course, such a family tree grossly oversimplifies literary inheritance and influence; for one thing, most authors have many literary ancestors, many fathers and mothers.

As I was tracing this literary lineage in my dissertation, I kept asking myself how I knew that this genealogy was real, how I could be sure that I wasn’t just making it all up? Although I couldn’t state with certainty that there was this bachelor genealogy (such certainty seems to be beyond the reach–and outside the aims–of literary interpretation), I found some pretty good evidence for it. I won’t rehash all of the arguments here, but as I explain in my dissertation, Mitchell, Melville and James commented on their literary forebears. Mitchell gave a tribute at the Washington Irving Centenary, Melville rejected Irving’s approach to literature as being a “self-acknowledged imitation of a foreign model,” and James remembered his “very young pleasure” in reading Mitchell and Melville. Then there are the similarities in voice (a smooth, subjective sentimentalism), setting (bachelor garrets and lonely hearths), character (the dreaming bachelor), etc.

Now I wonder: Would text analysis tools and massive collections of texts such as Google Books provide further evidence for this bachelor genealogy? By comparing abstractions and visualizations of my four core bachelor texts, would I be able to see a “family resemblance”–perhaps a unique turn of phrase that is repeated through the generations like funny toes or wavy red hair? And what would constitute reliable evidence of inheritance, anyway–similarity in word choice, narrative voice, character, structure, graphical design (i.e., if you’ve got Fabio on the cover, the book probably belongs to the romance tradition)? Is there such a thing as “literary DNA”?

And is “literary DNA” even a valid concept, or is it a scientific term misapplied to the literary realm? By “literary DNA,” I mean the unique characteristics that define a literary work, characteristics developed through the complex process of literary inheritance and creativity. A search for the term in the MLA Bibliography yields no results for “literary DNA” (although “literary influence” gets you over 1000), while Google Scholar shows only 9 results. Perhaps the term rings too much of a positivist approach to literary study. When you broaden the search to Google, though, about 1000 results are retrieved. For example: The Paris Review initially called its archive of 50 years of its interviews The DNA of Literature (USA Today), presumably because you can plumb it to find authors explaining the genealogy of their works. In the Netherlands, researchers at the Huygens Institute KNAW are developing The literary DNA: computer-assisted recognition of narrative elements, software that can recognize themes and motifs in literature. “Literary DNA” is probably most closely associated with Don Foster, the literary scholar who used textual analysis to help uncover Joe Klein’s identity as the “Anonymous” author of Primary Colors, Ted Kaczynski as the Unabomber, and Shakespeare as the author of the poem A Funerall Elegye in memory of the late Vertuous Maister William Peete (erroneously, as it turned out). In Author Unknown (which I’m just starting to read), Foster argues, “The scientific analysis of a text–how a mind and a hand conspire to commit acts of writing–can reveal features as sharp and telling as anything this side of fingerprints and DNA” (4). I’m skeptical that a text can be analyzed scientifically, and textual analysis does not necessarily prove authorship definitively, as we see in the case of the poem mistakenly attributed to Shakespeare. Still, Foster’s methods are intriguing–he looks at word choice, punctuation, spelling, grammar, and sentence construction to discover the unique features of an author’s “linguistic system.” Romantic that I am, I dig the idea of literary scholar as sleuth.

So how would you use the computer to study literary influence and inheritance, and what would be the implications of this approach to research? Certainly search tools can help us to track allusions and other signs of influence. In his excellent article “Googling the Victorians,” Patrick Leary describes how the editors of the Strouse Edition of Carlyle’s Writings use Google, Literature Online (LION), the online OED, and other electronic resources to identify unattributed quotations, sparing themselves from having to spend “many often fruitless hours in the library stacks.” By providing quick access to much of the cultural record, the Web may, as Leary argues, “lend new urgency to arguments about what constitutes evidence of literary or intellectual influence” (12). Massive digitization projects such as Google Books are exposing not only influence, but also outright plagiarism. In Slate, Paul Collins reports that a linguist working for Google Books found that a passage from Sacrificial Foundations (1899) was a plagiarism of a plagiarism. As Collins observes, “it may be only a matter of time before some enterprising scholar yokes Google Book Search and plagiarism-detection software together into a massive literary dragnet, scooping out hundreds of years’ worth of plagiarists—giants and forgotten hacks alike—who have all escaped detection until now.” But when is “plagiarism “creative re-invention? Jonathan Lethem’s plagiarism on plagiarism cleverly makes the point that literature is all about appropriation and re-imagining.

Curious about the capabilities of Google Books to track literary inheritance, I compared Google Books editions of Washington Irving’s Sketch Book, Donald Grant Mitchell’s Reveries of a Bachelor, Melville’s Pierre, and Henry James’ The Lesson of the Master, The Death of the Lion; The Next Time, and Other Tales. (I know that Google Books is itself the subject of controversy, a topic I plan to take up in a future post.) What Google Books, the Open Content Alliance, and other huge digital collections offer is lots and lots of data that one could mine to study literary inheritance, the evolution of ideas, and a lot more. I hoped to use Google Books to discover salient features not only of these works, but of the works preceding and succeeding them on the literary family tree. Google Books provides handy (if limited) tools for detecting literary influence and seeing a snapshot of a work through its “About This Book” feature, a sort of reference page gathering together key words, popular phrases, and references from other works. Google Books isolates 20 unique “Key words and phrases” (such as “sleepy hollow” and “diedrich knickerbocker”), as well as three key words associated with each chapter. Proper names seem to be the most frequent key words, so I wasn’t too surprised that the only overlap in key words among the four works was that both The Sketch Book and Pierre use variations of “mourn” and “methinks,” reflecting perhaps the melancholic tone of both works. Popular Passages seems to hold more promise for influence-tracking, since it enables readers to “follow the literary memes* that appear again and again in the world of books” by highlighting the ten passages that appear most frequently in other works, then by providing links to all of those works. For Irving and Mitchell, the most frequent “popular passages” were quotations from other works (such as Thomas Gray’s poetry) and passages of their own works that were frequently anthologized, while for James and Melville the popular passages were those most frequently cited by critics or represented in other editions of their works. Through popular passages, you can get a quick sense of how the authors were received in their time and ours and glimpse how a work fits into the larger network of literary influence. To understand how critics and commentators have received The Sketch Book, you can also examine “References from books” (which presumably searches Google Books for citations of the work), “References from scholarly works” (which presumably searches Google Scholar), and “References from web pages” (which presumably searches Google, in a limited way). If, say, The Sketch Book and Pierre were referenced by the same works, that might indicate some kind of kinship. For selected books (in my very small sample, only the travel-focused The Sketch Book), Google Books also provides a map of places mentioned in the book. The map for The Sketch Book shows dozens of markers clustered in Great Britain, revealing at a glance what an Anglophile Irving was. You could compare maps of several books to see if they share similar settings or itineraries.

After spending several hours playing with About This Book, I didn’t find much evidence for my theory of bachelor genealogy–perhaps I would need a more specialized tool to turn up that kind of evidence. However, it does seem that Google Books could be quite useful for scholars interested in book history, since it includes book advertisements that appear in back matter, catalogs of library and personal collections, and back issues of Publishers Weekly, as well as the aforementioned features that allow you to see how the book has been referenced. By providing access to so much of the cultural record, Google Books can allow scholars to broaden the scope of their inquiry and find connections among works. While I recognize Google Book’s potential as a scholarly tool, I can see several ways to improve it for scholars:

  • Allow users to see more than 10 “popular passages” and 20 keywords; for power users, expose all available data (in a user-friendly way, of course)
  • Be more transparent about how these different features work. I searched the Google Books site for an explanation of the different “About This Book” features, but couldn’t find much beyond a few sentences in the Book Search Blog. If I’m going to trust a tool, I’d like to know how it works.
  • Make it easier to search within “popular passages.” Analyzing popular passages can be quite time-consuming, especially when a passage is cited in 860 books, as is a quotation from Cymbeline that Irving includes in The Sketch Book. Ideally you could search inside your “popular passages” results to see if, say, Melville cited the same passage from Shakespeare.
  • Help readers sort out different editions of the same work. As a book history/ textual studies gal, I like the fact that Google Books includes multiple editions of a work, even if the choice of editions include seems to be more an accident of what a library holds than a scholarly decision about which are the key editions. But once you start comparing the “About the Book” feature for those different editions, things get awfully confusing. When I compared 2 different versions of Reveries of a Bachelor, the 1852 L.C. Page edition and the 1906 Fenno edition, the data was quite different. Perhaps the text of the work changed substantially as different publishers came out with their own editions (neither is the authorized edition first published by Scribner in 1850), or maybe something’s wacky about the way that Google is generating this information. In any case, you get different popular passages (the Page edition includes lots of advertisements for other books published by the same company), different books/scholarly works that refer to this work, and different key words. Curious. So I guess I’d like a “compare editions” tool to reveal what’s really going on…
  • So this request may be a little pie in the sky, but I sure would love a way to know what works Google Books is not searching, or what it might be missing because of OCR errors. When my colleague Jane Segal and I studied humanities scholars’ use of digital resources, several worried that works not in digital form would be neglected and that scholarship would suffer from a sort of ignorance of the analog. I dunno–maybe such a list could be generated by comparing WorldCat records with Google Books? In any case, scholars need to be conscious of the limits of Google Books.
  • Fix errors in generating links to other books. For some reason, if you try to follow “References from books” past the first results page, you get an error that looks like this: “Your search – cites:0F_xecYtdwg1CDwRxfl_QZ – did not match any documents.” Argh!
  • Make it easier to download full-text of books (PDFs are handy, full-text is better).
  • Ensure free access to the full-text of public domain books. I got a little scared when I heard that Google Books would start charging for full-text, but that plan seems to focus on books that are part of its publisher partner program, not its library scanning project.
  • Experiment with visualization tools. For instance, it would be cool to play with some kind of social network graph or citation network showing all of the authors who cited Irving and were cited by him

Maybe it’s not fair to expect Google to turn Google Books into a scholarly resource–perhaps it’s better for scholars to develop their own applications to analyze Google Books. I’m encouraged by news that members of the digital humanities community have been in conversation with Google. Dan Cohen makes a persuasive case for Google Books providing an API that would allow researchers to mine and manipulate information. Despite my criticisms, I love poking around “About This Book,” which provides a great way to get a snapshot view of how a work fits into the literary ecosystem.

I had planned to discuss my experiments using text analysis tools to trace literary influence, but I’ve gone on way way too long already, so I’ll save that for a future post.

Woman vs. machine? Analyzing texts…

Since it took me five years before I could steel myself to look at my dissertation again, I had forgotten some of the main points that I made in it. To uncover key terms in Chapter 1, which explores the popular literature of bachelorhood in 19th century America, I decided to use text analysis tools. By generating a list of frequently occurring terms, I figured that I could get a snapshot of my argument and, I hoped, have a handy list of search terms to use as I looked for other instances of bachelor literature. I also wanted to play with the tools so that I could better understand their capabilities and limitations. What patterns would the tool reveal? What terms did I use over and over, despite my best efforts to vary my vocabulary? And is word weight a useful measure of the significance of a concept? Wouldn’t the position of a word (for instance, in a heading or thesis paragraph) also matter, and shouldn’t synonyms be considered in the algorithm?

Before using any tools to automatically generate a list of commonly used terms , I decided to go through the chapter and construct my own list of key words. Then I used TAPOR‘s Word Frequency tool to automatically generate a list of key terms. In comparing my list and TAPOR‘s, I am struck by how I read the chapter through my own interpretive filter. Most of the terms that I included on my list are different descriptors for the bachelor figure in American literature, such as “detached, “”narcissist,” “luxury,” “metamorphosis,” etc. Not surprisingly, TAPOR’s list is much broader. Sure, it overlaps with my list by including terms commonly associated with the bachelor figure, such as “single,” “man,” “unmarried,” “pleasure,” and “sentiment.” But it also includes terms such as “author,” “narrator, “literature,” “literary,” “writing,” American” and “identity,” terms that reflect my argument that anxieties over American authorship were reflected in discourse about bachelorhood. L ikewise, the TAPOR list gives high ranking to words associated with domesticity such as “family,” “home,” and “love,” reflecting my argument that the bachelor stood outside family-centered domesticity but remade it on his own terms. Before running TAPOR, I did write a quick summary of my argument that includes terms such as “identity” and “authorship,” so I was certainly aware of how these ideas played into my argument–they just weren’t included in the list I made. But the TAPOR list also includes some words that reflect not so much my argument as my rhetorical style–for instance, “instance” (I apparently use that phrase a lot to provide examples), “according” (attributing sources), “suggests” (summarizing someone else’s argument), “typically” (avoiding the absolute statement), and likewise (comparing). Noticing the language I use to make arguments reminds me of when I was recorded making a speech and became aware of the way I hung my head to the side and “ummed” as I spoke–I became more self-conscious of my style. I suppose what I’ve gotten out of this exercise, besides a handy list of keywords that I hope to use in conducting searches, is an initial confirmation of the claim that text analysis tools can help you to look beyond your own interpretive filter and see other patterns.

As much as I like the TAPOR tools, I should note one frustration. Ideally you would be able to export word frequencies in some sort of a spreadsheet-friendly format so that you can play with the data and come back to it at a later point, but I didn’t see an easy way to do this. I tried to copy and paste the list of 4286 unique words into Google spreadsheets (which I’m using to share my findings) and ended up crashing my browser. I then pasted the 286 terms that appear at least 5 times into Excel and then into Google spreadsheets, but that process seemed to introduce unnecessary steps. Anyhow, I’ll keep experimenting with TAPOR, HyperPo, Token X, WordHoard, NORA, and the other text analysis, mining and visualization tools out there. Suggestions welcomed!