Category Archives: tools

Presentation on How Digital Humanists Use GitHub

At Digital Humanities 2016, Sean Morey Smith and I presented on our ongoing work examining GitHub as a platform of knowledge for digital humanities. Our results are still preliminary, but we want to share our presentation (PDF). We’re especially grateful to those who agreed to be interviewed for the study and who took our survey. We expect to produce an article (or two) based on our research.

We welcome any questions or feedback.

Studying How Digital Humanists Use GitHub

Over the past academic year, I’ve been fortunate to participate in Rice’s Mellon-sponsored Sawyer Seminar on Platforms of Knowledge, where we’ve examined platforms for authoring, annotation, mapping, and social networking. We’ve discussed both the possibilities that platforms may open up for inquiry, public engagement and scholarly communications and the risks that they may pose for privacy and nuanced humanistic analysis. Inspired by the questions raised by the Seminar, my colleague Sean Smith and I are studying a platform used by a number of digital humanists: GitHub. Digital humanists employ GitHub not only for code, but also for writing projects, syllabi, websites, and other scholarly resources. We’ll present our initial findings at Digital Humanities 2016, but I wanted to offer some background to the study, especially since some of you will soon be receiving emails from me inviting you to participate in it.

Initially I was interested in using GitHub for a case study of how we assess and select digital platforms. Even as many researchers (myself included) rely on digital platforms, I haven’t been able to find many clear rubrics for evaluating them. Building on Quinn Dombrowski’s recommendations for choosing a platform for a web project, we are looking at criteria such as functionality and ease of use. In previous work examining archival management systems, I learned how important it is to talk with users about their experience with tools, so we will be conducting a survey and interviews about GitHub. Sean and I also also realized that GitHub itself provides valuable data about how people use GitHub, such as information about collaboration, code re-use, and connections to others. Our study will thus include analysis of publicly available data about selected GitHub users and repositories. (Of course, there is significant prior work on this topic in fields such as social computing that we will draw upon.)

With this project, we are:

  1. Identifying digital humanists who have GitHub accounts. For the purposes of this study, we are looking at presenters at the last three Digital Humanities conferences and people affiliated with organizations that belong to centerNet (assuming that the information is publicly available). Of course, this method is imperfect– it misses digital humanists who didn’t attend the DH conferences or who aren’t affiliated with DH centers, and it may include some people who don’t really consider themselves digital humanists. But it’s a start.
  2. Contacting those whose email addresses are easily retrievable (e.g. available via GitHub) and:
    1. Giving them the opportunity to opt out of having their publicly available GitHub data being included in our analysis and in the dataset that we plan to share at the end of the study. (Added 5/18/16: To be extra careful, we plan to anonymize this dataset.)
    2. Inviting them to take a brief survey about their usage and opinions of GitHub
    3. Inviting them to participate in an interview

    We may also contact people whose emails aren’t in the GitHub data but are otherwise available.

  3. Analyzing GitHub data from our dataset to gain insight into how digital humanists use GitHub.

We want to conduct this study openly while at the same respecting privacy. In conducting interviews for past studies, I’ve been frustrated that I can’t publicly identify and credit people who have made brilliant comments because of the promise of confidentiality.  So we’re giving interviewees the option to make all or some of their interview notes public–but of course they can instead keep the notes private and remain anonymous. Survey data will be anonymized but ultimately shared.

Here are important documents related to our study:

I welcome feedback and questions about this study. I hope that it will contribute to developing criteria for evaluating platforms like GitHub and offer insights into how digital humanities researchers and developers work.

Digital Pedagogy in Practice: Workshop Materials

On Saturday, March 2, I gave a workshop on digital (humanities) pedagogy for a group of about 20 faculty and staff at Gettysburg College.  I was impressed by the participants’ energy, openness, smarts, and playfulness.  We had fun!

I designed the workshop so that it moved through four phases, with the goal of participants ultimately walking away with concrete ideas about how they might integrate digital approaches into their own teaching:

1)  We explored the rationale for digital pedagogy (pdf of slides), discussing what students need to know in the 21st century, different frameworks for digital pedagogy (e.g. learning science, liberal education,  social learning, and studio learning), and definitions of digital pedagogy and the “digital liberal arts.” I started the session with Cathy Davidson’s exercise in which audience members first jot down on an index card three things they think students need to know in order to thrive in the digital age, then share their ideas with someone they didn’t walk in with, and finally work together to select the one key idea. (The exercise got people thinking and talking.)

2)   In the second session, I gave a brief presentation (pdf) offering specific case studies of digital pedagogy in action (repurposing some slides I’d used for previous workshops). Participants then broke up into groups to analyze an assignment used in a digital humanities class.

3)   Next participants worked in small groups to explore one of the following:

I structured the exercise so that participants first looked at the particular applications of the tool in teaching and scholarship (e.g. Mapping the Republic of Letters and Visualizing Emancipation in the session on information visualization), then played with a couple of tools in order to understand how they work, and finally reflected on the advantages and disadvantages of each tool and their potential pedagogical applications. I deliberately kept the exercises short and simple, and I tried to make them relevant to Gettysburg, drawing data from Wikipedia and other open sources.

4)   Finally participants worked in small teams (set up according to discipline) to develop an assignment incorporating digital approaches.  We concluded the session with a modified gallery walk, in which people circulated through the room and chatted with a representative of each team to learn more about their proposed assignment.

By the end of the day, workshop participants seemed excited by the possibilities and more aware of specific approaches that they could take (as well as a bit exhausted). I got several questions about copyright, so in future workshops I plan to incorporate a more formal discussion of fair use, Creative Commons and the public domain.

Our workshop drew heavily on materials shared by generous digital humanities instructors. (In that spirit, feel free to use or adapt any of my workshop materials. And I’m happy to give a version of this workshop elsewhere.) My thinking about digital humanities pedagogy has been informed by a number of people, particularly my terrific colleague Rebecca Davis.

Slides and Exercises from “Doing Things with Text” Workshop

Last week I was delighted to be back at my old stomping grounds at Rice University’s Digital Media Commons to lead a workshop on “Doing Things with Text.” The workshop was part of Rice’s Digital Humanities Bootcamp Series, led by my former colleagues Geneva Henry and Melissa Bailar. I hoped to expose participants to a range of approaches and tools, provide opportunities for hands-on exploration and play, and foster discussion about the advantages and limitations of text analysis, topic modeling, text encoding, and metadata. Although we ran out of time before getting through my ambitious agenda, I hope my slides and exercises provide useful starting points for exploring text analysis and text encoding.

Archival Management Systems Report, Wiki & Webinar

[Note: Typically my blog focuses on digital humanities research, but this post discusses some of my related work examining software that helps archives streamline their workflows.]

As archives acquire collections, arrange them, describe them, manage them, and make them publicly available, they produce data in multiple formats, such as notecards, Word documents, Excel files, Access databases, XML (EAD) finding aids, web pages, etc.  Chris Prom suggests that some archives use so many tools in creating this data that their workflows “would make a good subject for a Rube Goldberg cartoon.”   As a result, archives replicate data and effort, struggle with versioning control, face challenges finding and analyzing archival information, and have difficulty making that information publicly available.   By using archival management systems such as Archon and Archivists’ Toolkit, however, archives can streamline the production of archival information; make it simpler to find information and generate reports; enable non-professionals to more easily create archival description;  conform to archival standards; and share information such as finding aids with the public.  To help guide the archival community in selecting the appropriate archival management system, I recently wrote a report for the Council on Library and Information Resources (CLIR).

Working on the report led me to several (admittedly non-revolutionary) insights:

  1. If you want to know what features software users need, ask them.   In the course of interviewing over 30 archivists and developers, I gained a greater understanding of key criteria for archival management software including flexibility, conformity to standards, support for an integrated workflow, ease of use, remote access (since archivists may do initial work processing collections off site), customization capabilities, ability to import and export data, etc.
  2. There is no one-size-fits-all tool.  Some archives prefer to use open source software; others are leery of open source, need a hosted solution, or require lots of support in importing and exporting data, customizing the user interface, etc.  Some archives need a way to publish archival information on the web; others want to export finding aids and pull them into existing publishing tools.
  3. Reports go out-of-date as soon as they are published.  Why not release the report as a wiki so that the community can keep it current and relevant?  With the support of CLIR, I’ve created a wiki called Archival Software.  Right now it more or less replicates the structure and content of my original report, but I hope that it evolves according to the needs of the community.   I invite members of the archival community to update the information, add new sections, restructure the wiki, and do whatever else makes it most useful.
  4. If archival management systems integrate and streamline the archival workflow from accessioning the collection to describing it to managing it to making it publicly available, what would an integrated research tool for the humanities look like–or would such a tool even be desirable or possible, given the variation in research practices? My first thought: Zotero with add-ons for analyzing information (perhaps similar to the tools under development by SEASR), authoring and sharing research  (like the Word plug-in or plug-ins for multimedia authoring or mashup creation, sharing via Internet Archive collaboration), etc.

On March 31, the Society of American Archivists (SAA) will offer a web seminar, Archival Content Management Systems, that is based upon my report.  The webinar will examine the case for archival management systems, explore selection criteria, and provide brief demonstrations of 3 systems.  I think there’s still time to register.  (Apologies for the self-promotion, but I wanted to get the word out…)

Using Text Analysis Tools for Comparison: Mole & Chocolate Cake

How can text analysis tools enable researchers to study the relationships between texts? In an earlier post, I speculated about the relevance of such tools for understanding “literary DNA”–how ideas are transmitted and remixed–but as one reader observed, intertextuality is probably a more appropriate way of thinking about the topic. In my dissertation, I argue that Melville’s Pierre represents a dark parody of Mitchell’s Reveries of a Bachelor. Melville takes the conventions of sentimental bachelor literature, mixes in elements of the Gothic and philosophic/theological tracts, and produces a grim travesty of bachelor literature that makes the dreaming bachelor a trapped quasi-husband, replaces the rural domestic manor with a crowded urban apartment building, and ends in a real, Hamlet-intense death scene rather than the bachelor coming out of reverie or finding a wife. Would text analysis tools support this analysis, or turn up patterns that I had previously ignored?

I wanted to get a quick visual sense of the two texts, so I plugged them into Wordle, a nifty word cloud generator that enables you to control variables such as layout, font and color. (Interestingly, Wordle came up with the perfect visualizations for each text at random: Pierre white type on a black background shaped into, oh, a chess piece or a tombstone, Reveries a brighter, more casual handwritten style, with a shape like a fish or egg.)

Wordle Word Cloud for Pierre

Wordle Reveries Word Cloud

Using these visual representations of the most frequent words in each book enabled me to get a sense of the totality, but then I also drilled down and began comparing the significance of particular words. I noted, for instance, the importance of “heart” in Reveries, which is, after all, subtitled “A Book of the Heart.” I also observed that “mother” and “father” were given greater weight in Pierre, which is obsessed with twisted parental legacies. To compare the books in even more detail, I decided to make my own mashed up word cloud, placing terms that appeared in both texts next to each other and evaluating their relative weight. I tried to group similar terms, creating a section for words about the body, words about feeling, etc. (I used crop, copy and paste tools in PhotoShop to create this mashup, but I’m sure–or I sure hope–there’s a better way.

Comparison of Reveries and Pierre(About three words into the project, I wished for a more powerful tool to automatically recognize, extract and group similar words from multiple files, since my eyes ached and I had a tough time cropping out words without also grabbing parts of nearby words. Perhaps each word would be a tile that you drag over to a new frame and move around; ideally, you could click on the word and open up a concordance) My mashup revealed that in many ways Pierre and Reveries have similar linguistic profiles. For instance, both contain frequently-occurring words focused on the body (face, hand, eye), time (morning, night), thinking, feeling, and family. Perhaps such terms are common in all literary works (one would need to compare these works to a larger literary corpus), but they also seem to reflect the conventions of sentimental literature, with its focus on the family and embodied feeling (see, for instance, Howard).

The word clouds enabled me to get an initial impression of key words in the two books and the overlap between them, but I wanted to develop a more detailed understanding. I used TAPOR’s Comparator to compare the two texts, generating a complete list of how often words appeared in each text and their relative weighting. When I first looked at the the word list, I was befuddled:

Words Reveries counts Reveries relative counts Pierre relative Pierre counts Relative ratio Reveries:Pierre
blaze 45 0.0007 0 1 109.4667

What does the relative ratio mean? I was starting to regret my avoidance of all math and stats courses in college. But after I worked with the word clouds, the statistics began to make more sense. Oh, relative ratio means how often a word appears in the first text versus the second–“blaze” is much more prominent in Reveries. Ultimately I trusted the concreteness and specificity of numbers more than the more impressionistic imagery provided by the word cloud, but the word cloud opened up my eyes so that I could see the stats more meaningfully. For instance, I found that mother indeed was more significant in Pierre, occurring 237 times vs. 58 times in Reveries. Heart was more important in Reveries (a much shorter work), appearing 199 times vs. 186 times in Pierre. I was surprised that “think” was more significant in Reveries than in Pierre, given the philosophical orientation of the latter. With the details provided by the text comparison results, I could construct an argument about how Melville appropriates the language of sentimentality.

But the differences between the two texts are perhaps even more interesting than their similarities, since they show how Melville departed from the conventions of male sentimentalism, embraced irony, and infused Pierre with a sort of gothic spirtualism. These differences are revealed more fully in the statistics than the word clouds. A number of terms are unique to each work. For instance, sentimental terms such as “sympathies,” “griefs,” “sensibility” appear frequently in Reveries but never in Pierre, as do romantic words such as “flirt,” “sparkle,” and “prettier.” As is fitting for Melville, Pierre‘s unique language is typically darker, more archaic, abstract, and spiritual/philosophical, and obsessed with the making of art: “portrait,” “writing,” “original,” “ere,” “miserable,” “visible,” “invisible,” “profound(est),” “final,” “vile,” “villain,” “minds,” “mystical,” “marvelous,” “inexplicable,” “ambiguous.” (Whereas Reveries is subtitled “A Book of the Heart,” Pierre is subtitled “The Ambiguities.”) There is a strand of darkness in Mitchell–he uses “sorrow” more than Melville–but then Mitchell uses “pleasure” 14 times to Melville’s 2 times and “pleasant” 43 times. Reveries is more self-consciously focused on bachelorhood; Mitchell uses “bachelor” 28 times to Melville’s 5. Both authors refer to dreaming; Mitchell uses “reveries” 10 times, Melville 7. Interestingly, only Melville uses “America” (14 times).

Looking over the word lists raises all sorts of questions about the themes and imagery of each work and their relationship to each other, but the data can also be overwhelming. If comparing two works yields over 10,000 lines in a spreadsheet, what criteria should you use in deciding what to select (to use Unsworth’s scholarly primitive)? What happens when you throw more works into the mix? I’m assuming that text mining techniques will provide more sophisticated ways of evaluating textual data, allowing you to filter data and set preferences for how much data you get. (I should note that you can exclude terms and set preferences in TAPOR).

Text analysis brings attention to significant features of a text by abstracting those features–for instance, by generating a word frequency list that contains individual words and the number of times they appear. But I kept wondering how the words were used, in what context they appeared. So Melville uses “mother” a lot–is it in a sweetly sentimental way, or does he treat the idea of mother more complexly? By employing TAPOR’s concordance tool, you can view words in context and see that Mitchell often uses mother in association with words like “heart,” “kiss,” “lap,” while in Melville “mother” does appear with “Dear” and “loving,” but also with “conceal,” “torture,” “mockingly,” “repelling,” “pride,” “cruel.” Hmmm. In Mitchell, “hand” most often occurs with “your” and “my,” signifying connection, while “hand” in Pierre is more often associated with action (hand-to-hand combat, “lift my hand in fury,” etc) or with putting hand to brow in anguish. Same word, different resonance. It’s as if Melville took some of the ingredients of sentimental literature and made something entirely different with them, enchiladas mole rather than a chocolate cake.

Word clouds, text comparisons, and concordances open up all sorts of insights, but how does one use this evidence in literary criticism? If I submitted an article full of word count tables to a traditional journal, I bet the editors wouldn’t know what to do with it. But that may change, and in any case text analysis can inform the kind of arguments critics make. My experience playing with text analysis tools verifies, for me, Steve Ramsay’s recommendation that we “reconceive computer-assisted text analysis as an activity best employed not in the service of a heightened critical objectivity, but as one that embraces the possibilities of that deepened subjectivity upon which critical insight depends.”

Works Cited

Howard, June. “What Is Sentimentality?.” American Literary History 11.1 (1999): 63-81. 22 Jun 2008 <http://alh.oxfordjournals.org/cgi/content/citation/11/1/63&gt;.

Ramsay, Stephen. “Reconceiving Text Analysis: Toward an Algorithmic Criticism.” Lit Linguist Computing 18.2 (2003): 167-174. 27 Nov 2007 <http://llc.oxfordjournals.org/cgi/content/abstract/18/2/167&gt;.

Digging in the DiRT: Sneak Preview of the Digital Research Tools (DiRT) wiki

When I talk with researchers about a cool tool such as Zotero, they often ask, “Hey, how did you find out about that?” Not everyone has the time or inclination to read blogs, software reviews, and listserv announcements obsessively, but now researchers can quickly identify relevant tools by checking out the newly-launched Digital Research Tools (DiRT) wiki: http://digitalresearchtools.pbwiki.com/. DiRT lists dozens of useful tools for discovering, organizing, analyzing, visualizing, sharing and disseminating information, such as tools for compiling bibliographies, taking notes, analyzing texts, and visualizing data. We also offer software reviews that not only describe the tool’s features, strengths, and weaknesses, but also provide usage tips, links to training resources, and suggestions for how it might be implemented by researchers. So that DiRT is accessible to non-techies and techies alike, we try to avoid jargon and categorize tools by their functions. Although the acronym DiRT might suggest that it’s a gossip site for academic software, dishing on bugs and dirty secrets about the software development process, we prefer a gardening metaphor, as we hope to help cultivate research projects by providing clear, concise information about tools that can help researchers do their more work more effectively or creatively.

DiRT is brand new, so we’re still in the process of creating content and figuring how best to present it; consider it to be in alpha release and expect to see it evolve. (We plan to announce DiRT more broadly in a few months, but we’re giving sneak previews right now in the hope that comments from members of the digital humanities community can help us to improve it.) Currently the DiRT editorial team includes me, my ever-innovative and enthusiastic colleague Debra Kolah, and three whip-smart librarians from Sam Houston State University with expertise in Web 2.0 technologies (as well as English, history, business, and ranching!): Tyler Manolovitz, Erin Dorris Cassidy, and Abe Korah. We’ve committed to provide at least 5 new tool reviews per month, but we can do even more if more people join us (hint, hint). We invite folks to recommend research tools or software categories, write reviews, sign on to be co-editors, and/or offer feedback on the wiki. Please contact me at lspiro@rice.edu. [Update: You can also provide feedback via this form.]

By the way, playing with DiRT has convinced me yet again of the value of collaboration. Everyone on the team has contributed great ideas about what tools to cover, what form the reviews should take, and how to promote and sustain the wiki. Five people can sure do a heck of a lot more than one–and have fun in the process.

Literary DNA and Google Books

Can digital tools help us to trace literary inheritance and influence? In my dissertation, I claim that Washington Irving helped to originate a tradition of bachelor sentimentalism in American literature that Donald Grant Mitchell extended in Reveries of a Bachelor, Melville satirized in Pierre and “Paradise of Bachelors and Tartarus of Maids,” and Henry James complicated in “Lessons of the Master” and other works. If this relationship were visualized in a family tree, it would look something like this:

Bachelor Pedigree

Note that Herman Melville is hanging out by Donald Grant Mitchell but isn’t really connected to the family tree–in part because he’s got kind of a dismissive step-child relationship to Washington Irving, in part because I just couldn’t figure out how to make siblings display in Ancestry.com. (I would have been better off using a more flexible tool like Gliffy, and Ancestry.com now thinks my name is Henry James, but live ‘n learn.) Note also that there are no mothers–appropriate, I suppose, for bachelor literature. Of course, such a family tree grossly oversimplifies literary inheritance and influence; for one thing, most authors have many literary ancestors, many fathers and mothers.

As I was tracing this literary lineage in my dissertation, I kept asking myself how I knew that this genealogy was real, how I could be sure that I wasn’t just making it all up? Although I couldn’t state with certainty that there was this bachelor genealogy (such certainty seems to be beyond the reach–and outside the aims–of literary interpretation), I found some pretty good evidence for it. I won’t rehash all of the arguments here, but as I explain in my dissertation, Mitchell, Melville and James commented on their literary forebears. Mitchell gave a tribute at the Washington Irving Centenary, Melville rejected Irving’s approach to literature as being a “self-acknowledged imitation of a foreign model,” and James remembered his “very young pleasure” in reading Mitchell and Melville. Then there are the similarities in voice (a smooth, subjective sentimentalism), setting (bachelor garrets and lonely hearths), character (the dreaming bachelor), etc.

Now I wonder: Would text analysis tools and massive collections of texts such as Google Books provide further evidence for this bachelor genealogy? By comparing abstractions and visualizations of my four core bachelor texts, would I be able to see a “family resemblance”–perhaps a unique turn of phrase that is repeated through the generations like funny toes or wavy red hair? And what would constitute reliable evidence of inheritance, anyway–similarity in word choice, narrative voice, character, structure, graphical design (i.e., if you’ve got Fabio on the cover, the book probably belongs to the romance tradition)? Is there such a thing as “literary DNA”?

And is “literary DNA” even a valid concept, or is it a scientific term misapplied to the literary realm? By “literary DNA,” I mean the unique characteristics that define a literary work, characteristics developed through the complex process of literary inheritance and creativity. A search for the term in the MLA Bibliography yields no results for “literary DNA” (although “literary influence” gets you over 1000), while Google Scholar shows only 9 results. Perhaps the term rings too much of a positivist approach to literary study. When you broaden the search to Google, though, about 1000 results are retrieved. For example: The Paris Review initially called its archive of 50 years of its interviews The DNA of Literature (USA Today), presumably because you can plumb it to find authors explaining the genealogy of their works. In the Netherlands, researchers at the Huygens Institute KNAW are developing The literary DNA: computer-assisted recognition of narrative elements, software that can recognize themes and motifs in literature. “Literary DNA” is probably most closely associated with Don Foster, the literary scholar who used textual analysis to help uncover Joe Klein’s identity as the “Anonymous” author of Primary Colors, Ted Kaczynski as the Unabomber, and Shakespeare as the author of the poem A Funerall Elegye in memory of the late Vertuous Maister William Peete (erroneously, as it turned out). In Author Unknown (which I’m just starting to read), Foster argues, “The scientific analysis of a text–how a mind and a hand conspire to commit acts of writing–can reveal features as sharp and telling as anything this side of fingerprints and DNA” (4). I’m skeptical that a text can be analyzed scientifically, and textual analysis does not necessarily prove authorship definitively, as we see in the case of the poem mistakenly attributed to Shakespeare. Still, Foster’s methods are intriguing–he looks at word choice, punctuation, spelling, grammar, and sentence construction to discover the unique features of an author’s “linguistic system.” Romantic that I am, I dig the idea of literary scholar as sleuth.

So how would you use the computer to study literary influence and inheritance, and what would be the implications of this approach to research? Certainly search tools can help us to track allusions and other signs of influence. In his excellent article “Googling the Victorians,” Patrick Leary describes how the editors of the Strouse Edition of Carlyle’s Writings use Google, Literature Online (LION), the online OED, and other electronic resources to identify unattributed quotations, sparing themselves from having to spend “many often fruitless hours in the library stacks.” By providing quick access to much of the cultural record, the Web may, as Leary argues, “lend new urgency to arguments about what constitutes evidence of literary or intellectual influence” (12). Massive digitization projects such as Google Books are exposing not only influence, but also outright plagiarism. In Slate, Paul Collins reports that a linguist working for Google Books found that a passage from Sacrificial Foundations (1899) was a plagiarism of a plagiarism. As Collins observes, “it may be only a matter of time before some enterprising scholar yokes Google Book Search and plagiarism-detection software together into a massive literary dragnet, scooping out hundreds of years’ worth of plagiarists—giants and forgotten hacks alike—who have all escaped detection until now.” But when is “plagiarism “creative re-invention? Jonathan Lethem’s plagiarism on plagiarism cleverly makes the point that literature is all about appropriation and re-imagining.

Curious about the capabilities of Google Books to track literary inheritance, I compared Google Books editions of Washington Irving’s Sketch Book, Donald Grant Mitchell’s Reveries of a Bachelor, Melville’s Pierre, and Henry James’ The Lesson of the Master, The Death of the Lion; The Next Time, and Other Tales. (I know that Google Books is itself the subject of controversy, a topic I plan to take up in a future post.) What Google Books, the Open Content Alliance, and other huge digital collections offer is lots and lots of data that one could mine to study literary inheritance, the evolution of ideas, and a lot more. I hoped to use Google Books to discover salient features not only of these works, but of the works preceding and succeeding them on the literary family tree. Google Books provides handy (if limited) tools for detecting literary influence and seeing a snapshot of a work through its “About This Book” feature, a sort of reference page gathering together key words, popular phrases, and references from other works. Google Books isolates 20 unique “Key words and phrases” (such as “sleepy hollow” and “diedrich knickerbocker”), as well as three key words associated with each chapter. Proper names seem to be the most frequent key words, so I wasn’t too surprised that the only overlap in key words among the four works was that both The Sketch Book and Pierre use variations of “mourn” and “methinks,” reflecting perhaps the melancholic tone of both works. Popular Passages seems to hold more promise for influence-tracking, since it enables readers to “follow the literary memes* that appear again and again in the world of books” by highlighting the ten passages that appear most frequently in other works, then by providing links to all of those works. For Irving and Mitchell, the most frequent “popular passages” were quotations from other works (such as Thomas Gray’s poetry) and passages of their own works that were frequently anthologized, while for James and Melville the popular passages were those most frequently cited by critics or represented in other editions of their works. Through popular passages, you can get a quick sense of how the authors were received in their time and ours and glimpse how a work fits into the larger network of literary influence. To understand how critics and commentators have received The Sketch Book, you can also examine “References from books” (which presumably searches Google Books for citations of the work), “References from scholarly works” (which presumably searches Google Scholar), and “References from web pages” (which presumably searches Google, in a limited way). If, say, The Sketch Book and Pierre were referenced by the same works, that might indicate some kind of kinship. For selected books (in my very small sample, only the travel-focused The Sketch Book), Google Books also provides a map of places mentioned in the book. The map for The Sketch Book shows dozens of markers clustered in Great Britain, revealing at a glance what an Anglophile Irving was. You could compare maps of several books to see if they share similar settings or itineraries.

After spending several hours playing with About This Book, I didn’t find much evidence for my theory of bachelor genealogy–perhaps I would need a more specialized tool to turn up that kind of evidence. However, it does seem that Google Books could be quite useful for scholars interested in book history, since it includes book advertisements that appear in back matter, catalogs of library and personal collections, and back issues of Publishers Weekly, as well as the aforementioned features that allow you to see how the book has been referenced. By providing access to so much of the cultural record, Google Books can allow scholars to broaden the scope of their inquiry and find connections among works. While I recognize Google Book’s potential as a scholarly tool, I can see several ways to improve it for scholars:

  • Allow users to see more than 10 “popular passages” and 20 keywords; for power users, expose all available data (in a user-friendly way, of course)
  • Be more transparent about how these different features work. I searched the Google Books site for an explanation of the different “About This Book” features, but couldn’t find much beyond a few sentences in the Book Search Blog. If I’m going to trust a tool, I’d like to know how it works.
  • Make it easier to search within “popular passages.” Analyzing popular passages can be quite time-consuming, especially when a passage is cited in 860 books, as is a quotation from Cymbeline that Irving includes in The Sketch Book. Ideally you could search inside your “popular passages” results to see if, say, Melville cited the same passage from Shakespeare.
  • Help readers sort out different editions of the same work. As a book history/ textual studies gal, I like the fact that Google Books includes multiple editions of a work, even if the choice of editions include seems to be more an accident of what a library holds than a scholarly decision about which are the key editions. But once you start comparing the “About the Book” feature for those different editions, things get awfully confusing. When I compared 2 different versions of Reveries of a Bachelor, the 1852 L.C. Page edition and the 1906 Fenno edition, the data was quite different. Perhaps the text of the work changed substantially as different publishers came out with their own editions (neither is the authorized edition first published by Scribner in 1850), or maybe something’s wacky about the way that Google is generating this information. In any case, you get different popular passages (the Page edition includes lots of advertisements for other books published by the same company), different books/scholarly works that refer to this work, and different key words. Curious. So I guess I’d like a “compare editions” tool to reveal what’s really going on…
  • So this request may be a little pie in the sky, but I sure would love a way to know what works Google Books is not searching, or what it might be missing because of OCR errors. When my colleague Jane Segal and I studied humanities scholars’ use of digital resources, several worried that works not in digital form would be neglected and that scholarship would suffer from a sort of ignorance of the analog. I dunno–maybe such a list could be generated by comparing WorldCat records with Google Books? In any case, scholars need to be conscious of the limits of Google Books.
  • Fix errors in generating links to other books. For some reason, if you try to follow “References from books” past the first results page, you get an error that looks like this: “Your search – cites:0F_xecYtdwg1CDwRxfl_QZ – did not match any documents.” Argh!
  • Make it easier to download full-text of books (PDFs are handy, full-text is better).
  • Ensure free access to the full-text of public domain books. I got a little scared when I heard that Google Books would start charging for full-text, but that plan seems to focus on books that are part of its publisher partner program, not its library scanning project.
  • Experiment with visualization tools. For instance, it would be cool to play with some kind of social network graph or citation network showing all of the authors who cited Irving and were cited by him

Maybe it’s not fair to expect Google to turn Google Books into a scholarly resource–perhaps it’s better for scholars to develop their own applications to analyze Google Books. I’m encouraged by news that members of the digital humanities community have been in conversation with Google. Dan Cohen makes a persuasive case for Google Books providing an API that would allow researchers to mine and manipulate information. Despite my criticisms, I love poking around “About This Book,” which provides a great way to get a snapshot view of how a work fits into the literary ecosystem.

I had planned to discuss my experiments using text analysis tools to trace literary influence, but I’ve gone on way way too long already, so I’ll save that for a future post.

Imagining new tools for humanities scholars

In my last post, I noted that most humanities scholars seem to want pretty basic tools that are rooted in immediate needs, tools that would, for example, allow them to convert files from one format to another, easily compose and exchange documents that use Unicode fonts, and find information quickly. Such tools would save scholars time and spare them from frustration.Even as the digital humanities community acknowledges (and serves) the need for basic tools, I believe that it should also continue innovating by developing applications for analyzing, mining and visualizing texts; annotating and searching images and video; collecting and sharing digital objects; etc. As more and more scholarly resources become available in digital formats, I think that humanities scholars will recognize a pressing need for tools that help them manage, analyze and share huge masses of information. It’s just difficult for people to imagine exactly what these tools would do. As the DLF found in its 2004 Scholars’ Panel, “so unfamiliar is this area that we heard from several individuals that they had a hard time articulating precisely what they required from such tools, or what level of software creation skills or consultancy is available to them, and where. We are still in a stage where it is easier to react to an example of an existing tool than to dream them up ex nihilo.”

But dang, it sure is fun–and useful–to dream up new tools. At the 2005 Summit on Digital Tools for the Humanities, participants were deeply and playfully engaged as they sketched out tools to support interpretation, exploration of resources, collaboration, and (my favorite) Visualization of Space, Time and Uncertainty. I love seeing what kind of imaginative tools and hacks Bill Turkel will come up with in his blog Digital History Hacks, such as history appliances.

Here’s my own wish list for scholarly tools. Most of these ideas come out of my practical need for tools to help me find and manage digital information, as well as my curiosity about how a tool from one domain (say, music) might work when applied to the scholarly domain. I’m beginning to regard Zotero as my scholarly workbench, so I’m imagining a lot of these tools as add-ons to Zotero (without the expectation that they would necessarily be developed by the Zotero team or included as part of the standard release).

  • Bibliography Ripper: I’m trying to determine how many of the works I cited in my dissertation are now available electronically, which is incredibly labor intensive, despite my crude attempts to use search tools such as Rollyo to speed up the process. Ideally I could feed my bibliography into a bibliography ripper (OK, perhaps it would need a softer name) that would allow me to select which entries I’d like to dump into a Zotero collection. Then it would automatically go out and search Google Scholar, Open Worldcat, etc. for each resource. If full-text is available, it would be automatically downloaded into my Zotero collection; otherwise, the call number would be captured. Of course, such a tool would be useful not only in allowing me to assemble a collection of research materials that I used before “going digital,” but in pulling citations from the works of other scholars, a common research practice.
  • Recommender: I’m tantalized by the recommendation engine planned for the next release of Zotero. I’d love a tool that would recommend resources based on what I already have in my research collections, saving me from having to go out and find them myself.
  • Auto-summarizer: Matt Kirschenbaum recently gave a wonderful talk on The Remaking of Reading: Data Mining and the Digital Humanities where he described scholarly practices of “not reading” (skimming, looking at bibliographies, reading summaries by others, etc) and distant reading (“using statistical, quantitative methods to ‘read’ large volumes of text at a distance”). Given the volume of information I’m trying to deal with, I could really use a tool that would offer reliable summaries of works (particularly if an abstract isn’t available) and would let me judge quickly whether I need to read more deeply.
  • Shuffle scholarly playlists: When I listen to my iPod on shuffle, I often notice connections among songs and details I had previously overlooked; randomness seems to foster attention. I wonder if a similar effect could be achieved by putting my research collections on shuffle, if my critical attention would be stimulated if I asked Zotero to give me a random article?
  • Authoring tools: I think one thing holding back digital scholarship is the lack of powerful, intuitive authoring tools. Sure, blogging and wiki software offers a number of advantages–ease of use, collaboration capabilities, etc. In particular, I see a lot of potential in WordPress. A student at Georgetown developed a well-designed, thoughtful “online research portfolio” about bachelorhood in nineteenth century American lit using WordPress. Developers are building WordPress plug-ins and themes geared towards scholarship, such as Courseware (which “enables you to manage a class with a WordPress blog”) and CommentPress (which “allows readers to comment paragraph by paragraph in the margins of a text”). I’m also a fan of the authoring tools provided by the open educational repository Connexions, which offers converters from Word to XML and a pretty simple edit-in-place interface. What I’m looking for, though, is a way to put together a layered, hypermedia scholarly work, kind of like a DVD with bonus materials. At this early stage, I envision having a track for my main argument, one for supporting materials (texts, images, audio, etc), one for a “making of” feature exploring the process of producing the project, and one for extras such as a Google Map showing where bachelor authors lived, a digital story using images and audio to explore literary bachelorhood, etc.

Maybe my dream tools are already out there (or really out there) or are being developed–I’d love to find out!

What tools do humanities scholars need?

This week my colleague and I met with some of the leading philologists at Rice to discuss how they do their research and what tools would help make them more productive and innovative. I was primed to talk about text analysis and visualization tools, collaboration tools, collection-building tools, etc., but instead the conversation focused on much more bread-and-butter stuff. These scholars want:

  1. an easy way to convert from one file format to another. A religious studies scholar described the frustrating hours she put into converting files from Nota Bene to Word, hours that could have been spent doing research or writing. Others said that they have valuable files in obsolete formats and acknowledged printing out important documents so they would have at least some means of accessing them. Given the desperate need for an easy way to migrate file formats forward, I think scholars would embrace an easy-to-use batch conversion tool that would work with the sometimes-obscure file formats academics use. Bonus points for free, secure, long-term online storage of data. I think libraries have a real opportunity here to assist scholars in archiving and preserving their data, although scholars may understandably wish to retain custody over it.
  2. a reliable, commonly adopted word processor that handles Unicode well. As an Americanist, I’m blissfully unaware of the challenges of working with character sets such as Coptic, Ethiopic, Greek, Hebrew, etc., but these philologists struggle with font issues every day. Inputting characters is a pain, and exchanging files with publishers and others is even more of a hassle. Scholars said that they had to re-do work because their publishers didn’t have the right fonts installed on their systems. They also seemed to dislike Word, which is designed more for business applications. I started to wonder if Open Office could be adapted to meet scholars’ needs…
  3. a virtual reference desk of key texts in their field. Many foundational philology texts from the nineteenth century have not yet been digitized. Scholars said they would love to able to consult these texts quickly, particularly dictionaries, lexicons, and other tools. Somewhat surprisingly (at least to a text encoder like me), they said that they wouldn’t require full text, just page images. (Interestingly, two of the three grants that were recently awarded through the NEH/IMLS’s Advancing Knowledge program focus on building reference/contextual tools: “Tufts will develop a digital reference tool allowing researchers and librarians to conduct context-based ‘smart searches’ of un-indexed words from existing databases in the Tufts Digital Library,” while “The University of California, Berkeley, in collaboration with the Queen’s University, Belfast, will develop a digital database of Irish studies materials to test three open-source digital tools. The Context Finder, Context Builder, and Context Provider tools will be aimed at establishing scholarly context.”)

Given that scholars’ most precious commodity is probably time, it makes perfect sense that they most desire tools that help them to be more efficient and avoid getting caught up in frustrating, tedious activities such as converting files, wrestling with word processing programs, and finding books in the library. This conversation echoed the results of a survey my colleague Jane Segal and I conducted in the spring investigating the impact of digital resources on humanities scholarship. Not surprisingly, scholars most commonly use technologies that serve their regular research practices. Of our 85 respondents, 100% use word processing progams, but only 36% use bibliographic software, and only 5% use text analysis tools. Our respondents most desired tools that would help them find resources more quickly: 88% wanted “Search tools that are powerful and easy to use” and “Search tools that go across multiple scholarly web sites,” but only 28% wanted text visualization tools, and only 13% ranked dynamic mapping/GIS tools as a priority.

I should emphasize that scholars are by no means hostile to cutting-edge visualization and analysis tools; they’re just not aware of them or aren’t sure that they would support their research practices. When we asked Whitman and Dickinson scholars what they thought of tools such as text visualization applications, they generally seemed intrigued, but they indicated that they would need to be persuaded that such tools would advance their research projects. That makes sense: except for a handful of “innovators” and “early adopters” eager to try something new (estimated by Rogers to be about 16% of the population), most folks are pretty pragmatic in their adoption of technologies. They need to be frustrated with current tools and convinced that investing time and money in adopting a new one will pay off in increased productivity. They will first adopt tools that help them do what they’ve always done, just better. According to a very interesting 2001 CLIR report on Scholarly Work in the Humanities and the Evolving Information Environment, humanities scholars are adopting technologies “that are enhancing many of their traditional work practices” (28). As Jerome McGann argues in Radiant Textuality, transformations in humanities research must be rooted in the core values and methods of the discipline: “the general field of humanities and education and scholarship will not take the use of digital technology seriously until one demonstrates how its tools improve the ways we explore and explain aesthetic works–until, that is, they expand our interpretational procedures” (xii).

So what drives humanities scholars to adopt new tools? Well, the research on this topic (one that I need to investigate further) seems to suggest that the tool needs to be easy to find and use, that people need to receive incentives and support, etc. Rather than get into all of that, though, let me offer two quick anecdotes:

  • “Research not re-search.” One of my job responsibilities is to run tech training workshops for faculty. Generally even not the promise of the opportunity learn Really Useful Tools or feast on free lunch can lure people to these workshops, but a good number of humanities faculty and grad students showed up for my sessions on the wonderful, free, open source bibliographic tool Zotero. When I demonstrated how you could automatically download bibliographic information and articles from supported web sites, their faces lit up; they actually oohed and aahed. I felt like David Copperfield. The usefulness of such a tool was immediately apparent: “ohmygosh, I don’t have to copy and paste or type out citations; I can organize (and find!) my research much more easily. Hallelujah!” It was a little more difficult for the participants to grasp how they might use Zotero’s tagging functions, which resemble the schemas they use to organize their notes, but are different enough to prompt some confusion.
  • This summer I attended Ed Ayers’ keynote address at the Geography and the Humanities Symposium and was blown away by his presentation on visualizing dynamic temporal and geographical process. The audience’s excitement was palatable–I heard a lot of “oh, wows.” The dazzle was due in part to Ed Ayers’ exceptional presentation skills, but also to the power of the visualization tools to illustrate change. I was impressed by how he made the unfamiliar familiar by comparing the dynamic visualizations he was showing to weather maps. As he demonstrated the applications, people could see patterns that would otherwise be hard to detect and thus understand the usefulness of such tools. (You can see Ed Ayers and Will Thomas’ fascinating presentation on “Time, Space and History” at the 2006 Educause conference here)

I guess what I conclude from all of this is pretty obvious: humanities scholars need tools that help them to be more productive– and more innovative, although it’s harder for many folks to imagine what’s possible until they see concrete demonstrations. These tools can seem somewhat magical until scholars start incorporating them into their regular practices. One scholar that I interviewed this summer suggested that digital humanists run workshops on new tools and digital collections at conferences such as the MLA and American Literature Association conference. She said she and her colleagues are curious about new digital tools and collections, but they aren’t necessarily aware of them and don’t always know how to use them. Of course, the conversation needs to be two-way– tool developers need to understand the needs of scholars. With projects such as NINES, MONK, etc., we can find great models for scholar/developer collaborations (in many cases, digital humanists themselves are both scholars and developers.) And I’ve been impressed by the ways that projects such as Zotero and TAPOR have produced handy tutorials and actively promoted themselves.