Monthly Archives: May 2008

Open Education: Imagining Sleep

If I were to design a web-based course, I’d want it to make intelligent use of multimedia (movie clips, podcasts, music, images, etc.), adopt a Creative Commons license so that people could freely use it, be interactive, take an interdisciplinary approach, and, of course, demonstrate a deep knowledge of the course topic. For an excellent example of such a course, check out Imagining Sleep: An Interdisciplinary Course on Sleep and Dream by my pal Carolyn Fay. Carolyn’s certainly qualified to teach about sleep: she wrote a dissertation in French literature on sleep and dreams, and she taught several interdisciplinary courses on sleep while a faculty member at Franklin and Marshall College and Penn State Altoona (plus she has a young daughter, and therefore much experience with interrupted sleep). Imagining Sleep offers a series of lessons on the scientific, cultural, and psychological contexts surrounding sleep, complete with activities, readings, and informative, charmingly-narrated podcast lectures. I think an important aspect of digital scholarship is making knowledge available to the wider community, and Imagining Sleep does a great job of organizing that knowledge coherently and using the Web wisely to deliver information.

Digging in the DiRT: Sneak Preview of the Digital Research Tools (DiRT) wiki

When I talk with researchers about a cool tool such as Zotero, they often ask, “Hey, how did you find out about that?” Not everyone has the time or inclination to read blogs, software reviews, and listserv announcements obsessively, but now researchers can quickly identify relevant tools by checking out the newly-launched Digital Research Tools (DiRT) wiki: http://digitalresearchtools.pbwiki.com/. DiRT lists dozens of useful tools for discovering, organizing, analyzing, visualizing, sharing and disseminating information, such as tools for compiling bibliographies, taking notes, analyzing texts, and visualizing data. We also offer software reviews that not only describe the tool’s features, strengths, and weaknesses, but also provide usage tips, links to training resources, and suggestions for how it might be implemented by researchers. So that DiRT is accessible to non-techies and techies alike, we try to avoid jargon and categorize tools by their functions. Although the acronym DiRT might suggest that it’s a gossip site for academic software, dishing on bugs and dirty secrets about the software development process, we prefer a gardening metaphor, as we hope to help cultivate research projects by providing clear, concise information about tools that can help researchers do their more work more effectively or creatively.

DiRT is brand new, so we’re still in the process of creating content and figuring how best to present it; consider it to be in alpha release and expect to see it evolve. (We plan to announce DiRT more broadly in a few months, but we’re giving sneak previews right now in the hope that comments from members of the digital humanities community can help us to improve it.) Currently the DiRT editorial team includes me, my ever-innovative and enthusiastic colleague Debra Kolah, and three whip-smart librarians from Sam Houston State University with expertise in Web 2.0 technologies (as well as English, history, business, and ranching!): Tyler Manolovitz, Erin Dorris Cassidy, and Abe Korah. We’ve committed to provide at least 5 new tool reviews per month, but we can do even more if more people join us (hint, hint). We invite folks to recommend research tools or software categories, write reviews, sign on to be co-editors, and/or offer feedback on the wiki. Please contact me at lspiro@rice.edu. [Update: You can also provide feedback via this form.]

By the way, playing with DiRT has convinced me yet again of the value of collaboration. Everyone on the team has contributed great ideas about what tools to cover, what form the reviews should take, and how to promote and sustain the wiki. Five people can sure do a heck of a lot more than one–and have fun in the process.

Ways that digital resources can transform teaching and research, grand and small

While trying to determine how many articles in JSTOR and Project Muse cite Making of America (MOA), I stumbled across several articles that describe how databases such as MOA are beginning to transform humanities research. (Funny–when I look for this kind of evidence, I don’t find it, but when I’m not looking, there it is.) Most of the essays focus on how online collections enrich research by making available works that would otherwise be difficult to locate, but in one a social historian imagines large, collaborative projects in which information technology plays a crucial role.

According to Sandra Roff, researchers are discovering sources that they otherwise would not have found because they can run full-text searches on databases such as Making of America and American Periodical Series Online, 1740–1900. Describing her research into the history of the Free Academy, the precursor to the City University of New York, Roff writes:

The standard histories published before the development of the internet now prove to be incomplete since new information is easily retrieved from periodical literature using the new technology. These periodicals can provide a picture of all aspects of life during a particular time period of history, which adds a new dimension to previously static historical facts. Since there are a limited number of indexes available for the greater part of the nineteenth century, research has usually been restricted to periodical sources close to the subject locale or else to periodicals in a particular subject area. Going beyond these parameters often would yield few results and would be considerably time consuming. However, by using these databases, we discovered that news of the Free Academy was not local but had indeed spread around the country. Without the limitations of subject, author and title searching, which were the only way that historical indexes such as Poole’s or any of the indexed New York City newspapers could be searched prior to online databases, articles can now be retrieved using keyword searches. These Boolean searches can reveal mentions of subjects embedded in articles that might earlier had proven elusive even if the periodicals were searched.

Similarly, Charles La Porte argues that databases such as MOA are making it possible to study “obscure” ideas buried in Victorian periodicals:

What is new and exciting is our increasing access to formerly obscure Victorian ideas through online databases. The study of Victorian periodicals is flourishing today in part because Victorian print culture has never been more accessible, given indices like the Nineteenth Century Masterfile, and sites that reproduce Victorian journals like Chadwyck-Healey’s “Periodicals Contents Index” (PCI), the jointly-produced (and free) “Internet Library of Early Journals” (ILEJ) of the Universities of Birmingham, Leeds, Manchester, and Oxford, and the “Making of America” (MOA) database of Cornell and the University of Michigan. The growth of these and similar resources provides us not only more access to obscure poetry, but also to the print environment of known works, and to Victorian discussions of them.

Cynthia Patterson describes online access as a “bane and boon:” she used the web extensively to locate materials for her study of Philadelphia pictorial magazine, but worried that digitization would make her own research less unique and innovative, since everyone would now have access to the same materials she had so diligently pursued:

Like most scholars, I was finding the World Wide Web an unbelievably rich source for access to networking and research. About that time, I discovered the Research Society for American Periodicals, the Making of America collection at Cornell and Michigan, and the few issues of Godey’s available online. I also discovered Periodyssey, the rare book dealer in New York City, and quietly began buying up bound volumes, first of the Union, then of Graham’s, Godey’s and Peterson’s. I also took coursework through George Mason University’s Center for History and New Media. While I was fascinated with the work they were doing, digital access became a source of dread: I lived in fear that someone else would suddenly digitize the magazines in my study before I could finish my project!

To encourage students to conduct original research, teachers are promoting MOA and other databases that provide access to primary source materials. Christopher Hanlon laments the difficulty of getting students to do serious literary scholarship and explains how requiring them to use online databases such as Making of America for their research led them to produce more interesting, original work. For instance, one of his students drew on magazine articles drawn from MOA to show how the Swede in Crane’s “Blue Hotel” reflects late 19th C anxiety about Swedish immigration to the US.

By urging my students to use OCR databases to do historical research on literary texts, I was asking them to view the texts on our syllabus in Hayden White’s (1978: 81) sense of a ‘literary artifact,’ but more than that, I was urging them to take charge of their own experience of literature and hence the experience they were asking their readers to share in. Although students still don’t possess a deep sense of history, using online archives can empower students to do something we always ask of them but hardly ever equip them to accomplish: devise their own way into a text, and a way in about which we are, finally, interested.

As these comments suggest, it seems that researchers currently most value digital collections for providing enhanced access to a broader range of materials; my colleague Jane Segal and I reached a similar conclusion in our survey of humanities scholars last year. Through enhanced access, both the depth and breadth of research can be improved, as researchers uncover sources that would be otherwise difficult to discover and can quickly search a wide range of materials. Perhaps in the next five or ten years, researchers will also be saying that how they fundamentally do research and what kinds of questions they can pose have also changed, as projects such as MONK, NINES, etc. provide sophisticated tools for working with digital information and online environments for collaboration, publication, etc. (Or maybe they’re saying this already and I haven’t stumbled across those sources yet.)

In developing digital tools and methods, we should consider how they can help scholars tackle particular research challenges. Calling for historians to undertake “big,” collaborative social science research projects, Richard Steckel suggests that “large-scale archives” and “systematic information collection” can enable researchers to pursue ambitious projects, such as studying climate history, creating an international catalog of films and photographs, digitizing the notes of prominent historians, and creating a database of crime reports from 1800 to the present. He also proposes that historians digitize large collections of diaries and letters, citing MOA, Valley of the Shadow, and the Evans Early American Imprint Collection as examples of successful digitization projects. Although Steckel doesn’t use the term “digital scholarship,” he makes the case for research that requires collaboration, draws on large databases, uses computer-based tools such as GIS and statistical applications, and engages historians in producing documentaries and databases–which sure sounds like digital scholarship to me.

What qualifies as a “grand challenge” in the humanities? Such a question seems to drive initiatives to develop digital scholarship in the humanities. According to the report of the ACLS Commission on Cyberinfrastructure for the Humanities and Social Sciences, building the cyberinfrastructure is itself the humanities’ grand challenge. The AHRC e-Science Scoping Study acknowledges the difficulty of describing specific grand challenges, but points to a few possibilities: developing tools for researchers that facilitate “annotating, collating, visualising and simulating the digital content created and used within their research,” as well as “new collaborative tools and virtual collaborative environments.” Steckel’s climate history idea particularly resonates with me, freaked out as I am about climate change, but other ambitious collaborative projects spring to mind: initiatives that aim to make the humanities more global and interdisciplinary (such as Mappamundi), major GIS projects (such as Africa Map), open access data archives (such as OpenContext), etc. Given the NEH’s recently-announced high-performance computing initiative, I also wonder about the possibilities of using supercomputers to conduct complex queries across massive collections of texts, construct 3D models of cultural heritage sites, run simulations of both historical and literary events, etc.

While I’m on the subject of grand challenges and big projects, in a compelling article in the most recent Literary & Linguistic Computing, Patrick Juola argues for “Killer Applications in Digital Humanities,” which he defines as “a solution sufficiently interesting to, by itself, retrospectively justify looking [at?] the problem it solves—a Great Problem that can both empower and inspire.” Juola suggests that to make digital humanities more relevant to the broader humanities community, it should develop tools that serve “the needs of mainstream humanities scholars.” As examples of potential “killer apps,” Juola describes tools that would enable humanities scholars to automatically create back-of-the-book indices, annotate works, and discover and explore resources.

Amen. I am excited by the potential of big projects and killer apps to open up new discoveries and methods, build knowledge, serve the social good, etc. However, I hope we don’t lose sight of the contributions that small, focused projects can make as well. As an example of the mismatch between scholars’ needs and the tools developed by digital humanities folks, Juola points to an electronic scholarly edition of Clotel, which allows readers to compare passages and track changes. According to Juola, “it is not clear who among Clotel scholars will be interested in using this capacity or this edition,” and the annotation capabilities cannot be applied to other texts. But I think such a comment may reflect an all-too-common underappreciation of textual scholarship. Since Clotel exists in 4 versions, being able to compare passages is of real benefit to researchers. It’s not as if this project was created without consulting with sholars; indeed, the editor is a distinguished scholar of African-American literature. Although I certainly agree that digital humanities projects should focus on researchers’ needs (hence the significance of projects such as Bamboo, which are trying to discern those needs), I also believe that innovative methods of exploring and representing knowledge can come out experiments such as the Clotel edition. (I should acknowledge that I’m pals with some of the folks involved in developing this electronic edition.) Of course, ideally experimental tools and interfaces would be developed in as open a fashion as possible so that other projects can build on the work. As the examples I cited at the beginning of this post illustrate, big projects–text collections, databases, annotation tools, GIS maps, etc–can facilitate research into more focused topics, which in turn can contribute to our understanding of the big picture or lead us to a small but nonetheless dazzling insight.

Works Cited:

Hanlon, Christopher. “History on the Cheap: Using the Online Archive to Make Historicists out of Undergrads.” Pedagogy 5.1 (2005): 97-101. <http://muse.jhu.edu/journals/pedagogy/v005/5.1hanlon.html&gt;.

Juola, Patrick. “Killer Applications in Digital Humanities.” Lit Linguist Computing 23.1 (2008): 73-83. 15 May 2008 <http://llc.oxfordjournals.org/cgi/content/abstract/23/1/73&gt;.

LaPorte, Charles. “Post-Romantic Ideologies and Victorian Poetic Practice, or, the Future of Criticism at the Present Time.” Victorian Poetry 41.4 (2004): 519-525. 7 May 2008 <http://muse.jhu.edu/journals/victorian_poetry/v041/41.4laporte.html&gt;.

Patterson, Cynthia. “Access: Bane and Boon.” American Periodicals: A Journal of History, Criticism, and Bibliography 17.1 (2007): 117-118. 7 May 2008 <http://muse.jhu.edu/journals/american_periodicals/v017/17.1patterson.html&gt;.

Roff, Sandra Shoiock. “From the Field: A Case Study in Using Historical Periodical Databases to Revise Previous Research.” American Periodicals: A Journal of History, Criticism, and Bibliography 18.1 (2008): 96-100. <http://muse.jhu.edu/journals/american_periodicals/v018/18.1roff.html&gt;.

Steckel, Richard H. (Richard Hall). “Big Social Science History.” Social Science History 31.1 (2007): 1-34. <http://muse.jhu.edu/journals/social_science_history/v031/31.1steckel.html&gt;.

What can you do with texts that are in a digital format?

I’ve had a longstanding, friendly debate with a colleague about whether it is sufficient to provide page images of books, or whether text should be converted to a machine- and human-readable format such as XML. She argues that converting scanned books to text is expensive and that the primary goal should be to provide access to more material. True, but converting books into a textual format makes them much more accessible, allowing users to search, manipulate, organize, and analyze them. Here’s my summary of what you can do with an electronic text. Most of these advantages are pretty obvious, but worth articulating.

Read it—on paper (once you print it out or pay for on-demand printing), your computer, or, increasingly, a portable device. From a single XML file, you can generate many forms of output, including HTML, PDF and for a mobile device.
Copy and paste it–avoid the hassle of having to retype passages.
Search it. Several years ago, I wrote a series of learning modules on stereographs, 3D photographs popular in the late 19th and early 20th centuries. I searched for books and articles on stereographs in the library catalog and in journal collections such as JSTOR, but was kind of disappointed by the lack of relevant information. Last year I returned to the topic and used Google Books for my research. I found dozens more relevant sources, such as key theoretical and historical works on stereography (most of which had already been published when I first studied the topic) as well as some fascinating nineteenth and early twentieth century manuals. Sure, I had to wade through a lot more stuff to find what I needed, but being able to search the contents of books and essays as well as the metadata let me uncover much more useful stuff.
Build a personal collection. Forget file cabinets crammed with photocopies. Using tools such as Zotero and EndNote, you can easily download articles and the accompanying bibliographic information onto your laptop, then take your entire collection with you on a plane, to an archive, to a boring meeting, etc. You can search your collection, sort it, create bibliographies, etc.
Share it. Much to the chagrin of movie studios and record companies, digital files are easy to share, so you can give colleagues access to articles, notes, bibliographies, etc. without having to deal with physical delivery (copyright permitting, of course.) With the forthcoming Zotero 2.0, sharing will get even easier.
Analyze it. Once you have a book in a text-based format, you can do all sorts of nifty things with it–generate word counts, find out what terms appear most frequently next to a particular word, extract dates, find capitalized terms, compare texts, and much more. See TAPOR’s tutorial.
Visualize it. Not only are text visualization tools, well, cool, they also can open up interpretive insights. For instance, using the US Presidential Speeches Tag Cloud, you can get a quick, dynamic view of the history of presidential priorities.
Mine it. Look for patterns in large textbases. As Loretta Auvil of NCSA & SEASR explains, text mining tools such as those being developed by MONK and SEASR enable researchers to automatically classify texts according to characteristics such as genre, identify patterns such as repetition (as in the case of Stein’s Making of the Americas), analyze literary inheritance, and study themes across thousands of texts.
Remix & play with it. By taking the elements of a text or collection of texts and remixing them, you not only produce a new creative work, but also see the text in a new way–your attention is brought to particular linguistic elements, like the fragments of a broken vase used to make a mosaic. For instance, when I used the Open Wound “language mixing tool” with Melville’s 1855 sketch “The Paradise of Bachelors and the Tartarus of Maids”, I gained new insights into the violence and anxiety expressed by words such as “agony,” “cut,” and “defective.” Running the tool on the sketch also produced some stunning phrases that could serve as mottoes for this kind of activity: “Exposed are the cutters,” “in the meditation onward,” and “protecting through the scholarship.” I also plan to play with tools that would allow me to mashup several bachelor texts (take the beginning from Irving, the middle from Melville and Hawthorne, the end from Mitchell), replace key words with pictures, etc.

Some really interesting research is underway on the possibilities of text mining for humanities scholarship–including the aforementioned MONK and SEASR projects, as well CHNM’s “Scholarship in the Age of Abundance: Enhancing Historical Research With Text-Mining and Analysis Tools.”

Evaluating the quality of electronic texts

In my last post, I said that 83% of the primary source works that I used in my dissertation are now available online as full-text. But how reliable are these electronic texts? Can researchers feel comfortable citing them and using them for text analysis? In my view, the quality of an electronic text and its appropriateness for use in scholarship depend on 6 factors:

  • Quality of the scanning: Is the complete page captured? Is the image skewed or distorted? Is the image of sufficient resolution?
  • Quality of the OCR/text conversion: Is full text provided? What method was used to produce the text–double-keying or OCR? How accurate is the text? Are the texts marked up in TEI (Text Encoding Initiative)? Are words joined across line breaks? Are running heads preserved?
  • Quality of the metadata: Is the bibliographic information accurate? Is it clear what edition you are looking at? If there are multiple volumes, do you know which volume you are getting and how to locate the other volume(s)?
  • Terms of use: What are you legally able to do with the digitized work? Can you download the full-text and use tools to analyze it? Is the content freely and openly available, or do you have to pay for use?
  • Convenience: Can you easily download the text and store it in your own collection? How much work do you have to do to convert the text into a format appropriate for use with text analysis tools? How hard is it to find the electronic text in the first place? Is there a Zotero translator for the collection?
  • Reputation: Is the digital archive well-regarded in the scholarly community? If you cited the archive in your bibliography, would fellow researchers question your decision? Does the archive provide clear information about its process for selecting, digitizing, and preserving texts?

I focused my evaluation on the main collections that I plumbed for the primary source works in my dissertation bibliography: Google Books (GB), Open Content Alliance (OCA), Early American Fiction (EAF), Project Gutenberg (PG), and Making of America (MOA). I found the OCA works in the Internet Archive (they are marked as belonging to the “American Libraries” or “Canadian Libraries” collections.) I apologize in advance for the length of this post, but I want to dig into the details.

Quality of Images

Perhaps the most heated debate over the quality of digitized texts has focused on Google Books. For example, Paul Duguid and Robert Townsend have questioned the quality of Google Books, providing examples of skewed or poorly scanned pages, inaccurate metadata, and the failure to make available materials that should be in the public domain. It goes without saying that providing access to high-quality page images is important: many researchers want to study the illustrations and other visual features of a text and to verify the converted text against the original page. Furthermore, a poorly scanned page probably means that the resulting OCR will also be bad.

To evaluate the quality of Google’s scans, I first took a birds’-eye view, using the page preview function in Adobe Acrobat to get a quick glimpse of the image quality of 56 19th century works that I downloaded from GB. Using this admittedly inexact method, I noticed fewer than 100 or so scanning errors across the approximately 11,000 pages I glanced at. If I found one distorted image (see, for instance, the 1834 Knickerbocker) or a finger in the scan, it was likely that the book would contain other errors as well–maybe the scanning operator was, um, distracted. However, closer scrutiny of the files revealed other errors not visible through the preview. For instance, the text for Typee ended right in the middle of a word, which I don’t think is how Melville meant it to be (although that narrative approach does leave you anticipating what comes next). In the 1906 Maynard & Merrill edition of The Sketch Book, the last few lines on a number of pages are, for lack of a more precise term, stretched and curved, as if the page were turned too quickly during the scanning or photographing. (Judging from all of the black splotches on the pages, this text appears to be scanned from microfilm, so the quality issues may have been introduced during the microfilming, not the scanning.) Although I can’t make a definitive statement about the quality of Google Book’s scans, I’d say there are some problems, but they are not as significant as I thought they would be.

Although Google Books seems to have the most content right now (about 2,680,000 books?), I prefer the quality of the scans provided by the Open Library/ Internet Archive, which provides searchable text + image PDF, as well as DJVU files and a flip-book format that simulates the experience of turning book pages online. Whereas most of the pages in Google Books are scanned in black and white, the OCA scans are in full color, showing the coloration of the page and the richness of the illustrations. Although I haven’t conducted a systematic study of OCA scans, I haven’t noticed many problems at all. (Disclaimer: Rice University’s Library is a member of the Open Content Alliance, but I have not been personally involved in working with the organization.)

Likewise, the scans for Early American Fiction are full-color and were captured at a high resolution. When I worked at Virginia’s Electronic Text Center, I visited the photography studio for EAF and witnessed the care with which each page was scanned, so I would be surprised to find many problems. Although the MOA images are black and white, the scans appear to be of high quality and were captured at 600 dpi. Project Gutenberg does not provide page images, so the issue of scanning quality is moot for this collection.

Quality of Text Conversion

Although the ability to search and retrieve an entire text online makes research more convenient and comprehensive, doing deeper analysis requires either being able to download texts or to run tools on online content. With some subscription-based collections, such as Net Library and Questia, you are limited to viewing the work one page at a time, which really restricts what you can do with it. With other collections, you can capture the entire work by downloading it as a web page (Muse, EAF), image-only PDF (Google Books), searchable PDF (OCA and JSTOR, lately?), or plain text with uncorrected OCR (Making of America, OCA).

My favorite sources are those that provide good quality full-text of the article or book in HTML (XML would be even better), since that minimizes the work I need to do in getting the text into a format I can use with text analysis tools. Converting the image-only PDFs provided for download by Google Books into plain text or XML has been troublesome, perhaps because the resolution of the scans and overall quality of the scanning could be better, perhaps because nineteenth century works aren’t necessarily printed clearly. (Several folks have complained about the quality of Google Books PDF files: 1, 2.) To get the Google Books PDFs into a format I could work with, I had to run them through Acrobat 8.0’s optimizer to improve the image quality, then through its OCR engine to convert the files to Adobe Tagged Text XML, a very basic markup format.

In assessing the quality of the full-text files provided by Google Books, the Open Content Alliance, Early American Fiction, Making of America, and Project Gutenberg, I compared the quality of OCR for 3 works important to my dissertation: Mitchell’s Reveries of a Bachelor, Melville’s The Piazza Tales, and Irving’s The Sketch Book. [Results of the analysis are available here.] EAF, OCA, and GB held all three books, but MOA only had Reveries and PG only had The Sketch Book and The Piazza Tales. For GB, I looked at the quality of the OCR I was able to generate from the image-only PDFs, as well as of the plain text provided online by Google Books, which allows you to view only one page at a time. I selected the same three pages in each of the books and counted the number of OCR errors per page. In the case of The Piazza Tales, I was able to use the same edition from all of the collections (the 1856 Dix & Edwards edition), but even for the other works the number of words per page seemed about the same. I found that:

  • The most accurate texts were produced by EAF (1 error total out of the 3 sample pages I examined in 3 texts), Project Gutenberg (0 errors in 2 texts), and MOA (0 errors in 1 text). The OCR that I generated from the Google Books PDFs was the least accurate, with a total of 111 errors, while there were 44 errors in the 9 pages I examined from OCA. Full disclosure: I was a project assistant for EAF and expected that the quality of text conversion would be high, since the texts were double-keyed, marked up in TEI, and then reviewed by graduate assistant.
  • The OCR produced by Google Books is of much higher quality than what I was able to create through Adobe Acrobat’s OCR engine–there were 38 errors in the plain text online version vs 111 in my OCRed version. (I’m betting that Google is working with better images and better software.) But even Google Book’s OCR engine was tripped up by poorly scanned pages; for instance, in one page from The Sketch Book, words were cut off along the left edge, resulting in 25 OCR errors for this page.
  • The number of text conversion errors would increase for OCA if I included words that broke across lines. Although preserving line and paragraph breaks, as OCA does, makes it easier for the reader to move between the page image and the OCRed text, this practice also means that word counts for OCA texts are probably not accurate, since terms such as “pic” <line break> “ture” would not be registered as “picture”. Depending on the density of words on the page, it seems that an average of 4-6 words per page are broken across lines.
  • Running heads are preserved in the MOA, OCA and Google Books texts, so word counts for terms that appear in these heads (e.g. “Reveries of a Bachelor”) would be inflated.

In addition to sampling texts to determine the number of errors in text conversion, I also wanted to get a macro view of the accuracy of each version. I decided to focus on The Piazza Tales, since OCA, Google Books, and EAF all contain the 1856 Dix & Edwards edition. I used the HTML file downloaded from EAF (the TEI version would have been preferable, but there is no direct access to that file), the plain text file from OCA, and the Adobe Tagged Text XML I generated from the Google Books image only PDF. (I couldn’t get either HyperPo or TAPOR Tools to work with the Gutenberg file.) To generate the word counts, I ran each text through HyperPo 6, then compared the results [available as spreadsheet]:

  • Word counts differed in each file, even though the editions were identical. For example, HyperPo 6 counted 311 instances of “all” in the Google Books version, 383 in EAF, and 380 in OCA. Likewise “one” appears 327 times in Google Books, 391 times in EAF, and 390 times in OCA. These variations are probably due to differences in the quality of text conversion. I would expect Google Books to register lower counts for words, since the OCR that I was able to produce appears to be least accurate. EAF would probably contain the most accurate results, since it (unlike OCA) typically joins words broken across lines.
  • Each version contains words/characters that would lead to false results. For instance, the word count for EAF includes terms that were not part of the original text but that were used to aid online navigation, such as “link” (1843 times), “page” (1368), and “window” (948). The letter “s” appears 421 times in the GB results, 468 times in EAF, and 471 in OCA, signifying, I suspect, a word fragment.
  • “Piazza,” which is in the running head for the book, appears 243 times in OCA, but does not register in the top 50 terms for GB or EAF. In the case of EAF, the running heads are stripped out, so “Piazza” is not given too much weight. In the case of Google Books, the OCR quality for terms that are capitalized is usually terrible–instead of “THE PIAZZA TALES” you get “‘I’ BZ .1 A Z&A ‘I’ ALB•” or “TUB PI..lZZA TAL.”

As Wesley Raabe has shown, even the most carefully prepared edition of a work, whether print or electronic, is likely to contain errors (and some works aren’t prepared that carefully). When you introduce OCR into the process, the rate of error increases. Now for purposes of search and retrieval, OCR errors may not be all that significant–the OCR probably wasn’t inaccurate for every instance of a word, so you can still determine if a text contains that word. However, if I were to choose a version of a text to use with a text analysis program, I would go with the EAF, since the EAF text was double-keyed, marked up in TEI, and appears to have the highest accuracy rate. Unfortunately, producing double-keyed, marked-up texts is more expensive than creating uncorrected OCR, so there are far more mass-digitized, OCRed books than double-keyed, TEI encoded texts. There are 886 volumes in EAF (which is focused on nineteenth century American literature) and over 2 million in GB. If scholars are going to use text analysis, mining and visualization tools on massive collections of digitized texts, they probably need to be explicit about their tolerance for error. (It would be nice if they could get access to the higher quality OCR produced by GB, too.)

Quality of Metadata

In order to trust the reliability of a work, you need good metadata so that you know what you are looking at (and in some cases to find it in the first place). As Duguid has observed, the metadata for Google Books can be poor, particularly when you’re dealing with a multi-volume work, as you often are when you’re working with nineteenth century literature. Apart from viewing the title page, there’s no way to tell that a work in GB comes in multiple volumes, and the metadata record provides no linkage from one volume to the other in the series. Indeed, sometimes it appears that only one volume in a series has been digitized. However, GB also provides potentially useful information about books–not only the title, author, publisher, and publication date, but also the page count, digitization date, and subject keywords, as well as reviews, unique phrases, and popular passages.

Although OCA does offer a lot of information about works in its collections, the metadata could be organized more effectively. When you search for a work, the information that appears in the results page is too general: you see the title, author name, and number of downloads, but not the publisher or publication date, so you have to follow the link to know what edition you’re getting. Sometimes the OCA omits the publisher and publication date from the metadata record even though this information is available on the title page of the book. At least the OCA typically includes the volume number in the title. I also like the way that the Internet Archive provides detailed metadata about how the electronic text was produced, including the number of images and scanning operator. Ideally you’d get subject terms as well, and the author name and title could serve as hotlinks to more content by that author or with that title. (The Open Library, which also involves Internet Archive guru Brewster Kahle, often provides more complete metadata records, but it’s a meta-catalog, with records for non-digitized works held by libraries, even GB.)

Both EAF and MOA offer complete bibliographic information for works in their collections. Unfortunately, Project Gutenberg often does not provide bibliographic information beyond the title, author, and subject terms, which is a real problem for scholars who need to know what edition they are looking at.

Restrictions on Use of Digitized Materials

Brewster Kahle and John Wilkin of the University of Michigan recently debated what “open” means. According to Kahle, open content, like open source code, can be “downloaded in bulk, read, analyzed, modified, and reused.” Wilkin replies that “There is no uniformly defined constituency called ‘researchers’ who ‘require downloadability.'” While Wilkin may be right about the current needs of average users, this statement overlooks the needs of researchers who want to do more than just search and retrieve a text–for instance, those who want to build their own collections, run analytical tools that requires you to load files residing on your local hard drive, etc. Of course, with some tools, such as TAPOR, you can provide a URL for the text to be analyzed. The question may not be so much one of downloadability as the ability to access and manipulate online texts–can you get access to the data through an API, for instance? Is text mining even permitted by the terms of use?

Here’s my summary of the terms of use put forward by the collections that I used, with the caveat that I Am Not A Lawyer and may be misinterpreting legalese.

a. Google Books:
Google Books, which does not require a subscription, makes available most of its public domain books for download as image-only PDF files, but imposes four restrictions on their usage: automatic querying is prohibited, the files should be used for non-commercial purposes, attribution to Google must be maintained, and users should “keep it legal,” particularly since copyright law varies by country. [I won’t wade into the debate about whether Google Books is good or bad for what I’ll call a “researcher-friendly” approach to copyright, but Lawerence Lessig and Siva Vaidhyanathan offer interesting perspectives.]

b. Internet Archive/ Open Content Alliance:
The Internet Archive, which is freely available, requires users to adhere to a 7 page terms of use agreement that governs use of all collections. Essentially the Internet Archive makes the user responsible for using content appropriately, requiring that “you certify that your use of any part of the Archive’s Collections will be noncommercial and will be limited to noninfringing or fair use under copyright law.” As the name suggests, the Open Content Alliance seem to be pretty darn open in its approach.

c. Early American Fiction:
Access to EAF requires a subscription, although 158 of 886 volumes are freely available from the University of Virginia Library. (However, something seems to be broken right now at Virginia, since I wasn’t able to open any of the freely available EAF texts.) Proquest’s terms of use for Early American Fiction are much more explicit and restrictive than the Open Content Alliance or Google Books. Essentially, “You will use the Products solely for your own personal or internal use.” You cannot create derivative works, nor can you systematically download works to create a collection of materials. Works can be printed or saved only for private or educational use.

d. Net Library:
NetLibrary, which is available only through subscription and focuses on copyrighted content, prohibits the use of automated methods for collecting or analyzing its data such as robots or data mining tools without “express written permission.” It puts forward what seems to me to be a fairly expansive policy limiting use of copyrighted materials: “You may not modify, alter, publish, transmit, distribute, display, participate in the transfer or sale, create derivative works, or in any way exploit, any of the Copyrighted Material, in whole or in part.” Presumably this policy wouldn’t rule out using text analysis tools on works downloaded from NetLibrary (assuming you could actually get the full text of resources), but it does seem to ban remixes or even alteration of the work (conversion to XML?). NetLibrary is owned by OCLC Online Computer Library Center, Inc., an Ohio nonprofit corporation.

e. JSTOR:
In its usage policy, JSTOR, which requires a subscription, recognizes that fair use governs what researchers can do with research materials. Users may not download entire issues or “incorporate Content into an unrestricted database or website,” but they can “search, view, reproduce, display, download, print, perform, and distribute articles, book reviews, journal front and back matter, and other discrete materials.”

f. Project Muse:
Owned by Johns Hopkins UP, Project Muse permits downloading one copy of articles for personal use as well as distributing works to students in a class at the subscriber’s institution. Project Muse, which requires a subscription, will not allow users to employ automated processes to download works, re-use content for commercial purposes, or “modify or create a derivative work of any Journal content without the prior written permission of the copyright holder re-use content for commercial use.”

g. Questia

Available only through individual subscription, Questia, which focuses on copyrighted content, puts forward very restrictive terms of use: You may use the Questia Web site for your personal or academic research activities. Any other use, including the reproduction, modification, distribution, transmission, republication, display, performance, rehosting, tampering, framing, or embedding of this site or its content or tools, or any commercial use whatsoever of this Web site or its content or its tools, is strictly prohibited without our prior written consent…. You agree not to reverse engineer, reverse assemble, reverse compile, decompile, disassemble, translate or otherwise alter any executable code, contents or materials downloaded from or made available through our Web site. You agree not to use programs, scripts, code or other available methods to download or view multiple pages of content on Questia in an automated fashion. You agree not to save the content available on Questia on your hard drive or other storage device for viewing offline.”

h. Making of America:
Michigan’s MOA, which is freely available, provides a standard availability notice with the metadata record for each book or journal granting the right to search, but prohibiting the redistribution of materials without permission: “These pages may be freely searched and displayed. Permission must be received for subsequent distribution in print or electronically.” However, a more general usage statement on the home page for the University of Michigan Digital Library Text Collection states that “Users are free to download, copy, and distribute works in the public domain without asking for permission.” Cornell‘s policy for its content in MOA seems more restrictive: “This material is to be used for personal or research use only. Any other use, including but not limited to commercial or scholarly reproductions, redistribution, publication, or transmission, whether by electronic means or otherwise, without prior written permission of the Library is prohibited.”

Bottom line: For most of these online collections, particularly those that are non-commercial, researchers are permitted to download and analyze texts. However, if a scholar wanted to create a free web-based collection of, say, bachelor texts downloaded from various sources, that would probably require getting permission. Converting downloaded files from one format to another for purposes of personal research generally seems OK. If you wanted to extract a few illustrations from digitized public domain texts (assuming they were of sufficient quality to begin with) and reuse them in, say, a video, that would seem to be permitted by fair use if not the explicit policy of the collection, but I’m not totally sure.

Convenience

I want to spend my time analyzing works, not searching all over the web for them, typing bibliographic information into Zotero, or converting files into another format. As already noted, GB allows you to download image-only PDFs of most public domain materials, but you have to optimize and OCR these files if you want to search or analyze them. Nevertheless, it’s pretty easy to find relevant works in GB’s huge collection, and it does provide support for Zotero. At the Open Content Alliance, you can download books in several formats, including DjVu, word plus image PDF, and plain text. You can even access the files via an FTP site. However, the Internet Archive does not yet have a Zotero translator (strange considering the recently-announced partnership between the two groups), which means more work for me in capturing bibliographic information. With the Michigan Making of America collection, you can download the uncorrected full text after reading a warning that the OCR contains errors. I was also interested to note that Michigan now provides a print-on-demand service for works in MOA–the metadata record includes a link “Order a Softcover @ Amazon.com.” The 1864 Reveries of a Bachelor, for instance, costs $23.99 from Amazon, and you can search inside the book at Amazon. (Amazon lists 7,130 results when you search for “scholarly publishing office University of Michigan Library”.) Michigan’s MOA just offered Zotero support, but it doesn’t seem to be working yet. It’s pretty easy to work with files from Project Gutenberg (plain text and sometimes HTML) and EAF (html), although neither offers Zotero support. Even if I have to search in multiple collections for texts that I need, manually add bibliographic information, and convert files to different formats, I should acknowledge that it’s so much faster and more efficient to do research online than to look in catalogs for books, track ’em down in the stacks, skim them for key words, photocopy them, and copy down the bibliographic information.

Reputation

Evaluating the reputation of an online collection is tricky–it’s difficult to know how a scholarly community regards particular sources without doing citation analysis or conducting interviews with or surveys of scholars. To get a quick sense of how often various digital collections are cited, I searched for the title of the collection or the base URL (e.g. books. google.com) in Project Muse and JSTOR. Only five works cite Google Books (three of which focus on the book scanning project) and only four cite “Early American Fiction.” However, 37 articles cite Project Gutenberg, 36 cite Making of America, and 30 cite the Internet Archive, mostly its contemporary culture collections such as the Prelinger Archive. I attribute the low rate of citation of Google Books to several factors: it’s the newest collection, having been launched at the end of 2004. Further, many researchers probably feel that they can cite works in GB as if they were looking at the print version rather than including the unwieldy Google Books URL in their bibliography. Reputation–the sense that you get more credit if you cite the print version–probably also factors in. I’m not sure why EAF is not cited more often, but does contain fewer works than the other collections.

Conclusions

In reviewing the quality of these electronic texts, my intention is not to whine or nitpick–I’m grateful to all of these projects for making such rich content accessible. Since I’ve been involved a few digitization projects myself, I know how difficult it can be to ensure quality. However, I think researchers need to be aware of the current limitations of electronic texts. Because the OCR or even the keyboarding of texts is not entirely accurate, using tools to search or analyze them will not produce completely accurate results. Perhaps we can use pre-processing [corrected 5/13] tools to increase the accuracy. For instance, we could use such tools to remove running heads and other extraneous information, spellcheck files, and even join words broken across lines. Even if the texts aren’t completely accurate, I think you can come to some fascinating insights about the works by using text analysis tools on them–you just need to be conscious of potential problems with the quality of the text.

We can also make informed choices about which texts we use for research. Each online collection has its strengths and weaknesses: Google Books contains the most books but its image-only PDF files can be a pain to work with, EAF has the most accurate texts in my field (19th C American fiction) but is also the smallest collection, and MOA has some key works for the study of 19th century culture and offers access to pretty decent OCR. OCA seems to take into account the needs of researchers by building a large collection of public domain works, offering access to files in different formats, and adopting a policy of openness.

How many texts have been digitized?

In remixing my dissertation as a work of digital scholarship, I’m trying to use digital resources for my research as much as possible. But is this even possible? How many research materials in American literature and culture are available online as full-text, and how reliable are these electronic texts? I worked on my dissertation between 1996 and 2002 and used electronic collections that were available at the time–particularly Early American Fiction, Making of America (both the Michigan and Cornell sites), JSTOR, Project Muse, and HarpWeek–but I did most of my research in the stacks at Virginia’s Alderman Library, perusing critical works and flipping through 19th century periodicals on the hunt for bachelor texts. (I’m a bit embarrassed to admit that I cited fewer digital resources than I actually used, but research that I’ve done with my colleague Jane Segal indicates that few literary scholars cite digital resources, even though many use them.) If I were to begin researching my dissertation now, what new possibilities would be open to me, and what problems would I face in trying to rely on digital resources?

To find out, I searched for each of the 296 items in my original bibliography in both free and subscription-based online collections such as Google Books, Open Content Alliance/Internet Archive, JSTOR, Project Muse, Early American Fiction, Making of America, Net Library, and Questia (which requires an individual subscription). I found that 83% of my primary source materials and 37% of my secondary source materials are now available online as full-text. By “full text,” I mean that, at minimum, you can read the work from start to finish online and search within it. If the work is in the public domain or is a journal article, you often can download it, whether as HTML, PDF, plain text, or, in the case of the Open Content Alliance, DJVU images. (I earlier reported that only 22% of my secondary materials were available online, but I then I realized that I needed to look for these resources at sites such as Net Library and Questia.) Furthermore, 95% of all the sources listed in my bibliography have been digitized. If a work has been digitized but is not available as full-text, it’s typically a work that Google Books offers as limited preview, snippet view, or no preview because of copyright restrictions. You can still search books that are limited preview or snippet view, but you cannot retrieve more than a few pages (limited preview) or lines (snippet view). Access to 22% of the works–mostly periodicals and secondary ebooks–requires a subscription.

I suspect that more works have been digitized in my field, nineteenth-century American literature, than in most. Works are safely in the public domain (except for critical editions produced in the 20th century), and major digitization initiatives such as Early American Fiction, Making of America, and Wright American Fiction have provided access to thousands of books and magazines. US research libraries have extensive collections focused on American literature, so those works seem to be well-represented in Google Books and the Open Content Alliance.

Here are the numbers:

Type Total # # Full Text # Ltd. Prev. # Snip View # No Prev. # Not Digit. # Subs Req % Full Text % Digit
secondary monograph 119 28 36 10 43 2 25 23.5% 98.3%
secondary periodical 29 27 0 0 0 2 27 93.1% 93.1%
primary monograph 66 50 5 1 8 2 1 75.8% 97%
primary periodical 79 70 0 0 2 7 12 88.6% 91.1%
archival 3 0 0 0 0 3 0.0% 0.0%
Total Primary 148 120 5 1 10 12 8.8% 82.8% 91.9%
Total Secondary 148 55 36 10 43 4 35.1% 37.2% 97.3%
Grand Total 296 175 41 11 53 16 22% 59.1% 94.6%

As a side-note, I’m interested to see that I used secondary monographs more than any other resource–a common practice in literary study, I suspect.

Methodology

Initially I planned to systematically search multiple digital collections for the works in bibliography, but once I realized how long that process would take I decided to scale back my efforts. My goal was to find out if a work was available online as full text, not to discover every electronic version of that work. I experimented with using tools such as Rollyo and the Google Custom Search Engine to search for works across a specific set of sites, but they wouldn’t return results from some major sources such as Google Books (perhaps because Google Books apparently does not permit indexing by other commercial search engines). I longed for a tool that would suck in my bibliography, search for authoritative versions of each text, capture the bibliographic information, and download everything into Zotero (come on, semantic web), but, alas, I had to do this work manually. If I were a better programmer, I might be able to automate part of the collection process, but most online archives prohibit automated methods for downloading files.

So that I could determine whether certain types of works were more likely to be available online, I distinguished between primary and secondary monographs (books as well as essays/articles/poems collected within books), primary and secondary periodicals, and archival materials (which I classified as primary source). Logically enough, where I looked depended on what I was looking for:

  • primary source monographs: Google Books and Early American Fiction. If I didn’t find the work at these sites, I tried the Open Content Alliance and Making of America, then searched for the title using Google.
  • primary source periodicals: Making of America, Google Books, and subscription databases provided by Alexander Street Press, then the general web.
  • secondary source monographs: Google Books, Live Search Books, Net Library, and Questia
  • secondary source periodicals: Google Scholar (which searches JSTOR, Project Muse, and other electronic journal collections)
  • archival resources: web page of repository holding the collection

I wasn’t always able to find the same edition of a work that I cited in my dissertation. However, I was delighted to discover first editions of important works in Early American Fiction, Google Books, Open Content Alliance, and other online collections.

As the statistics cited above indicate, the majority of my primary sources are available online as full-text, while most of my secondary sources are not. But I noted several instances where public domain materials that should be freely available were not.

  • Most of the primary source monographs I used have been digitized, but Google Books treats 8 of these public domain works as “no preview.” I’m not sure why Google treats works such as The Soldier’s Bride and Other Tales (1833) as “no preview,” but a note on the metadata record for this book gives a clue: “Prepared for The Electronic Archive of Early American Fiction at the University of Virginia Library.” My hunch is that Google Books does not make available some public domain works already digitized by its library partners.
  • Thanks to Making of America, many nineteenth century American periodicals have been digitized. Google Books also contains important 19th C magazines such as Southern Literary Messenger (which Poe wrote for) and Salmagundi, but it does not appear to have every issue of many of these magazines. Some important magazines, such as Godey’s Lady’s Book, are not available at all through Google Books (although you can access the magazine if your library subscribes to the Alexander Street Press Godey’s collection).
  • Nearly all of the secondary journal articles that I consulted are available online, most through JSTOR or Project Muse. However, more specialized journals such as the Walt Whitman Quarterly Review are not yet available as full-text online (although the WWQR does have a complete index).
  • Only about 25% of the secondary books that I cited are available as full-text through Questia and/or NetLibrary. However, Google Books seems to have at least digitized most of the contemporary secondary monographs in my bibliography. (I’m assuming that “no preview” means Google has digitized the work but isn’t making even a snippet publicly available.) Publishers such as Oxford UP, University of California Press, Cambridge UP and Knopf appear to have made deals with Google to allow limited preview of their books. Interestingly, Google Books has not digitized some works available through Questia, such as After the Whale and Monumental Anxieties.
  • I looked at three archival collections–single items at the University of Virginia and the Virginia Historical Society, as well as a fairly large collection focused on the author Donald Grant Mitchell at Yale–and none have been digitized yet.

So what are the implications of my findings that most of my primary sources are available online as full text, while many of my secondary sources are, at least in a limited fashion, in a digital format and 62% of them are searchable? As Patrick Leary, Jo Guldi, and others have argued, massive digitization projects promise not only to make the research process more efficient, but also to open up new approaches to research. For example, you can discover important works that would otherwise be invisible to you, trace the use of a phrase across works, and analyze significant patterns in a corpus of texts.

Yet we should also acknowledge that not everything is available online and that research sources are scattered across multiple collections, not yet searchable through a single tool. Despite the efforts of many archives to digitize their collections, studying most archival resources still requires a trip to the archives (although the Web has made it much easier to determine if an archive holds relevant materials and to prepare for a research trip). Many online collections–particularly those focused on works not in the public domain–require a subscription, so if you’re an independent scholar or if your library can’t afford a subscription, you’re out of luck. Furthermore, scholars need to be able to trust the reliability of online texts so that they feel comfortable using (and citing) them–the metadata needs to be accurate, the page images and OCR of sufficient quality, the reputation of the archive high. Given the potentially overwhelming quantity of data, we need better to tools to search, manage and analyze information (fortunately, project such as Zotero and MONK are developing such tools). We also need to be able to use these tools with text collections, whether the tools are integrated into the collection (as Token-X is with the Willa Cather Archive), invoked through an API, or run on collections that we build ourselves by downloading relevant resources. And we need to feel comfortable that we can download and analyze online resources without worrying about being sued for violating licensing terms.

In my next posts, I’ll look at the quality of online texts, discuss what researchers can do with full-text, and detail the problems I ran into trying to get downloaded texts into shape for text analysis tools.