What can you do with texts that are in a digital format?

I’ve had a longstanding, friendly debate with a colleague about whether it is sufficient to provide page images of books, or whether text should be converted to a machine- and human-readable format such as XML. She argues that converting scanned books to text is expensive and that the primary goal should be to provide access to more material. True, but converting books into a textual format makes them much more accessible, allowing users to search, manipulate, organize, and analyze them. Here’s my summary of what you can do with an electronic text. Most of these advantages are pretty obvious, but worth articulating.

Read it—on paper (once you print it out or pay for on-demand printing), your computer, or, increasingly, a portable device. From a single XML file, you can generate many forms of output, including HTML, PDF and for a mobile device.
Copy and paste it–avoid the hassle of having to retype passages.
Search it. Several years ago, I wrote a series of learning modules on stereographs, 3D photographs popular in the late 19th and early 20th centuries. I searched for books and articles on stereographs in the library catalog and in journal collections such as JSTOR, but was kind of disappointed by the lack of relevant information. Last year I returned to the topic and used Google Books for my research. I found dozens more relevant sources, such as key theoretical and historical works on stereography (most of which had already been published when I first studied the topic) as well as some fascinating nineteenth and early twentieth century manuals. Sure, I had to wade through a lot more stuff to find what I needed, but being able to search the contents of books and essays as well as the metadata let me uncover much more useful stuff.
Build a personal collection. Forget file cabinets crammed with photocopies. Using tools such as Zotero and EndNote, you can easily download articles and the accompanying bibliographic information onto your laptop, then take your entire collection with you on a plane, to an archive, to a boring meeting, etc. You can search your collection, sort it, create bibliographies, etc.
Share it. Much to the chagrin of movie studios and record companies, digital files are easy to share, so you can give colleagues access to articles, notes, bibliographies, etc. without having to deal with physical delivery (copyright permitting, of course.) With the forthcoming Zotero 2.0, sharing will get even easier.
Analyze it. Once you have a book in a text-based format, you can do all sorts of nifty things with it–generate word counts, find out what terms appear most frequently next to a particular word, extract dates, find capitalized terms, compare texts, and much more. See TAPOR’s tutorial.
Visualize it. Not only are text visualization tools, well, cool, they also can open up interpretive insights. For instance, using the US Presidential Speeches Tag Cloud, you can get a quick, dynamic view of the history of presidential priorities.
Mine it. Look for patterns in large textbases. As Loretta Auvil of NCSA & SEASR explains, text mining tools such as those being developed by MONK and SEASR enable researchers to automatically classify texts according to characteristics such as genre, identify patterns such as repetition (as in the case of Stein’s Making of the Americas), analyze literary inheritance, and study themes across thousands of texts.
Remix & play with it. By taking the elements of a text or collection of texts and remixing them, you not only produce a new creative work, but also see the text in a new way–your attention is brought to particular linguistic elements, like the fragments of a broken vase used to make a mosaic. For instance, when I used the Open Wound “language mixing tool” with Melville’s 1855 sketch “The Paradise of Bachelors and the Tartarus of Maids”, I gained new insights into the violence and anxiety expressed by words such as “agony,” “cut,” and “defective.” Running the tool on the sketch also produced some stunning phrases that could serve as mottoes for this kind of activity: “Exposed are the cutters,” “in the meditation onward,” and “protecting through the scholarship.” I also plan to play with tools that would allow me to mashup several bachelor texts (take the beginning from Irving, the middle from Melville and Hawthorne, the end from Mitchell), replace key words with pictures, etc.

Some really interesting research is underway on the possibilities of text mining for humanities scholarship–including the aforementioned MONK and SEASR projects, as well CHNM’s “Scholarship in the Age of Abundance: Enhancing Historical Research With Text-Mining and Analysis Tools.”

Evaluating the quality of electronic texts

In my last post, I said that 83% of the primary source works that I used in my dissertation are now available online as full-text. But how reliable are these electronic texts? Can researchers feel comfortable citing them and using them for text analysis? In my view, the quality of an electronic text and its appropriateness for use in scholarship depend on 6 factors:

  • Quality of the scanning: Is the complete page captured? Is the image skewed or distorted? Is the image of sufficient resolution?
  • Quality of the OCR/text conversion: Is full text provided? What method was used to produce the text–double-keying or OCR? How accurate is the text? Are the texts marked up in TEI (Text Encoding Initiative)? Are words joined across line breaks? Are running heads preserved?
  • Quality of the metadata: Is the bibliographic information accurate? Is it clear what edition you are looking at? If there are multiple volumes, do you know which volume you are getting and how to locate the other volume(s)?
  • Terms of use: What are you legally able to do with the digitized work? Can you download the full-text and use tools to analyze it? Is the content freely and openly available, or do you have to pay for use?
  • Convenience: Can you easily download the text and store it in your own collection? How much work do you have to do to convert the text into a format appropriate for use with text analysis tools? How hard is it to find the electronic text in the first place? Is there a Zotero translator for the collection?
  • Reputation: Is the digital archive well-regarded in the scholarly community? If you cited the archive in your bibliography, would fellow researchers question your decision? Does the archive provide clear information about its process for selecting, digitizing, and preserving texts?

I focused my evaluation on the main collections that I plumbed for the primary source works in my dissertation bibliography: Google Books (GB), Open Content Alliance (OCA), Early American Fiction (EAF), Project Gutenberg (PG), and Making of America (MOA). I found the OCA works in the Internet Archive (they are marked as belonging to the “American Libraries” or “Canadian Libraries” collections.) I apologize in advance for the length of this post, but I want to dig into the details.

Quality of Images

Perhaps the most heated debate over the quality of digitized texts has focused on Google Books. For example, Paul Duguid and Robert Townsend have questioned the quality of Google Books, providing examples of skewed or poorly scanned pages, inaccurate metadata, and the failure to make available materials that should be in the public domain. It goes without saying that providing access to high-quality page images is important: many researchers want to study the illustrations and other visual features of a text and to verify the converted text against the original page. Furthermore, a poorly scanned page probably means that the resulting OCR will also be bad.

To evaluate the quality of Google’s scans, I first took a birds’-eye view, using the page preview function in Adobe Acrobat to get a quick glimpse of the image quality of 56 19th century works that I downloaded from GB. Using this admittedly inexact method, I noticed fewer than 100 or so scanning errors across the approximately 11,000 pages I glanced at. If I found one distorted image (see, for instance, the 1834 Knickerbocker) or a finger in the scan, it was likely that the book would contain other errors as well–maybe the scanning operator was, um, distracted. However, closer scrutiny of the files revealed other errors not visible through the preview. For instance, the text for Typee ended right in the middle of a word, which I don’t think is how Melville meant it to be (although that narrative approach does leave you anticipating what comes next). In the 1906 Maynard & Merrill edition of The Sketch Book, the last few lines on a number of pages are, for lack of a more precise term, stretched and curved, as if the page were turned too quickly during the scanning or photographing. (Judging from all of the black splotches on the pages, this text appears to be scanned from microfilm, so the quality issues may have been introduced during the microfilming, not the scanning.) Although I can’t make a definitive statement about the quality of Google Book’s scans, I’d say there are some problems, but they are not as significant as I thought they would be.

Although Google Books seems to have the most content right now (about 2,680,000 books?), I prefer the quality of the scans provided by the Open Library/ Internet Archive, which provides searchable text + image PDF, as well as DJVU files and a flip-book format that simulates the experience of turning book pages online. Whereas most of the pages in Google Books are scanned in black and white, the OCA scans are in full color, showing the coloration of the page and the richness of the illustrations. Although I haven’t conducted a systematic study of OCA scans, I haven’t noticed many problems at all. (Disclaimer: Rice University’s Library is a member of the Open Content Alliance, but I have not been personally involved in working with the organization.)

Likewise, the scans for Early American Fiction are full-color and were captured at a high resolution. When I worked at Virginia’s Electronic Text Center, I visited the photography studio for EAF and witnessed the care with which each page was scanned, so I would be surprised to find many problems. Although the MOA images are black and white, the scans appear to be of high quality and were captured at 600 dpi. Project Gutenberg does not provide page images, so the issue of scanning quality is moot for this collection.

Quality of Text Conversion

Although the ability to search and retrieve an entire text online makes research more convenient and comprehensive, doing deeper analysis requires either being able to download texts or to run tools on online content. With some subscription-based collections, such as Net Library and Questia, you are limited to viewing the work one page at a time, which really restricts what you can do with it. With other collections, you can capture the entire work by downloading it as a web page (Muse, EAF), image-only PDF (Google Books), searchable PDF (OCA and JSTOR, lately?), or plain text with uncorrected OCR (Making of America, OCA).

My favorite sources are those that provide good quality full-text of the article or book in HTML (XML would be even better), since that minimizes the work I need to do in getting the text into a format I can use with text analysis tools. Converting the image-only PDFs provided for download by Google Books into plain text or XML has been troublesome, perhaps because the resolution of the scans and overall quality of the scanning could be better, perhaps because nineteenth century works aren’t necessarily printed clearly. (Several folks have complained about the quality of Google Books PDF files: 1, 2.) To get the Google Books PDFs into a format I could work with, I had to run them through Acrobat 8.0’s optimizer to improve the image quality, then through its OCR engine to convert the files to Adobe Tagged Text XML, a very basic markup format.

In assessing the quality of the full-text files provided by Google Books, the Open Content Alliance, Early American Fiction, Making of America, and Project Gutenberg, I compared the quality of OCR for 3 works important to my dissertation: Mitchell’s Reveries of a Bachelor, Melville’s The Piazza Tales, and Irving’s The Sketch Book. [Results of the analysis are available here.] EAF, OCA, and GB held all three books, but MOA only had Reveries and PG only had The Sketch Book and The Piazza Tales. For GB, I looked at the quality of the OCR I was able to generate from the image-only PDFs, as well as of the plain text provided online by Google Books, which allows you to view only one page at a time. I selected the same three pages in each of the books and counted the number of OCR errors per page. In the case of The Piazza Tales, I was able to use the same edition from all of the collections (the 1856 Dix & Edwards edition), but even for the other works the number of words per page seemed about the same. I found that:

  • The most accurate texts were produced by EAF (1 error total out of the 3 sample pages I examined in 3 texts), Project Gutenberg (0 errors in 2 texts), and MOA (0 errors in 1 text). The OCR that I generated from the Google Books PDFs was the least accurate, with a total of 111 errors, while there were 44 errors in the 9 pages I examined from OCA. Full disclosure: I was a project assistant for EAF and expected that the quality of text conversion would be high, since the texts were double-keyed, marked up in TEI, and then reviewed by graduate assistant.
  • The OCR produced by Google Books is of much higher quality than what I was able to create through Adobe Acrobat’s OCR engine–there were 38 errors in the plain text online version vs 111 in my OCRed version. (I’m betting that Google is working with better images and better software.) But even Google Book’s OCR engine was tripped up by poorly scanned pages; for instance, in one page from The Sketch Book, words were cut off along the left edge, resulting in 25 OCR errors for this page.
  • The number of text conversion errors would increase for OCA if I included words that broke across lines. Although preserving line and paragraph breaks, as OCA does, makes it easier for the reader to move between the page image and the OCRed text, this practice also means that word counts for OCA texts are probably not accurate, since terms such as “pic” <line break> “ture” would not be registered as “picture”. Depending on the density of words on the page, it seems that an average of 4-6 words per page are broken across lines.
  • Running heads are preserved in the MOA, OCA and Google Books texts, so word counts for terms that appear in these heads (e.g. “Reveries of a Bachelor”) would be inflated.

In addition to sampling texts to determine the number of errors in text conversion, I also wanted to get a macro view of the accuracy of each version. I decided to focus on The Piazza Tales, since OCA, Google Books, and EAF all contain the 1856 Dix & Edwards edition. I used the HTML file downloaded from EAF (the TEI version would have been preferable, but there is no direct access to that file), the plain text file from OCA, and the Adobe Tagged Text XML I generated from the Google Books image only PDF. (I couldn’t get either HyperPo or TAPOR Tools to work with the Gutenberg file.) To generate the word counts, I ran each text through HyperPo 6, then compared the results [available as spreadsheet]:

  • Word counts differed in each file, even though the editions were identical. For example, HyperPo 6 counted 311 instances of “all” in the Google Books version, 383 in EAF, and 380 in OCA. Likewise “one” appears 327 times in Google Books, 391 times in EAF, and 390 times in OCA. These variations are probably due to differences in the quality of text conversion. I would expect Google Books to register lower counts for words, since the OCR that I was able to produce appears to be least accurate. EAF would probably contain the most accurate results, since it (unlike OCA) typically joins words broken across lines.
  • Each version contains words/characters that would lead to false results. For instance, the word count for EAF includes terms that were not part of the original text but that were used to aid online navigation, such as “link” (1843 times), “page” (1368), and “window” (948). The letter “s” appears 421 times in the GB results, 468 times in EAF, and 471 in OCA, signifying, I suspect, a word fragment.
  • “Piazza,” which is in the running head for the book, appears 243 times in OCA, but does not register in the top 50 terms for GB or EAF. In the case of EAF, the running heads are stripped out, so “Piazza” is not given too much weight. In the case of Google Books, the OCR quality for terms that are capitalized is usually terrible–instead of “THE PIAZZA TALES” you get “‘I’ BZ .1 A Z&A ‘I’ ALB•” or “TUB PI..lZZA TAL.”

As Wesley Raabe has shown, even the most carefully prepared edition of a work, whether print or electronic, is likely to contain errors (and some works aren’t prepared that carefully). When you introduce OCR into the process, the rate of error increases. Now for purposes of search and retrieval, OCR errors may not be all that significant–the OCR probably wasn’t inaccurate for every instance of a word, so you can still determine if a text contains that word. However, if I were to choose a version of a text to use with a text analysis program, I would go with the EAF, since the EAF text was double-keyed, marked up in TEI, and appears to have the highest accuracy rate. Unfortunately, producing double-keyed, marked-up texts is more expensive than creating uncorrected OCR, so there are far more mass-digitized, OCRed books than double-keyed, TEI encoded texts. There are 886 volumes in EAF (which is focused on nineteenth century American literature) and over 2 million in GB. If scholars are going to use text analysis, mining and visualization tools on massive collections of digitized texts, they probably need to be explicit about their tolerance for error. (It would be nice if they could get access to the higher quality OCR produced by GB, too.)

Quality of Metadata

In order to trust the reliability of a work, you need good metadata so that you know what you are looking at (and in some cases to find it in the first place). As Duguid has observed, the metadata for Google Books can be poor, particularly when you’re dealing with a multi-volume work, as you often are when you’re working with nineteenth century literature. Apart from viewing the title page, there’s no way to tell that a work in GB comes in multiple volumes, and the metadata record provides no linkage from one volume to the other in the series. Indeed, sometimes it appears that only one volume in a series has been digitized. However, GB also provides potentially useful information about books–not only the title, author, publisher, and publication date, but also the page count, digitization date, and subject keywords, as well as reviews, unique phrases, and popular passages.

Although OCA does offer a lot of information about works in its collections, the metadata could be organized more effectively. When you search for a work, the information that appears in the results page is too general: you see the title, author name, and number of downloads, but not the publisher or publication date, so you have to follow the link to know what edition you’re getting. Sometimes the OCA omits the publisher and publication date from the metadata record even though this information is available on the title page of the book. At least the OCA typically includes the volume number in the title. I also like the way that the Internet Archive provides detailed metadata about how the electronic text was produced, including the number of images and scanning operator. Ideally you’d get subject terms as well, and the author name and title could serve as hotlinks to more content by that author or with that title. (The Open Library, which also involves Internet Archive guru Brewster Kahle, often provides more complete metadata records, but it’s a meta-catalog, with records for non-digitized works held by libraries, even GB.)

Both EAF and MOA offer complete bibliographic information for works in their collections. Unfortunately, Project Gutenberg often does not provide bibliographic information beyond the title, author, and subject terms, which is a real problem for scholars who need to know what edition they are looking at.

Restrictions on Use of Digitized Materials

Brewster Kahle and John Wilkin of the University of Michigan recently debated what “open” means. According to Kahle, open content, like open source code, can be “downloaded in bulk, read, analyzed, modified, and reused.” Wilkin replies that “There is no uniformly defined constituency called ‘researchers’ who ‘require downloadability.’” While Wilkin may be right about the current needs of average users, this statement overlooks the needs of researchers who want to do more than just search and retrieve a text–for instance, those who want to build their own collections, run analytical tools that requires you to load files residing on your local hard drive, etc. Of course, with some tools, such as TAPOR, you can provide a URL for the text to be analyzed. The question may not be so much one of downloadability as the ability to access and manipulate online texts–can you get access to the data through an API, for instance? Is text mining even permitted by the terms of use?

Here’s my summary of the terms of use put forward by the collections that I used, with the caveat that I Am Not A Lawyer and may be misinterpreting legalese.

a. Google Books:
Google Books, which does not require a subscription, makes available most of its public domain books for download as image-only PDF files, but imposes four restrictions on their usage: automatic querying is prohibited, the files should be used for non-commercial purposes, attribution to Google must be maintained, and users should “keep it legal,” particularly since copyright law varies by country. [I won't wade into the debate about whether Google Books is good or bad for what I'll call a "researcher-friendly" approach to copyright, but Lawerence Lessig and Siva Vaidhyanathan offer interesting perspectives.]

b. Internet Archive/ Open Content Alliance:
The Internet Archive, which is freely available, requires users to adhere to a 7 page terms of use agreement that governs use of all collections. Essentially the Internet Archive makes the user responsible for using content appropriately, requiring that “you certify that your use of any part of the Archive’s Collections will be noncommercial and will be limited to noninfringing or fair use under copyright law.” As the name suggests, the Open Content Alliance seem to be pretty darn open in its approach.

c. Early American Fiction:
Access to EAF requires a subscription, although 158 of 886 volumes are freely available from the University of Virginia Library. (However, something seems to be broken right now at Virginia, since I wasn’t able to open any of the freely available EAF texts.) Proquest’s terms of use for Early American Fiction are much more explicit and restrictive than the Open Content Alliance or Google Books. Essentially, “You will use the Products solely for your own personal or internal use.” You cannot create derivative works, nor can you systematically download works to create a collection of materials. Works can be printed or saved only for private or educational use.

d. Net Library:
NetLibrary, which is available only through subscription and focuses on copyrighted content, prohibits the use of automated methods for collecting or analyzing its data such as robots or data mining tools without “express written permission.” It puts forward what seems to me to be a fairly expansive policy limiting use of copyrighted materials: “You may not modify, alter, publish, transmit, distribute, display, participate in the transfer or sale, create derivative works, or in any way exploit, any of the Copyrighted Material, in whole or in part.” Presumably this policy wouldn’t rule out using text analysis tools on works downloaded from NetLibrary (assuming you could actually get the full text of resources), but it does seem to ban remixes or even alteration of the work (conversion to XML?). NetLibrary is owned by OCLC Online Computer Library Center, Inc., an Ohio nonprofit corporation.

e. JSTOR:
In its usage policy, JSTOR, which requires a subscription, recognizes that fair use governs what researchers can do with research materials. Users may not download entire issues or “incorporate Content into an unrestricted database or website,” but they can “search, view, reproduce, display, download, print, perform, and distribute articles, book reviews, journal front and back matter, and other discrete materials.”

f. Project Muse:
Owned by Johns Hopkins UP, Project Muse permits downloading one copy of articles for personal use as well as distributing works to students in a class at the subscriber’s institution. Project Muse, which requires a subscription, will not allow users to employ automated processes to download works, re-use content for commercial purposes, or “modify or create a derivative work of any Journal content without the prior written permission of the copyright holder re-use content for commercial use.”

g. Questia

Available only through individual subscription, Questia, which focuses on copyrighted content, puts forward very restrictive terms of use: You may use the Questia Web site for your personal or academic research activities. Any other use, including the reproduction, modification, distribution, transmission, republication, display, performance, rehosting, tampering, framing, or embedding of this site or its content or tools, or any commercial use whatsoever of this Web site or its content or its tools, is strictly prohibited without our prior written consent…. You agree not to reverse engineer, reverse assemble, reverse compile, decompile, disassemble, translate or otherwise alter any executable code, contents or materials downloaded from or made available through our Web site. You agree not to use programs, scripts, code or other available methods to download or view multiple pages of content on Questia in an automated fashion. You agree not to save the content available on Questia on your hard drive or other storage device for viewing offline.”

h. Making of America:
Michigan’s MOA, which is freely available, provides a standard availability notice with the metadata record for each book or journal granting the right to search, but prohibiting the redistribution of materials without permission: “These pages may be freely searched and displayed. Permission must be received for subsequent distribution in print or electronically.” However, a more general usage statement on the home page for the University of Michigan Digital Library Text Collection states that “Users are free to download, copy, and distribute works in the public domain without asking for permission.” Cornell’s policy for its content in MOA seems more restrictive: “This material is to be used for personal or research use only. Any other use, including but not limited to commercial or scholarly reproductions, redistribution, publication, or transmission, whether by electronic means or otherwise, without prior written permission of the Library is prohibited.”

Bottom line: For most of these online collections, particularly those that are non-commercial, researchers are permitted to download and analyze texts. However, if a scholar wanted to create a free web-based collection of, say, bachelor texts downloaded from various sources, that would probably require getting permission. Converting downloaded files from one format to another for purposes of personal research generally seems OK. If you wanted to extract a few illustrations from digitized public domain texts (assuming they were of sufficient quality to begin with) and reuse them in, say, a video, that would seem to be permitted by fair use if not the explicit policy of the collection, but I’m not totally sure.

Convenience

I want to spend my time analyzing works, not searching all over the web for them, typing bibliographic information into Zotero, or converting files into another format. As already noted, GB allows you to download image-only PDFs of most public domain materials, but you have to optimize and OCR these files if you want to search or analyze them. Nevertheless, it’s pretty easy to find relevant works in GB’s huge collection, and it does provide support for Zotero. At the Open Content Alliance, you can download books in several formats, including DjVu, word plus image PDF, and plain text. You can even access the files via an FTP site. However, the Internet Archive does not yet have a Zotero translator (strange considering the recently-announced partnership between the two groups), which means more work for me in capturing bibliographic information. With the Michigan Making of America collection, you can download the uncorrected full text after reading a warning that the OCR contains errors. I was also interested to note that Michigan now provides a print-on-demand service for works in MOA–the metadata record includes a link “Order a Softcover @ Amazon.com.” The 1864 Reveries of a Bachelor, for instance, costs $23.99 from Amazon, and you can search inside the book at Amazon. (Amazon lists 7,130 results when you search for “scholarly publishing office University of Michigan Library”.) Michigan’s MOA just offered Zotero support, but it doesn’t seem to be working yet. It’s pretty easy to work with files from Project Gutenberg (plain text and sometimes HTML) and EAF (html), although neither offers Zotero support. Even if I have to search in multiple collections for texts that I need, manually add bibliographic information, and convert files to different formats, I should acknowledge that it’s so much faster and more efficient to do research online than to look in catalogs for books, track ‘em down in the stacks, skim them for key words, photocopy them, and copy down the bibliographic information.

Reputation

Evaluating the reputation of an online collection is tricky–it’s difficult to know how a scholarly community regards particular sources without doing citation analysis or conducting interviews with or surveys of scholars. To get a quick sense of how often various digital collections are cited, I searched for the title of the collection or the base URL (e.g. books. google.com) in Project Muse and JSTOR. Only five works cite Google Books (three of which focus on the book scanning project) and only four cite “Early American Fiction.” However, 37 articles cite Project Gutenberg, 36 cite Making of America, and 30 cite the Internet Archive, mostly its contemporary culture collections such as the Prelinger Archive. I attribute the low rate of citation of Google Books to several factors: it’s the newest collection, having been launched at the end of 2004. Further, many researchers probably feel that they can cite works in GB as if they were looking at the print version rather than including the unwieldy Google Books URL in their bibliography. Reputation–the sense that you get more credit if you cite the print version–probably also factors in. I’m not sure why EAF is not cited more often, but does contain fewer works than the other collections.

Conclusions

In reviewing the quality of these electronic texts, my intention is not to whine or nitpick–I’m grateful to all of these projects for making such rich content accessible. Since I’ve been involved a few digitization projects myself, I know how difficult it can be to ensure quality. However, I think researchers need to be aware of the current limitations of electronic texts. Because the OCR or even the keyboarding of texts is not entirely accurate, using tools to search or analyze them will not produce completely accurate results. Perhaps we can use pre-processing [corrected 5/13] tools to increase the accuracy. For instance, we could use such tools to remove running heads and other extraneous information, spellcheck files, and even join words broken across lines. Even if the texts aren’t completely accurate, I think you can come to some fascinating insights about the works by using text analysis tools on them–you just need to be conscious of potential problems with the quality of the text.

We can also make informed choices about which texts we use for research. Each online collection has its strengths and weaknesses: Google Books contains the most books but its image-only PDF files can be a pain to work with, EAF has the most accurate texts in my field (19th C American fiction) but is also the smallest collection, and MOA has some key works for the study of 19th century culture and offers access to pretty decent OCR. OCA seems to take into account the needs of researchers by building a large collection of public domain works, offering access to files in different formats, and adopting a policy of openness.

How many texts have been digitized?

In remixing my dissertation as a work of digital scholarship, I’m trying to use digital resources for my research as much as possible. But is this even possible? How many research materials in American literature and culture are available online as full-text, and how reliable are these electronic texts? I worked on my dissertation between 1996 and 2002 and used electronic collections that were available at the time–particularly Early American Fiction, Making of America (both the Michigan and Cornell sites), JSTOR, Project Muse, and HarpWeek–but I did most of my research in the stacks at Virginia’s Alderman Library, perusing critical works and flipping through 19th century periodicals on the hunt for bachelor texts. (I’m a bit embarrassed to admit that I cited fewer digital resources than I actually used, but research that I’ve done with my colleague Jane Segal indicates that few literary scholars cite digital resources, even though many use them.) If I were to begin researching my dissertation now, what new possibilities would be open to me, and what problems would I face in trying to rely on digital resources?

To find out, I searched for each of the 296 items in my original bibliography in both free and subscription-based online collections such as Google Books, Open Content Alliance/Internet Archive, JSTOR, Project Muse, Early American Fiction, Making of America, Net Library, and Questia (which requires an individual subscription). I found that 83% of my primary source materials and 37% of my secondary source materials are now available online as full-text. By “full text,” I mean that, at minimum, you can read the work from start to finish online and search within it. If the work is in the public domain or is a journal article, you often can download it, whether as HTML, PDF, plain text, or, in the case of the Open Content Alliance, DJVU images. (I earlier reported that only 22% of my secondary materials were available online, but I then I realized that I needed to look for these resources at sites such as Net Library and Questia.) Furthermore, 95% of all the sources listed in my bibliography have been digitized. If a work has been digitized but is not available as full-text, it’s typically a work that Google Books offers as limited preview, snippet view, or no preview because of copyright restrictions. You can still search books that are limited preview or snippet view, but you cannot retrieve more than a few pages (limited preview) or lines (snippet view). Access to 22% of the works–mostly periodicals and secondary ebooks–requires a subscription.

I suspect that more works have been digitized in my field, nineteenth-century American literature, than in most. Works are safely in the public domain (except for critical editions produced in the 20th century), and major digitization initiatives such as Early American Fiction, Making of America, and Wright American Fiction have provided access to thousands of books and magazines. US research libraries have extensive collections focused on American literature, so those works seem to be well-represented in Google Books and the Open Content Alliance.

Here are the numbers:

Type Total # # Full Text # Ltd. Prev. # Snip View # No Prev. # Not Digit. # Subs Req % Full Text % Digit
secondary monograph 119 28 36 10 43 2 25 23.5% 98.3%
secondary periodical 29 27 0 0 0 2 27 93.1% 93.1%
primary monograph 66 50 5 1 8 2 1 75.8% 97%
primary periodical 79 70 0 0 2 7 12 88.6% 91.1%
archival 3 0 0 0 0 3 0.0% 0.0%
Total Primary 148 120 5 1 10 12 8.8% 82.8% 91.9%
Total Secondary 148 55 36 10 43 4 35.1% 37.2% 97.3%
Grand Total 296 175 41 11 53 16 22% 59.1% 94.6%

As a side-note, I’m interested to see that I used secondary monographs more than any other resource–a common practice in literary study, I suspect.

Methodology

Initially I planned to systematically search multiple digital collections for the works in bibliography, but once I realized how long that process would take I decided to scale back my efforts. My goal was to find out if a work was available online as full text, not to discover every electronic version of that work. I experimented with using tools such as Rollyo and the Google Custom Search Engine to search for works across a specific set of sites, but they wouldn’t return results from some major sources such as Google Books (perhaps because Google Books apparently does not permit indexing by other commercial search engines). I longed for a tool that would suck in my bibliography, search for authoritative versions of each text, capture the bibliographic information, and download everything into Zotero (come on, semantic web), but, alas, I had to do this work manually. If I were a better programmer, I might be able to automate part of the collection process, but most online archives prohibit automated methods for downloading files.

So that I could determine whether certain types of works were more likely to be available online, I distinguished between primary and secondary monographs (books as well as essays/articles/poems collected within books), primary and secondary periodicals, and archival materials (which I classified as primary source). Logically enough, where I looked depended on what I was looking for:

  • primary source monographs: Google Books and Early American Fiction. If I didn’t find the work at these sites, I tried the Open Content Alliance and Making of America, then searched for the title using Google.
  • primary source periodicals: Making of America, Google Books, and subscription databases provided by Alexander Street Press, then the general web.
  • secondary source monographs: Google Books, Live Search Books, Net Library, and Questia
  • secondary source periodicals: Google Scholar (which searches JSTOR, Project Muse, and other electronic journal collections)
  • archival resources: web page of repository holding the collection

I wasn’t always able to find the same edition of a work that I cited in my dissertation. However, I was delighted to discover first editions of important works in Early American Fiction, Google Books, Open Content Alliance, and other online collections.

As the statistics cited above indicate, the majority of my primary sources are available online as full-text, while most of my secondary sources are not. But I noted several instances where public domain materials that should be freely available were not.

  • Most of the primary source monographs I used have been digitized, but Google Books treats 8 of these public domain works as “no preview.” I’m not sure why Google treats works such as The Soldier’s Bride and Other Tales (1833) as “no preview,” but a note on the metadata record for this book gives a clue: “Prepared for The Electronic Archive of Early American Fiction at the University of Virginia Library.” My hunch is that Google Books does not make available some public domain works already digitized by its library partners.
  • Thanks to Making of America, many nineteenth century American periodicals have been digitized. Google Books also contains important 19th C magazines such as Southern Literary Messenger (which Poe wrote for) and Salmagundi, but it does not appear to have every issue of many of these magazines. Some important magazines, such as Godey’s Lady’s Book, are not available at all through Google Books (although you can access the magazine if your library subscribes to the Alexander Street Press Godey’s collection).
  • Nearly all of the secondary journal articles that I consulted are available online, most through JSTOR or Project Muse. However, more specialized journals such as the Walt Whitman Quarterly Review are not yet available as full-text online (although the WWQR does have a complete index).
  • Only about 25% of the secondary books that I cited are available as full-text through Questia and/or NetLibrary. However, Google Books seems to have at least digitized most of the contemporary secondary monographs in my bibliography. (I’m assuming that “no preview” means Google has digitized the work but isn’t making even a snippet publicly available.) Publishers such as Oxford UP, University of California Press, Cambridge UP and Knopf appear to have made deals with Google to allow limited preview of their books. Interestingly, Google Books has not digitized some works available through Questia, such as After the Whale and Monumental Anxieties.
  • I looked at three archival collections–single items at the University of Virginia and the Virginia Historical Society, as well as a fairly large collection focused on the author Donald Grant Mitchell at Yale–and none have been digitized yet.

So what are the implications of my findings that most of my primary sources are available online as full text, while many of my secondary sources are, at least in a limited fashion, in a digital format and 62% of them are searchable? As Patrick Leary, Jo Guldi, and others have argued, massive digitization projects promise not only to make the research process more efficient, but also to open up new approaches to research. For example, you can discover important works that would otherwise be invisible to you, trace the use of a phrase across works, and analyze significant patterns in a corpus of texts.

Yet we should also acknowledge that not everything is available online and that research sources are scattered across multiple collections, not yet searchable through a single tool. Despite the efforts of many archives to digitize their collections, studying most archival resources still requires a trip to the archives (although the Web has made it much easier to determine if an archive holds relevant materials and to prepare for a research trip). Many online collections–particularly those focused on works not in the public domain–require a subscription, so if you’re an independent scholar or if your library can’t afford a subscription, you’re out of luck. Furthermore, scholars need to be able to trust the reliability of online texts so that they feel comfortable using (and citing) them–the metadata needs to be accurate, the page images and OCR of sufficient quality, the reputation of the archive high. Given the potentially overwhelming quantity of data, we need better to tools to search, manage and analyze information (fortunately, project such as Zotero and MONK are developing such tools). We also need to be able to use these tools with text collections, whether the tools are integrated into the collection (as Token-X is with the Willa Cather Archive), invoked through an API, or run on collections that we build ourselves by downloading relevant resources. And we need to feel comfortable that we can download and analyze online resources without worrying about being sued for violating licensing terms.

In my next posts, I’ll look at the quality of online texts, discuss what researchers can do with full-text, and detail the problems I ran into trying to get downloaded texts into shape for text analysis tools.

Strategies for Promoting Social Scholarship

As I noted in my last post, the development of collaborative, online, open access scholarship (which I’ll call “social scholarship”) faces some significant obstacles, including cultural barriers, concerns about intellectual property, and the need for sound economic models for open access publications. But I think social scholarship can and will grow. Here are some strategies to promote it:

1) Develop tools that enable researchers to what they already do, but better.

Why have some disciplines, such as physics, embraced online delivery of research? As Stephen Pinfield notes in “How Do Physicists Use an E-Print Archive?,” the physics e-print archive arxiv succeeded in part because it “automated” physicists’ existing practices of exchanging pre-prints. Rather than having to go through the hassles of mailing or emailing preprints to multiple colleagues, physicists could easily post them online and, as a side benefit, make them more visible. Once researchers are convinced that a tool can help them do what they already do, only better, then they can also begin to see how it may help them to do new stuff, too. For instance, when I talk to researchers about Zotero, they first recognize its value in downloading bibliographic citations and creating bibliographies, but then begin to get excited about the possibilities of tagging and searching their collections.

2) Make social scholarship cool.
A primary lesson I learned in high school: if the cool people are doing it, pretty much everyone else will want to as well. I typically try something new (whether food, books, music, or technology) because someone I respect has recommended it. In a more scholarly context, I often evaluate the quality of a journal by checking out its editorial board. As researchers see how their colleagues are having a significant impact on research by making their work available as open access, they may be more willing to release their own research as open access. Likewise, as leading scholars come to be associated with open access journals (witness, for example, the Open Humanities Press, which has a top-notch editorial board), these publications will likely gain more legitimacy.

3) Assuage concerns about intellectual property.
Certainly not every researcher will want to blog or post pre-prints about ongoing work—someone pursuing a patent wouldn’t want to give away the goods prematurely, and if a researcher hopes to publish in a journal that doesn’t allow self-archiving, then he or she may not want to test that policy (although plenty of folks do). But researchers’ fears of being scooped or plagiarized if they post material online seem exaggerated. Indeed, posting a pre-print or a blog entry about a research breakthrough may enable a researcher to register that idea without having to wait through the long publication cycle. Sure, the Web enables plagiarizers to easily find information and copy and paste it into a document, but it also makes it easy to search for a unique phrase and catch the plagiarizers. (Witness today’s Chronicle of Higher Education article on journals experimenting with plagiarism detection tools similar to TurnItIn.) By using a Creative Commons license, researchers can make clear the terms under which their work can be used.

4) Experiment with new models for open access publication.
Even as the web makes the distribution of content easier, most academics aren’t ready to dispense with the peer review, copy editing, and in some cases the marketing functions provided by publishers, all of which cost money. So how will we pay for open access publishing? Various economic models are emerging—author fees, university or library support for publishing, etc. SCOAP3 pursues an intriguing collaborative model that has emerged from the high energy physics community, whereby a consortium supported by libraries, research societies and other groups would contract with publishers to provide their services and publish high energy physics journals as open access. To cover the approximately the United States’ approximately $4.5 million share of the total costs of publishing these journals, libraries, research societies, government agencies, etc. would re-direct funds to the SCOAP3 consortium. Rather than shifting the costs of open access publication to authors (through publication charges) or individual institutions (by moving the publication function to libraries, for instance), SCOAP3 hopes to control costs by pooling funds and to give authors and libraries (the producers, purchasers and consumers of journal content) a stronger voice in the publication process. The SCOAP3 consortium would contract with publishers to provide peer review and editorial quality control, but the publications would be open access. The publishing industry wouldn’t be closed out of this process; indeed, several publishers and scholarly societies are participating the conversations about SCOAP3. Final publications would be deposited in open access repositories, enabling data mining and scholarly re-use.

5) Make the case that social scholarship is good and good for you.
Making research openly accessible can appeal to researchers’ altruistic impulses to share their work with independent scholars and researchers whose libraries cannot afford expensive journal subscriptions, as well as to make work paid for by the public available as a public good. Yet open access also makes sense purely for self-interest. As universities increasingly measure the “impact factor” of publications, articles that other researchers can easily find, comment upon, and link to will likely carry more weight. As Michael Jensen points out, the more accessible a work is, the more visible it is and more likely it is that it will be cited. (Of course, if tenure committees don’t view electronic publications as being as scholarly as more traditional publications, then self-interest may be undermined–but scholarly organizations such as the MLA and universities such as the members of the University of California system are beginning to recognize the importance of giving proper credit to electronic publications.)

Obstacles to social scholarship

As I noted in an earlier post, humanities scholars are beginning to experiment with social scholarship, embracing open access, creating and using social networking sites and collaborative tools, and undertaking joint research projects. But I must acknowledge that social scholarship (which I’m using as a catch-all term to include open access, web 2.0, and a culture of collaboration) is in its early stages and faces significant obstacles—economic, cultural, and technological. These challenges include:

  1. Lack of awareness of social scholarship: According a recent article in the Chronicle of Higher Education (“Researchers Develop Online Tools for Science Collaborations“), few scientists are aware of collaborative resources such as blogs and social networking sites. I’ve noticed this lack of awareness among faculty members from pretty much every discipline at my university. As the article points out, many people don’t use new technologies or communication methods unless they have specific needs to meet—why invest the effort in changing how you do work unless there are concrete payoffs?
  2. Intellectual property concerns: Some researchers worry that if they make their work available online before publishing it with a traditional publisher they will lose control of it. For instance, a competitor may read their blog entry about ongoing research and scoop them—or even plagiarize their work. They also fear that publishers will refuse to publish a work that has already been made available online. From another perspective, copyright law also limits what material you can incorporate into your own work and share—for instance, museums and other cultural institutions seem to be levying higher fees for publication of digital images to which they hold the copyright.
  3. Skepticism about the quality of electronic-only publications: According to research by UC Berkeley’s Center for Studies in Higher Education, faculty in five disciplines—English, biostatistics, law and economics, anthropology, and chemical engineering–associate electronic-only publication with the lack of peer review and thus the lack of quality. If researchers don’t believe that tenure committees will give them credit for publishing in open access journals, then they will stick with more traditional means of publication.
  4. Lack of recognition for social scholarship: In many disciplines, there is currently little incentive for researchers to embrace social scholarship; the incentives are with the traditional system. When I talk to faculty about social scholarship, many appreciate the vision of sharing but worry about the implementation, particularly whether tenure committees will give them credit for collaborative scholarship. What kind of rewards and recognition do you get for commenting on a colleague’s blog, publishing your articles through an institutional repository, sharing your bibliographies, or keeping an open notebook documenting your research? The UC Berkeley’s new report “Publishing Needs and Opportunities at the University of California” finds that “a significant minority” of faculty are experimenting with alternative publishing models, but that they “are increasingly frustrated by a tenure and review system that fails to recognize these new publishing models and hence constrains experimentation both in the technologies of dissemination and in the audiences addressed.”
  5. Lack of time to make work available online: Contributing content to user-generated sites, reading and commenting on blogs, sharing bookmarks and doing all of the other work of social scholarship take a lot of time—time that many busy academics don’t have. In a blog post on why Web 2.0 hasn’t been adopted in the biosciences, David Crotty, executive editor of the online publication Cold Spring Harbor Protocols, details how traditional methods of doing research can often be more efficient than Web 2.0 approaches, at least initially, since you can just email a file rather than finding a collaborative site, setting up an account, uploading the file, inviting participants to view it, waiting for them to establish accounts, etc.
  6. Cultural obstacles: Engaging in online discussions and making public thoughts that are in process are not yet part of mainstream academic culture. As David Crotty notes, many academics are unlikely to make critical comments in a public forum, since they don’t want to piss off potential reviewers, employers, or collaborators.
  7. Need for sound economic models for open access publication: Producing academic journals isn’t free, as I learned when I served as the managing editor of Postmodern Culture—even if editors donate their time, funds are needed for copyediting, coordinating editorial review, covering travel costs for editorial meetings, paying for web hosts, etc. How will open access journals be paid for—through author fees? University, society or foundation support? What will guarantee the sustainability of these journals and provide long-term access to their content? If scholars worry about the viability and reputation of open access journals, what will entice them to publish in these journals rather than traditional publications? In Open Access Publishing and the Emerging Infrastructure for 21st-Century Scholarship, Don Waters, Program Officer for Scholarly Communications at The Andrew W. Mellon Foundation, expresses skepticism about the open access model: “One worry about mandates for open access publishing is that they will deprive smaller publishers of much needed subscription income, pushing them into further decline, and making it difficult for them to invest in ways to help scholars select, edit, market, evaluate, and sustain the new products of scholarship represented in digital resources and databases. The bigger worry, which is hardly recognized and much less discussed in open access circles, is that sophisticated publishers are increasingly seeing that the availability of material in open access form gives them important new business opportunities that may ultimately provide a competitive advantage by which they can restrict access, limit competition, and raise prices.”

I believe that these challenges can be overcome and will sketch some strategies for promoting social scholarship in my final posting on this thread.

Becoming a “Digital Scholar”: Digital Discovery 2008

[Below is the text of a presentation that I will be giving at the Digital Discovery conference on March 27, 2008]

When I started a graduate program in English way back in 1992, I used computers mainly to write papers. Then came the web. Within a few years, I was creating web-based assignments for the undergraduate courses I was teaching,UVA's Alderman Library marking up electronic texts for the University of Virginia’s Electronic Text Center, creating my own electronic edition of a nineteenth-century sentimental sketch, and copy-editing articles for one of the first humanities journals to be published exclusively online, Postmodern Culture. Even though I was actively engaged in what has now come to be called “digital humanities,” I still did much of my dissertation research—which examines bachelorhood in nineteenth-century American literature and culture–the old fashioned way. I wandered the stacks of Alderman Library smelling the decaying books, skimmed print catalogs such as Lyle Wright’s bibliography of American fiction, flipped through 150 year-old volumes of magazines such as Harpers and The Atlantic, and even took notes with a fountain pen.

I finished my dissertation in 2002. After spending so much time laboring over it, I wanted to move on. But 5 years later, I decided I was ready to take up with the diss again, on new terms. I’m fascinated by the question of how the abundance of digital information and the development of new technologies will affect humanities scholarship, and it seemed to me that the best way for me to understand these transformations would be to undertake a major research project myself. By revisiting my dissertation, I could build on my existing knowledge and compare my research process 5+ years ago to what’s possible today. Thus I decided to remix my dissertation as a work of digital scholarship.

So what’s digital scholarship? According to the ACLS report on cyberinfrastructure for the humanities and social sciences, digital scholarship includes building digital collections, creating tools for collecting, analyzing, and authoring digital information, and “using digital collections and analytical tools to generate new intellectual products.” My project reflects that last idea, as I am exploring the implications of using digital collections and tools. My work is still very much in process, but here are three preliminary observations:

  1. A vast amount of information is now available online. When I first started working on my dissertation, I wished that there were some way for me to search not only bibliographic information, but also the content of works themselves. Well, that wish is beginning to come true. We don’t know exactly how many books Google has digitized, but the number is well over a million—and of course many more works have also been digitized by the Open Content Alliance, Microsoft’s Live Book Search, and countless libraries and archives. I was curious about how many of the nearly 300 works I cite in my dissertation bibliography are now available online, so I searched for them in Google Books, Making of America, online journal collections, and other sites. I found that 77% of my primary source resources and 22% of my secondary sources are available online as full-text, while 92% of all my research materials have been digitized (this number includes works available through Google Books as limited preview, snippet view, and no preview.) 13% of the resources—mostly journal articles–require a subscription.Now I should note that I study 19th century American literature, which is safely out of copyright and ubiquitous at most research libraries. Still, many significant materials haven’t been digitized, particularly periodical literature and archival materials. Other works are only available through subscription. Even if the resource has been digitized, it often has errors—metadata can be unclear or incomplete, scans can show the fingers of the scanning operators or be cut off, and the quality of the OCR can be poor. Nevertheless, the availability of so much digital information means that how we do research will change.

  2. As we deal with the abundance of information, we need tools to find, organize, manage, analyze and share our research materials. Fortunately, those tools are beginning to be developed.For example, when I started researching my dissertation, ZoteroI wasted a lot of time attempting to organize my notes and looking up bibliographic information that I hadn’t captured accurately. Now I manage my research much more effectively by using Zotero, a free Firefox-based research tool developed by George Mason’s Center for New Media and History. Zotero automatically captures bibliographic information from hundreds of supported web sites and lets you insert properly-formatted notes and bibliographies as you write a paper in Word. Moreover, you can take notes in Zotero, tag your resources, organize them into collections and sub-collections, and search across them. The next version of Zotero will support sharing bibliographic resources on the network.To detect patterns in the texts that I collect, I am using text analysis tools such as those developed by TAPOR. With TAPOR, you can create a concordance, compare texts, and look for co-occurring words. I’ve already used TAPOR to generate a list of the most frequently occurring terms in the first chapter of my dissertation, then compared that list to one I created manually. While my own list mainly focused on different descriptors of the bachelor, the TAPOR list reflected key components of my argument, including words associated with domesticity, nationhood, and identity. Text analysis tools can make implicit knowledge explicit and open your eyes to patterns you hadn’t previously been aware of.

    Beyond text, visualization and mashup tools allow you to make sense of data such as demographic History Browserinformation, troop movements, and even patterns in the correspondences of historical figures such as Thomas Jefferson. Ed Ayers envisions historians using dynamic “social weather maps” that allow them to watch historical forces in process. For my own project, I plan to explore the geographic and temporal nature of bachelor literature by developing several interactive maps that show where bachelor narratives were set and where their authors were born, as well as timelines that plot the publication history of bachelor literature.Having texts in open formats (such as XML) makes it much easier to analyze and manipulate them—otherwise you have to go through a cumbersome conversion process. I’m OCRing the PDFs I downloaded from Google Books so I can then use text analysis tools on them, but I understand that the resulting text will be somewhat unreliable.

  3. Although the journal and monograph still dominate the humanities, new means of scholarly communication are emerging, enabling the faster dissemination of ideas, more community dialog, and the use of multimedia.
    • I recently started blogging about my research project. Through blogging, I’ve become much more engaged in my research community and have been energized by the generous responses from friends and leaders in digital humanities. Readers have helped me to think through ideas by offering alternative perspectives and alerting me to resources I hadn’t been aware of. Blogging motivates me to follow developments in my field more closely and to synthesize what I’m observing. My blog is also a great memory aid—in preparing this talk, I’ve remembered, “Oh yeah, I blogged about that” and have been able to pull up the relevant entry quickly. I’m reaching much more people than I would through traditional means of publication. For instance, during January alone, my blog was viewed over 2,725 times, far more times, I suspect, than any of my articles have been read.
    • Our culture is increasingly a visual one, dominated by TV, movies, and YouTube videos, but we don’t yet have many examples of video-based scholarship. However, some interesting models are emerging. SciVee, which is sponsored by the Public Library of Science, NSF, and the Dan Diego Supercomputing Center, makes it easy for scientists to upload videos that accompany published articles, making their work more accessible and visible. Anthropologist Michael Wesch’s video “The Machine is Us/ing Us” demonstrates the potential of video as a means for disseminating ideas—it has been viewed almost 5 million times and explores Web 2.0 in, well, a Web 2.0 way, illustrating the dynamic, interactive nature of the Web. Inspired by these examples, as well as by digital storytelling , I’m planning to create short videos that allow me to express ideas difficult to explore in print. For example, I plan to survey America’s changing perception of the bachelor by showing images of bachelor from the 19th to 20th centuries (accompanied by bachelor songs). I also am working on a short video about the history of the bestseller Reveries of a Bachelor, which went through many editions and changed its physical format as the publisher dreamed up ways to keep it in demand.
    • Then there are the scholars who are simply—and significantly—making their works available online through open access repositories. In so doing, these researchers are making it possible for independent scholars and those at institutions without big library budgets to access their work, advancing the democracy of knowledge. Moreover, they are likely increasing their own visibility as scholars. As Michael Jensen argues, scholars will increasingly be evaluated based on the impact of their work, which will be measured by factors such as number and quality of citations, blog comments, and links to the document.

    Lest I seem naïve, I should acknowledge that significant challenges face digital scholarship. Many studies, including the ACLS report on Cyberinfrastructure and the UC Berkeley Center for Studies in Higher Education reports on faculty attitudes toward digital resources, detail these challenges, but let me just mention a few. I’ve found that many humanities scholars I’ve talked with are not yet aware of digital scholarship. Already feeling stretched by obligations to do research, teach, and perform service , few academics have time to learn new technologies. As the digital environment constantly shifts and tools come and go, it’s overwhelming trying to keep up. In any case, the system doesn’t really reward faculty for experimenting with new technologies. According to a recent MLA report, over half of the tenure committees in the humanities have no experience evaluating “scholarly monographs in electronic format.” Many researchers feel that they will be penalized if they don’t publish in the most prestigious, well-established journals. Then there’s copyright: For my remixed work on bachelorhood, I’d love to provide links to the full-text of every work that I cite. I’d also like to remix those original sources to produce new works. But what I can do is constrained by copyright.

    I believe that many of these challenges will be overcome. For example, scholarly societies like the MLA are recognizing the validity of digital scholarship. Organizations like NINES, which focuses on nineteenth century studies, are providing tools, training, content portals, and support for scholars, as well as conveying legitimacy on digital scholarship. The NEH just turned its Digital Humanities Initiative into a full-fledged Office. The Creative Commons is pressing for greater clarity on copyright.

    What effect will the computer revolution have on humanities scholarship? It’s really too early to say– in a small way, that is what I’m trying to figure out in my project. In the sciences, we’ve seen the rise of new sub-disciplines and methodologies made possible through computation and data archives. In the humanities, I believe that being able to access the full text of millions of books will bring about significant changes in how research is conducted. In a great blog post from earlier this month, Tom Scheinfeldt from the Center for History and New Media suggests that historical scholarship will shift from a focus on ideology to a focus on methodology. As my colleague Jane Segal and I found in our study of the impact of digital archives on humanities scholarship, the Walt Whitman Archive is opening up new areas of inquiry in Whitman studies. For instance, Whitman scholars are shifting to manuscript study and paying increased attention to versions of Leaves of Grass besides the first and deathbed editions. As I’ve discovered through my own attempt to go digital, many challenges lie ahead, but I’m motivated by the opportunity to be creative, learn new skills, and have an impact on scholarship.

Signs that social scholarship is catching on in the humanities

To what extent are humanities researchers practicing “social scholarship”—embracing openness, accessibility and collaboration in producing their work? In defining the characteristics of the humanities cyberinfrastructure, the report of the ACLS Commission on Cyberinfrastructure recommends that it should be “accessible” and “facilitate collaboration.” At the same time, the report contends that solitary scholarship is the norm in the humanities: “Despite the demonstrated value of collaboration in the sciences, there are relatively few formal digital communities and relatively few institutional platforms for online collaboration in the humanities. In these disciplines, single-author work continues to dominate.” Recently, however, I’ve observed several trends that suggest increasing experimentation with collaborative tools and approaches in the humanities:

1) Individual commitment by scholars to open access
Recently several prominent humanities scholars have voiced strong support for open access publishing. For instance, Nick Montfort has stated that he will no longer review articles for non-open access journals. Likewise, dannah boyd has declared that she will no longer publish in journals where content is not freely available and that “scholars have a responsibility to make their work available as a public good.” As part of a forum on open access in Anthropology News, Chris Kelty articulated his reluctance to peer-review articles “for a multinational corporation with shareholders and an enormous profit margin” when he isn’t compensated for his labor. Such declarations are increasing awareness of open access and stirring up an important debate about whether it is feasible and desirable. By making publications freely available online, scholars reach a larger audience, serve the fundamental scholarly mission to advance public knowledge, and make their own work more visible. Of course, there are significant economic and cultural obstacles to open access, obstacles that I will look at in my next post.

2) Development of open access publishing outlets
The commitment to publish only in open access journals won’t go very far if there aren’t appropriate forums for this scholarship (unless authors choose to self-publish their work). Already the Directory of Open Access Journals lists 554 humanities journals, including Digital Humanities Quarterly, Transformations, African Studies Quarterly, Southern Spaces, and Bryn Mawr Classical Review Yet some open access journals struggle with the lack of resources and, perhaps more significantly, the lack of contributors. According to Sigi Jottkandt and Gary Hall, leaders of the new Open Humanities Press, the most significant obstacle “is still the general perception by our colleagues that open access publication is not as academically rigorous as traditional print-based journals and books” (http www.driver-repository.be/media/docs/OHPBrussels13-2-07.pdf). To tackle the perception that open access journals are somehow less scholarly, the Open Humanities Press emphasizes the prestige of its editorial board, which includes Stephen Greenblatt, N. Katherine Hayles, Jerome McGann, Peter Suber, and Gayatri Chakravorty Spivak. The Open Humanities Press aims to develop open access humanities journals in critical theory, construct a research gateway, and publish foundational books on critical theory that are in the public domain, taking as it main values access, scholarship, diversity and transparency. Academic and commercial publishers are likewise experimenting with open access publishing models. For instance, the University of Michigan Press and the University of Michigan Library are collaborating on the digitalculturebooks imprint, which makes digital versions of works freely available. The MIT Press is publishing Information Technologies and International Development as an open access journal and is providing free online access to the MacArthur Foundation Series on Digital Media and Learning thanks to the support of the MacArthur Foundation. Hindawi Publishing Corporation, a commercial press focused on science and engineering, now publishes all of its journals as open access under a model where authors cover publication costs.

3) Availability of tools to support collaboration
To encourage humanities scholars to work together on complex research problems, share data and references, and jointly author documents, they need tools that make the whole process easy. Web 2.0 is a notoriously squishy term, but for me it is fundamentally about enabling participation and collaboration. We could list dozens of different collaborative tools, such as blogs, wikis, collaborative bookmarking, social networking, collaborative authoring, social tagging, visualization, mashups, etc. In the digital humanities domain, a number of tools are under development that facilitate collaboration. For example, Stan Katz hails the recent partnership between the Center for the New Media and History and the Internet Archive to enable humanities scholars to collaborate by uploading their research notes and collections to the Internet Archive using Zotero. SEASR is a software environment for data analysis that will “empower collaboration among scholars.”

4) Experiments with social peer review
While the traditional peer-review process includes only a few often anonymous reviewers, new approaches to peer review engage a larger community in evaluation and leverage collaborative bookmarking and social tagging applications to determine the impact of a work. For example, in preparing his book Expressive Processing: Digital Fictions, Computer Games, and Software Studies for publication, Noah Wardruip-Fruin is pursuing two methods of peer-review: the traditional process, through MIT Press, and blog-based peer review. He’s posting the book in sections to Grand Text Auto and using CommentPress to engage in a conversation with readers. In reading over Wardruip-Fruin’s meta-reflections on blog-based peer review, I was struck by his observation that getting feedback from multiple reviewers helps him to figure out whether something just bothered one reader or is a deeper problem: “the blog-based review form not only brings in more voices (which may identify more potential issues), and not only provides some ‘review of the reviews’ (with reviewers weighing in on the issues raised by others), but is also, crucially, a conversation (my proposals for a quick fix to the discussion of one example helped unearth the breadth and seriousness of the larger issues with the section).” For Wardruip-Fruin, the “social process” produces comments that he trusts more, since they emerge from community dialogue. Some have criticized this approach, arguing that removing anonymity means that comments aren’t as honest and that opening up the review process dilutes its authority, but it seems to me that blog-based peer review resembles an online writing workshop—you hear from multiple readers and get a sense of how your argument is playing out.

5) Development of social networks to support open exchanges of knowledge
Social networking sites provide key organizational and communication tools for a community, whether it be focused around a particular field or spans the disciplines. As HASTAC’s name (the Humanities, Arts, Science and Technology Advanced Collaboratory) suggests, it fosters collaboration focused on innovative, interdisciplinary uses of technology by coordinating a network of research centers, sharing information, cultivating community, overseeing funding programs such as the MacArthur Digital Media and Learning Competition, and more. NINES (Networked Infrastructure for Nineteenth-century Electronic Scholarship) is developing a platform for collaboration (Collex), a network of nineteenth-century scholars, mechanisms for peer review of digital scholarship, and training programs for scholars working on digital projects.

6) Support for collaboration by funding agencies
Funding agencies are emphasizing collaboration in many of their programs. If you look at the tag cloud for the recently-announced winners of the MacArthur Digital Media and Learning competition, “collaboration” stands out as the most frequently used term, applied to projects that, for instance, “connect young African social entrepreneurs with young North American professionals,” enable young people to work together on Do It Yourself science projects, or engage high school students in Los Angeles and Cairo in an environmental studies game. Similarly, the NEH/IMLS Digital Partnership program focuses on “innovative, collaborative humanities projects,” encouraging libraries, museums, and scholars to work together to advance public knowledge.

7) More broadly, universities are emphasizing community as key part of graduate education.
The Carnegie Foundation’s The Formation of Scholars: Re-thinking Doctoral Education for the Twenty-First Century argues that graduate programs must create intellectual community to engage graduate students in the work of the department and discipline, retain them, and promote innovative thinking. Perhaps digital humanities projects exemplify the benefits of collaborative approaches to scholarship, since it’s difficult for a solo scholar to pull off the typical digital humanities project. I was motivated to complete my PhD in large part because of the communities that I participated in, particularly my dissertation group and the Electronic Text Center. It seemed that the happiest graduate students in my program were those working on digital humanities projects, which allowed us to collaborate with senior scholars and fellow graduate students, learn new skills, and do work that had immediate benefit for researchers and, often, the general public.

Other examples of social scholarship’s emergence include the growth of blogging and the use of collaborative bibliographic tools such as citeulike (which includes 500 items that are tagged “humanities“). Despite these signs that social scholarship is beginning to gain traction in the humanities, significant obstacles remain, obstacles that I will discuss in my next post.

Social Scholarship in the Humanities

Scholarship seems to be getting more visibly social. According to Laura Cohen, social scholarship is “the practice of scholarship in which the use of social tools is an integral part of the research and publishing process.” Social scholars may blog, share bookmarks, data and other resources, participate in social networks, make their works-in-progress available for review, and deposit their publications in open access repositories. A recent Scientific American article points out some of the benefits of “open source” science. At social networking sites such as OpenWetWare, which recently received a substantial NSF grant to develop social software for scientists, biologists and bioengineers share research protocols and syllabi, blog the research process, post profiles of their research groups, and find collaborators. As a result, collective wisdom is documented and passed down, failures as well as successes are made visible, lab managers can more easily track ongoing research, and researchers can get quick feedback on their work from colleagues around the world. Open Source Science seems especially appropriate for researchers searching for cures to diseases common in developing nations but of little interest to big pharmaceutical companies, since such openness can facilitate more rapid discoveries and is not constrained by the quest for patents. With Harvard’s recent adoption of an open access policy and the NIH mandate that research publications it funds be deposited in PubMed Central, social scholarship appears to be gaining momentum. To what extent are the humanities part of this movement?

Typically humanists are cast as the loners of academia, brooding over books in solitude. True, rarely do you see humanities scholars jointly authoring works, although they often collaborate to edit essay collections and journals and organize conferences and workshops. Unlike the sciences, where joint authorship is expected, many tenure committees haven’t yet figured out how to assign credit for collaborative work in the humanities. Yet you can glance at the acknowledgments in any humanities monograph and find ample evidence for the social context out of which scholarship emerges—the friends and colleagues who suggested references and read multiple drafts, the anonymous peer reviewers who provided feedback, the conference attendees and students who served as sounding boards, the assistants who offered research support, the librarians and archivists who tracked down sources, the funders who helped pay for research trips, the partners who put up with it all. Reversing the typical image of scientists as collaborators and humanists as loners, Sayeed Choudhury and Timothy Stinson point out in The Virtual Observatory and the Roman de la Rose: Unexpected Relationships and the Collaborative Imperative that in the “data-poor” environments of the early modern era scientists were reluctant to share information, whereas medieval manuscripts provide ample evidence of humanists working together to write, copy, annotate, illustrate, and disseminate texts. As Choudhury and Stinson suggest, “Perhaps it is not a set of inherent characteristics within specific disciplines that defines their mode of scholarship or communication, but rather the relative ease or difficulty with which practitioners of those disciplines can generate, acquire or process data.” Does scarcity produce secrecy, abundance openness? Information housed in archives remains a scarce resource for humanities scholars, but mass digitization efforts are making other forms of humanities data widely available. Will humanities scholars work together to mine and make sense of this information? In my next posts, I’ll look at some trends indicating that humanities scholars are beginning to embrace social scholarship, as well as discuss some obstacles.

“Knowing” in 3D, ca. 1908 and 2008

“Everything you know is in 3D. You too can see—hear—experience—feel—know. Everyone believe. All of this can be yours.”

3d Gkasses

Flickr

These words flash onto the screen during the trailer for U23D, which promotes itself as the first live action digital 3D film. But U23D is not the first to claim that one can possess knowledge by visualizing the world in three dimensions. At the turn of the twentieth century, stereograph companies such as Underwood and Underwood proclaimed that “To see is to know.” I just finished up a project about stereographs as a means of virtual travel in the early 20th century, so I rushed out to experience a twenty-first century 3D technology.

TIMEA

Stereographs are images that are taken from a slightly different perspective and then mounted side by side; when you look at a stereograph through a viewing apparatus called a stereoscope, you can see a single 3D image (think View Master). Although I had naively assumed that the push to use images for education didn’t really begin until the TV Age, Judith Babbitts argues that stereograph companies “played a dominant role in creating the popular discourse that redefined what was important to know and how one should go about knowing it. As no knowledge industry had done before it, the stereograph industry identified technology–the technology of the camera–as the essential factor in acquiring information” (127). Both Underwood and its competitor Keystone View Company enlisted educational exper