How many texts have been digitized?

In remixing my dissertation as a work of digital scholarship, I’m trying to use digital resources for my research as much as possible. But is this even possible? How many research materials in American literature and culture are available online as full-text, and how reliable are these electronic texts? I worked on my dissertation between 1996 and 2002 and used electronic collections that were available at the time–particularly Early American Fiction, Making of America (both the Michigan and Cornell sites), JSTOR, Project Muse, and HarpWeek–but I did most of my research in the stacks at Virginia’s Alderman Library, perusing critical works and flipping through 19th century periodicals on the hunt for bachelor texts. (I’m a bit embarrassed to admit that I cited fewer digital resources than I actually used, but research that I’ve done with my colleague Jane Segal indicates that few literary scholars cite digital resources, even though many use them.) If I were to begin researching my dissertation now, what new possibilities would be open to me, and what problems would I face in trying to rely on digital resources?

To find out, I searched for each of the 296 items in my original bibliography in both free and subscription-based online collections such as Google Books, Open Content Alliance/Internet Archive, JSTOR, Project Muse, Early American Fiction, Making of America, Net Library, and Questia (which requires an individual subscription). I found that 83% of my primary source materials and 37% of my secondary source materials are now available online as full-text. By “full text,” I mean that, at minimum, you can read the work from start to finish online and search within it. If the work is in the public domain or is a journal article, you often can download it, whether as HTML, PDF, plain text, or, in the case of the Open Content Alliance, DJVU images. (I earlier reported that only 22% of my secondary materials were available online, but I then I realized that I needed to look for these resources at sites such as Net Library and Questia.) Furthermore, 95% of all the sources listed in my bibliography have been digitized. If a work has been digitized but is not available as full-text, it’s typically a work that Google Books offers as limited preview, snippet view, or no preview because of copyright restrictions. You can still search books that are limited preview or snippet view, but you cannot retrieve more than a few pages (limited preview) or lines (snippet view). Access to 22% of the works–mostly periodicals and secondary ebooks–requires a subscription.

I suspect that more works have been digitized in my field, nineteenth-century American literature, than in most. Works are safely in the public domain (except for critical editions produced in the 20th century), and major digitization initiatives such as Early American Fiction, Making of America, and Wright American Fiction have provided access to thousands of books and magazines. US research libraries have extensive collections focused on American literature, so those works seem to be well-represented in Google Books and the Open Content Alliance.

Here are the numbers:

Type Total # # Full Text # Ltd. Prev. # Snip View # No Prev. # Not Digit. # Subs Req % Full Text % Digit
secondary monograph 119 28 36 10 43 2 25 23.5% 98.3%
secondary periodical 29 27 0 0 0 2 27 93.1% 93.1%
primary monograph 66 50 5 1 8 2 1 75.8% 97%
primary periodical 79 70 0 0 2 7 12 88.6% 91.1%
archival 3 0 0 0 0 3 0.0% 0.0%
Total Primary 148 120 5 1 10 12 8.8% 82.8% 91.9%
Total Secondary 148 55 36 10 43 4 35.1% 37.2% 97.3%
Grand Total 296 175 41 11 53 16 22% 59.1% 94.6%

As a side-note, I’m interested to see that I used secondary monographs more than any other resource–a common practice in literary study, I suspect.

Methodology

Initially I planned to systematically search multiple digital collections for the works in bibliography, but once I realized how long that process would take I decided to scale back my efforts. My goal was to find out if a work was available online as full text, not to discover every electronic version of that work. I experimented with using tools such as Rollyo and the Google Custom Search Engine to search for works across a specific set of sites, but they wouldn’t return results from some major sources such as Google Books (perhaps because Google Books apparently does not permit indexing by other commercial search engines). I longed for a tool that would suck in my bibliography, search for authoritative versions of each text, capture the bibliographic information, and download everything into Zotero (come on, semantic web), but, alas, I had to do this work manually. If I were a better programmer, I might be able to automate part of the collection process, but most online archives prohibit automated methods for downloading files.

So that I could determine whether certain types of works were more likely to be available online, I distinguished between primary and secondary monographs (books as well as essays/articles/poems collected within books), primary and secondary periodicals, and archival materials (which I classified as primary source). Logically enough, where I looked depended on what I was looking for:

  • primary source monographs: Google Books and Early American Fiction. If I didn’t find the work at these sites, I tried the Open Content Alliance and Making of America, then searched for the title using Google.
  • primary source periodicals: Making of America, Google Books, and subscription databases provided by Alexander Street Press, then the general web.
  • secondary source monographs: Google Books, Live Search Books, Net Library, and Questia
  • secondary source periodicals: Google Scholar (which searches JSTOR, Project Muse, and other electronic journal collections)
  • archival resources: web page of repository holding the collection

I wasn’t always able to find the same edition of a work that I cited in my dissertation. However, I was delighted to discover first editions of important works in Early American Fiction, Google Books, Open Content Alliance, and other online collections.

As the statistics cited above indicate, the majority of my primary sources are available online as full-text, while most of my secondary sources are not. But I noted several instances where public domain materials that should be freely available were not.

  • Most of the primary source monographs I used have been digitized, but Google Books treats 8 of these public domain works as “no preview.” I’m not sure why Google treats works such as The Soldier’s Bride and Other Tales (1833) as “no preview,” but a note on the metadata record for this book gives a clue: “Prepared for The Electronic Archive of Early American Fiction at the University of Virginia Library.” My hunch is that Google Books does not make available some public domain works already digitized by its library partners.
  • Thanks to Making of America, many nineteenth century American periodicals have been digitized. Google Books also contains important 19th C magazines such as Southern Literary Messenger (which Poe wrote for) and Salmagundi, but it does not appear to have every issue of many of these magazines. Some important magazines, such as Godey’s Lady’s Book, are not available at all through Google Books (although you can access the magazine if your library subscribes to the Alexander Street Press Godey’s collection).
  • Nearly all of the secondary journal articles that I consulted are available online, most through JSTOR or Project Muse. However, more specialized journals such as the Walt Whitman Quarterly Review are not yet available as full-text online (although the WWQR does have a complete index).
  • Only about 25% of the secondary books that I cited are available as full-text through Questia and/or NetLibrary. However, Google Books seems to have at least digitized most of the contemporary secondary monographs in my bibliography. (I’m assuming that “no preview” means Google has digitized the work but isn’t making even a snippet publicly available.) Publishers such as Oxford UP, University of California Press, Cambridge UP and Knopf appear to have made deals with Google to allow limited preview of their books. Interestingly, Google Books has not digitized some works available through Questia, such as After the Whale and Monumental Anxieties.
  • I looked at three archival collections–single items at the University of Virginia and the Virginia Historical Society, as well as a fairly large collection focused on the author Donald Grant Mitchell at Yale–and none have been digitized yet.

So what are the implications of my findings that most of my primary sources are available online as full text, while many of my secondary sources are, at least in a limited fashion, in a digital format and 62% of them are searchable? As Patrick Leary, Jo Guldi, and others have argued, massive digitization projects promise not only to make the research process more efficient, but also to open up new approaches to research. For example, you can discover important works that would otherwise be invisible to you, trace the use of a phrase across works, and analyze significant patterns in a corpus of texts.

Yet we should also acknowledge that not everything is available online and that research sources are scattered across multiple collections, not yet searchable through a single tool. Despite the efforts of many archives to digitize their collections, studying most archival resources still requires a trip to the archives (although the Web has made it much easier to determine if an archive holds relevant materials and to prepare for a research trip). Many online collections–particularly those focused on works not in the public domain–require a subscription, so if you’re an independent scholar or if your library can’t afford a subscription, you’re out of luck. Furthermore, scholars need to be able to trust the reliability of online texts so that they feel comfortable using (and citing) them–the metadata needs to be accurate, the page images and OCR of sufficient quality, the reputation of the archive high. Given the potentially overwhelming quantity of data, we need better to tools to search, manage and analyze information (fortunately, project such as Zotero and MONK are developing such tools). We also need to be able to use these tools with text collections, whether the tools are integrated into the collection (as Token-X is with the Willa Cather Archive), invoked through an API, or run on collections that we build ourselves by downloading relevant resources. And we need to feel comfortable that we can download and analyze online resources without worrying about being sued for violating licensing terms.

In my next posts, I’ll look at the quality of online texts, discuss what researchers can do with full-text, and detail the problems I ran into trying to get downloaded texts into shape for text analysis tools.

About these ads

12 responses to “How many texts have been digitized?

  1. Pingback: Early Modern Notes » New resources for making digital history

  2. You have to keep in mind, using google books, that the metadata offered there are not very good. That means one should not only use title/author as search terms but prominent phrases of a text, because it may be possible that ocr of the title page was not correct, but ocr of the full text was successful.

  3. Lisa, this is a fascinating result. My hunch has been that a lot more stuff has been digitized than most people think (at least most people who aren’t into digital humanities). But I haven’t seen anyone go through an existing work of scholarship and try to ‘reverse engineer’ the source base in digital form. Bill

  4. @jge– Good point! I’ll be addressing metadata in my next post…

  5. Thanks, Bill! That reverse engineering took a lot more work than I thought it would, but I was pretty curious to see the results. By the way, I’m loving The Programming Historian (Bill’s excellent, practical primer on programming for non-programmers: http://niche.uwo.ca/programming-historian/index.php/Main_Page)

  6. Pingback: pobres pero honrados - Tapera

  7. Pingback: Evaluating the quality of electronic texts « Digital Scholarship in the Humanities

  8. Pingback: Where to go next? « Lisa Spiro’s Research Notes: Bachelor 2.0 Project

  9. Late finding this very interesting blog post, but we’ve been including materials we own into the catalog here at Michigan: http://mirlyn.lib.umich.edu . Good metadata (or, as good as you’d expect in a large academic library) and links to the local and Google Books versions.

    (I’m another Virginia alum, btw.)

  10. Pingback: Using Google Books to Research Publishing History « Digital Scholarship in the Humanities

  11. Lisa: Your piece has been doubly helpful to me! Just in understanding how to improve my productivity. Do you have any idea of how one might push Google Books, for instance, to receive an alert when it posts new books with certain keyword content? That would boost Google’s utility to we serious researchers!

    Retired Professor of Geological Engineering
    University of Missouri

  12. Thanks, Dr. Hatheway. I haven’t been to able to figure out how to get an alert when Google adds new books that match particular criteria, but the new Google Book Search gadget does provide recommendations that aren’t bad, in my limited experience with it: http://www.google.com/ig/adde?moduleurl=www.google.com/ig/modules/books/library_gadget.xml&source=bsha

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s