Tag Archives: digitization

How many texts have been digitized?

In remixing my dissertation as a work of digital scholarship, I’m trying to use digital resources for my research as much as possible. But is this even possible? How many research materials in American literature and culture are available online as full-text, and how reliable are these electronic texts? I worked on my dissertation between 1996 and 2002 and used electronic collections that were available at the time–particularly Early American Fiction, Making of America (both the Michigan and Cornell sites), JSTOR, Project Muse, and HarpWeek–but I did most of my research in the stacks at Virginia’s Alderman Library, perusing critical works and flipping through 19th century periodicals on the hunt for bachelor texts. (I’m a bit embarrassed to admit that I cited fewer digital resources than I actually used, but research that I’ve done with my colleague Jane Segal indicates that few literary scholars cite digital resources, even though many use them.) If I were to begin researching my dissertation now, what new possibilities would be open to me, and what problems would I face in trying to rely on digital resources?

To find out, I searched for each of the 296 items in my original bibliography in both free and subscription-based online collections such as Google Books, Open Content Alliance/Internet Archive, JSTOR, Project Muse, Early American Fiction, Making of America, Net Library, and Questia (which requires an individual subscription). I found that 83% of my primary source materials and 37% of my secondary source materials are now available online as full-text. By “full text,” I mean that, at minimum, you can read the work from start to finish online and search within it. If the work is in the public domain or is a journal article, you often can download it, whether as HTML, PDF, plain text, or, in the case of the Open Content Alliance, DJVU images. (I earlier reported that only 22% of my secondary materials were available online, but I then I realized that I needed to look for these resources at sites such as Net Library and Questia.) Furthermore, 95% of all the sources listed in my bibliography have been digitized. If a work has been digitized but is not available as full-text, it’s typically a work that Google Books offers as limited preview, snippet view, or no preview because of copyright restrictions. You can still search books that are limited preview or snippet view, but you cannot retrieve more than a few pages (limited preview) or lines (snippet view). Access to 22% of the works–mostly periodicals and secondary ebooks–requires a subscription.

I suspect that more works have been digitized in my field, nineteenth-century American literature, than in most. Works are safely in the public domain (except for critical editions produced in the 20th century), and major digitization initiatives such as Early American Fiction, Making of America, and Wright American Fiction have provided access to thousands of books and magazines. US research libraries have extensive collections focused on American literature, so those works seem to be well-represented in Google Books and the Open Content Alliance.

Here are the numbers:

Type Total # # Full Text # Ltd. Prev. # Snip View # No Prev. # Not Digit. # Subs Req % Full Text % Digit
secondary monograph 119 28 36 10 43 2 25 23.5% 98.3%
secondary periodical 29 27 0 0 0 2 27 93.1% 93.1%
primary monograph 66 50 5 1 8 2 1 75.8% 97%
primary periodical 79 70 0 0 2 7 12 88.6% 91.1%
archival 3 0 0 0 0 3 0.0% 0.0%
Total Primary 148 120 5 1 10 12 8.8% 82.8% 91.9%
Total Secondary 148 55 36 10 43 4 35.1% 37.2% 97.3%
Grand Total 296 175 41 11 53 16 22% 59.1% 94.6%

As a side-note, I’m interested to see that I used secondary monographs more than any other resource–a common practice in literary study, I suspect.

Methodology

Initially I planned to systematically search multiple digital collections for the works in bibliography, but once I realized how long that process would take I decided to scale back my efforts. My goal was to find out if a work was available online as full text, not to discover every electronic version of that work. I experimented with using tools such as Rollyo and the Google Custom Search Engine to search for works across a specific set of sites, but they wouldn’t return results from some major sources such as Google Books (perhaps because Google Books apparently does not permit indexing by other commercial search engines). I longed for a tool that would suck in my bibliography, search for authoritative versions of each text, capture the bibliographic information, and download everything into Zotero (come on, semantic web), but, alas, I had to do this work manually. If I were a better programmer, I might be able to automate part of the collection process, but most online archives prohibit automated methods for downloading files.

So that I could determine whether certain types of works were more likely to be available online, I distinguished between primary and secondary monographs (books as well as essays/articles/poems collected within books), primary and secondary periodicals, and archival materials (which I classified as primary source). Logically enough, where I looked depended on what I was looking for:

  • primary source monographs: Google Books and Early American Fiction. If I didn’t find the work at these sites, I tried the Open Content Alliance and Making of America, then searched for the title using Google.
  • primary source periodicals: Making of America, Google Books, and subscription databases provided by Alexander Street Press, then the general web.
  • secondary source monographs: Google Books, Live Search Books, Net Library, and Questia
  • secondary source periodicals: Google Scholar (which searches JSTOR, Project Muse, and other electronic journal collections)
  • archival resources: web page of repository holding the collection

I wasn’t always able to find the same edition of a work that I cited in my dissertation. However, I was delighted to discover first editions of important works in Early American Fiction, Google Books, Open Content Alliance, and other online collections.

As the statistics cited above indicate, the majority of my primary sources are available online as full-text, while most of my secondary sources are not. But I noted several instances where public domain materials that should be freely available were not.

  • Most of the primary source monographs I used have been digitized, but Google Books treats 8 of these public domain works as “no preview.” I’m not sure why Google treats works such as The Soldier’s Bride and Other Tales (1833) as “no preview,” but a note on the metadata record for this book gives a clue: “Prepared for The Electronic Archive of Early American Fiction at the University of Virginia Library.” My hunch is that Google Books does not make available some public domain works already digitized by its library partners.
  • Thanks to Making of America, many nineteenth century American periodicals have been digitized. Google Books also contains important 19th C magazines such as Southern Literary Messenger (which Poe wrote for) and Salmagundi, but it does not appear to have every issue of many of these magazines. Some important magazines, such as Godey’s Lady’s Book, are not available at all through Google Books (although you can access the magazine if your library subscribes to the Alexander Street Press Godey’s collection).
  • Nearly all of the secondary journal articles that I consulted are available online, most through JSTOR or Project Muse. However, more specialized journals such as the Walt Whitman Quarterly Review are not yet available as full-text online (although the WWQR does have a complete index).
  • Only about 25% of the secondary books that I cited are available as full-text through Questia and/or NetLibrary. However, Google Books seems to have at least digitized most of the contemporary secondary monographs in my bibliography. (I’m assuming that “no preview” means Google has digitized the work but isn’t making even a snippet publicly available.) Publishers such as Oxford UP, University of California Press, Cambridge UP and Knopf appear to have made deals with Google to allow limited preview of their books. Interestingly, Google Books has not digitized some works available through Questia, such as After the Whale and Monumental Anxieties.
  • I looked at three archival collections–single items at the University of Virginia and the Virginia Historical Society, as well as a fairly large collection focused on the author Donald Grant Mitchell at Yale–and none have been digitized yet.

So what are the implications of my findings that most of my primary sources are available online as full text, while many of my secondary sources are, at least in a limited fashion, in a digital format and 62% of them are searchable? As Patrick Leary, Jo Guldi, and others have argued, massive digitization projects promise not only to make the research process more efficient, but also to open up new approaches to research. For example, you can discover important works that would otherwise be invisible to you, trace the use of a phrase across works, and analyze significant patterns in a corpus of texts.

Yet we should also acknowledge that not everything is available online and that research sources are scattered across multiple collections, not yet searchable through a single tool. Despite the efforts of many archives to digitize their collections, studying most archival resources still requires a trip to the archives (although the Web has made it much easier to determine if an archive holds relevant materials and to prepare for a research trip). Many online collections–particularly those focused on works not in the public domain–require a subscription, so if you’re an independent scholar or if your library can’t afford a subscription, you’re out of luck. Furthermore, scholars need to be able to trust the reliability of online texts so that they feel comfortable using (and citing) them–the metadata needs to be accurate, the page images and OCR of sufficient quality, the reputation of the archive high. Given the potentially overwhelming quantity of data, we need better to tools to search, manage and analyze information (fortunately, project such as Zotero and MONK are developing such tools). We also need to be able to use these tools with text collections, whether the tools are integrated into the collection (as Token-X is with the Willa Cather Archive), invoked through an API, or run on collections that we build ourselves by downloading relevant resources. And we need to feel comfortable that we can download and analyze online resources without worrying about being sued for violating licensing terms.

In my next posts, I’ll look at the quality of online texts, discuss what researchers can do with full-text, and detail the problems I ran into trying to get downloaded texts into shape for text analysis tools.