Evaluating the quality of electronic texts

In my last post, I said that 83% of the primary source works that I used in my dissertation are now available online as full-text. But how reliable are these electronic texts? Can researchers feel comfortable citing them and using them for text analysis? In my view, the quality of an electronic text and its appropriateness for use in scholarship depend on 6 factors:

  • Quality of the scanning: Is the complete page captured? Is the image skewed or distorted? Is the image of sufficient resolution?
  • Quality of the OCR/text conversion: Is full text provided? What method was used to produce the text–double-keying or OCR? How accurate is the text? Are the texts marked up in TEI (Text Encoding Initiative)? Are words joined across line breaks? Are running heads preserved?
  • Quality of the metadata: Is the bibliographic information accurate? Is it clear what edition you are looking at? If there are multiple volumes, do you know which volume you are getting and how to locate the other volume(s)?
  • Terms of use: What are you legally able to do with the digitized work? Can you download the full-text and use tools to analyze it? Is the content freely and openly available, or do you have to pay for use?
  • Convenience: Can you easily download the text and store it in your own collection? How much work do you have to do to convert the text into a format appropriate for use with text analysis tools? How hard is it to find the electronic text in the first place? Is there a Zotero translator for the collection?
  • Reputation: Is the digital archive well-regarded in the scholarly community? If you cited the archive in your bibliography, would fellow researchers question your decision? Does the archive provide clear information about its process for selecting, digitizing, and preserving texts?

I focused my evaluation on the main collections that I plumbed for the primary source works in my dissertation bibliography: Google Books (GB), Open Content Alliance (OCA), Early American Fiction (EAF), Project Gutenberg (PG), and Making of America (MOA). I found the OCA works in the Internet Archive (they are marked as belonging to the “American Libraries” or “Canadian Libraries” collections.) I apologize in advance for the length of this post, but I want to dig into the details.

Quality of Images

Perhaps the most heated debate over the quality of digitized texts has focused on Google Books. For example, Paul Duguid and Robert Townsend have questioned the quality of Google Books, providing examples of skewed or poorly scanned pages, inaccurate metadata, and the failure to make available materials that should be in the public domain. It goes without saying that providing access to high-quality page images is important: many researchers want to study the illustrations and other visual features of a text and to verify the converted text against the original page. Furthermore, a poorly scanned page probably means that the resulting OCR will also be bad.

To evaluate the quality of Google’s scans, I first took a birds’-eye view, using the page preview function in Adobe Acrobat to get a quick glimpse of the image quality of 56 19th century works that I downloaded from GB. Using this admittedly inexact method, I noticed fewer than 100 or so scanning errors across the approximately 11,000 pages I glanced at. If I found one distorted image (see, for instance, the 1834 Knickerbocker) or a finger in the scan, it was likely that the book would contain other errors as well–maybe the scanning operator was, um, distracted. However, closer scrutiny of the files revealed other errors not visible through the preview. For instance, the text for Typee ended right in the middle of a word, which I don’t think is how Melville meant it to be (although that narrative approach does leave you anticipating what comes next). In the 1906 Maynard & Merrill edition of The Sketch Book, the last few lines on a number of pages are, for lack of a more precise term, stretched and curved, as if the page were turned too quickly during the scanning or photographing. (Judging from all of the black splotches on the pages, this text appears to be scanned from microfilm, so the quality issues may have been introduced during the microfilming, not the scanning.) Although I can’t make a definitive statement about the quality of Google Book’s scans, I’d say there are some problems, but they are not as significant as I thought they would be.

Although Google Books seems to have the most content right now (about 2,680,000 books?), I prefer the quality of the scans provided by the Open Library/ Internet Archive, which provides searchable text + image PDF, as well as DJVU files and a flip-book format that simulates the experience of turning book pages online. Whereas most of the pages in Google Books are scanned in black and white, the OCA scans are in full color, showing the coloration of the page and the richness of the illustrations. Although I haven’t conducted a systematic study of OCA scans, I haven’t noticed many problems at all. (Disclaimer: Rice University’s Library is a member of the Open Content Alliance, but I have not been personally involved in working with the organization.)

Likewise, the scans for Early American Fiction are full-color and were captured at a high resolution. When I worked at Virginia’s Electronic Text Center, I visited the photography studio for EAF and witnessed the care with which each page was scanned, so I would be surprised to find many problems. Although the MOA images are black and white, the scans appear to be of high quality and were captured at 600 dpi. Project Gutenberg does not provide page images, so the issue of scanning quality is moot for this collection.

Quality of Text Conversion

Although the ability to search and retrieve an entire text online makes research more convenient and comprehensive, doing deeper analysis requires either being able to download texts or to run tools on online content. With some subscription-based collections, such as Net Library and Questia, you are limited to viewing the work one page at a time, which really restricts what you can do with it. With other collections, you can capture the entire work by downloading it as a web page (Muse, EAF), image-only PDF (Google Books), searchable PDF (OCA and JSTOR, lately?), or plain text with uncorrected OCR (Making of America, OCA).

My favorite sources are those that provide good quality full-text of the article or book in HTML (XML would be even better), since that minimizes the work I need to do in getting the text into a format I can use with text analysis tools. Converting the image-only PDFs provided for download by Google Books into plain text or XML has been troublesome, perhaps because the resolution of the scans and overall quality of the scanning could be better, perhaps because nineteenth century works aren’t necessarily printed clearly. (Several folks have complained about the quality of Google Books PDF files: 1, 2.) To get the Google Books PDFs into a format I could work with, I had to run them through Acrobat 8.0’s optimizer to improve the image quality, then through its OCR engine to convert the files to Adobe Tagged Text XML, a very basic markup format.

In assessing the quality of the full-text files provided by Google Books, the Open Content Alliance, Early American Fiction, Making of America, and Project Gutenberg, I compared the quality of OCR for 3 works important to my dissertation: Mitchell’s Reveries of a Bachelor, Melville’s The Piazza Tales, and Irving’s The Sketch Book. [Results of the analysis are available here.] EAF, OCA, and GB held all three books, but MOA only had Reveries and PG only had The Sketch Book and The Piazza Tales. For GB, I looked at the quality of the OCR I was able to generate from the image-only PDFs, as well as of the plain text provided online by Google Books, which allows you to view only one page at a time. I selected the same three pages in each of the books and counted the number of OCR errors per page. In the case of The Piazza Tales, I was able to use the same edition from all of the collections (the 1856 Dix & Edwards edition), but even for the other works the number of words per page seemed about the same. I found that:

  • The most accurate texts were produced by EAF (1 error total out of the 3 sample pages I examined in 3 texts), Project Gutenberg (0 errors in 2 texts), and MOA (0 errors in 1 text). The OCR that I generated from the Google Books PDFs was the least accurate, with a total of 111 errors, while there were 44 errors in the 9 pages I examined from OCA. Full disclosure: I was a project assistant for EAF and expected that the quality of text conversion would be high, since the texts were double-keyed, marked up in TEI, and then reviewed by graduate assistant.
  • The OCR produced by Google Books is of much higher quality than what I was able to create through Adobe Acrobat’s OCR engine–there were 38 errors in the plain text online version vs 111 in my OCRed version. (I’m betting that Google is working with better images and better software.) But even Google Book’s OCR engine was tripped up by poorly scanned pages; for instance, in one page from The Sketch Book, words were cut off along the left edge, resulting in 25 OCR errors for this page.
  • The number of text conversion errors would increase for OCA if I included words that broke across lines. Although preserving line and paragraph breaks, as OCA does, makes it easier for the reader to move between the page image and the OCRed text, this practice also means that word counts for OCA texts are probably not accurate, since terms such as “pic” <line break> “ture” would not be registered as “picture”. Depending on the density of words on the page, it seems that an average of 4-6 words per page are broken across lines.
  • Running heads are preserved in the MOA, OCA and Google Books texts, so word counts for terms that appear in these heads (e.g. “Reveries of a Bachelor”) would be inflated.

In addition to sampling texts to determine the number of errors in text conversion, I also wanted to get a macro view of the accuracy of each version. I decided to focus on The Piazza Tales, since OCA, Google Books, and EAF all contain the 1856 Dix & Edwards edition. I used the HTML file downloaded from EAF (the TEI version would have been preferable, but there is no direct access to that file), the plain text file from OCA, and the Adobe Tagged Text XML I generated from the Google Books image only PDF. (I couldn’t get either HyperPo or TAPOR Tools to work with the Gutenberg file.) To generate the word counts, I ran each text through HyperPo 6, then compared the results [available as spreadsheet]:

  • Word counts differed in each file, even though the editions were identical. For example, HyperPo 6 counted 311 instances of “all” in the Google Books version, 383 in EAF, and 380 in OCA. Likewise “one” appears 327 times in Google Books, 391 times in EAF, and 390 times in OCA. These variations are probably due to differences in the quality of text conversion. I would expect Google Books to register lower counts for words, since the OCR that I was able to produce appears to be least accurate. EAF would probably contain the most accurate results, since it (unlike OCA) typically joins words broken across lines.
  • Each version contains words/characters that would lead to false results. For instance, the word count for EAF includes terms that were not part of the original text but that were used to aid online navigation, such as “link” (1843 times), “page” (1368), and “window” (948). The letter “s” appears 421 times in the GB results, 468 times in EAF, and 471 in OCA, signifying, I suspect, a word fragment.
  • “Piazza,” which is in the running head for the book, appears 243 times in OCA, but does not register in the top 50 terms for GB or EAF. In the case of EAF, the running heads are stripped out, so “Piazza” is not given too much weight. In the case of Google Books, the OCR quality for terms that are capitalized is usually terrible–instead of “THE PIAZZA TALES” you get “‘I’ BZ .1 A Z&A ‘I’ ALB•” or “TUB PI..lZZA TAL.”

As Wesley Raabe has shown, even the most carefully prepared edition of a work, whether print or electronic, is likely to contain errors (and some works aren’t prepared that carefully). When you introduce OCR into the process, the rate of error increases. Now for purposes of search and retrieval, OCR errors may not be all that significant–the OCR probably wasn’t inaccurate for every instance of a word, so you can still determine if a text contains that word. However, if I were to choose a version of a text to use with a text analysis program, I would go with the EAF, since the EAF text was double-keyed, marked up in TEI, and appears to have the highest accuracy rate. Unfortunately, producing double-keyed, marked-up texts is more expensive than creating uncorrected OCR, so there are far more mass-digitized, OCRed books than double-keyed, TEI encoded texts. There are 886 volumes in EAF (which is focused on nineteenth century American literature) and over 2 million in GB. If scholars are going to use text analysis, mining and visualization tools on massive collections of digitized texts, they probably need to be explicit about their tolerance for error. (It would be nice if they could get access to the higher quality OCR produced by GB, too.)

Quality of Metadata

In order to trust the reliability of a work, you need good metadata so that you know what you are looking at (and in some cases to find it in the first place). As Duguid has observed, the metadata for Google Books can be poor, particularly when you’re dealing with a multi-volume work, as you often are when you’re working with nineteenth century literature. Apart from viewing the title page, there’s no way to tell that a work in GB comes in multiple volumes, and the metadata record provides no linkage from one volume to the other in the series. Indeed, sometimes it appears that only one volume in a series has been digitized. However, GB also provides potentially useful information about books–not only the title, author, publisher, and publication date, but also the page count, digitization date, and subject keywords, as well as reviews, unique phrases, and popular passages.

Although OCA does offer a lot of information about works in its collections, the metadata could be organized more effectively. When you search for a work, the information that appears in the results page is too general: you see the title, author name, and number of downloads, but not the publisher or publication date, so you have to follow the link to know what edition you’re getting. Sometimes the OCA omits the publisher and publication date from the metadata record even though this information is available on the title page of the book. At least the OCA typically includes the volume number in the title. I also like the way that the Internet Archive provides detailed metadata about how the electronic text was produced, including the number of images and scanning operator. Ideally you’d get subject terms as well, and the author name and title could serve as hotlinks to more content by that author or with that title. (The Open Library, which also involves Internet Archive guru Brewster Kahle, often provides more complete metadata records, but it’s a meta-catalog, with records for non-digitized works held by libraries, even GB.)

Both EAF and MOA offer complete bibliographic information for works in their collections. Unfortunately, Project Gutenberg often does not provide bibliographic information beyond the title, author, and subject terms, which is a real problem for scholars who need to know what edition they are looking at.

Restrictions on Use of Digitized Materials

Brewster Kahle and John Wilkin of the University of Michigan recently debated what “open” means. According to Kahle, open content, like open source code, can be “downloaded in bulk, read, analyzed, modified, and reused.” Wilkin replies that “There is no uniformly defined constituency called ‘researchers’ who ‘require downloadability.'” While Wilkin may be right about the current needs of average users, this statement overlooks the needs of researchers who want to do more than just search and retrieve a text–for instance, those who want to build their own collections, run analytical tools that requires you to load files residing on your local hard drive, etc. Of course, with some tools, such as TAPOR, you can provide a URL for the text to be analyzed. The question may not be so much one of downloadability as the ability to access and manipulate online texts–can you get access to the data through an API, for instance? Is text mining even permitted by the terms of use?

Here’s my summary of the terms of use put forward by the collections that I used, with the caveat that I Am Not A Lawyer and may be misinterpreting legalese.

a. Google Books:
Google Books, which does not require a subscription, makes available most of its public domain books for download as image-only PDF files, but imposes four restrictions on their usage: automatic querying is prohibited, the files should be used for non-commercial purposes, attribution to Google must be maintained, and users should “keep it legal,” particularly since copyright law varies by country. [I won’t wade into the debate about whether Google Books is good or bad for what I’ll call a “researcher-friendly” approach to copyright, but Lawerence Lessig and Siva Vaidhyanathan offer interesting perspectives.]

b. Internet Archive/ Open Content Alliance:
The Internet Archive, which is freely available, requires users to adhere to a 7 page terms of use agreement that governs use of all collections. Essentially the Internet Archive makes the user responsible for using content appropriately, requiring that “you certify that your use of any part of the Archive’s Collections will be noncommercial and will be limited to noninfringing or fair use under copyright law.” As the name suggests, the Open Content Alliance seem to be pretty darn open in its approach.

c. Early American Fiction:
Access to EAF requires a subscription, although 158 of 886 volumes are freely available from the University of Virginia Library. (However, something seems to be broken right now at Virginia, since I wasn’t able to open any of the freely available EAF texts.) Proquest’s terms of use for Early American Fiction are much more explicit and restrictive than the Open Content Alliance or Google Books. Essentially, “You will use the Products solely for your own personal or internal use.” You cannot create derivative works, nor can you systematically download works to create a collection of materials. Works can be printed or saved only for private or educational use.

d. Net Library:
NetLibrary, which is available only through subscription and focuses on copyrighted content, prohibits the use of automated methods for collecting or analyzing its data such as robots or data mining tools without “express written permission.” It puts forward what seems to me to be a fairly expansive policy limiting use of copyrighted materials: “You may not modify, alter, publish, transmit, distribute, display, participate in the transfer or sale, create derivative works, or in any way exploit, any of the Copyrighted Material, in whole or in part.” Presumably this policy wouldn’t rule out using text analysis tools on works downloaded from NetLibrary (assuming you could actually get the full text of resources), but it does seem to ban remixes or even alteration of the work (conversion to XML?). NetLibrary is owned by OCLC Online Computer Library Center, Inc., an Ohio nonprofit corporation.

e. JSTOR:
In its usage policy, JSTOR, which requires a subscription, recognizes that fair use governs what researchers can do with research materials. Users may not download entire issues or “incorporate Content into an unrestricted database or website,” but they can “search, view, reproduce, display, download, print, perform, and distribute articles, book reviews, journal front and back matter, and other discrete materials.”

f. Project Muse:
Owned by Johns Hopkins UP, Project Muse permits downloading one copy of articles for personal use as well as distributing works to students in a class at the subscriber’s institution. Project Muse, which requires a subscription, will not allow users to employ automated processes to download works, re-use content for commercial purposes, or “modify or create a derivative work of any Journal content without the prior written permission of the copyright holder re-use content for commercial use.”

g. Questia

Available only through individual subscription, Questia, which focuses on copyrighted content, puts forward very restrictive terms of use: You may use the Questia Web site for your personal or academic research activities. Any other use, including the reproduction, modification, distribution, transmission, republication, display, performance, rehosting, tampering, framing, or embedding of this site or its content or tools, or any commercial use whatsoever of this Web site or its content or its tools, is strictly prohibited without our prior written consent…. You agree not to reverse engineer, reverse assemble, reverse compile, decompile, disassemble, translate or otherwise alter any executable code, contents or materials downloaded from or made available through our Web site. You agree not to use programs, scripts, code or other available methods to download or view multiple pages of content on Questia in an automated fashion. You agree not to save the content available on Questia on your hard drive or other storage device for viewing offline.”

h. Making of America:
Michigan’s MOA, which is freely available, provides a standard availability notice with the metadata record for each book or journal granting the right to search, but prohibiting the redistribution of materials without permission: “These pages may be freely searched and displayed. Permission must be received for subsequent distribution in print or electronically.” However, a more general usage statement on the home page for the University of Michigan Digital Library Text Collection states that “Users are free to download, copy, and distribute works in the public domain without asking for permission.” Cornell‘s policy for its content in MOA seems more restrictive: “This material is to be used for personal or research use only. Any other use, including but not limited to commercial or scholarly reproductions, redistribution, publication, or transmission, whether by electronic means or otherwise, without prior written permission of the Library is prohibited.”

Bottom line: For most of these online collections, particularly those that are non-commercial, researchers are permitted to download and analyze texts. However, if a scholar wanted to create a free web-based collection of, say, bachelor texts downloaded from various sources, that would probably require getting permission. Converting downloaded files from one format to another for purposes of personal research generally seems OK. If you wanted to extract a few illustrations from digitized public domain texts (assuming they were of sufficient quality to begin with) and reuse them in, say, a video, that would seem to be permitted by fair use if not the explicit policy of the collection, but I’m not totally sure.

Convenience

I want to spend my time analyzing works, not searching all over the web for them, typing bibliographic information into Zotero, or converting files into another format. As already noted, GB allows you to download image-only PDFs of most public domain materials, but you have to optimize and OCR these files if you want to search or analyze them. Nevertheless, it’s pretty easy to find relevant works in GB’s huge collection, and it does provide support for Zotero. At the Open Content Alliance, you can download books in several formats, including DjVu, word plus image PDF, and plain text. You can even access the files via an FTP site. However, the Internet Archive does not yet have a Zotero translator (strange considering the recently-announced partnership between the two groups), which means more work for me in capturing bibliographic information. With the Michigan Making of America collection, you can download the uncorrected full text after reading a warning that the OCR contains errors. I was also interested to note that Michigan now provides a print-on-demand service for works in MOA–the metadata record includes a link “Order a Softcover @ Amazon.com.” The 1864 Reveries of a Bachelor, for instance, costs $23.99 from Amazon, and you can search inside the book at Amazon. (Amazon lists 7,130 results when you search for “scholarly publishing office University of Michigan Library”.) Michigan’s MOA just offered Zotero support, but it doesn’t seem to be working yet. It’s pretty easy to work with files from Project Gutenberg (plain text and sometimes HTML) and EAF (html), although neither offers Zotero support. Even if I have to search in multiple collections for texts that I need, manually add bibliographic information, and convert files to different formats, I should acknowledge that it’s so much faster and more efficient to do research online than to look in catalogs for books, track ’em down in the stacks, skim them for key words, photocopy them, and copy down the bibliographic information.

Reputation

Evaluating the reputation of an online collection is tricky–it’s difficult to know how a scholarly community regards particular sources without doing citation analysis or conducting interviews with or surveys of scholars. To get a quick sense of how often various digital collections are cited, I searched for the title of the collection or the base URL (e.g. books. google.com) in Project Muse and JSTOR. Only five works cite Google Books (three of which focus on the book scanning project) and only four cite “Early American Fiction.” However, 37 articles cite Project Gutenberg, 36 cite Making of America, and 30 cite the Internet Archive, mostly its contemporary culture collections such as the Prelinger Archive. I attribute the low rate of citation of Google Books to several factors: it’s the newest collection, having been launched at the end of 2004. Further, many researchers probably feel that they can cite works in GB as if they were looking at the print version rather than including the unwieldy Google Books URL in their bibliography. Reputation–the sense that you get more credit if you cite the print version–probably also factors in. I’m not sure why EAF is not cited more often, but does contain fewer works than the other collections.

Conclusions

In reviewing the quality of these electronic texts, my intention is not to whine or nitpick–I’m grateful to all of these projects for making such rich content accessible. Since I’ve been involved a few digitization projects myself, I know how difficult it can be to ensure quality. However, I think researchers need to be aware of the current limitations of electronic texts. Because the OCR or even the keyboarding of texts is not entirely accurate, using tools to search or analyze them will not produce completely accurate results. Perhaps we can use pre-processing [corrected 5/13] tools to increase the accuracy. For instance, we could use such tools to remove running heads and other extraneous information, spellcheck files, and even join words broken across lines. Even if the texts aren’t completely accurate, I think you can come to some fascinating insights about the works by using text analysis tools on them–you just need to be conscious of potential problems with the quality of the text.

We can also make informed choices about which texts we use for research. Each online collection has its strengths and weaknesses: Google Books contains the most books but its image-only PDF files can be a pain to work with, EAF has the most accurate texts in my field (19th C American fiction) but is also the smallest collection, and MOA has some key works for the study of 19th century culture and offers access to pretty decent OCR. OCA seems to take into account the needs of researchers by building a large collection of public domain works, offering access to files in different formats, and adopting a policy of openness.

2 responses to “Evaluating the quality of electronic texts

  1. Thank you again for your blog, I really appreciate the work in all this!

    One should know that in several countries the legal notice coming with the digitized work may not be correct. In Germany for example simple book scanning does not constitute a copyright to the images one has produced (because there’s no “creative process” in scanning). So if Google Books offers images of books in the public domain, Google’s “terms of use” are of no importance.

    There are 2 points not mentioned: 1. does the archive have a possibility to correct texts? E.G. If one works with a scan by Google and notices ocr/textual errors, what can one do? The answer (for Google): nothing. A pity. (I haven’ tried for the other archives yet.)

    Citing: Some collections offer stable links to texts that are clearly recocgnizable as such. But did you know that you can identify a Google Book by linking with the string before the second question mark? For example your “Knickerbocker”:
    http://books.google.com/books?id=WxIAAAAAYAAJ.

  2. Pingback: digitization transmission

Leave a comment