What can you do with texts that are in a digital format?

I’ve had a longstanding, friendly debate with a colleague about whether it is sufficient to provide page images of books, or whether text should be converted to a machine- and human-readable format such as XML. She argues that converting scanned books to text is expensive and that the primary goal should be to provide access to more material. True, but converting books into a textual format makes them much more accessible, allowing users to search, manipulate, organize, and analyze them. Here’s my summary of what you can do with an electronic text. Most of these advantages are pretty obvious, but worth articulating.

Read it—on paper (once you print it out or pay for on-demand printing), your computer, or, increasingly, a portable device. From a single XML file, you can generate many forms of output, including HTML, PDF and for a mobile device.
Copy and paste it–avoid the hassle of having to retype passages.
Search it. Several years ago, I wrote a series of learning modules on stereographs, 3D photographs popular in the late 19th and early 20th centuries. I searched for books and articles on stereographs in the library catalog and in journal collections such as JSTOR, but was kind of disappointed by the lack of relevant information. Last year I returned to the topic and used Google Books for my research. I found dozens more relevant sources, such as key theoretical and historical works on stereography (most of which had already been published when I first studied the topic) as well as some fascinating nineteenth and early twentieth century manuals. Sure, I had to wade through a lot more stuff to find what I needed, but being able to search the contents of books and essays as well as the metadata let me uncover much more useful stuff.
Build a personal collection. Forget file cabinets crammed with photocopies. Using tools such as Zotero and EndNote, you can easily download articles and the accompanying bibliographic information onto your laptop, then take your entire collection with you on a plane, to an archive, to a boring meeting, etc. You can search your collection, sort it, create bibliographies, etc.
Share it. Much to the chagrin of movie studios and record companies, digital files are easy to share, so you can give colleagues access to articles, notes, bibliographies, etc. without having to deal with physical delivery (copyright permitting, of course.) With the forthcoming Zotero 2.0, sharing will get even easier.
Analyze it. Once you have a book in a text-based format, you can do all sorts of nifty things with it–generate word counts, find out what terms appear most frequently next to a particular word, extract dates, find capitalized terms, compare texts, and much more. See TAPOR’s tutorial.
Visualize it. Not only are text visualization tools, well, cool, they also can open up interpretive insights. For instance, using the US Presidential Speeches Tag Cloud, you can get a quick, dynamic view of the history of presidential priorities.
Mine it. Look for patterns in large textbases. As Loretta Auvil of NCSA & SEASR explains, text mining tools such as those being developed by MONK and SEASR enable researchers to automatically classify texts according to characteristics such as genre, identify patterns such as repetition (as in the case of Stein’s Making of the Americas), analyze literary inheritance, and study themes across thousands of texts.
Remix & play with it. By taking the elements of a text or collection of texts and remixing them, you not only produce a new creative work, but also see the text in a new way–your attention is brought to particular linguistic elements, like the fragments of a broken vase used to make a mosaic. For instance, when I used the Open Wound “language mixing tool” with Melville’s 1855 sketch “The Paradise of Bachelors and the Tartarus of Maids”, I gained new insights into the violence and anxiety expressed by words such as “agony,” “cut,” and “defective.” Running the tool on the sketch also produced some stunning phrases that could serve as mottoes for this kind of activity: “Exposed are the cutters,” “in the meditation onward,” and “protecting through the scholarship.” I also plan to play with tools that would allow me to mashup several bachelor texts (take the beginning from Irving, the middle from Melville and Hawthorne, the end from Mitchell), replace key words with pictures, etc.

Some really interesting research is underway on the possibilities of text mining for humanities scholarship–including the aforementioned MONK and SEASR projects, as well CHNM’s “Scholarship in the Age of Abundance: Enhancing Historical Research With Text-Mining and Analysis Tools.”

7 responses to “What can you do with texts that are in a digital format?

  1. A debate on electronic text versus page images is definitely worth having, but I think the comparison is muddled in this posting. While I work in electronic publishing, I’ll advocate page images here since I fully agree that providing broad access is more important than producing hard-crafted masterpieces of electronic text.

    Nearly all projects which scan pages of text — from venerable projects like Making of America to more recent ones like Google Book Search — use OCR software on these images. OCR accuracy rates vary but are well above 99% for contemporary typefaces. So when you include these projects in the page-image category, you find that you can do nearly everything you can do with electronic text:

    1. Read it: If you have a PDF of page images (such as you can download through Google Books for public-domain works), you can use a process like this to view on an Amazon Kindle: http://www.mobileread.com/forums/showthread.php?t=18968 . As the corpus of digitized works grows, expect methods like this to be made more user-friendly.

    2. Copy and paste it: Many page-image projects allow you to view the OCR of content. While there will be OCR errors and possibly a loss of formatting, it’s not really such a bad start.

    3. Search it: Searching on full text uses the OCR text, and searching on the metadata uses the any metadata provided in the system. As you point out, Google Books lets you find items quite easily.

    4. Build a personal collection: You can download PDFs of pages images just as easily as other electronic text formats, though these PDFs may not be searchable the way electronic text is.

    5. Share it: PDFs of page images can be shared as easily.

    6. Analyze/visualize/mine it: You can do word-count analysis on the OCR text. What you can’t do is take advantage of any highly structured markup, but there are few sources of uniformly encoded data that allows you to do this kind of searching anyway, despite good efforts by the TEI. See Glen Worthey’s presentation at the DLF Spring Forum 2008: http://www.diglib.org/forums/spring2008/2008springprogram.htm .

    7. Remix & play with it: You can do this with the OCR text.

  2. @Kevin: Good point. A lot of what I was describing applies to texts in a digital form (whether PDF, XML, or page images)–that’s what the title of the post indicates, although my opening paragraph does focus more on page images vs OCRed/keyboarded texts. I wanted to keep the post short and simple, but I should have been clearer.

  3. Pingback: What can you do with texts in a digital format? « (Digital) Humanities

  4. Thanks for mentioning our work at SEASR (Software Environment for the Advancement of Scholarly Research). We are readying our software for first release (set for later this summer) and are seeking humanities collaborators who would like to make use of SEASR. We have a good deal of information up on our website: http://www.seasr.org, including a helpful technology description.

  5. Text images have three advantages: 1. the ocr may be wrong, but the image is always right. 2. you can cite it just as the printed version. 3. printing is easy and gives you a very usable text, because the scanned version has already been optimised for print.
    But of course 2. and 3. are also possible in xml text (with a lot more work).

  6. @JGE: Yes, ideally both the image and the text would be made available.

  7. Pingback: So you’ve digitized your text. Now what? « (Digital) Humanities

Leave a comment