DocSouth Data: Library Collections built for Text Mining
The popularity of Digital Humanities has led to increased interest in text mining and data analysis among humanists. While there are many tools available for anyone who wants to experiment with these techniques, researchers often hit a roadblock when it comes to finding collections that are ready to be analyzed.
DocSouth Data began as an idea to address this challenge by providing access to the data behind the North American Slave Narratives collection; the most widely used collection in UNC's Documenting the American South. This was an obvious place to start for several reasons; First, it is an extremely compelling collection of obvious historical significance. Second, it is complete and coherent in that it includes all known autobiographical narratives of fugitive and former slaves published as broadsides, pamphlets, or books in English up to 1920. Third, because it was transcribed by students, it is extremely reliable data. In addition to the the North American Slave Narratives collection, users can also get the data behind three other collections which are of similar quality: The Church in the Southern Black Community, First-Person Narratives of the American South and the Library of Southern Literature.
This presentation from Stewart Varner, Managing Director of the Price Lab, will provide an introduction to the collection, how it was designed and how it can be used in digital humanities research.
Text analysis tools in progress from The HathiTrust Research Center
Sayan Bhattacharyya will briefly describe two ongoing tool-building initiatives at the HathiTrust Research Center (HTRC), the research wing of the HathiTrust Digital Library: the HTRC Bookworm and the HTRC "Extracted Features" functionality. The first tool, the HTRC Bookworm, consists of the generic Bookworm tool, developed by Erez Aiden Lieberman and Ben Schmidt, integrated with the HathiTrust Digital Library (HTDL). This hookup leverages the extensive metadata that enriches the HTDL, enabling motivated visualizations of facets of a corpus. One such motivation, though by no means the only one, is the tracing of individual words over facets across time or across other dimensions. The second tool, HTRC's "Extracted Features", provides users with bags of words (and some other information) per page. This is not only useful for performing text analysis on those texts which cannot be made available as linear, sequential streams of words because of copyright restrictions, but also lowers the cost of processing for those texts which can be. He will argue that, in addition to their utilitarian value, tools such as these may also help problematize such notions as "text" and "reading".
Meyerson Conference Center, Van Pelt-Dietrich Library