Seth Kulick

February 9, 2026 - 12:00pm

Williams 623


Annotation and Models for Information Extraction from Large Text Corpora: Early Modern English, Yiddish, and Scots

 

The availability of the 1.5 billion word Early English Books Online (EEBO) collection has opened up new possibilities for research, but its utility is limited by the lack of labels that can be used to extract the desired information.  In this talk I will discuss methods we are developing that will allow us to add the appropriate labels in order to support linguistic, historical, and other digital humanities research.  For example,  syntactic annotation is needed for linguists studying language change, and similarly, markup that links mentions of people to external databases would increase the research possibilities for identifying social networks.   I will also discuss parallel work for Yiddish and Scots, focusing on corpora such as the 12,000 books provided online by the Yiddish Book Center. 

 

Seth Kulick is an associate director at the Linguistic Data Consortium (University of Pennsylvania), with a research focus in natural language processing for text. In recent years, he’s enjoyed joint work across disciplines. This has led to serving as a PI or co-PI on two current NSF grants, and one recently finished NSF grant, bringing together linguistics, computer science, and humanities. The talk will describe the work in these grants.