Python in the Archives: Computational data mining and visualization of historical records from Mughal India

Sudev J Sheth

Doctoral Candidate

Project Start Date:

June, 2017

Data Analysts:

Sudev J Sheth
Doctoral Candidate, CAS (South Asia Studies, History)

Jennifer Sui
CAS '17 (Economics, Statistics)

Faculty Mentors & Advisors:

Dr. Ramya Sreenivasan
Department of South Asia Studies

Dr. Arvind Bhusnurmath
Department of Computer and Information Science

Dr. Devesh Kapur
Department of Political Science

Dr. Sayan Bhattacharyya
Price Lab for Digital Humanities

Our world is producing information faster than we can analyze it. In fact, ninety-percent of today’s data has been generated in the past two years alone. The challenges of managing and making sense of endless facts and figures have provided a catalyst to the growing field of computational data science. In addition, more and more non-specialists are partaking in algorithmic work because programming languages like Python are user friendly and relatively easier to learn. For the most part, computational analysis remains oriented towards predictive modeling and optimization for the benefit of business and politics. My project moves away from this trend by synthesizing approaches in data science with historical questions and primary sources from early-modern India (AD 1352-1850). My initial dataset will be generated from one of six detailed archival catalogs compiled by the National Archives of India between 1982 and 2011. The data comprises 627 descriptive entries that will be assembled into a relational database, analyzed using statistical tools in Python, and visualized according to key research questions. I believe that my method will allow me to discover broader patterns, trends, and associations between constituting elements of the archive that a more traditional reading of selected documents cannot provide. Finally, I hope to demonstrate ways that scholars in the humanities can incorporate computational methods into their research, and how data scientists can benefit from the interesting issues and problems historical sources present such as working with uneven data sets and the multiple representational forms that our evidence takes. Our intended audience comprises researchers in the social sciences and humanities along with those working in the emerging fields of data science, visualization, and digital humanities. Our main objective is to demonstrate the value of computational tools for both creating and analyzing unconventional datasets.

For more information about the project, visit https://sudevsheth.com/2017/09/16/python-in-the-archive/

Additional project support provided by the Center for the Advanced Study of India.