Full Text Corpora

Since 2015, I have been incrementally developing collections of full text (corpora) of early modern Italian books to use for computational text analysis. There are 2 sets of texts that I use increasingly for research and hopefully for teaching in the future:

  • available diplomatic editions curated by others (437 texts, over 19 million words)
  • diplomatic editions of books known to have been in Galileo’s library, curated by my team

Scholarly Context for Corpus Creation

Since much of the U.S. scholarly work on computational literary analysis has been driven by studies of modern, Anglophone texts, I needed a collection of texts in my area and language of specialization: 14th-18th- century Italian. By developing such a collection, in addition to learning about the Italian texts, I have an opportunity to test the portability of 21st-century quantitative methods for text analysis designed for modern English. In terms of scale and scope, my preliminary point of comparison was Matthew Jockers’ corpus used for Macroanalysis (Univ. of Illinois Press, 2013). Andrew Piper explores subcorpora that range in size from 75,000 poems, to 150 novels, to 28,000 documents of fiction and nonfiction, to 65,000 characters in 7,500 novels in Enumerations (Univ. of Chicago Press, 2018). Both scholars were working with teams of graduate students at institutions with robust support for Digital Humanities research.

Since Bowdoin College is an undergraduate-only institution without a non-English language course requirement, I have approached corpus development in two directions: using what is already available and carefully curating a new data set with the paid help of Italian majors and an Italophile outside contractor, MAI Services.

Creating a corpus engages directly with concerns raised by Leah Marcus in Unediting the Renaissance (Routledge, 1996). Marcus focused on the English Renaissance, but brought to light the outsized role of 17th-century and later editors in determining the authoritative version of a text, often obscuring how contemporary readers would have encountered and contextualized the text. When combined with concerns about how archives determine what and which authors are preserved, creating a corpus both relies on prior power structures that create access to primary materials and represents an opportunity to intervene. I engage with these questions directly in a chapter in an edited volume on Galileo’s correspondence that is currently under review (as of summer 2025).

From here: