Computational Parallax

During the summers of 2015 and 2016 I participated in the National Humanities Center’s first Digital Textual Studies Institute, led by Willard McCarty and Matt Jockers. This was an excellent, intensive opportunity for thinking about what happens when we combine a humanities question with a computational method. Participants are now preparing their articles for publication. I am presenting highlights here that connect with the overall Galileo’s library project.

The motivating question behind this work remains: what did Galileo’s prose sound like to his readers? Did it sound dated? Erudite? Casual? In terms of the reception of Galileo’s ideas, the question of how those ideas were packaged is pressing. I can impose a reading based on critically-informed hindsight, but is there a way to access a sense of textual similarity (or difference) that would be more generative? One attempt seemed to be to borrow methods from authorship attribution studies, which build on stylometry and the identification of passages with similar features. At first this seemed as “easy” as developing a century- or decade- classifier to answer the question of whether or not Galileo’s prose sound out-dated, but given the high rates of sampling from older texts in the early modern period, the results did not point to time markers in the style. Instead, what came to light were the clustering patterns of certain passages. The unexpected results were a prompt to really stop and think about what was happening to the texts, and this is how I arrived at what I am calling computational parallax, the idea that when seen against different backdrops, certain textual features will be highlighted or obscured so that we learn something about the text in one position and something else in another.

This idea came about because one of the biggest challenges for this project was determining which texts to use – transcriptions of modern critical editions (available in much greater number for Italian texts of the period) or transcriptions of early modern original editions. Given spelling and punctuation changes, not to speak of typographical errors and editorial interventions during many reprintings, the variation between editions for some works could be tremendous. The most notable impact would be on function words, which are typically the most frequent words (MFWs), the features that drive clustering and classification. Working with a research assistant, Sabina Hartnett (Bowdoin ’18), I carried out an experiment to test what would happen to key texts if they were analyzed in a corpus of modern editions and in a corpus of early modern editions. The control texts were the usual suspects in this work so far: Galileo’s treatise on comets, Ariosto’s epic poem, Tasso’s heroic poem, and a relative newcomer to the project, Tasso’s much shorter pastoral L’Aminta. These were the texts that could I find in digital transcription from both periods (and in the case of Tasso’s heroic poem create during the time of the project). These control texts were studied with 29 different documents from each period, for a total of 66 documents (about 4.5 million words). We used passages of 2,000 words, since the control group, when tested alone in both early modern and modern formats, clustered entirely according to title at that length. (We are going to rerun the tests with Maciej Eder’s suggested parameters, based on the persuasive work in his article “Does size matter? Authorship attribution, small samples, big problem,” Digital Scholarship in the Humanities 30.2 (2015): 167-182.)

What we found were that different passages in the control group texts broke away from their main text in the early modern corpus and the modern corpus. That is, for control text A, in its early modern edition in the early modern corpus the introductory passage might be paired with a pastoral, but for the modern edition of the same text in the modern corpus a middle section would be paired with a collection of letters and the introductory passage would be clustered with the rest of the passages from text A. These “interlopers,” as we called them, offered fascinating insight into the sounds of textual similarity across genres and periods. They provoked intensive close reading and comparison (results forthcoming). But that wasn’t everything.

Importantly, these interloping passages forced us to step back and look at the hclust function we had been using in order to better understand its applications and the assumptions being made about the data. In our explanation of hclust for a broader audience, we focused on the U.S. cities data loaded with the stats package in R. It became very clear that if you add or subtract a city from the data, the clusters will change; so too with the textual passages. That is, even the most impossibly large data set for early modern Italian texts will only provide one snapshot of information about passages’ similarity to one another. Change the texts and the visible similarities change. It seems so self-evident, but data visualization based on calculations in multiple dimensions suggest a fixity or permanence of a measurement, when in fact it is just one measurement of many possible measurements. Galileo had offered an entertaining explanation of parallax in his Dialogue of Cecco de’ Ronchitti (1605), in which the speakers look at trees from different points of view in the Paduan countryside in order to demonstrate how visible features change depending on the backdrop against which an object is seen.

To put this in terms of early modern reading and Galileo’s prose, I’ll come back to the original question: What did Galileo’s prose sound like to his seventeenth century readers? The answer is entirely dependent upon the individual reader’s background: a young man who only read or saw a performance of Ariosto’s prose comedy The Coffer (1508) would likely note little resemblance to the preface of Galileo’s letter on comets (1623), while a young woman who knew the prologue of Ariosto’s versification of The Coffer (1529) could immediately hear echoes in Galileo’s laments about the unfair treatment of his work by critics. So I’m left wondering what it is that we are really asking of our computational tools as we assemble larger and larger corpora to analyze. What are we missing? What aren’t we seeing by only considering one perspective (one backdrop)?

There is a lot to untangle here, and this post is just meant as a placeholder for ongoing work…