Creating Digital Texts

Old printed text in Italian

Final lines of page 1 of Nicolò Antonio Stelliola’s Il Telescopio (1627). Scanned from the copy at the British Museum.

In order to build an intentional corpus, one that bypasses this histories of collecting books in 19th-century libraries and one that reflects known 17th-century book collecting preferences, I have been working with students and contractors to create a reliable full-text corpus of the books in Galileo’s library. I gave priority to Galileo’s works, his opponents’ works, poetry (known influences and minor works), drama, and possible influential texts across form and genre.

Even though Google Books deploys its Optical Character Recognition (OCR) algorithm on its scans for full text search, the outcome is seldom immediately useful for text analysis of early modern Italian printed books. The same is true for Adobe, Abbyy, and many of the other available tools. While Google’s OCR has improved dramatically in the last several years, the above image and its transcriptions below show a few continuing weaknesses that have significant consequences for computational text analysis:

  • Spacing between words
  • Long s differentiation
  • Punctuation recognition
  • Handling catch words

Google’s OCR (August 3, 2022):

Alladetta principale intenzione d’iſpecillo, vengono alligate per confequen
za molte ſpeculazioni verfanti nel genoviſiuo, neceſſarie per l’affinità della »
materiai & Perche nclla intelligenża delle cofe, fi hàneceſsità ്liും

Correct Transcription:

Alla detta principale intenzione d’ispecillo, vengono alligate per conseguenza
molte speculazioni versanti nel geno visivo, necessarie per l’affinità della
materia; & perche nella intelligenza delle cose, si hà necessità dell’intelligenza

Large Language Models (LLMs) are a potential assistant in this work. Research workflow development in Summer 2025 with modern Italian printed text showed substantial improvements over previous attempts to automate OCR. That work is ongoing. For the time being, we will continue to manually correct this OCR to establish reliable full text versions of the books.

Completed files are currently shared via the TAPAS project: http://www.tapasproject.org/tapas-commons/galileos-library-tei-editions

Principles for creating the texts have been to prioritize the words’ sequencing. Words broken across lines are represented without the hyphen. Printers’ catch words at the bottom of a page, typed editorial marginalia, headers, and illustrations have been omitted. This step is temporary. As someone who relishes these details, this decision was made for the sake of the varying skills of the team members and to prioritize a proof of concept. Paratexts and page numbers have been retained. Punctuation, accents, and capitalization have been maintained. Paragraphs have been retained for prose, line breaks for poetry and drama. The interchangeability of the letters u and v in early modern printing has been accounted for by code for preparing the text for analysis.

A separate subcorpus has been created for just the prefatory letters of these books.

Current Research Questions

  • Did Galileo Galilei’s prose sound archaic, innovative, poetic, or dramatic to contemporary readers?
  • To what extent were Galileo and his opponents implicitly citing other sources?
  • What impacts do faulty OCR in Google Books create for search and discoverability in Italian texts of the period?

See Credits for the full list of team members.