{"id":379,"date":"2025-08-27T09:57:33","date_gmt":"2025-08-27T13:57:33","guid":{"rendered":"https:\/\/research.bowdoin.edu\/galileos-library\/?page_id=379"},"modified":"2025-10-22T15:57:49","modified_gmt":"2025-10-22T19:57:49","slug":"creating-digital-texts","status":"publish","type":"page","link":"https:\/\/research.bowdoin.edu\/galileos-library\/data-sets\/full-text-corpora\/creating-digital-texts\/","title":{"rendered":"Creating Digital Texts"},"content":{"rendered":"<div id=\"attachment_381\" style=\"width: 946px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/research.bowdoin.edu\/galileos-library\/files\/2025\/08\/StelliolaExample.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-381\" class=\"size-full wp-image-381\" src=\"https:\/\/research.bowdoin.edu\/galileos-library\/files\/2025\/08\/StelliolaExample.png\" alt=\"Old printed text in Italian\" width=\"936\" height=\"240\" srcset=\"https:\/\/research.bowdoin.edu\/galileos-library\/files\/2025\/08\/StelliolaExample.png 936w, https:\/\/research.bowdoin.edu\/galileos-library\/files\/2025\/08\/StelliolaExample-300x77.png 300w, https:\/\/research.bowdoin.edu\/galileos-library\/files\/2025\/08\/StelliolaExample-768x197.png 768w, https:\/\/research.bowdoin.edu\/galileos-library\/files\/2025\/08\/StelliolaExample-624x160.png 624w\" sizes=\"auto, (max-width: 936px) 100vw, 936px\" \/><\/a><p id=\"caption-attachment-381\" class=\"wp-caption-text\">Final lines of page 1 of Nicol\u00f2 Antonio Stelliola\u2019s Il Telescopio (1627). Scanned from the copy at the British Museum.<\/p><\/div>\n<p>In order to build an intentional corpus, one that bypasses this histories of collecting books in 19th-century libraries and one that reflects known 17th-century book collecting preferences, I have been working with students and contractors to create a reliable full-text corpus of the books in Galileo\u2019s library. I gave priority to Galileo\u2019s works, his opponents\u2019 works, poetry (known influences and minor works), drama, and possible influential texts across form and genre.<\/p>\n<p>Even though Google Books deploys its Optical Character Recognition (OCR) algorithm on its scans for full text search, the outcome is seldom immediately useful for text analysis of early modern Italian printed books. The same is true for Adobe, Abbyy, and many of the other available tools. While Google\u2019s OCR has improved dramatically in the last several years, the above image and its transcriptions below show a few continuing weaknesses that have significant consequences for computational text analysis:<\/p>\n<ul>\n<li>Spacing between words<\/li>\n<li>Long s differentiation<\/li>\n<li>Punctuation recognition<\/li>\n<li>Handling catch words<\/li>\n<\/ul>\n<p><strong>Google\u2019s OCR (August 3, 2022):<\/strong><\/p>\n<blockquote><p>Alladetta principale intenzione d\u2019i\u017fpecillo, vengono alligate per confequen<br \/>\nza molte \u017fpeculazioni verfanti nel genovi\u017fiuo, nece\u017f\u017farie per l\u2019affinit\u00e0 della \u00bb<br \/>\nmateriai &amp; Perche nclla intelligen\u017ca delle cofe, fi h\u00e0nece\u017fsit\u00e0 \u0d4dli\u0d41\u0d02<\/p><\/blockquote>\n<p><strong>Correct Transcription:<\/strong><\/p>\n<blockquote><p><em>Alla detta principale intenzione d\u2019ispecillo, vengono alligate per conseguenza<br \/>\nmolte speculazioni versanti nel geno visivo, necessarie per l\u2019affinit\u00e0 della<br \/>\nmateria; &amp; perche nella intelligenza delle cose, si h\u00e0 necessit\u00e0 dell\u2019intelligenza<br \/>\n<\/em><\/p><\/blockquote>\n<p>Large Language Models (LLMs) are a potential assistant in this work. Research workflow development in Summer 2025 with modern Italian printed text showed substantial improvements over previous attempts to automate OCR. That work is ongoing. For the time being, we will continue to manually correct this OCR to establish reliable full text versions of the books.<\/p>\n<p>Completed files are currently shared via the TAPAS project: <a href=\"http:\/\/www.tapasproject.org\/tapas-commons\/galileos-library-tei-editions\">http:\/\/www.tapasproject.org\/tapas-commons\/galileos-library-tei-editions<\/a><\/p>\n<p>Principles for creating the texts have been to prioritize the words&#8217; sequencing. Words broken across lines are represented without the hyphen. Printers\u2019 catch words at the bottom of a page, typed editorial marginalia, headers, and illustrations have been omitted. This step is temporary. As someone who relishes these details, this decision was made for the sake of the varying skills of the team members and to prioritize a proof of concept. Paratexts and page numbers have been retained. Punctuation, accents, and capitalization have been maintained. Paragraphs have been retained for prose, line breaks for poetry and drama. The interchangeability of the letters\u00a0<em>u <\/em>and <em>v<\/em>\u00a0in early modern printing has been accounted for by code for preparing the text for analysis.<\/p>\n<p>A separate subcorpus has been created for just the prefatory letters of these books.<\/p>\n<p><strong>Current Research Questions<\/strong><\/p>\n<ul>\n<li>Did Galileo Galilei\u2019s prose sound archaic, innovative, poetic, or dramatic to contemporary readers?<\/li>\n<li>To what extent were Galileo and his opponents implicitly citing other sources?<\/li>\n<li>What impacts do faulty OCR in Google Books create for search and discoverability in Italian texts of the period?<\/li>\n<\/ul>\n<p>See <a href=\"https:\/\/research.bowdoin.edu\/galileos-library\/credits\/\">Credits<\/a> for the full list of team members.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In order to build an intentional corpus, one that bypasses this histories of collecting books in 19th-century libraries and one that reflects known 17th-century book collecting preferences, I have been working with students and contractors to create a reliable full-text corpus of the books in Galileo\u2019s library. I gave priority to Galileo\u2019s works, his opponents\u2019 [&hellip;]<\/p>\n","protected":false},"author":41,"featured_media":0,"parent":373,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-379","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/research.bowdoin.edu\/galileos-library\/wp-json\/wp\/v2\/pages\/379","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/research.bowdoin.edu\/galileos-library\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/research.bowdoin.edu\/galileos-library\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/research.bowdoin.edu\/galileos-library\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/research.bowdoin.edu\/galileos-library\/wp-json\/wp\/v2\/comments?post=379"}],"version-history":[{"count":0,"href":"https:\/\/research.bowdoin.edu\/galileos-library\/wp-json\/wp\/v2\/pages\/379\/revisions"}],"up":[{"embeddable":true,"href":"https:\/\/research.bowdoin.edu\/galileos-library\/wp-json\/wp\/v2\/pages\/373"}],"wp:attachment":[{"href":"https:\/\/research.bowdoin.edu\/galileos-library\/wp-json\/wp\/v2\/media?parent=379"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}