Digital and Computational Studies Blog – Page 18 – Bowdoin College

Digital Reconstructions of Libraries

December 1, 2013 By Crystal Hall

Libraries are very much on my mind these days as I grapple with the best methodologies for reconstructing and visualizing Galileo’s library. I am also working constantly with digital collections: institutional libraries, archives of organizations, and single studies of authors. Perhaps it is no surprise then, that when first asked to suggest possible readings for the section of the Gateway to Digital Humanities Course that focuses on textual analysis, I immediately recommended Jorge Luis Borge’s “Library of Babel.”

To me this short essay represents many of the possibilities and pitfalls of digital and computational library studies. Borges imagines a library that holds one copy of every book that could possibly be written. Some contain gibberish, others perfect copies of known work. Scholars live in the library searching for answers to questions about human experience. Ideological camps form and battles ensue, but all the while, even this hyperbolically complete library remains enigmatic to its users due to its sheer size. In parallel ways, computers have the potential to create a similar digital library. Natural language processing has already shown that computers can generate prose that has the “sound” of known authors like Immanuel Kant. Programming loops (of the kind the Gateway to Digital Humanities students are applying to images) perform the same action repeatedly (changing one pixel at a time, for example) and could conceptually be employed to provide the infinite variety of texts that populate “The Library of Babel.”

For readers of Python programming language, I tried to express this impossible program in loop terms in Jython. Strings and concatenation would help, but I think this still conveys the message in a light-hearted form:

Screenshot (Crystal Hall, 2013) of JES Jython platform.

The above attempt at code (that has legal syntax for Jython, but an error-filled program) is a futile approach for bringing order to chaos. Some Digital Humanities (DH) scholars would argue that digital and computational studies could offer partial solutions to comprehending and organizing this vast quantity of textual information. This is quite optimistic that estimates suggest 340 million new 140-character tweets on Twitter daily, not to mention the 3.77 billion (and growing) indexed pages on the world wide web.

Working even with the available (and manageable) digital data, certain assumptions are made by tools and certain information is lost in their application, all of which gives me pause for thought as I reconstruct and try to find analytical pathways through the library of a person about whom ideological fields have been defined and passionate battles have been fought for centuries. Matt Jockers has led the field of DH with his work on Macroanalytics, currently focused on establishing patterns in nineteenth-century fiction, but relies on only the books for which a digital copy has been made. Google books Ngram Viewer allows users to compare the frequencies of words that appear in digital or digitized books during different time periods, which assumes consistency of cataloguing and meta-data entry across all participating institutions, which is not always the case.

Screenshot (Crystal Hall, 2013) of Google books Ngram Viewer.

As I revisit the data for my own project on Galileo, I wonder where I will enter the ideological disputes that surround the interested fields; I worry about what information will be excluded from the data; and how my users will navigate the digital library I am about to create.

Excel Data and Gephi Data Laboratory

November 15, 2013 By Crystal Hall

My goal for this blog entry is to explain how to organize data within an Excel Spreadsheet (that will be saved as a Comma Separated Values file or .csv) to import into Gephi for visualization and analysis of nodes (individual elements represented as points) and edges (relationships represented by connective lines) in a network. My explanation assumes familiarity with the Gephi tutorials based on prepared .gexf files (the extension for files readable by Gephi) of Les Miserables or Facebook data. I assume that my reader is now thinking about applying network analysis to her own research.

New users of Gephi may not have any familiarity with .gexf files, XML mark-up, or other code for organizing data, but can still find use in Gephi. Excel is typically a more user-friendly application for this kind of organization, and most databases (Microsoft Access for example) can be converted to an Excel workbook (.xls) or directly to a .csv file. The explanations assume a basic understanding of storing, copying, and sorting data in Excel. The organizational principles described below can be applied to whichever application you use to generate the tabular .csv files that you will use in Gephi. Other supported formats and their functionality can be found at Gephi’s site.

I am using screenshots from my own research data on the books in Galileo Galilei’s library to help demonstrate the kinds of information each column should contain. Below is a screen shot of one spreadsheet in the Excel workbook that I have used to organize all of my notes related to the project:

There are many spreadsheets listed in the tab bar at the bottom of the screen for the different kinds of information I have for the project. Importantly, a .csv file only retains the information in the active worksheet (“By author” in this case, the tab in white) and will not save the other sheets. It is important to copy the information you want to use from your primary workbook (multiple sheets) to a single-spreadsheet workbook for nodes and a single-spreadsheet workbook for edges. Also, the column headings in my workbook (“My#”, “Fav’s#”, “Author. Favaro’s full citation”, “Year”, etc.) are my shorthand and cannot be interpreted by Gephi, another reason that copying the information you want to use to new single-spreadsheet workbook files is highly recommended.

1) You will need to create two .csv files: a node table and an edge table. I use Excel as my tabular application, and Excel files save by default to the .xlsx format. In order to get the .csv, you need to choose that option for file format when saving.

2) The node table tells Gephi all of the possible nodes in a network and must have at least the columns Id and Label. There should be one line for every node that will appear in either column of the edge table:

This seems easy enough, but what kinds of information are best placed in the Id column, and how should that differ from the Label? The example above is taken from a spreadsheet that I use to organize information about Galileo’s library. All of my nodes in this example are the proper nouns that are found in titles in the library and the titles themselves (about 2650 nodes total). The example above is, in a word, clunky. It is redundant and ultimately makes my network visualization unreadable if I try to add labels over the nodes. Consider the following example in which full titles would become labels over roughly 650 nodes (obscuring nodes and edges in the process):

Having a unique identifying number (the Id that Gephi expects) allows me to store a lot of information about that node in a spreadsheet or database that I can later choose to access as necessary. Since my organizational system was created long before I knew about Gephi, my Label column corresponds to the Full Title column in my spreadsheet (which ultimately clutters my visualization to the point of illegibility if I add labels). To make this more readable, I need to change the data in the Label column to the data from a “Short Title” column.

3) As you might notice, there are other columns in the first screen shot for the node table. The node table can also include attributes (in parenthesis in the example because they are not necessary for a basic visualization of a network). Attributes are a way to categorize data, perhaps by gender, race, age, etc. While not necessary for exploring data with Gephi, they allow for a more nuanced exploration of a network. For example, I will want to add attribute columns for religious affiliation (Jesuit, Benedictine, Protestant, Catholic, etc.) and genre to start visualizing the data in a way that helps me answer my research questions. Attribute columns can also be added in the “Data Laboratory” section of the Gephi interface even after you have loaded the .csv files for the nodes and edges.

4) The time interval is another optional column of information to include about your data, which may or may not be applicable or useful. I copy here a partial screenshot from the Gephi.org page as a reference:

The Gephi wiki also displays the code behind this process.

Thinking about my own dataset, I need a Time Interval column for every title that shows the earliest year that a book could have entered the library. I will stop my time intervals with Galileo’s death in 1642. From the examples in part 3, the time interval information would look like this in the .csv version of the spreadsheet, with the columns Id, Time Start, Time End:

4,1640,1642

5,1628,1642

6,1637,1642

Once you have uploaded the .csv, in Data Laboratory, you can merge the Time Start and Time End columns using the merge strategy “Create Time Interval.” This will concatenate and format what you need in order to be able to view the change over time of the network.

5) The edge table (the second .csv file that you need to create) then tells Gephi the connections that exist between the nodes. It must have the columns Source and Target:

This is where having a unique identifier for all nodes can be very convenient. My source above is title 299 in which the Cologne Academy is mentioned as a contributor to the book that I have given the identifier 299. Book titles can include people or places (Targets), but people or places cannot include titles (Sources), so my edges are directed, and the distinction between source nodes and target nodes is critical.

6) Similarly to the node table, there are many optional categories that can add nuance to an analysis of a network. The edge table can also include a Label column to help with categorization of relationship types, a unique Id for the relationship (generated by Gephi), Attributes (eg: family, friend, co-worker, classmate, etc. for social networks), and Time Interval.

7) The edge table can also include information not found in the node table. Type indicates whether the relationship is directed or undirected. This column can be auto-filled on upload and is visible in the Data Laboratory.

8) Another option for the edge table is to provide weightedness for relationship. Weight is your opportunity to give more importance to certain relationships by giving them a numerical weight.

Remember to save the files as .csv, then load them in Gephi, nodes first, using the “Import .csv” option in the Data Laboratory toolbar. Be sure to indicate which type of file you are uploading (node table or edge table), otherwise you risk error messages.

Data can simply be input directly into the Data Laboratory of Gephi, but I am most familiar with the functionality of Excel, have organized my research data using spreadsheets, and prefer to make adjustments, filter data, and store my information in one format. Programming languages such as R seem particularly adept at creating the tabular information needed here, particularly when automatically pulling data from a large corpus.

My approach may not work for everyone or every project, but hopefully seeing real data in a raw format provides context for its presentation in the data laboratory:

In turn, that should make the analysis of something as complex as the visualization of the connections between names in Galileo’s library less opaque:

“Terms and Conditions May Apply” Storify of Screening and Discussion

November 15, 2013 By jgieseki

Following the screening of “Terms and Conditions May Apply,” Profs. Elias & Gieseking (Government, Digital and Computational Studies) of Bowdoin, and USM Prof. Clearwater (Law) held a brief discussion of the film and issues of privacy, transparency, and participation on the Internet. The Storify of the highlights of their discussion is below.

http://storify.com/jgieseking/terms-and-conditions-may-apply-screening-and-disc

Emese Gaal on INTD 2401: Gateway to the Digital Humanities

November 13, 2013 By Crystal Hall

Jack Gieseking and Crystal Hall recently spoke with Emi Gaal ’15 about her experience in the new “Gateway to the Digital Humanities” course.

Why are you taking the Gateway course?

As someone more involved in the humanities and social sciences, this class seemed like a nice first segue into the more technical realm of computer science while still focusing on broad objectives in both social and technical sciences. Additionally, I have always wanted to take both a computer science class and an art history class at Bowdoin, as both are very interesting to me, and the interdisciplinary nature of this class has provides a great introduction to both.

What has surprised you in the seminar so far?

I am most surprised by how useful it is to understand the methodology behind programming within programs such as GIS, as it allows the user to be more intentional about the commands he or she aims to carry out. Being more in the “know” about how the whole digital sphere operates is empowering and I believe it will help me better understand the possibilities and limitations of executing projects.

Do you have any early ideas about your final project?

I don’t have an idea just yet, but I think using GIS, as it is a program with which I am already proficient, would prove to be a great tool to incorporate. Also, since we have only covered one other topic, image analysis, aside from spatial analysis, I feel like I should wait until I have a better idea of what the other topics and tools are in which I could dabble before I solidify an idea.

Terms and Conditions May Apply: Screening and Discussion on November 13th

November 7, 2013 By jgieseki

TACMA header. November 13, 2013 4:30 PM – 7:00 PM
Visual Arts Center, Kresge Auditorium

Panel discussion “Privacy and Security, Transparency and the Internet” following screening of “Terms and Conditions May Apply”

Have you ever read the Terms and Conditions and Privacy Policies connected to every website you visit, phone call you make, or app you use? Of course you haven’t. But those agreements allow corporations to do things with your personal information you could never even imagine. What are you really agreeing to when you click “I accept”?

Following the screening of “Terms and Conditions May Apply,” join Prof. Elias (Government) and Prof. Gieseking (Digital and Computational Studies) of Bowdoin, and USM Prof. Clearwater (law) previously of Harvard’s Berkman Center for Internet and Society for a discussion of the film and issues of privacy, security, transparency, and the Internet.