Data Ignota: A Data-Ship

I have read Ada Palmer’s Terra Ignota series a few times¹ since it came out - alone or with friends - because on every reread I discover new ideas, new calls forward or back within the text or outside of it, and new structural complexities.

I am, to be frank, obsessed with it, and I have the years-long chat logs, the dedicated Zettelkasten vault, and the nigh-unreadable ebooks bursting with rainbow-hued highlights to prove it.

A Ship Worthy of Apollo’s Call

This project is an attempt to upgrade from those first hollowed logs to a ship worthy of the perils and mysteries we gentle readers face when we heed Apollo’s call to push away from shore and sail toward terra ignota.

The Analyses

I will post demos and examples of the analyses that can be performed on the data generated from this digital edition of the novels. As these accumulate, they should progressively build a map of the feasible… more of a spur to your own explorations than anything definitive or comprehensive.

The Dev Log

I will also keep a dev log along this voyage. This will provide a more sequential view of what we found when, any shoals we hit along the way, and any issues with the crew’s morale or theology which might crop up.

The Dashboard

To help with navigation, I will set up a dashboard page to display the progress made and list any failed sanity checks & smoke tests with the latest version of the data.

The Data

This website will have links to data files generated for and used in the analyses. Feel free to download and re-use for non-commercial use with attribution.

What Is This Ship Made of?

TEI Encoding

I am encoding the novels in the TEI² dialect of XML, as this is the de facto standard for text analysis in the Digital Humanities³.

This is truly a labor of love⁴: going over the text, line by line, and marking it up with various xml elements. This is also the step where most errors will be introduced by yours truly, altho some of those errors can be detected by sanity check queries.

Xquery

It’s worth it, tho, because then I can use Xquery⁵ to ask questions like:

“How many different ways does Mycroft refer to J.E.D.D. Mason while speaking to Bridger?”
“¿Who talks about the Masons in Humanist Spanish? ¿On which pages?”
“How many different Utopian coats are there in the text? How are they described?”

R

To analyze this data further and display the results, I use the R language⁶. In particular I am applying the Tidy Data paradigm⁷ as best I understand it. This is a new language for me and this serves as practical hands-on learning for me.

TEI Publisher

Lastly, I can load this Digital Edition into TEI Publisher⁸ an application running on top of the XML database eXist-db⁹ which will display the text with various enhanced features based on the markup, such as highlighting names and displaying popups with their metadata on hover.

For interest, here is a copy of the ODD configuration file for this digital edition.

Copyright Concerns

The novels are under copyright, so there are parts of this project I can share and others I cannot.

The project therefore has two repositories on Github:

a private repo to version control and back-up the marked-up text
a public repo for everything else, including this website

CSV files containing tidy data about the text.
- To allow others to reuse the data for their own analysis.
- These will NOT contain stretches of text from the books.
Xquery files which extract data from the text.
- To show my work and to help anyone who’s trying to do something similar.
- These will pull from the private repo and put their output xml files there as well.
R markdown files which generate example analysis reports from the data.
- To show my work and to demo how the data can be used.
- These will pull the output xml files from the private repo and put their output html and CSV files in the public repo, i.e. accessible on this website.

The text of the novels, in any format.
Data files containing stretches of text from the novels, in any format.

This project counts as my eighth reread of the first book in the series, Too Like the Lighting.↩︎
Text Encoding Initiative: https://tei-c.org/↩︎
an area of scholarly activity at the intersection of computing and the disciplines of the humanities. https://en.wikipedia.org/wiki/Digital_humanities ↩︎
It took a month of sustained effort to mark up the first twelve chapters of Too Like the Lightning.↩︎
a query and functional programming language that queries and transforms collections of structured and unstructured data, usually in the form of XML. https://en.wikipedia.org/wiki/XQuery ↩︎
a language and environment for statistical computing and graphics. https://www.r-project.org/↩︎
Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. https://vita.had.co.nz/papers/tidy-data.html ↩︎
http://teipublisher.com/↩︎
http://exist-db.org/↩︎