Data Ignota: Characters

Mathieu Glachant

Origins of This Data

This data is generated by extracting all TEI <person> descendant nodes of a <listPerson type="characters"> node in the <standOff> node of the Digital Edition of Terra Ignota.

What Are `<person>` Nodes For?

These <person> nodes contain information about characters in the series. This is metadata added by the editor¹, like everything else in <standOff>.

The <listPerson type="characters"> node lists persons who are present in the text, “on stage” as it were. This list has sub-lists for convenience in editing, e.g. the Saneer-Weeksbooth ’bash or Danaë’s brood, hence the need to include all descendant <person> nodes, not just children.

Note that there are other lists of persons in the metadata, e.g. which contain fictional people added as part of world-building or actual historical people referenced in the text. These lists are siblings of the list of characters within <standOff>.

Example Nodes

Here’s a simplified view of what these nodes look like:

<standOff>
  <listPerson type="characters">
    <person xml:id="Mycroft">
      <persName type="emoji">✍</persName>
      <persName type="short">Mycroft</persName>
      <persName type="primary">
        <forename>Mycroft</forename>
        <surname>Canner</surname>
      </persName>
    </person>
    ...
  </listPerson>
  ...
</standOff>

The data dictionary below maps the information above to a column in the data file.

Get the Data

The data extracted from these <person> nodes is available as a CSV file.

Download Link

Download the data

Last Updated

This file was last updated on 2022-11-03.

Raw Data Generation

The raw data is first extracted from the <person> nodes using an Xquery script.

Xquery Script

For easy ingestion with the XML package in R, the script’s output has a <records> root node and one <character> node per character in the original text.

xquery version "3.1";

declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare variable $doc := doc("PATH_TO_TEI_FILE");

<records>
  {
  for $character in $doc//tei:listPerson[@type = "characters"]//tei:person
  return 
  <character>
    ...
    A node per column in the output file, see below for details
    ...
  </character>
  }
</records>

XML Output File

The Xquery script outputs an XML output file of the form:

<?xml version="1.0" encoding="UTF-8"?>
<records>
   <character>
      <id>#Mycroft</id>
      <emoji>✍</emoji>
      <name>Mycroft</name>
      <fullName>Mycroft Canner</fullName>
      <sameAs>NA</sameAs>
   </character>
   ...
</records>

Data Prep

This XML output file must then be cleaned before being saved to the csv file provided above.

Clean: Fix NA Values

First, missing values set to NA. This is very easy using the XML and tidyverse packages.

characters <- xmlToDataFrame(xml_path) %>%
  mutate(
    # Missing values must be _NA_
    id = na_if(id, "NA"),
    emoji = na_if(emoji, "NA"),
    name = na_if(name, "NA"),
    fullName = na_if(fullName, "NA"),
    sameAs = na_if(sameAs, "NA"),
  )

NB: Luckily for you, when you read in this data as a CSV file the readr package is smart enough to correctly guess on all of this.

Tidy: Separate Delimited Lists

Second, any rows containing delimited lists² must be³ separated into multiple rows.

In the case of our characters, this would only happen if a <person> node had more than one sameAs alternate identities. There are currently 2 characters with alternate identities but 0 of them have more than one.

characters <- characters %>%
  # Break space-delimited columns across multiple rows
  separate_rows(sameAs, sep = " ")

Editing Progress

There are currently 82 characters in the file, including 12 generic ones like ‘unknown Junior Scientist’ or ‘unknown Servicer’.

Names

Names are required and there are currently 0 character(s) without one.

Emojis

Emojis are optional and there are currently 31 characters with one.

Data Dictionary

List of the columns in the data file explaining what they mean and how they were generated.

`id`

A unique identifier used within the Digital Edition.

Required, unique, must conform to xml:id requirements, e.g. can’t start with a number.

Derived from parameter xml:id of the <person> node itself, with the # tagged onto the front to match the ref attribute syntax used when pointing to this characters, e.g. in a line of dialog.

<id>
  #{data($character/@xml:id)}
</id>

`emoji`

An emoji evocative of the character, used for visualizations where space is at a premium.

Optional, unique when it exists.

Derived from the <persName type="emoji"> child node of the <person> node itself.

<emoji>
  {
  if ($character/tei:persName[@type = "emoji"]) 
  then normalize-space(data($character/tei:persName[@type = "emoji"]))
  else "NA"
  }
</emoji>

`name`

A short name for the character, attested in the text.

Alphanumeric, required. NA means the character has neither short nor full name in the metadata, an omission by the editor.

Derived from the <persName type="short"> child node of the <person> node itself, or if it does not exist, the full name (see below).

<name>
  {
  if ($character/tei:persName[@type = "short"]) 
  then normalize-space(data($character/tei:persName[@type = "short"]))
  else if ($character/tei:persName[@type = "primary"]) 
  then normalize-space(data($character/tei:persName[@type = "primary"]))
  else "NA"
  }
</name>

`fullName`

A full name for the character, attested in the book.

Alphanumeric, required. NA means the character has no full name in the metadata, an omission by the editor.

Derived from the <persName type="primary"> child node of the <person> node itself.

<fullName>
  {
  if ($character/tei:persName[@type = "primary"]) 
  then normalize-space(data($character/tei:persName[@type = "primary"]))
  else "NA"
  }
</fullName>

`sameAs`

The unique identifier of another character that is actually the same person, aka an alternate identity. Alternate identities result in multiple rows for the same person.

Optional, NA indicates no alternate identity exists for that person.

Derived from parameter sameAs of the <person> node itself.

<sameAs>
  {
  if ($character/@sameAs) 
  then data($character/@sameAs)
  else "NA"
  }
</sameAs>

i.e. yours truly.↩︎
In the XML output the lists were space-delimited, since they derived from XML node parameters.↩︎
In order to be Tidy Data, in which each row represents a single observation.↩︎

Characters

Origins of This Data

What Are `<person>` Nodes For?

Example Nodes

Get the Data

Download Link

Last Updated

Raw Data Generation

Xquery Script

XML Output File

Data Prep

Clean: Fix NA Values

Tidy: Separate Delimited Lists

Editing Progress

Names

Emojis

Data Dictionary

`id`

`emoji`

`name`

`fullName`

`sameAs`

Corrections

Citation

Characters

Origins of This Data

What Are <person> Nodes For?

Example Nodes

Get the Data

Download Link

Last Updated

Raw Data Generation

Xquery Script

XML Output File

Data Prep

Clean: Fix NA Values

Tidy: Separate Delimited Lists

Editing Progress

Names

Emojis

Data Dictionary

id

emoji

name

fullName

sameAs

Corrections

Citation

What Are `<person>` Nodes For?

`id`

`emoji`

`name`

`fullName`

`sameAs`