Data Ignota: `<said>` Element

Mathieu Glachant

Origins of This Data

This data is generated by extracting all TEI <said> nodes in the Digital Edition of Terra Ignota.

What Are `<said>` Nodes For?

These <said> nodes contain the text of all direct communication between characters. I usually refer to such passages as a ‘line of dialog’ or ‘line’ for short.

Example Nodes

A line between Mycroft and Dominic would be encoded thus:

Novel Layout

<p>
  <said>"Where hast thou been, stray?"</said>, Dominic
  snarled. <said>"Thy master needs thee."</said>
</p>

Script Layout

<sp>
  <speaker>Child:</speaker>
  <said>"I miss you, Mycroft."</said>
</sp>

Note that the narration part of that paragraph is always left outside of the node.

The data dictionary below maps each piece of information available for the line, e.g. book, chapter, and page or who is speaking, to whom, about what, etc…

Get the Data

The data extracted from these <said> nodes is available as a CSV file.

Download Link

Download the data

Last Updated

This file was last updated on 2022-11-03.

Raw Data Generation

The raw data is first extracted from the <said> nodes using an Xquery script.

Xquery Script

For easy ingestion with the XML package in R, the script’s output has a <records> root node and one <line> node per line of dialog in the original text.

xquery version "3.1";

declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare variable $doc := doc("PATH_TO_TEI_FILE");

<records>
  {
  for $book in $doc//tei:text[@type = "book"]
    for $chapter in $book//tei:div[@type = "chapter"]
      for $p at $p_index in $chapter//(tei:p | tei:sp)
        for $line at $line_index in $p//tei:said
  <line>
    ...
    A node per column in the output file, see below for details
    ...
  </line>}
</records>

The query iterates thru books, chapters, paragraphs, and lines of dialog. The first two levels are numbered in the original file, but the lower two are numbered during data extraction, hence their need for an index.

XML Output File

The Xquery script outputs an XML output file of the form:

<?xml version="1.0" encoding="UTF-8"?>
<records>
   <line>
      <book>1</book>
      <chapter>1</chapter>
      <scene>3</scene>
      <paragraph>1</paragraph>
      <line>1</line>
      <page>13</page>
      <speaker>#Mycroft</speaker>
      <addressee>#reader</addressee>
      <person>#reader</person>
      <org>NA</org>
      <place>NA</place>
      <language>en-en</language>
      <aloud>FALSE</aloud>
      <format>novel</format>
      <text>You will criticize me, reader, for ... </text>
   </line>
   
   etc...
   
</records>

Data Prep

This XML output file must then be cleaned, tidied, and purged of content under copyright before being saved to the csv file provided above.

Clean: Fix Data Types and NA Values

First, the correct data types must be set for each column and missing values set to NA. This is very easy using the XML and tidyverse packages.

said <- xmlToDataFrame(xml_path) %>%
  mutate(
    # Column data types must be correct
    book = as.integer(book),
    chapter = as.integer(chapter),
    scene = as.integer(scene),
    paragraph = as.integer(paragraph),
    line = as.integer(line),
    page = as.integer(page),
    aloud = as.logical(aloud),
    # Missing values must be _NA_
    speaker = na_if(speaker, "NA"),
    addressee = na_if(addressee, "NA"),
    person = na_if(person, "NA"),
    org = na_if(org, "NA"),
    place = na_if(place, "NA")
  )

NB: Luckily for you, when you read in this data as a CSV file the readr package is smart enough to correctly guess on all of this.

Tidy: Separate Delimited Lists

Second, any rows containing delimited lists¹, e.g. a line of dialog with more than one addressee, must be² separated into multiple rows.

said <- said %>%
  # Break space-delimited columns across multiple rows
  separate_rows(speaker, sep = " ") %>%
  separate_rows(addressee, sep = " ") %>%
  separate_rows(person, sep = " ") %>%
  separate_rows(org, sep = " ") %>%
  separate_rows(place, sep = " ")

Copyright: Remove Raw Text

Lastly, before writing to the CSV file, any text under copyright is removed. This is the last chance to process that text, so let’s add a words column for the number of words in the line.

said <- said %>%
  # Proxy for word count, counting spaces as separators
  mutate(words = str_count(text, " ") + 1)  %>%
  # Must not include the original text for copyright reasons
  mutate(text = NULL)

NB: If you have ideas for other columns I could generate from the text at this step, please create an issue in the tracker here, or start a new discussion here.

Editing Progress

Chapter¹	Progress²	Missing Param.
Chapter¹	Progress²	`speaker`	`addressee`
1.01	100%	-	-
1.02	100%	-	-
1.03	100%	-	-
1.04	100%	-	-
1.05	100%	-	-
1.06	100%	-	-
1.07	100%	-	-
1.08	100%	-	-
1.09	100%	-	-
1.10	100%	-	-
1.11	100%	-	-
1.12	100%	-	-
1.13	100%	-	-
¹ Chapters with at least one line edited.
² Percentage of lines edited.

Data Dictionary

List of the columns in the data file explaining what they mean and how they were generated.

`book`

The number of the book which contains the line.

Required, numeric.

Derived from parameter n of the <text type="book"> node the line is a child of.

<book>
  {data($book/@n)}
</book>

`chapter`

The number of the chapter containing the line.

Required, numeric.

Derived from parameter n of the <div type="chapter"> node the line is a child of.

<chapter>
  {data($chapter/@n)}
</chapter>

`scene`

The number of the scene containing the line.

Required, numeric.

Derived by counting the number of preceding <milestone unit="scene"> nodes in document order.

<scene>{
  count($line/preceding::tei:milestone[@unit = "scene"])
}</scene>

`paragraph`

The number of the paragraph (within the chapter) of the paragraph containing the line.

Required, numeric.

Derived from the index $p_index generated while iterating in document order thru the <p> and <sp> children of the current chapter.

<paragraph>
  {$p_index}
</paragraph>

`line`

The number of the actual line of dialog within the current paragraph.

Required, numeric.

<line>
  {$line_index}
</line>

If there are more than one line of dialog for a paragraph, this usually indicates that there is some narration separating them. The paragraph below contains two lines of dialog, for example.

<p>
  <said>"Where hast thou been, stray?"</said>, Dominic
  snarled. <said>"Thy master needs thee."</said>
</p>

`page`

The number of the page on which the line starts.

Required, numeric.

Derived from parameter n of the <pb/> milestone node which precedes the line in document order.

<page>
  {data($line/preceding::tei:pb[1]/@n)}
</page>

`speaker`

The unique identifier of the character speaking the line. Multiple speakers result in multiple rows for the same line.

Required, NA indicates the line needs to be edited.

Derived from parameter who of the line’s <said> node.

<speaker>
  {if ($line/@who) then data($line/@who) else "NA"}
</speaker>

Character IDs

The identifier takes the form #Mycroft or #Carlyle. It points to a unique <person> node with the xml:id parameter set to that value.

Those <person> nodes contain metadata about the character, e.g. names, affiliations, age, etc… and are stored outside the text in a <standoff> node within the Digital Edition’s file.

I will, at some point, publish the list of characters as its own file and the primary key will be this identifier to permit joins.

‘Unknown’ Characters

It is not always clear in the text who is speaking, e.g. when Carlyle first overhears thru Thisbe’s door in Chapter the First, or the speaker is an unnamed individual in a crowd or group, e.g. the several servicers who witness Vivien dragging Mycroft out of the gutter in Chapter the Sixth.

For such cases, I use generic IDs with the form #Unknown-Soldier or #Unknown-Servicer or the fallback #Unknown.

`addressee`

The unique identifier of the character the line is being spoken to. Multiple addressees result in multiple rows for the same line.

Required, NA indicates the line needs to be edited.

Derived from the parameter toWhom of the line’s <said> node.

See speaker for format and meaning of the identifier.

<addressee>
  {if ($line/@toWhom) then data($line/@toWhom) else "NA"}
</addressee>

Asides to the Reader

The text does not call out Mycroft’s asides to his gentle reader typographically. I have therefore had to make some editorial decisions when marking up those parts of the text which are typeset like narration but which I believe to be such an aside.

This allows me to include the conversations between Mycroft and the reader in this data, but at the cost of some personal interpretation in what does or does not make the cut.

The rule I’ve tried to follow is that narration that uses the second person and/or addresses the reader directly should be included. Usually I have found that once this pattern starts it persists to the end of the paragraph. Some individual cases are less clear-cut, and of course I will have made errors during the editing.

`person`

The unique identifier of a person mentioned by name in the line. Multiple people mentioned result in multiple rows for the same line.

Optional, NA indicates no one was mentioned in the line.

Derived from the ref parameter of any children <persName> nodes of the line.

See speaker for format and meaning of the identifier.

<person>
  {if ($line//tei:persName) then 
    for $name in distinct-values($line//tei:persName/@ref)
    return normalize-space(concat($name, " "))
  else "NA"}
</person>

`org`

The unique identifier of an organization or group mentioned by name in the line. Multiple orgs mentioned result in multiple rows for the same line.

Optional, NA indicates no organization was mentioned in the line.

Derived from the ref parameter of any children <orgName> nodes of the line,.

<org>
  {if ($line//tei:orgName)then 
    for $name in distinct-values($line//tei:orgName/@ref)
    return normalize-space(concat($name, " "))
  else "NA"}
</org>

Org IDs

The organizations are managed much like the characters, but using <org> nodes instead of <person>. Unlike persons, orgs can nest.

These include Hives, bash’es, nation strats, the servicers, or the Chicago Museum of Science and Industry as well as its Junior Scientist Club.

`place`

The unique identifier of a place mentioned by name in the line. Multiple places mentioned result in multiple rows for the same line.

Optional, NA indicates no place was mentioned in the line.

Derived from the ref parameter of any children <placeName> nodes of the line.

<place>
  {if ($line//tei:placeName) then 
    for $name in distinct-values($line//tei:placeName/@ref)
    return normalize-space(concat($name, " "))
  else "NA"}
</place>

Place IDs

Places are managed much like orgs, but using <place> nodes instead of <org>.

These include planets, continents, cities, bash’houses, palaces, or flower trenches and they nest as well.

`language`

The ISO-code for the language the line is spoken in.

Required, but the output file defaults to en-en since Mycroft’s typographic conventions allowed a programmatic approach to setting the attribute in the text.

Derived from the xml:lang parameter of the line’s <said> node.

Note that this is not the language of the line in the text itself, since Mycroft takes it upon himself to translate most of the dialog into English.

<language>
  {if ($line/@xml:lang) then data($line/@xml:lang) else "en-en"}
</language>

Future ISO Language Codes

Several of the languages used are future invented variants of current ones. In that spirit, I have created fictitious ISO codes for this parameter:

en-ar: archaic English, as used by the reader and Dominic,
sp-hu: Humanist Spanish,
jp-mi: Mitsubishi Japanese, on the premise that Ando makes little distinction between voting bloc and nation-strat,
fr-eu: European French, just to piss of Ganymede,
en-tx: the uncapitalized unpunctuated form of text only english used by eureka weeksbooth
la-ma: Masonic neo-latin.

Others to come as they appear in the text, such as Ute-Speak as a dialect of English.

Non-English Languages in the Text

I have used the <foreign xml:lang="fr-eu"> element to wrap text that is actually given in a language other than English but I am not including this information in this data file. Create a feature request on the project GitHub repository if you’d like me to revisit that decision.

`aloud`

A boolean indicating whether the line is spoken aloud or not. Takes the value FALSE when texting or between Mycroft and the reader, for example.

Required, but the file defaults to TRUE when the attribute has not been set in the text.

Derived from the aloud parameter of the line’s <said> node.

<aloud>
  {if ($line[@aloud="false"]) then "FALSE" else "TRUE"}
</aloud>

`format`

The style in which the dialog is rendered in the text. Takes the value script when the layout changes to a script-like format with the speaker’s name repeated at each line, and novel otherwise.

Required, NA indicates a <said> element which does not match the criteria below.

Set to script when the ancestor of the line just below the chapter <div> level is an <sp> node, and to novel otherwise.

<format>
  {if ($line/ancestor::tei:sp) then "script" else "novel"}
</format>

`words`

The number of words in the line.

Required, numeric.

Derived during data cleaning by counting the normalized spaces in the text column, plus one.

<text>
  {normalize-space(data($line))}
</text>

NB: If you have ideas for other columns I could generate from the text without infringing on the author’s copyright, please create an issue in the tracker here, or start a new discussion here.

In the XML output the lists were space-delimited, since they derived from XML node parameters.↩︎
In order to be Tidy Data, in which each row represents a single observation.↩︎

<said> Element

Origins of This Data

What Are <said> Nodes For?

Example Nodes

Novel Layout

Script Layout

Get the Data

Download Link

Last Updated

Raw Data Generation

Xquery Script

XML Output File

Data Prep

Clean: Fix Data Types and NA Values

Tidy: Separate Delimited Lists

Copyright: Remove Raw Text

Editing Progress

Data Dictionary

book

chapter

scene

paragraph

line

page

speaker

Character IDs

‘Unknown’ Characters

addressee

Asides to the Reader

person

org

Org IDs

place

Place IDs

language

Future ISO Language Codes

Non-English Languages in the Text

aloud

format

words

Corrections

Citation

What Are `<said>` Nodes For?

`book`

`chapter`

`scene`

`paragraph`

`line`

`page`

`speaker`

`addressee`

`person`

`org`

`place`

`language`

`aloud`

`format`

`words`