Data Ignota: Scenes

Mathieu Glachant

Origins of This Data

This data is generated by extracting all TEI <milestone unit="scene"> descendant nodes of <text> nodes in the Digital Edition of Terra Ignota.

What Are `<milestone unit="scene">` Nodes For?

These <milestone unit="scene"> nodes mark the points at which there are scene changes.

These changes come in three types: location, characters present physically, and characters present remotely. This is indicated by an additional type attribute set to the values location, present, and remote respectively.

Lastly, an ana attribute contains a space-delimited list of the locations or characters for the scene.

Example Nodes

Here’s a simplified view of what these nodes look like:

<milestone unit="scene" type="location" ana="#FlowerTrench"/>
<milestone unit="scene" type="present" ana="#Carlyle"/>
<milestone unit="scene" type="remote"/>
<p>We begin on the morning of ... </p>

The data dictionary below maps the information above to a column in the data file.

Get the Data

The data extracted from these <milestone unit="scene"> nodes is available as a CSV file.

Download Link

Download the data

Last Updated

This file was last updated on 2022-11-03.

Raw Data Generation

The raw data is first extracted from the <milestone unit="scene"> nodes using an Xquery script.

Xquery Script

For easy ingestion with the XML package in R, the script’s output has a <records> root node and one <scene> node per scene milestone in the original text.

xquery version "3.1";

declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare variable $doc := doc("PATH_TO_TEI_FILE");

<records>
  {
  for $scene at $scene_index in $doc//tei:text//tei:milestone[@unit = "scene"]
  return 
  <scene>
    ...
    A node per column in the output file, see below for details
    ...
  </scene>
  }
</records>

XML Output File

The Xquery script outputs an XML output file of the form:

<?xml version="1.0" encoding="UTF-8"?>
<records>
  <scene>
    <scene>45</scene>
    <location>#Entrance-SWBH</location>
    <present>#Carlyle</present>
    <remote>#Eureka #Mycroft</remote>
   </scene>
   
   etc...
   
</records>

Data Prep

This XML output file must then be cleaned before being saved to the csv file provided above.

Clean: Fix Data Types and NA Values

First, the correct data types must be set for each column and missing values set to NA. This is very easy using the XML and tidyverse packages.

scenes <- xmlToDataFrame(xml_path) %>%
  mutate(
    # Missing values must be _NA_
    location = na_if(location, ""),
    present = na_if(present, ""),
    remote = na_if(remote, ""),
    # Column data types must be correct
    scene = as.integer(scene),
  )

NB: Luckily for you, when you read in this data as a CSV file the readr package is smart enough to correctly guess on all of this.

Tidy: Separate Delimited Lists

Second, any rows containing delimited lists¹, e.g. a scene with more than one character present, must be² separated into multiple rows.

scenes <- scenes %>%
  # Break space-delimited columns across multiple rows
  separate_rows(location, sep = " ") %>%
  separate_rows(present, sep = " ") %>%
  separate_rows(remote, sep = " ")

Editing Progress

Currently, scene milestones have been defined up to page -.

Data Dictionary

List of the columns in the data file explaining what they mean and how they were generated.

`scene`

A unique numeric identifier of the scene change within the text.

Required, numeric.

Generated dynamically by the Xquery output script as the index while iterating thru the <milestone unit="scene"> nodes.

<scene>{
  $scene_index
}</scene>

`location`

The unique identifier of the location the scene takes place in. Multiple locations result in multiple rows for the same scene.

Required, NA indicates the scene needs to be edited.

Derived from parameter ana of the most recent milestone of type location.

<location>{
  if ($scene[@type = "location"]) 
  then data($scene/@ana) 
  else data($scene/preceding::tei:milestone[@unit = "scene" and @type = "location"][1]/@ana)
}</location>

`present`

The unique identifier of a character present physically for the scene. Multiple characters result in multiple rows for the same scene.

Required, NA indicates the scene needs to be edited.

Derived from parameter ana of the most recent milestone of type present.

<present>{
  if ($scene[@type = "present"]) 
  then data($scene/@ana) 
  else data($scene/preceding::tei:milestone[@unit = "scene" and @type = "present"][1]/@ana)
}</present>

`remote`

The unique identifier of a character present remotely for the scene. Multiple characters result in multiple rows for the same scene.

Required, NA indicates the scene needs to be edited.

Derived from parameter ana of the most recent milestone of type remote.

<remote>{
  if ($scene[@type = "remote"]) 
  then data($scene/@ana) 
  else data($scene/preceding::tei:milestone[@unit = "scene" and @type = "remote"][1]/@ana)
}</remote>

In the XML output the lists were space-delimited, since they derived from XML node parameters.↩︎
In order to be Tidy Data, in which each row represents a single observation.↩︎

Scenes

Origins of This Data

What Are `<milestone unit="scene">` Nodes For?

Example Nodes

Get the Data

Download Link

Last Updated

Raw Data Generation

Xquery Script

XML Output File

Data Prep

Clean: Fix Data Types and NA Values

Tidy: Separate Delimited Lists

Editing Progress

Data Dictionary

`scene`

`location`

`present`

`remote`

Corrections

Citation

Scenes

Origins of This Data

What Are <milestone unit="scene"> Nodes For?

Example Nodes

Get the Data

Download Link

Last Updated

Raw Data Generation

Xquery Script

XML Output File

Data Prep

Clean: Fix Data Types and NA Values

Tidy: Separate Delimited Lists

Editing Progress

Data Dictionary

scene

location

present

remote

Corrections

Citation

What Are `<milestone unit="scene">` Nodes For?

`scene`

`location`

`present`

`remote`