The scenes, providing location and characters present…
This data is generated by extracting all TEI <milestone unit="scene">
descendant nodes of <text>
nodes in the Digital Edition of Terra Ignota.
<milestone unit="scene">
Nodes For?These <milestone unit="scene">
nodes mark the points at which there are scene changes.
These changes come in three types: location, characters present physically, and characters present remotely. This is indicated by an additional type
attribute set to the values location
, present
, and remote
respectively.
Lastly, an ana
attribute contains a space-delimited list of the locations or characters for the scene.
Here’s a simplified view of what these nodes look like:
milestone unit="scene" type="location" ana="#FlowerTrench"/>
<milestone unit="scene" type="present" ana="#Carlyle"/>
<milestone unit="scene" type="remote"/>
<p>We begin on the morning of ... </p> <
The data dictionary below maps the information above to a column in the data file.
The data extracted from these <milestone unit="scene">
nodes is available as a CSV file.
This file was last updated on 2022-11-03.
The raw data is first extracted from the <milestone unit="scene">
nodes using an Xquery script.
For easy ingestion with the XML
package in R, the script’s output has a <records>
root node and one <scene>
node per scene milestone in the original text.
xquery version "3.1";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare variable $doc := doc("PATH_TO_TEI_FILE");
records>
<
{
for $scene at $scene_index in $doc//tei:text//tei:milestone[@unit = "scene"]
return scene>
<
...
A node per column in the output file, see below for details
...scene>
</
}records> </
The Xquery script outputs an XML output file of the form:
<?xml version="1.0" encoding="UTF-8"?>
records>
<scene>
<scene>45</scene>
<location>#Entrance-SWBH</location>
<present>#Carlyle</present>
<remote>#Eureka #Mycroft</remote>
<scene>
</
etc...
records> </
This XML output file must then be cleaned before being saved to the csv file provided above.
First, the correct data types must be set for each column and missing values set to NA. This is very easy using the XML
and tidyverse
packages.
scenes <- xmlToDataFrame(xml_path) %>%
mutate(
# Missing values must be _NA_
location = na_if(location, ""),
present = na_if(present, ""),
remote = na_if(remote, ""),
# Column data types must be correct
scene = as.integer(scene),
)
NB: Luckily for you, when you read in this data as a CSV file the readr
package is smart enough to correctly guess on all of this.
Second, any rows containing delimited lists1, e.g. a scene with more than one character present, must be2 separated into multiple rows.
scenes <- scenes %>%
# Break space-delimited columns across multiple rows
separate_rows(location, sep = " ") %>%
separate_rows(present, sep = " ") %>%
separate_rows(remote, sep = " ")
Currently, scene milestones have been defined up to page -.
List of the columns in the data file explaining what they mean and how they were generated.
scene
A unique numeric identifier of the scene change within the text.
Required, numeric.
Generated dynamically by the Xquery output script as the index while iterating thru the <milestone unit="scene">
nodes.
scene>{
<
$scene_indexscene> }</
location
The unique identifier of the location the scene takes place in. Multiple locations result in multiple rows for the same scene.
Required, NA
indicates the scene needs to be edited.
Derived from parameter ana
of the most recent milestone of type location
.
location>{
<
if ($scene[@type = "location"])
then data($scene/@ana)
else data($scene/preceding::tei:milestone[@unit = "scene" and @type = "location"][1]/@ana)location> }</
present
The unique identifier of a character present physically for the scene. Multiple characters result in multiple rows for the same scene.
Required, NA
indicates the scene needs to be edited.
Derived from parameter ana
of the most recent milestone of type present
.
present>{
<
if ($scene[@type = "present"])
then data($scene/@ana)
else data($scene/preceding::tei:milestone[@unit = "scene" and @type = "present"][1]/@ana)present> }</
remote
The unique identifier of a character present remotely for the scene. Multiple characters result in multiple rows for the same scene.
Required, NA
indicates the scene needs to be edited.
Derived from parameter ana
of the most recent milestone of type remote
.
remote>{
<
if ($scene[@type = "remote"])
then data($scene/@ana)
else data($scene/preceding::tei:milestone[@unit = "scene" and @type = "remote"][1]/@ana)remote> }</
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Glachant (2022, Nov. 3). Data Ignota: Scenes. Retrieved from https://syvwlch.github.io/Data-Ignota/tei/scenes/
BibTeX citation
@misc{glachant2022scenes, author = {Glachant, Mathieu}, title = {Data Ignota: Scenes}, url = {https://syvwlch.github.io/Data-Ignota/tei/scenes/}, year = {2022} }