Data Ignota: Who Talks to Whom?

Mathieu Glachant

Our said.csv data file contains information about who said a line of dialog, and to whom. What can we do with this data for the chapters edited to date, i.e. the first twelve?

EDIT 2022-10-16: Let’s also use the new characters.csv file to pull in names and emojis for characters.

knitr::opts_chunk$set(dev = "ragg_png") # Use Ragg device so emojis work
library(tidyverse)
library(gt) # For pretty tables
library(tidygraph) # For network graphs
library(ggraph) # To plot the network graphs

PATH_TO_SAID <- "../../data/said.csv"
PATH_TO_CHARACTERS <- "../../data/characters.csv"

said <- read.csv(PATH_TO_SAID) %>%
  filter(
    book == 1,
    chapter <= 12,
    # Let's remove the asides to the reader from this analysis
    speaker != "#reader",
    addressee != "#reader",
  )

dram_pers <- read.csv(PATH_TO_CHARACTERS) %>%
  select(-fullName)

Who Speaks? Who Listens?

Top Speakers

Let’s build a list of characters with a speaking part.

We need to:

remove the asides to the reader and their rants to Mycroft,
eliminate duplicate rows that are not relevant, and
rank them by number of lines of dialog.

top_speakers <- said %>%
  # Group & summarize
  # to drop rows irrelevant to `speaker`
  group_by(line, speaker) %>%
  summarize(.groups = "drop") %>%
  # Group & summarize 
  # to count lines spoken by character
  group_by(id = speaker) %>%
  summarize(speaks = n()) %>%
  # Arrange by count
  arrange(desc(speaks))

Here are the top five out of 38 speakers:

Name	Emoji	speaks
Carlyle	🙏	246
Mycroft	✍️	218
Bridger	🧸	89
Thisbe	🦨	80
Vivien	🧮	77

Top Addressees

We can do the same for characters who are directly addressed.

top_addressees <- said  %>%
  # Group & summarize
  # to drop rows irrelevant to `addressee`
  group_by(line, addressee) %>%
  summarize(.groups = "drop") %>%
  # Group & summarize
  # to count lines addressed to character
  group_by(id = addressee) %>%
  summarize(spokenTo = n()) %>%
  # Arrange by count
  arrange(desc(spokenTo))

Here are the top five out of 50 addressees:

Name	Emoji	spokenTo
Carlyle	🙏	258
Mycroft	✍️	235
Bridger	🧸	99
Vivien	🧮	81
Ganymede	🌞	80

Top Conversationalists

We can now combine the two lists using a join¹ by the character column.

characters <- full_join(
  top_speakers,
  top_addressees,
  by = "id"
  ) %>%
   # Join to retrieve names
  left_join(dram_pers, by = "id")

Tables are nice and all, but this seems like a good time for a graph. Here are the top 20 out of 51 characters by lines of dialog:

No real surprise there, Carlyle and Mycroft are our two chatter boxes, even when we remove the reader from the conversation, and it falls off pretty fast after that. Another Power Law Distribution?

Does Anyone Hog the Conversation?

It looks like some characters speaks more or less than they are spoken to. Anyone stand out in particular?

Looks like MASON speaks less than he is spoken to. This probably reflects how much ’splaining Ganymede and Andō have to do once he arrives on stage. At the other extreme, the Major speaks much more than he is spoken to. He can be rather intimidating, after all!

Who Speaks to Whom?

Ok, looking at the speaker and addressee columns separately is nice, and so is comparing their totals per character… but we can do more than that, right?

What happens if we look at pairs of characters speaking to each other?

Listing All the Conversations

Let’s build a list of the pairs of speakers and addressees with at least one line of dialog. We’ll also keep track of total line and word count.

Group by line, words, speaker, and addressee, then
summarize to drop dupes relevant for other columns, and
Group again by speaker and addressee,
then summarize and calculate the line and word counts.

pair_list <- said  %>%
  # Group & summarize
  # to drop rows irrelevant to speaker or addressee
  group_by(line, speaker, addressee, words) %>%
  summarize(.groups = "drop") %>%
  # Group & summarize
  # to add line and word counts
  group_by(speaker, addressee) %>%
  summarize(
    lines = n(), 
    words = sum(words), 
    .groups = "drop"
  ) %>%
  # Arrange by count
  arrange(desc(lines))

That’s it, it’s that easy!

Here are the top four directed pairs by number of lines, out of 206:

Speaker	Addressee	Lines	Words
Carlyle	Bridger	72	1864
Bridger	Carlyle	71	965
Carlyle	Mycroft	63	745
Mycroft	Carlyle	63	1520

Note that both conversations are balanced by line count but not by word count? It’s more like two-to-one, reflecting how one character is explaining things and answering questions for the other.

Radar Plots

We can draw some radar plots showing the total word count our top four speakers direct at our top six addressees:

See below for the emoji mapping for the addressees. We are missing an emoji for Martin at 11 o’clock.

Everybody speaks to Carlyle, but Carlyle mostly speaks to Mycroft and Bridger. Only Mycroft speaks to the outside world.

This works pretty well for small groups of characters but it won’t scale well to the 38 speaking parts we have so far, which is only going to grow as we add chapters.

Network Graph

We have a list of pairs of characters engaged in dialog with a few different measures of how much dialog it was… that’s all we need to build a directed, weighted network graph, but we’ll also pass in our list of characters for convenience.

network_graph <- tbl_graph(
  edges = pair_list,
  nodes = characters,
  node_key = "id",
)

That’s all it takes to make a network graph, which we can then visualize in various ways, e.g. a hairball of characters with at least fifteen spoken lines of dialog:

See below for the emoji mapping for the other characters. We are missing emojis for Lesley, Martin, and Su-Hyeon, going from left to right.

Mycroft is the hinge between several conversations, one dominated by Carlyle currently taking place at the Saneer-Weeksbooth bash’house, another one over at Ganymede’s palace during the Renunciation Day party, and some more isolated pairs around the periphery, like Ockham and Martin or Dominic and Lesley.

Emoji-to-Character Mapping

Emoji	Name
🙏	Carlyle
✍️	Mycroft
🧸	Bridger
🦨	Thisbe
🧮	Vivien
🌞	Ganymede
NA	Martin
🪒	Ockham
👸	Danaë
🌸	Andō
🤺	Dominic
NA	Lesley
🎯	Sniper
💡	Eureka
💊	Kosala
🪖	the Major
👑	Spain
🧱	MASON
NA	Su-Hyeon

If we use a full_join() any characters that are not on both lists will be kept, but with an NA for the missing count.↩︎

Who Talks to Whom?