Data Ignota: Exploring Verbosity

Mathieu Glachant

Let’s define, for now, verbosity as the tendency to use lots of words each time a character communicates with another.

What visualizations of this concept can we generate from the chapters edited to date, i.e. the first twelve?

library(tidyverse)
library(gt)

PATH_TO_SAID <- "../../data/said.csv"

said <- read.csv(PATH_TO_SAID) %>%
  filter(
    book == 1,
    chapter <= 12
  )

Overall Verbosity

To show the distribution of lines of dialog by number of words, we can use a histogram… but we have a problem. Some lines show up more than once in the data, e.g. line 2 in which Mycroft refers by name to both himself and the reader.

line	speaker	person	words
1	#Mycroft	#reader	156
2	#Mycroft	#reader	155
2	#Mycroft	#Mycroft	155
3	#Thisbe	NA	3
4	#Carlyle	#Carlyle	2
5	#Thisbe	NA	1

So first we need to group_by() & summarize() by the line column, a pattern we’ll need to apply anytime we care about the number of times the line appears in the text.

lines <- said %>%
  # Group by line
  group_by(line) %>%
  # Summarize by line and words per line
  summarize(words = mean(words), .groups = "drop")

line	words
1	156
2	155
3	3
4	2
5	1

This operation dropped 890 duplicate lines from our data.

We’re now ready to plot our histogram.

Verbosity overall in the novels follows a classic Power Law Distribution with a fairly long tail. This is pretty standard for human communications.

Verbosity by Language

Let’s drill a little deeper to see if the distribution changes for different sets of lines, e.g. by the language spoken¹.

We just group by line and language before summarizing to remove duplicate lines of dialog.

lines_by_lang <- said  %>%
  # Group by line AND by language
  group_by(line, language) %>%
  # Summarize and drop everything else except the words per line
  summarize(words = mean(words), .groups = "drop")

Note that there is only ever one language per line of dialog, so this operation does remove all duplicate lines. We’ll see later that this is not always the case.

Alright, time to draw some violin-plots!

There’s a general trend for well-represented languages to have ‘squatter’ distributions, favoring shorter lines of dialog. This may just be an artifact of small sample size. We’d need to revisit this with a few more chapters of data available.

Also note how languages like Mitsubishi Japanese have a second bulge around ~50 words per line, indicating conversations may fall into two modes depending on context or participants, perhaps?

Verbosity by Character

Let’s look at how verbose our top chatterboxes are.

To start, we need a list of our top speakers, which we can build with the following steps:

Group & summarize by line and speaker to get one row per line and speaker,
Group & summarize again by speaker and keep the count, and
Arrange by count and keep the top ones.

Note that in the first step, since some lines of dialog have multiple speakers², there will remain duplicate lines in the data. This is what we want here, since in the following step we will group & count by speaker and we want the duplicate lines to count for their respective speakers.

top_speakers <- said  %>%
  # Group by line AND by speaker to deduplicate lines
  group_by(line, speaker) %>%
  summarize(.groups = "drop") %>%
  # Group by speaker and count
  group_by(speaker) %>%
  summarize(count = n()) %>%
  # Arrange by count and keep only the top
  arrange(desc(count)) %>%
  head(5)

This gives us the 5 characters with the most lines of dialog. Now we can filter for them before we group our lines by lines and speaker so we can plot their distribution.

lines_by_speaker <- said  %>%
  # Filter for top speakers by number of lines
  filter(speaker %in% top_speakers$speaker) %>%
  # Group by line and speaker
  group_by(line, speaker) %>%
  summarize(words = mean(words), .groups = "drop")

Mycroft really stands out from the others. 300+ words per line of dialog? Which of the characters he speaks to could be on the receiving end of such prolixity?

To generate a list of characters Mycroft speaks to the most, we must:

Filter for lines that Mycroft speaks,
Group & summarize by line and addressee to get one row per line and addressee,
Group & summarize again by addressee and keep the count, and
Arrange by count and keep the top ones..

The pattern should start to look familiar by now, no?

top_Mycroft_addressees <- said  %>%
  # Keep just Mycroft's lines of dialog
  filter(speaker == "#Mycroft") %>%
  # Group by line AND by addressee to deduplicate lines
  group_by(line, addressee) %>%
  summarize(.groups = "drop") %>%
  # Group by addressee and count
  group_by(addressee) %>%
  summarize(count = n()) %>%
  # Arrange by count and keep only the top
  arrange(desc(count)) %>%
  head(5)

That gives us the top 5 characters Mycroft speaks to by lines of dialog. Now we can filter for them as addressees and for Mycroft as speaker before we generate our distribution plot.

lines_by_Mycroft_addressee <- said  %>%
  # Filter for lines between Mycroft and his top addressees
  filter(
    speaker == "#Mycroft",
    addressee %in% top_Mycroft_addressees$addressee
  ) %>%
  # Groub by line and addressee
  group_by(line, addressee) %>%
  summarize(words = mean(words), .groups = "drop")

Ah, of course. Mycroft is much more verbose in his historical & philosophical asides to the reader. Also, notice how Mycroft’s number of words per line shows that bulge near the top when speaking to certain people? It’s particularly marked with Ando & Danaë and may explain the similar bulge we saw in the distribution for Mitsubishi Japanese.

In any case, how does Mycroft’s verbosity compare to the others if we exclude his asides to the reader?

lines_by_speaker <- said  %>%
  # Filter for top speakers by number of lines
  # but drop the asides to the reader this time
  filter(
    speaker %in% top_speakers$speaker,
    addressee != "#reader"
  ) %>%
  # Group by line and speaker
  group_by(line, speaker) %>%
  summarize(words = mean(words), .groups = "drop")

Mycroft is still in the top 5 but without his asides to the reader his verbosity distribution looks much more like the others’.

Note this is the language Mycroft tells us the line was in before translation. This allows an apples-to-apples comparison, since the word count is always done in English.↩︎
Usually when multiple characters cry out the same thing at the same time, and the text ascribes a single quote to all of them.↩︎

Exploring Verbosity

Overall Verbosity

Verbosity by Language

Verbosity by Character

Corrections

Citation