A brief look at how many words characters use per line of dialog
Let’s define, for now, verbosity as the tendency to use lots of words each time a character communicates with another.
What visualizations of this concept can we generate from the chapters edited to date, i.e. the first twelve?
To show the distribution of lines of dialog by number of words, we can use a histogram… but we have a problem. Some lines show up more than once in the data, e.g. line 2 in which Mycroft refers by name to both himself and the reader.
line | speaker | person | words |
---|---|---|---|
1 | #Mycroft | #reader | 156 |
2 | #Mycroft | #reader | 155 |
2 | #Mycroft | #Mycroft | 155 |
3 | #Thisbe | NA | 3 |
4 | #Carlyle | #Carlyle | 2 |
5 | #Thisbe | NA | 1 |
So first we need to group_by()
& summarize()
by the line
column, a pattern we’ll need to apply anytime we care about the number of times the line appears in the text.
line | words |
---|---|
1 | 156 |
2 | 155 |
3 | 3 |
4 | 2 |
5 | 1 |
This operation dropped 890 duplicate lines from our data.
We’re now ready to plot our histogram.
Verbosity overall in the novels follows a classic Power Law Distribution with a fairly long tail. This is pretty standard for human communications.
Let’s drill a little deeper to see if the distribution changes for different sets of lines, e.g. by the language spoken1.
We just group by line
and language
before summarizing to remove duplicate lines of dialog.
Note that there is only ever one language per line of dialog, so this operation does remove all duplicate lines. We’ll see later that this is not always the case.
Alright, time to draw some violin-plots!
There’s a general trend for well-represented languages to have ‘squatter’ distributions, favoring shorter lines of dialog. This may just be an artifact of small sample size. We’d need to revisit this with a few more chapters of data available.
Also note how languages like Mitsubishi Japanese have a second bulge around ~50 words per line, indicating conversations may fall into two modes depending on context or participants, perhaps?
Let’s look at how verbose our top chatterboxes are.
To start, we need a list of our top speakers, which we can build with the following steps:
Note that in the first step, since some lines of dialog have multiple speakers2, there will remain duplicate lines in the data. This is what we want here, since in the following step we will group & count by speaker and we want the duplicate lines to count for their respective speakers.
This gives us the 5 characters with the most lines of dialog. Now we can filter for them before we group our lines by lines and speaker so we can plot their distribution.
Mycroft really stands out from the others. 300+ words per line of dialog? Which of the characters he speaks to could be on the receiving end of such prolixity?
To generate a list of characters Mycroft speaks to the most, we must:
The pattern should start to look familiar by now, no?
top_Mycroft_addressees <- said %>%
# Keep just Mycroft's lines of dialog
filter(speaker == "#Mycroft") %>%
# Group by line AND by addressee to deduplicate lines
group_by(line, addressee) %>%
summarize(.groups = "drop") %>%
# Group by addressee and count
group_by(addressee) %>%
summarize(count = n()) %>%
# Arrange by count and keep only the top
arrange(desc(count)) %>%
head(5)
That gives us the top 5 characters Mycroft speaks to by lines of dialog. Now we can filter for them as addressees and for Mycroft as speaker before we generate our distribution plot.
Ah, of course. Mycroft is much more verbose in his historical & philosophical asides to the reader. Also, notice how Mycroft’s number of words per line shows that bulge near the top when speaking to certain people? It’s particularly marked with Ando & Danaë and may explain the similar bulge we saw in the distribution for Mitsubishi Japanese.
In any case, how does Mycroft’s verbosity compare to the others if we exclude his asides to the reader?
Mycroft is still in the top 5 but without his asides to the reader his verbosity distribution looks much more like the others’.
Note this is the language Mycroft tells us the line was in before translation. This allows an apples-to-apples comparison, since the word count is always done in English.↩︎
Usually when multiple characters cry out the same thing at the same time, and the text ascribes a single quote to all of them.↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Glachant (2022, Oct. 9). Data Ignota: Exploring Verbosity. Retrieved from https://syvwlch.github.io/Data-Ignota/viz/2022-10-09-exploring-verbosity/
BibTeX citation
@misc{glachant2022exploring, author = {Glachant, Mathieu}, title = {Data Ignota: Exploring Verbosity}, url = {https://syvwlch.github.io/Data-Ignota/viz/2022-10-09-exploring-verbosity/}, year = {2022} }