Data Ignota: Data Cleaning and Tidying

Mathieu Glachant

Data quality was the theme today:

Thanks to the handy table in the said post, was able to go back to the Digital Edition in the private repo and fix all of the missing attributes.
Also explained in the post how the raw xml extracted from the digital edition is cleaned and tidied before being saved to provided said.csv file.

Also started a conversation around using emojis to label main characters.

What It Took to Get Here

Clarifications:
- Fixed all missing speaker and addressee attributes in chapters 1 thru 12
- Showed data cleaning & tidying:
  - Add a section on Data Cleaning & Tidying,
    - Move relevant R code chunks and make visible, and
    - General readability editing pass on the post.
Enhancements:
- Mapping main characters to emojis:
  - Would be great to have a persistent mapping of main characters to emojis, for visualizations that will not have a lot of room to label them, e.g. network graphs.
  - Started a discussion to track suggestions, including from the Reading Group on Signal.
  - Started an issue to track implementation
  - Research how emojis would work in R, e.g. https://github.com/hadley/emo

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Citation

For attribution, please cite this work as

Glachant (2022, Oct. 6). Data Ignota: Data Cleaning and Tidying. Retrieved from https://syvwlch.github.io/Data-Ignota/posts/2022-10-06-data-cleaning-and-tidying/

BibTeX citation

@misc{glachant2022data,
  author = {Glachant, Mathieu},
  title = {Data Ignota: Data Cleaning and Tidying},
  url = {https://syvwlch.github.io/Data-Ignota/posts/2022-10-06-data-cleaning-and-tidying/},
  year = {2022}
}