Speakers: Jonathan Blaney (Cambridge University), Gabriel Bodard (University of London), Katharine Shields (King's College London)
Sunoikisis Digital Classics session 2
This session following from the preceding one on sources of open texts, with a discussion of the importance of text cleaning and data preparation to any digital analysis or other process. We consider different processes that are more or less tolerant of "messy" texts, including poor OCR and similar artefacts, and highlight the importance of including a realistic level of text preparation in your project planning budget. We look at a few options for cleaning and repairing large quantities of text, before offering a simple tutorial to regular expressions, which can be used to remove repetitive and predictable unwanted features across one or multiple texts, including at massive scale.
Follow live or later at: https://youtu.be/Or-SaNznWz0
Further information, readings and exercise at: https://github.com/SunoikisisDC/SunoikisisDC-2024-2025/wiki/2-Preparing-Texts