The Training School was intended to give participants hands-on experience in creating ELTeC TEI-XML versions of source texts starting from scanned page images or from a pre-existing HTML version. We supplied a set of raw materials for candidates to work on, along with detailed instructions (see links below). At the end of the Training School, each participant was able to contribute new TEI encoded texts to the ELTeC GitHub repository.
On Monday, participants learned how to use Oxygen for encoding texts in TEI XML, and were introduced to the sampling and balance criteria as well as other metadata requirements for ELTeC. On Tuesday, after an introduction to the principles of OCR, they worked with various transformation scenarios for creating TEI XML texts, firstly starting from plain text files, and secondly from two formats produced by Abbyy OCR: one in XML, and the other in DOCX. We introduced them informally to the key technologies of regular expressions and Xpath, and some of the many features of the oXygen XML editing environment. In the closing session on Wednesday, participants presented what they had achieved, and discuss the outcomes of this and other parallel workshops.
The Training School was intended for any researcher interested in contributing texts to the ELTeC. Some previous experience of computer use is needed, but no knowledge of TEI XML was assumed.
Training School Schedule | ||||
Monday 23 September 2019 | ||||
14:00 | Introduction to Oxygen and TEI XML | Basic TEI XML structure and how to use oxygen HTML Slides | Martina | |
16:15 | Introduction to ELTeC | What ELTeC is about: its design and goals (PDF slides) | Carolin | |
16:45 | Getting started | First practical: creating an ELTeC conformant TEI Header with oXygenHTML slides; PDF version | Lou | |
Tuesday 24 September 2019 | ||||
09:00 | Introduction to OCR with OCR4all | Practical and theoretical discussion of OCR; comparison between ABBYY and OCR4all; practical exercise in training a model. PDF slides | Christian | |
11:15 | Handling plain text files | Second practical: everyone chooses a different Polish text HTML slides; PDF version. See also The ELTeC Header check list | Lou | |
13:00 | Lunch break | |||
14:00 | Text formats and how to handle them | Talk about formats and tools HTML slides | Lou | |
14:45 | Working with Abbyy outputs | Third practical: everyone chooses a text from the list to convert either from Abbyy XML format, or from DOCX format slides; PDF version; See also The ELTeC encoding check list | Lou | |
Wednesday 25 September 2019 | ||||
9:00 | Concluding session | Participants worked on completing their encodings, and created a short presentation of the outputs of the workshop; completed texts were uploaded to the github repo. | Everyone |