Corpus design and text contribution for ELTeC

Budapest Training School, 23-25 September 2019

The Training School was intended to give participants hands-on experience in creating ELTeC TEI-XML versions of source texts starting from scanned page images or from a pre-existing HTML version. We supplied a set of raw materials for candidates to work on, along with detailed instructions (see links below). At the end of the Training School, each participant was able to contribute new TEI encoded texts to the ELTeC GitHub repository.

On Monday, participants learned how to use Oxygen for encoding texts in TEI XML, and were introduced to the sampling and balance criteria as well as other metadata requirements for ELTeC. On Tuesday, after an introduction to the principles of OCR, they worked with various transformation scenarios for creating TEI XML texts, firstly starting from plain text files, and secondly from two formats produced by Abbyy OCR: one in XML, and the other in DOCX. We introduced them informally to the key technologies of regular expressions and Xpath, and some of the many features of the oXygen XML editing environment. In the closing session on Wednesday, participants presented what they had achieved, and discuss the outcomes of this and other parallel workshops.

The Training School was intended for any researcher interested in contributing texts to the ELTeC. Some previous experience of computer use is needed, but no knowledge of TEI XML was assumed.

Training School Schedule
Monday 23 September 2019
14:00Introduction to Oxygen and TEI XML Basic TEI XML structure and how to use oxygen HTML SlidesMartina
16:15Introduction to ELTeCWhat ELTeC is about: its design and goals (PDF slides)Carolin
16:45Getting started First practical: creating an ELTeC conformant TEI Header with oXygenHTML slides; PDF versionLou
Tuesday 24 September 2019
09:00Introduction to OCR with OCR4allPractical and theoretical discussion of OCR; comparison between ABBYY and OCR4all; practical exercise in training a model. PDF slidesChristian
11:15Handling plain text filesSecond practical: everyone chooses a different Polish text HTML slides; PDF version. See also The ELTeC Header check listLou
13:00Lunch break
14:00Text formats and how to handle themTalk about formats and tools HTML slidesLou
14:45Working with Abbyy outputs Third practical: everyone chooses a text from the list to convert either from Abbyy XML format, or from DOCX format slides; PDF version; See also The ELTeC encoding check listLou
Wednesday 25 September 2019
9:00Concluding session Participants worked on completing their encodings, and created a short presentation of the outputs of the workshop; completed texts were uploaded to the github repo.Everyone
Christian Reul, Carolin Odebrecht, Martina Scholger, and Lou Burnard
Rezearta Murati (University of Shkodër "Luigj Gurakuqi", Albania); Pia Geißel (University of Trier, Germany); Mau Zsófia Ábrahám (Budapest University, Hungary) Tamas Biro (Eötvös Loránd University, Hungary); Katinka Rózsa (Hungary); Sudipt Subhankar ( Hungary); Vincas Grigas (Vilnius University, Lithuania); Arūnas Gudinavičius (Vilnius University,Lithuania); Nikolche Mickoski (Macedonian Academy of Sciences and Arts, Macedonia); Camelia Gradinaru (UAIC, Romania); Luiza Marinescu (, Romania); Emanuel Modoc (, Romania); Ágnes Kocsis (, Hungary)
Eötvös Loránd University, Budapest