Corpus design and text contribution for ELTeC

Budapest Training School, 23-25 September 2019

The Training School was intended to give participants hands-on experience in creating ELTeC TEI-XML versions of source texts starting from scanned page images or from a pre-existing HTML version. We supplied a set of raw materials for candidates to work on, along with detailed instructions (see links below). At the end of the Training School, each participant was able to contribute new TEI encoded texts to the ELTeC GitHub repository.

On Monday, participants learned how to use Oxygen for encoding texts in TEI XML, and were introduced to the sampling and balance criteria as well as other metadata requirements for ELTeC. On Tuesday, after an introduction to the principles of OCR, they worked with various transformation scenarios for creating TEI XML texts, firstly starting from plain text files, and secondly from two formats produced by Abbyy OCR: one in XML, and the other in DOCX. We introduced them informally to the key technologies of regular expressions and Xpath, and some of the many features of the oXygen XML editing environment. In the closing session on Wednesday, participants presented what they had achieved, and discuss the outcomes of this and other parallel workshops.

The Training School was intended for any researcher interested in contributing texts to the ELTeC. Some previous experience of computer use is needed, but no knowledge of TEI XML was assumed.

Training School Schedule
Monday 23 September 2019
14:00	Introduction to Oxygen and TEI XML	Basic TEI XML structure and how to use oxygen HTML Slides	Martina
16:15	Introduction to ELTeC	What ELTeC is about: its design and goals (PDF slides)	Carolin
16:45	Getting started	First practical: creating an ELTeC conformant TEI Header with oXygenHTML slides; PDF version	Lou
Tuesday 24 September 2019
09:00	Introduction to OCR with OCR4all	Practical and theoretical discussion of OCR; comparison between ABBYY and OCR4all; practical exercise in training a model. PDF slides	Christian
11:15	Handling plain text files	Second practical: everyone chooses a different Polish text HTML slides; PDF version. See also The ELTeC Header check list	Lou
13:00	Lunch break
14:00	Text formats and how to handle them	Talk about formats and tools HTML slides	Lou
14:45	Working with Abbyy outputs	Third practical: everyone chooses a text from the list to convert either from Abbyy XML format, or from DOCX format slides; PDF version; See also The ELTeC encoding check list	Lou
Wednesday 25 September 2019
9:00	Concluding session	Participants worked on completing their encodings, and created a short presentation of the outputs of the workshop; completed texts were uploaded to the github repo.	Everyone

Trainers:: Christian Reul, Carolin Odebrecht, Martina Scholger, and Lou Burnard
Participants:: Rezearta Murati (University of Shkodër "Luigj Gurakuqi", Albania); Pia Geißel (University of Trier, Germany); Mau Zsófia Ábrahám (Budapest University, Hungary) Tamas Biro (Eötvös Loránd University, Hungary); Katinka Rózsa (Hungary); Sudipt Subhankar ( Hungary); Vincas Grigas (Vilnius University, Lithuania); Arūnas Gudinavičius (Vilnius University,Lithuania); Nikolche Mickoski (Macedonian Academy of Sciences and Arts, Macedonia); Camelia Gradinaru (UAIC, Romania); Luiza Marinescu (, Romania); Emanuel Modoc (, Romania); Ágnes Kocsis (, Hungary)
Location:: Eötvös Loránd University, Budapest