# What is ELTeC all about?
**Christof Schöch (Trier, Germany)**
*** Belgrade Training School, March 22, 2022
https://distantreading.github.io/eltec-slides/ ***
:: - Happy to be here, thanks for the invite - Multilingual approaches in CLS are on the rise - - Need to be supported by three things - multilingual data - methods to handle it - people who know the context - So I believe it is inherently a community effort - At least, that is the premise of ELTeC, the "European Literary Text Collection" -- ### Overview 1. [What is ELTeC?](#/2) 2. [Composition criteria](#/3) 3. [Encoding principles](#/4) 4. [Publication strategy](#/5) 4. [Usage scenarios](#/6) 5. [Conclusion](#/7) :: - So I'd like to present ELTeC today, with a focus on multilingualism -- ## (1) What is ELTeC? --- ### ELTeC in context * COST Action "Distant Reading for European Literary History" * Research network (31 countries, 200+ researchers) * Ambition: Foster digital, cross-lingual research into the history of the European novel * Areas of activity: * Build a multilingual corpus of the European novel * Develop appropriate, digital methods of analysis * Thinking about the theoretical consequences * Creation of a network of researchers across Europe * Capacity building: training schools, exchanges, joint projects --- ### "European Literary Text Collection" * A multilingual corpus of the European novel * Time period: 1840-1920 (production, availability, OCR, copyright) * At least 10 different languages (currently: 10 complete, 7 more in progress, + extensions) * Comparable (!) collections of 100 novels per language * Key characteristics * Each corpus represents the variety of production * Texts encoded in XML-TEI and linguistically-annotated (POS, NE) * Everything published under open licences (CC) * More information * http://www.distant-reading.net/eltec/ * Latest release: [v1.1.0, April 2021, 14 collections, 1200 novels](https://github.com/COST-ELTeC/ELTeC) --- ### Progress of our work on ELTeC
See: https://distantreading.github.io/ELTeC/
-- ## (2) Eligibility and composition criteria --- ### Eligibility criteria * Novels, i.e. narrative, fictional prose of a certain length * Minimal length: 10.000 words * Novels written originally in the language of the collection * Novels first (or at least also) published in Europe --- ### Composition criteria * Objectives * Comparability of the collections * Represent the diversity of the novel production * Go beyond the canon (and the usual collections) * Criteria taken into account * Period of publication: 1840-59, 1860-79, 1880-99, 1900-1919 * Length of the text: short (10-50k), medium (50-100k), long (100k+) * Author gender: male, female, diverse/mixed * Reprint count, 1970-2010: low (0-1), high (2+) * Number of novels per author 3 (9-11 authors), 1 (otherwise) --- #### Composition of the collections
|ELTeC-eng|ELTeC-rom| |:---:|:---:| |
|
| |100 novels
EC5 100
excellent balance| 100 novels
EC5 83
balance difficult to obtain| |||
--- #### The diversity paradox
* Three objectives * Comparability of the copora (enabled by strict criteria) * Diversity of texts within a corpus (enforced by strict criteria) * Diversity of languages in ELTeC (suported by loose criteria) -- ## (3) Encoding principles --- ### Three levels of encoding * Everything is encoded in XML-TEI (of course!?) * There is a common header for metadata * Three levels of encoding * Level 0: minimal TEI encoding (metadata + `div`, `p`, `hi`) * Level 1: semantic TEI encoding (`foreign`, `emph` etc.) * Level 2: TEI with token-level linguistic annotation (UPos, NE) * Controlled by a set of schemas * Schemas connected via "ODD chaining" (see Burnard et al. 2021) * Validation with RelaxNG and Schematron * Validation socially enforced by Lou --- ### Metadata * Composition criteria (see above) * Provenance * digital source * print source * first edition * Type of novel (optional) * Subgenre of the novel * Narrative perspective * Textual characteristics * Language * Orthography (original vs. modernised) * Alphabet (latin, cyrillic, transition) * Encoding level (see above) :: - Nothing special from the point of view of digital editing - But more detailed than what is standard in collection building -- ## (4) Publication strategy --- ### Publication strategy * For the needs of the project * Space for collaboration (XML) : [Github](https://github.com/cost-eltec) * Publication of 'releases' with DOI (XML) : Github + [Zenodo](https://zenodo.org/communities/eltec/) * Overview (HTML, XML) : [Github.io](https://distantreading.github.io/ELTeC/) * Distribution platforms (beyond Zenodo): * [TEI Publisher](https://teipublisher.com/exist/apps/eltec/index.html) * [GAMS](http://glossa.uni-graz.at/context:eltec) * [TextGrid Rep](https://dev.textgridrep.org/browse/3tg6g.0) * Further publication formats * Packages for usage with analysis tools like TXM or Antconc * Publication via analysis platforms like CWB, TXM Portal, NoSketchEngine etc. --- #### Github
https://github.com/cost-eltec --- #### Zenodo
https://zenodo.org/communities/eltec/ --- #### TEI Publisher
https://teipublisher.com/exist/apps/eltec/index.html --- #### GAMS (Graz)
https://glossa.uni-graz.at/archive/objects/context:eltec/methods/sdef:Context/get?mode=home# -- ## (5) Usage scenarios --- ### Some scenarios * Shared objectives * Adapt existing statistical methods to the multiple european languages * Evaluate these methods in a multilingual context (beyond eng, deu, fra) * Some examples * Linguistic annotation: Cínkova et al. 2020 * Annotation of Named Entities: Frontini et al. 2020 * Identification of direct speech: Byszuk et al. 2020 * Title analysis: Patras et al. 2021 * Stylometric Authorship Attribution: Schöch et al. --- ### Identification of direct speech
* Key results * "Multilingual sentence embeddings" clearly surpass baseline * Performance: score F1 ~ 0.89 for all nine languages --- ### Title analysis
--- ### Stylometry: Dendrograms
ELTeC-fra ELTeC-rom --- ### Stylometry: Evaluation
(Aktuell: deu, eng, fra, hun, por, rom, slv) -- ## Conclusion --- ### So, what is ELTEC? * A multilingual resource, of course * A learning opportunity regarding collaborative research * A rallying point for a European, multilingual community * A foundation for the development of cross-lingual methods * A modest start for a history of European literature that would be truly digital, multilingual and diverse --- #### Final Action Conference
https://www.distant-reading.net/events/conference-programme/ :: - Final Action Conference - Building, Annotating, Analysing ELTeC - Free participation! --- ### Thank you!
--- ### References
* Creation of ELTeC (selection) * Lou Burnard, Christof Schöch, Carolin Odebrecht: “In Search of Comity: TEI for Distant Reading”, in: _Journal of the Text Encoding Initiative_, 2021. https://doi.org/10.4000/jtei.3500 * Christof Schöch, Roxana Patraș, Diana Santos, Tomaž Erjavec: “Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives”, in: _Modern Languages Open_ 2021. http://doi.org/10.3828/mlo.v0i0.364 * Cinková, Silvie, Tomaž Erjavec, Cláudia Freitas, et al., ‘Evaluation of Taggers for 19th-Century Fiction’, in DH_Budapest_2019,
* Frontini, Francesca, Carmen Brando, Joanna Byszuk et al., ‘Named Entity Recognition for Distant Reading’, in CLARIN Annual Conference 2020 Proceedings, pp. 27–41
* Stanković, Ranka, Cvetana Krstev, Branislava Šandrih Todorović, und Mihailo Škorić. 2021. „Annotation of the Serbian ELTeC Collection“. Infotheca 21 (2): 43–59. https://doi.org/10.18485/infotheca.2021.21.2.3.
* Usage of ELTeC (selection) * Cinková, Silvie, and Jan Rybicki, ‘Stylometry in a Bilingual Setup’, in Proceedings of LREC 2020, pp. 977–984
* Byszuk, Joanna, Michał Woźniak, Mike Kestemont et al. ‘Detecting Direct Speech in Multilingual Collection of 19th Century Novels’, in Proceedings of LT4HALA 2020, pp. 100–104
* Mihurko-Poniž, Katja, Rosario Arias, J. Berenike Herrmann et al. ‘Thresholds to the “Great Unread”: Titling Practices across Multilingual Collections of European Novels’, Day of DH 2021,
. * Krstev, Cvetana. 2021. „White as Snow, Black as Night – Similes in Old Serbian Literary Texts“. Infotheca 21 (2): 119–36. https://doi.org/10.18485/infotheca.2021.21.2.6. * Škorić, Mihailo, Ranka Stanković, Milica Ikonić Nešić, Joanna Byszuk, und Maciej Eder. 2022. „Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution“. Mathematics 10 (5): 838. https://doi.org/10.3390/math10050838.