ELTeC

# What is ELTeC all about?

**Christof Schöch (Trier, Germany)**

***
Belgrade Training School, March 22, 2022
 https://distantreading.github.io/eltec-slides/
***
<img data-src="img/basics/tcdh-slim.png" height="50"></img>   <img data-src="img/basics/uni-trier.png" height="50"></img>   <img data-src="img/basics/cost-and-eu.png" height="70"></img>

::
- Happy to be here, thanks for the invite
- Multilingual approaches in CLS are on the rise
  - 
- Need to be supported by three things
  - multilingual data
  - methods to handle it
  - people who know the context
- So I believe it is inherently a community effort
- At least, that is the premise of ELTeC, the "European Literary Text Collection"

--
### Overview
1. [What is ELTeC?](#/2)
2. [Composition criteria](#/3)
3. [Encoding principles](#/4)
4. [Publication strategy](#/5)
4. [Usage scenarios](#/6)
5. [Conclusion](#/7)

::
- So I'd like to present ELTeC today, with a focus on multilingualism

--
## (1) What is ELTeC?

---
### ELTeC in context 
* COST Action "Distant Reading for European Literary History" 
 * Research network (31 countries, 200+ researchers)
 * Ambition: Foster digital, cross-lingual research into the history of the European novel
* Areas of activity: 
 * Build a multilingual corpus of the European novel
 * Develop appropriate, digital methods of analysis
 * Thinking about the theoretical consequences
 * Creation of a network of researchers across Europe
 * Capacity building: training schools, exchanges, joint projects

---
### "European Literary Text Collection"
* A multilingual corpus of the European novel 
 * Time period: 1840-1920 (production, availability, OCR, copyright)
 * At least 10 different languages (currently: 10 complete, 7 more in progress, + extensions)
 * Comparable (!) collections of 100 novels per language
* Key characteristics 
 * Each corpus represents the variety of production
 * Texts encoded in XML-TEI and linguistically-annotated (POS, NE)
 * Everything published under open licences (CC)
* More information 
 * http://www.distant-reading.net/eltec/
 * Latest release: [v1.1.0, April 2021, 14 collections, 1200 novels](https://github.com/COST-ELTeC/ELTeC)

---
### Progress of our work on ELTeC
<img data-src="img/eltec-overview_numnovels.png" height="500"></img>
 See: https://distantreading.github.io/ELTeC/

--
## (2) Eligibility and composition criteria

---
### Eligibility criteria
* Novels, i.e. narrative, fictional prose of a certain length
* Minimal length: 10.000 words
* Novels written originally in the language of the collection
* Novels first (or at least also) published in Europe

---
### Composition criteria
* Objectives 
 * Comparability of the collections
 * Represent the diversity of the novel production
 * Go beyond the canon (and the usual collections)
* Criteria taken into account 
 * Period of publication: 1840-59, 1860-79, 1880-99, 1900-1919 
 * Length of the text: short (10-50k), medium (50-100k), long (100k+) 
 * Author gender: male, female, diverse/mixed
 * Reprint count, 1970-2010: low (0-1), high (2+) 
 * Number of novels per author 3 (9-11 authors), 1 (otherwise)

---
#### Composition of the collections

|ELTeC-eng|ELTeC-rom|
|:---:|:---:|
|<img data-src="img/mosaic-eng.svg" height="400">|<img data-src="img/mosaic-rom.svg" height="400">|
|100 novels EC5 100 excellent balance| 100 novels EC5 83 balance difficult to obtain|
|||

---
#### The diversity paradox
<a href="img/eltec-overview_paradox.png"><img data-src="img/eltec-overview_paradox.png" height="400"></img></a>

* Three objectives
  * Comparability of the copora (enabled by strict criteria)
  * Diversity of texts within a corpus (enforced by strict criteria)
  * Diversity of languages in ELTeC (suported by loose criteria)

--
## (3) Encoding principles

---
### Three levels of encoding
* Everything is encoded in XML-TEI (of course!?) 
* There is a common header for metadata 
* Three levels of encoding 
 * Level 0: minimal TEI encoding (metadata + `div`, `p`, `hi`)
 * Level 1: semantic TEI encoding (`foreign`, `emph` etc.)
 * Level 2: TEI with token-level linguistic annotation (UPos, NE)
* Controlled by a set of schemas 
 * Schemas connected via "ODD chaining" (see Burnard et al. 2021)
 * Validation with RelaxNG and Schematron
 * Validation socially enforced by Lou

---
### Metadata
* Composition criteria (see above) 
* Provenance 
 * digital source
 * print source
 * first edition
* Type of novel (optional) 
 * Subgenre of the novel
 * Narrative perspective
* Textual characteristics 
 * Language
 * Orthography (original vs. modernised)
 * Alphabet (latin, cyrillic, transition)
 * Encoding level (see above)

::
- Nothing special from the point of view of digital editing
- But more detailed than what is standard in collection building

--
## (4) Publication strategy

---
### Publication strategy
* For the needs of the project 
 * Space for collaboration (XML) : [Github](https://github.com/cost-eltec)
 * Publication of 'releases' with DOI (XML) : Github + [Zenodo](https://zenodo.org/communities/eltec/)
 * Overview (HTML, XML) : [Github.io](https://distantreading.github.io/ELTeC/)
* Distribution platforms (beyond Zenodo): 
 * [TEI Publisher](https://teipublisher.com/exist/apps/eltec/index.html)
 * [GAMS](http://glossa.uni-graz.at/context:eltec)
 * [TextGrid Rep](https://dev.textgridrep.org/browse/3tg6g.0)
* Further publication formats 
 * Packages for usage with analysis tools like TXM or Antconc
 * Publication via analysis platforms like CWB, TXM Portal, NoSketchEngine etc.

---
#### Github
<img data-src="img/eltec_github.png" height="500"></img>

https://github.com/cost-eltec

---
#### Zenodo
<img data-src="img/eltec_zenodo.png" height="500"></img>

https://zenodo.org/communities/eltec/

---
#### TEI Publisher
<img data-src="img/eltec_teip.png" height="500"></img>

https://teipublisher.com/exist/apps/eltec/index.html

---
#### GAMS (Graz)
<img data-src="img/gams.png" height="500"></img>

https://glossa.uni-graz.at/archive/objects/context:eltec/methods/sdef:Context/get?mode=home#

--
## (5) Usage scenarios

---
### Some scenarios
* Shared objectives 
 * Adapt existing statistical methods to the multiple european languages
 * Evaluate these methods in a multilingual context (beyond eng, deu, fra)
* Some examples 
 * Linguistic annotation: Cínkova et al. 2020
 * Annotation of Named Entities: Frontini et al. 2020
 * Identification of direct speech: Byszuk et al. 2020 
 * Title analysis: Patras et al. 2021
 * Stylometric Authorship Attribution: Schöch et al.

---
### Identification of direct speech
<img data-src="img/byszuk-2020.png" height="400"></img>

* Key results
  * "Multilingual sentence embeddings" clearly surpass baseline
  * Performance: score F1 ~ 0.89 for all nine languages

---
### Title analysis
<img data-src="img/patras-2021_annotation.png" width="500"></img> 
 <img data-src="img/patras-2021_lengths.png" width="500"></img>

---
### Stylometry: Dendrograms
<a href="img/ELTeC-fra_eders-d_1000.png"><img height="500" data-src="img/ELTeC-fra_eders-d_1000.png"></a>     <a href="img/ELTeC-rom_eders-d_1000.png"><img height="500" data-src="img/ELTeC-rom_eders-d_1000.png"></a>
 ELTeC-fra                 ELTeC-rom

---
### Stylometry: Evaluation
<a href="img/results_ELTeC-hun.svg"><img height="200" data-src="img/delta-hun.png"></a></img>   <a href="img/results_ELTeC-fra.svg"><img height="200" data-src="img/delta-fra.png"></img></a> <a href="img/results_ELTeC-rom.svg"><img height="200" data-src="img/delta-rom.png"></a></img>   <a href="img/results_ELTeC-slv.svg"><img height="200" data-src="img/delta-slv.png"></img></a> (Aktuell: deu, eng, fra, hun, por, rom, slv)

--
## Conclusion

---
### So, what is ELTEC?
* A multilingual resource, of course 
* A learning opportunity regarding collaborative research 
* A rallying point for a European, multilingual community 
* A foundation for the development of cross-lingual methods 
* A modest start for a history of European literature that would be truly digital, multilingual and diverse

---
#### Final Action Conference
<img data-src="img/conference.png" height="500"></img>

https://www.distant-reading.net/events/conference-programme/

::
- Final Action Conference
- Building, Annotating, Analysing ELTeC
- Free participation!

---
### Thank you!
<img height="500" data-src="img/danke.png">

---
### References

	
* Creation of ELTeC (selection)
 * Lou Burnard, Christof Schöch, Carolin Odebrecht: “In Search of Comity: TEI for Distant Reading”, in: _Journal of the Text Encoding Initiative_, 2021. https://doi.org/10.4000/jtei.3500
 * Christof Schöch, Roxana Patraș, Diana Santos, Tomaž Erjavec: “Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives”, in: _Modern Languages Open_ 2021. http://doi.org/10.3828/mlo.v0i0.364
 * Cinková, Silvie, Tomaž Erjavec, Cláudia Freitas, et al., ‘Evaluation of Taggers for 19th-Century Fiction’, in DH_Budapest_2019, <http://elte-dh.hu/dh_budapest_2019-abstract-booklet/>
 * Frontini, Francesca, Carmen Brando, Joanna Byszuk et al., ‘Named Entity Recognition for Distant Reading’, in CLARIN Annual Conference 2020 Proceedings, pp. 27–41 <https://office.clarin.eu/v/CE-2020-1738-CLARIN2020_ConferenceProceedings.pdf>
 * Stanković, Ranka, Cvetana Krstev, Branislava Šandrih Todorović, und Mihailo Škorić. 2021. „Annotation of the Serbian ELTeC Collection“. Infotheca 21 (2): 43–59. https://doi.org/10.18485/infotheca.2021.21.2.3.

* Usage of ELTeC (selection)
 * Cinková, Silvie, and Jan Rybicki, ‘Stylometry in a Bilingual Setup’, in Proceedings of LREC 2020, pp. 977–984 <https://www.aclweb.org/anthology/2020.lrec-1.123/>
 * Byszuk, Joanna, Michał Woźniak, Mike Kestemont et al. ‘Detecting Direct Speech in Multilingual Collection of 19th Century Novels’, in Proceedings of LT4HALA 2020, pp. 100–104 <https://lrec2020.lrec-conf.org/media/proceedings/Workshops/Books/LT4HALAbook.pdf>
 * Mihurko-Poniž, Katja, Rosario Arias, J. Berenike Herrmann et al. ‘Thresholds to the “Great Unread”: Titling Practices across Multilingual Collections of European Novels’, Day of DH 2021, <https://www.youtube.com/watch?v=fMtkwCxkzfw>.
 * Krstev, Cvetana. 2021. „White as Snow, Black as Night – Similes in Old Serbian Literary Texts“. Infotheca 21 (2): 119–36. https://doi.org/10.18485/infotheca.2021.21.2.6.
 * Škorić, Mihailo, Ranka Stanković, Milica Ikonić Nešić, Joanna Byszuk, und Maciej Eder. 2022. „Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution“. Mathematics 10 (5): 838. https://doi.org/10.3390/math10050838.