Detecting direct speech in multilingual collection of 19th century novels

Joanna Byszuk; Michał Woźniak; Mike Kestemont; Albert Leśniak; Wojciech Łukasik; Artjoms Šeļa; Maciej Eder

compendium

The Distant Reading Compendium: A virtual edited volume

Detecting direct speech in multilingual collection of 19th century novels

Reference

Byszuk, Joanna, Michał Woźniak, Mike Kestemont, Albert Leśniak, Wojciech Łukasik, Artjoms Šeļa, and others. “Detecting Direct Speech in Multilingual Collection of 19th Century Novels”, in Proceedings of LT4HALA 2020-1st Workshop on Language Technologies for Historical and Ancient Languages, ed. by Rachele Sprungoli and Marco Passarotti (presented at the LT4HALA 2020: 1st Workshop on Language Technologies for Historical and Ancient Languagese, Paris: European Language Resources Association (ELRA), (2020), 100–104. URL: https://lrec2020.lrec-conf.org/media/proceedings/Workshops/Books/LT4HALAbook.pdf

Abstract

Fictional prose can be broadly divided into narrative and discursive forms with direct speech being central to any discourse representa-tion (alongside indirect reported speech and free indirect discourse). This distinction is crucial in digital literary studies and enables in-teresting forms of narratological or stylistic analysis. The difficulty of automatically detecting direct speech, however, is currently un-der-estimated. Rule-based systems that work reasonably well for modern languages struggle with (the lack of) typographical conven-tions in 19th-century literature. While machine learning approaches to sequence modeling can be applied to solve the task, they typi-cally face a severed skewness in the availability of training material, especially for lesser resourced languages. In this paper, we reportthe result of a multilingual approach to direct speech detection in a diverse corpus of 19th-century fiction in 9 European languages.The proposed method fine-tunes a transformer architecture with multilingual sentence embedder on a minimal amount of annotatedtraining in each language, and improves performance across languages with ambiguous direct speech marking, in comparison to acarefully constructed regular expression baseline.

Keywords

Direct speech recognition, Multilingual, 19th century novels, Deep learning, Transformer, BERT, ELTeC

Direct Access

BibTex

@inproceedings{byszuk_detecting_2020,
	address = {Paris},
	title = {Detecting direct speech in multilingual collection of 19th century novels},
	isbn = {979-10-95546-53-5},
	url = {https://lrec2020.lrec-conf.org/media/proceedings/Workshops/Books/LT4HALAbook.pdf},
	abstract = {Fictional prose can be broadly divided into narrative and discursive forms with direct speech being central to any discourse representa-tion (alongside indirect reported speech and free indirect discourse). This distinction is crucial in digital literary studies and enables in-teresting forms of narratological or stylistic analysis. The difficulty of automatically detecting direct speech, however, is currently un-der-estimated. Rule-based systems that work reasonably well for modern languages struggle with (the lack of) typographical conven-tions in 19th-century literature. While machine learning approaches to sequence modeling can be applied to solve the task, they typi-cally face a severed skewness in the availability of training material, especially for lesser resourced languages. In this paper, we reportthe result of a multilingual approach to direct speech detection in a diverse corpus of 19th-century fiction in 9 European languages.The proposed method fine-tunes a transformer architecture with multilingual sentence embedder on a minimal amount of annotatedtraining in each language, and improves performance across languages with ambiguous direct speech marking, in comparison to acarefully constructed regular expression baseline.},
	booktitle = {Proceedings of {LT4HALA} 2020-1st {Workshop} on {Language} {Technologies} for {Historical} and {Ancient} {Languages}},
	publisher = {European Language Resources Association (ELRA)},
	author = {Byszuk, Joanna and Woźniak, Michał and Kestemont, Mike and Leśniak, Albert and Łukasik, Wojciech and Šeļa, Artjoms and Eder, Maciej},
	editor = {Sprungoli, Rachele and Passarotti, Marco},
	month = may,
	year = {2020},
	keywords = {type\_publication},
	pages = {100--104},
}