View on GitHub


The Distant Reading Compendium: A virtual edited volume

Stylometry in a Bilingual Setup


Cinková, Silvie, and Jan Rybicki. “Stylometry in a Bilingual Setup”, in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, France, May 11-16, 2020, ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, and others (presented at the 12th Language Resources and Evaluation Conference, LREC (2020), European Language Resources Association), 977–984. URL:


The method of stylometry by most frequent words does not allow direct comparison of original texts and their translations, i.e. across languages. For instance, in a bilingual Czech-German text collection containing parallel texts (originals and translations in both directions, along with Czech and German translations from other languages), authors would not cluster across languages, since frequency word lists for any Czech texts are obviously going to be more similar to each other than to a German text, and the other way round. We have tried to come up with an interlingua that would remove the language-specific features and possibly keep the linguistically independent features of individual author signal, if they exist. We have tagged, lemmatized, and parsed each language counterpart with the corresponding language model in UDPipe, which provides a linguistic markup that is cross-lingual to a significant extent. We stripped the output of language-dependent items, but that alone did not help much. As a next step, we transformed the lemmas of both language counterparts into shared pseudolemmas based on a very crude Czech-German glossary, with a 95.6% success. We show that, for stylometric methods based on the most frequent words, we can do without translations.


Stylometry, Multilinguality, Authorship attribution, Translation, Czech-German, POS, Lemmatization, ELTeC

Direct Access


	address = {Marseille},
	title = {Stylometry in a {Bilingual} {Setup}},
	url = {},
	booktitle = {Proceedings of {The} 12th {Language} {Resources} and {Evaluation} {Conference}, {LREC} 2020, {Marseille}, {France}, {May} 11-16, 2020},
	publisher = {European Language Resources Association},
	author = {Cinková, Silvie and Rybicki, Jan},
	editor = {Calzolari, Nicoletta and Béchet, Frédéric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, Hélène and Moreno, Asunción and Odijk, Jan and Piperidis, Stelios},
	year = {2020},
	keywords = {type\_publication},
	pages = {977--984},