Sampling criteria for the ELTeC

Table of contents

1. Task

The task for WG1 is to develop guidelines for data and metadata for the creation of the ELTeC. This task can be split up into several distinct tasks: Guidelines for corpus design, basic annotation and metadata schemes and workflow. This discussion paper focuses on corpus design and metadata because both tasks interplay with each other.

The goal of CA16204 is to create a big benchmark corpus of literature from 1840-1920 (first period) for different computational distant reading methods for corpus annotation and analysis. The task of creating annotation guidelines of WG1 needs to be closely communicated and coordinated with WG2 in order to know which methods and tools needs which kind(s) of annotation model and format. The same holds for the development of the metadata scheme.

For creating such a benchmark corpus, we need a corpus design which allows for a comparability of texts and individual sub-collections according to different metadata set(s). It should be possible for every COST Action member to sample sub-collections from the ELTeC for specific tasks and research questions. In a first step, we focus on the development of clear, operationalized, transparent and motivated selection criteria for the corpus.

It is important to stress that we do not intend to define what a novel is by defining what kind of selection criteria we will use for ELTeC. The category novel may be divided into three groups where at least one of the following core criteria is met: a) textual: length (>10.000 words), prose, fiction, narrative structure b) peritextual (the term ‘novel’ (or its equivalent) in the title or subtitle of the text ) and c) contextual: the text is bibliographically listed with the UDC: 82-31 Novels. Full-length stories.

2. Method

We follow a non-normative but metadata-based approach of sampling criteria which will follow a corpus design approach. Corpus sampling criteria are mostly oriented/developed by the research question or/and contexts of the corpus creators group. In CA16204, we have neither only a single research question nor a fixed and previously known corpus creator group. The research context of the Action is more interested in knowledge production in a methodological sense and does not prefer a single method, model or theory. Furthermore, the member group of the Action will fluctuate and consist of researches from different disciplines with different theoretical and cultural contexts. Thus, we need to build the corpus design on a methodical basis. With this method, we will also be able to select canonical texts as well but not exclusively.

Representativeness is a kind of ideal which we would like to pursue but which cannot be achieved as whole. We will therefore aim to represent the variety of a population. In line with the MoU, the ELTeC will be designed as a monitor corpus where texts (from different languages and periods) can be added over time. We then need to decide which criterion is balanced in which way and interplays with other criteria.

3. Objectives of sampling criteria

According to the MoU, the corpus design should be balanced with respect to language and publication date of the texts. This means that the corpus should not be based solely on chronological criteria, meaning that we need a text from each year of the period in question. The main sampling criterion ‘language’ will require not to include translations at all. We will prefer to take the first edition of a novel or editions of these novels. By a novel, we prefer to take the edition of the book, hence we don't prefer novels only printed in journals, unless a particular literary tradition only features novels printed in serial format. If we consider editions of a novel, these editions should be freely available (free licences for reusing them. The first edition is more interesting from a philological point of view. It represents the authentic texts of the authors. Dealing with historical texts might require some cleaning up or normalizations. We will merge all word forms which are seperated by line breaks. At the moment, we must assume that there are no (sufficiently good) normalization tools for every language. Later editions of a novel may be already normalized in some way. This might lead to different text representations in ELTeC which should be indicated in the metadata.

Considering also later freely available editions of a novel has two advantages: First, members of the Action already can provide machine-readable text documents (html, TEI etc.) of later editions and second, in some languages it might be easier to find later editions which already exist in a machine-readable format (in this way we do not have to put effort in digitizing them).

Electronically availability should not be a leading sampling criterion althoug availability is a limiting factor. A text should not be excluded from ELTeC because it is not digitized, but it should be excluded if the text cannot be made freely available in ELTeC. If we only use availability as a selecting criterion, we are at risk of copying projects such as ‘Gutenberg’ for example. The issue remains of finding additional funds to digitise non-canonical books. Un til that moment, the solution would be to create pilot corpora (that can later be supplemented or substituted by an alterna tive) for literatures that do not have a significant number of digitized texts.

We then need additional criteria which can be applied without having to know (read) the texts in question. The criteria should be checked without a deep knowledge about the texts. Otherwise, this will oppose the goal of the whole Action and the methodical approach of distant reading. The criteria should be operationalizable, meaning decidable from text metadata. Here, we define text metadata in a wider scope than only the classical bibliographical metadata. In this way corpus design interacts with metadata. Some of the text’s metadata can be used as sampling criteria. These criteria are text-external and -internal criteria (cf. Hunston 2008) on which we then need to rely. The selection criteria may be assisted by bibliographical overviews (wherever available) for each language in order to avoid possible canon-derived bias.

We suggest using an online table as a means of collecting nominations for inclusion in the ELTeC but other methods are feasible.

4. Sampling criteria

For creating a language collection two steps have to be done: First step is selection: identifying text candidates. Second step is balancing: proportion within the corpus. Both steps are defined in this document.

The following principles apply:

4.1. Eligibility criteria

In order to be considered for inclusion, a text must...

The MoU defines the languages to be sampled. It does not propose distinguishing regional variation (e.g. in German), nor geographical variation (e.g. the French spoken in Belgium, France, or Switzerland). It assumes only European varieties, so English excludes US English; French excludes Quebecois.

We follow a language-based approach (not country-based). This means for example that we include Swiss German texts in the German language collection. We prefer standard varieties over dialect varieties if sampling criteria for text candidates are met.

4.2. Composition criteria

This section briefly summarizes the classification criteria applied, and also the ideal target proportions of titles to be included for each category within the balanced collection.

Date : 1840 to 1920 (first iteration)

We will divide into four groups:

  • group A (1840-1859): code T1
  • group B (1860-1879): code T2
  • group C (1880-1899): code T3
  • group D (1900-1920): code T4

Each time slot should be represented and should contain at least 20% of the total number of titles.

Reprint count

We propose to use the number of times a work has been reprinted during a specific period as an objective measure of its reception. We count the number of reprints attested during the period 1970-2010 according to Worldcat or a relevant national library catalogue and classify texts as either:

  • low: less than 2 reprints
  • high: 2 or more reprints

Note that we do not include digitizations of texts in the reprint count.

At least 30% of titles should be classified as "high"; at least 30% should be classified as "low".

Author gender

We use the following three categories for actual (not claimed) author gender

  • male
  • female
  • mixed (undefined or more than one author)

At least 10% and at most 50% of the titles should have a female author.

Author title count

The number of titles per author should be controlled. Ideally, no author should be represented by more than three titles. We count

  • the number of authors represented by a single title
  • the number of authors represented by exactly three novels

No less than 9 and no more than 11 authors should be represented by exactly three novels; all other authors should be represented by a single title only.


We classify titles by their length as follows:

  • short (10k-50k word tokens)
  • medium (50k-100k word tokens)
  • long (>100k word tokens)

Each length category should be represented and should contain at least 20% of the total number of titles.

Since it is an open issue how to classify novels by topic and in particular since different languages do not share the same terminology for the concept, we do not use the topic or type of novel as a sampling criterion. Texts should however provide metadata about their genre, using appropriate keywords in the descriptive metadata.

5. Metadata for texts in ELTeC

We list here some examples of the metadata items to be collected for each text. These will be provided by specific components of the TEI Header structure.

See Encoding Guidelines for more details.

6. Literature

  1. Algee-Hewitt, Mark; McGurl, Mark (2015): Between Canon and Corpus. Six Perspectives on the 20th-Century Novels. Stanford Literary Lab Pamphlet no 8.
  2. Biber, Douglas (1993): Representativeness in Corpus Design. In: Literary and Linguistic Computing (8), 243–257.
  3. Herrmann, Leonhard (2011): System? Kanon? Epoche? In: Matthias Beilein, Claudia Stockinger und Simone Winko (Hg.): Kanon, Wertung und Vermittlung. Literatur in der Wissensgesellschaft. Berlin: De Gruyter (Studien und Texte zur Sozialgeschichte der Literatur, Bd. 129), S. 59–75.
  4. Hunston, Susan (2008): Collection strategies and design decisions. In: Anke Lüdeling und Merja Kytö (Hg.): Corpus Linguistics. An International Handbook. 2 Bände. Berlin: De Gruyter (1), S. 154–168.
  5. IFLA (2009): Functional Requirements for Bibliographic Records (Technical Report). Online verfügbar unter, zuletzt geprüft am 23.12.2016.
  6. Lüdeling, Anke (2011): Corpora in Linguistics. Sampling and Annotation. In: Karl Grandin (Hg.): Going Digital. Evolutionary and Revolutionary Aspects of Digitization. New York: Science History Publications (Nobel Symposium, 147), 220–243.
  7. Moisl, Hermann (2009): Exploratory Multivariate Analysis. In: Anke Lüdeling und Merja Kytö (Hg.): Corpus Linguistics. An International Handbook. 2 Bände. Berlin: De Gruyter (2), S. 874–899.
  8. Winko, Simone (1996): Literarische Wertung und Kanonbildung. In: Grundzüge der Literaturwissenschaft. Hrsg. v. H. L. Arnold und H. Detering. München, 585–600.
  9. van Zundert, Joris; Andrews, Tara L. (2017): Qu'est-ce qu'un texte numérique? A new rationale for the digital representation of text. In: Digital Scholarship in the Humanities (32), S. 78–88. DOI: 10.1093/llc/fqx039.
COST Action CA16204 – WG1 . Date: 2018-01