As well as the following summary statistics, this page provides links to human-readable versions of each text currently included in the European Literary Text Collection (ELTeC). Click on a language code in the table below to see a list of texts now available in that language. Then click on the identifier of a text to see a simple rendering of the text as produced by CETEIcean. The original source files are stored in a GitHub repository at COST-ELTeC, and may be downloaded freely from there.
The following tables list three different flavours of ELTeC corpus. All ELTeC corpora are encoded in TEI-XML according to one of the ELTeC schemas. The ELTeC core corpora are, as far as possible, comparable in size and composition. Each contains a balanced selection of 100 texts respecting all the criteria defined for the ELTeC project. The ELTeC plus corpora contain smaller collections of texts which cover the same period of time as the core corpora, but which do not meet the balance criteria defined for the project: in some cases, the criteria simply could not be satisfied because the required mixture of texts did not exist; in other cases, future iterations of the collection may contain additional texts. The ELTeC extended corpora are ELTeC-conformant in their encoding, but selected according to different design criteria, either to provide additional texts for the same time period, or to provide coverage of a different time period.
The E5C column gives the conformance score calculated for each repository and is displayed in green if conformance is high. The other columns give counts for each of the four balance criteria, with numbers in red indicating that this criterion is unsatisfied. Hovering over the last figure in each column displays the E5C score calculated for that criterion.
This remains a work in progress! Comments and reports of any problems are much appreciated: send them to the WG1 Issue Tracker.
ELTeC-core | AUTHORSHIP | LENGTH | TIME SLOT | REPRINT COUNT | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Language | Last update | Texts | Words | Male | Female | 1-title | 3-title | Short | Medium | Long | 1840-59 | 1860-79 | 1880-99 | 1900-20 | range | Frequent | Rare | E5C |
cze | 2021-04-09 | 100 | 5621667 | 88 | 12 | 62 | 6 | 43 | 49 | 8 | 12 | 21 | 39 | 28 | 27 | 1 | 19 | 80.00 |
deu | 2022-04-19 | 100 | 12738842 | 67 | 33 | 35 | 9 | 20 | 37 | 43 | 25 | 25 | 25 | 25 | 0 | 48 | 46 | 96.92 |
eng | 2022-11-17 | 100 | 12227703 | 49 | 51 | 70 | 10 | 27 | 27 | 46 | 21 | 22 | 31 | 26 | 10 | 32 | 68 | 100.00 |
fra | 2022-01-24 | 100 | 8712219 | 66 | 34 | 58 | 10 | 32 | 38 | 30 | 25 | 25 | 25 | 25 | 0 | 44 | 56 | 101.54 |
gsw | 2023-03-30 | 100 | 6408326 | 73 | 27 | 32 | 9 | 45 | 40 | 15 | 6 | 16 | 19 | 59 | 53 | 0 | 0 | 66.15 |
hun | 2022-01-24 | 100 | 6948590 | 79 | 21 | 71 | 9 | 47 | 31 | 22 | 22 | 21 | 27 | 30 | 9 | 32 | 67 | 100.00 |
pol | 2022-06-01 | 100 | 8500172 | 58 | 42 | 1 | 33 | 33 | 35 | 32 | 8 | 11 | 35 | 46 | 38 | 39 | 61 | 80.00 |
por | 2022-03-15 | 100 | 6799385 | 83 | 17 | 73 | 9 | 40 | 41 | 19 | 13 | 37 | 19 | 31 | 24 | 26 | 60 | 94.62 |
rom | 2022-05-31 | 100 | 5951910 | 79 | 16 | 59 | 9 | 49 | 31 | 20 | 6 | 21 | 25 | 48 | 42 | 24 | 76 | 83.08 |
slv | 2022-02-02 | 100 | 5682120 | 89 | 11 | 26 | 5 | 53 | 39 | 8 | 2 | 13 | 36 | 49 | 47 | 48 | 52 | 78.46 |
spa | 2022-05-16 | 100 | 8737928 | 78 | 22 | 46 | 10 | 34 | 35 | 31 | 23 | 22 | 29 | 26 | 7 | 46 | 54 | 100.00 |
srp | 2022-03-17 | 100 | 4931503 | 92 | 8 | 48 | 11 | 55 | 39 | 6 | 2 | 18 | 40 | 40 | 38 | 38 | 62 | 80.77 |
ELTeC-plus | AUTHORSHIP | LENGTH | TIME SLOT | REPRINT COUNT | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Language | Last update | Texts | Words | Male | Female | 1-title | 3-title | Short | Medium | Long | 1840-59 | 1860-79 | 1880-99 | 1900-20 | range | Frequent | Rare | E5C |
gle | 2022-04-08 | 1 | 24471 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1.54 |
gre | 2022-01-24 | 17 | 98607 | 13 | 4 | 14 | 1 | 17 | 0 | 0 | 0 | 2 | 8 | 7 | 8 | 4 | 7 | 52.31 |
hrv | 2022-01-26 | 21 | 1440018 | 21 | 0 | 4 | 0 | 6 | 12 | 3 | 6 | 12 | 2 | 1 | 11 | 1 | 0 | 23.08 |
ita | 2022-05-05 | 70 | 5535905 | 59 | 11 | 29 | 5 | 26 | 30 | 14 | 8 | 18 | 21 | 23 | 15 | 39 | 4 | 70.77 |
lit | 2022-05-25 | 32 | 947634 | 25 | 7 | 18 | 1 | 24 | 3 | 5 | 6 | 4 | 6 | 16 | 12 | 9 | 23 | 60.00 |
lav | 2022-04-28 | 31 | 2553907 | 27 | 4 | 14 | 1 | 10 | 14 | 7 | 0 | 2 | 6 | 23 | 23 | 4 | 26 | 52.31 |
nor | 2022-11-12 | 58 | 3686837 | 40 | 18 | 22 | 12 | 28 | 19 | 11 | 5 | 3 | 32 | 18 | 29 | 32 | 26 | 70.77 |
swe | 2021-04-11 | 58 | 4960085 | 29 | 28 | 18 | 8 | 16 | 24 | 18 | 15 | 3 | 20 | 20 | 17 | 17 | 41 | 76.92 |
ukr | 2021-04-09 | 50 | 1840062 | 37 | 13 | 23 | 7 | 34 | 13 | 3 | 5 | 10 | 11 | 24 | 19 | 30 | 20 | 70.77 |
ELTeC-extension | AUTHORSHIP | LENGTH | TIME SLOT | REPRINT COUNT | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Language | Last update | Texts | Words | Male | Female | 1-title | 3-title | Short | Medium | Long | 1840-59 | 1860-79 | 1880-99 | 1900-20 | range | Frequent | Rare | E5C |
nor-ext | 2022-04-27 | 5 | 187124 | 2 | 3 | 5 | 0 | 4 | 1 | 0 | 0 | 0 | 2 | 3 | 3 | 3 | 2 | 35.38 |
fra-ext1 | 2022-04-07 | 370 | 32942955 | 20 | 8 | 38 | 8 | 81 | 161 | 128 | 98 | 150 | 115 | 7 | 143 | 11 | 17 | 54.86 |
fra-ext2 | 2022-03-25 | 100 | 7549824 | 80 | 18 | 49 | 3 | 48 | 30 | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 76.92 |
fra-ext3 | 2022-04-07 | 17 | 1220673 | 0 | 0 | 5 | 1 | 5 | 9 | 3 | 17 | 0 | 0 | 0 | 17 | 0 | 0 | 22.31 |
eng-ext | 2022-03-26 | 14 | 1798258 | 7 | 7 | 9 | 1 | 3 | 2 | 9 | 1 | 5 | 0 | 8 | 8 | 5 | 6 | 68.46 |
srp-ext | 2022-03-09 | 20 | 331568 | 17 | 3 | 12 | 0 | 20 | 0 | 0 | 0 | 1 | 9 | 10 | 10 | 6 | 14 | 48.46 |
por-ext | 2021-09-22 | 21 | 894495 | 18 | 3 | 21 | 0 | 13 | 5 | 3 | 1 | 5 | 8 | 7 | 7 | 5 | 9 | 56.92 |
Summary produced: 2023-03-30