[Your logo here]

WG1: Encoding Proposals

Who am I?

I am Lou Burnard, now on my third or fourth life.

  1. I was born on the same day as the poet John Milton, but approx 300 years later. I studied at Oxford University, with a masters in English Studies, specialising in 19th century literature in 1971
  2. After which I taught World Literature in the University of Malawi for a couple of years
  3. For about 25 years I worked at Oxford University Computing Services, initially as a data centre operator, eventually as Assistant Director
  4. I started the Oxford Text Archive in 1976; the Text Encoding Initiative in 1987; the British National Corpus in 1994;
  5. In 2010 I took early retirement from OUCS and started work as a freelance
  6. Between 2009 and 2012 I worked closely with the TGE Adonis which eventually became HumaNum, the French digital humanities infrastructure

Look me up on Google if you get bored during the rest of this talk...

Proposed Encoding Guidelines for the ELTeC

A discussion document setting out the full proposal is available here

We summarize the proposals as follows:

What sort of guaranteed minimum?

The focus is not to represent texts in all their original complexity of structure or appearance, but rather to facilitate a richer and better-informed distant reading than a transcription of its lexical content alone would permit.

For example,

Why XML-TEI?

Why not just use plain text?

Taming the TEI

Basic structure of an ELTeC text

Goal : represent only what is essential to an understanding of the text

What are the essential components of a novel?

It seems uncontroversial to distinguish in our markup chapters, headings, paragraphs but how about :

It's not hard to find TEI tags for these: but is it helpful? can we be consistent in their application ?

TEI encoding typically loses typographic subtleties

Are we bothered?
  • the chapter title is centred
  • there are linebreaks within the paragraphs (and sometimes words get hyphenated as a result)
  • the first word is capitalised
  • paragraphs are indented (except for the first)
  • dash and quote marks have narrative function
  • hyphens may or may not be significant
  • double quotes and single quotes have different functions

Which typographic features should we keep ?

(Penguin, 1970)
Figure 1. (Penguin, 1970)
(Knopf, 1921)
Figure 2. (Knopf, 1921)
(Everyman, 1991)
Figure 3. (Everyman, 1991)
(First US ed, 1910)
Figure 4. (First US ed, 1910)

What about material other than running prose and dialogue ?

Novels often contain material other than running prose

We could:

  1. use the appropriate TEI elements for verse or drama (<lg>, <l>, <sp>, <stage>)
  2. use the appropriate TEI elements for lists and tables (<list>, <label>, <item>, <table>, <cell>, <row>)
  3. use the appropriate TEI elements for graphics (<figure>, <graphic>, <head>)

Or we could

Whichever we choose to do, we must be consistent!

An example

Should this be encoded as:
<p>  <label>le vieillard.</label> « Oh mon ami ! ne m’avez-vous pas dit que vous n’aviez pas de naissance ? </p>
or (expensively)
<sp>  <speaker>le vieillard.</speaker>  <p>« Oh mon ami ! ne m’avez-vous pas dit que    vous n’aviez pas de naissance ?</p> </sp>
or (deceitfully)
<p>le vieillard.</p> <p>« Oh mon ami ! ne m’avez-vous pas dit que vous n’aviez pas de naissance ?</p>

Another example

Should this be encoded as:
<p> Even in her photographic days she had relied upon her smile and her figure to attract, and now that she was <quote>   <l>"On the shelf,</l>   <l>On the shelf,</l>   <l>Boys, boys, I'm on the shelf,"</l>  </quote> she was not likely to find her tongue.  Occasional bursts of song (of which the above is an example) still issued from her lips, but the spoken word was rare. </p>
or (deceitfully)
<p>... and now that she was</p> <p>"On the shelf, <lb/>On the shelf, <lb/>Boys, boys, I'm on the shelf,"</p> <p>she was not likely to find her tongue.  Occasional ...</p>

Some other open questions

Again, consistency of practice is essential. Whether we decide to drop or to preserve these features, we must do so for every text.

Metadata : the TEI Header

We propose using this for all metadata. It will provide for each text

The schema will check consistency of data supplied.

A possible title statement

We may need to modify the TEI definitions

<titleStmt>  <title>Howards End : ELTeC edition</title>  <author dates="1879 1970sex="M">   <persName>    <forename>Edward</forename>    <forename>Morgan</forename>    <surname>Forster</surname>   </persName>   <persName>E.M. Forster</persName>   <idno type="viaf">https://viaf.org/viaf/31996364</idno>   <idno type="wiki">https://www.wikidata.org/wiki/Q189119</idno>  </author>  <respStmt>   <resp>ELTeC encoding</resp>   <name>Lou Burnard</name>  </respStmt> </titleStmt>

An example source description

<sourceDesc>  <bibl>   <author>E.M. Forster</author>   <title>Howards End</title>   <pubPlace>London</pubPlace>   <publisher>Edward Arnold</publisher>   <date>1910</date>   <idno type="wiki">https://www.wikidata.org/wiki/Q1146642</idno>  </bibl>  <bibl>   <title>The Project Gutenberg Etext of Howards End, by E. M. Forster</title>   <ref target="http://www.gutenberg.org/files/2891/2891-h/2891-h.htm">HTML      version downloaded on <date>2017-12-26</date>   </ref>  </bibl>  <note type="editionssource="worldcat"> Worldcat lists 484 print editions in    English</note> </sourceDesc>

And finally... profile and revision descriptions

<profileDesc>  <langUsage>   <language ident="en-BRusage="99">British English</language>   <language ident="deusage="1">German</language>  </langUsage>  <textClass>   <keywords source="http://wikidata.org">    <term>social class</term>    <term>social convention</term>    <term>modernity</term>    <term>family drama</term>   </keywords>   <catRef target="#author_m #reprint_3"/>   <classCode scheme="UDC">8231.111</classCode>  </textClass> </profileDesc>

The values supplied by target are defined in a project-wide <taxonomy>; this and other project-wide metadata is held in a separate corpus header.

<revisionDesc>  <change when="2018-02-11who="LB">Added to EN collection</change> </revisionDesc>

Just one small question...

How do we get there from here?