[Put logo here]

Handling plain text files

Quick ELTeC Corpus Building

We've found a great resource on the internet: 100 transcribed Polish novels, all ready to be used. Just a few problems:

This tutorial steps you through the process of going from a plain text file to a level-0 encoded ElTeC conformant version. You should be able to carry this out without any knowledge of Polish, though we do have a Polish participant who has kindly agreed to provide advice if you get stuck.

The complete process looks like this :

choose the text and create a document

Take a look at the POL folder. It contains 18 files we've pre-selected for you, along with a tiny bit of metadata in a file called titles.txt which supplies you with an identifier, a filename, a wordcount, and a magic incantation (see below) for each title. Choose one to work on (or we'll give you one)!

Each filename contains the surname of the author, a word from the title, and the date of first publication. You'll need to enter the name and the title in some suitable search engine (Wikipedia, Worldcat ...) to get the metadata necessary to complete your TEI Header, notably the author's full name, dates, and sex, as well as the full title.

We also know that each text is derived from a digital collection called 100 Polish Novels transcribed by the Computational Stylistics Group in Krakow, and that it is freely available from their GitHub repository at https://github.com/computationalstylistics/100_polish_novels. We do not however know much about the print sources for these texts: hopefully your researches should throw up some information at least about their first edition in book form. Armed with that additional information for your chosen text, proceed to create a new file and a valid TEI header for the file.

This process is described in detail in the Header Tutorial.

add some text

Now we are ready to add some text to our document.

Note that the error message The body of a text must contain at least one chapter is still present. We will need to introduce more markup. We could do this slowly and painfully, one step at a time, but computers are supposed to make it easier to automate tasks which are slow and painful. In the next section, we'll see how you can take advantage of any systematic patterns in the format of a non-marked-up text to introduce explicit XML markup. To do this we'll use the sophisticated Find/Replace tools built into oXygen.

identify the chapters and headings

To begin, open the Find/Replace dialogue and make sure that the check box Regular expression is selected. A regular expression is a kind of pattern: the find and replace command usually searches for specific character strings; when regular expressions are enabled, it can also search for complex patterns of characters.

For example, in some of our Polish texts, every new chapter begins with a roman number on a line of its own. In other texts the chapter number is preceded by the word Rozdział; in yet others the chapter has a title given in uppercase letters only. We can write regular expressions to cater for all of these possibilities. Without regular expression matching, we could seek lines containing the explicit strings I, II, III, IV etc. But it is much simpler to seek all such lines by means of a regular expression matching any sequence of one or more of the letters I V or X followed by a new line.

The real power of regular expression matching is not just that it enables us to find quickly particular strings in the text. It also allows us to specify replacements for those strings. In our case, wherever we identify something which is the heading of a chapter, we need to tag it as a <head>, and also show that it begins a new chapter by inserting a <div type="chapter"> tag, and (to ensure our document remains well-formed XML) preceding it with a </div> tag to close the preceding chapter. The syntax for doing this is simple enough: the replacement part of the find/replace dialogue can include numbered references (like this : \1) which are to be expanded by a part of the matched string, specifically the part enclosed in parentheses.

This is easier to understand with an example. Let's suppose you are working on POL001.

identify the paragraphs

In this file, every line other than those beginning with a < character now contains a single paragraph. We can use a simple regexp to mark them all up as <p> elements :

Of course, there's still work to be done... we have not tried to deal with the occasional piece of verse, nor have we tried to markup the trailing headings at the end of some divisions, so these are all appearing as spurious <p> elements. Feel free to continue perfecting this text -- or maybe you'd rather try a different one? Congratulations on getting this far!