Lou Burnard Consulting
Please begin by selecting a text you would like to work on, in a language which you are comfortable with handling. The following items are available in your Work/Novels folder:
Identifier | Title | PDF filename | Other formats | Wordcount |
---|---|---|---|---|
ENG18410 | Cecil, or, The adventures of a coxcomb | ENG18410_Gore | abbyy.xml | |
ENG18470 | The Castle of Ehrenstein | ENG18470_James | abbyy.xml | |
ENG19150 | Pointed Roofs | ENG19150_Richardson | abbyy.xml, docx | |
FRA19120 | Vers le pole en aeroplane | FRA19120_Gilbert | abbyy.xml, docx | |
HUN18460 | Meghasonlott Kedely | HUN18460_Kelmenfy | abbyy.xml, docx | |
HUN19090 | Végzetes Tévedésh | - | HUN19090_Beniczkyne.docx | |
HUN19180 | A három galamb | - | HUN19180_Kehel.html | |
ROM18860 | Teodorei: copila domnului din bucuresti | ROM18860_Macri | abbyy.xml, docx | |
ROM19060 | Mara | ROM19060_Slavici | abbyy.xml, docx | |
ROM19170 | Domnul Badina | ROM19170_Smara | abbyy.xml | |
SRP18860 | Omer Čelebija | SRP18860_Milicevic | abbyy.xml |
Each file with extension .pdf contains a digitized representation of an original printed source, which has been processed by Abbyy OCR. Take a look at this to determine what metadata you can extract for use in the header of your file, and also to familiarize yourself with the basic structure of the text -- whether it has prefaces etc.
Use online resources (WorldCat, Wikipedia, etc.) to collect other required metadata, such as the full name of the author, their dates, sex, etc. When you have this information, proceed to create a new document and a TEI Header The procedure to follow is described in detail in the Header Tutorial.
Each file with extension .abbyy.xml contains an XML representation of the results of the OCR procedure carried out by Abbyy. This representation is quite verbose and contains much information we don't need, but it is easy to translate it to a basic TEI form using an XSLT stylesheet. We won't be teaching you XSLT (a standard language for converting one XML document to another) but we have provided a suitable stylesheet for this task. In oXygen, the easiest way to use it is to use a named ‘transformation scenario’. We have created such a scenario for you: to use it,
Comparing this version with the PDF file you opened earlier, you can see that the tag <pb/> indicates the start of a new page in the original. Each line of the output corresponds with a typographic line in the source. Blocks of text are tagged with a <p> element, though not all of them are paragraphs. Where there was a non-textual image in the original, the <gap> element appears. And of course there are probably quite a few character recognition errors, especially if the original printed text was unclear or damaged.
There is no automatically generated tagging beyond this; we will therefore have to add it by hand. Refresh your memory about the basic structure of an ELTeC-0 text: any front matter, such as the titlepage or a preface, should be wrapped in a <front>, while the text itself should be contained by a <body> element, grouping containing <div> elements.
It's up to you how you do this. Here's a suggestion:
Your next challenge is to identify the chapter divisions. You may be able to detect them using a regexp to search for specific words, as in the Polish texts, but this is not guaranteed. A text may not have any chapter divisions at all, in which case our schema requires you to wrap the whole thing in one <div type="chapter">. Or they may not have been correctly identified by the OCR, in which case you may have to split the current <div> element at the right place, using ALT-SHIFT-D for example. Use the Outline view (Document->Show View->Outline) to see the emerging structure of your document.
At some point you will want to tidy up the text a little: some things that would be easy to fix are:
¬
. This makes it easy to join words up again: use a regexp to remove all occurrences of ¬\n
When you're reasonably happy with the transcription, type CTRL-A to select all of it and CTRL-C to copy it. Then return to the file in which you prepared your header, move the cursor to the appropriate point inside the <text> element, and type CTRL-V to paste it. Is the completed document valid? If not, fix it!
Abbyy also has the useful capability to export its results in Word DOCX format. Here's a quick guide to converting this to TEI:
As you see, this is an XML document, with lots of XML tags, using many different namespaces, most of them defined by Microsoft.
CTRL-SHIFT-C
. Or click the little spanner iconWe will begin by using oXygen's built in xPath browser to investigate the tagging more closely.
//p
into the XPath 2.0 box at top left and press returnbody text
, though there are some other formatting values//hi
to see the<hi> elementsoXygen also provides a useful way of automating changes to the markup without the trouble of writing a stylesheet. We will use it to transform all <p> elements containing an <anchor> into <head> elements.
//p[anchor]
next to Target elements (XPATH) and head
next to New local nameWe now have <head> elements identifying the titles of chapters, but no <div> tags. We suggest using the find and replace command as a simple way of adding these. The search string should be <head
, and the replacement string should be </div><div><div><head
. Test that this works as you expect before applying it to the whole file. When you have applied it, don't forget to scroll up to the beginning of the text and remove the redundant </div>
You can also use find and replace to get rid of redundant rend values: simply use (for example) rend="Body Text (2)"
as search string and nothing as replacement for it.
You should now be able to work through the file chapter by chapter, comparing it with the original PDF as suggested above. Check that each chapter break has been correctly detected and recorded and remember to save your work!