[Put logo here]

Re-processing Abbyy outputs

Lou Burnard Consulting

Choosing your text

Please begin by selecting a text you would like to work on, in a language which you are comfortable with handling. The following items are available in your Work/Novels folder:

IdentifierTitlePDF filenameOther formatsWordcount
ENG18410Cecil, or, The adventures of a coxcombENG18410_Goreabbyy.xml
ENG18470The Castle of EhrensteinENG18470_Jamesabbyy.xml
ENG19150Pointed RoofsENG19150_Richardsonabbyy.xml, docx
FRA19120Vers le pole en aeroplaneFRA19120_Gilbertabbyy.xml, docx
HUN18460Meghasonlott KedelyHUN18460_Kelmenfyabbyy.xml, docx
HUN19090Végzetes Tévedésh-HUN19090_Beniczkyne.docx
HUN19180A három galamb-HUN19180_Kehel.html
ROM18860Teodorei: copila domnului din bucurestiROM18860_Macriabbyy.xml, docx
ROM19060MaraROM19060_Slaviciabbyy.xml, docx
ROM19170Domnul BadinaROM19170_Smaraabbyy.xml
SRP18860Omer ČelebijaSRP18860_Milicevicabbyy.xml

Each file with extension .pdf contains a digitized representation of an original printed source, which has been processed by Abbyy OCR. Take a look at this to determine what metadata you can extract for use in the header of your file, and also to familiarize yourself with the basic structure of the text -- whether it has prefaces etc.

Use online resources (WorldCat, Wikipedia, etc.) to collect other required metadata, such as the full name of the author, their dates, sex, etc. When you have this information, proceed to create a new document and a TEI Header The procedure to follow is described in detail in the Header Tutorial.

Reprocessing abbyy XML format

Each file with extension .abbyy.xml contains an XML representation of the results of the OCR procedure carried out by Abbyy. This representation is quite verbose and contains much information we don't need, but it is easy to translate it to a basic TEI form using an XSLT stylesheet. We won't be teaching you XSLT (a standard language for converting one XML document to another) but we have provided a suitable stylesheet for this task. In oXygen, the easiest way to use it is to use a named ‘transformation scenario’. We have created such a scenario for you: to use it,

Comparing this version with the PDF file you opened earlier, you can see that the tag <pb/> indicates the start of a new page in the original. Each line of the output corresponds with a typographic line in the source. Blocks of text are tagged with a <p> element, though not all of them are paragraphs. Where there was a non-textual image in the original, the <gap> element appears. And of course there are probably quite a few character recognition errors, especially if the original printed text was unclear or damaged.

There is no automatically generated tagging beyond this; we will therefore have to add it by hand. Refresh your memory about the basic structure of an ELTeC-0 text: any front matter, such as the titlepage or a preface, should be wrapped in a <front>, while the text itself should be contained by a <body> element, grouping containing <div> elements.

It's up to you how you do this. Here's a suggestion:

Your next challenge is to identify the chapter divisions. You may be able to detect them using a regexp to search for specific words, as in the Polish texts, but this is not guaranteed. A text may not have any chapter divisions at all, in which case our schema requires you to wrap the whole thing in one <div type="chapter">. Or they may not have been correctly identified by the OCR, in which case you may have to split the current <div> element at the right place, using ALT-SHIFT-D for example. Use the Outline view (Document->Show View->Outline) to see the emerging structure of your document.

At some point you will want to tidy up the text a little: some things that would be easy to fix are:

When you're reasonably happy with the transcription, type CTRL-A to select all of it and CTRL-C to copy it. Then return to the file in which you prepared your header, move the cursor to the appropriate point inside the <text> element, and type CTRL-V to paste it. Is the completed document valid? If not, fix it!

Reprocessing docx files

Abbyy also has the useful capability to export its results in Word DOCX format. Here's a quick guide to converting this to TEI:

As you see, this is an XML document, with lots of XML tags, using many different namespaces, most of them defined by Microsoft.

We will begin by using oXygen's built in xPath browser to investigate the tagging more closely.

oXygen also provides a useful way of automating changes to the markup without the trouble of writing a stylesheet. We will use it to transform all <p> elements containing an <anchor> into <head> elements.

We now have <head> elements identifying the titles of chapters, but no <div> tags. We suggest using the find and replace command as a simple way of adding these. The search string should be <head, and the replacement string should be </div><div><div><head. Test that this works as you expect before applying it to the whole file. When you have applied it, don't forget to scroll up to the beginning of the text and remove the redundant </div>

You can also use find and replace to get rid of redundant rend values: simply use (for example) rend="Body Text (2)" as search string and nothing as replacement for it.

You should now be able to work through the file chapter by chapter, comparing it with the original PDF as suggested above. Check that each chapter break has been correctly detected and recorded and remember to save your work!