[Put logo here]

Text formats and how to handle them

Lou Burnard Consulting

Silk purses and sow's ears

Up to a point...

Semantic vs presentational features

We are where we are

Documents might come in...

Low hanging fruit...

Word or Libre Office vel sim

XHTML and friends

Typesetting and other esoteric formats

Good luck!

Two essential technologies: regexp and xpath

Regexp (regular expressions) allow you to specify patterns which match strings of characters and manipulate the resulting matches

Xpath is a standard syntax for matching parts of an XML tree in terms of its elements and attributes

Regexp and xpath are both built-in to oXygen

Programming options

If you have lots of texts in a specific arcane format, it is worth investing time and effort to translate them, using whatever tools are at your disposal.

A multi-stage path is usually easiest, e.g.

Remember: the computer should be doing the boring repetitive work, not you!