Wednesday, April 8, 2009

Text processing without the pain

Don't get me wrong, I love sed and awk. I have whole libraries of sed and awk scripts for doing all sorts of things. But some of them took a lot longer to write than they should've.

Last night I was faced with the task of translating a whole bunch of documentation from Tex and Latex to Docbook 5.0 XML. That meant doing multiline matches with sed and some preprocessing with awk, and my spirit rebelled.

I went beserk with online searches for "alternatives to sed awk" "text processing commandline utilities" and so on. The problem is that the standard text processing utilities need so much explanation that tutorials on how to do things with sed and awk are churned out so that they outnumber the alternatives by at least an order of magnitude difference.

I finally did what I should have done in the first place. I went to sourceforge.net and searched for "text processing". And I found Gema (pause for heavenly choir music).

It is not perfect by any means. the documentation in particular is just as cryptic as the original sed man pages. But in less than a half hour I had a script up and running that cleanly handled the multiline matches I needed to do.

As an example:

\\item\[*\]#\\=($1)#\n

will match the following

\\item[First Name]

The first name of the individual

This should not include any periods or commas.

And print out the following:

(First Name) The first name of the individual

This should not include any periods or commas

It's clean, it appears to be quick though none of the files I used it on with that large. It is well worth looking at. You also might want to take a look at this article.