PILCH Hartmut 2015-03-26/12.4

Nachrichteneingänge und öffentliche Gedanken

Heute am Donnerstag der als Kalenderwoche 13 bekannten 12. Woche des Jahres, dem 26. März 2015, treffen hier vielleicht Nachrichten und Anregungen ein, für die diese öffentliche Tagebuchseite zum Thema PILCH Hartmut als erste Anlaufstelle zur Weiterverarbeitung dienen kann.

gestern

yusiao_santcasing_siandzy

Simple parser for text document conversion

I am replacing old versions of my documents with new versions based on a simple yet fairly easily configurable framework which I wrote myself. This framework is limited in that it just progresses linearly and rewrites one input file into one output file on the fly. Sometimes more complex operations are needed that require parsing one or more documents into a tree representation and then deparsing into new objects after replacing some of the elements found therein.

I already have solved the problem in my mind and I am wondering whether I should move to the solution step by step based on my current framework, including other structure building experiences like this one, or whether it would make sense to use existing frameworks such as Tree::Simple und Parse::Yapp.

I’m not the first one to ask myself similar questions, see e.g. here:

For the parser, I’m not sure whether a bottom up or top-down approach is best. I’m somewhat tempted to use this an an opportunity to learn Parse::RecDescent, which seems like it would effective for these kinds of documents with sections, paragraphs, inline markup, etc… Are there other suggestions? I’d like to avoid external, non-perl tools, but something like Parse::YAPP could be ok.

For the data structure, I’m debating between converting everything to some standard DOM (e.g. XML::DOM, Mozilla::DOM) or equivalent “grove” (e.g. Data::Grove and the like) or rolling my own generic document tree structure using tools like Tree::Simple or Data::Hierarchy. A standards-based approach seems appealing to be able to leverage tools built on the standard, but I’m worried about a lack of flexibility and burdening the dependency chain with a DOM written for too narrow a purpose. (E.g. XML::DOM requires LWP::UserAgent and also XML::Parser which itself depends on the the “expat” library.)

For the translator, the approach pretty much depends on the data structure. If it winds up in a standards-based structure, then I can leverage tools to manipulate that standard. Otherwise, the output formatting would have to be written based on traversal of the data structure. (This assumes a document model approach as opposed to a SAX-style approach.)

Some of these approaches look very good, but OTOH tree structures are easily built from array references, in which the parsing/deparsing subroutines can be represented as siimple function references if not even as regular and sformat expressions respectively, and it seems to me that writing what I need by incrementally enhancing what is already working for me will be the fastest way forward. An advantage of adapting the tools may be that they may open up new horizons for other applications as well.

netzplanet_fachkraefteverladung

morgen

deplate
http://a2e.de/dok/phm_pub150326
© 2015-03-26 Hartmut PILCH