During recent weeks the production system by which I have been exporting concrete documents in multiple languages and formats from abstract data representations has regained vitality. While I plan and implement new functionalities, I use the system to document the old ones. Here’s a glimpse of what’s going on.

Mechanisms for Subgrouping within a line
Turn MLHT text into a database source format
Support migration of contents
Short form of group opener without space
Make group-local trigger/hook variables definable globally
Activation of formatters by mere textchunk naming
Building text blocks from external sources

Mechanisms for Subgrouping within a line

Normally each text block occupies one line. This way we achieve maximal overwritability that we need for multilinguality/pluriversionality. Occasionally however we do want to cram several fields into one line. For this we provide two mechanisms.

The modifier “+1” behind the format structure variable says that the line content is a 1-dimensional list structure. As with variable definitions, multidimensional structures are possible, the syntax is the same.

@tab = /proc+table/proc+row+lvlmax+1/
(_tb @tab
|a|1|
|b|2|
)

The modifier “+tabregex” points to a textchunk variable tabregex whose value is a perl regular expression such as A(w+)s*&s*(S.*S)s*Z by which we split the line into fields.

mrex = \\R\{(.*)\}\{(.*)\}
@tab = /proc+table/proc+row+match+mrex/
(_tb @tab
\R{a}{1}
\R{b}{2} 
)

With the modifier +split+srex we point to the text variable srex whose value is a perl regex like s*&s* with which as argument the line content is separated into sublines using the perl function split.

srex = \s*\&\s*
@tab = /proc+table/proc+row+split+srex/
(_tb @tab
a & 1
a & 2
)

With the modifier +split+srex we point to the text variable srex whose value is a perl regex like with which as argument the line content is separated into sublines using the perl function split, such that the part matched by the left bracketed subexpression is put back to end of the left subline whereas the part matched by the right bracketed subexression is put back in front of the right subline. This is useful for separating a paragrah into sentences with a separator consisting of a final interpunction mark e.g. dot on the left, white space in the middle and an initial upper-case letter on the right, of which only the middle white space disappears when splitting, whereas the left and right parts are given back to where they came from. In analogy we also offer the simplified versions lsplit and rsplit in which of the separator only a left or right subexpression respectively are matched and handed back.

lrex = ([.?!:])\s+([[:upper:]])
@sent = /proc+lines+lrsplit+lrex/
(_tb @sent
Odi et amo.  Quare id faciam? Fortasse requiris.  Nescio!  Sed fieri sentio.  Et excrucior!
)

Turn MLHT text into a database source format

We have legacy notation in /sig/oas/15/01/spez/_dok.oas_spez1501.txt and the like that is dependent on Deplate and activated with make dbput, looking roughly like this:

person id=lyre nom="Lyre" typ=frm rol=adr lok=ham plz=24057 str="Bierbauch" dom=19 mail="sme@lyre.com"
person id=ldapaper nom="LDA Paper UK LLP" typ=frm rol=adr lok=fra plz=63207 str="Nordkaiplatz" dom=1 
mail="fabius.fairmayor@ldapaper.com"
ag dok=memoS1 nom="Memorandum Quintus Baum GmbH Deutsch-Chinesisch" des="90 Zeilen a 1,40 EUR" mon=126 ust=00 odat=2015-01-08 
fdat=2015-01-12 de=lingoserv rkod="Auftrag 1501023" status=f:Rechnung stellen.
ag dok=imibS1 nom="Gesamt" des="ADV-Dokumente ins Chinesische und Japanische" odat=2015-01-09 fdat=2015-01-13 rkod="" 
de=lingoserv pre=imibS1b status=f:Rechnung stellen.

Replace this with something like

(_adr @ul//adrdb
(lyre
dabagrup = adrdb
typ = frm
rol = adr
lok = ham
plz = 24057
str = Bierbauch
dom = 19
mail = sme@lyre.com
Lyre
)

(ldapaper/adrdb
dabagrup = adrdb
typ = frm
rol = adr
lok = fra
plz = 63207
str = Nordkaiplatz
dom = 1
mail = fabius.fairmayor@ldapaper.com
LDA Paper UK LLP
)

)

and

(_spz @ul//spzdb
(memoS1
dabagrup = spzdb
%flds = ||mon|126||ust|00||odat|2015-01-08||fdat|2015-01-12||de|lingoserv||rkod|Auftrag 1501023||status|f|
Memorandum Quintus Baum GmbH Deutsch-Chinesisch
90 Zeilen a 1,40 EUR
)

(imibS1
dabagrup = spzdb
||odat|2015-01-09||fdat|2015-01-13||rkod|||de|lingoserv||pre|imibS1b||status|f|
Gesamt
ADV-Dokumente ins Chinesische und Japanische
)

)

such that the program dokdata2db program can then from the dbm file find internal special textchunks such as

_dabagrups_ = +spzdb+adrdb+
_dabagrup_spzdb_rellits_ = +imibS1+memoS1+
_dabagrup_adrdb_rellits_ = +ldapaper+lyre+

Going on from there, dokdata2db should be able use rellits2putrek or similar to write the subelements of the identified rellits to the database. The _dabagrup_ info |+person+adr+tel|+tit+des+| consists of two lists: (1) names of involved dabarels, (2) shorthands of the fields that constitute initial lines of the textchunk body.

To simplify, we first specify the data that each record uses with an attribute field dabagrup. Moreover we provide an attribute field sub_dabagrup to be used in the parent section so that we can have a list of subsections that all enter the database automatically. Finally, we replace this notation with a more robust notation such as

(imibS1 @ll/spzdb
odat = 2015-01-09
fdat = 2015-01-13
rkod =
de = lingoserv
pre = imibS1b
status =f
Gesamt
ADV-Dokumente ins Chinesische und Japanische
)

or, for additional notational simplicity, we allow specification of attributes by hash notation alongside with the plain attribute notation. The hash notation would imply that we are specifying final values that are not subjected to $m->grupfill expansion.

(_spz @ll/spzdb
||odat|2015-01-09||fdat|2015-01-13||de|lingoserv||pre|imibS1b||status|f|
rkod =
Gesamt
ADV-Dokumente ins Chinesische und Japanische
)

For the parent section we would specify an shorthand of sub_dabagrup by an extra slash, suggesting an extra hierarchy level.

(_adr @ll//adrdb
(memoS1
...
)

...
)

These would set the internal dabagrup attribute or dabagrup attribute respectively. This attribute would then trigger the pushing of the lit onto the internal _dabagrup__rellits_ list at the end of the Tmplfil.pm process. The dabagrup itself would have to have been defined and registered in the internal _dabagrups_ variable by a command like special = dabagrup adrdb |person+adr+tel|tit+des|. This _dabagrups_ registry would in a separate step be used by dokdata2db as the starting point for finding the records and writing them to the database.

Support migration of contents

Certain sections whose contents might move elsewhere must be referencable as a document_id, also with document id URI, even though they are not documents. The anchor symbol of the living version of the section would be marked with an asterisk suffix behind which a further possible prependable part could be added, which fould form the dok URI, e.g. elal could become elal*mlhtdok which would make it referencable as mlhtdok_elal Moreover we could use the el_dok attribute to point to the last migration source and with al_dok to point to the next migration destination. A referencable virtual document id marked by an asterisk suffix to the anchor id would be written to a special internal variable _sektdoks__ or similar and and then written to the database by dokdata2db, much in the same way as is being done with document metadata and other data records. We still have to clarify how these virtucal document ids should be integreated into the document id table mlhtdok.

Short form of group opener without space

There seems to be no reason for separating the group opening bracket and section anchor by a whitespace from the following formatter variable argument. Especially in the normal cases where no further arguments exist, a group opening without whitespace would look more elegant.

(@vrb
(_v@vrb+1
(_tb@tab+tabregex
(oas_adr*@ulsekt//adrdb

Make group-local trigger/hook variables definable globally

%indproc = ||enumlist|1||itemlist|1||ilinioi|1||minitrivlist|1|
%sfx2fun = ||url|+call+ahurlval_verb+||dok|+call+ahdokval_verb+|

Inversely it would be desirable to make some variable types, e.g. formatter hierarchy structure variables, that are now defined only globally, also definable group-locally. The grup_add_grupvars mechanism which currently provides every group/node with all its local variables upon opening should be replaced or complemented with a mechanism that allows easy access to all variables; there are some tradeoffs as that could lead to slower computing in some cases.

Activation of formatters by mere textchunk naming

By means of the following mapping hash we could assure that the section anchor name suffixes _ula and _ulb imply use of the formatter structure @ul so that this formatter would no longer have to be specified in the document.

%litjung = ||_vb|@vrb||_ct|@cit||_ol|@ol||_ul|@ul|

This way documents could become even leaner. However they would become also less flexible because naming would be burdened with a function. Due to this disadadvantage this feature is of low priority.

Building text blocks from external sources

Wir können jetzt schon Textblöcke aus anderswo definierten Textblöcken zusammensetzen. Es gibt dabei noch Bedarf nach weiteren Varianten. Z.B. könnte es nötig sein, die von außen eingelesenen Textblöcke mit einem Aufruf mapcall zu transformieren, der ähnlich wie das vorhandene foreach funktionieren aber mit Listenausdrücken operieren würde.

In /adv/perl/A2E/Tmplfil.pm.tmpl#ELGRUP hinzu-entwickeln, was nötig ist, um Dateien wie /sig/oas/_lng.oas.txt gut zum Funktionieren zu bringen. Die in ELGRUP vorgeschlagenen Konstrukte sind oftmals weniger sinnvoll als die unten angeführte Umsetzung mit proc/call-Aufrufen, da letztere keine Textvariablen erzeugt. M.a.W. die ELGRUP-Konstrukte sind nur dann optimal, wenn man die Möglichkeit haben will, in anderen Sprachen Textvariablen zu überschreiben. Solche Situationen dürften eher selten vorkommen.

(_lst @ul ! vals)
@alin = /proc+alineas/proc+linioi/
(intro @alin ! +swpat+eupla+lisboa+) # warn: for backward compatibility only
@litsvar = +swpat+eupla+lisboa+
(intro @alin ! lits)
(_lst @ul mapcall)

Besser ist vermutlich folgendes

(_in /proc+include/
dbmvals_log.txt
)

oder folgendes

(_ir /call+include_re/
tabregex
dbmvals_log.txt
)

bzw

(_mc /call+mapcall/
ahval_verb
valsvar
)

Functionality Growth in Multilingual Hypertext