Automated trademarking in structured documents – DITA in particular
Unabashed plug warning: The following entry gives a conceptual overview of a solution Scriptorium has implemented for managing trademarks in structured tagging. And we’re proud of it.
You know the problem. According to your style standards, only the first instance of a given trademarked term should display the trademark symbol. Structured documentation allows you to re-use document parts (such as DITA topics) in just about any order you like. In Manual A, the first file containing the trademarked text is, say, Topic A; in Manual B the first file containing the trademarked text is Topic E, which is also used in Manual A. Where do you put your trademark markup, and how do you maintain it when running Manual A and Manual B at approximately the same time?
Maintaining the trademarks by hand adds a level of effort that becomes non-negligible when you start considering a large number of manuals. And the process becomes error prone – those darned human beings. Different writers might tag things different ways, trademarks might escape notice, or markup might be inserted in inappropriate places by accident.
Isn’t this one of those problems that automated documentation was supposed to solve, not create? I once had a professor who said that computers were supposed to handle the work that computers could solve so people could work on the problems that only people can solve.
More than one of Scriptorium’s customers has presented us with this problem, so we know it is not uncommon. We have found a way to deal with the problem in DITA, and we believe that the principle is sufficiently generic to use in non-DITA structures as well.
To begin with, forget conditional processing. It won’t help you with the problem of marking only the first instance of a term. In the example of Manual A, above, setting the condition “Manual A” would still display the trademark in Topic A and Topic E. This is not what your editor wants – and he or she will let you know it in spades if he or she is any kind of editor at all.
Scriptorium’s solution for DITA, in simple outline, is as follows:
-
Using XSL, go through the ditamaps and remove all trademarking from the document files.
-
Following a predefined list of trademarked and registered trademarked terms, go through the ditamaps and identify the files that contain each term. Create a temporary file that lists the relevant files in order of book occurrence. (This step prevents having to crawl through the ditamaps more than once.)
-
Using Perl, iterate through the files listed for each term in the temporary file. Check the occurrence of each instance of the term, in text order, and evaluate whether it is a valid occurrence that requires trademarking. If so, wrap the appropriate trademark markup around it and go to the next trademark. If not, keep going through the text and the list of files until you find a valid occurrence of this trademark.
We possibly could have used XSL instead of Perl for the third step, but Perl’s text manipulation capability is much more robust than XSL’s, so we chose Perl.
In the implementation, the trademarking utility is coordinated by an Ant process. A user runs this utility just before the book is rendered for output. Being in Ant, the trademarking process could probably be integrated into the DITA Open Toolkit build system fairly easily to create a seamless, one-step production process.
There are a number of interesting problems that arise during implementation. For example, in step 3 the process has to evaluate whether the instance of a term is valid for trademarking. Some kinds of non-valid instances of a term in the text might be:
-
The term is in an indexterm tag.
-
The term is in an href attribute.
-
The term is in a title.
-
The term is in a codeblock tag.
You might also encounter a condition where a trademarked term could be both mixed case and all uppercase. Per your style guide, only the first instance of either should be marked, but not the first instance of both. That sort of requirement makes life just a little more interesting for a coder.
In general, the issue of trademarking first instances is not a simple problem to solve, and variations in style requirements will undoubtedly add complexity and challenges to the problem. But that’s what automated documentation is supposed to be good at, right? So we humans can get back to doing the more difficult problems that only people can solve.
I’m not sure – is that really such a good deal?
Michael Müller-Hillebrand
I see that too often: Perl hacks…
You are right: Everything could have been done with XSLT, and XSLT 2.0 offers all the necessary features, especially Regular Expression support. The trademarking rules would be in a configuration document (in XML, of course) and the maintainer of the solution would have to know XSLT and no obscure pre-XML hacker’s language (sorry, Perl-lovers).
My guess: Had you done it completely in XSLT 2.0, you would be even prouder.
Congratulations,
– Michael
David Kelly
David Kelly writes:
I agree, keeping it in a single language would probably have been a more elegant, maintainable, and efficient solution. But we also have some customers who, for various reasons, ask us to use XSLT 1.0. One reason we have been given is for the stylesheets to be compatible with as broad a number of transform engines as possible.
And of course, one of my goals with this entry was to stimulate ideas, so I appreciate your insight. I think an XSLT 2.0 solution would be interesting to work out.
Best regards,
David
[this comment did not convert properly from Haloscan to WordPress and had to be restored manually. -Sarah]