The performance of docbook2X, and most other DocBook tools[2] can be summed up in a short phrase: they are slow.
On a modern computer producing only a few man pages at a time, with the right software — namely, libxslt as the XSLT processor — the DocBook tools are fast enough. But their slowness becomes a hindrance for generating hundreds or even thousands of man pages at a time.
The author of docbook2X encounters this problem whenever he tries to do automated tests of the docbook2X package. Presented below are some actual benchmarks, and possible approaches to efficient DocBook to man pages conversion.
Table 1. docbook2X running times on 2157
refentry
documents
Step | Time for all pages | Avg. time per page |
---|---|---|
DocBook to Man-XML | 519.61 s | 0.24 s |
Man-XML to man-pages | 383.04 s | 0.18 s |
roff character mapping | 6.72 s | 0.0031 s |
Total | 909.37 s | 0.42 s |
The above benchmark was run on 2157 documents coming from the doclifter man-page-to-DocBook conversion tool. The man pages come from the section 1 man pages installed in the author’s Linux system. The XML files total 44.484 MiB, and on average are 20.6KiB long.
The results were obtained using the test script in
test/mass/test.pl
,
using the default man-page conversion options.
The test script employs the obvious optimizations,
such as only loading once the XSLT processor, the
man-pages stylesheet, db2x_manxml and utf8trans.
Unfortunately, there does not seem to be obvious ways that the performance can be improved, short of re-implementing the tranformation program in a tight programming language such as C.
Some notes on possible bottlenecks:
Character mapping by utf8trans is very fast compared to the other stages of the transformation. Even loading utf8trans separately for each document only doubles the running time of the character mapping stage.
Even though the XSLT processor is written in C,
XSLT processing is still comparatively slow.
It takes double the time of the Perl script[3]
db2x_manxml,
even though the XSLT portion and the Perl portion
are processing documents of around the same size[4]
(DocBook refentry
documents and Man-XML documents).
In fact, profiling the stylesheets shows that a significant amount of time is spent on the localization templates, in particular the complex XPath navigation used there. An obvious optimization is to use XSLT keys for the same functionality.
However, when that is implemented,
the author found that the time used for
setting up keys dwarfs the time savings
from avoiding the complex XPath navigation. It adds an
extra 10s to the processing time for the 2157 documents.
Upon closer examination of the libxslt source code,
XSLT keys are seen to be implemented rather inefficiently:
each key pattern x
causes the entire input document to be traversed once
by evaluating the XPath //
!
x
Perhaps a C-based XSLT processor written with the best performance in mind (libxslt is not particularly the most efficiently coded) may be able to achieve better conversion times, without losing all the nice advantages of XSLT-based tranformation. Or failing that, one can look into efficient, stream-based transformations (STX).
[2] with the notable exception of the docbook-to-man tool based on the instant stream processor (but this tool has many correctness problems)
[3] From preliminary estimates, the Pure-XSLT solution takes only slightly longer at this stage: .22 s per page
[4] Of course, conceptually, DocBook processing is more complicated. So these timings also give us an estimate of the cost of DocBook’s complexity: twice the cost over a simpler document type, which is actually not too bad.