There are two ways to get there. The harder up-front way and the easier post-processing way.
(1) Through the PDF configuration file. By default this is PDFtoXHTML.cfg
in the Converters/cvtpdf
subdirectory of your installation. You can create your own configuration file and reference it via the
config options to xdmp:pdf-convert
. What you want to do is add a crop box to the pages to crop out the
page numbers, headers/footers etc. The syntax of this is:
[ANNOT PLAN]
0.Iceni Crop Box =1-# [341.15, 91.78, 259.87, 364.84];
[-- END --]
How do you figure out what the geometry of this box should be? You can download a tool called Gemini from Iceni to do this. This works fine, as long as all the documents you process have the same geometry.
(2) Via post-processing of the docbook output. The page start and ends are marked up in the XHTML generated from PDF conversion, and that propogates to the Docbook as well. Something like:
&para>
&phrase id="pge03"> &/phrase>
&/para> &para>
&phrase id="pgs04"> &/phrase>
&/para>
&para/>
You could run some kind of stylesheet that looks for this pattern and removes the page header/page footer
information in the vicinity. It gets tricky because you'll have to decide whether to stitch adjacent
paragraphs around a page break back together, presumably based on the style information. By default the
Docbook doesn't preserve the style information from the XHTML
, but you can get it to if you need it
by setting the option preserve-styles to false in the Docbook step. Look in the Docbook pipeline.
(This comes from the Installer/conversion/docbook-pipeline.xml
in your install directory.)