PDF to XML conversion in Marklogic

Question

There are two ways to get there. The harder up-front way and the easier post-processing way. (1) Through the PDF configuration file. By default this is PDFtoXHTML.cfg in the Converters/cvtpdf subdirectory of your installation. You can create your own configuration file and reference it via the config options to xdmp:pdf-convert. What you want to do is add a crop box to the pages to crop out the page numbers, headers/footers etc. The syntax of this is:

[ANNOT PLAN]
0.Iceni Crop Box =1-# [341.15, 91.78, 259.87, 364.84];
[-- END --]

How do you figure out what the geometry of this box should be? You can download a tool called Gemini from Iceni to do this. This works fine, as long as all the documents you process have the same geometry.

(2) Via post-processing of the docbook output. The page start and ends are marked up in the XHTML generated from PDF conversion, and that propogates to the Docbook as well. Something like:

&amp;para>
  &amp;phrase id="pge03"> &amp;/phrase>
&amp;/para> &amp;para>
  &amp;phrase id="pgs04"> &amp;/phrase>
&amp;/para>
&amp;para/>

You could run some kind of stylesheet that looks for this pattern and removes the page header/page footer information in the vicinity. It gets tricky because you'll have to decide whether to stitch adjacent paragraphs around a page break back together, presumably based on the style information. By default the Docbook doesn't preserve the style information from the XHTML, but you can get it to if you need it by setting the option preserve-styles to false in the Docbook step. Look in the Docbook pipeline. (This comes from the Installer/conversion/docbook-pipeline.xml in your install directory.)