Вопрос

We have some PDF files on our filesystem which is loaded into Marklogic server via MLCP. Once the PDF gets ingested, CPF triggers in Marklogic which has default conversion enabled. This results in transformation of the PDF files to XML (docbook format) files. A PDF might contain a sentence at the end of a page and some portion of the text gets spilled over to the next page. The issue is that when transformation occurs, the text from each page is retained in a tag which means that the spilled over text from the next page appears in a separate tag. For e.g. consider the sentence "The quick brown fox jumps over a lazy dog". Now, "the quick brown fox" appears in one page of the PDF and the rest "jumps over the lazy dog" gets onto the next page. After transformation, this is what appears in the XML:

......
<para>The quick brown fox</para>
...... (some information about headers)
<para>jumps over the lazy dog</para>

Is there a way where the continuity of text can be retained during transformation?

Это было полезно?

Решение

There are two ways to get there. The harder up-front way and the easier post-processing way. (1) Through the PDF configuration file. By default this is PDFtoXHTML.cfg in the Converters/cvtpdf subdirectory of your installation. You can create your own configuration file and reference it via the config options to xdmp:pdf-convert. What you want to do is add a crop box to the pages to crop out the page numbers, headers/footers etc. The syntax of this is:

[ANNOT PLAN]
0.Iceni Crop Box =1-# [341.15, 91.78, 259.87, 364.84];
[-- END --]

How do you figure out what the geometry of this box should be? You can download a tool called Gemini from Iceni to do this. This works fine, as long as all the documents you process have the same geometry.

(2) Via post-processing of the docbook output. The page start and ends are marked up in the XHTML generated from PDF conversion, and that propogates to the Docbook as well. Something like:

&amp;para>
  &amp;phrase id="pge03"> &amp;/phrase>
&amp;/para> &amp;para>
  &amp;phrase id="pgs04"> &amp;/phrase>
&amp;/para>
&amp;para/>

You could run some kind of stylesheet that looks for this pattern and removes the page header/page footer information in the vicinity. It gets tricky because you'll have to decide whether to stitch adjacent paragraphs around a page break back together, presumably based on the style information. By default the Docbook doesn't preserve the style information from the XHTML, but you can get it to if you need it by setting the option preserve-styles to false in the Docbook step. Look in the Docbook pipeline. (This comes from the Installer/conversion/docbook-pipeline.xml in your install directory.)

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top