Frage

I know barely enough to survive in this digital world.

I have many one-page postscript files (graphs/images) I wish to convert to pdf and automatically crop to a narrow box. I'm on windows right now (I do use linux too, so don't hesitate to post code for linux)

I have in the past been successful by combining Ghostscript gswin32c.exe and Calibre pdfmanipulate.exe. This is probably a familiar approach to many here.

But this approach has become fraught with problems, for several reasons.

One problem arose after I "upgraded" to the 64 bit gswin64c.exe. The 32 bit version gswin32c.exe still works on my system though, so I can't complain too much.

Another problem arose while dealing with postscript files that are perhaps improperly coded. There seems to be at least two problems, but I'm not sure which, if any, is responsible or if both are. One problem is that the bounding box line, e.g. %%BoundingBox: 135 179 484 587 is not always placed on the second line from the top. I understand that can be an issue. Another problem is that the bounding box above corresponds to a "Portrait" orientation in Ghostscript, but the cropping follows the "Landscape" orientation. Yet another problem I have not identified is that for some files the cropping seems quite random.

So here is my 32bit approach (which works for high quality files), followed by the 64bit adaptation which doesn't work (perhaps because it calls some pypdf script on my machine rather than the patched script provided by calibre, if I understand https://bugs.launchpad.net/ubuntu/+source/calibre/+bug/800551 and http://www.mobileread.com/forums/archive/index.php/t-103097.html, but I'm just guessing and don't know a workaround anyhow):

@echo off echo batch processing with Latex ps2pdf followed by Ghostscript gswin64c.exe and Calibre2 pdfmanipulate.exe for %%I in (*.ps,*.eps) do ( "C:\Program Files\MiKTeX 2.9\miktex\bin\x64\ps2pdf" %%I ) for %%I in (*.pdf) do ( "C:\Program Files (x86)\Ghostscript\gs9.00\bin\gswin32c.exe" -dSAFER -dNOPAUSE -dBATCH
-sDEVICE#bbox "%%I" 2> bounding "C:\Program Files (x86)\Calibre2\pdfmanipulate.exe" crop -o "%%~nICropped32.pdf" -b bounding "%%I" pause "C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH
-sDEVICE#bbox "%%I" 2> bounding "C:\Program Files (x86)\Calibre2\pdfmanipulate.exe" crop -o "%%~nICropped64.pdf" -b bounding "%%I" pause )

The above 32 bit approach works on high quality files, e.g. Postscript level 3 produced by PSTricks or by Maple's standard 2D plot driver, but doesn't on older files, eg. Postscript level 2 (if that) produced by Maple's classic plot driver.

I have found a workaround for some such files. It consists in using epstopdf from the (MiKTeX) LaTeX distribution. It works on those Maple classic files. Unfortunately it doesn't work on some other postscript files I generated several years ago with PSTricks and other software like Matlab.

And so I need to make several transformations and select the ones that worked. I wonder if you would have suggestions that would make my life easier. If I can fix the BoundingBox and Portrait/Landscape issues I should be quite content.

I thank you in advance for any suggestions. A linux suggestion would be acceptable. My preference will go for a solution that might be able to handle the diversity of files in one single push of the "return" key.

And of course I'm looking for a lossless type of cropping, one that consists only in interpreting the bounding box, but not in transforming it into a (possibly) lower quality pdf.

EDIT: I forgot to say. When I apply gswin32c/pdfmanipulate to a high quality level 3 postscript file, the file named "bounding" fills with information like:

%%BoundingBox: 34 128 567 667 %%HiResBoundingBox: 34.364390 128.875004 566.054069 666.071980

In the example above, the file was already pretty much cropped. Note the closeness between %%BoundingBox and %%HiResBoundingBox

but applied to a low quality level 2 (or so it claims to be) postscript file, the "bounding" file fills with :

%%BoundingBox: 189 137 574 467 %%HiResBoundingBox: 189.485994 137.843996 573.299983 466.668478

but the bounding box really ought to be %%BoundingBox: 135 179 484 587 The above (135 179 484 587) is the bounding box provided by the postscript file itself (which I moved to the second line by copy-pasting) and it is consistent with the bounding box interpreted by Ghostview/Ghostscript when in the Portrait orientation.

But it gets completely ignored by Ghostscript...

I don't know where the 189 137 574 467 comes from --- it's very wrong...

EDIT 2. I'd like to clarify a few points, in response to Ken's questions:

Hi Ken, thanks for your reply,

sorry if my question was unclear --- nevertheless you seem to have understood the gist of it --- let me take your questions in turn:

I'm unsure why you are using 2 applications, it should be possible to perform the entire transformation with just Ghostscript.

I didn't find a way to do it all with Ghostscript so I used another way. I found the Ghostscript/Calibrate suggestion here, http://www.mobileread.com/forums/archive/index.php/t-72885.html, and elsewhere, tried it and it worked until recently.

I'm not saying it's not possible to do it all with Ghostscript, I'm merely saying that I didn't find a way to.

"One problem arose after I "upgraded" to the 64 bit gswin64c.exe" You haven't said what the problem was, have you reported it as a bug ? If people don't report bugs, they don't get fixed......

I gave the links describing the problem and the bug report, here: https://bugs.launchpad.net/ubuntu/+source/calibre/+bug/800551, http://www.mobileread.com/forums/archive/index.php/t-103097.html, my problem is the exact same one.

You seem to have some confusion between PostScript programs and comments. Any line in a PostScript program beginning '%' is a comment, and has no effect on the operation of the program. So BoundingBox comments won't do anything at all.

I beg to differ, if I may. Take a postscript file, remove the %%Bounding Box, save and open it in Ghostview. Ghostview throws up error messages and then displays it without using the bounding box information, e.g. a figure surrounding by a lot of white space instead of tightly surrounded by the bounding box. So yes, this comment does something, within Ghostview at least. Having removed the %%Bounding Box, if you then use Calibre/pdfmanipulate to crop the pdf, it will crop it wrongly in cases where having the %%Bounding Box would have worked. So this "comment" is quite useful in the context of displaying and cropping.

Note there is no requirement for it to be the second line of the file.....

It is recommended by Adobe. Quoting from adobe,

"The second required DSC header comment provides information about the size of the EPS file and must be present so the including application can transform and clip the EPS file properly. This is the bounding box comment."

http://partners.adobe.com/public/developer/en/ps/5002.EPSF_Spec.pdf

Adobe say "must." Personally I couldn't care less if it's a must or not, as long as I can produce pdf from my eps that are properly bounded.

In general Ghostscript ignores DSC comments, however if you set ProcessDSC to true, then it will make very limited use of it (primarily the BoundingBox comment to set the page size).

with pdfmanipulate it makes all the difference between a properly cropped pdf and an improperly cropped one.

Moving on. You say you are using LaTeX ps2pdf, if you already have a PostScript file, you can send that to Ghostscript for conversion to PDF. Its not clear to me what exactly you are using Ghostscript for in this case, simply to find the real bounding box of the page ?

yes.

Its not clear to me what you mean by 'lossless' cropping, if you crop the content you must be losing something clearly, even if its just white space.....

I mean that I don't want the cropping process to "rasterize" (or whatever it's called, you will know the term) the whole image. The part of the file that is cropped out is not useful to me so it's not much of a loss. The part of the file that is within the crop should be of the same quality as the original. That's the general idea.

You can find comments about this here, which is one place where I found useful information, http://www.charlietanksley.net/philtex/reading-pdfs-on-portables/

Its easy enough to do the conversion in one pass if you know the size you want to crop to,

no I don't know the size, that's why I'm going to such lengths to have software calculate it for me, and it's obviously not a simple thing because Ghostscript and epstopdf don't always agree on the optimal crop, one getting it right for some files but not for others, the other getting it right for other files but not for some...

if you don't know the size then you can do it in 2 passes using only Ghostscript by first extracting the BoundingBox as you have done. That will get you 4 numbers, the bottom left and top right of the bounding box (if I remember correctly). You then create a 'translate' PostScript operation to move the content of the page down and left (so that it starts at 0,0, the bottom left corner). You also create a page device request to set the page size, the size being given by width = right - left and height = top - bottom. Feed the original file, along with the PostScript operators, to Ghostscript and select the pdfwrite device and you will get a PDF file.

A batch file example would be great, if you have one handy. I have seen several examples based on pdfwrite and none that I've tried have worked. The devil is in the detail.

As far as the bounding box goes, it may be a bug, or it may be that the file makes a mark, potentially using a white ink at the outside location. In this case the bounding box device will still regard it as part of the page content. You may be able to see that it isn't, but the device cannot. Consider if the page was first filled with a dark background, and the content outlined using white ink.

The files were all created with software such as Matlab, Maple, PSTricks and it's unlikely (but obviously not impossible) that there would be invisible white marks outside of the area given by the %%Bounding Box.

In many cases, the %%Bounding Box comment contains all the information that is needed and I'd like Ghostscript or Calibre or pdfwrite or whomever to use that information.

I cannot offer a comprehensive solution without understanding more about what you want to do, and ideally seeing one or more of your problematic files.

That would be very easy, how can I post a postscript file for your viewing? It's 420 kilobytes.

Thanks Ken, let's hope we can find a workable solution.

EDIT 3. I have identified a big part of the problem.

My postscript file has the following bounding box, pretty close to an optimal crop: %%BoundingBox: 135 179 484 587

When I run Ghostscript gswin64c/gswin32c to compute the bounding box, viz

for %%I in (*.ps,*.eps) do ("C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH -dAutoRotatePages=/None -sDEVICE#bbox "%%I" 2> bounding)

I get:

%%BoundingBox: 145 189 475 574 %%HiResBoundingBox: 145.331574 189.485994 474.155986 573.299983

When I run ps2pdf followed by Ghostscript gswin64c, i.e.

for %%I in (*.ps,*.eps) do ("C:\Program Files\MiKTeX 2.9\miktex\bin\x64\ps2pdf" %%I)
for %%I in (*.pdf) do ("C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH -dAutoRotatePages=/None -sDEVICE#bbox "%%I" 2> bounding)

I get the following bounding box:

%%BoundingBox: 189 137 574 467 %%HiResBoundingBox: 189.395994 137.843996 573.299983 466.668478

So the problem is that the conversion from ps to pdf with ps2pdf introduces a change in the bounding box information which results in incorrect cropping. So replacing ps2pdf with something else, like eps2pdf solves the problem here. Of course there are other solutions. Particularly valuable are solutions involving Ghostcript only, as suggested by Ken and luser droog. Their very valuable (and superior to my quick fix) suggestions are below. Something like this has worked:

for %%I in (*.eps,*.ps) do ("C:\Program Files\MiKTeX 2.9\miktex\bin\x64\epstopdf" %%I)
for %%I in (*.pdf) do (
"C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH -dAutoRotatePages=/None -sDEVICE#bbox "%%I" 2> bounding
"C:\Program Files (x86)\Calibre2\pdfmanipulate.exe" crop -o "%%~nICropped.pdf" -b bounding "%%I"
)
War es hilfreich?

Lösung

Insufficient space in comments to add this so I'm afraid I'm posting yet another answer....

The reason the BoundingBox looks bogus for the PDF file is because of a feature of the PDF conversion process. By default it rotates pages until the majority of the text is horizontal, in the case of this file (and, I presume other files with the same problem), this resulted in a rotation by 90 degrees clockwise.

This means of course the the bounding box rotates as well, and inspection of the values shows that this is what has happened. So the BoundingBox is correct for the rotated PDF file.

Now, I supplied a couple of PostScript programs by private email, here's what I put:

1pass.ps

This reads the BoundingBox line from the source PostScript file, and uses it to set up the page size and offset. You pass in the name of the file to use by setting 'SourceFileName' Eg, with the file you provided:

gs -sDEVICE=pdfwrite -sSourceFileName=classic.ps -o out.pdf 1pass.ps

will produce a file called out.pdf which is the result of reading the BoundingBox, and converting to a PDF file with a page cropped to that size.

%!PS  

%% redefine setpagedevice to prevent changes by the PostScript program  
%% But keep a copy under a different name, so we cna use it.  
/Oldsetpagedevice /setpagedevice load def  
/setpagedevice {pop} bind def  

(File to process is ) print SourceFileName ==  

/SourceFile SourceFileName (r) file def  
/BoxString 65535 string def  
/LLx 0 def  
/LLy 0 def  
/URx 0 def  
/URy 0 def  
/FoundBox false def  

/GetValues {  
  token {                   % read a PostScript token  
    /LLx exch def               % Assume its a number for now  
    token {  
      /LLy exch def  
      token {  
        /URx exch def  
        token {  
          /URy exch def  
          pop                       % Get rid of any remaining string data  
          true              % return success code  
        }{  
          (Failed to read a number from the string) ==  
          false             % return failure code  
        } ifelse  
      }{  
        (Failed to read a number from the string) ==  
        false               % return failure code  
      } ifelse  
    }{  
      (Failed to read a number from the string) ==  
      false                 % return failure code  
    } ifelse  
  } {  
    (Failed to read a number from the string) ==  
    false                   % return failure code  
  } ifelse  
} bind def  

{  
  SourceFile BoxString readline {  
    (%%BoundingBox:) anchorsearch {  
      pop                           %% discard matching string  
      GetValues             %% extract BBox  
      /FoundBox exch def        %% Note success/failure  
      exit                  %% exit this loop  
    } {  
      pop                   %% discard string, no match  
    } ifelse  
  } {  
    (Failed to find a %%BoundingBox comment) ==  
    exit                            %% No more data, exit loop  
  } ifelse  
} loop  

SourceFile closefile            %% close the file  

FoundBox {  
  (LLx = ) print LLx ==  
  (LLy = ) print LLy ==  
  (URx = ) print URx ==  
  (URy = ) print URy ==  
  > Oldsetpagedevice  
  LLx neg LLy neg translate  
  SourceFileName run  
} if  

2pass.ps

This is intended to be used the way you are currently working, it has two advantages over 1pass.ps:

  1. It works with PDF files as well as PostScript files, and with files which do not contain a %%BoundingBox comment.
  2. The BoundingBox is accurate.

It has the disadvantage that you have to process each file twice, once to get the bounding box and once to create the PDF file.

This takes two parameters, the name of the file containing the output of the bbox device, and the name of the file to be converted. Again, using the file you sent, you would use it like this:

First command:

  gs \
   -sDEVICE=bbox \
    classic.ps 2> bounding.txt

Second command:

  gs \
   -sDEVICE=pdfwrite \
   -sBoxFileName=bounding.txt \
   -sPostScriptFileName=classic.ps \
   -o out.pdf \
    2pass.ps

PostScript code for classic.ps:

%!PS  

%% redefine setpagedevice to prevent changes by the PostScript program  
%% But keep a copy under a different name, so we cna use it.  
/Oldsetpagedevice /setpagedevice load def  
/setpagedevice {pop} bind def  

(Bounding Box parameters in file ) print BoxFileName ==  
(File to process is ) print PostScriptFileName ==  

/BoxFile BoxFileName (r) file def  
/BoxString 256 string def  
/HiResBoxString 256 string def  
/LLx 0 def  
/LLy 0 def  
/URx 0 def  
/URy 0 def  

BoxFile BoxString readline  % Read first line from file  
{  
  /BoxString exch def       % redefine string to be the one we read  
}{  
  (Encountered EOF before newline reading %%BoundingBox) == flush  
} ifelse  

BoxFile HiResBoxString readline % Read first line from file  
{  
  /HiResBoxString exch def      % redefine string to be the one we read  
}{  
  (Encountered EOF before newline reading %%HiResBoundingBox) == flush  
} ifelse  

BoxFile closefile               % close the file  

BoxString (%%BoundingBox:) anchorsearch  
{  
  pop                       % Get rid of the mathcing string  
  token {                   % read a PostScript token  
    /LLx exch def               % Assume its a number  
    token {  
      /LLy exch def  
      token {  
        /URx exch def  
        token {  
          /URy exch def  
          pop                       % Get rid of any remaining string data  
        }{  
          (Failed to read a number from the string) ==  
        } ifelse  
      }{  
        (Failed to read a number from the string) ==  
      } ifelse  
    }{  
      (Failed to read a number from the string) ==  
    } ifelse  
  } {  
    (Failed to read a number from the string) ==  
  } ifelse  
}{  
  print (does not contain a BoundingBox) ==  
} ifelse  

(LLx = ) print LLx ==  
(LLy = ) print LLy ==  
(URx = ) print URx ==  
(URy = ) print URy ==  

> Oldsetpagedevice  
LLx neg LLy neg translate  

PostScriptFileName run  

Andere Tipps

If simply enforcing the BoundingBox comment will do what you want, you can replace the first call to ghostscript with a text-scanner.

Here's the sh version of the script above (can't stand those Windows pathnames!)

for i in *.pdf ; 
do 
    gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=bbox "$i" 2> bounding ; 
    pdfmanipulate crop -o "${i%.pdf}-cropped.pdf" -b bounding "$i" ; 
done

And you can modify it to use grep like this:

for i in *.pdf ; 
do 
    grep '%%BoundingBox' "$i" > bounding ; 
    pdfmanipulate crop -o "${i%.pdf}-cropped.pdf" -b bounding "$i" ; 
done

If I was trying to do this on Windows, I would install cygwin and use the same script.

"I have in the past been successful by combining Ghostscript gswin32c.exe and Calibre pdfmanipulate.exe. This is probably a familiar approach to many here."

I'm unsure why you are using 2 applications, it should be possible to perform the entire transformation with just Ghostscript.

"One problem arose after I "upgraded" to the 64 bit gswin64c.exe"

You haven't said what the problem was, have you reported it as a bug ? If people don't report bugs, they don't get fixed......

You seem to have some confusion between PostScript programs and comments. Any line in a PostScript program beginning '%' is a comment, and has no effect on the operation of the program. So BoundingBox comments won't do anything at all.

That said, there is a convention (Document Structure Convention, DSC for short) which describes a way of embedding comments in PostScript files which DSC processors can use. There are rules describing how the program must be structured for this to work. If a PostScript program begins %!PS-Abode-m.n where m and n are integers, then it is declaring itself to be a DSC-compliant program and the version it is compliant with is the number 'm.n'. In this case the BoundingBox comment won't be used by a PostScrip tinterpreter, but a DSC processor might use it. Note there is no requirement for it to be the second line of the file.....

In general Ghostscript ignores DSC comments, however if you set ProcessDSC to true, then it will make very limited use of it (primarily the BoundingBox comment to set the page size).

Moving on. You say you are using LaTeX ps2pdf, if you already have a PostScript file, you can send that to Ghostscript for conversion to PDF. Its not clear to me what exactly you are using Ghostscript for in this case, simply to find the real bounding box of the page ?

Its not clear to me what you mean by 'lossless' cropping, if you crop the content you must be losing something clearly, even if its just white space.....

Its easy enough to do the conversion in one pass if you know the size you want to crop to, if you don't know the size then you can do it in 2 passes using only Ghostscript by first extracting the BoundingBox as you have done. That will get you 4 numbers, the bottom left and top right of the bounding box (if I remember correctly). You then create a 'translate' PostScript operation to move the content of the page down and left (so that it starts at 0,0, the bottom left corner). You also create a page device request to set the page size, the size being given by width = right - left and height = top - bottom. Feed the original file, along with the PostScript operators, to Ghostscript and select the pdfwrite device and you will get a PDF file.

As far as the bounding box goes, it may be a bug, or it may be that the file makes a mark, potentially using a white ink at the outside location. In this case the bounding box device will still regard it as part of the page content. You may be able to see that it isn't, but the device cannot. Consider if the page was first filled with a dark background, and the content outlined using white ink.

I cannot offer a comprehensive solution without understanding more about what you want to do, and ideally seeing one or more of your problematic files.

OK touching briefly on DSC, your point about Ghostview is correct, but Ghostview is :

  1. not Part of Ghostscript (surprising though that may be)
  2. a DSC aware application.

My comments were applicable to the PostScript language and mean to explain why Ghostscript ignores these comments.

The point about the 'second required comment'; it must be present (for DSC compliance), it doesn't have to be the second line. Though it wouldn't surprise me to hear that some applications erroneously require that.

As a general rule Ghostscript's pdfwrite PDF output device won't convert anything to rasters. There are some rare exceptions, usually involving unusual font types or colour spaces, or when converting a PDF with transparency to a PDF version prior to support for transparency (eg PDF/A or PDF/X).

To create a PDF file, cropped as required, from Ghostscript:

 gswin32c ^
  -o out.pdf ^
  -sDEVICE=pdfwrite ^
  -dPAGEWIDTHPOINTS=xx -dPAGEHEIGHTPOINTS=yy ^
  -dFIXEDMEDIA ^
  -c "-x -y translate" ^
  -f input.ps

You would have to calculate xx, yy, x and y from the returned BoundingBox of a previous invocation, if your PostScript file doesn't already include this information. Given what you say above, that seems to be the case.

In the general case:

  • xx = urx - llx,
  • yy = ury - lly,
  • x = llx,
  • y = lly

A better solution would probably be to write a PostScript program to do the setup, that's easy enough to do.

You can email a file to 'ken.sharp AT artifex.com', or use any convenient file transfer facility and mail me a URL. I'm most interested in the case where the returned BoundingBox isn't what you expect....

I did look at the URLs you posted above, and I can't see one describing problems with the 64-bit version of Ghostscript. As a final question, which version of Ghostscript are you using ?

Answer about calibre/pdfmanipulate.exe

calibre has removed pdfmanipulate.exe from recent releases. I found that I had to go back to version 0.8.66 to get pdfmanipulate, I downloaded the portable version: calibre-portable-0.8.66.zip

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top