Without using any additional libraries, how would someone approach the challenge of reading the metadata of .pdf files in Python?

StackOverflow https://stackoverflow.com/questions/22257212

  •  11-06-2023
  •  | 
  •  

سؤال

I know this is not an easy question and I do not expect an easy answer. I want to learn more about this, and the only way to do it is the hard way.

What first steps should I take?

هل كانت مفيدة؟

المحلول

If you want to get 'CreationDate', 'Author' and this kind of entries you can try this quick and dirty solution. Normally this information in a pdf should look like this:

obj
<<
/Author(NameOfAuthor)
/CreationDate(D:20040910110429)
/Producer(AcrobatPdfWriter)
>>
endobj

Not sure if applies for all pdf formats but I got some decent data that you can 'clean-up' after. Only works if the entries are on separate lines.

metadata_fields = ['Creator', 'CreationDate', 'Producer', 'ModDate']
with open('path_to_your_file.pdf') as my_pdf:
  meta_values = [line.rstrip('\n') for line in my_pdf.readlines() 
             for item in metadata_fields if item in line]
  print meta_values

Output:

['<</Producer(AFPL Ghostscript 8.11)', '/CreationDate(D:20040910110429)',
 '/ModDate(D:20040910110429)', '/Creator(PDFCreator Version 0.8.0)']
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top