抽出からテキストをMS wordファイルをpython

https://stackoverflow.com/questions/125222

02-07-2019
|

質問

作業によってMS wordファイルのpythonでは、python win32拡張機能を使用できます。以下の手順で行い、同linux?があるのですか？

解決

またサブプロセスの呼び出 antiword.Antiwordはlinuxコマンドラインユーティリティの整備ダンピングテキストのことを当たり前にできる.作品とも簡単に書類（明ができなくなるというフォーマット).このよaptと、そして回転数がコンパイルします。

他のヒント

をご利用 ネイティブにPython docking pane、xtreme property gridなどの高度な機能モジュール.こちらのコストを抽出する方法をすべてのテキストからのメンバー

document = docx.Document(filename)
docText = '\n\n'.join([
    paragraph.text.encode('utf-8') for paragraph in document.paragraphs
])
print docText

見 Python docking pane、xtreme property gridなどの高度な機能サイト

チェックアウトも Textract を引き出すテーブル等

XML構文解析とregexsソッドを呼び出しcthulu.いにしてもらいましょう！

ベンジャミンの回答はなかなか良いです。私連結...

import zipfile, re

docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)

OpenOffice.org できる脚本Python: こちらをご参照.

以降、オーズで負荷を最MS Wordファイルを完璧にしたと思いることの最良のベット。

そうすることを義務付けられているが、私は最近している既存のものを探し出しを抽出する方法をテキストからMS wordファイルのかかったwvLib:

http://wvware.sourceforge.net/

インストール後に図書館を利用でPythonのは簡単です：

import commands

exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)

ことになるのです。か、私たちがやっていることを利用しまいます。getouput機能を実行カップルのシェルスクリプト、すなわちwvText（抽出物からメール、Wordドキュメントおよび猫にファイルの読み込みます。その後、全体のテキストからの資料を、Wordドキュメントまでの変数、利用可能になります。

ればこのことは誰でも同様の問題ます。

見どのようにdoc形式の作品や作成資料を、wordドキュメントPHPを利用したlinuxでは、controlelrディレクトリにて.前者は特に便利です。 Abiword 私のおすすめします。があり制限もの:

ただし、ドキュメントには複雑なテーブル、テキストボックス、組み込みスプレッドシートなど、それが動作しない可能性があります。開発好MS Wordフは非常に難しい工程でしばしお待ちくださいとおっワで開きます。また資料を、Wordドキュメントを合負荷オープンにしてくださいバグなどの文書の改善を図り輸入業者.

（注：掲載しここの質問どうですが、当該ちください言い訳、転載す.)

現在、このカクhackyうですが、働くための基本的なテキストに引き出します。もちろん使用するQtプログラムがあるんて産卵のためのプロセスでは、コマンドラインかハッキングとは

unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

うことになる:

解凍-p file.docx:-p=="unzipを標準出力に出力"

grep':グラブでのラインを含む'<w:t' (<w:t=""> 葉のシーズンは、2007年のXML要素のために"テキスト"といいですね)

sed's/<[^<]>//g'*:削除も内部タグ

grep-v^[[:スペース:]]$'*:空の行を削除

ありがより効率的に行うこうですが、仕事の数docsい試験を実施します。

どんもご承知のように、解凍しgrepやsedすべてのポートWindowsのUnixesべきであると合理的にクロスプラットフォーム.Despitしている姿を見せるわけですから、醜いhack;)

この使用純粋なpythonモジュールなしで呼び出すサブプロセスに利用できますzipfile python modude.

content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
    if item.orig_filename == 'word/document.xml':
        content = docx.read(item.orig_filename)

    else:
        pass

コンテンツの文字列が必要で洗浄し、一方いということです:

# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
    if '>' in item:
        bad_good = item.split('>')
        if bad_good[-1] != '':
            fullyclean.append(bad_good[-1])
        else:
            pass
    else:
        pass

# Assemble a new string with all pure content
content = " ".join(fullyclean)

が確実によりこれまでにないクリーンに文字列からのreモジュールです。武器agiは、dexで下がらないboxerぐ.

Unoconvとして提供するものであり、良い代替: http://linux.die.net/man/1/unoconv

場合はHunspellを設置し、することさえ可能ですので、コマンドラインから変換するファイルをテキスト, その負荷のテキストに対してエラーになります。

なんなのかんていかに運用します。きます。doc形式は爆弾の複合体と呼ばれる"メモリダンプ"の言葉での時間。

でスワティには、HTMLにおけるリが増えることで、日本でのダンディがワード文書などれもすばらしい。

読みWord2007以降のファイルを含め、.docking pane、xtreme property gridなどの高度な機能ファイルを使用できます python-docking pane、xtreme property gridなどの高度な機能 package:

from docx import Document
document = Document('existing-document-file.docx')
document.save('new-file-name.docx')

み込まれました。docファイルをWord2003以前は、サブプロセスの呼び出 antiword.インストールする必要がありantiword先：

sudo apt-get install antiword

それだけですからpythonスクリプト:

import os
input_word_file = "input_file.doc"
output_text_file = "output_file.txt"
os.system('antiword %s > %s' % (input_word_file, output_text_file))

することを義務付けられている?このようなものが存在しません。のみが回答および話した。これがいいんですよねー答、回答お申し込みください。も読み取り用メソッド*.docking pane、xtreme property gridなどの高度な機能(MS Word2007以降の書類を使用せずvisualすべて覆われている。が抽出法本文するようにしましdoc(MS Word97-2000)は、Pythonのみを欠い.この複雑な?：ないか、を理解すること何ともったいない別のものです。

私からの完了コードを読んだフォーマット仕様を掘っても提案したアルゴリズムその他の言語を学んでいます。

MS Word(*.doc)ファイルは、OLE2化合物のファイルです。なんの不必要な詳細、さらに新しいファイルとしてシステムに格納されます。で実際に使脂構造体の定義。(HmもできるループマウントでLinux???) このように、すことができますファイル内のファイルのように、写真など。同じで*.docking pane、xtreme property gridなどの高度な機能を利用ZIPアーカイブです。あのパッケージで提供可能にPyPIることでオファイルです。（Olefile,compoundfiles,...) 使用したcompoundfilesパッケージを開く*.docファイルです。しかし、コーディネーターとの打ち合わせ97-2000、内部subfilesなXML形式又はHTML形式でバイナリファイルです。としてこれだけでは十分ではありませんがそれぞれる情報その他についてお読みの少なくとも、ほ情報保存します。理解し、PDFファイルから、私のアルゴリズムです。

以下のコードは非常に急成と動作確認は少数のファイルです。調査を実施しているのは、同じ見を動作させることができます。ことがありgibberishが表示され、ほとんど常に終了します。とですが奇数の文字とします。

人だけで希望のを検索します。ただ、私の方を改善するこのコードです。


doc2text module:
"""
This is Python implementation of C# algorithm proposed in:
http://b2xtranslator.sourceforge.net/howtos/How_to_retrieve_text_from_a_binary_doc_file.pdf

Python implementation author is Dalen Bernaca.
Code needs refining and probably bug fixing!
As I am not a C# expert I would like some code rechecks by one.
Parts of which I am uncertain are:
    * Did the author of original algorithm used uint32 and int32 when unpacking correctly?
      I copied each occurence as in original algo.
    * Is the FIB length for MS Word 97 1472 bytes as in MS Word 2000, and would it make any difference if it is not?
    * Did I interpret each C# command correctly?
      I think I did!
"""

from compoundfiles import CompoundFileReader, CompoundFileError
from struct import unpack

__all__ = ["doc2text"]

def doc2text (path):
    text = u""
    cr = CompoundFileReader(path)
    # Load WordDocument stream:
    try:
        f = cr.open("WordDocument")
        doc = f.read()
        f.close()
    except: cr.close(); raise CompoundFileError, "The file is corrupted or it is not a Word document at all."
    # Extract file information block and piece table stream informations from it:
    fib = doc[:1472]
    fcClx  = unpack("L", fib[0x01a2l:0x01a6l])[0]
    lcbClx = unpack("L", fib[0x01a6l:0x01a6+4l])[0]
    tableFlag = unpack("L", fib[0x000al:0x000al+4l])[0] & 0x0200l == 0x0200l
    tableName = ("0Table", "1Table")[tableFlag]
    # Load piece table stream:
    try:
        f = cr.open(tableName)
        table = f.read()
        f.close()
    except: cr.close(); raise CompoundFileError, "The file is corrupt. '%s' piece table stream is missing." % tableName
    cr.close()
    # Find piece table inside a table stream:
    clx = table[fcClx:fcClx+lcbClx]
    pos = 0
    pieceTable = ""
    lcbPieceTable = 0
    while True:
        if clx[pos]=="\x02":
            # This is piece table, we store it:
            lcbPieceTable = unpack("l", clx[pos+1:pos+5])[0]
            pieceTable = clx[pos+5:pos+5+lcbPieceTable]
            break
        elif clx[pos]=="\x01":
            # This is beggining of some other substructure, we skip it:
            pos = pos+1+1+ord(clx[pos+1])
        else: break
    if not pieceTable: raise CompoundFileError, "The file is corrupt. Cannot locate a piece table."
    # Read info from pieceTable, about each piece and extract it from WordDocument stream:
    pieceCount = (lcbPieceTable-4)/12
    for x in xrange(pieceCount):
        cpStart = unpack("l", pieceTable[x*4:x*4+4])[0]
        cpEnd   = unpack("l", pieceTable[(x+1)*4:(x+1)*4+4])[0]
        ofsetDescriptor = ((pieceCount+1)*4)+(x*8)
        pieceDescriptor = pieceTable[ofsetDescriptor:ofsetDescriptor+8]
        fcValue = unpack("L", pieceDescriptor[2:6])[0]
        isANSII = (fcValue & 0x40000000) == 0x40000000
        fc      = fcValue & 0xbfffffff
        cb = cpEnd-cpStart
        enc = ("utf-16", "cp1252")[isANSII]
        cb = (cb*2, cb)[isANSII]
        text += doc[fc:fc+cb].decode(enc, "ignore")
    return "\n".join(text.splitlines())

だけのオプションのための読書'doc'ファイルを使用せずCOM: miette.すべます

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow