我在UTF-8中有一个文件,其中有些行包含U+2028行分隔符(http://www.fileformat.info/info/unicode/char/2028/index.htm)。当我从文件中读取行时,我不希望它被视为线路断路。当我在文件上迭代或使用readlines()时,有没有办法将其排除在分离器之外? (除了将整个文件读为字符串,然后将其分成 n。)谢谢!

有帮助吗?

解决方案

我无法在Mac OS X -U+2028上的Python 2.5、2.6或3.0中复制这种行为。您可以详细介绍一下此错误的位置吗?

也就是说,这是“文件”类的子类,可能会做您想做的事情:

#/usr/bin/python
# -*- coding: utf-8 -*-
class MyFile (file):
    def __init__(self, *arg, **kwarg):
        file.__init__(self, *arg, **kwarg)
        self.EOF = False
    def next(self, catchEOF = False):
        if self.EOF:
            raise StopIteration("End of file")
        try:
            nextLine= file.next(self)
        except StopIteration:
            self.EOF = True
            if not catchEOF:
                raise
            return ""
        if nextLine.decode("utf8")[-1] == u'\u2028':
            return nextLine+self.next(catchEOF = True)
        else:
            return nextLine

A = MyFile("someUnicode.txt")
for line in A:
    print line.strip("\n").decode("utf8")

其他提示

我无法再现这种行为,但是这是一个天真的解决方案,它只是将读取结果融合到u+2028结束之前。

#!/usr/bin/env python

from __future__ import with_statement

def my_readlines(f):
  buf = u""
  for line in f.readlines():
    uline = line.decode('utf8')
    buf += uline
    if uline[-1] != u'\u2028':
      yield buf
      buf = u""
  if buf:
    yield buf

with open("in.txt", "rb") as fin:
  for l in my_readlines(fin):
    print l

Thanks to everyone for answering. I think I know why you might not have been able to replicate this.I just realized that it happens if I decode the file when opening, as in:

f = codecs.open(filename, encoding='utf-8')
for line in f:
    print line

The lines are not separated on u2028, if I open the file first and then decode individual lines:

f = open(filename)
for line in f:
    print line.decode("utf8")

(I'm using Python 2.6 on Windows. The file was originally UTF16LE and then it was converted into UTF8).

This is very interesting, I guess I won't be using codecs.open much from now on :-).

If you use Python 3.0 (note that I don't, so I can't test), according to the documentation you can pass an optional newline parameter to open to specifify which line seperator to use. However, the documentation doesn't mention U+2028 at all (it only mentions \r, \n, and \r\n as line seperators), so it's actually a suprise to me that this even occurs (although I can confirm this even with Python 2.6).

The codecs module is doing the RIGHT thing. U+2028 is named "LINE SEPARATOR" with the comment "may be used to represent this semantic unambiguously". So treating it as a line separator is sensible.

Presumably the creator would not have put the U+2028 characters there without good reason ... does the file have u"\n" as well? Why do you want lines not to be split on U+2028?

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top