Pergunta

The documentation for Data.ByteString.hGetContents says

As with hGet, the string representation in the file is assumed to be ISO-8859-1.

Why should it have to "assume" anything about the "string representation in the file"? The data is not necessarily strings or encoded text at all. If I wanted something to deal with encoded text I'd use Data.Text or perhaps Data.ByteString.Char8. I thought the whole point of ByteString is that the data is handled as a list of 8-bit bytes, not as text characters. What is the impact of the assumption that it is ISO-8859-1?

Foi útil?

Solução

It's a roundabout way to say the same thing - no decoding is performed (since the encoding is 8-bit, nothing needs to be done), so hGetContents gives you bytes in range 0x00 - 0xFF:

$ cat utf-8.txt
ÇÈÄ
$ iconv -f iso8859-1 iso8859-1.txt                         
ÇÈÄ
$ ghci
> openFile "iso8859-1.txt" ReadMode >>= (\h -> fmap BS.unpack $ BS.hGetContents h)
[199,200,196,10]
> openFile "utf-8.txt" ReadMode >>= (\h -> fmap BS.unpack $ BS.hGetContents h)
[195,135,195,136,195,132,10]

Outras dicas

Perhaps it's similar to this, then:

There are situations where encodings are handled incorrectly but things still work. An often-encountered situation is a database that's set to latin-1 and an app that works with UTF-8 (or any other encoding). Pretty much any combination of 1s and 0s is valid in the single-byte latin-1 encoding scheme. If the database receives text from an application that looks like 11100111 10111000 10100111, it'll happily store it, thinking the app meant to store the three latin characters "縧". After all, why not? It then later returns this bit sequence back to the app, which will happily accept it as the UTF-8 sequence for "縧", which it originally stored. The database admin interface automatically figures out that the database is set to latin-1 though and interprets any text as latin-1, so all values look garbled only in the admin interface.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top