Question

I need a way to detect whether a file contains characters from a certain charset.

Specifically, I want to detect the presence of UTF8-encoded cyrillic characters in a series of files. Is there a tool to do this?

Thanks

Was it helpful?

Solution

If you are looking for ready solution, you might want to try Enca.

However, if you only want to detect presence of what can be possibly decoded as UTF-8 Cyrillic characters (without any complete UTF-8 validity checks), you just have to grep for something like /(\xD0[\x81\x90-\xBF]|\xD1[\x80-\x8F\x91]){n,}/ (this exact regexp is for n subsequent UTF8-encoded Russian Cyrillic characters). For additional check that the whole file contains only valid UTF-8 data you can use something like isutf8(1).

Both methods have their good and bad sides and may sometimes give wrong results.

OTHER TIPS

IIRC the ICU library has code that does character set detection. Though it's basically a best effort guess.

Edit: I did remember correctly, check out this paper / tutorial

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top