Pergunta

Just the other day I ran into a strange strange bug. I had a string of characters that I had to build. And as a delimiter the host system I was communicating with used char 254. Anyway I build out my string and sent it to the host. On the host I was receiving char 222 as the delimiter! After scratching my head and looking into it deeper it seemed odd that

hex : FE, binary: 11111110

was turning into

hex: DE, binary: 11011110

I tried the Locale.getDefault() and Locale.ENGLISH to no avail.

Could it be that the implementation of String.toUpperCase has a mask for ALL chars except specific hard coded ones?

For now I'm using the following to get around the problem:

public static String toUpperCase(String input) {

    char[] chars = input.toCharArray();


    for(int i = 0; i < chars.length; ++i ) {

        if( chars[i] > 96 && chars[i] < 123 ) {

            chars[i] &= 223;
        }

    }

    return new String(chars);

}

my question is am I missing something? Is there a better way that I am not aware of? Thanks a bunch!

Foi útil?

Solução 2

Java uses UTF-16 in general. The first 256 values of the char primitive type in Java are exactly the same as the Latin-1 character set, which is given here. On that chart you can see that capitalizing value 254 (Lower Icelandic thorn) will convert it to value 222 (Upper Icelandic thorn).

The moral is: don't use values which have case as delimiters in a String.

Outras dicas

The Unicode character 254 is the lower case thorn, þ, a letter used in Icelandic that stands roughly for the "th" sound. Its upper case version is the character 222, upper case thorn Þ. What did you expect would happen?

According to http://www.unicode.org/faq/casemap_charprop.html:

The Unicode Standard defines the default case mapping for each individual character, with each character considered in isolation. This mapping does not provide for the context in which the character appears, nor for the language-specific rules that must be applied when working in natural language text.

So it looks like the upper/lowerCase methods work pretty much the same way regardless of what Locale you've used. Specifying a different Locale may affect a few specific letters (like "i" in Turkish), but it doesn't make upper/lowerCase stop working on entire groups of letters. So specifying Locale.ENGLISH doesn't make upperCase ignore Icelandic letters--or Russian or Greek letters.

It cannot be the case that String.toUpperCase() does anything but convert to upper case in a given char set.

Your question seems to imply that the link between your system and the host is done using 8-bit character set (ASCII ?). However, Java uses 16-bit characters, in a variety of character sets (UTF-16, UTF-8, etc.) So, there must be something doing the conversion, both in interpreting the character set, and converting to 8 bit. If the character set is UTF-8, then the first 127 chars map 1-1 with ASCII. However, you are concerned with chars outside of that range, so a more complex conversion is needed. I'm guessing that is where the problem is.

So I think you should:

  1. Find out what char set the host is expecting
  2. Find out where the conversion from Java 16-but chars is happening. Are you doing that yourself?

I would guess that the strange behavior is somewhere in there.

Sorry I can't be more help. If you give me more details about the comm link, and the conversion process, I might be able to shed more light on what's going on

Locale trlocale= Locale.forLanguageTag("tr-TR");
Locale enLocale = Locale.forLanguageTag("en_US");
System.out.println("üğişçö".toUpperCase(new java.util.Locale("tr", "TR")));
System.out.println("üğişçö".toUpperCase(new java.util.Locale("en", "EN")));
value = new String(value.getBytes("UTF-8"), "UTF-8");
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top