Java String.toUpperCase()

Question 1

Java uses UTF-16 in general. The first 256 values of the char primitive type in Java are exactly the same as the Latin-1 character set, which is given here. On that chart you can see that capitalizing value 254 (Lower Icelandic thorn) will convert it to value 222 (Upper Icelandic thorn).

The moral is: don't use values which have case as delimiters in a String.

Question 2

The Unicode character 254 is the lower case thorn, þ, a letter used in Icelandic that stands roughly for the "th" sound. Its upper case version is the character 222, upper case thorn Þ. What did you expect would happen?

Question 3

According to http://www.unicode.org/faq/casemap_charprop.html:

The Unicode Standard defines the default case mapping for each individual character, with each character considered in isolation. This mapping does not provide for the context in which the character appears, nor for the language-specific rules that must be applied when working in natural language text.

So it looks like the upper/lowerCase methods work pretty much the same way regardless of what Locale you've used. Specifying a different Locale may affect a few specific letters (like "i" in Turkish), but it doesn't make upper/lowerCase stop working on entire groups of letters. So specifying Locale.ENGLISH doesn't make upperCase ignore Icelandic letters--or Russian or Greek letters.

Question 4

It cannot be the case that String.toUpperCase() does anything but convert to upper case in a given char set.

Your question seems to imply that the link between your system and the host is done using 8-bit character set (ASCII ?). However, Java uses 16-bit characters, in a variety of character sets (UTF-16, UTF-8, etc.) So, there must be something doing the conversion, both in interpreting the character set, and converting to 8 bit. If the character set is UTF-8, then the first 127 chars map 1-1 with ASCII. However, you are concerned with chars outside of that range, so a more complex conversion is needed. I'm guessing that is where the problem is.

So I think you should:

Find out what char set the host is expecting
Find out where the conversion from Java 16-but chars is happening. Are you doing that yourself?

I would guess that the strange behavior is somewhere in there.

Sorry I can't be more help. If you give me more details about the comm link, and the conversion process, I might be able to shed more light on what's going on

Question 5

Locale trlocale= Locale.forLanguageTag("tr-TR");
Locale enLocale = Locale.forLanguageTag("en_US");
System.out.println("üğişçö".toUpperCase(new java.util.Locale("tr", "TR")));
System.out.println("üğişçö".toUpperCase(new java.util.Locale("en", "EN")));
value = new String(value.getBytes("UTF-8"), "UTF-8");