Question

If I have a string of UTF-8 characters and they need to be output to an older system as UTF-7 I have two questions pertaining to this.

  1. How can I convert a string s which has UTF-8 characters to the same string without those characters efficiently?

  2. Are there any simple of converting extended characters like 'Ō' to their closest non extended equivalent 'O'?

Was it helpful?

Solution

If the older system can actually handle UTF-7 properly, why do you want to remove anything? Just encode the string as UTF-7:

string text = LoadFromWherever(Encoding.UTF8);
byte[] utf7 = Encoding.UTF7.GetBytes(text);

Then send the UTF-7-encoded text down to the older system.

If you've got the original UTF-8-encoded bytes, you can do this in one step:

byte[] utf7 = Encoding.Convert(Encoding.UTF8, Encoding.UTF7, utf8);

If you actually need to convert to ASCII, you can do this reasonably easily.

To remove the non-ASCII characters:

var encoding = Encoding.GetEncoding
    ("us-ascii", new EncoderReplacementFallback(""), 
     new DecoderReplacementFallback(""));
byte[] ascii = encoding.GetBytes(text);

To convert non-ASCII to nearest equivalent:

string normalized = text.Normalize(NormalizationForm.FormKD);
var encoding = Encoding.GetEncoding
    ("us-ascii", new EncoderReplacementFallback(""), 
     new DecoderReplacementFallback(""));
byte[] ascii = encoding.GetBytes(normalized);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top