我怎么截断java串适合于给出的数字，一旦UTF-8encoded?

https://stackoverflow.com/questions/119328

02-07-2019
|

题

我怎么截断java String 所以，我知道它会适合在给定的数字储存，一旦它被UTF-8encoded?

解决方案

这是一个简单的循环，它计算UTF-8表示的大小，并在超过它时截断：

public static String truncateWhenUTF8(String s, int maxBytes) {
    int b = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);

        // ranges from http://en.wikipedia.org/wiki/UTF-8
        int skip = 0;
        int more;
        if (c <= 0x007f) {
            more = 1;
        }
        else if (c <= 0x07FF) {
            more = 2;
        } else if (c <= 0xd7ff) {
            more = 3;
        } else if (c <= 0xDFFF) {
            // surrogate area, consume next char as well
            more = 4;
            skip = 1;
        } else {
            more = 3;
        }

        if (b + more > maxBytes) {
            return s.substring(0, i);
        }
        b += more;
        i += skip;
    }
    return s;
}

此处理出现在代理对中输入字符串。 Java的UTF-8编码器（正确）将代理对输出为单个4字节序列而不是两个3字节序列，因此 truncateWhenUTF8（）将返回它可以使用的最长截断字符串。如果忽略实现中的代理对，则截断的字符串可能会短于它们所需的时间。

我没有对该代码进行过大量测试，但这里有一些初步测试：

private static void test(String s, int maxBytes, int expectedBytes) {
    String result = truncateWhenUTF8(s, maxBytes);
    byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
    if (utf8.length > maxBytes) {
        System.out.println("BAD: our truncation of " + s + " was too big");
    }
    if (utf8.length != expectedBytes) {
        System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
    }
    System.out.println(s + " truncated to " + result);
}

public static void main(String[] args) {
    test("abcd", 0, 0);
    test("abcd", 1, 1);
    test("abcd", 2, 2);
    test("abcd", 3, 3);
    test("abcd", 4, 4);
    test("abcd", 5, 4);

    test("a\u0080b", 0, 0);
    test("a\u0080b", 1, 1);
    test("a\u0080b", 2, 1);
    test("a\u0080b", 3, 3);
    test("a\u0080b", 4, 4);
    test("a\u0080b", 5, 4);

    test("a\u0800b", 0, 0);
    test("a\u0800b", 1, 1);
    test("a\u0800b", 2, 1);
    test("a\u0800b", 3, 1);
    test("a\u0800b", 4, 4);
    test("a\u0800b", 5, 5);
    test("a\u0800b", 6, 5);

    // surrogate pairs
    test("\uD834\uDD1E", 0, 0);
    test("\uD834\uDD1E", 1, 0);
    test("\uD834\uDD1E", 2, 0);
    test("\uD834\uDD1E", 3, 0);
    test("\uD834\uDD1E", 4, 4);
    test("\uD834\uDD1E", 5, 4);

}

已更新修改后的代码示例，它现在处理代理项对。

其他提示

你应该使用 CharsetEncoder ，简单的 getBytes（） +尽可能多的复制可以将UTF-8字符切成两半。

这样的事情：

public static int truncateUtf8(String input, byte[] output) {

    ByteBuffer outBuf = ByteBuffer.wrap(output);
    CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());

    Charset utf8 = Charset.forName("UTF-8");
    utf8.newEncoder().encode(inBuf, outBuf, true);
    System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
    return outBuf.position();
}

这是我提出的，它使用标准的Java API，因此应该安全并与所有unicode古怪和代理对等兼容。解决方案取自 http://www.jroller.com/holy/entry/truncating_utf_string_to_the ，检查添加为null，并在字符串比 maxBytes

/**
 * Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in
 * half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal
 * character.
 *
 * Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the
 */
public static String truncateToFitUtf8ByteLength(String s, int maxBytes) {
    if (s == null) {
        return null;
    }
    Charset charset = Charset.forName("UTF-8");
    CharsetDecoder decoder = charset.newDecoder();
    byte[] sba = s.getBytes(charset);
    if (sba.length <= maxBytes) {
        return s;
    }
    // Ensure truncation by having byte buffer = maxBytes
    ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes);
    CharBuffer cb = CharBuffer.allocate(maxBytes);
    // Ignore an incomplete character
    decoder.onMalformedInput(CodingErrorAction.IGNORE)
    decoder.decode(bb, cb, true);
    decoder.flush(cb);
    return new String(cb.array(), 0, cb.position());
}

UTF-8的编码具有整齐的特点，允许你看到的其中一个字节的设定。

检查流的字数限制你想要的。

如果其高位为0，这是一个单字节char，只是取代它0和你的罚款。
如果其高位是1和使是下一位，那么你在开始的一个多字节char，所以只集字节到0，你是好.
如果高位是1但是，下一位是0,然后你在中间的一个角色，旅行回沿缓冲区，直到你打字节，有2个或更多1s在高位，并替换字节的0。

例如：如果你流是：31 33 31C1A3 32 33 00，你可以让你的字符串1, 2, 3, 5, 6, 或者7字节长，但不是4，因为这将把0后C1，这是开始一个多字节char。

你可以使用-new String（data.getBytes（＆quot; UTF-8＆quot;），0，maxLen，＆quot; UTF-8＆quot;）;

您可以在不进行任何转换的情况下计算字节数。

foreach character in the Java string
  if 0 <= character <= 0x7f
     count += 1
  else if 0x80 <= character <= 0x7ff
     count += 2
  else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
     count += 3
  else if 0xdc00 <= character <= 0xffff
     count += 3
  else { // surrogate, a bit more complicated
     count += 4
     skip one extra character in the input stream
  }

您必须检测代理对（D800-DBFF和U + DC00＆＃8211; U + DFFF）并为每个有效的代理对计数4个字节。如果您获得第一个范围中的第一个值，第二个范围中的第二个值，那么一切正常，跳过它们并添加4。但如果没有，那么它就是一个无效的代理对。我不确定Java是如何处理的，但是你的算法必须在那个（不太可能的）情况下正确计算。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow