Java: rileva i caratteri di controllo che non sono corretti per JSON

https://stackoverflow.com/questions/6051509

15-11-2019
|

Domanda

Sto reinventando la ruota e creando i miei metodi JSON Parse in Java.

Vado dalla documentazione (molto bella!) su json.org . L'unica parte in cui non sono sicuro è dove dice "o carattere di controllo"

Poiché la documentazione è così chiara, e JSON è così semplice e facile da implementare, pensavo che avrei dovuto andare avanti e richiederei la specifica invece di essere allentata.

Come dovrei correttamente Striscia i caratteri di controllo in Java? Forse c'è un'intervallo Unicode?

.
Modifica: A (comunemente?) Pezzo mancante al puzzle
I è stato informato che ci sono altri caratteri di controllo al di fuori di L'intervallo definito ¹ ² che può essere fastidioso nei tag <script>.
.
In particolare i caratteri u + 2028 e u + 2029, separatore di linea e paragrafo, che agiscono come elenchi. Iniezione di una nuova linea nel bel mezzo di una stringa più letterale probabilmente causerà un errore di sintassi (stringa nonterminata letterale). ³
Anche se credo che questo non posa una minaccia XSS, è ancora una buona idea aggiungere regole aggiuntive per l'uso in tag <script>.
.
Sii semplice e codifica tutti i caratteri non "ASCII stampabili" con notazione \u. Quei personaggi non sono comuni per cominciare. Se ti piace, potresti aggiungere all'elenco Bianco, ma consiglio un approccio di listino BIANCO.
Nel caso in cui non si sia a conoscenza, non dimenticare Informazioni su </scripttagCode (non sensibile al maiuscolo / minuscolo), quale potrebbe causare l'iniezione dello script HTML alla tua pagina con i caratteri </script><script src=http://tinyurl.com/abcdef>. Nessuno di questi personaggi è per impostazione predefinita codificata in JSON.

Soluzione

Will Character.isISOControl(...) do? Incidentally, UTF-16 is an encoding of Unicode codepoints... Are you going to be operating at the byte level, or at the character/codepoint level? I recommend leaving the mapping from UTF-16 to character streams to Java's core APIs...

Altri suggerimenti

Even if it's not very specific, I would assume that they refer to the "control" character category from the Unicode specification.

In Java, you can check if a character c is a Unicode control character with the following expression: Character.getType(c) == Character.CONTROL.

I believe the Unicode definition of a control character is:

The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F.

That's their definition of a control code, but the above is followed by the sentence "Also known as control characters.", so...

I know the question has been asked a couple of years ago, but I am replying anyway, because the accepted answer is not correct.

Character.isISOControl(int codePoint)

does the following check:

(codePoint >= 0x00 && codePoint <= 0x1F) || (codePoint >= 0x7F && codePoint <= 0x9F);

The JSON specification defines at https://tools.ietf.org/html/rfc7159:

Strings

The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Character.isISOControl(int codePoint)

will flag all characters that need to be escaped (U+0000-U+001F), though it will also flag characters that do not need to be escaped (U+007F-U+009F). It is not required to escape the characters (U+007F-U+009F).

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow