Domanda

I have a situation where I'm not sure if the input I get is HTML encoded or not. How do I handle this? I also have jQuery available.

function someFunction(userInput){
    $someJqueryElement.text(userInput);
}

// userInput "<script>" returns "&lt;script&gt;", which is fine
// userInput "&lt;script&gt;" returns &amp;lt;script&amp;gt;", which is bad

I could avoid escaping ampersands (&), but what are the risks in that? Any help is very much appreciated!

Important note: This user input is not in my control. It returns from a external service, and it is possible for someone to tamper with it and avoid the html escaping provided by that service itself.

È stato utile?

Soluzione

You really need to make sure you avoid these situations as it introduces really difficult conditions to predict.

Try adding an additional variable input to the function.

function someFunction(userInput, isEncoded){
    //Add some conditional logic based on isEncoded
    $someJqueryElement.text(userInput);
}

If you look at products like fckEditor, you can choose to edit source or use the rich text editor. This prevents the need for automatic encoding detection.

If you are still insistent on automatically detecting html encoding characters, I would recommend using index of to verify that certain key phrases exist.

str.indexOf('&lt;') !== -1

This example above will detect the < character.

~~~New text added after edit below this line.~~~

Finally, I would suggest looking at this answer. They suggest using the decode function and detecting lengths.

var string = "Your encoded &amp; decoded string here"

function decode(str){
    return decodeURIComponent(str).replace(/&lt;/g,'<').replace(/&gt;/g,'>');
}

if(string.length == decode(string).length){
    // The string does not contain any encoded html.
}else{
    // The string contains encoded html.
}

Again, this still has the problem of a user faking out the process by entering those specially encoded characters, but that is what html encoding is. So it would be proper to assume html encoding as soon as one of these character sequences comes up.

Altri suggerimenti

You must always correctly encode untrusted input before concatenating it into a structured language like HTML.

Otherwise, you'll enable injection attacks like XSS.

If the input is supposed to contain HTML formatting, you should use a sanitizer library to strip all potentially unsafe tags & attributes.

You can also use the regex /<|>|&(?![a-z]+;) to check whether a string has any non-encoded characters; however, you cannot distinguish a string that has been encoded from an unencoded string that talks about encoding.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top