2

I need to validate HTML user input in a web App using JavaScript.

What I did so far based on this question: I'm using third party library, sanitize-html, to sanitize input and then compare it to original one. If they are different, Html is invalid.

const isValidHtml = (html: string): boolean => {
    let sanitized = sanitizeHtml(html, sanitizationConfig);
    sanitized = sanitized.replace(/\s/g, '').replace(/<br>|<br\/>/g, ''); // different browser's behavior for <br>
    html = html.replace(/\s/g, '').replace(/<br>|<br\/>/g, '');
    return sanitized === html;
}

The above method works fine with unescaped Html but not with escaped ones.

isValidHtml('<'); // false
isValidHtml('&lt;'); // true
isValidHtml('<script>'); // false
isValidHtml('&lt;script&gt;'); // true, this should be false also!!!
  1. Am I missing something with this method?
  2. Is there a better way to do this task?

EDIT: As suggested by @brad in the comments, I tried to decode Html first:

decodeHtml(html: string): string {
    const txt = document.createElement('textarea');
    txt.innerHTML = html;
    const decodedHtml = txt.value;
    txt.textContent = null;
    return decodedHtml;
} 

and then call isValid(decodedHtml), I got this result:

isValidHtml('<'); // false
isValidHtml('&lt;'); // false, this should be true!!!
isValidHtml('<script>'); // false
isValidHtml('&lt;script&gt;'); // false
5
  • Why not just let the browser parse it, and then re-serialize the DOM to HTML? Whatever you do, RegEx isn't the answer. Commented Nov 21, 2018 at 3:40
  • @Brad If I do so, &lt; will be decoded as < and sanitizeHtml method will return empty string. Which means isValid('&lt;') returns false Commented Nov 21, 2018 at 3:50
  • No it won't... did you try it? Commented Nov 21, 2018 at 4:28
  • @Brad I did update my question, is that what you are suggestion? Commented Nov 21, 2018 at 13:43
  • Your code is not Javascript. Commented Nov 21, 2018 at 14:39

1 Answer 1

2

If you're not actually trying to validate the HTML, and are simply trying to ensure it ends up being valid, I would recommend running it through the DOM parser and getting the HTML back out, effectively letting the browser do the work for you.

Untested, but something like this:

const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/html');
console.log(doc.documentElement.innerHTML);

Basically, you use the browser's built-in parsing to handle any errors, in the standard way that it does anyway. It will create a tree of nodes. From that tree of nodes, you generate HTML that is guaranteed to be valid.

See also: https://developer.mozilla.org/en-US/docs/Web/API/DOMParser#Parsing_an_SVG_or_HTML_document

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.