1

I am using JSoup library in Java to sanitize input to prevent XSS attacks. It works well for simple inputs like alert('vulnerable').

Example:

String data = "<script>alert('vulnerable')</script>";
data = Jsoup.clean(data, , Whitelist.none());
data = StringEscapeUtils.unescapeHtml4(data); //StringEscapeUtils from apache-commons lib
System.out.println(data);

Output: ""

However, if I tweak the input to the following, JSoup cannot sanitize the input.

String data = "<<b>script>alert('vulnerable');<</b>/script>";
data = Jsoup.clean(data, , Whitelist.none());
data = StringEscapeUtils.unescapeHtml4(data);
System.out.println(data);

Output: <script>alert('vulnerable');</script>

This output obviously still prone to XSS attacks. Is there a way to fully sanitize the input so that all HTML tags is removed from input?

1
  • I mean, you realize, that your second example differs from the first only be leaving out the <SCRIPT> tag in the String ... This doesn't actually have a <SCRIPT> element in it: <<b>script>alert('vulnerable');<</b>/script> ... This is malformed HTML. What are you expecting? Furthermore, I have never quite understood the purpose of JSoup's XSS attack cleaners. If you wish to simply eliminate <SCRIPT> tags, then just remove them... Commented Oct 4, 2020 at 12:53

2 Answers 2

2

Not sure if this is the best solution, but a temporary workaround would be parsing the raw text into a Doc and then clean the combined text of the Doc element and all its children:

String unsafe = "<<b>script>alert('vulnerable');<</b>/script>";
Document doc = Jsoup.parse(unsafe);
String safe = Jsoup.clean(doc.text(), Whitelist.none());
System.out.println(safe);

Wait for someone else to come up with the best solution.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you @1218985. Tried your solution. So far this solution looks good as it fulfills what I need.
0

The problem is that you are unescaping the safe HTML that jsoup has made. The output of the Cleaner is HTML. The none safelist passes no tags, only the textnodes, as HTML.

So the input:

<<b>script>alert('vulnerable');<</b>/script>

Through the Cleaner returns:

&lt;script&gt;alert('vulnerable');&lt;/script&gt;

which is perfectly safe for presenting as HTML. See https://try.jsoup.org/~hfn2nvIglfl099_dVxLQEPxekqg

Just don't include the unescape line.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.