61

I am saving user-submitted HTML (in a database). I must prevent JavaScript injection attacks. The most pernicious I have seen is JavaScript in a style="expression(...)".

In addition to this, a fair amount of valid user content will include special characters and XML constructs, so I'd like to avoid a white-list approach if possible. (Listing every allowable HTML element and attribute).

Examples of JavaScript attack strings:

1.

"Hello, I have a
<script>alert("bad!")</script>
problem with the <dog>
element..."
"Hi, this <b
style="width:expression(alert('bad!'))">dog</b>
is black."

Is there a way to prevent such JavaScript, and leave the rest intact?

The only solution I have so far is to use a regular expression to remove certain patterns. It solves case 1, but not case 2.

The environment is essentially the Microsoft stack:

  • SQL Server 2005
  • C# 3.5 (ASP.NET)
  • JavaScript and jQuery.

I would like the chokepoint to be the ASP.NET layer - anyone can craft a bad HTTP request.

Edit

Thanks for the links, everyone. Assuming that I can define my list (the content will include many mathematical and programming constructs, so a whitelist is going to be very annoying), I still have a question:

What kind of parser will allow me to just remove the "bad" parts? The bad part could be an entire element, but then what about those scripts that reside in the attributes? I can't remove < a hrefs > willy-nilly.

0

7 Answers 7

45

You think that's it? Check this out.

Whatever approach you take, you definitely need to use a whitelist. It's the only way to even come close to being safe about what you're allowing on your site.

EDIT:

I'm not familiar with .NET, unfortunately, but you can check out stackoverflow's own battle with XSS (https://blog.stackoverflow.com/2008/06/safe-html-and-xss/) and the code that was written to parse HTML posted on this site: Archive.org link - obviously you might need to change this because your whitelist is bigger, but that should get you started.

6
  • Thanks, I'm actually using that site as a test bed. I've successfully removed anything that looks anything like < s c r i p t >, so I need to get the ones that don't... that is, expression:, javascript:, vbscript: etc. Could you suggest a parser that can do that? Commented Jun 2, 2009 at 21:42
  • 2
    If your approach is to remove dangerous things, your code will be vulnerable to injection. The only safe approach is to have a whitelist of specifically allowed elements and attributes.
    – Miles
    Commented Jun 2, 2009 at 21:45
  • Thanks for the feedback. I was afraid that a whitelist was the answer. :) Commented Jun 2, 2009 at 21:53
  • 7
    Jeff, try this string: <scr<script>ipt>. Yay XSS! Blacklists will not work. Whether your whitelist is long or not, there is no way to blacklist this. The OWASP website can help a lot if you want to read more articles about this.
    – Chuck Vose
    Commented Apr 18, 2013 at 1:24
  • 1
    Care to explain here what a whitelist is and how it operates/removes the bad stuff?
    – rogerdpack
    Commented Oct 4, 2016 at 19:57
8

Whitelist for elements and attributes is the only acceptable choice in my opinion. Anything not on your whitelist should be stripped out or encoded (change <>&" to entities). Also be sure to check the values within the attributes you allow.

Anything less and you are opening yourself up to problems - known exploits or those that will be discovered in the future.

5

Currently the best option is to use a Content Security Policy header like this:

Content-Security-Policy: default-src 'self';

This will prevent loading of both inline and external scripts, styles, images, etc., so only resources from the same origin will be loaded and executed by the browser.

However, it will not work on old browsers.

3
  • Will this prevent us from using CDN?
    – Daniel Wu
    Commented Jul 23, 2020 at 7:06
  • 1
    @DanielWu yes, but you can whitelist the CDN domain or file hash
    – Adam
    Commented Jul 23, 2020 at 11:50
  • 1
    This answer would have been better if it showed how to then whitelist a site's known-good external resources, because in its current state it isn't usable for 99% of websites - all modern websites rely on external resources. Commented Jun 6, 2022 at 20:56
4

The only really safe way to go is to use a white-list. Encode everything, then convert the allowed codes back.

I have seen rather advanced attempts to only disallow dangerous code, and it still doesn't work well. It's quite some feat to try to safely catch everything that anyone can think of, and it is prone to do annoying replacements of some things that aren't dangerous at all.

1
  • I found out the hard way. We are now using escaping and whitelisting. Commented Jun 23, 2009 at 8:52
3

Basically, as Paolo said, you should try to focus on what the users are allowed to do, rather than trying to filter out the stuff they're not supposed to do.

Keep a list of allowed HTML tags (things like b, i, u...) and filter out everything else. You probably also want to remove all attributes to the allowed HTML tags (because of your second example, for instance).

Another solution would be to introduce so-called BB code, which is what a lot of forums use. It has similar syntax to HTML, but starts with the idea of a whitelist of allowed code, which is then transformed into HTML. For example, [b]example[/b] would result in example. Make sure when using BB code to still filter out HTML tags beforehand.

1
  • The content I fear, will include many many mathematical and programming constructs (XML, C#, etc.) so I would have loved to avoid a whitelist. Commented Jun 2, 2009 at 21:55
0

what server side code are you using? Depending on which there are a number or ways you can filter out malicious script but it's dangerous territory. Even seasoned proffesionals get caught out: http://www.codinghorror.com/blog/archives/001167.html

-4

You can use this restrict function.

function restrict(elem){
  var tf = _(elem);
  var rx = new RegExp;
  if(elem == "email"){
       rx = /[ '"]/gi;
  }else if(elem == "search" || elem == "comment"){
    rx = /[^a-z 0-9.,?]/gi;
  }else{
      rx =  /[^a-z0-9]/gi;
  }
  tf.value = tf.value.replace(rx , "" );
}
1
  • 6
    dumping code without any explanation will only cause confusion, especially if it doesn't even work: what does this mean --> _(elem) ?
    – vsync
    Commented May 11, 2017 at 8:33

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.