Regular Expressions Suck at Preventing XSS
Depending on who you listen to, XSS is now the top computer security vulnerability, having passed the venerable SQL injection in 2007. If you're a developer, especially a web developer, and you DON'T know what XSS is, stop reading right now and start Googling.
Cross-site scripting (XSS) is a type of computer security vulnerability typically found in web applications which allow code injection by malicious web users into the web pages viewed by other users. - Wikipedia
Typically, the injection takes the form of javascript code. How does this code get injected into your site? There are a myriad of ways; HTML is ubiquitous these days. On the application I work on, the easiest vector is email.
We have a web-based email system. Users get an email, usually in HTML, and we display it inside our web application. It's a classic input validation problem; we're essentially presenting user generated content directly to the user, unfiltered. Well, not quite. Even from the beginning, we did some basic regex validation. The base case for XSS is via a SCRIPT tag, so we try to strip those. I am a big fan of regular expressions; they are great. But in this case, it's like beating off a mugger with a wet noodle.
Many other systems need to do the same thing. See Jeff Atwood's solution for Stack Overflow, where they allow HTML formatted code snippets to be submitted by the users. He's not alone; developers all seem to initially gravitate to regular expressions for this task.
I contest that you really, really don't want to do this with regular expressions. Regular expressions are notoriously bad at parsing HTML, XML or any nested tag language. You don't want to be a casual parser, especially when you're trying to strictly enforce security. They also suck at parsing email addresses, a topic I plan to cover later.
The key is that you're not just protecting against valid, vanilla HTML. You're protecting against anything that a browser can understand, and anything it can mis-understand. Browsers can be tricked into producing valid DOM from invalid HTML quite easily. Browsers love rending crap invalid HTML; they even take pride in it.
For example, see this list of obfuscated XSS attacks. Are you prepared to tailor a regex to prevent this real world attack on Yahoo and Hotmail on IE6/7/8?
<HTML><BODY> <?xml:namespace prefix="t" ns="urn:schemas-microsoft-com:time"> <?import namespace="t" implementation="#default#time2"> <t:set attributeName="innerHTML" to="XSS<SCRIPT DEFER>alert("XSS")</SCRIPT>"> </BODY></HTML>
How about this attack that works on IE6?
<TABLE BACKGROUND="javascript:alert('XSS')">
How about attacks that are not listed on this site? The problem with Jeff's approach is that it's not a whitelist, as claimed. It's only stripping well-behaved tags. We want to strip malicious tags! As someone on this page adeptly notes:
The problem with it, is that the html must be clean. There are cases where you can pass in hacked html, and it won't match it, in which case it'll return the hacked html string as it won't match anything to replace. This isn't strictly whitelisting.
Why use a regex to parse HTML at all? Use a damn parser! I would suggest a purpose built tool like AntiSamy. It works by actually parsing the HTML, and then traversing the DOM and removing anything that's not in the configurable whitelist. The major difference is the ability to gracefully handle malformed HTML. I hear you complaining about performance already. To that, I would simply ask whether you feel that HTML rendering time significantly impacts the users perception of performance in their regular browsing. Yeah, I didn't think so. You can spare a few extra milliseconds to do this correctly.
The best part is that AntiSamy actually unit tests for all the XSS attacks on the above site. Ant it's damn easy to use:
public String toSafeHtml(String html) throws ScanException, PolicyException { Policy policy = Policy.getInstance(POLICY_FILE); AntiSamy antiSamy = new AntiSamy(); CleanResults cleanResults = antiSamy.scan(html, policy); return cleanResults.getCleanHTML().trim(); }