Sanitize HTML with Beautiful Soup
If you have a website that displays user-generated HTML (emails, rich text entry, etc), you likely want to scrub that HTML before you display it. At the very least, you want to provide a reasonable protection against XSS. But maybe you also want to prevent the HTML from breaking your page layout. Either way, Beautiful Soup is a good tool for the job.
Here is some Python code that uses Beautiful Soup to clean HTML of any tags and attributes not explicitly white listed. The CSS property scrubbing is blacklist based, which sucks and should be redone with a real CSS parsing library. If you want a really diesel solution to this issue, I suggest you look at AntiSamy for Java or .NET.
As a bonus, there is also a method for converting html to plaintext, while making a best effort to preserve the structural whitespace.
from BeautifulSoup import BeautifulSoup, Comment import re, htmlentitydefs from HTMLParser import HTMLParseError from datetime import datetime import subprocess import os import urllib2 def safe_html(html): if not html: return None # remove these tags, complete with contents. blacklist = ["script", "style" ] whitelist = [ "div", "span", "p", "br", "pre", "table", "tbody", "thead", "tr", "td", "a", "blockquote", "ul", "li", "ol", "b", "em", "i", "strong", "u", "font" ] try: # BeautifulSoup is catching out-of-order and unclosed tags, so markup # can't leak out of comments and break the rest of the page. soup = BeautifulSoup(html) except HTMLParseError, e: # special handling? raise e # now strip HTML we don't like. for tag in soup.findAll(): if tag.name.lower() in blacklist: # blacklisted tags are removed in their entirety tag.extract() elif tag.name.lower() in whitelist: # tag is allowed. Make sure all the attributes are allowed. tag.attrs = [(a[0], safe_css(a[0], a[1])) for a in tag.attrs if _attr_name_whitelisted(a[0])] else: # not a whitelisted tag. I'd like to remove it from the tree # and replace it with its children. But that's hard. It's much # easier to just replace it with an empty span tag. tag.name = "span" tag.attrs = [] # scripts can be executed from comments in some cases comments = soup.findAll(text=lambda text:isinstance(text, Comment)) for comment in comments: comment.extract() safe_html = unicode(soup) if safe_html == ", -": return None return safe_html def _attr_name_whitelisted(attr_name): return attr_name.lower() in ["href", "style", "color", "size", "bgcolor", "border"] def safe_css(attr, css): if attr == "style": return re.sub("(width|height):[^;]+;", "", css) return css def plaintext(input): """Converts HTML to plaintext, preserving whitespace.""" # from http://effbot.org/zone/re-sub.htm#unescape-html def _unescape(text): def fixup(m): text = m.group(0) if text[:2] == "&#": # character reference try: if text[:3] == "&#x": return unichr(int(text[3:-1], 16)) else: return unichr(int(text[2:-1])) except ValueError: pass else: # named entity try: text = unichr(htmlentitydefs.name2codepoint[text[1:-1]]) except KeyError: pass return text # leave as is return re.sub("&#?\w+;", fixup, text) input = safe_html(input) # basic sanitation first text = "".join(BeautifulSoup("<body>%s</body>" % input).body(text=True)) text = text.replace("xml version='1.0' encoding='%SOUP-ENCODING%'", "") # strip BS meta-data return _unescape(text)
Note: This method is neither fast, nor particularly bullet proof. You definitely want to cache the results so you're not performing this transformation while the user is waiting. Some documents will throw a HTMLParseError if Beautiful Soup cannot parse them. You can choose whether you want to do the secure thing and not show it at all, or if you want to risk it and show them the original HTML.