Sanitize HTML with Beautiful Soup

If you have a website that displays user-generated HTML (emails, rich text entry, etc), you likely want to scrub that HTML before you display it. At the very least, you want to provide a reasonable protection against XSS. But maybe you also want to prevent the HTML from breaking your page layout. Either way, Beautiful Soup is a good tool for the job.

Here is some Python code that uses Beautiful Soup to clean HTML of any tags and attributes not explicitly white listed. The CSS property scrubbing is blacklist based, which sucks and should be redone with a real CSS parsing library. If you want a really diesel solution to this issue, I suggest you look at AntiSamy for Java or .NET.

As a bonus, there is also a method for converting html to plaintext, while making a best effort to preserve the structural whitespace.

from BeautifulSoup import BeautifulSoup, Comment
import re, htmlentitydefs
from HTMLParser import HTMLParseError
from datetime import datetime
import subprocess
import os
import urllib2

def safe_html(html):

    if not html:
        return None

    # remove these tags, complete with contents.
    blacklist = ["script", "style" ]

    whitelist = [
        "div", "span", "p", "br", "pre",
        "table", "tbody", "thead", "tr", "td", "a",
        "blockquote",
        "ul", "li", "ol",
        "b", "em", "i", "strong", "u", "font"
        ]

    try:
        # BeautifulSoup is catching out-of-order and unclosed tags, so markup
        # can't leak out of comments and break the rest of the page.
        soup = BeautifulSoup(html)
    except HTMLParseError, e:
        # special handling?
        raise e

    # now strip HTML we don't like.
    for tag in soup.findAll():
        if tag.name.lower() in blacklist:
            # blacklisted tags are removed in their entirety
            tag.extract()
        elif tag.name.lower() in whitelist:
            # tag is allowed. Make sure all the attributes are allowed.
            tag.attrs = [(a[0], safe_css(a[0], a[1])) for a in tag.attrs if _attr_name_whitelisted(a[0])]
        else:
            # not a whitelisted tag. I'd like to remove it from the tree
            # and replace it with its children. But that's hard. It's much
            # easier to just replace it with an empty span tag.
            tag.name = "span"
            tag.attrs = []

    # scripts can be executed from comments in some cases
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    for comment in comments:
        comment.extract()

    safe_html = unicode(soup)

    if safe_html == ", -":
        return None

    return safe_html

def _attr_name_whitelisted(attr_name):
    return attr_name.lower() in ["href", "style", "color", "size", "bgcolor", "border"]

def safe_css(attr, css):
    if attr == "style":
        return re.sub("(width|height):[^;]+;", "", css)
    return css

def plaintext(input):
    """Converts HTML to plaintext, preserving whitespace."""

    # from http://effbot.org/zone/re-sub.htm#unescape-html
    def _unescape(text):
        def fixup(m):
            text = m.group(0)
            if text[:2] == "&#":
                # character reference
                try:
                    if text[:3] == "&#x":
                        return unichr(int(text[3:-1], 16))
                    else:
                        return unichr(int(text[2:-1]))
                except ValueError:
                    pass
            else:
                # named entity
                try:
                    text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
                except KeyError:
                    pass
            return text # leave as is
        return re.sub("&#?\w+;", fixup, text)

    input = safe_html(input) # basic sanitation first
    text = "".join(BeautifulSoup("<body>%s</body>" % input).body(text=True))
    text = text.replace("xml version='1.0' encoding='%SOUP-ENCODING%'", "") # strip BS meta-data
    return _unescape(text)

Note: This method is neither fast, nor particularly bullet proof. You definitely want to cache the results so you're not performing this transformation while the user is waiting. Some documents will throw a HTMLParseError if Beautiful Soup cannot parse them. You can choose whether you want to do the secure thing and not show it at all, or if you want to risk it and show them the original HTML.

Chase Seibert