Parsing invalid xml with Beautiful Soup

Sometimes, bad xml happens to good people. In my case, I was getting a text stream back from a web-service call that proported to be xml, but was actually not well-formed. It had ampersands inside a tag.

 <?xml version="1.0" encoding='UTF-8'?>
 <resume>
   <title>Developer & Manager</title>
    ...
 </resume>

This lead to the following error parsing with python's minidom:

Traceback (most recent call last):
  ...
    response = minidom.parseString(xml)
  File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
    return expatbuilder.parseString(string)
  File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in parseString
    return builder.parseString(string)
  File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
ExpatError: not well-formed (invalid token): line 369, column 2025

Virtually all XML parses will rightly balk at this input, because it's not valid. You could easily work-around this issue by replacing all ampersands with the html-entity &amp;, but the real issue is that the web-service is obviously not using an XML parser to create the document. It's likely creating the document by hand, which means that further cases of invalid XML are quite likely.

A more robust solution is to use a lenient parser like Beautiful Soup, which is actually an HTML parser. Even though it doesn't know anything about XML, it's enough for basic parsing. Beautiful Soup will allow ampersands (which are valid in HTML anyway), unclosed tags, bad encodings or virtually anything else. It's designed to make a best effort no matter what. It's also easy to use.

from BeautifulSoup import BeautifulSoup

response = BeautifulSoup(xml)
print response.find("title").string