Creating large XML files in Python with saxutils

In the past, when I needed to produce an XML file problematically, I typically turned to templating engines. Something like the following example in Django templates:

<?xml version="1.0" encoding="utf-8"?>
<jobs>
    {% for job in jobs %}
        {% with job.user|get_profile as profile %}
      <job>
          <title>{{ job.title|xml_escape }}</title>
          <job-board-name>{{ 'JOB_FEED_SITE_NAME'|setting }}</job-board-name>
          <job-board-url>{{ 'SITE_URL_NO_SLASH'|setting }}</job-board-url>
          <job-code>job{{ job.id }}</job-code>
          <detail-url>{{ 'SITE_URL_NO_SLASH'|setting}}{% url job job.id job.slug %}{{ 'simplyhired.com'|campaign_tracking }}</detail-url>

...

You get the idea. Once you start including free text data from real users, you pretty quickly run into issues with invalid XML. Typically, you need some combination of escaping (like saxutils.escape) and stripping (for control character that are illegal in XML). For whatever reason, it's pretty common for the XML parsers themselves not to give you the tools you need to produce valid XML from imperfect data. You can counter this with a XML cleaning library such as 4Suite, or write your own.

This solution is inelegant, but it works. However, if you need to produce very large XML files, you will run into memory constraints. Unless your templating engine has the ability to stream the file contents to disk, which Django's doesn't, you'll be storing all that XML in memory before it can be written. This can lead to hours of XML composing followed by an out of memory exception, with nothing writing to disk.

Using an XML parser to actually compose XML is much more robust, even if the code is somewhat less straight forward. You could use a DOM parser, but that wouldn't solve your memory issue; the whole XML document must still be kept in memory before it can be written. Instead, you can use a SAX parser, which is explicitly designed for streaming.

Python ships with saxutils, which is very capable, if a bit verbose to use as an API. Here is a quick abstraction for simple nested tags without attributes, which is a common pattern.

from xml.sax.saxutils import XMLGenerator
from xml.sax.xmlreader import AttributesNSImpl
from website.helpers import xml_helper
import types
from django.utils.encoding import smart_str, force_unicode
from xml.sax import saxutils

def strip_illegal_xml_characters(input):

    if input:

        import re

        # unicode invalid characters
        RE_XML_ILLEGAL = u'([\u0000-\u0008\u000b-\u000c\u000e-\u001f\ufffe-\uffff])' + \
                         u'|' + \
                         u'([%s-%s][^%s-%s])|([^%s-%s][%s-%s])|([%s-%s]$)|(^[%s-%s])' % \
                          (unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
                           unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
                           unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
                           )
        input = re.sub(RE_XML_ILLEGAL, "", input)

        # ascii control characters
        input = re.sub(r"[\x01-\x1F\x7F]", "", input)

    return input

def escape_xml(value):
    return strip_illegal_xml_characters(force_unicode(smart_str(value)))

class SimpleSaxWriter():

    def __init__(self, output, encoding, top_level_tag=u"jobs"):
        logger = XMLGenerator(output, encoding)
        logger.startDocument()
        attrs = AttributesNSImpl({}, {})
        logger.startElementNS((None, top_level_tag), top_level_tag, attrs)
        self._logger = logger
        self.top_level_tag = top_level_tag

    def start_tag(self, name):
        attrs = AttributesNSImpl({}, {})
        self._logger.startElementNS((None, name), name, attrs)

    def end_tag(self, name):
        self._logger.endElementNS((None, name), name)

    def simple_tag(self, name, contents):

        if contents:

            if not isinstance(contents, types.UnicodeType):
                raise TypeError("XML character data must be passed in as a unicode object")

            self.start_tag(name)
            # saxutils will let invalid XML though, left to it's own devices
            self._logger.characters(escape_xml(contents))
            self.end_tag(name)

    def write_entry(self, level, msg):
        raise NotImplementedError()

    def close(self):
        self._logger.endElementNS((None, self.top_level_tag), self.top_level_tag)
        self._logger.endDocument()
        return

Given that framework, an implementing class might look like:

class JobWriter(SimpleSaxWriter):

    def write_entry(self, level, job):

        from website.helpers.tags import label
        from website import helpers

        self.start_tag("job")

        self.simple_tag("title", job.title)
        self.simple_tag("job-board-name", unicode(settings.JOB_FEED_SITE_NAME))

 ...

        self.start_tag("description")
        self.simple_tag("summary", job.description_plaintext)
        self.end_tag("description")

 ...

To actually write to disk, you need to pass in a file. Notice that I'm using mode "wb", because saxutils will produce a byte stream, and that I'm NOT passing an encoding, even though I'm dealing with UTF-8 data. That's because Python will guess wrong in that case, and try to output ASCII.

    file = codecs.open("test.xml", mode="wb")
    feed = JobWriter(file, "UTF-8")
    for job in jobs: # just a list of objects from Django's orm, could be anything
        feed.write_entry(2, job)
    feed.close()
    file.close()


I'm currently working at NerdWallet, a startup in San Francisco trying to bring clarity to all of life's financial decisions. We're hiring like crazy. Hit me up on Twitter, I would love to talk.

Follow @chase_seibert on Twitter