Creating large XML files in Python with saxutils
In the past, when I needed to produce an XML file problematically, I typically turned to templating engines. Something like the following example in Django templates:
<?xml version="1.0" encoding="utf-8"?> <jobs> {% for job in jobs %} {% with job.user|get_profile as profile %} <job> <title>{{ job.title|xml_escape }}</title> <job-board-name>{{ 'JOB_FEED_SITE_NAME'|setting }}</job-board-name> <job-board-url>{{ 'SITE_URL_NO_SLASH'|setting }}</job-board-url> <job-code>job{{ job.id }}</job-code> <detail-url>{{ 'SITE_URL_NO_SLASH'|setting}}{% url job job.id job.slug %}{{ 'simplyhired.com'|campaign_tracking }}</detail-url> ...
You get the idea. Once you start including free text data from real users, you pretty quickly run into issues with invalid XML. Typically, you need some combination of escaping (like saxutils.escape) and stripping (for control character that are illegal in XML). For whatever reason, it's pretty common for the XML parsers themselves not to give you the tools you need to produce valid XML from imperfect data. You can counter this with a XML cleaning library such as 4Suite, or write your own.
This solution is inelegant, but it works. However, if you need to produce very large XML files, you will run into memory constraints. Unless your templating engine has the ability to stream the file contents to disk, which Django's doesn't, you'll be storing all that XML in memory before it can be written. This can lead to hours of XML composing followed by an out of memory exception, with nothing writing to disk.
Using an XML parser to actually compose XML is much more robust, even if the code is somewhat less straight forward. You could use a DOM parser, but that wouldn't solve your memory issue; the whole XML document must still be kept in memory before it can be written. Instead, you can use a SAX parser, which is explicitly designed for streaming.
Python ships with saxutils, which is very capable, if a bit verbose to use as an API. Here is a quick abstraction for simple nested tags without attributes, which is a common pattern.
from xml.sax.saxutils import XMLGenerator from xml.sax.xmlreader import AttributesNSImpl from website.helpers import xml_helper import types from django.utils.encoding import smart_str, force_unicode from xml.sax import saxutils def strip_illegal_xml_characters(input): if input: import re # unicode invalid characters RE_XML_ILLEGAL = u'([\u0000-\u0008\u000b-\u000c\u000e-\u001f\ufffe-\uffff])' + \ u'|' + \ u'([%s-%s][^%s-%s])|([^%s-%s][%s-%s])|([%s-%s]$)|(^[%s-%s])' % \ (unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff), unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff), unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff), ) input = re.sub(RE_XML_ILLEGAL, "", input) # ascii control characters input = re.sub(r"[\x01-\x1F\x7F]", "", input) return input def escape_xml(value): return strip_illegal_xml_characters(force_unicode(smart_str(value))) class SimpleSaxWriter(): def __init__(self, output, encoding, top_level_tag=u"jobs"): logger = XMLGenerator(output, encoding) logger.startDocument() attrs = AttributesNSImpl({}, {}) logger.startElementNS((None, top_level_tag), top_level_tag, attrs) self._logger = logger self.top_level_tag = top_level_tag def start_tag(self, name): attrs = AttributesNSImpl({}, {}) self._logger.startElementNS((None, name), name, attrs) def end_tag(self, name): self._logger.endElementNS((None, name), name) def simple_tag(self, name, contents): if contents: if not isinstance(contents, types.UnicodeType): raise TypeError("XML character data must be passed in as a unicode object") self.start_tag(name) # saxutils will let invalid XML though, left to it's own devices self._logger.characters(escape_xml(contents)) self.end_tag(name) def write_entry(self, level, msg): raise NotImplementedError() def close(self): self._logger.endElementNS((None, self.top_level_tag), self.top_level_tag) self._logger.endDocument() return
Given that framework, an implementing class might look like:
class JobWriter(SimpleSaxWriter): def write_entry(self, level, job): from website.helpers.tags import label from website import helpers self.start_tag("job") self.simple_tag("title", job.title) self.simple_tag("job-board-name", unicode(settings.JOB_FEED_SITE_NAME)) ... self.start_tag("description") self.simple_tag("summary", job.description_plaintext) self.end_tag("description") ...
To actually write to disk, you need to pass in a file. Notice that I'm using mode "wb", because saxutils will produce a byte stream, and that I'm NOT passing an encoding, even though I'm dealing with UTF-8 data. That's because Python will guess wrong in that case, and try to output ASCII.
file = codecs.open("test.xml", mode="wb") feed = JobWriter(file, "UTF-8") for job in jobs: # just a list of objects from Django's orm, could be anything feed.write_entry(2, job) feed.close() file.close()