Python: Convert a Word/PDF document to html
My current Django project deals with resume files in Word/PDF format. In order to show a web preview of the file, it's necessary to translate these files to plain HTML. While this was sometimes a pain in the past, I've recently found that it's relatively easy with standard Linux tools.
AbiWord is a general purpose word processor for Linux. It has pretty good support for Word files, as well as many other formats such as PDF, RTF, etc. Usually it's invoked as a GUI app, just like Microsoft Word. However, being a Linux app, there is also good command-line support.
One of the things you can do from the command line is convert files from one format to another. Here is a quick example:
# print the HTML translation of a DOC file to the console abiword -t output.html resume.doc; cat output.html
It's also relatively simple to invoke this from Python, using the standard libraries.
import subprocess import os import uuid def document_to_html(file_path): tmp = "/tmp" guid = str(uuid.uuid1()) # convert the file, using a temporary file w/ a random name command = "abiword -t %(tmp)s/%(guid)s.html %(file_path)s; cat %(tmp)s/%(guid)s.html" % locals() p = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True, cwd=os.path.join(settings.PROJECT_DIR, "website/templates")) error = p.stderr.readlines() if error: raise Exception("".join(error)) html = p.stdout.readlines() return "".join(html)
AbiWord produces fairly clean HTML. If you want to scrub it even more, I would suggest something like BeautifulSoup.