Reading OpenDocument office files from Python - Stefaan Lippens inserts content here

The OpenDocument file format (aka "OASIS Open Document Format for Office Applications"), is an open and free standard for office files. It's fairly easy to read OpenDocument files in/from Python. Basicly, an OpenDocument file is just a zip archive but with another extension (".ods" spreadsheets, ".odt" for text documents, ".odg" for graphics and so on). The files in the zip file are mainly some XML files, like content.xml, settings.xml and styles.xml.

Basicly, we just need two standard python modules from the nice standard Python library to extract data from a OpenDocument File: zipfile for handling the zip compression and xml.parsers.expat (or another xml parser module) for parsing the xml. A possible/simple/minimal way to do read a fictional spreadsheet file pelican.ods is as follows:

# import the needed modules
import zipfile
import xml.parsers.expat

# get content xml data from OpenDocument file
ziparchive = zipfile.ZipFile("pelican.ods", "r")
xmldata = ziparchive.read("content.xml")
ziparchive.close()

class Element(list):
    def __init__(self, name, attrs):
        self.name = name
        self.attrs = attrs

class TreeBuilder:
    def __init__(self):
        self.root = Element("root", None)
        self.path = [self.root]
    def start_element(self, name, attrs):
        element = Element(name, attrs)
        self.path[-1].append(element)
        self.path.append(element)
    def end_element(self, name):
        assert name == self.path[-1].name
        self.path.pop()
    def char_data(self, data):
        self.path[-1].append(data)

# create parser and parsehandler
parser = xml.parsers.expat.ParserCreate()
treebuilder = TreeBuilder()
# assign the handler functions
parser.StartElementHandler  = treebuilder.start_element
parser.EndElementHandler    = treebuilder.end_element
parser.CharacterDataHandler = treebuilder.char_data

# parse the data
parser.Parse(xmldata, True)

After importing the modules zipfile and xml.parsers.expat, we open the OpenDocument file through the zipfile filter and extract the XML data from content.xml (in just three simple readable statements, that's what I like about python)

Next, we have to parse the XML data and store its elements. First I define a class Element for storing the name, attributes and contents (note that its a subclass of list) of XML elements. The XML parse handler functions are the methods of the class TreeBuilder, which also collects the whole XML tree in its attribute root during parsing. The parsing itself is initiated by parser.Parse(xmldata, True).

At this point, the whole XML stream is parsed and the structure is stored in and accessible from treebuilder.root. For example, in the following code we define a function showtree and call it, for rendering an indented tree representation of the element names and character data:

def showtree(node, prefix=""):
    print prefix, node.name
    for e in node:
        if isinstance(e, Element):
            showtree(e, prefix + "  ")
        else:
            print prefix + "  ", e

showtree(treebuilder.root)

This simple example could be extended ofcourse to serve a more real life application, but you get the point.

Moreover, at the OpenDocument Fellowship they are working on LibOpenDocument, a library for manipulating OpenDocument files. At the time of this writing, there is only a PHP and a Python implementation in early development stage.