wordpress hit counter
Welcome to OpenXML Developer Sign in | Join | Help

Working with Open Packaging parts

Article submitted by: Stéphane Rodriguez

This article is a tutorial for working with parts as defined by the Open Packaging Conventions (part 2 in particular). The intention is not to provide the reader with plug-and-play code, but instead to provide insight to the conditions under which the parts are easy to work with. It is indeed particularly obvious when you take a look at current code snippets using the System.IO.Packaging library (.NET 3.0) that simple actions don't translate into particularly obvious or easy to read source code, and the Open Packaging Conventions specs goes at length on subjects that are not particularly of interest (such as discussing the access to parts using relative names, as if it were a virtual hard drive), and further distance itself from being approachable.

Audience : this article assumes you are willing to directly access document parts one way or another. Note that if you intend to work with Word/Excel/Powerpoint 2007 documents or XPS documents, you need not directly access the parts. Microsoft provides the regular COM Automation object-level programming interfaces that completely abstract those parts away. Not to mention third-party implementations. Bear in mind that while it may be laudable to take the challenge to go that low-level and make direct consumption or changes to the underlying serialization of documents, it automatically makes you vulnerable to entire classes of document corruptions. There are three layers of validation applications verify in order to make sure manipulated documents are not corrupted : valid XML, conformant XML according to the published XML schemas, application semantics (implementation whose rules are not described in other layers). Each of those three layers can bring corrupt dialog boxes next time you open a manipulated document, whose recovery may be extremely unsuited : if you lose important parts such as the data or the formatting, this becomes problematic to begin with. Bear this in mind, and don't forget that both Microsoft over at OpenXmlDeveloper.org and non-Microsoft individuals have provided with plug-and-play code snippets to fulfill business scenarios that let you for instance insert/update custom XML data sources, a server scenario which is poised to be popular in the Office 2007 timeframe. Posting task-specific code snippets should be a trend in the coming months.

 

Physical versus logical parts

There is ample documentation on the web explaining that Open Packaging based files such as those produced by Word/Excel/Powerpoint 2007 and the MS XPS framework, consist in ZIP files that can be cracked-open by hand to see what's inside. In particular, if you open up a Word 2007 document, it's intriguing to see all those parts split up for you, making it potentially simple to update one independently of another :


Cracking-open a Word 2007 document using built-in Windows XP ZIP shell

If you end up renaming files with a .ZIP extension only to open them in Windows XP explorer window, then perhaps you will be interested to learn this little trick : once the file is renamed as .ZIP, you can still open it any of the associated Office 2007 application if you use the Open With... option in the shell context menu. You can save yourself a lot of keyboard strokes with that. And of course this is possible because none of Word/Excel/Powerpoint apparently lock the file anymore, a change from past versions.

Similarly, if you crack-open an Excel 2007 spreadsheet, you can see that the workbook part is independent to worksheet themselves, and that is even underlined by the fact that the worksheet parts are stored in a "sub-folder" :


Cracking-open an Excel 2007 spreadsheet using built-in Windows XP ZIP shell

But the "sub-folder" and visual hierarchy is just a trick. Windows XP built-in ZIP shell emphasizes that trick by making it visually appear as regular folders. Whereas they aren't. This is not just quibbling about it, the discrepancy actually shows the difference between the physical parts, i.e. how the parts end up being stored in the ZIP file, versus the logical parts, i.e. how the programmatic model exposes them to developers. To get a first idea of what this means, let's use Winzip to crack-open an Excel 2007 spreadsheet :


Cracking-open an Excel 2007 spreadsheet in Winzip reveals the absence of physical folders

As you can see above, there is no real sub-folder apartment. It's a visual representation made possible because in the definition of ZIP entries, nobody prevents you from inserting slashes, and historically a number of applications have used slashes to extract ZIP entries physically as actual folders. So far so good. Now let's see what happens if we insert a picture in the first worksheet of our spreadsheet :


Cracking-open an Excel 2007 spreadsheet in Winzip after adding a picture

As you can see above, despite the fact that we have added a picture to the first worksheet, thus should expect the picture to be stored as part of what defines or contains this worksheet (for instance xl/worksheets/sheet1/), Excel 2007 chooses instead to store the picture in a different, unrelated, "folder" called xl/media. Why Excel 2007 does it is for performance reasons, as it tries to factorize pictures as much as possible, should the same picture be inserted more than once, copied across worksheets and so on. But that goes to the detriment of understanding the hierarchical nature of how parts are physically stored. In fact, parts are not stored hierarchically from a physical stand point. That is clear in this example, but you may also take a look at the first example we had with Word 2007 : a number of the parts are virtually stored at the same "folder" level, despite the fact that they share a hierarchical relationship : for instance the document contains a page header, so the page header is a child of the document. Question : where is this parent-children relation stored?

The short answer to this is the logical representation of parts through relationships. Only the logical representation maintains the hierarchical nature of parts, and none of that is reflected in Winzip or even the built-in Windows XP shell. Before we get into further details of what it really means, and how to begin to structure our mindset in order to write source code targeting it, let's see a diagram of how both our Word and Excel documents are really structured from a logical point of view :


Logical view of Word 2007 document parts

 


Logical view of Excel 2007 document parts

Notice how the logical views show a really hierarchical relationship between parts.

If you use the logical views to express needs such as "inserting a picture in the first worksheet" or "finding all pictures in the first worksheet", then this can be translated automatically to simple parent-children navigation. Whereas if you use the physical views, there is no such model and you simply cannot do it.

Those logical views raise the following questions :

  • How do we map the logical view into the physical view?
  • What are the arrows?
  • What is the root part?

 

Relationships

The arrows are relationships, a way to represent the parent-children that ties together two given parts. The Open Packaging conventions materialize relationships to any given part using separate ZIP entries, whose name ends with .rels.

Any hierarchy has to begin with a top part. The root part is a logical empty part, with no name and no actual ZIP entry associated to it.

While the root part has no materialization, the definition of relationships implies that it has an associated .rels to maintain the parent-children relations with children parts. Hence the ZIP entry _rels/.rels.

This may still be a little obscure at this point where those ZIP entry names come from, but in fact the algorithm is pretty simple. Here is how to build a relationship ZIP entry name from an arbitrary part name :


How to build the relationships name from a given part name

It follows that, by applying the algorithm to the root part, that the associated relationships has name _rels/.rels. Hence the ZIP entry that you can see when you crack-open any Open Packaging based file.

The relationships itself is a simple declaration of all direct child parts. For optimization reasons, Word/Excel/Powerpoint tend to write this declaration without typical carriage returns, which makes it hard to read unless you use a tool that beautifies or pretty-prints the XML tree : you can double-click a .rels file (provided the extension name is associated to a suitable application) and open it in Internet Explorer whose default XSLT stylesheet automatically beautifies a valid XML tree. You can also open the file in Visual Studio, third-party XML tool, or diffopc.


A typical example of relationships (here the relationships of the root part of a Word 2007 document)

In the example above, the root part has three child parts, whose name is completely arbitrary and whose slashes in the names are totally irrelevant from a parent-child perspective.

That's all there is to relationships. You can add or remove relationships simply by adding or removing entries in the relationships. Note that when you do so, it's only simple if you use a logical view where slashes are irrelevant. Fortunately, an API such as System.IO.Packaging lets you do that. The only side effect is that, unless further abstracted, reading a source code that uses the System.IO.Packaging library may be hard to read and follow because the slashes automatically bring a meaning to you, that sense of hierarchy. But again, ignore slashes and take part names in their entirety in such a way that all there is to it is a name uniqueness between them, and you'll be fine.

Ideally, I should be able to add or remove relationships to parts by using parts as argument themselves. This further underlines that the hierarchical view provided by the Windows XP shell goes against the comprehension of what's going on.

When I add a picture to the first worksheet of an Excel spreadsheet, I really only provide the worksheet part and picture part as argument. By analogy, navigating through relationships from top to bottom is a typical recursive parent-children loop.

Note that, while relationships materialize the relations between parts, a part itself need not include any of those identifiers in the part content itself. For instance, a worksheet needs not include any identifier or tag that would essentially duplicate that fact. In practice though, it is often done. This is where perhaps it gets a little tricky, or at least a non-uniform behavior. Each application ultimately decides whether or not to include an identifier in parts themselves to redundantly describe a relation with another part. For instance, if you take a Word 2007 document you may be surprised at how little relations of the form r:id are included in the main Document part, whereas if you take a look at the associated relationships, you will find them listed. Whenever a part adds a redundant relation identifier, it uses the r namespace usually declared at the top of corresponding part, for instance xmlns:r = "http://schemas.openxmlformats.org/officeDocument/2006/relationships". It's often using the r:id attribute although it does not have to either. If you insert a picture, you may find the r:embed attribute instead, it's all up to the application : Word, Excel and Powerpoint tend to behave the same when it comes to pictures and other graphical objects, thanks to the fact both share a drawing library which has a corresponding markup (DrawingML, see part 4 of TC45 specs).

From a pure Open Packaging perspective, none of those r:something are relevant. They are relevant for the application itself, as it decides to store and maintain this information or not. Obviously, such declaration makes it more descriptive and stand-alone when human-scanning a part, than having to navigate through everything only to find how parts relate to others. Again, the first step is to rely on the associated relationships as defined in the section above, not the r:id attributes.

 

Content types and Relationship types

The Open Packaging conventions require that each part be associated to a MIME type and a relationship type. An example of MIME type is image/jpeg, and an example of relationship type : http://schemas.openxmlformats.org/officeDocument/2006/relationships/image.

MIME types are standard internet names to which Microsoft added the new MIME types inducted by the new Word/Excel/Powerpoint/XPS file formats. Relationship types are essentially redundant to the MIME types, and are expressed in a XML way.

It automatically follows that Open Packaging libraries should go towards making a strong effort to reduce the friction related to MIME types and relationship types, ideally by inferring one from another, or better yet, infer both the MIME type and relationship type from the part name.

Since the Open Packaging conventions let one create a part with an arbitrary name, it's in theory impossible to infer the MIME type and relationship type from a part name. After all, sheet.xml could be a Word document main part. That said, Word/Excel/Powerpoint and the XPS framework do not create part names arbitrarily. If you open any Excel spreadsheet, you'll see worksheet names of the form xl/worksheets/sheetxxx.xml where xxx is a number, and that's hardcoded in the application itself.

If a library were to make life simple for developers, it would at least offer the option to use an auto-pilot mode where known part names from each of the application is known in advance, and thus both the MIME type and relationship type can be automatically inferred. This reduces a ton of plumbing code. Note that, to make that possible in a reliable fashion, one needs to use the equivalent of regular expressions to identify known part names. Indeed, a spreadsheet name such as xl/worksheets/sheetxxx.xml has a varying xxx sub-string in it, not to mention that due to the existence of a binary Excel file format (XLSB), a worksheet part name may be of the form xl/worksheets/sheetxxx.bin, with different MIME and relationship types.

With the exclusion of parts whose name finishes with .bin, all known part names are documented in part 1 of TC45 specs. If that helps, I may provide a recap table for these in a future update of this article.

All MIME types are stored in a special ZIP entry named [Content_Types].xml, which is not a part, meaning that it needs not be referenced in any way in any part or relationships. This ZIP entry lets one quickly navigate object types present in the ZIP file. Note that it is not by any mean a meaningful way to list all parts in the ZIP file, not by a long shot. So, while it's tempting and it certainly looks like it is, the [Content_Types].xml ZIP entry cannot be used to list parts. The reason why is that, if you say insert two pictures of the same type, then there will be only one MIME type for both, thus no chance of finding the two picture part names. The last non-obvious algorithm is the one used to add a MIME type to this ZIP entry.


The algorithm for manipulating content types in [Content_Types].xml. Adding a picture called image1.jpg
Published Wednesday, November 15, 2006 2:23 PM by dmahugh
Filed Under:

Comments

No Comments
Anonymous comments are disabled