Welcome to OpenXML Developer Sign in | Join | Help

Guided Tour of the Spec, Part 1: Packaging

Have you looked at the spec? (The Office Open XML Document Interchange Specification.) It's huge: 2000 pages of dense text in a single document, all carefully and logically organized, and written for everyone from government entities and compliance organizations to software developers to educators and authors. This broad audience makes the style a bit hard to digest in purely technical terms, at least to me. (As one of many examples, the use of all caps for specific words like MUST and SHALL left me feeling YELLED AT!)

So this article is one person's attempt to sum up a few of the spec's key concepts for developers. My goal is to save you some time in getting started with the spec if you're 100% new to Office Open XML. As always, if you have questions or comments or suggestions about this article, please post them below.

The Package

At the highest level, each Office Open XML document is stored in a package, meaning a ZIP archive containing a bunch of parts and items. Parts are the chunks of content that make up the document, while items are descriptive metadata that defines how the parts are assembled and rendered. Items can be divided into relationship items (defining how parts are related) and content-type items (defining the content type of each part), and relationship items can be further divided into package relationships (relating a part to the package) and part relationships (relating one part to another).

Parts

Let's look at the parts first. Each document has a main part that is the "outermost" part, and all other parts (if any) are contained or referenced within this main part. The details and location of the main part vary depending on the document type:

  • For a typical WordProcessingML package, see sample.docx in the sample documents on this site. Note that the main part is document.xml file in the word folder.
  • For a typical SpreadsheetML package, see sample.xlsx in the sample documents on this site. Note that the main part is workbook.xml file in the xl folder.
  • For a typical PresentationML package, see sample.pptx in the sample documents on this site. Note that the main part is presentation.xml file in the ppt folder.

Relationship Items

So you're probably wondering: how would my code know to look in those specific places for the main part of each of these documents? The answer: by looking to the officeDocument relationship that points to the main part. This relationship is stored in the .rels file in the _rels folder in the samples mentioned above.

The _rels folder is a fundamental concept of the packaging convention. Relationships are always defined in a _rels folder that is "directly subordinate to the folder of the source of the relationship item." (See section 9.2.3, Representing Relationships.) In the examples above, the officeDocument relationship is a package relationship (i.e., the source is the package itself), so it's stored in a _rels subfolder of the root of the package.

But suppose you were looking for a part relationship whose source is the document.xml file itself — that relationship would be stored in a _rels subfolder under the word folder that contains document.xml. This consistency in the storage of the relationships makes it easier to write code to recursively traverse a package, because once you have a part you know exactly where to look for the relationships that bind it to its components.

Content-Type Items

Content types are exactly what they sound like: metadata describing the "file type" of each part in the package. Typical content types may include simple embedded objects such as text/plain or image/jpeg as well as more expansive types such as application/xml. Note that even the relationships items have a content type (application/vnd.ms-package.relationships+xml in the sample documents). Content types tell the consumer of an Office Open XML package how to interpret or render each of the parts.

Content types are all stored together in a single item named [Content_Types].xml in the root folder of the package. The types in this package apply to all of the parts throughout all levels of the package's internal hierarchy.

A default content type is typically associated with a file extension, such as xml or jpg. Override content types may also specify that a specific part is a specific content type, regardless of its file extension. As a typical example of what you'll find in content type items, here's [Content_Types].xml from the sample.docx file:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<Types xmlns="http://schemas.microsoft.com/package/2005/06/content-types">
  <Default Extension="rels" ContentType="application/vnd.ms-package.relationships+xml" />
  <Default Extension="xml" ContentType="application/xml" />
  <Override PartName="/word/document.xml" ContentType="application/vnd.ms-word.main+xml" />
  <Override PartName="/word/styles.xml" ContentType="application/vnd.ms-word.styles+xml" />
  <Override PartName="/word/settings.xml" ContentType="application/vnd.ms-word.settings+xml" />
  <Override PartName="/word/theme/theme1.xml" ContentType="application/vnd.ms-office.theme+xml" />
  <Override PartName="/word/fontTable.xml" ContentType="application/vnd.ms-word.fontTable+xml" />
  <Override PartName="/docProps/core.xml" ContentType="application/vnd.ms-package.core-properties+xml" />
</Types>

Embedded Objects

One of the most flexible aspects of Office Open XML is the support for arbitrary embedded objects, which may be any type of binary content. No longer does a document have to suffer the bloat and inefficiency of Base64 encoding in order to combine XML-based flexibility with modern multimedia content types: you can simply include a relationship to an object and let the Office Open XML consumer merge it all together when you open the document. These can even be references to external resources that are not included in the document package itself. (See section 9.2.3, Representing Relationships.)

As a very simple example of how embedded objects are handled in a typical Office Open XML producer, consider what happens when we insert a picture in the sample.docx sample document. If you do this in Word 2007 and save the document, several things are changed within the package:

  • a default content type is added to [Content_Types).xml, mapping the jpg extension to image/jpeg
  • a media folder is created under the word folder, and the inserted jpg file is stored there
  • the document.xml file in the word folder (i.e., the main part) is modified to add a reference to a new relationship
  • the document.xml.rels file (in the _rels folder under the word folder) gets a new relationship definition, which links the relationship tag within document.xml to the appropriate file part in the media folder

The key concept to understand here is that the document gets a relationship added to it, and the relationship defines where the inserted image is coming from. If the image were an external resource, the document would look exactly the same, but the .rels file would contain a different definition of the relationship.

Alternative Format Import Parts: AFChunks

During the transition from binary document formats to Office Open XML, there are going to be a lot of hybrid documents that contain embedded sections and objects that are in older formats: HTML, RTF, earlier versions of WordProcessingML, and so on. To allow for this inevitable complexity, the Office Open XML formats allow for alternative format chunks that may be in one of these older formats.

AFChunks are handled in essentially the same manner as embedded objects. The document part contains a relationship tag, and the separate definition of that relationship tag in the corresponding .rels part tells where to get the chunk.

Note that the intent of AFChunks is for support of the migration of content to Office Open XML. A conforming consumer should read AFChunks and dynamically insert them in the appropriate places as defined by the relationships, but a conforming producer of Office Open XML should never write AFChunks in its output. See section 10.3.1, "Alternative Format Import part," for more information.

Summary

The examples here are very simple, and I've avoided getting into the details of the three specific document formats because my goal was to start with the packaging conventions and related concepts. A logical continuation of this article would be to cover the components of each of the three document types (WordProcessingML, SpreadsheetML, and PresentationML) in more detail. These topics are addressed in overviews in sections 8.3, 8.4, and 8.5 respectively, and then covered in great detail through the remainder of the spec.

Feedback

Was this article useful? Would you like to see more articles like this that summarize various aspects of the spec? If so, what would you like to see next? Share your feedback in the comments section below.

Published Thursday, March 30, 2006 7:17 AM by dmahugh
Filed Under:

Comments

 

orcmid said:

Oops.  Didn't know this was here -- I found it from your blog.

I like it.  Nice breakdown about items and such.  Thought you'd dug a hole when you talked about specific places for the "outermost" parts.  Heh.

Note that references needn't be to "components" they can be any explicit relationship (including back up to a parent) and unimaginable forms of implicit relationships.

And, I think there needs to be a little more about how to know which package relationship is the one that matters?

I'm excited to see this kind of exploration while we all learn this together.  Thanks.  Keep 'em coming.
March 30, 2006 7:22 PM
 

dmahugh said:

Hey, as far as which package relationship is "the one that matters," check out "Step 1: Finding the Start Part" in Kevin Boske's blog post on getting started with the packaging API.  There should be only one part of type officeDocument, and that's the start part.

I'm going to get back to digging through the spec this week, and will post more detail on this soon.

- Doug
April 3, 2006 2:37 PM
Anonymous comments are disabled