Note from Eric White: This is the first post in a series of guest posts by Florian Reuter. He has written a pretty cool library for working with OPC files (published at http://libopc.codeplex.com/). In upcoming posts, he is going to cover Markup Compatibility and Extensibility (MCE) and his libopc library.
The Open Packaging Convention (OPC) is part II of
the OfficeOpenXML standard --- the standard behind the new .docx, .xslx and
.pptx Office formats.
The OPC defines a container format which can be used to
store any kind of data and it is not only suited for Office format. E.g. the
XMLPaperSpecification (XPS) also uses OPC as the packaging layer.
In many ways OPC can be seen as a successor of OLE
containers used by the proprietary .DOC, .XSL and .PPT formats. Unlike OLE containers
--- which are modeled according to the FAT format ---OLE containers are valid
.ZIP archived plus some extra metadata.
This means that any OPC container can be opened with a ZIP
program. Try it out yourself: Create a .docx/.xslx or .pptx file and rename the
extension to .zip. A simple double-click will expose the container structure in
Windows internal viewer:
The metadata is encoded in the additional "_rels"
folders and the "[Content_Types].xml".
In order to really understand the OPC it is important to
understand the abstract OPC container structure first.
First of all every OPC container specifies a set of MIME types
also known as content types. Typical content types in a .DOCX document are:
Additionally every OPC container has a "default"
binding between an extension and a content type. E.g.:
Next every OPC container defines a set of "relation
types". Relation types have the same form as XML namespace names. Typical relation types in a .DOCX file are:
OPC container also keeps a list of all external relations.
E.g. when a .DOCX document contains a hyperlink to "http://naverage.com",
then this external link is stored as an external relation:
Data is stored inside an OPC container as parts. A part has
a hierarchical name and a type. Here are the typical parts of a .DOCX document:
Finally OPC container store relations between parts.
Consider e.g. the part "word/document.xml" and the part
"word/styles.xml". There is obviously a relation between this two
parts in the way that the "word/styles.xml" part contains the styles
definitions referenced in the "word/document.xml" part. Therefore in
a typical .DOCX document a relation similar to the following is established:
A relation inside an OPC container has a source part, a
destination part as well as a relation id and a relation type. The relation id
is unique with respect to the source part, i.e. no two relations which leave a
source part have the same id.
An OPC container also has a virtual root part (here denoted
with "[root]" or "/"), which is used to model the root of
the relation hierarchy.
Here are the typical relations found in a .DOCX file:
One of the peculiarities of the OPC is how you navigate
within an OPC container. Although most API's give you the ability to access
parts directly usually the relations are used to find the right part.
Let's suppose you want to open the document part of a DOCX
document. The straightforward --- but wrong --- way would be to check whether
an OPC container has the "word/document.xml" stream and open it if
present. Even when you additionally check whether the
"word.document.xml" stream has the content type "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
" it would be the wrong way to handle a DOCX document, since the part name
"word/document.xml" is not important.
The right way to access the document part of a DOCX document
is to check whether the OPC container has a relation leaving [root] of type
If so we follow the relation and next we check the content type of the relation's
target part. If the content type of the target part is " application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
we have a DOCX document.
Windows comes with two different libraries for handling OPC
container. An unmanaged COM-based API and a managed .NET-based API.
Documentation about the two APIs can be found here http://msdn.microsoft.com/en-us/library/windows/desktop/dd742822.aspx
and here http://msdn.microsoft.com/en-us/library/system.io.packaging.aspx.
In this series of blog posts we will use libopc
(libopc.codeplex.com) a FREE and open source library for dealing with the OPC
which can be used on Windows as well on Linux, iOS and Android.
Libopc comes with a command line tool "opc_dump"
which can be used to dump the structure of an OPC container. This tool is very
handy and it can be used like this:
> opc_dump "Hello World.docx" > dump.txt
In the next post we will take a look at the layer above OPC
called Markup Compatibility and Extensibility (MCE) before we will take a
closer look at libopc.
Florian Reuter (CEO of Naverage UG http://naverage.com
and coordinator of http://libopc.codeplex.com)