I'm a protocols and formats standards junky and I started to look at how to describe a conceptual model for the Office Open XML packages. I notice that there seem to be at least three levels of abstractions involved, and while I mull on that some more I was wondering if anyone with a similar interest had any observations to share.
Here are the at-least three:
So that's what I've been thinking about as I tinker with diagrammatic ways of coming to grips with the OOX Package conventions. The reason I'm doing this is that document-processing is a pet interest of mine. I'm interested in explaining and demonstrating how we use abstract levels like this to ultimately accomplish useful processing of digital documents. I think OOX Packages (and OPC even more-so in some respects) are ideal choices because these open formats are going to be of great practical value as well as useful objects of study.
Well, Dennis, if you're a "protocols and formats standards junky" then you're probably in the right place. :-)
I appreciate your attempt to come up with a conceptual model for understanding Office Open XML packages. I'm trying to formulate some abstractions to aid my own understanding of the formats, and it's good to hear another perspective. On another thread, Stephane and Sanjay were discussing some of the distinctions between what constitutes a part (i.e., is a chunk of XML a part, or just one of the tangible artifacts of a part?), so it looks like we're all trying to pin down these concepts more pecifically.
I'm still mulling it all over, but here are a couple of observations after a first read of your thoughts ...
At the highest level, we just have parts, relationships, and content types, all bundled into a package. The fact that there is a hierarchical structure to the package is just an implementation detail, really, and in fact the draft Ecma spec strongly emphasizes the importance of not writing code to the hierarchy and writing it instead to the defined relationships.
I'm starting to see that's the key to understanding the formats, in general: thinking in terms of parts, relationships, and content types, rather than the physical implementations of those concepts. I started thinking about OOX at a concrete, "what's in the ZIP archive?" level, but last week Stephen Peront (who contributed the embedded-objects sample) pointed out to me that we really shouldn't think at that level. It's the first thing you see when you crack open a package, and of course as a developer you need to know the implementation details, but the part/relationship/content-type abstractions are at the core of what an "Open XML Format" is all about.
Another aspect of learning about Open XML is to divide it into the document markup languages and the packaging conventions. Many developers already know a lot about WordProcessingML, SpreadsheetML, and PresentationML from prior experience with Office 2003 and related products. But the packaging conventions are new to everyone. (Well, except Brian and a few of his cohorts.) So I think it's worth concentrating on the packaging conventions first, and viewing the document MLs as essentially blobs for now. That may not be an absolutely necessary abstraction, but anything that lets me ignore 90% of the spec feels like a good short-term tactic at this point on the learning curve. :-)
- Doug
My colleague Bill Anderson left a comment on the blog version of my initial post on this topic. Bill raises two concerns:
I replied to his comment on the blog, and I will probably follow up there after thinking it through. I'm not so sure metadiscussions like that are what people care about here. (I hadn't seen Doug's reply yet; I think I need to turn on some kind of automatic notice.)
What fun. This is useful to know (and continue mulling over).
dmahugh: I'm still mulling it all over, but here are a couple of observations after a first read of your thoughts ... At the highest level, we just have parts, relationships, and content types, all bundled into a package. The fact that there is a hierarchical structure to the package is just an implementation detail, really, and in fact the draft Ecma spec strongly emphasizes the importance of not writing code to the hierarchy and writing it instead to the defined relationships.
I agree completely with the injunction in the Ecma draft. But I don't take that as denying that the model is hierarchical. I take it as saying one should not assume what the actual hierarchy will be in any given package.
I see a (logical) hierarchy in the sense that there are rules (e.g., for custom content) and for part-relationships that use the notion of folders and can depend on relative addressing up and down the hierarchy/folder-nest.
I prefere to say that there is a (conceptual) folder hierarchy, but there is no semantics built into the hierarchy, the choice of folder names and paths, etc. And in many cases, (sub-) hierarchies must be preserved lest references from one content part to another content part be broken.
dmahugh: I'm starting to see that's the key to understanding the formats, in general: thinking in terms of parts, relationships, and content types, rather than the physical implementations of those concepts. I started thinking about OOX at a concrete, "what's in the ZIP archive?" level, but last week Stephen Peront (who contributed the embedded-objects sample) pointed out to me that we really shouldn't think at that level. It's the first thing you see when you crack open a package, and of course as a developer you need to know the implementation details, but the part/relationship/content-type abstractions are at the core of what an "Open XML Format" is all about.
Amen. I think there is an even-higher conceptual model and I'm aligned with this as the logical (data?) model above the physical/persistent representation.
dmahugh: Another aspect of learning about Open XML is to divide it into the document markup languages and the packaging conventions. [ ... ] That may not be an absolutely necessary abstraction, but anything that lets me ignore 90% of the spec feels like a good short-term tactic at this point on the learning curve. :-)
Another aspect of learning about Open XML is to divide it into the document markup languages and the packaging conventions. [ ... ] That may not be an absolutely necessary abstraction, but anything that lets me ignore 90% of the spec feels like a good short-term tactic at this point on the learning curve. :-)
I think so too. We're looking at an abstraction for a container/carrier level. It will be interesting to see how one can characterize the bridge to content, but not now.
Update: I just found Doug's new Guided Tour article on the format, starting with the OOX Package. Love it.
dmahugh: On another thread, Stephane and Sanjay were discussing some of the distinctions between what constitutes a part (i.e., is a chunk of XML a part, or just one of the tangible artifacts of a part?), so it looks like we're all trying to pin down these concepts more pecifically.
On another thread, Stephane and Sanjay were discussing some of the distinctions between what constitutes a part (i.e., is a chunk of XML a part, or just one of the tangible artifacts of a part?), so it looks like we're all trying to pin down these concepts more pecifically.
I found the discussion you refer to on the thread off of Sanjay's Java example. I chipped in my two cents there, so won't say more until that one clears moderation/caching.