wordpress hit counter
OpenXML Developer
Goodbye and Hello

OpenXmlDeveloper.org is Shutting Down

There is a time for all good things to come to an end, and the time has come to shut down OpenXmlDeveloper.org.

Screen-casts and blog posts: Content on OpenXmlDeveloper.org will be moving to EricWhite.com.

Forums: We are moving the forums to EricWhite.com and StackOverflow.com. Please do not post in the forums on OpenXmlDeveloper.org. Instead, please post in the forums at EricWhite.com or at StackOverflow.com.

Please see this blog post for more information about my plans moving forward.  Cheers, Eric

Modeling OOX Packages

  • rated by 0 users
  • This post has 4 Replies |
  • 0 Followers
  • I'm a protocols and formats standards junky and I started to look at how to describe a conceptual model for the Office Open XML packages.  I notice that there seem to be at least three levels of abstractions involved, and while I mull on that some more I was wondering if anyone with a similar interest had any observations to share.

    Here are the at-least three:

    1. Package Conceptual Model - this is the highest level of what a generic OOX package is, what its essentials are and what it carries.  This does not deal with the specific OOX content, just the package itself.  Getting to the carrying of office documents is a much bigger deal, with more levels of abstraction.  The relationship and content-type items might not exist independently of parts at this level.
          
    2. Package Logical Model - this is in terms of the abstracted items.  I'm waffling, here, on whether the [Content_Types].xml part is reflected or whether every component simply has a content-type attribute.  I'm pretty certain that parts and relationships are refected here, as is the hierarchic structure.  This provides a logical, navigational representation for the conceptual OOX packages.  It is independent of particular method of storage and technology for access and manipulation.
         
    3. Persisted/Serialized Storage Model(s) - this is in terms of storage-system and data stream formats.  Carrying a package in a hierarchical file system (e.g., before-after Zip-extraction) applies as do other possible storage abstractions.  The Zip format as a serialized storage structure or data stream is another case.  This level makes use of the Zip model (itself an abstraction) as a carrier.  Taking it all the way to the bits can be handled below that.  An important characteristics at this level is that the models all have a way to be transformed into and out of the Zip serialization.  I find the use of a constrained hierarchical-storage model easier to visualize and explain, even though the Zip serialization is the key to interchange.

    So that's what I've been thinking about as I tinker with diagrammatic ways of coming to grips with the OOX Package conventions.  The reason I'm doing this is that document-processing is a pet interest of mine.  I'm interested in explaining and demonstrating how we use abstract levels like this to ultimately accomplish useful processing of digital documents.  I think OOX Packages (and OPC even more-so in some respects) are ideal choices because these open formats are going to be of great practical value as well as useful objects of study.

    Dennis E. Hamilton AIIM DMware Technical Coordinator http://odma.info http://DMware.info
  • Well, Dennis, if you're a "protocols and formats standards junky" then you're probably in the right place. :-)

    I appreciate your attempt to come up with a conceptual model for understanding Office Open XML packages.  I'm trying to formulate some abstractions to aid my own understanding of the formats, and it's good to hear another perspective.  On another thread, Stephane and Sanjay were discussing some of the distinctions between what constitutes a part (i.e., is a chunk of XML a part, or just one of the tangible artifacts of a part?), so it looks like we're all trying to pin down these concepts more pecifically.

    I'm still mulling it all over, but here are a couple of observations after a first read of your thoughts ...

    At the highest level, we just have parts, relationships, and content types, all bundled into a package.  The fact that there is a hierarchical structure to the package is just an implementation detail, really, and in fact the draft Ecma spec strongly emphasizes the importance of not writing code to the hierarchy and writing it instead to the defined relationships.

    I'm starting to see that's the key to understanding the formats, in general: thinking in terms of parts, relationships, and content types, rather than the physical implementations of those concepts.  I started thinking about OOX at a concrete, "what's in the ZIP archive?" level, but last week Stephen Peront (who contributed the embedded-objects sample) pointed out to me that we really shouldn't think at that level.  It's the first thing you see when you crack open a package, and of course as a developer you need to know the implementation details, but the part/relationship/content-type abstractions are at the core of what an "Open XML Format" is all about.

    Another aspect of learning about Open XML is to divide it into the document markup languages and the packaging conventions.  Many developers already know a lot about WordProcessingML, SpreadsheetML, and PresentationML from prior experience with Office 2003 and related products.  But the packaging conventions are new to everyone.  (Well, except Brian and a few of his cohorts.)  So I think it's worth concentrating on the packaging conventions first, and viewing the document MLs as essentially blobs for now.  That may not be an absolutely necessary abstraction, but anything that lets me ignore 90% of the spec feels like a good short-term tactic at this point on the learning curve.  :-)

    - Doug

    - Doug Mahugh Technical Evangelist, Microsoft
  • My colleague Bill Anderson left a comment on the blog version of my initial post on this topic.  Bill raises two concerns:

    1. The use of "Logical Model" which throws him a little.  I'm thinking about that.
    2. The use of "Persisted/Serialized Storage Model" where "Package Storage Model" might be good enough.  That makes me nervous.  I'm thinking that use of "Storage" is an error on my part.  Maybe I should have used "Physical Model(s)?"

    I replied to his comment on the blog, and I will probably follow up there after thinking it through.  I'm not so sure metadiscussions like that are what people care about here.  (I hadn't seen Doug's reply yet; I think I need to turn on some kind of automatic notice.)

    Dennis E. Hamilton AIIM DMware Technical Coordinator http://odma.info http://DMware.info
  • What fun.  This is useful to know (and continue mulling over).

    dmahugh:

    I'm still mulling it all over, but here are a couple of observations after a first read of your thoughts ...

    At the highest level, we just have parts, relationships, and content types, all bundled into a package.  The fact that there is a hierarchical structure to the package is just an implementation detail, really, and in fact the draft Ecma spec strongly emphasizes the importance of not writing code to the hierarchy and writing it instead to the defined relationships.

    I agree completely with the injunction in the Ecma draft.  But I don't take that as denying that the model is hierarchical.  I take it as saying one should not assume what the actual hierarchy will be in any given package.

    I see a (logical) hierarchy in the sense that there are rules (e.g., for custom content) and for part-relationships that use the notion of folders and can depend on relative addressing up and down the hierarchy/folder-nest. 

    I prefere to say that there is a (conceptual) folder hierarchy, but there is no semantics built into the hierarchy, the choice of folder names and paths, etc.  And in many cases, (sub-) hierarchies must be preserved lest references from one content part to another content part be broken.

    dmahugh:

    I'm starting to see that's the key to understanding the formats, in general: thinking in terms of parts, relationships, and content types, rather than the physical implementations of those concepts.  I started thinking about OOX at a concrete, "what's in the ZIP archive?" level, but last week Stephen Peront (who contributed the embedded-objects sample) pointed out to me that we really shouldn't think at that level.  It's the first thing you see when you crack open a package, and of course as a developer you need to know the implementation details, but the part/relationship/content-type abstractions are at the core of what an "Open XML Format" is all about.

    Amen.  I think there is an even-higher conceptual model and I'm aligned with this as the logical (data?) model above the physical/persistent representation.

    dmahugh:

    Another aspect of learning about Open XML is to divide it into the document markup languages and the packaging conventions. [ ... ] That may not be an absolutely necessary abstraction, but anything that lets me ignore 90% of the spec feels like a good short-term tactic at this point on the learning curve.  :-)

    I think so too.  We're looking at an abstraction for a container/carrier level.  It will be interesting to see how one can characterize the bridge to content, but not now.

    Update: I just found Doug's new Guided Tour article on the format, starting with the OOX Package.  Love it.

    Dennis E. Hamilton AIIM DMware Technical Coordinator http://odma.info http://DMware.info
  • dmahugh:

    On another thread, Stephane and Sanjay were discussing some of the distinctions between what constitutes a part (i.e., is a chunk of XML a part, or just one of the tangible artifacts of a part?), so it looks like we're all trying to pin down these concepts more pecifically.

    I found the discussion you refer to on the thread off of Sanjay's Java example.  I chipped in my two cents there, so won't say more until that one clears moderation/caching.

    Dennis E. Hamilton AIIM DMware Technical Coordinator http://odma.info http://DMware.info
Page 1 of 1 (5 items)