To celebrate the RTM release of the Open XML SDK 2.0 we’re launching a bunch of new content here at Open XML Developer. This article provides our brief history of the Open XML SDK 2.0 and provides useful links to content here at Open XML Developer.
Introduction
The Open XML file formats are a set of internationally standardised document formats. For developers used to working with complex, loosely documented legacy binary formats, Open XML represented a fantastic change in the way they worked. Working with the Open XML formats is still fairly heavy lifting involving quite complex XML data manipulation; modern office file documents are a very rich representation of information.
The Open XML SDK 2.0 is a tool that aims to drive developer productivity for those working with the Open XML formats. It provides a strongly typed object API and removes the need for developers to manipulate the raw XML data that make up an Open XML document. Importantly, the Open XML SDK 2.0 is designed to work without any dependence on Microsoft Office or any other tool; you can comfortably run Open XML SDK 2.0 based code on high volume web servers for example, a place you would not typically want to be running Microsoft Word.
In this article we will look briefly at the various approaches for working with Office documents and their pros and cons. We will then dive deep into the Open XML SDK 2.0 and see key features designed to help developers be more productive in working with these open, interoperable standards.
In the beginning
Love it or hate it, Microsoft Office has been a powerful force in business over the past 20 odd years. Applications like Word and Excel (and indeed other non-Microsoft alternatives) have touched almost every business in the world. A schoolchild these days is likely to be as proficient with a word processor as they are with a pen and pencil.
Office Automation:
For a long time developers have wanted to be able to manipulate Office documents from their code. The Microsoft Office suite has historically had good developer capability- there are more ‘mission critical’ Excel macro based applications out there than we probably care to admit! Through Visual Basic for Applications and Microsoft Office Automation we can, as developers and power users, effectively remote control Office applications by using a typed object model.
Office automation was and still is a fantastic solution for many applications. It provides an extensive Object model covering almost every feature of the key Microsoft Office applications. Because it is actually manipulating the productivity application, Office automation exposes not only file format information but also application functionality; for example we can programmatically regenerate a table of contents or perform a spell check. If you are writing a smart client application for Windows and you know that your user will have Office installed the office automation should be at the top of your list of approaches to use.
This approach has drawbacks too though, particularly given the nature of applications today where we see high volume workloads on servers. Microsoft Office is a people focussed application and it is not really designed to be run on a server- it does not scale the way web developers need it to. Office automation is simply not designed for the sort of workloads that are generated in big web applications. Finally, a big web farm would carry with it significant additional licensing costs where Microsoft Office required on each server.
Finally, office automation was not an appropriate choice where your application needed to run on something other than a Windows PC. If you were writing code for a Windows SmartPhone you would not have access to any of the automation APIs.
Cutting out the Middle-man
In order to achieve our needs we would need to go one level deeper and manipulate the actual document file rather than the application. We would sacrifice some functionality in doing this, for example we would no longer be able to use the API for application function such as spellchecking, but we would have the power to write extremely lightweight high performance code that could target a variety of platforms.
In the good old days this was pretty difficult- the Microsoft Office binary formats were not publicly documented and were a complex proposition to work with. Several large Independent Software Vendors wrote applications that could work with these formats and some 3rd party APIs emerged but the prospects for small software firms or internal IT teams working with these documents were limited.
That is where a more transparent and understandable Document file format comes to the rescue.
The early 2000’s were the age of angle brackets- if it wasn’t in HTML or XML it was not worth having. Between 2000 and 2003 Microsoft started down a process refining and documenting the formats into an XML based format.
Well, almost...
These new formats really appealed to developers of the day who, already being experts in working with XML, were able to easily manipulate Office documents without a reliance on the Office client applications. They were less popular with end users, as all the pain of moving to a new default format just did not seem worth it.
TODAY: Open Standards and the Open XML SDK v2.0
Over the course of the 2000’s Microsoft worked with International Standards Bodies such as ECMA and ISO to develop, refine, and finally standardise their Open XML file formats. The Open XML file format is now an ISO standard: IS29500.
Key features of this format are:
Because of their simplicity, the Open XML formats are very easy for developers to work with. All you need is an ability to unpack a *.zip file and manipulate a text document.
However, for .NET developers the Open XML SDK 2.0 2.0 makes working with Open XML documents more accessible than ever before. The SDK 2.0 provides a strongly typed API for manipulating the OPC packages and the Open XML format markup contained within them. In the next section of the document we’ll look at what makes the Open XML SDK 2.0 a great tool.
Strongly Typed API
The Open XML SDK 2.0 includes a strongly typed API for manipulating documents. Rather than having to work with the Open XML markup parts as generic XML you can use these typed objects. This means that you will have full Intellisense support inside various developer tools and certain tools will also support retrieving member documentation.
The API is able to serialize back into an Open XML document at any time. This means that developers can manipulate a document using the object model and make a simple call to Save() to write out the Open XML format package.
Language Integrated Query is a .NET based technology for querying data structures from code. The Open XML SDK 2.0 supports LINQ as a first class concept meaning that it is possible to easily query and manipulate collections of objects such as paragraphs or table rows.
The Open XML SDK 2.0 can be coupled with Office Services functionality of Microsoft SharePoint to support batch document conversion and other advanced operations. The SDK 2.0 requires only a medium level of trust and as such will be compatible with most server side deployment scenarios including ‘locked down’ hosting providers and SharePoint 2010 sandboxed solutions.
Developers can layer their own code on top of the API to build application specific libraries for reuse; because the Open XML SDK 2.0 supports free distribution rights these libraries can be distributed broadly.
1. Developer Productivity Tools
The Open XML SDK 2.0 includes the Developer Productivity Tool that help developers to more quickly write their applications.
Document Comparison:
The SDK 2.0 includes a mechanism to compare two documents. Developers can use this a little like they might use a normal ‘Diff’ tool with text files. The Open XML Diff capability is unique in that it understands the structure of the open XML formats and is therefore able to show changes not only in the underlying Open XML mark-up but also the OPC package structures.
The Open XML Diff tool is particularly useful for comparing documents pre and post manipulation by another Open XML implementation. Need to understand how Microsoft Word applies a particular format? Compare the before and after markup using this tool.
Document Reflection:
Often developers will have a template document from which they wish to work. They may have built a word document and now want to understand how to write code to create that document programmatically. For this need the SDK 2.0 provides a Document Reflector. This document reflector is able to parse an existing document and emit C# source code that will recreate that document via the strongly typed API.
If you are familiar with the excellent Red Gate .NET Reflector tool then you will understand just how useful this approach can be.
For more information on how to use the Developer Productivity Tool please see this OpenXMLDeveloper.org article: An Introduction to Open XML SDK 2.0.
2. Interoperability and Standards Conformance
The Open XML SDK 2.0 represents the most complete API for working with the internationally standardised IS29500 Open XML file formats.
Validation of documents is important for many developers, particularly those generating content dynamically from other data sources. As well as the inherent validation provided for by the strongly typed API, the SDK 2.0 includes specific validation logic. This logic checks not only schema conformance but also semantic conformance based on the requirements set out in the IS29500 specification text. Validation errors caused by syntactically or semantically incorrect markup return detailed XPath information to allow easy resolution.
The SDK 2.0 provides easy access to documentation directly from within the tools. As well as intellisense documentation the SDK 2.0 validation tools will retrieve guidance from the IS29500 specification and the Microsoft implementation notes.
The IS29500 specification makes specific provision for implementers to add their own markup. This is set out in part 3 of the specification; Markup Compatibility and Extensibility. The SDK 2.0 provides specific support for MCE constructs. Documents can be pre-processed to retrieve either Office 2007 or Office 2010 markup. The SDK 2.0 is also able to easily emit MCE constructs should developers wish to specify their own extended markup. For more details on working with MCE in the Open XML SDK 2.0 please see the hands on lab: OpenXML Markup Compatibility and Extensibility with the OpenXML SDK 2.0[CA1]
3. Application Independence
The Open XML SDK 2.0 does not have any dependency on Microsoft office or any other productivity suite. The SDK 2.0 supports both server side and client side configurations with Microsoft .NET being the only prerequisite.
The Open XML SDK 2.0 is freely distributable. This means that any solution developer that chooses to implement the Open XML formats in their application is able to distribute the Open XML Binaries (that is the *.dll files) with their application.
Learning About the Open xml SDK 2.0
In this section we’ll set out and link to many of the learning resources for the Open XML SDK 2.0.
Articles:
We have a number of OpenXmlDeveloper.org articles that discuss the Open XML SDK 2.0. Some of these articles have been written for CTP releases of the SDK 2.0. For the most part these should work with the final release.
Hands on Labs:We have refreshed the Hands on Lab content available here at Open XML Developer. While it still provides the detailed drill down on Open XML markup it now also discusses the Open XML SDK 2.0 and includes two new labs that demonstrate how to use the Open XML SDK 2.0
Conclusion
The 2.0 release of the Open XML represents a major step forward for developers working with the Open XML file formats. It provides the richest client independent document format manipulation API on any platform.
Here at Open Xml Developer we will be publishing a number of new articles on the Open XML SDK 2.0 over the coming months. You can subscribe to our RSS feed to be notified of new updates or follow us on Twitter @OpenXmlDev.
[CA1]Note to person posting this. We need to link to this hands on Lab
[MDB2]Link needed?
[CA3]Note to editor: Link required.
[CA4]Note to Editor: Link required once lab is published
To view the Open XML SDK 2.0 Launch forum click here.