wordpress hit counter
Open XML and Java - OpenXML Developer - Blog - OpenXML Developer
Goodbye and Hello

OpenXmlDeveloper.org is Shutting Down

There is a time for all good things to come to an end, and the time has come to shut down OpenXmlDeveloper.org.

Screen-casts and blog posts: Content on OpenXmlDeveloper.org will be moving to EricWhite.com.

Forums: We are moving the forums to EricWhite.com and StackOverflow.com. Please do not post in the forums on OpenXmlDeveloper.org. Instead, please post in the forums at EricWhite.com or at StackOverflow.com.

Please see this blog post for more information about my plans moving forward.  Cheers, Eric

Open XML and Java

Open XML and Java

Rate This
  • Comments 21

Note: this code has been updated for the final Ecma schemas and the RTM version of Office. If you note some mistakes or have some issues, please report them to julien@chable.net

This post is a summary translation of articles by Julien Chable that have are available (in French) on MSDN France:

Retrieve a document’s main part

public final static String NS_CORE_DOCUMENT =
"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";

...

final String APP_ROOT = System.getProperty("user.dir") + File.separator;
ZipFile zipFile = null;

try {
zipFile = new ZipFile(APP_ROOT + "sample.docx");
} catch (IOException e) {
e.printStackTrace();
}

Package p = Package.open(zipFile, PackageAccess.Read);

// Retrieve core part relationship from his type
PackageRelationship coreDocRelationship =
p.getRelationshipsByType(
PackageRelationshipConstants.NS_CORE_DOCUMENT).getRelationship(0);

// Get the content part from the relationship
PackagePart coreDocument = p.getPart(coreDocRelationship);
System.out.println(coreDocument.getUri() + " -> "
+ coreDocument.getContentType());

Listing 1

Listing 1 output for several types of documents :

  • Word :
    word/document.xml -> application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
  • Excel :
    xl/workbook.xml -> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml
  • PowerPoint :
    ppt/presentation.xml -> application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml

Here are the extensions and the URI of the main part for several types of documents :

  • WordProcessingML (.docx) : word/document.xml
  • SpreadsheetML (.xlsx) : xl/workbook.xml
  • PresentationML (.pptx) : ppt/presentation.xml

How to get document’s properties

The following sample demonstrates how to get the core property part of a document :

public final static String NS_CORE_PROPERTIES =
"http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties";
...

Package p = Package.open(zipFile, PackageAccess.Read);

// Get core properties part relationship
PackageRelationship corePropertiesRelationship =
p.getRelationshipsByType(PackageRelationshipConstants.NS_CORE_PROPERTIES).getRelationship(0);

// Get core properties part from the previous relationship
PackagePart coreDocument = p.getPart(corePropertiesRelationship);
System.out.println(coreDocument.getUri() + " -> "
+ coreDocument.getContentType());

Listing 2

The output displays :

docProps/core.xml -> application/vnd.openxmlformats-package.core-properties+xml

Only a few simple lines are needed to get document’s properties :

...
OpenXMLDocument docx = new OpenXMLDocument(Package.open(zipFile,
PackageAccess.Read));
System.out.println(docx.getCoreProperties().getCreator());
System.out.println(docx.getCoreProperties().getTitle());
System.out.println(docx.getCoreProperties().getSubject());

Listing 3

The output displays :

Julien CHABLE

Lorem Ipsum

Sample document

How to change document’s properties

It’s as simple as to get a property :

// Destination file
File destFile = new File(APP_ROOT + "sample_out.docx");

// Open the document
Package pack = Package.open(zipFile, PackageAccess.ReadWrite);
OpenXMLDocument docx = new OpenXMLDocument(pack);

CoreProperties coreProps = docx.getCoreProperties();
coreProps.setCreator("OpenXMLDeveloer.org powa");
coreProps.setDescription("A new description");
coreProps.setTitle("SampleListing4");

// Save document
docx.save(destFile);

Listing 4

Extended properties

The little framework associated with this article doesn’t provide any class or method to access extended properties. As a result, in this sample, we need to use DOM API to extract information from the extended properties part :

...

// Open the package
Package p = Package.open(..., PackageAccess.Read);

// Get extended properties relationship
PackageRelationship extendedPropertiesRelationship = p
.getRelationshipsByType(
PackageRelationshipConstants.NS_EXTENDED_PROPERTIES)
.getRelationship(0);

// Get extended properties part from the previous relationship
PackagePart extPropsPart = p.getPart(extendedPropertiesRelationship);
System.out.println(extPropsPart.getUri() + " -> "
+ extPropsPart.getContentType());

// Extract content
try {
InputStream inStream = extPropsPart.getInputStream();

// Create DOM parser
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory
.newInstance();
documentBuilderFactory.setNamespaceAware(true);
documentBuilderFactory.setIgnoringElementContentWhitespace(true);

DocumentBuilder documentBuilder;
documentBuilder = documentBuilderFactory.newDocumentBuilder();

// Parse XML content
Document extPropsDoc = documentBuilder.parse(inStream);

// Extract the name and the version of the Open XML file generator
System.out.println("Document generated with "
+ extPropsDoc.getElementsByTagName("Application").item(0)
.getTextContent()
+ " vers. "
+ extPropsDoc.getElementsByTagName("AppVersion").item(0)
.getTextContent());

// Extract statistics about this document
System.out.println("This document contains "
+ extPropsDoc.getElementsByTagName("Words").item(0)
.getTextContent()
+ " words and is composed of "
+ extPropsDoc.getElementsByTagName("Characters").item(0)
.getTextContent()
+ " characters and "
+ extPropsDoc.getElementsByTagName("Lines").item(0)
.getTextContent() + " lines");

inStream.close();
} catch (Exception ioe) {
System.err
.println("Failed to extract extended properties ! :(");
}

Listing 5

Output of Listing 5 :

docProps/app.xml -> application/vnd.openxmlformats-officedocument.extended-properties+xml

Document generated with Microsoft Office Word vers. 12.0000

This document contains 262 words and is composed of 1444 characters and 12 lines

Thumbnail

Many OpenXML documents, for example PowerPoint 2007, contain a thumbnail of the document. This specific part have the following relationship : http://schemas.openxmlformats.org/package/2006/relationships/metadata/thumbnail.

The following listing use tow methods – getThumbnails() and extractParts() – to extract the thumbnail of the document, and put it into the ‘export’ directory :

final String APP_ROOT = System.getProperty("user.dir") + File.separator;
ZipFile zipFile = null; // Le fichier source
try {
zipFile = new ZipFile(APP_ROOT + "sample.pptx");
} catch (IOException e) {
...
}

// Destination folder
File destFile = new File(APP_ROOT + "export");

// Open the package
OpenXMLDocument docx = OpenXMLDocument.open(zipFile, PackageAccess.Read);

// Extract thumbnails
docx.extractParts(docx.getThumbnails(), destFile);

Listing 6

Here are the details of the getThumbnails() and extractParts() methods :

public final static String NS_THUMBNAIL_PART =
"http://schemas.openxmlformats.org/package/2006/relationships/metadata/thumbnail";
...

// Retrieve all thumbnails contain in the document.
public ArrayList<PackagePart> getThumbnails() {
return container.getPartByRelationshipType(
PackageRelationshipConstants.NS_THUMBNAIL_PART);
}

Listing 6-1 (class OpenXMLDocument)

/**
*
Extract part content into the specified folder.
*
* @param
parts
*
Parts to extract.
* @param
destFolder
*
Destination folder.
*/

public void extractParts(ArrayList<PackagePart>parts, File destFolder) {
for (PackagePart part : parts) {
String filename = PackageURIHelper.getFilename(part.getUri());
try {
InputStream ins = part.getInputStream();
FileOutputStream fw = new FileOutputStream(destFolder
.getAbsolutePath()
+ File.separator + filename);
byte[] buff = new byte[512];
while (ins.available() > 0) {
ins.read(buff);
fw.write(buff);
}
fw.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

Listing 6-2 (class OpenXMLDocument)

Listing 6 result :

Word document basic creation

To simplify this example, we’re going to create a document from a blank one by modifying his content ; this manipulation is simpler to understand and to do for this article, than a ‘from scratch’ creation. To add paragraphs in a document, the classes ParagraphBuilder, Paragraph and Run are greatly useful :

// Creation of a paragraph builder
ParagraphBuilder paraBuilder = new ParagraphBuilder();
paraBuilder.setAlignment(ParagraphAlignment.CENTER);

Listing 7-1

Once the ParagraphBuilder is ready, you could create a new paragraph by using the newParagraph() method :

// We create the first paragraph
Paragraph par1 = paraBuilder.newParagraph();

Listing 7-2

The following example creates two paragraphs with the content : ‘Hello Office Open XML’ and ‘OpenXMLDeveloper.org’ with a great font size :

...

Package pack = Package.open(zipFile, PackageAccess.ReadWrite);
WordDocument docx = new WordDocument(pack);

// Creation of a paragraph builder
ParagraphBuilder paraBuilder = new ParagraphBuilder();
paraBuilder.setAlignment(ParagraphAlignment.CENTER);

// We create the first paragraph
Paragraph par1 = paraBuilder.newParagraph();

// Add runs to modify the style
Run r1 = new Run("Hello");
r1.setBold(true);

Run r2 = new Run(" Office");
r2.setItalic(true);

Run r3 = new Run(" Open");
r3.setUnderline(UnderlineStyle.SINGLE);

Run r4 = new Run(" XML");
r4.setVerticalAlignement(VerticalAlignment.SUPERSCRIPT);

// Add previous runs to the first paragraph
par1.addRun(r1);
par1.addRun(r2);
par1.addRun(r3);
par1.addRun(r4);

// Add the first paragraph in the document’s content
docx.appendParagraph(par1);

// Creation of a second paragraph
paraBuilder.setBold(true);

Paragraph par2 = paraBuilder.newParagraph();

Run r21 = new Run("www.openxmldeveloper.org");
r21.setFontSize(55);
par2.addRun(r21);

// Append the second paragraph to content
docx.appendParagraph(par2);

// Save the document
docx.save(destFile);

Listing 8

The result :

Convert a Word document to HTML

The OpenXML format is partly based on the XML technology, so the conversion to HTML is quite simple, at least, for basic documents thanks to the XSLT technology !

In this example, we’ll use the straightforward document generated by the previous listing with the following XSLT stylesheet :

<?xml version="1.0" encoding="utf-8"?>

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/3/main">

<xsl:output method="html" />

<!-- Document root -->
<xsl:template match="/w:document">
<xsl:apply-templates select="w:body" />
</xsl:template>


<!-- Body and paragraphs -->
<xsl:template match="w:body">
<html>
<body
>
<xsl:for-each select="w:p">
<p>
<xsl:apply-templates select="w:pPr" />
<xsl:apply-templates select="w:r" />
</p>
</xsl:for-each>
</body>
</html
>
</xsl:template>

<!-- Paragraph properties -->
<xsl:template match="w:pPr">
<xsl:attribute name="style">
<xsl:apply-templates />
</xsl:attribute>
</xsl:template>

<!-- Text alignment -->
<xsl:template match="w:jc">
text-align:
<xsl:value-of select="@w:val" />
</xsl:template>

<!-- Run -->
<xsl:template match="w:r">
<span>
<xsl:apply-templates select="w:rPr" />
<xsl:value-of select="w:t" />
</span>
</xsl:template>

<!-- Run properties -->
<xsl:template match="w:rPr">
<xsl:attribute name="style">
<xsl:apply-templates />
</xsl:attribute>
</xsl:template>

<!-- Font size -->
<xsl:template match="w:sz">
font-size:
<xsl:value-of select="@w:val" />
px;
</xsl:template>

<!-- Vertical alignment -->
<xsl:template match="w:vertAlign">
<xsl:variable name="jcVal" select="@w:val" />
<xsl:if test="$jcVal = 'superscript'">
font-size:33%;position:relative;bottom:0.5em;
</xsl:if>
<xsl:if test="$jcVal = 'subscript'">
font-size:33%;position:relative;bottom:-0.5em;
</xsl:if>
</xsl:template>

<!-- Bold -->
<xsl:template match="w:b">font-weight:bold;</xsl:template>

<!-- Italic -->
<xsl:template match="w:i">font-style:italic;</xsl:template>

<!-- Underline -->
<xsl:template match="w:u">text-decoration:underline;</xsl:template>

</xsl:stylesheet>


We’ll use the class WordToHTMLTransformer and the associated method transform() to convert our OpenXML document to HTML :

WordDocument docx = new WordDocument(...);
...
WordToHTMLTransformer wt = new WordToHTMLTransformer();
InputStream transformStream = wt.transform(docx);

Listing 9-1

The complete example :

final String APP_ROOT = System.getProperty("user.dir") + File.separator;
ZipFile zipFile = null; // Le fichier source

try {
zipFile = new ZipFile(APP_ROOT + "sample_out.docx");
} catch (IOException e) {
e.printStackTrace();
}

// La destination du fichier de sortie
File destFile = new File(APP_ROOT + "output.html");

Package pack = Package.open(zipFile, PackageAccess.ReadWrite);
WordDocument docx = new WordDocument(pack);

WordToHTMLTransformer wt = new WordToHTMLTransformer();
try {
InputStream transformStream = wt.transform(docx);
BufferedWriter outStream = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream(destFile)));

BufferedReader br = new BufferedReader(new InputStreamReader(
transformStream));

String buff;
while ((buff = br.readLine()) != null)
outStream.write(buff);
outStream.close();

br.close();
} catch (Exception e) {
e.printStackTrace();
}

Listing 10

The HTML file generated by Listing 10 in Internet Explorer 7 :

Author

Julien Chable, student at EFREI in France and Microsoft Student Partner writes articles about Java and .NET in several magazines and websites. He can be contacted via his website http://julien.chable.net or his blog http://blogs.developpeur.org/neodante/

Attachment: sources.zip
Page 2 of 2 (21 items) 12