Conversion of OpenXML (WordProcessingML) to HTML using XSLT : Part 1
This article is a stepping stone in the journey of converting OpenXML format of Microsoft word to HTML. It explains how to convert WordprocessingML to Html tags using XSLT, so that the contents of word document can be viewed using a browser. The XSLT parses the XML content and converts into HTML preserving the font properties like “bold”, “italic”, “underline” etc.
XSLT Process:
The Process of Conversion:
Create a Word document using Microsoft Office Word 2007 (xyz.docx).
Extract the “document.xml” file from the created document.
Use the XSLT shown in the next segment of this article (download samples for the XSL file) to convert “document.xml “to HTML.
The code snippet(c#) shown in the final segment of this article helps in conversion, programmatically.
The XSLT used for Conversion
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/3/main">
<xsl:output method="html" />
<xsl:template match="/">
<xsl:apply-templates select="//w:body" />
</xsl:template>
<xsl:template match="w:body">
<html>
<head />
<body>
<pre>
<xsl:apply-templates />
</pre>
</body>
</html>
<xsl:template match="w:p">
<div>
<xsl:apply-templates select="w:r" />
</div>
<xsl:template match="w:r">
<xsl:apply-templates select="w:t" />
<xsl:template match="w:t">
<span>
<xsl:apply-templates select="../w:rPr" />
<xsl:value-of select="." />
</span>
<xsl:template match="w:rPr">
<xsl:attribute name="style">
</xsl:attribute>
<xsl:template match="w:u">text-decoration:underline;</xsl:template>
<xsl:template match="w:b">font-weight:bold;</xsl:template>
<xsl:template match="w:i">font-style:italic;</xsl:template>
</xsl:stylesheet>
The Sample WordProcessingML file taken for Conversion (document.xml) :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:o12="http://schemas.microsoft.com/office/2004/7/core" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.microsoft.com/office/omml/2004/12/core" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/3/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/3/main">
<w:body>
<w:p>
<w:r w:rsidR="00AD4F4A">
<w:t xml:space="preserve">This is </w:t>
</w:r>
<w:r w:rsidR="00A35A66">
<w:t>simple</w:t>
<w:t xml:space="preserve"> text. It preserves both spaces and line </w:t>
<w:r w:rsidR="00AD4F4A" w:rsidRPr="007949A6">
<w:rPr>
<w:b/>
</w:rPr>
<w:t>breaks</w:t>
<w:r w:rsidR="007949A6" w:rsidRPr="007949A6">
<w:t xml:space="preserve"> in </w:t>
<w:proofErr w:type="spellStart"/>
<w:t>bold</w:t>
<w:t>.This</w:t>
<w:proofErr w:type="spellEnd"/>
<w:t xml:space="preserve"> is preformatted text. It preserves both spaces and line </w:t>
<w:r w:rsidR="00AD4F4A" w:rsidRPr="0054521E">
<w:i/>
<w:r w:rsidR="007949A6">
<w:t>italics</w:t>
<w:t xml:space="preserve"> is </w:t>
<w:t xml:space="preserve">simple preformatted </w:t>
<w:t xml:space="preserve"> text. It preserves both spaces and line breaks.</w:t>
</w:p>
<w:p/>
<w:pPr>
</w:pPr>
<w:t xml:space="preserve">New </w:t>
<w:r w:rsidR="0054521E" w:rsidRPr="0054521E">
<w:t>Heading</w:t>
<w:t>:</w:t>
<w:u w:val="single"/>
<w:t>underlined</w:t>
<w:t>This is preformatted text. It preserves both spaces and line breaks.</w:t>
<w:sectPr w:rsidR="00510025" w:rsidSect="00320151">
<w:pgSz w:w="12240" w:h="15840"/>
<w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>
<w:cols w:space="720"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>
Code Snippet
The following code snippet(C#) transforms the “document.xml” to HTML using the XSLT.
It also invokes the browser(IE) to display the converted HTML file.
try
{
// strXMLFilePath holds the absolute path of the extracted //“Document.xml” file from the word document(*.docx)
string strXMLFilePath = @"C:\Hello.docx\word\document.xml";
// Lets save the XSLT (shown in the article) to a file called //“OpenXml_XSLT.xsl”
// strXSLFilePath holds the absolute path of the “OpenXml_XSLT.xsl”
string strXSLFilePath = @"C:\FolderXSLT\OpenXml_XSLT.xsl";
// strHtmlFilePath holds the file name & path of the HTML file to be
// generated.In this case we save it inside the folder that holds // the Document.xml
string strHtmlFilePath = Path.Combine(Path.GetDirectoryName(strXMLFilePath), "xyz.html");
//Load the Document.xml file into XmlDocument object
System.Xml.XmlDocument objXmlDom = new XmlDocument();
objXmlDom.Load(strXMLFilePath);
//Load the XSL file(OpenXml_XSLT.xsl) into the XslCompiledTransform object
System.Xml.Xsl.XslCompiledTransform objXSLT = new XslCompiledTransform();
objXSLT.Load(strXSLFilePath);
// The Transform method converts the XML to HTML and stores the HTML file
// in the path specified in the variable strHtmlFilePath
objXSLT.Transform(strXMLFilePath,strHtmlFilePath );
//The Process.Start method invokes the browser (IE) ,which displays the
// converted HTML file.
System.Diagnostics.Process.Start("IExplore.exe", strHtmlFilePath);
}
catch(Exception ex)
MessageBox.Show(ex.Message);
The Output Html (after Conversion of the XML file (above) using XSLT)
<html xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/3/main">
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<span>This is </span>
<span>simple</span>
<span> text. It preserves both spaces and line </span>
<span style="font-weight:bold;">breaks</span>
<span style="font-weight:bold;"> in </span>
<span style="font-weight:bold;">bold</span>
<span>.This</span>
<span> is preformatted text. It preserves both spaces and line </span>
<span style="font-style:italic;">breaks</span>
<span style="font-style:italic;"> in </span>
<span style="font-style:italic;">italics</span>
<span> is </span>
<span style="font-weight:bold;">simple preformatted </span>
<span> text. It preserves both spaces and line breaks.</span>
<div></div>
<span style="font-weight:bold;">New </span>
<span style="font-weight:bold;">Heading</span>
<span style="font-weight:bold;">:</span>
<span style="text-decoration:underline;">underlined</span>
<span>This is preformatted text. It preserves both spaces and line breaks.</span>
Browser View of the HTML:
Hope this helps to get started with WordProcessingML to HTML conversion.