wordpress hit counter
Conversion of OpenXML (WordProcessingML) to HTML using XSLT : Part 1 - OpenXML Developer - Blog - OpenXML Developer

Conversion of OpenXML (WordProcessingML) to HTML using XSLT : Part 1

Blog

Samples, Demos, and Reference Articles

Conversion of OpenXML (WordProcessingML) to HTML using XSLT : Part 1

Rate This
  • Comments 5

Conversion of OpenXML (WordProcessingML) to HTML using XSLT : Part 1

This article is a stepping stone in the journey of converting OpenXML format of Microsoft word to HTML. It explains how to convert WordprocessingML to Html tags using XSLT, so that the contents of word document can be viewed using a browser. The XSLT parses the XML content and converts into HTML preserving the font properties like “bold”, “italic”, “underline” etc.

 

XSLT Process:  

The Process of Conversion:

 Create a Word document using Microsoft Office Word 2007 (xyz.docx).

 Extract the “document.xml” file from the created document.

Use the XSLT shown in the next segment of this article (download samples for the XSL file) to convert “document.xml “to HTML.

The code snippet(c#) shown in the final segment of this article helps in conversion, programmatically.

 

 

The XSLT used for Conversion

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/3/main">

  <xsl:output method="html" />

  <xsl:template match="/">

    <xsl:apply-templates select="//w:body" />

  </xsl:template>

  <xsl:template match="w:body">

    <html>

      <head />

      <body>

        <pre>

          <xsl:apply-templates />

        </pre>

      </body>

    </html>

  </xsl:template>

  <xsl:template match="w:p">

    <div>

      <xsl:apply-templates select="w:r" />

    </div>

  </xsl:template>

  <xsl:template match="w:r">

    <xsl:apply-templates select="w:t" />

  </xsl:template>

  <xsl:template match="w:t">

    <span>

      <xsl:apply-templates select="../w:rPr" />

      <xsl:value-of select="." />

    </span>

  </xsl:template>

  <xsl:template match="w:rPr">

    <xsl:attribute name="style">

      <xsl:apply-templates />

    </xsl:attribute>

  </xsl:template>

  <xsl:template match="w:u">text-decoration:underline;</xsl:template>

  <xsl:template match="w:b">font-weight:bold;</xsl:template>

  <xsl:template match="w:i">font-style:italic;</xsl:template>

</xsl:stylesheet>

 

 

The Sample WordProcessingML file taken for Conversion (document.xml) :

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:o12="http://schemas.microsoft.com/office/2004/7/core" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.microsoft.com/office/omml/2004/12/core" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/3/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/3/main">

  <w:body>

    <w:p>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve">This is </w:t>

      </w:r>

      <w:r w:rsidR="00A35A66">

        <w:t>simple</w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve"> text. It preserves      both spaces and line </w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A" w:rsidRPr="007949A6">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t>breaks</w:t>

      </w:r>

      <w:r w:rsidR="007949A6" w:rsidRPr="007949A6">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t xml:space="preserve"> in </w:t>

      </w:r>

      <w:proofErr w:type="spellStart"/>

      <w:r w:rsidR="007949A6" w:rsidRPr="007949A6">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t>bold</w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A">

        <w:t>.This</w:t>

      </w:r>

      <w:proofErr w:type="spellEnd"/>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve"> is preformatted text. It preserves      both spaces and line </w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A" w:rsidRPr="0054521E">

        <w:rPr>

          <w:i/>

        </w:rPr>

        <w:t>breaks</w:t>

      </w:r>

      <w:r w:rsidR="007949A6">

        <w:rPr>

          <w:i/>

        </w:rPr>

        <w:t xml:space="preserve"> in </w:t>

      </w:r>

      <w:proofErr w:type="spellStart"/>

      <w:r w:rsidR="007949A6">

        <w:rPr>

          <w:i/>

        </w:rPr>

        <w:t>italics</w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A">

        <w:t>.This</w:t>

      </w:r>

      <w:proofErr w:type="spellEnd"/>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve"> is </w:t>

      </w:r>

      <w:r w:rsidR="00A35A66">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t xml:space="preserve">simple        preformatted          </w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve"> text. It preserves      both spaces and line breaks.</w:t>

      </w:r>

    </w:p>

    <w:p/>

    <w:p>

      <w:pPr>

        <w:rPr>

          <w:b/>

        </w:rPr>

      </w:pPr>

      <w:r w:rsidR="00AD4F4A" w:rsidRPr="0054521E">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t xml:space="preserve">New </w:t>

      </w:r>

      <w:r w:rsidR="0054521E" w:rsidRPr="0054521E">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t>Heading</w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A" w:rsidRPr="0054521E">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t>:</w:t>

      </w:r>

    </w:p>

    <w:p>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve">This is </w:t>

      </w:r>

      <w:r w:rsidR="007949A6">

        <w:rPr>

          <w:u w:val="single"/>

        </w:rPr>

        <w:t>underlined</w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve"> text. It preserves      both spaces and line breaks.</w:t>

      </w:r>

    </w:p>

    <w:p>

      <w:r w:rsidR="00AD4F4A">

        <w:t>This is preformatted text. It preserves      both spaces and line breaks.</w:t>

      </w:r>

    </w:p>

    <w:p/>

    <w:sectPr w:rsidR="00510025" w:rsidSect="00320151">

      <w:pgSz w:w="12240" w:h="15840"/>

      <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>

      <w:cols w:space="720"/>

      <w:docGrid w:linePitch="360"/>

    </w:sectPr>

  </w:body>

</w:document>

 

Code Snippet

The following code snippet(C#) transforms the “document.xml”  to HTML using the  XSLT.

It also invokes the browser(IE) to display the converted HTML file.

try

{

// strXMLFilePath holds the absolute path of the extracted //“Document.xml” file from the word document(*.docx)

 

      string strXMLFilePath = @"C:\Hello.docx\word\document.xml";

 

// Lets save the XSLT (shown in the article) to a file called //“OpenXml_XSLT.xsl”

      // strXSLFilePath holds the absolute path of the “OpenXml_XSLT.xsl”

 

  string strXSLFilePath = @"C:\FolderXSLT\OpenXml_XSLT.xsl";

 

// strHtmlFilePath holds the file name & path of the HTML file to be

//    generated.In this case we save it inside the folder that holds  //   the Document.xml

 

string strHtmlFilePath = Path.Combine(Path.GetDirectoryName(strXMLFilePath), "xyz.html");

         

//Load the Document.xml file into  XmlDocument object    

 

 System.Xml.XmlDocument objXmlDom = new XmlDocument();

 objXmlDom.Load(strXMLFilePath);

 

//Load the XSL file(OpenXml_XSLT.xsl) into the XslCompiledTransform object                     

 System.Xml.Xsl.XslCompiledTransform objXSLT = new XslCompiledTransform();

 objXSLT.Load(strXSLFilePath);

 

// The Transform method converts the XML to HTML and stores the HTML file

// in the path specified in the variable strHtmlFilePath         

 objXSLT.Transform(strXMLFilePath,strHtmlFilePath );

 

//The Process.Start method invokes the browser (IE) ,which displays the

// converted HTML file.

 System.Diagnostics.Process.Start("IExplore.exe", strHtmlFilePath);

}

catch(Exception ex)

{

MessageBox.Show(ex.Message);

}

  

The Output Html (after Conversion of the XML file (above) using XSLT)

 <html xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/3/main">

<head>

  <META http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

<body>

  <pre>

    <div>

      <span>This is </span>

      <span>simple</span>

      <span> text. It preserves      both spaces and line </span>

      <span style="font-weight:bold;">breaks</span>

      <span style="font-weight:bold;"> in </span>

      <span style="font-weight:bold;">bold</span>

      <span>.This</span>

      <span> is preformatted text. It preserves      both spaces and line </span>

      <span style="font-style:italic;">breaks</span>

      <span style="font-style:italic;"> in </span>

      <span style="font-style:italic;">italics</span>

      <span>.This</span>

      <span> is </span>

      <span style="font-weight:bold;">simple        preformatted          </span>

      <span> text. It preserves      both spaces and line breaks.</span>

    </div>

    <div></div>

    <div>

      <span style="font-weight:bold;">New </span>

      <span style="font-weight:bold;">Heading</span>

      <span style="font-weight:bold;">:</span>

    </div>

    <div>

      <span>This is </span>

      <span style="text-decoration:underline;">underlined</span>

      <span> text. It preserves      both spaces and line breaks.</span>

    </div>

    <div>

      <span>This is preformatted text. It preserves      both spaces and line breaks.</span>

    </div>

    <div></div>

  </pre>

</body>

</html>

 

Browser View of the HTML:

 

 

Hope this helps to get started with WordProcessingML to HTML conversion.

  • Hi,

    Why not use DocX2Html.xsl wich is shipped with SPS 2007. It's more complete than your start. Only problem left is showing pictures, frames, headertext, etc...

    I Hope this will help.

    AJ
  • Where can we find DocX2Html.xsl ???
  • @kamran: try the WordML2HTML XSLT stylesheet from http://www.xmllab.net/Downloads/tabid/61/Default.aspx
  • Hi,

    Great article. How do you obtain the XSLT used for Conversion?

    Thanks.
  • Hello,

    do you guys know the exact opposite of this article?

    I have a generate HTML codes and I wanted to convert it into Word File...but I need an XSLT for converting HTML into WordML and if possible if can handle the formats from the HTML codes.

    considering there can be mixing of formats in HTML like..

    <P align="justify">
      <FONT face="arial">
          font face<BR/>
          <FONT size="7">
            <U>font</U> <EM>size</EM><BR/>
               <FONT style="color:#123456">
                <B>font</B> <STRIKE>color</STRIKE>
               </FONT>
          </FONT>
      </FONT>
    </P>

    can also be as complicated with tables, images, bullets, etc

    do you guys know an XSLT that can handle this?

    hope somebody can help.. thanks and godbless

Page 1 of 1 (5 items)