wordpress hit counter
Conversion of OpenXML (WordProcessingML) to HTML using XSLT : Part 1 - OpenXML Developer - Blog - OpenXML Developer
Goodbye and Hello

OpenXmlDeveloper.org is Shutting Down

There is a time for all good things to come to an end, and the time has come to shut down OpenXmlDeveloper.org.

Screen-casts and blog posts: Content on OpenXmlDeveloper.org will be moving to EricWhite.com.

Forums: We are moving the forums to EricWhite.com and StackOverflow.com. Please do not post in the forums on OpenXmlDeveloper.org. Instead, please post in the forums at EricWhite.com or at StackOverflow.com.

Please see this blog post for more information about my plans moving forward.  Cheers, Eric

Conversion of OpenXML (WordProcessingML) to HTML using XSLT : Part 1

Conversion of OpenXML (WordProcessingML) to HTML using XSLT : Part 1

Rate This
  • Comments 6

Conversion of OpenXML (WordProcessingML) to HTML using XSLT : Part 1

This article is a stepping stone in the journey of converting OpenXML format of Microsoft word to HTML. It explains how to convert WordprocessingML to Html tags using XSLT, so that the contents of word document can be viewed using a browser. The XSLT parses the XML content and converts into HTML preserving the font properties like “bold”, “italic”, “underline” etc.

 

XSLT Process:  

The Process of Conversion:

 Create a Word document using Microsoft Office Word 2007 (xyz.docx).

 Extract the “document.xml” file from the created document.

Use the XSLT shown in the next segment of this article (download samples for the XSL file) to convert “document.xml “to HTML.

The code snippet(c#) shown in the final segment of this article helps in conversion, programmatically.

 

 

The XSLT used for Conversion

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/3/main">

  <xsl:output method="html" />

  <xsl:template match="/">

    <xsl:apply-templates select="//w:body" />

  </xsl:template>

  <xsl:template match="w:body">

    <html>

      <head />

      <body>

        <pre>

          <xsl:apply-templates />

        </pre>

      </body>

    </html>

  </xsl:template>

  <xsl:template match="w:p">

    <div>

      <xsl:apply-templates select="w:r" />

    </div>

  </xsl:template>

  <xsl:template match="w:r">

    <xsl:apply-templates select="w:t" />

  </xsl:template>

  <xsl:template match="w:t">

    <span>

      <xsl:apply-templates select="../w:rPr" />

      <xsl:value-of select="." />

    </span>

  </xsl:template>

  <xsl:template match="w:rPr">

    <xsl:attribute name="style">

      <xsl:apply-templates />

    </xsl:attribute>

  </xsl:template>

  <xsl:template match="w:u">text-decoration:underline;</xsl:template>

  <xsl:template match="w:b">font-weight:bold;</xsl:template>

  <xsl:template match="w:i">font-style:italic;</xsl:template>

</xsl:stylesheet>

 

 

The Sample WordProcessingML file taken for Conversion (document.xml) :

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:o12="http://schemas.microsoft.com/office/2004/7/core" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.microsoft.com/office/omml/2004/12/core" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/3/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/3/main">

  <w:body>

    <w:p>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve">This is </w:t>

      </w:r>

      <w:r w:rsidR="00A35A66">

        <w:t>simple</w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve"> text. It preserves      both spaces and line </w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A" w:rsidRPr="007949A6">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t>breaks</w:t>

      </w:r>

      <w:r w:rsidR="007949A6" w:rsidRPr="007949A6">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t xml:space="preserve"> in </w:t>

      </w:r>

      <w:proofErr w:type="spellStart"/>

      <w:r w:rsidR="007949A6" w:rsidRPr="007949A6">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t>bold</w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A">

        <w:t>.This</w:t>

      </w:r>

      <w:proofErr w:type="spellEnd"/>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve"> is preformatted text. It preserves      both spaces and line </w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A" w:rsidRPr="0054521E">

        <w:rPr>

          <w:i/>

        </w:rPr>

        <w:t>breaks</w:t>

      </w:r>

      <w:r w:rsidR="007949A6">

        <w:rPr>

          <w:i/>

        </w:rPr>

        <w:t xml:space="preserve"> in </w:t>

      </w:r>

      <w:proofErr w:type="spellStart"/>

      <w:r w:rsidR="007949A6">

        <w:rPr>

          <w:i/>

        </w:rPr>

        <w:t>italics</w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A">

        <w:t>.This</w:t>

      </w:r>

      <w:proofErr w:type="spellEnd"/>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve"> is </w:t>

      </w:r>

      <w:r w:rsidR="00A35A66">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t xml:space="preserve">simple        preformatted          </w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve"> text. It preserves      both spaces and line breaks.</w:t>

      </w:r>

    </w:p>

    <w:p/>

    <w:p>

      <w:pPr>

        <w:rPr>

          <w:b/>

        </w:rPr>

      </w:pPr>

      <w:r w:rsidR="00AD4F4A" w:rsidRPr="0054521E">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t xml:space="preserve">New </w:t>

      </w:r>

      <w:r w:rsidR="0054521E" w:rsidRPr="0054521E">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t>Heading</w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A" w:rsidRPr="0054521E">

        <w:rPr>

          <w:b/>

        </w:rPr>

        <w:t>:</w:t>

      </w:r>

    </w:p>

    <w:p>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve">This is </w:t>

      </w:r>

      <w:r w:rsidR="007949A6">

        <w:rPr>

          <w:u w:val="single"/>

        </w:rPr>

        <w:t>underlined</w:t>

      </w:r>

      <w:r w:rsidR="00AD4F4A">

        <w:t xml:space="preserve"> text. It preserves      both spaces and line breaks.</w:t>

      </w:r>

    </w:p>

    <w:p>

      <w:r w:rsidR="00AD4F4A">

        <w:t>This is preformatted text. It preserves      both spaces and line breaks.</w:t>

      </w:r>

    </w:p>

    <w:p/>

    <w:sectPr w:rsidR="00510025" w:rsidSect="00320151">

      <w:pgSz w:w="12240" w:h="15840"/>

      <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>

      <w:cols w:space="720"/>

      <w:docGrid w:linePitch="360"/>

    </w:sectPr>

  </w:body>

</w:document>

 

Code Snippet

The following code snippet(C#) transforms the “document.xml”  to HTML using the  XSLT.

It also invokes the browser(IE) to display the converted HTML file.

try

{

// strXMLFilePath holds the absolute path of the extracted //“Document.xml” file from the word document(*.docx)

 

      string strXMLFilePath = @"C:\Hello.docx\word\document.xml";

 

// Lets save the XSLT (shown in the article) to a file called //“OpenXml_XSLT.xsl”

      // strXSLFilePath holds the absolute path of the “OpenXml_XSLT.xsl”

 

  string strXSLFilePath = @"C:\FolderXSLT\OpenXml_XSLT.xsl";

 

// strHtmlFilePath holds the file name & path of the HTML file to be

//    generated.In this case we save it inside the folder that holds  //   the Document.xml

 

string strHtmlFilePath = Path.Combine(Path.GetDirectoryName(strXMLFilePath), "xyz.html");

         

//Load the Document.xml file into  XmlDocument object    

 

 System.Xml.XmlDocument objXmlDom = new XmlDocument();

 objXmlDom.Load(strXMLFilePath);

 

//Load the XSL file(OpenXml_XSLT.xsl) into the XslCompiledTransform object                     

 System.Xml.Xsl.XslCompiledTransform objXSLT = new XslCompiledTransform();

 objXSLT.Load(strXSLFilePath);

 

// The Transform method converts the XML to HTML and stores the HTML file

// in the path specified in the variable strHtmlFilePath         

 objXSLT.Transform(strXMLFilePath,strHtmlFilePath );

 

//The Process.Start method invokes the browser (IE) ,which displays the

// converted HTML file.

 System.Diagnostics.Process.Start("IExplore.exe", strHtmlFilePath);

}

catch(Exception ex)

{

MessageBox.Show(ex.Message);

}

  

The Output Html (after Conversion of the XML file (above) using XSLT)

 <html xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/3/main">

<head>

  <META http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

<body>

  <pre>

    <div>

      <span>This is </span>

      <span>simple</span>

      <span> text. It preserves      both spaces and line </span>

      <span style="font-weight:bold;">breaks</span>

      <span style="font-weight:bold;"> in </span>

      <span style="font-weight:bold;">bold</span>

      <span>.This</span>

      <span> is preformatted text. It preserves      both spaces and line </span>

      <span style="font-style:italic;">breaks</span>

      <span style="font-style:italic;"> in </span>

      <span style="font-style:italic;">italics</span>

      <span>.This</span>

      <span> is </span>

      <span style="font-weight:bold;">simple        preformatted          </span>

      <span> text. It preserves      both spaces and line breaks.</span>

    </div>

    <div></div>

    <div>

      <span style="font-weight:bold;">New </span>

      <span style="font-weight:bold;">Heading</span>

      <span style="font-weight:bold;">:</span>

    </div>

    <div>

      <span>This is </span>

      <span style="text-decoration:underline;">underlined</span>

      <span> text. It preserves      both spaces and line breaks.</span>

    </div>

    <div>

      <span>This is preformatted text. It preserves      both spaces and line breaks.</span>

    </div>

    <div></div>

  </pre>

</body>

</html>

 

Browser View of the HTML:

 

 

Hope this helps to get started with WordProcessingML to HTML conversion.

  • Hi,

    Why not use DocX2Html.xsl wich is shipped with SPS 2007. It's more complete than your start. Only problem left is showing pictures, frames, headertext, etc...

    I Hope this will help.

    AJ
  • Where can we find DocX2Html.xsl ???
  • @kamran: try the WordML2HTML XSLT stylesheet from http://www.xmllab.net/Downloads/tabid/61/Default.aspx
  • Hi,

    Great article. How do you obtain the XSLT used for Conversion?

    Thanks.
  • Hello,

    do you guys know the exact opposite of this article?

    I have a generate HTML codes and I wanted to convert it into Word File...but I need an XSLT for converting HTML into WordML and if possible if can handle the formats from the HTML codes.

    considering there can be mixing of formats in HTML like..

    <P align="justify">
      <FONT face="arial">
          font face<BR/>
          <FONT size="7">
            <U>font</U> <EM>size</EM><BR/>
               <FONT style="color:#123456">
                <B>font</B> <STRIKE>color</STRIKE>
               </FONT>
          </FONT>
      </FONT>
    </P>

    can also be as complicated with tables, images, bullets, etc

    do you guys know an XSLT that can handle this?

    hope somebody can help.. thanks and godbless

  • Pingback from  Office Development | Pearltrees

Page 1 of 1 (6 items)