Welcome to OpenXML Developer Sign in | Join | Help

Extracting embedded spreadsheets from wordprocessingML documents

This article explains the process for extracting embedded spreadsheet files from a chosen word document as shown below.

When an XLSX files are embedded into a word document, each of the embedded file shall be the target of a relationship in a Main Document part-relationship item. An Embedded Object part may be located within or external to the package containing the word document. This is expressed syntactically by the TargetMode attribute of the Relationship element by stating if it is Internal or External.

For example, the document.xml.rels file will have –

<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">

  .....

  <Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject" Target="file:///D:\OpenXML\Book1.xlsx" TargetMode="External"/>

  ......

  <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/package" Target="embeddings/Microsoft_Office_Excel_Worksheet1.xlsx"/>

  ....

</Relationships>

 

In the actual document part (document.xml), this is stored as –

<w:p w:rsidR="00D4241A" w:rsidRDefault="008734DB">

 <w:r>

  <w:object w:dxaOrig="7502" w:dyaOrig="2924">

    ......

   <o:OLEObject Type="Embed" ProgID="Excel.Sheet.12" ShapeID="_x0000_i1025" DrawAspect="Content" ObjectID="_1240231300" r:id="rId5"/>

  </w:object>

 </w:r>

</w:p>

<w:p w:rsidR="00B80B59" w:rsidRDefault="00B80B59">

 <w:r>

  <w:object w:dxaOrig="1536" w:dyaOrig="994">

   .....

   <o:OLEObject Type="Link" ProgID="Excel.Sheet.12" ShapeID="_x0000_i1026" DrawAspect="Icon" r:id="rId7" UpdateMode="Always">

    <o:LinkType>EnhancedMetaFile</o:LinkType>

    <o:LockedField>false</o:LockedField>

    <o:FieldCodes>\f 0</o:FieldCodes>

   </o:OLEObject>

  </w:object>

 </w:r>

</w:p>

 

Here we will be looking into extracting the internally embedded spreadsheet fiels.

When this word document is chosen, it should be first checked if it contains an embedded spreadsheet file. This can be done as follows –

 

Assume you want to open the word document stored in a string variable called – documentFileName.

 

 

//Open the document

Package toExtract = Package.Open(documentFileName);

//Extract parts out of the document

foreach (PackagePart pPart in toExtract.GetParts ())

{

partURI = pPart.Uri.OriginalString;

//Check if the part has name ending with “.xlsx”

      if (partURI.ToLower().EndsWith(".xlsx"))

      {

            //code to extract the embedded XLSX

      }

}

 

Now when it is known that the word document has embedded spreadsheet files, we can extract out those XLSX files and write them out to the directory in which the original word document exists. This can be done as follows –

 

Define the document relationship type and embeddings relationship type –

const string documentRelationshipType = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";

const string embeddingRelationshipType = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/package";

 

Get the main document part (document.xml)

System.IO.Packaging.Package package = System.IO.Packaging.Package.Open(loadFileName, System.IO.FileMode.Open, System.IO.FileAccess.ReadWrite);

System.IO.Packaging.PackagePart documentPart = null;

 

foreach (System.IO.Packaging.PackageRelationship relationship in package.GetRelationshipsByType(documentRelationshipType))

{

Uri documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);

      documentPart = package.GetPart(documentUri);

      // There is only one document.

      break;

}

 

Now get the embeddings parts from the document part and for each of the embeddings check if it is a spreadsheet file and extract those files into the folder in which the word document exists (the method ReadWriteStream() does this) –

 

string pathPart = loadFileName.Substring(0, loadFileName.LastIndexOf('\\') + 1);

foreach (System.IO.Packaging.PackageRelationship relationship in documentPart.GetRelationshipsByType(embeddingRelationshipType))

{

Uri documentUri = PackUriHelper.ResolvePartUri(documentPart.Uri, relationship.TargetUri);

      PackagePart embeddingPart = package.GetPart(documentUri);

      System.IO.Stream partStream = embeddingPart.GetStream();

      string partURI = embeddingPart.Uri.OriginalString;

      string[] parts = partURI.Split('/');

      string extractFileName = pathPart + parts[parts.Length - 1];

 

      if(extractFileName.EndsWith("xlsx"))

      {

            // create a write stream

            FileStream writeStream = new FileStream(extractFileName, FileMode.Create, FileAccess.Write);

            // write to the stream

            ReadWriteStream(partStream, writeStream);

            //to make the workbook visible

            updateExcel(extractFileName);

}

}

package.Close();

 

The method ReadWriteStream() in the above code will read the content of the part stream and copy it into a file stream.

 

Note that there is a method called updateExcel() in the above code. When the embedded spreadsheet file is extracted and written out, the value of the attribute visibility of the <workbookView/> element, is stored as ‘hidden’ and due to this, when the file is opened the worksheets are not visible. Due to this, the extracted spreadsheet is opened up and the value of the visibility attribute is updated to ‘visible’ in the method updateExcel().

 

This can be done as follows –

const string documentRelationshipType = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";

 

Package package = Package.Open(docFileName, FileMode.Open, FileAccess.ReadWrite);

PackagePart documentPart = null;

 

// Get the document part

foreach (System.IO.Packaging.PackageRelationship relationship in package.GetRelationshipsByType(documentRelationshipType))

{

      Uri documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);

      documentPart = package.GetPart(documentUri);

 

// There is only one document.

break;

}

 

//  Manage namespaces to perform Xml XPath queries.

NameTable nt = new NameTable();

XmlNamespaceManager xmlNamespaceManager = new XmlNamespaceManager(nt);

xmlNamespaceManager.AddNamespace("workbook", "http://schemas.openxmlformats.org/spreadsheetml/2006/main");

 

XmlDocument xmlDocument = new XmlDocument();

xmlDocument.Load(documentPart.GetStream());

 

XmlNode workbookViewNode = xmlDocument.SelectSingleNode("//workbook:workbookView", xmlNamespaceManager);

workbookViewNode.Attributes["visibility"].Value = "visible";

xmlDocument.Save(documentPart.GetStream());

package.Close();

 

The example demo is attached with the article as a zip file.

Published Thursday, May 10, 2007 9:16 AM by SanjayKumarM
Attachment(s): ExtractXLSX.zip

Comments

 

Lanqing said:

With the new OpenXml API SDK, the code could be reduced to the following few lines:

string pathPart = loadFileName.Substring(0, loadFileName.LastIndexOf('\\') + 1);
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(loadFileName, false))
{
   //getting the embedded excell sheet out
   MainDocumentPart docPart = wordDoc.MainDocumentPart;
   foreach (EmbeddedObjectPart embeddedPart in docPart.EmbeddedObjectParts)
   {
       string partURI = embeddedPart.Uri.OriginalString;
       string [] parts = partURI.Split('/');
       if (parts[parts.Length-1].EndsWith("xlsx")
       {
           // found one embedded excel sheet
           string extractedFileName = pathPart + parts[parts.Length - 1];
         
           // create a write stream
           FileStream writeStream = new FileStream(extractFileName, FileMode.Create, FileAccess.Write);
           // write to the stream
           ReadWriteStream(partStream, writeStream);
           //to make the workbook visible
           updateExcel(extractFileName);
       }
   }
}

The function updateExcel() could be simplied as well. The excel document's main part could be simply accessed as SpreadsheetDocument.WorkbookPart. User could get the stream of the part and modify the value of xml attribute "visibility" off the stream.
June 21, 2007 4:22 AM
 

SanjayKumarM said:

Yes, the OpenXML API SDK is of great relief.
June 21, 2007 5:45 AM
Anonymous comments are disabled