wordpress hit counter
Extracting embedded spreadsheets from wordprocessingML documents - OpenXML Developer - Blog - OpenXML Developer
Goodbye and Hello

OpenXmlDeveloper.org is Shutting Down

There is a time for all good things to come to an end, and the time has come to shut down OpenXmlDeveloper.org.

Screen-casts and blog posts: Content on OpenXmlDeveloper.org will be moving to EricWhite.com.

Forums: We are moving the forums to EricWhite.com and StackOverflow.com. Please do not post in the forums on OpenXmlDeveloper.org. Instead, please post in the forums at EricWhite.com or at StackOverflow.com.

Please see this blog post for more information about my plans moving forward.  Cheers, Eric

Extracting embedded spreadsheets from wordprocessingML documents

Extracting embedded spreadsheets from wordprocessingML documents

  • Comments 19

This article explains the process for extracting embedded spreadsheet files from a chosen word document as shown below.

When an XLSX files are embedded into a word document, each of the embedded file shall be the target of a relationship in a Main Document part-relationship item. An Embedded Object part may be located within or external to the package containing the word document. This is expressed syntactically by the TargetMode attribute of the Relationship element by stating if it is Internal or External.

For example, the document.xml.rels file will have –

<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">

  .....

  <Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject" Target="file:///D:\OpenXML\Book1.xlsx" TargetMode="External"/>

  ......

  <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/package" Target="embeddings/Microsoft_Office_Excel_Worksheet1.xlsx"/>

  ....

</Relationships>

 

In the actual document part (document.xml), this is stored as –

<w:p w:rsidR="00D4241A" w:rsidRDefault="008734DB">

 <w:r>

  <w:object w:dxaOrig="7502" w:dyaOrig="2924">

    ......

   <o:OLEObject Type="Embed" ProgID="Excel.Sheet.12" ShapeID="_x0000_i1025" DrawAspect="Content" ObjectID="_1240231300" r:id="rId5"/>

  </w:object>

 </w:r>

</w:p>

<w:p w:rsidR="00B80B59" w:rsidRDefault="00B80B59">

 <w:r>

  <w:object w:dxaOrig="1536" w:dyaOrig="994">

   .....

   <o:OLEObject Type="Link" ProgID="Excel.Sheet.12" ShapeID="_x0000_i1026" DrawAspect="Icon" r:id="rId7" UpdateMode="Always">

    <o:LinkType>EnhancedMetaFile</o:LinkType>

    <o:LockedField>false</o:LockedField>

    <o:FieldCodes>\f 0</o:FieldCodes>

   </o:OLEObject>

  </w:object>

 </w:r>

</w:p>

 

Here we will be looking into extracting the internally embedded spreadsheet fiels.

When this word document is chosen, it should be first checked if it contains an embedded spreadsheet file. This can be done as follows –

 

Assume you want to open the word document stored in a string variable called – documentFileName.

 

 

//Open the document

Package toExtract = Package.Open(documentFileName);

//Extract parts out of the document

foreach (PackagePart pPart in toExtract.GetParts ())

{

partURI = pPart.Uri.OriginalString;

//Check if the part has name ending with “.xlsx”

      if (partURI.ToLower().EndsWith(".xlsx"))

      {

            //code to extract the embedded XLSX

      }

}

 

Now when it is known that the word document has embedded spreadsheet files, we can extract out those XLSX files and write them out to the directory in which the original word document exists. This can be done as follows –

 

Define the document relationship type and embeddings relationship type –

const string documentRelationshipType = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";

const string embeddingRelationshipType = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/package";

 

Get the main document part (document.xml)

System.IO.Packaging.Package package = System.IO.Packaging.Package.Open(loadFileName, System.IO.FileMode.Open, System.IO.FileAccess.ReadWrite);

System.IO.Packaging.PackagePart documentPart = null;

 

foreach (System.IO.Packaging.PackageRelationship relationship in package.GetRelationshipsByType(documentRelationshipType))

{

Uri documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);

      documentPart = package.GetPart(documentUri);

      // There is only one document.

      break;

}

 

Now get the embeddings parts from the document part and for each of the embeddings check if it is a spreadsheet file and extract those files into the folder in which the word document exists (the method ReadWriteStream() does this) –

 

string pathPart = loadFileName.Substring(0, loadFileName.LastIndexOf('\\') + 1);

foreach (System.IO.Packaging.PackageRelationship relationship in documentPart.GetRelationshipsByType(embeddingRelationshipType))

{

Uri documentUri = PackUriHelper.ResolvePartUri(documentPart.Uri, relationship.TargetUri);

      PackagePart embeddingPart = package.GetPart(documentUri);

      System.IO.Stream partStream = embeddingPart.GetStream();

      string partURI = embeddingPart.Uri.OriginalString;

      string[] parts = partURI.Split('/');

      string extractFileName = pathPart + parts[parts.Length - 1];

 

      if(extractFileName.EndsWith("xlsx"))

      {

            // create a write stream

            FileStream writeStream = new FileStream(extractFileName, FileMode.Create, FileAccess.Write);

            // write to the stream

            ReadWriteStream(partStream, writeStream);

            //to make the workbook visible

            updateExcel(extractFileName);

}

}

package.Close();

 

The method ReadWriteStream() in the above code will read the content of the part stream and copy it into a file stream.

 

Note that there is a method called updateExcel() in the above code. When the embedded spreadsheet file is extracted and written out, the value of the attribute visibility of the <workbookView/> element, is stored as ‘hidden’ and due to this, when the file is opened the worksheets are not visible. Due to this, the extracted spreadsheet is opened up and the value of the visibility attribute is updated to ‘visible’ in the method updateExcel().

 

This can be done as follows –

const string documentRelationshipType = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";

 

Package package = Package.Open(docFileName, FileMode.Open, FileAccess.ReadWrite);

PackagePart documentPart = null;

 

// Get the document part

foreach (System.IO.Packaging.PackageRelationship relationship in package.GetRelationshipsByType(documentRelationshipType))

{

      Uri documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);

      documentPart = package.GetPart(documentUri);

 

// There is only one document.

break;

}

 

//  Manage namespaces to perform Xml XPath queries.

NameTable nt = new NameTable();

XmlNamespaceManager xmlNamespaceManager = new XmlNamespaceManager(nt);

xmlNamespaceManager.AddNamespace("workbook", "http://schemas.openxmlformats.org/spreadsheetml/2006/main");

 

XmlDocument xmlDocument = new XmlDocument();

xmlDocument.Load(documentPart.GetStream());

 

XmlNode workbookViewNode = xmlDocument.SelectSingleNode("//workbook:workbookView", xmlNamespaceManager);

workbookViewNode.Attributes["visibility"].Value = "visible";

xmlDocument.Save(documentPart.GetStream());

package.Close();

 

The example demo is attached with the article as a zip file.

Attachment: ExtractXLSX.zip
Page 1 of 2 (19 items) 12