This article explains the process for extracting embedded spreadsheet files from a chosen word document as shown below.
When an XLSX files are embedded into a word document, each of the embedded file shall be the target of a relationship in a Main Document part-relationship item. An Embedded Object part may be located within or external to the package containing the word document. This is expressed syntactically by the TargetMode attribute of the Relationship element by stating if it is Internal or External.
For example, the document.xml.rels file will have –
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
.....
<Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject" Target="file:///D:\OpenXML\Book1.xlsx" TargetMode="External"/>
......
<Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/package" Target="embeddings/Microsoft_Office_Excel_Worksheet1.xlsx"/>
....
</Relationships>
In the actual document part (document.xml), this is stored as –
<w:p w:rsidR="00D4241A" w:rsidRDefault="008734DB">
<w:r>
<w:object w:dxaOrig="7502" w:dyaOrig="2924">
<o:OLEObject Type="Embed" ProgID="Excel.Sheet.12" ShapeID="_x0000_i1025" DrawAspect="Content" ObjectID="_1240231300" r:id="rId5"/>
</w:object>
</w:r>
</w:p>
<w:p w:rsidR="00B80B59" w:rsidRDefault="00B80B59">
<w:object w:dxaOrig="1536" w:dyaOrig="994">
<o:OLEObject Type="Link" ProgID="Excel.Sheet.12" ShapeID="_x0000_i1026" DrawAspect="Icon" r:id="rId7" UpdateMode="Always">
<o:LinkType>EnhancedMetaFile</o:LinkType>
<o:LockedField>false</o:LockedField>
<o:FieldCodes>\f 0</o:FieldCodes>
</o:OLEObject>
Here we will be looking into extracting the internally embedded spreadsheet fiels.
When this word document is chosen, it should be first checked if it contains an embedded spreadsheet file. This can be done as follows –
Assume you want to open the word document stored in a string variable called – documentFileName.
//Open the document
Package toExtract = Package.Open(documentFileName);
//Extract parts out of the document
foreach (PackagePart pPart in toExtract.GetParts ())
{
partURI = pPart.Uri.OriginalString;
//Check if the part has name ending with “.xlsx”
if (partURI.ToLower().EndsWith(".xlsx"))
//code to extract the embedded XLSX
}
Now when it is known that the word document has embedded spreadsheet files, we can extract out those XLSX files and write them out to the directory in which the original word document exists. This can be done as follows –
Define the document relationship type and embeddings relationship type –
const string documentRelationshipType = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
const string embeddingRelationshipType = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/package";
Get the main document part (document.xml)
System.IO.Packaging.Package package = System.IO.Packaging.Package.Open(loadFileName, System.IO.FileMode.Open, System.IO.FileAccess.ReadWrite);
System.IO.Packaging.PackagePart documentPart = null;
foreach (System.IO.Packaging.PackageRelationship relationship in package.GetRelationshipsByType(documentRelationshipType))
Uri documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);
documentPart = package.GetPart(documentUri);
// There is only one document.
break;
Now get the embeddings parts from the document part and for each of the embeddings check if it is a spreadsheet file and extract those files into the folder in which the word document exists (the method ReadWriteStream() does this) –
string pathPart = loadFileName.Substring(0, loadFileName.LastIndexOf('\\') + 1);
foreach (System.IO.Packaging.PackageRelationship relationship in documentPart.GetRelationshipsByType(embeddingRelationshipType))
Uri documentUri = PackUriHelper.ResolvePartUri(documentPart.Uri, relationship.TargetUri);
PackagePart embeddingPart = package.GetPart(documentUri);
System.IO.Stream partStream = embeddingPart.GetStream();
string partURI = embeddingPart.Uri.OriginalString;
string[] parts = partURI.Split('/');
string extractFileName = pathPart + parts[parts.Length - 1];
if(extractFileName.EndsWith("xlsx"))
// create a write stream
FileStream writeStream = new FileStream(extractFileName, FileMode.Create, FileAccess.Write);
// write to the stream
ReadWriteStream(partStream, writeStream);
//to make the workbook visible
updateExcel(extractFileName);
package.Close();
The method ReadWriteStream() in the above code will read the content of the part stream and copy it into a file stream.
Note that there is a method called updateExcel() in the above code. When the embedded spreadsheet file is extracted and written out, the value of the attribute visibility of the <workbookView/> element, is stored as ‘hidden’ and due to this, when the file is opened the worksheets are not visible. Due to this, the extracted spreadsheet is opened up and the value of the visibility attribute is updated to ‘visible’ in the method updateExcel().
This can be done as follows –
Package package = Package.Open(docFileName, FileMode.Open, FileAccess.ReadWrite);
PackagePart documentPart = null;
// Get the document part
// Manage namespaces to perform Xml XPath queries.
NameTable nt = new NameTable();
XmlNamespaceManager xmlNamespaceManager = new XmlNamespaceManager(nt);
xmlNamespaceManager.AddNamespace("workbook", "http://schemas.openxmlformats.org/spreadsheetml/2006/main");
XmlDocument xmlDocument = new XmlDocument();
xmlDocument.Load(documentPart.GetStream());
XmlNode workbookViewNode = xmlDocument.SelectSingleNode("//workbook:workbookView", xmlNamespaceManager);
workbookViewNode.Attributes["visibility"].Value = "visible";
xmlDocument.Save(documentPart.GetStream());
The example demo is attached with the article as a zip file.
I notice a small error in the prevoius code. Instead of EmbeddedObjectPart we should search for an EmbeddedPackagePart object.
Here is an updated version:
// Open the word document and assign to the local pointer.
OpenSettings openSettings = new OpenSettings();
openSettings.AutoSave = true;
// Add the MarkupCompatibilityProcessSettings
openSettings.MarkupCompatibilityProcessSettings =
new MarkupCompatibilityProcessSettings(
MarkupCompatibilityProcessMode.ProcessAllParts,
DocumentFormat.OpenXml.FileFormatVersions.Office2010);
using (WordprocessingDocument openXmlPackageForWord = WordprocessingDocument.Open(loadFileName, true, openSettings))
// Get the main document part (workbook.xml) of the word document.
MainDocumentPart mainDocumentPart = openXmlPackageForWord.MainDocumentPart;
// Load the contents of the workbook.
XmlDocument doc = new XmlDocument();
doc.Load(mainDocumentPart.GetStream());
foreach (EmbeddedPackagePart embeddedPart in openXmlPackageForWord.MainDocumentPart.EmbeddedPackageParts)
string partURI = embeddedPart.Uri.OriginalString;
// found one embedded excel sheet
string extractedFileName = pathPart + parts[parts.Length - 1];
if (extractedFileName.EndsWith("xlsx"))
System.IO.Stream partStream = embeddedPart.GetStream();
FileStream writeStream = new FileStream(extractedFileName, FileMode.Create, FileAccess.Write);
updateExcel(extractedFileName);
openXmlPackageForWord.Close();
Best regards.