wordpress hit counter
OOXML Crawler - OpenXML Developer - Blog - OpenXML Developer
Goodbye and Hello

OpenXmlDeveloper.org is Shutting Down

There is a time for all good things to come to an end, and the time has come to shut down OpenXmlDeveloper.org.

Screen-casts and blog posts: Content on OpenXmlDeveloper.org will be moving to EricWhite.com.

Forums: We are moving the forums to EricWhite.com and StackOverflow.com. Please do not post in the forums on OpenXmlDeveloper.org. Instead, please post in the forums at EricWhite.com or at StackOverflow.com.

Please see this blog post for more information about my plans moving forward.  Cheers, Eric

OOXML Crawler

OOXML Crawler

  • Comments 1

Written by Johannes Prinz

 

What is it:

I’ve written an Open XML crawler application that searches for Open XML files and then extracts properties out of those files. It is a Windows application and requires the .NET 3.5 to be installed. The source code is attached to this article, which you can download and use under the Microsoft Public License .

This crawler searches for Open XML documents starting from the seed URL you give it and following the links from this page to further pages looking for documents using a breadth first search. For each Open XML document found, the crawler lists their properties as they are discovered. The crawler adds more and more columns as more and more properties are discovered.

Usage:

 Insert a valid URL into 1.
Click “Add Seed” button 2 (note the queue count incrementing 5).
You can repeat steps 1 and 2 as many times as you like but eventually you’ll want to hit the Start button 3 to start the crawl.
Eventually when the crawler finds Open XML documents you’ll see the results being listen in the grid view 4.
As the crawler discovers more URLs you’ll see the priority queue count 5 increasing.
After visiting a URL the visited count 6 is incremented and the crawler will NOT crawl that URL again.
7, 8 and 9 clear the “results list”, the “to visit URLs queue” and the “visited URLs list”.

 

Results:

I crawled my internal SharePoint intranet site which has plenty of Open XML documents in various Document Libraries. After having crawled 150 URLs from the original given seed URL it had found a further 4538 URLs to visit and 89 Open XML documents and here are some of the properties the crawler found.

 

As you can see it’s picked up 40 odd properties for the documents it’s found.

 

About the Code:

Most of the code in this sample app is around the logic for the simple PriorityQueue implementation, the crawler and the WPF UI.

Under the hood the crawler just checks content types. If a URL is another HTML page then the Crawler fetches all the links and dumps them into the priority queue otherwise, if it’s a Open XML document it fetches its properties, and finally if it’s neither OOXML nor HTML we simply ignore it.

So how do I know the document requested is a Open XML document? Here I found what the content types are for Open XML documents. There are 3 main types for Document, Spread Sheet and Presentation documents.

 

application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

 

So in code I check for that

private Dictionary<object, object> ProcessUri(...)
{...  

   if (contentType.ToLower().Contains("application/vnd.openxmlformats-officedocument."))

...

}

I had a brief look at the Open XML SDK v2 to get the document properties. Knowing that Open XML documents are just archives of xml files I decided to unpack a word doc and have a look at what it looks like on the inside.

Ahaaa! A whole folder full of properties. I can fetch all the properties I need using the System.IO.Packaging.Package class to get access to those files and all their XML nodes below the root and add them to my properties collection. Note VisualXPath.exe was a great help here to stub out all the XmlNamespaceManager stuff.

 

private void GetDocumentProperties(MemoryStream stream, Dictionary<object, object> results)

{

...

using (Package doc = Package.Open(stream))

{

   propertiesPackage = doc.GetPart(new Uri("/docProps/core.xml", UriKind.Relative));

   xmlDoc.Load(propertiesPackage.GetStream());

   nsMgr = new XmlNamespaceManager(xmlDoc.NameTable);

   nsMgr.AddNamespace("cp",

     "http://schemas.openxmlformats.org/package/2006/metadata/core-properties");

     nsMgr.AddNamespace("dcterms", "http://purl.org/dc/terms/");

     nsMgr.AddNamespace("dc", "http://purl.org/dc/elements/1.1/");

     nsMgr.AddNamespace("dcmitype", "http://purl.org/dc/dcmitype/");

     nsMgr.AddNamespace("xsi", "http://www.w3.org/2001/XMLSchema-instance");

     properties = xmlDoc.SelectNodes("cp:coreProperties/*", nsMgr);

   foreach (XmlNode node in properties)

   {

      if (results.ContainsKey(node.LocalName))

      {

         results[node.LocalName] = node.InnerText;

      } else {

         results.Add(node.LocalName, node.InnerText);

      }

   }

   propertiesPackage =

      doc.GetPart(new Uri("/docProps/app.xml", UriKind.Relative));

   xmlDoc.Load(propertiesPackage.GetStream());

   nsMgr = new XmlNamespaceManager(xmlDoc.NameTable);

   nsMgr.AddNamespace("vt",

     "http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes");

   nsMgr.AddNamespace("def",

   "http://schemas.openxmlformats.org/officeDocument/2006/extended-properties");

   properties = xmlDoc.SelectNodes("def:Properties/*", nsMgr);

   foreach (XmlNode node in properties)

   {

      if (results.ContainsKey(node.LocalName))

      {

         results[node.LocalName] = node.InnerText;

      }else{

         results.Add(node.LocalName, node.InnerText);

      }

   }

}

...

}

 

As you can see I currently inspect app.xml and core.xml, there is also a custom.xml file: if you decide to extend the code to parse this please do let us know and we’d be happy to include this back into the source code package here.

 

Attachment: OOXMLCrawlerSource.zip
Page 1 of 1 (1 items)