Article by: Sharad Bajaj
Introduction
The new Office Open XML file formats are the default formats for Word 2007, Excel 2007, and PowerPoint 2007. As users create documents with these new formats, and convert existing documents to the new formats, there will be a growing need to search those documents. This article provides an example of how you can write a program that will very quickly search the contents of DOCX files (as created by Word 2007) for a specified text string. The separate of content and metadata in the Open XML file formats allows this search to run very rapidly.
In my example code (attached), you can see that I iterate through all DOCX files and open each one using the Package object that you can find in the new System.IO.Packaging namespace. System.IO.Packaging namespace is part of the .Net Framework 3.0, which you can download from here. I have also used background threading GUI operations like updating the progress bar and other controls.
File Format
Here we will talk a little bit about the DOCX file format. To do that just create a .docx file type in Word 2007 and type a few words of some text in it. Now save the file and go to Windows Explorer and rename the file with extension .zip. (Word, Excel, and PowerPoint files are now actually ZIP files)
Now double click on it and open your file. It will show you all the folder structure and files in it. Now navigate to word folder and open the document.xml file. Now you can see all the data -- whatever you wrote in your file is stored there. And one more important thing to note: all the data is enclosed in ("text") tags. This is the key we will use to search in each file.
Using Code
Download the code and open the solution and go to Docfileparse.cs in the class you will find a class Docfileparse this class has constructor.
/// /// Constructor /// /// The Name of file to parse /// The text need to search /// is the search is case senstive /// Do we need to match the excat word public Docfileparse(string _filename,string _searchtext, bool iscasesenstive,bool _matchword) { filename = _filename; searchtext = _searchtext; casesenstive = iscasesenstive; if (casesenstive == false) searchtext = searchtext.ToLower(); matchword = _matchword; }
We are trying to support two additional types of search. Those are:
[1] case senstive
[2] Match whole word.
We pass these two parameter also in constructor to let class know what type of search we want to do. In case of case-senstive search we search the text exactly. If it is not a casesenstive search, then we convert all data in lowercase to search.
In second case if user wants to search whole word, then we split main string in words and match the string that user wants to search one by one with each word.
GUI mainform creates an object of Docparsefile class in backgroud thread method ThreadProc:
public void ThreadProc() { DirectoryInfo df=new DirectoryInfo(textBox1.Text); int i = 0; foreach (FileInfo f in df.GetFiles("*.docx")) { if (progressBar1.Value < progressBar1.Maximum) { MethodInvoker mi = new MethodInvoker(this.UpdateProgress); this.BeginInvoke(mi); } else { MethodInvoker mi = new MethodInvoker(this.resetProgress); this.BeginInvoke(mi); } i++; //Creates object of DocParseFile and pass parameters Docfileparse docparser = new Docfileparse(f.FullName, textBox2.Text,checkBox1.Checked,checkBox2.Checked); string _message=""; //Call SearchText method on the object that returns true or false if (docparser.searchText(ref _message) == true) { SetlistboxText(f.FullName); } docparser = null; SetlabelText(i); } MethodInvoker mdone = new MethodInvoker(this.Processdone); IAsyncResult ia= this.BeginInvoke(mdone); }
This code looks a bit complicated becuase I am trying to provide some good working application. Otherwise if you want to plug this search anywhere else the DocParseFile.cs is more than enough.
Step-by-step Code Walkthrough
User creates the object of DocParseFile Class and calls the constructor. That Initiatlize the parameters in class. When user calls searchText method it perform following steps ...
Open the word(.docx) file package:
private bool openpack() { try { filepack = Package.Open(filename, FileMode.Open, FileAccess.Read); return true; } catch (Exception ) { return false; } }
Navigate to document.xml file in Package and load that XMLDocument:
private XmlDocument loadXmlDoc() { try { XmlDocument xmdoc=new XmlDocument(); Uri pathtodoc = new Uri("/word/document.xml", UriKind.Relative); PackagePart newPart = filepack.GetPart(pathtodoc); xmdoc.Load(newPart.GetStream(FileMode.Open, FileAccess.Read)); return xmdoc; } catch (Exception) { return null; } }
Now close the package by calling package.close() method in ClosePackage. Now we have the xmldocument where all data is stored, so now we have no need to keep the package open.
Now we will search data in this xmldocument. If desired data is found, searchText method will return true otherwise false. As I described in file format section all the data is stored in tags, so we will query xmldocment and get all the elements for this tag and than we will iterate through each tag to check innerText and match the text:
private bool IstextinFile(XmlDocument xmldocument) { XmlNodeList textNodes = xmldocument.GetElementsByTagName("w:t"); string text = ""; foreach (XmlNode xmnode in textNodes) { text = xmnode.InnerText; if (casesenstive == false) text = text.ToLower(); if (matchword == false) { if (text.IndexOf(searchtext) >= 0) return true; } else { char[] separator ={ ' ' }; string[] _wrods = text.Split(separator); foreach (string wordc in _wrods) { if (wordc.Equals(searchtext) == true) return true; } } } return false; }
Thats it, we're done. In the UI I have added few cool things like after search you can right-click on filename and open in Word or you can double click also to open the file. Right now this example I have done for Word and I am planning to enhance it for Excel and PowerPoint too.
You can explore this further to provide an online tool in your organization to search for documents/spreadsheets/presentations. And this search is faster than Windows search -- I have tested it with more than 60 files and it runs faster than Windows search.